Survey
Survey
Abstract—This paper presents a comprehensive survey of et al., 2012), VGGNet (Simonyan and Zisserman, 2015a),
vision-language (VL) intelligence from the perspective of time. GoogleNet (Szegedy et al., 2015), and ResNet (He et al.,
This survey is inspired by the remarkable progress in both 2016a). Another prominent breakthrough is recurrent neural
computer vision and natural language processing, and recent
trends shifting from single modality processing to multiple network (RNN) in the field of natural language processing
modality comprehension. We summarize the development in (NLP), which proposed recurrent cells for sequential data
this field into three time periods, namely task-specific methods, modeling (Rumelhart et al., 1985; Hochreiter and Schmidhuber,
vision-language pre-training (VLP) methods, and larger models 1997a). To mitigate the vanishing and exploding gradient issues
empowered by large-scale weakly-labeled data. We first take some in long sequence training, LSTM (Hochreiter and Schmidhuber,
common VL tasks as examples to introduce the development
of task-specific methods. Then we focus on VLP methods and 1997a), a variant of RNN, and GRU (Chung et al., 2014), a
comprehensively review key components of the model structures more efficient version of LSTM, were proposed accordingly.
and training methods. After that, we show how recent work Another great breakthrough in NLP is Transformer (Vaswani
utilizes large-scale raw image-text data to learn language-aligned et al., 2017), which utilizes the attention mechanism to pursue
visual representations that generalize better on zero or few shot better language representation. Using multiple stacked attention
learning tasks. Finally, we discuss some potential future trends
towards modality cooperation, unified representation, and knowl- layers, Transformers can fuse information over language tokens
edge incorporation. We believe that this review will be of help globally with high parallelism, which facilitates both powerful
for researchers and practitioners of AI and ML, especially those representation and large-scale training.
interested in computer vision and natural language processing. Though inspiring progress has been achieved in single-
modality domains, real-world problems often involve multiple
modalities. For example, an autonomous vehicle should be
I. I NTRODUCTION able to process human orders (language), traffic signals
Computer vision (CV) and natural language processing (vision), and road conditions (vision and sounds). Even single
(NLP) are sub-fields of artificial intelligence (AI) that focus on modality learning benefits from multi-modality. For example,
the simulation of human intelligence in vision and language. language learning needs perception which forms the basis
In the last decade, deep learning has greatly advanced single- of many semantic axioms (Bisk et al., 2020). Perception is
modality learning in the two fields and led to state-of-the-art the way humans understand the physical world and decides
results on a series of tasks. At the core of the remarkable the assumption behind the human language. Since we all
progress of deep learning lies the empowerment of rapidly hear and see the same thing, we will leave some knowledge
evolving GPUs and the availability of large-scale datasets, as common sense which is unwritten in our language (Bisk
which allows for accelerated training of deep models at scale. et al., 2020). Even restricted to language, speech contains more
Along with the advancement in deep learning, we have useful information than text-only, e.g., emotions can be implied
witnessed a series of powerful neural networks developed. through prosody. Noticing that multi-modal perception helps
Traditional neural networks are typically multi-layer perceptron in both multi-modal and single-modal tasks, there comes a
(MLP) consisting of multiple stacked linear layers and non- lot of research works. Within the field of multi-modality, the
linear activations (Rosenblatt, 1957, 1961). LeCun et al. integration of vision and language gets much attention since
(1998) proposed Convolutional Neural Network (CNN) to vision is one of the most important perceptions for the human
incorporate the shift-invariant property as a better inductive to understand the environment and language-aligned visual
bias for 2D visual input, which inspired a large number features can greatly improve the performances of both vision
of deep neural networks, including AlexNet (Krizhevsky tasks and vision-language tasks. Moreover, the popularity of
vision-language intelligence is also due to the availability of
∗Equal contribution. abundant datasets and benchmarks in this field.
†This work was done when Feng Li, Hao Zhang, and Shilong Liu were
interns at IDEA. The ambition to address many task-specific VL problems
‡Corresponding author. has fueled the initial development of VL learning. These VL
2
problems include image captioning, visual question answering Some works in this era adopt detected regions for image
(VQA), image-text matching, etc. Xu et al. (2015); Karpathy representation to learn object-level features. Only in the third
et al. (2014); Vinyals et al. (2015) integrated a CNN image era can researchers deal with large-scale datasets and pre-train
encoder and an RNN text decoder for image captioning. Antol semantic-rich features.
et al. (2015); Yang et al. (2016); Anderson et al. (2018b) There are also other survey paper discussing VL intelligence.
addressed the VQA task by mapping images and texts into Zhang et al. (2020) analyzed multi-modal deep learning
the same latent space and predicting answers from the latent from three perspectives: multimodal representation, fusion
representations. Kiros et al. (2014); Karpathy et al. (2014); multimodal signals and multimodal applications. Mogadala
Huang et al. (2016); Lee et al. (2018) performed image-text et al. (2021) organized their survey by tasks. They reviewed
matching by calculating the similarity between an image and a some common tasks and corresponding methods. They also
text either on sentence-level or token-level. These models are included datasets and metrics. Different from them, we view
tailored for specific problems with various datasets and can the VL intelligence from the perspective time. Further more,
only solve one task each. we include the general goal of this area and show how research
Inspired by the prevalence of pre-training and fine-tuning works atain the goal step-by-step.
in both language (Devlin et al., 2018) and vision, the inter- To the best of our knowledge, this is the first VL survey
disciplinary field of vision and language embraces a new era: that summarizes studies from the viewpoint of the time
to learn a joint representation of vision and language by pre- period. The remainder of this paper is organized as follows.
training on image-text pairs. The surge of VLP models is We start with some task-specific problems in VL such as
mostly inspired by language models in both architecture design image captioning, VQA, and image-text retrieval in Section
and training methods. For example, many recent studies (Li II. Then we comprehensively explain the vision-language joint
et al., 2019b; Lu et al., 2019; Zhang et al., 2021; Tan and representation learning empowered by pre-training in Section
Bansal, 2019; Li et al., 2020b; Yu et al., 2020; Chen et al., III. Finally, we show some work learning language-aligned
2020) adopted BERT-like (Devlin et al., 2018) architectures visual representations directly from raw image-text data with
and training methods. The development of VL learning meets large-scale vision-language pre-training in Section VI.
a serious challenge due to the lack of sufficiently large scale
manually labeled data. Recently, some studies (Radford et al., II. TASK SPECIFIC PROBLEMS
2021; Jia et al., 2021; Wang et al., 2021; Li et al., 2021b) broke Early VL methods are designed for specific tasks. The VL
this limitation by adopting contrastive learning and making domain encompasses a broad range of tasks, including image
use of large-scale web-crawled data to learn visual-linguistic captioning, VQA, image-text matching, visual grounding, and
features which can be used for zero-shot learning. visual dialog, etc. Some common VL tasks are summarized
The fast-evolving progress in the VL space urges a compre- in Table I, which shows the input, output, datasets, metrics,
hensive survey of existing studies in this domain. This paper and mainstream methods of each task. In this section, we only
aims to provide a structured review of recent progress in the introduce the three most common tasks in detail, including
VL domain to help researchers gain a whole picture and better image captioning, VQA, and image-text matching. For these
insight behind recent studies. We divide the development of VL three tasks, we will introduce their task formulation and the
learning into three eras. The first is from 2014 to 2018 where development of mainstream methods. For the remaining tasks,
specialized models are designed for different tasks. The second a short description of each task is included. We summarize
era is from 2019 to 2021, during which joint representations of that the development of task-specific methods is from global
vision and language are learned by pre-training on well-labeled representations to fine-grained object-centric representations.
VL datasets. Finally, the third era began in 2021 with the Most VL tasks experience three stages. The first stage is gloabl
appearance of CLIP (Shen et al., 2021), in which researchers vector representation and simple fusion. The second stage
seek to pre-train VL models on larger weakly-labeled datasets is grid feature representation and cross-modal attention.
and to obtain a strong zero/few-shot vision model with VL The third stage is object-centric feature representation and
pre-training. bottom-up top-down attention (Anderson et al., 2018b).
When reviewing the whole development of VL intelligence, The three stages and representative work are shown in Figure
we find the general goal is to learn good visual representations. 1.
A good visual representation should have three attributes Cornia et al.
VQA Kiros et al. Reed et al. Karpathy et al. Anderson et al. SCAN
as summarized in (Li et al., 2021b), which are object-level, (2015) (2014) (2016) (2015) (2018) (2018) (2019)
TABLE I
A COMPARISON OVER TASK - SPECIFIC PROBLEMS . TASKS ARE CLASSIFIED INTO FOUR CATEGORIES . F OR EACH TASK , WE HAVE SUMMARIZED THE INPUT,
OUTPUT, DATASETS , METRICS , AND MAINSTREAM METHODS .
Question answering is usually only related to some regions Faghri et al. (2018) considered hard negatives to improve the
of the image. Therefore, global representation only leads to a performance.
sub-optimal solution due to the noise introduced by irrelevant Karpathy et al. (2014) proposed Deep Fragment which is
regions. (Yang et al., 2016) proposed Stacked Attention the first attempt to use fine-grained representation on both the
Network (SAN) to stack multiple question-guided attention image side and text side. The architecture of Deep Fragment
layers. In each layer, the semantic representation of the question is shown in Fig. 4. Instead of directly representing the whole
is used as a query to attend to image grids. SAN is the first image and sentence, they map each image fragment and
work that verifies the effectiveness of attention in VQA. Fukui sentence fragment into the cross-modality embedding space.
et al. (2016) also adopted grid features, while they fuse image Then they align fragments between different modalities. Since
and language features through bilinear pooling (Lin et al., one image region may be related to several words, they find
2017). the most similar region embedding for each word embedding.
Grid features have limited power as we have illustrated in The similarity between the image and the sentence is the sum
the image captioning task. Shih et al. (2016) proposed to use of the similarities between aligned word and region pairs.
region features as visual representations. They adopted Edge
Boxes (Zitnick and Dollár, 2014) to locate regions. BUTD
(Anderson et al., 2018b) pre-trained a powerful detector and
uses the question features as queries to attend to region features.
Lu et al. (2017b) argued that attention on text is of equal
importance as on the image. Therefore, they developed a
co-attention method that jointly performs text-guided image
attention and image-guided text attention. Fig. 4. The overview of the Deep fragment(Karpathy et al., 2014) architecture.
Except for attention, there are other strategies for modality Left: Detected objects are mapped to fragment embedding space. Right:
fusion. Ren et al. (2015) treat an image feature as a lan- Dependence tree relations are encoded to fragment embedding space.
guage token. They concatenates the image embeddings with
language tokens as the input to LSTM. Kim et al. (2016) As the attention mechanism has shown great success in
proposed an iterative way of element-wise multiplication for other VL tasks, Huang et al. (2016) proposed to introduce
modality fusion named Multimodal Residual Networks (MRN). attention into image-text matching (ITM). They developed
MUTAN (Ben-Younes et al., 2017) presented parameterized a context-modulated attention scheme to attend to instance
bilinear interactions between modalities. Although there are pairs appearing in both image and text. Nam et al. (2017)
many ways to fuse image and language features, attention is proposed a dual attention framework that attends to specific
the most widely used one. regions in images and words in the text through multiple steps
The core of VQA is to obtain a joint representation of and gathers essential information from both modalities. These
image and language (the question). Researchers in this field methods proved the effectiveness of attention in the ITM task.
pursued various ways to better encode and fuse image and However, as a limitation, they are multi-step methods and can
language, which have laid the foundation for the following only focus on one semantic part at a time.
VLP methods. Most works in this field encode image and Lee et al. (2018) proposed a cross-attention algorithm called
language independently and then fuse them, which is similar SCAN to calculate the similarity between image and sentence.
to dual-stream methods for VLP. Ren et al. (2015) treat image To enable cross attention, they represent an image as a set of
embedding as a language token, which is similar to single- regions and a sentence as a set of words. The core idea of
stream methods. cross attention is to not only use the sentence as a query to
attend to image regions but also use the image as a query to
C. Image Text Matching attend to words.
Essentially, image-text matching is a problem of calculating
Task definition: Image-text matching (ITM), or say image-
the similarity between image and text. Early works encode
text retrieval, is one of the fundamental topics in vision. Given
image and text into global features and calculate their cosine
a query in a certain modality (vision or language), it aims
similarity through dot product. Subsequent works adopt fine-
to find the semantically closest target from another modality.
grained features—object-level features for images and word-
Depending on the query and the target modality, it contains two
level features for language. They also develop more complex
sub-tasks: image-to-text retrieval and text-to-image retrieval.
algorithms for calculating similarities such as cross attention.
Methods: The core of image-text matching is to calculate
the similarity or distance between an image and a piece of text.
A widely adopted prototype is to map image and text into a D. Other tasks
shared embedding space and then calculate their similarity. The There are a broad variety of tasks in the interdisciplinary
matched image and sentence are expected to have the highest field of vision and language that we can not elaborate on in
similarity. detail. Therefore, we list some of the important tasks in Table I,
Early methods (Frome et al., 2013; Socher et al., 2014; including:
Kiros et al., 2014) mainly adopted global feature to encode Text-to-Image Generation: Given a piece of text, generate
image and text. Kiros et al. (2014) proposed to learn cross- an image containing the content of the text. We leave the details
view representation with a hinge-based triplet ranking loss. of text-to-image generation to Section IV-B.
6
Fig. 5. (a) Original BERT with single-modality, where some language tokens are masked for prediction to train language representation. (b) Modified BERT
with multi-modality, where both image and language tokens are fed into a BERT-like Transformer model.
Visual Dialog: Given an image, a dialog history, and a conducted by replacing some text tokens with a special [MASK]
question about the image, answer the question. token and predicting each [MASK] using its context infor-
Visual Reasoning: Similar to VQA, which requires answer- mation. With this technique, language representation training
ing a question about an input image, visual reasoning requires can be considered as a denoising process, in which an input
a further ability to understand the image. Visual reasoning task sentence learns to reconstruct itself with some contaminated
usually contains adequate annotations about the objects in an tokens. This denoising training forces the masked tokens
image, the structure of the questions, etc. to utilize all the unmasked information, hence resulting in
Visual Entailment: Given an image and a text, decide contextualized representations. The architecture design and
whether the image semantically entails the input text. mask training technique developed for Transformer-based
Phrase Grounding and Reference Expression Compre- language models is the main principle behind a broad variety
hension: These two tasks require a model to output bounding of cross-modal developments that contributes to the recent
boxes corresponding to the text. For phrase grounding, the text surge of VLP models. Figure 5(b) shows a simple cross-modal
is a set of phrases and for reference expression comprehension, BERT (Anderson et al., 2018b). Similar to language training,
the text is an expression. image is tokenized and embedded along with language tokens
In the era of task-specific methods, researchers design with certain techniques, which will be elaborated on later.
specific models for different tasks. Although the models for Usually, the tokenized visual features and textual features
different tasks vary significantly, they follow a similar trajectory. together are fed into a Transformer encoder with masked
They all have three stages as shown in Fig. 1. The technological language training to learn a joint representation.
development of this era laid the foundation for the VLP era. In this section, we will go through the main components of
VLP models. As shown in Figure 6, there are primarily three
III. V ISION L ANGUAGE J OINT R EPRESENTATION components in VLP models, namely the visual embedding (VE),
The pre-training and fine-tuning paradigm has been adopted textual embedding (TE), and modality fusion (MF) modules.
across a wide range of domains and advanced various down- VE and TE are normally pre-trained with images and texts
stream tasks. Among the most prominent factors that leverage respectively, whereas MF takes the features extracted from
the prevailing large-scale pre-training is the availability of VE and TE, and fuses them with image-text pre-training. The
abundant datasets along with the rapid evolution of GPUs. goal of VLP is to learn object-level, language-aligned, and
Motivated by the success in single-modal language/vision pre- semantic-rich visual representations. Object-level means the
training, researchers started to explore the joint representation learned representation is fine-grained and aligned with objects
of language and vision (Sun et al., 2019a; Lu et al., 2019), rather than for a whole image. Works that use features of
giving birth to the cross-modal VLP models. detected objects to represent images (Tan and Bansal, 2019;
The recent surge of VLP models is mostly inspired by Lu et al., 2019; Yu et al., 2020; Li et al., 2019b,a, 2020b; Hu
language models in both architecture design and training meth- et al., 2020; Li et al., 2021b) are object-level. Language-aligned
ods. One of the most important breakthroughs is Transformer aims to learn visual features that are well-aligned with language
which was developed by Vaswani et al. (2017) for improving words, which is the goal for most VLP methods. Semantic-rich
language representation. Using multiple stacked attention strives for a representation that can be generalized to a broad
layers, Transformers can fuse information over language tokens range of semantic concepts and needs to be learned from a
globally with high parallelism, which facilitates both powerful large-scale dataset.
representation and large-scale training. A successful application Pre-training on a massive dataset is crucial for improving
of Transformer is BERT (Devlin et al., 2018) which leverages the performance on downstream tasks with smaller datasets,
Transformer encoders and introduces a bidirectional masking as the learned representation can be transferred in downstream
technique that allows each language token to attend to other tasks. VLP models have been proven very effective to empower
tokens bidirectionally. As shown in Figure 5(a), training is downstream tasks.
7
Fig. 6. The architecture of VLP models usually contains Visual Embedding (VE), Textual Embedding (TE), and Modality Fusion (MF). (a) is the illustration
of a dual-stream model, and (b) shows a single stream model. In a dual-stream model, modality fusion is optional and is done by the interactions (usually cross
attention) between language and image encoder. In a single stream model, modality fusion is done in a unified encoder (usually multi-layer Transformers).
1) Grid features are directly extracted from equally sized 2019; Tan and Bansal, 2019) also explicitly design inter-modal
image grids with a convolution feature extractor as aforemen- interactions between two encoders to enable modality fusion
tioned. For example, Huang et al. (2020, 2021) adopted grid in different encoding stages.
features as the image embedding of their VLP models. The 2) Single stream modeling:
advantages of grid features are mainly two folds. The first is Single stream modeling aims to learn one joint representation.
convenient as it does not require a pre-trained object detector. Image and text tokens are concatenated and inputted into
The second is that besides salient objects, grid features also Transformers as shown in Fig. 6(b). Most VLP models adopt
contain background which may be useful for downstream tasks. this modality fusion scheme (Alberti et al., 2019; Chen et al.,
2) Region features are extracted by a pre-trained object 2020; Li et al., 2019a; Huang et al., 2021; Jia et al., 2021).
detector. Most recent VLP models adopt region features Single-stream modeling performs implicit intra-modal and inter-
to learn object-level joint representations. Especially, most modal fusion, free from the architecture design of the fusion
VLP models adopt Faster R-CNN trained on the Visual stage in dual-stream modeling.
Genome (VG) dataset as region feature embedding following
the work of BUTD (Anderson et al., 2018b). There are D. Training
three essential components of region features, which are To learn a joint representation of vision and language,
bounding boxes, object tags, and RoI features (feature vectors vision language pre-training methods usually use several self-
after RoI pooling). Bounding boxes are commonly used in supervised learning losses to pre-train the model on a large
VLP as position indicators, which are encoded through a dataset. As studied in (Chen et al., 2020; Li et al., 2019a;
transformation into the same dimensional space as RoI features Lu et al., 2019; Su et al., 2019; Tan and Bansal, 2019; Zhou
and added to RoI features. Object tags are widely utilized in et al., 2020), there are mainly three pre-training methods, which
training methods such as Masked Region Classification, which are Image Text Matching (ITM), Masked Language Modeling
will be elaborated later in III-D3. The advantage of region (MLM), and Masked Visual Modeling (MVM).
features is that they help a VLP model focus on meaningful 1) Image Text Matching:
regions of the image. These regions are usually closely related The goal of ITM is to predict whether a pair of images and text
to downstream tasks. is matched. ITM can be formulated as a binary classification
task. Previous work (Chen et al., 2020) applies a sigmoid
3) Patch features are usually extracted by a linear projection function on the output of the special token [CLS] to predict
on evenly divided image patches. The main difference between whether the input image and text are matched. The loss function
patch and grid features is that grid features are extracted from is
the feature map of a convolutional model while patch features LITM = −E(W,V )∼D log p(y | W, V ) (1)
directly utilize a linear projection. Patch features were first
introduced by Vision Transformer (ViT) (Dosovitskiy et al., where W = {w1 , w2 , ..., wn } denotes a sequence of language
2021a) and then adopted by VLP models (Kim et al., 2021; tokens and V denotes the visual content. y = 0 or 1 indicates
Xue et al., 2021). The advantage of using patch features is whether the image is matched (y = 1) or not (y = 0).
efficiency. For example, ViLT accelerates the pre-training by 2) Masked Language Modeling:
10 times with competitive results (Kim et al., 2021). According to (Chen et al., 2020), MLM is utilized to encourage
Image embedding methods normally vary for different the model to learn the implicit relation between language tokens
tokenization schemes. Grid features and region features are and visual content. The goal is to reconstruct the masked
usually from pre-trained convolutional models, whereas patch language tokens from the known language tokens and visual
features can be simply embedded by a linear layer. contents. This goal can be formulated as
LMLM = −E(W,V )∼D log p wi | W\i , V ) , (2)
C. Modality Fusion where W denotes the sentence without the i-th word. Note
\i
At the core of VLP models lies the modality fusion, which that, although BPE is normally adopted for language tokeniza-
models intra-modality and inter-modality fusion to produce tion, the minimal masked unit is a whole word instead of a
contextualized joint representations of image and text. MF subword. The reason is that a subword can be easily predicted
schemas can be categorized into dual-stream modeling and from its surrounding subwords due to information leakage.
single-stream modeling. The general structure of VLP is shown There are also improved versions of MLM. For example,
in Figure 6. Sun et al. (2019b) proposed Knowledge Masked Language
1) Dual stream modeling: Modeling, which performs phrase-level masking and entity-
Dual-stream modeling aims to map vision and language into level masking to integrate phrase and entity-level knowledge
the same semantic space. It is the seminal method for modality into the language representation. For entity level masking, they
fusion (Tan and Bansal, 2019; Yu et al., 2020; Li et al., 2021a). treat a named entity as a whole. For example, J. K. Rowling,
As shown in Fig. 6(a), it adopts two separate encoders to learn which contains three tokens, is a person name and should
high-level representations for vision and language, respectively. be masked together in entity-level masking. The phrase level
The dual-stream design allows variable network depth and masking treats a group of words as a conceptual unit. They
architecture to be adaptive to each modality. Apart from intra- mask all the tokens belonging to a phrase and predict them
modal fusion within each modality, some studies (Lu et al., simultaneously.
9
April,
Aug, 2019 Aug , 2019 Aug , 2019 Sep, 2019 Dec, 2019 June, 2020 Oct, 2020 April, 2020
2019
Aug, 2019 Aug, 2019 Aug , 2019 Sep, 2019 April, 2020 June, 2020 Jan, 2021
Feb, 2021
Oscar
VisualBERT Unicoder-VL VL-BERT UNITER Ernie-ViL VinVL (Oscar+) ViLT
PixelBERT
Fig. 7. A Landscape of VLP methods. Works are sorted according to the time they were published. We also show the logos of main institutions where each
work is from.
3) Masked Vision Modeling: signal, which is the raw output of the object detector after
Inspired by MLM, MVM is designed to learn contextual- softmax. The loss function is
ized visual representation by reconstructing masked visual
M
contents. MVM is more challenging than MLM since the X
i
i
LM RC−kl = DKL c̃ vm kgθ (5)
information density of image is lower than that of language.
i=1
When reconstructing a missing word, sophisticated language
understanding is required. On the contrary, a missing image where gθi denotes the soft label of vm
i
predicted by the
patch can be recovered from neighboring patches without cross- VLP model.
modality understanding (He et al., 2021). To overcome this gap, 4) Masked Visual Modeling with Visual Dictionary
most works mask detected object regions that have relatively (MVMVD): Similar to language models which have
high information density. Other works such as SOHO (Huang a vocabulary dictionary, MVMVD requires a visual
et al., 2021) use a visual dictionary (VD) to represent more vocabulary dictionary (VD). The goal of MVMVD is to
comprehensive and compact semantics in the visual domain reconstruct the masked VD tokens (Huang et al., 2021).
so that they can apply MVM in the same way as MLM. In The loss function is
summary, there are mainly four MVM schemes.
1) Masked Region Prediction (MRP): MRP Minimizes the LMVM = −E(W,f (V))∼D log p f (vj ) | W, f (V)\j
distance between the predicted feature and the feature (6)
output by a pre-trained object detector. The distance metric where f (·) denotes the mapping from an image grid to
is usually L2 (Tan and Bansal,
1 2019). We denote masked a visual token in the VD and j denotes the index of the
M
image regions as vm = vm , ..., vm , hiθ as the model masked token in the VD.
i i
prediction corresponding to vm , and r vm as the ROI- There are two points that are worth noting. Firstly, to
i
pooled feature of vm . The loss function is encourage inter-modal fusion, some works such as UNITER-
XM
2 VL (Chen et al., 2019) only mask tokens in one modality each
LM RP = hiθ − r vm i
2
(3) time during training to encourage the masked tokens to attend
i=1 to another modality for missing information (Chen et al., 2019).
2) Masked Region Classification (MRC): MRC requires a Secondly, for MVMVD, neighboring image grids tend to map
model to predict the object semantic class for each masked to the same VD token as they are highly co-related. When
region. Since there is no ground-truth label, the label c(vm i
) performing reconstruction, the model may directly copy the
predicted by a pre-trained object detector of vm is used surrounding tokens. Therefore, all visual embedding vectors
i
Architecture Method Visual Embedding Pre-training Tasks Pre-training Datasets Downstream Tasks
ITM, MLM VQA, VR, RE
ViLBERT (2019) BUTD CC3M
MRC-kl IR, Zero-shot IR
COCO, VG (2017a), VQA v2
LXMERT (2019) BUTD ITM, MLM, MRP, MRC VQA, VR
GQA, Visual7W (2016)
VQA v2, Flickr30k
SNLI-VE, COCO
Dual-Stream 12-in-1 (2020) BUTD ITM,MLM,MRC-kl GuessWhat, VG VQA ,IR, RE, VE, VR
RefCOCO, RefCOCO+,RefCOCOG
Visual 7W, GQA, NLVR2
Object Prediction
Attribute Prediction CC3M, SBU (out-of-domain)
Ernie-ViL (2020) BUTD VQA, VR, IT, TR, RE
Relationship Prediction COCO, VG (in-domain)
ITM, MLM,MRC-kl
Zero-shot Action prediction (2017b)
VideoBERT (2019a) S3D (2017b) ITM, MLM, MVM YouTube cooking videos (2017b)
Video Captioning (2017b)
VisualBERT (2019b) Pre-trained Fast R-CNN ITM, MLM COCO VQA, VR, PG
B2T2 (2019) ResNet-152 (2016b) ITM, MLM CC3M VR
Pre-trained IR, TR, VR
Unicoder-VL (2019a) ITM, MLM, MRC CC3M, SBU
Faster-RCNN (2018) Zero-shot IR/TR
.
CC3M, English Wikipedia
VL-BERT (2019) BUTD MLM, MRP VQA, VR, RE
BooksCorpus (2015)
BUTD variant
MLM (Sequentially
Unified VLP (2020) (with ResNext-101 CC3M IC, VQA
and bidirectionally)
backbone) (2017a)
ITM, MLM, MRP, MRC
Single-Stream UNITER (2019) BUTD COCO, VG, CC3M, SBU VQA, VR, VE, IR, TR, RE
MRC-kl
MLM (include tags) COCO, VG, CC3M, VQA, VR, VE, IR, TR
Oscar (2020b) BUTD+tags
ITM (pollute tags) SBU, fliker30k, GQA RE, IC
PixelBERT (2020) Pixel feature embedding (2015) MLM, ITM COCO, VG VQA, IR, TR, VR
VILLA (2020) BUTD ITM, MLM, MRC-kl COCO, VG, CC3M, SBU VQA, VR, VE, IR, TR, RE
Mask Tag Prediction
VIVO (2020) BUTD Open Images V5 (2018; 2019) IC
(Hungarian match loss)
ITM, MLM IR, TR, VQA
SOHO (2021) VD COCO, VG
MVMVD VR, VE (based on VD)
VQA, VR,
ViLT (2021) Patch Projection ITM, MLM COCO, VG, SBU, CC IR, TR,
Zero-shot IR&TR
COCO, VG, VQA v2,
Hybrid SemVLP (2021a) Pre-tained Faster-RCNN MLM, MRP, VQA VQA, VR, IR, TR
GQA, Visual7W
TABLE II
P RE - TRAINING WORK COMPARISON . T HE PRE - TRAINING TASKS CORRESPOND TO THE TASKS DESCRIBED IN S ECTION III-D. W E ALSO LIST THE
PRE - TRAINING DATASETS AND DOWNSTREAM TASKS OF THESE WORKS . T HE DATASETS AND DOWNSTREAM TASKS ARE DESCRIBED IN TABLE I
E. Landscape of General Pre-training Studies videos (e.g. instructional videos). Another issue confining its
scalability is its delicately designed captioning text template,
After introducing the general pipeline of VLP models, in
for example, now let’s [MASK] the [MASK] to the [MASK],
this section, we summarize some pioneering works in the
and then [MASK] the [MASK], which only works for cooking
cross-domain of VLP.
videos.
Inspired by the success of pre-training in NLP and CV,
a boosting number of research works in the domain of Li et al. (2019b) proposed a simple single-stream VLP
VLP has recently surged to pursue a unified cross-modality model named VisualBERT. The extracted visual and textual
representation. A landscape of VLP works is shown in Figure tokens are directly combined and fed into Transformers, where
7. A more detailed comparison of related works is shown in cross-modality fusion can be performed implicitly. Similar
Table II. We elaborate on some representative studies in this to VisualBERT, several concurrent studies such as Unicoder-
section. VL (Li et al., 2019a), VL-BERT (Su et al., 2019), and
Single Stream Models: VideoBERT (Sun et al., 2019a) is UNITER (Chen et al., 2020) also adopt the single-stream
a pioneering work to learn joint representation for video and architecture. These VLP studies are similar in the following
language. The primary idea is to feed visual and textual tokens aspects: 1) They all utilize an object detection backbone to
into a single-stream model built upon BERT (Devlin et al., compute image embedding. 2) Masked language modeling
2018). Textual tokens are extracted by converting video speech task is adopted by all of them. 3) They all adopt the single-
into text with an automatic speech recognition approach, and stream BERT architecture. They differ from each other in their
visual tokens are acquired by extracting features from video pre-training methods and datasets, as shown in Table II.
clips using a convolutional backbone. VideoBERT is capable Dual Stream Models: ViLBERT (Lu et al., 2019) and
of performing a wide range of downstream classification and LXMBERT (Tan and Bansal, 2019) are pioneering works to
generation tasks, including video captioning and zero-shot mask extend BERT to dual-stream VLP models. They are pre-trained
verbs/nouns prediction. Note that VideoBERT is pre-trained on the Conceptual Captions dataset (Sharma et al., 2018) and
on cooking videos, where the contents are instructional and leverage a pre-trained Faster R-CNN model (Ren et al., 2017) to
of high-quality. It assumes that spoken words are well aligned detect regions as visual tokens. ViLBERT processes visual and
with visual content, which limits its application to only certain textual tokens separately with two parallel streams which can
11
fuse cross-modality information through cross-attention layers Motivated by the knowledge masking scheme of ERNIE (Sun
when needed. In other words, ViLBERT assumes the different et al., 2019b), structured knowledge is first incorporated in the
processing architectures for vision and language. Its cross- VLP models in ERNIE-ViL (Yu et al., 2020). To develop better
modal fusion is designed to be sparse and explicit between the cross-modality semantic alignments by constructing scene
two processing pipelines. LXMBERT differs from ViLBERT graphs, ERNIE-ViL proposes scene graph prediction tasks to
by decoupling intra-modal and inter-modal processing. More model objects, attributes, and relationships in the graph to learn
specifically, visual and textual tokens are encoded separately object-level and attribute-aware representation. Incorporating
in the first phase and then fed into a cross-modality encoder knowledge in cross-modality training is challenging and
to produce the joint representation. remains an open problem.
Other Fusion Methods: Fundamentally, single-stream model- Grid & Patch features: While the prevalence of region
ing and dual-stream modeling differ in the fusion time, where feature embedding facilitates the training of VLP models, it
single-stream fuses different modalities in an earlier stage also restricts the scalability and generalization capability of
while dual-stream prefers to extract high-level features of each VLP models. We can analyze the weakness of region features
modality before fusion. SemVLP (Li et al., 2021a) proposed to from Faster R-CNN as follows.
combine the two prevalent modeling architectures by training • Limited categories: Visual feature is limited by object
them iteratively. Such an approach takes advantage of both detection models which are trained on relatively small
architectures and performs cross-modality semantic alignment datasets with predefined object categories. For example,
on both low-level and high-level. Especially, the Transformer the widely adopted Faster R-CNN model in BUTD (An-
encoder is shared between both modeling methods with an derson et al., 2018a) is trained on VG with a fixed number
additional cross-modal attention module in the dual-stream of 1594 object classes and 524 attributes.
encoder, which is found to contribute to the semantic alignment • Low quality: Region features often suffer from low quality
and reduce parameters. as the Faster R-CNN models are trained on small well-
Most VLP models attempt to encode vision and language labeled datasets (Anderson et al., 2018b).
into separate tokens that interact with each other explicitly or • Lack context: Region features extract RoI features that
implicitly through modality fusion. Another line of VLP models belong to certain categories without any background infor-
alternatively attaches visual tokens to textual tokens based on mation, neglecting semantic relationships between these
object detection models. B2T2 (Alberti et al., 2019) proposed region features. In reality, these semantic relationships are
to fuse the features of detected objects in textual tokens, based important.
on which MLM and ITM are performed in pre-training. In PixelBERT (Huang et al., 2020) attempted to break this
B2T2, a token T can be expressed as limitation and fully utilizes visual information by directly
n
learning from pixel features. Instead of utilizing all the pixels as
T =t+
X
(ht heta(bi ) + bi ) (7) visual features, to reduce computation cost and improve model
i=1
robustness, a fixed number of 100 pixels are randomly sampled
during pre-training. However, the experimental results indicate
where t is the original textual embedding, n is the number of that random sampling only slightly improve the performance,
detected objects whose label is token t, bi is the embedding less than 0.5 VQA score in downstream tasks.
of the i-th object’s bounding box, and ht heta(bi ) denotes the SOHO (Huang et al., 2021) is another pioneering work that
extracted visual feature from the bounding box. B2T2 also leverages grid features for cross-modality understanding. To
analyzes the stages of fusing objects and textual tokens. The learn a semantically comprehensive representation for visual
result indicates the effectiveness of early fusion. context, SOHO proposes to learn a VD for visual tokenization.
Early Attempts to Bridge Modality Gap: To enable both VD is learned by first obtaining high-level features from a
generation and understanding tasks, Zhou et al. (2020) convolutional network, which are then grouped according to
proposed a unified vision-language pre-training approach. It feature similarity and fed into a moving-averaged encoder to
introduces two mask schemes namely bidirectional attention dynamically update VD. As visual embeddings are trainable,
mask and sequence-to-sequence mask to empower understand- SOHO is an end-to-end pre-training framework that directly
ing and generation tasks, respectively. It is worth noting that this learns from pixels, eliminating the need for bounding boxes.
unified VLP approach only adopts MLM during pre-training With this dynamic VD updating during training, the serial
and achieves a competitive performance on image captioning number of each token in the VD can be considered as a label
and VQA. 12-in-1 (Lu et al., 2020) extended multi-task training just like language tokens, making it natural to perform masked
to four broad tasks pre-trained on 12 datasets. The experimental vision modeling. For pre-training tasks, SOHO proposes a
results indicate multi-task training can consistently improve the novel MVMVD method (described in III-D3) to mask all the
performance of downstream tasks and yield a more lightweight visual tokens of the same label simultaneously in an image to
model with fewer parameters. avoid any information leakage.
VILLA (Gan et al., 2020) introduced adversarial training at Image embeddings based on regions or grids as aforemen-
the embedding level of visual and textual tokens based on the tioned are computationally heavy and the extracted high-level
design of UNITER (Chen et al., 2019). It performs adversarial features prevent early fusion of cross-modality information.
training by adding perturbations in the embedding space as Inspired by ViT (Dosovitskiy et al., 2021b), ViLT (Kim et al.,
regularization and yields decent performance improvement. 2021) adopts simple linear projection of image patches as
12
visual embedding, which greatly accelerates pre-training by Motivated by the success of CLIP and DALL-E, several recent
10 times with competitive results. It implies that instead of works further built more powerful models with even larger
designing novel visual embedding, designing better modality- datasets.
fusion could be the key to improving the representation of This section aims to introduce models trained with large-
VLP models. scale weakly-labeled datasets. The section is divided into two
Improve Aligned Representation: Vision-language-aligned parts. The first part includes works utilizing large-scale datasets
representation is a fundamental goal in VLP. To achieve this for visual understanding such as CLIP, ALIGN (Jia et al., 2021),
goal, some works propose to adopt additional object-level data SimVLM (Wang et al., 2021) and Florence (Yuan et al., 2021).
in VLP. For example, many VLP methods adopt RoI region The second part contains visual generation models empowered
features with detection models. However, the detected object by large-scale datasets such as DALL-E, GODIVA (Wu et al.,
tags as an important component are not explicitly modeled in 2021a) and NÜWA (Wu et al., 2021b).
VLP models. To leverage this additional information, Oscar (Li
et al., 2020b) introduced object tags as anchor points to A. Visual Understanding
help learn cross-modality-aligned representation. This learning
The core idea of CLIP is the training method. Instead of
process is empirically natural as the detected object tags often
training to predict masked visual or textual tokens as in other
appear in the image-paired text, which helps align vision and
VLP methods, CLIP learns to recognize paired image and
language. In addition, training with object tags contributes
text. Given a batch of N (image-text) pairs, the goal is to
to learning co-occurrence of objects. Therefore, Oscar yields
predict which of the N × N possible pairs are matched pairs
significant improvement in the downstream understanding and
(positive samples) and which are unmatched pairs (negative
generation tasks. However, the drawback of Oscar is also
samples). After pre-training, CLIP can perform zero-shot image
obvious as it relies on well-labeled image-caption datasets,
classification by using phrases such as ”a photo of” plus a
making it hard to scale.
category name as prompts to tell the model which categories
As VLP models are limited by inadequate well-aligned
an input image is the most similar to. Compared with fully
(image, caption) pairs, VIVO (Hu et al., 2020) proposed to
supervised baselines, zero-shot CLIP outperforms the baseline
scale up pre-training using a large amount of (image, tag) pairs.
on 16 of 27 datasets.
VIVO adopts a Hungarian matching loss to perform masked
Similar to CLIP, ALIGN (Jia et al., 2021) also adopts a
tag prediction, which enables visual vocabulary learning
dual encoder model with a contrastive loss for zero-shot tasks.
and improves the model generalization ability to describe
It utilizes a larger raw dataset with 1.8B image-text pairs.
novel objects in downstream tasks. As a result, it surpassed
ALIGN outperforms CLIP on many zero-shot visual tasks,
human performance for the first time on the Novel Object
which proves that a larger dataset leads to better performance.
Captioning at Scale (NoCaps) (Agrawal et al., 2019) benchmark.
Except for vision tasks, ALIGN also outperforms previous
Scaling VLP models based on RoI features also calls for
work on image-text retrieval tasks. SimVLM (Wang et al.,
more powerful visual representations. VinVL (Zhang et al.,
2021) developed a new approach to VL pre-training. It follows
2021) followed BUTD (Anderson et al., 2018b) and develops
a simple prefix language modeling objective to predict the
an improved object detection model for VLP with a larger
next token in an autoregressive way. It achieves competitive
training dataset. More specifically, it adopts ResNeXt152-
results on multiple VL tasks and has the ability for text-
C4 and merges four public datasets including VG, COCO,
guided zero-shot learning. Unlike previous works that adopt
Objects365, and OpenImagesV5 for large-scale training. VinVL
coarse (image-level) representations and static (image) data,
yields significant improvement on VLP models like VIVO and
Florence (Yuan et al., 2021) adopts fine-grained (object-level)
Oscar and achieves top results on the NoCaps, image captioning,
representations and extends to dynamic (video) data. For
and VQA leaderboards.
object-level representations, Florence adds an adaptor Dynamic
Head (Dai et al., 2021) to the image encoder and trains with an
IV. S CALE UP MODELS AND DATA extra object detection dataset. Through pre-training on 900M
pairs of image-text pairs, Florence achieves new state-of-the-art
Though inspiring progress has been made in the vision-
results in a majority of 44 representative benchmarks.
language joint representation, most aforementioned studies
Apart from zero-shot classification, CLIP can also help
primarily focus on object-level representation to pursue good
detection. For example, ViLD (Gu et al., 2021) proposes a
cross-modal alignment. However, they take a strong assumption:
zero-shot detector via CLIP distillation. Other studies show
image and text pairs are well labeled, which restricts training
that CLIP can learn multi-modal features which are more like
datasets to relatively small ”gold-labeled” datasets. For example,
neurons in the human brain (Goh et al., 2021) and can help
the largest public dataset widely used for VL pre-training is
VL tasks (Shen et al., 2021).
Conceptual Captions (Sharma et al., 2018) with three million
image-text pairs. To obtain richer semantics and stronger
generalization capability, larger weakly-labeled datasets such as B. Visual Generation
web-crawled datasets are greatly desired. CLIP (Radford et al., Besides visual understanding, large-scale weakly-labeled
2021) and DALL-E (Gu et al., 2021) are the first successful image-text-paired data can also assist text-to-image generation.
practice to utilize large-scale (400M image-text pairs for CLIP Ramesh et al. (2021) developed an image generation system
and 250M for DALL-E) web-crawled data for pre-training. called DALL-E. DALL-E converts images into discrete visual
13
tokens using a discrete variational auto encoder (dVAE) so specifically, the ”vokenization” model is trained with image-
that a (text, image) pair can be viewed as a single stream text matching to construct a visual image vocabulary, which
of data. During training, the text-image stream is fed into a is leveraged to map text tokens in language-only datasets to
decoder only Transformer. As for the attention mask, each the retrieved images with the highest score. The experimental
image token can see all the text tokens. The attention mask results show that it can yield additional improvement over
among text tokens is standard causal mask. And image-to-image self-supervised language models.
attention use either row, column or convolutional attention 2) Improve Cross-Modal Tasks with Single-Modality Data.:
mask. In inference time, given text tokens, the generation To address the data shortage issue, some VLP models utilize
process is to predict image tokens in an auto-regressive way extra single-modal data to improve the representation capability.
as in GPT. DALL-E shows impressive results in four aspects: For example, in an image-text dataset, text is usually short
creating anthropomorphized versions of animals and objects, with several tokens, which restricts the textual representation.
combining unrelated concepts, rendering text, and applying a Therefore, VL-BERT (Su et al., 2019) adds additional linguistic
transformation to existing images. corpora to improve the language part in cross-modal tasks.
Inspired by the training method in DALL-E, Wu et al.
(2021a) proposed a method named GODIVA to generate videos B. Toward General Unified-Modality
from the text. Similar to DALL-E, GODIVA tokenizes each
Thanks to the Transformer architecture, researchers have
frame of the video and concatenates the text and visual tokens
achieved remarkable progress in both single-modal and multi-
sequentially as a stream to train the model. DALL-E and
modal representation learning. In previous sections, we have
GODIVA are designed for text-to-image generation and text-
discussed multi-modal representation and modal cooperation,
to-video generation, respectively, while Wu et al. (2021b)
which connect vision and language in different ways. A more
proposed a unified visual generation model which achieves
ambitious goal is to build a general representation model
state-of-the-art results on 8 downstream tasks including text-
which can unify multiple modalities. As a pioneering work,
to-image, text-to-video, video prediction, etc. They proposed a
UNIMO (Li et al., 2020a) proposed a unified pre-training
3D Transformer that is able to encode all three data formats
model, which can handle both single-modal and multi-modal
including text (1D), images (2D), and videos (3D). To better
downstream tasks including understanding and generation.
attend on videos, they designed a 3D Nearby Attention to apply
Trained with a huge amount of single-modal as well as
attention along both spatial and temporal axes.
cross-modal data including BookWiki (Zhu et al., 2015) and
OpenWebText (language data), OpenImages (Krasin et al.,
V. F UTURE T RENDS 2017) and COCO (Lin et al., 2014) (image data), and COCO
In the last few years, we have witnessed how VLP models (Lin et al., 2014), Visual Genome (Krishna et al., 2016) ,
scale to use large quantities of weakly-labeled and more diverse Conceptual Captions (Sharma et al., 2018) and SBU (Ordonez
data. In the future, the models and data will continue to scale et al., 2011) (image-text data). As a result, UNIMO improves
up to achieve stronger modality cooperation and even unified many single-modal and multi-modal downstream tasks by a
representation. In addition, incorporating knowledge can further large margin. Another interesting work is a general-purpose
empower VLP models to gain better generalization abilities. vision system developed by Gupta et al. (2021) for a series of
In this section, we will discuss these future trends. vision and cross-modal tasks.
efficiently utilize large wikidata with high noise and how to of pre-trained vision-and-language models. In European
learn from knowledge in an explainable way. Conference on Computer Vision, pages 565–580. Springer,
2020.
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu
R EFERENCES
Soricut. Conceptual 12M: Pushing Web-Scale Image-Text
Harsh Agrawal, Karan Desai, Xinlei Chen, Rishabh Jain, Dhruv Pre-Training To Recognize Long-Tail Visual Concepts. In
Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: CVPR, 2021.
novel object captioning at scale. In ICCV, 2019. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy,
Chris Alberti, Jeffrey Ling, Michael Collins, and David Reitter. Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter:
Fusion of detected objects in text for visual question Learning universal image-text representations. 2019.
answering. In EMNLP, pages 2131–2140, 2019. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. UNITER:
Gould. SPICE: Semantic Propositional Image Caption Universal image-text representation learning. In ECCV, pages
Evaluation. In ECCV, 2016. 104–120, 2020.
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and
Mark Johnson, Stephen Gould, and Lei Zhang. Bottom- Yoshua Bengio. Empirical evaluation of gated recurrent
up and top-down attention for image captioning and visual neural networks on sequence modeling, 2014.
question answering. In CVPR, pages 6077–6086, 2018a. Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Show,
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Control and Tell: A Framework for Generating Controllable
Mark Johnson, Stephen Gould, and Lei Zhang. Bottom- and Grounded Captions. In CVPR, 2019.
up and top-down attention for image captioning and visual Xiyang Dai, Yinpeng Chen, Jianwei Yang, Pengchuan Zhang,
question answering. In Proceedings of the IEEE conference Lu Yuan, and Lei Zhang. Dynamic detr: End-to-end object
on computer vision and pattern recognition, pages 6077– detection with dynamic attention, October 2021.
6086, 2018b. Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh,
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv
Klein. Neural module networks, 2017. Batra. Visual dialog. In Proceedings of the IEEE Conference
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret on Computer Vision and Pattern Recognition, pages 326–335,
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2017.
Vqa: Visual question answering. In Proceedings of the IEEE Samyak Datta, Karan Sikka, Anirban Roy, Karuna Ahuja, Devi
international conference on computer vision, pages 2425– Parikh, and Ajay Divakaran. Align2ground: Weakly super-
2433, 2015. vised phrase grounding guided by image-caption alignment,
Satanjeev Banerjee and Alon Lavie. METEOR: An automatic 2019.
metric for MT evaluation with improved correlation with Harm De Vries, Florian Strub, Sarath Chandar, Olivier Pietquin,
human judgments. 2005. Hugo Larochelle, and Aaron Courville. Guesswhat?!
Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas visual object discovery through multi-modal dialogue. In
Thome. Mutan: Multimodal tucker fusion for visual question Proceedings of the IEEE Conference on Computer Vision
answering. In ICCV, pages 2612–2620, 2017. and Pattern Recognition, pages 5503–5512, 2017.
Rodrigo Benenson, Stefan Popov, and Vittorio Ferrari. Large- Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Luc
scale interactive object segmentation with human annotators. Van Gool, and Marie-Francine Moens. Talk2car: Taking
In CVPR, 2019. control of your self-driving car. Proceedings of the 2019
Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Conference on Empirical Methods in Natural Language
Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazari- Processing and the 9th International Joint Conference on
dou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, and Natural Language Processing (EMNLP-IJCNLP), 2019. doi:
Joseph Turian. Experience grounds language, 2020. 10.18653/v1/d19-1215. URL http://dx.doi.org/10.18653/v1/
Patrick Bordes, Éloi Zablocki, Laure Soulier, Benjamin Pi- D19-1215.
wowarski, and Patrick Gallinari. Incorporating visual Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
semantics into sentence representations within a grounded Toutanova. BERT: Pre-training of deep bidirectional trans-
space. In 2019 Conference on Empirical Methods in formers for language understanding. 2018.
Natural Language Processing and 9th International Joint Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama,
Conference on Natural Language Processing, pages 696–707. Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
Association for Computational Linguistics, 2019. and Trevor Darrell. Long-term recurrent convolutional
Emanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki, and networks for visual recognition and description. In CVPR,
Desmond Elliott. Multimodal pretraining unmasked: A meta- 2015.
analysis and a unified framework of vision-and-language Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk
berts. Transactions of the Association for Computational Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa
Linguistics, 9:978–994, 2021. Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, Jakob Uszkoreit, and Neil Houlsby. An image is worth
and Jingjing Liu. Behind the scene: Revealing the secrets 16x16 words: Transformers for image recognition at scale,
15
2021a. 2016b.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dollár, and Ross Girshick. Masked autoencoders are scalable
Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, vision learners, 2021.
et al. An Image is Worth 16x16 Words: Transformers for Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares.
Image Recognition at Scale. ICLR, 2021b. Image Captioning: Transforming Objects into Words. In
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja NeurIPS, 2019.
Fidler. Vse++: Improving visual-semantic embeddings with Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-
hard negatives, 2018. hard Nessler, and Sepp Hochreiter. Gans trained by a two
Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter time-scale update rule converge to a local nash equilibrium.
Young, Cyrus Rashtchian, Julia Hockenmaier, and David Advances in neural information processing systems, 30, 2017.
Forsyth. Every picture tells a story: Generating sentences Sepp Hochreiter and Jürgen Schmidhuber. Long short-term
from images. In ECCV, 2010. memory. Neural computation, 9(8):1735–1780, 1997a.
Andrea Frome, Greg Corrado, Jonathon Shlens, Samy Bengio, Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term
Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. Memory. Neural Computation, 9(8):1735–1780, 11 1997b.
Devise: A deep visual-semantic embedding model. In Neural ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL
Information Processing Systems (NIPS), 2013. https://doi.org/10.1162/neco.1997.9.8.1735.
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing
Trevor Darrell, and Marcus Rohrbach. Multimodal compact image description as a ranking task: Data, models and
bilinear pooling for visual question answering and visual evaluation metrics. 2013.
grounding, 2016. Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor
Philip Gage. A new algorithm for data compression. C Users Darrell, and Kate Saenko. Learning to reason: End-to-end
Journal, 12(2):23–38, 1994. module networks for visual question answering, 2017.
Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko.
and Jingjing Liu. Large-scale adversarial training for Explainable neural computation via stack neural module
vision-and-language representation learning. arXiv preprint networks, 2019.
arXiv:2006.06195, 2020. Xiaowei Hu, Xi Yin, Kevin Lin, Lijuan Wang, Lei Zhang,
Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Jianfeng Gao, and Zicheng Liu. Vivo: Surpassing human
Wang, and Wei Xu. Are you talking to a machine? dataset performance in novel object captioning with visual vocabu-
and methods for multilingual image question answering, lary pre-training. arXiv e-prints, pages arXiv–2009, 2020.
2015. Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei.
Gabriel Goh, Nick Cammarata, Chelsea Voss, Shan Carter, Attention on attention for image captioning, 2019.
Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Yan Huang, Wei Wang, and Liang Wang. Instance-aware image
Olah. Multimodal neurons in artificial neural networks. and sentence matching with selective multimodal lstm, 2016.
Distill, 6(3):e30, 2021. Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, Jianlong Fu. Pixel-bert: Aligning image pixels with text by
and Devi Parikh. Making the V in VQA matter: Elevating the deep multi-modal transformers. arXiv:2004.00849, 2020.
role of image understanding in Visual Question Answering. Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu,
In Conference on Computer Vision and Pattern Recognition Dongmei Fu, and Jianlong Fu. Seeing out of the box: End-to-
(CVPR), 2017. end pre-training for vision-language representation learning.
Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Zero- In Proceedings of the IEEE/CVF Conference on Computer
shot detection via vision and language knowledge distillation, Vision and Pattern Recognition, pages 12976–12985, 2021.
2021. Drew A. Hudson and Christopher D. Manning. Compositional
Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, attention networks for machine reasoning, 2018.
Jan Kautz, and Derek Hoiem. Contrastive learning for weakly Drew A Hudson and Christopher D Manning. Gqa: A new
supervised phrase grounding, 2020. dataset for real-world visual reasoning and compositional
Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and question answering. In Proceedings of the IEEE/CVF
Derek Hoiem. Towards general purpose vision systems, conference on computer vision and pattern recognition, pages
2021. 6700–6709, 2019.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Julia Ive, Pranava Madhyastha, and Lucia Specia. Distill-
Deep residual learning for image recognition, 2015. ing translations with visual awareness. arXiv preprint
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. arXiv:1906.07701, 2019.
Deep residual learning for image recognition. In Proceedings Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
of the IEEE conference on computer vision and pattern Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom
recognition, pages 770–778, 2016a. Duerig. Scaling up visual and vision-language representa-
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. tion learning with noisy text supervision. arXiv preprint
Identity mappings in deep residual networks. In European arXiv:2102.05918, 2021.
conference on computer vision, pages 630–645. Springer, Justin Johnson, Bharath Hariharan, Laurens van der Maaten,
16
Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, and Ross Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-
Girshick. Inferring and executing programs for visual Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei.
reasoning, 2017. Visual Genome: Connecting Language and Vision Using
Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Syn- Crowdsourced Dense Image Annotations. IJCV, 123(1):
naeve, Ishan Misra, and Nicolas Carion. Mdetr – modulated 32–73, 2017a.
detection for end-to-end multi-modal understanding, 2021. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji
Andrej Karpathy and Li Fei-Fei. Deep visual-semantic Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-
alignments for generating image descriptions. In CVPR, Jia Li, David A Shamma, et al. Visual genome: Connecting
2015. language and vision using crowdsourced dense image
Andrej Karpathy, Armand Joulin, and Li Fei-Fei. Deep frag- annotations. International journal of computer vision, 123
ment embeddings for bidirectional image sentence mapping, (1):32–73, 2017b.
2014. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Ima-
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and genet classification with deep convolutional neural networks.
Tamara Berg. Referitgame: Referring to objects in pho- Advances in neural information processing systems, 25:1097–
tographs of natural scenes. In Proceedings of the 2014 1105, 2012.
conference on empirical methods in natural language pro- Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik
cessing (EMNLP), pages 787–798, 2014. Dhar, Siming Li, Yejin Choi, Alexander C Berg, and
Douwe Kiela, Alexis Conneau, Allan Jabri, and Maximilian Tamara L Berg. Babytalk: Understanding and generating
Nickel. Learning visually grounded sentence representations. simple image descriptions. 2013.
In Proceedings of the 2018 Conference of the North Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings,
American Chapter of the Association for Computational Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov,
Linguistics: Human Language Technologies, Volume 1 (Long Matteo Malloci, Tom Duerig, and Vittorio Ferrari. The
Papers), pages 408–418, 2018. open images dataset v4: Unified image classification, ob-
Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh ject detection, and visual relationship detection at scale.
Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. arXiv:1811.00982, 2018.
Multimodal residual learning for visual qa, 2016. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.
Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- Gradient-based learning applied to document recognition.
and-language transformer without convolution or region Proceedings of the IEEE, 86(11):2278–2324, 1998.
supervision. arXiv preprint arXiv:2102.03334, 2021. Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Uni- Xiaodong He. Stacked Cross Attention for Image-Text
fying visual-semantic embeddings with multimodal neural Matching. In ECCV, 2018.
language models, 2014. Chenliang Li, Ming Yan, Haiyang Xu, Fuli Luo, Wei Wang,
Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. Associat- Bin Bi, and Songfang Huang. Semvlp: Vision-language
ing neural word embeddings with deep image representations pre-training by aligning semantics at multiple levels. arXiv
using fisher vectors. In Proceedings of the IEEE Conference preprint arXiv:2103.07829, 2021a.
on Computer Vision and Pattern Recognition, pages 4437– Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou.
4446, 2015. Unicoder-VL: A universal encoder for vision and language
Noriyuki Kojima, Hadar Averbuch-Elor, Alexander Rush, and by cross-modal pre-training. In AAAI, pages 11336–11344,
Yoav Artzi. What is learned in visually grounded neural 2019a.
syntax acquisition. In Proceedings of the 58th Annual Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and
Meeting of the Association for Computational Linguistics, Kai-Wei Chang. VisualBERT: A simple and performant
2020. baseline for vision and language. arXiv:1908.03557, 2019b.
Satwik Kottur, José MF Moura, Devi Parikh, Dhruv Batra, Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei
and Marcus Rohrbach. Clevr-dialog: A diagnostic dataset Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan,
for multi-round reasoning in visual dialog. arXiv preprint Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng
arXiv:1903.03166, 2019. Gao. Grounded language-image pre-training, 2021b.
Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu,
Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Jiachen Liu, Hua Wu, and Haifeng Wang. Unimo: Towards
Uijlings, Stefan Popov, Andreas Veit, et al. Openimages: unified-modal understanding and generation via cross-modal
A public dataset for large-scale multi-label and multi-class contrastive learning. arXiv preprint arXiv:2012.15409,
image classification. Dataset available from https://github. 2020a.
com/openimages, 2(3):18, 2017. Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang,
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Xiaodong He, Siwei Lyu, and Jianfeng Gao. Object-
Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li- driven text-to-image synthesis via adversarial training. In
Jia Li, David A Shamma, et al. Visual genome: Connecting Proceedings of the IEEE/CVF Conference on Computer
language and vision using crowdsourced dense image Vision and Pattern Recognition, pages 12174–12182, 2019c.
annotations. arXiv preprint arXiv:1602.07332, 2016. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu
17
Wei, et al. Oscar: Object-semantics aligned pre-training for arXiv preprint arXiv:1301.3781, 2013.
vision-language tasks. In ECCV, 2020b. Aditya Mogadala, Marimuthu Kalimuthu, and Dietrich Klakow.
Chin-Yew Lin. Rouge: A package for automatic evaluation of Trends in integration of vision and language research: A
summaries. 2004. survey of tasks, datasets, and methods. Journal of Artificial
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Intelligence Research, 71:1183–1317, 2021.
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual
Zitnick. Microsoft coco: Common objects in context. In attention networks for multimodal reasoning and matching.
European conference on computer vision, pages 740–755. In Proceedings of the IEEE conference on computer vision
Springer, 2014. and pattern recognition, pages 299–307, 2017.
Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text:
Bilinear cnns for fine-grained visual recognition, 2017. Describing images using 1 million captioned photographs.
Wei Liu, Sihan Chen, Longteng Guo, Xinxin Zhu, and Jing NeurIPS, 2011.
Liu. CPTR: Full Transformer Network for Image Captioning. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing
arXiv preprint arXiv:2101.10804, 2021. Zhu. BLEU: a method for automatic evaluation of machine
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. translation. 2002.
Knowing when to look: Adaptive attention via a visual Bryan A. Plummer, Liwei Wang, Chris M. Cervantes,
sentinel for image captioning, 2017a. Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik.
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Flickr30k entities: Collecting region-to-phrase correspon-
Hierarchical question-image co-attention for visual question dences for richer image-to-sentence models, 2016.
answering, 2017b. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh,
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Neural Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda
baby talk. In Proceedings of the IEEE conference on Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger,
computer vision and pattern recognition, pages 7219–7228, and Ilya Sutskever. Learning Transferable Visual Mod-
2018. els From Natural Language Supervision. arXiv preprint
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: arXiv:2103.00020, 2021.
Pretraining task-agnostic visiolinguistic representations for Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
vision-and-language tasks. In NeurIPS, 2019. Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Zero-shot text-to-image generation, 2021.
and Stefan Lee. 12-in-1: Multi-task vision and language Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring
representation learning. In Proceedings of the IEEE/CVF models and data for image question answering, 2015.
Conference on Computer Vision and Pattern Recognition, Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
pages 10437–10446, 2020. Faster r-cnn: Towards real-time object detection with region
Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. Multi- proposal networks, 2016.
modal convolutional neural networks for matching image Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
and sentence, 2015. Faster R-CNN: towards real-time object detection with region
Mateusz Malinowski and Mario Fritz. A multi-world approach proposal networks. 39(6):1137–1149, 2017.
to question answering about real-world scenes based on Frank Rosenblatt. The perceptron, a perceiving and recognizing
uncertain input. Advances in neural information processing automaton Project Para. Cornell Aeronautical Laboratory,
systems, 27:1682–1690, 2014. 1957.
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Frank Rosenblatt. Principles of neurodynamics. perceptrons and
Ask your neurons: A neural-based approach to answering the theory of brain mechanisms. Technical report, Cornell
questions about images, 2015. Aeronautical Lab Inc Buffalo NY, 1961.
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams.
and Alan Yuille. Deep captioning with multimodal recurrent Learning internal representations by error propagation. Tech-
neural networks (m-rnn), 2015. nical report, California Univ San Diego La Jolla Inst for
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Cam- Cognitive Science, 1985.
buru, Alan L Yuille, and Kevin Murphy. Generation and Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki
comprehension of unambiguous object descriptions. In Cheung, Alec Radford, and Xi Chen. Improved techniques
Proceedings of the IEEE conference on computer vision for training gans. Advances in neural information processing
and pattern recognition, pages 11–20, 2016. systems, 29:2234–2242, 2016.
David Mascharka, Philip Tran, Ryan Soklaski, and Arjun Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural
Majumdar. Transparency by design: Closing the gap between machine translation of rare words with subword units. arXiv
performance and interpretability in visual reasoning. 2018 preprint arXiv:1508.07909, 2015.
IEEE/CVF Conference on Computer Vision and Pattern Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu
Recognition, Jun 2018. doi: 10.1109/cvpr.2018.00519. URL Soricut. Conceptual captions: A cleaned, hypernymed, image
http://dx.doi.org/10.1109/CVPR.2018.00519. alt-text dataset for automatic image captioning. 2018.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna
Efficient estimation of word representations in vector space. Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer.
18
How much can clip benefit vision-and-language tasks?, 2021. convolutions. In CVPR, 2015.
Violetta Shevchenko, Damien Teney, Anthony Dick, and Anton Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality
van den Hengel. Reasoning over vision and language: encoder representations from transformers. 2019.
Exploring the benefits of supplemental knowledge, 2021. Hao Tan and Mohit Bansal. Vokenization: improving lan-
Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen Livescu. guage understanding with contextualized, visual-grounded
Visually grounded neural syntax acquisition. In Proceedings supervision. arXiv preprint arXiv:2010.06775, 2020.
of the 57th Annual Meeting of the Association for Computa- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
tional Linguistics, pages 1842–1861, 2019a. Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Jiaxin Shi, Hanwang Zhang, and Juanzi Li. Explainable and Polosukhin. Attention is all you need. In NeurIPS, 2017.
explicit visual reasoning over scene graphs, 2019b. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh.
Kevin J. Shih, Saurabh Singh, and Derek Hoiem. Where to CIDEr: Consensus-based Image Description Evaluation. In
look: Focus regions for visual question answering, 2016. CVPR, 2015.
Kurt Shuster, Samuel Humeau, Antoine Bordes, and Jason Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru
Weston. Image chat: Engaging grounded conversations. arXiv Erhan. Show and tell: A neural image caption generator. In
preprint arXiv:1811.00945, 2018. CVPR, 2015.
Karen Simonyan and Andrew Zisserman. Very deep convolu- Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona,
tional networks for large-scale image recognition. In ICLR, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset.
2015a. 2011.
Karen Simonyan and Andrew Zisserman. Very deep convolu- Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning deep
tional networks for large-scale image recognition, 2015b. structure-preserving image-text embeddings, 2016.
Amanpreet Singh, Vedanuj Goswami, Vivek Natarajan, Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia
Yu Jiang, Xinlei Chen, Meet Shah, Marcus Rohrbach, Dhruv Tsvetkov, and Yuan Cao. Simvlm: Simple visual language
Batra, and Devi Parikh. Pythia-a platform for vision & model pretraining with weak supervision. arXiv preprint
language research. In NeurIPS, SysML Workshop, volume arXiv:2108.10904, 2021.
2018, 2018. Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan
Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Yang, Guillermo Sapiro, and Nan Duan. Godiva: Generating
Manning, and Andrew Y. Ng. Grounded compositional open-domain videos from natural descriptions, 2021a.
semantics for finding and describing images with sentences. Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin
Transactions of the Association for Computational Linguis- Jiang, and Nan Duan. Nüwa: Visual synthesis pre-training
tics, 2:207–218, 2014. doi: 10.1162/tacl a 00177. for neural visual world creation, 2021b.
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. Visual
Wei, and Jifeng Dai. VL-BERT: Pre-training of generic entailment task for visually-grounded language learning,
visual-linguistic representations. In ICLR, 2019. 2019.
Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi. A corpus Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
of natural language for visual reasoning. In Proceedings of Kaiming He. Aggregated residual transformations for deep
the 55th Annual Meeting of the Association for Computa- neural networks. In CVPR, pages 1492–1500, 2017a.
tional Linguistics (Volume 2: Short Papers), pages 217–223, Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and
2017. Kevin Murphy. Rethinking spatiotemporal feature learning
Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun for video understanding. arXiv preprint arXiv:1712.04851,
Bai, and Yoav Artzi. A corpus for reasoning about natural 1(2):5, 2017b.
language grounded in photographs. In ACL, pages 6418– Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron
6428, 2019. Courville, Ruslan Salakhutdinov, Richard S Zemel, and
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Yoshua Bengio. Show, attend and tell: Neural image caption
Cordelia Schmid. Videobert: A joint model for video and generation with visual attention. 2015.
language representation learning. In ICCV, 2019a. Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-
Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua grained text to image generation with attentional generative
Wu. Ernie: Enhanced representation through knowledge adversarial networks. In Proceedings of the IEEE conference
integration. arXiv preprint arXiv:1904.09223, 2019b. on computer vision and pattern recognition, pages 1316–
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to 1324, 2018.
sequence learning with neural networks, 2014. Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Fu, Houqiang Li, and Jiebo Luo. Probing inter-modality:
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Visual parsing with self-attention for vision-language pre-
Vanhoucke, and Andrew Rabinovich. Going deeper with training. arXiv preprint arXiv:2106.13488, 2021.
convolutions, 2014. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Smola. Stacked attention networks for image question
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent answering, 2016.
Vanhoucke, and Andrew Rabinovich. Going deeper with Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet
19
Kohli, and Joshua B. Tenenbaum. Neural-symbolic vqa: Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov,
Disentangling reasoning from vision and language under- Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning
standing, 2019. books and movies: Towards story-like visual explanations by
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. watching movies and reading books. In Proceedings of the
From image descriptions to visual denotations: New similar- IEEE international conference on computer vision, pages
ity metrics for semantic inference over event descriptions. 19–27, 2015.
2014. C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating
Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua object proposals from edges. In European conference on
Wu, and Haifeng Wang. Ernie-vil: Knowledge enhanced computer vision, pages 391–405. Springer, 2014.
vision-language representations through scene graph. arXiv
preprint arXiv:2006.16934, 1:12, 2020.
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg,
and Tamara L. Berg. Modeling context in referring expres-
sions, 2016.
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit
Bansal, and Tamara L. Berg. Mattnet: Modular attention
network for referring expression comprehension, 2018.
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang
Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin
Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu,
Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao,
Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and
Pengchuan Zhang. Florence: A new foundation model for
computer vision, 2021.
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi.
From recognition to cognition: Visual commonsense reason-
ing. In CVPR, pages 6720–6731, 2019.
Chao Zhang, Zichao Yang, Xiaodong He, and Li Deng. Mul-
timodal intelligence: Representation learning, information
fusion, and applications. IEEE Journal of Selected Topics
in Signal Processing, 14(3):478–493, 2020.
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang
Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan: Text
to photo-realistic image synthesis with stacked generative
adversarial networks, 2017.
Lei Zhang and Heung-Yeung Shum. Statistical foundation
behind machine learning and its impact on computer vision.
Manuscript under review, 2022.
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei
Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl:
Revisiting visual representations in vision-language models.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 5579–5588, 2021.
Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama,
Eiichiro Sumita, Zuchao Li, and Hai Zhao. Neural machine
translation with universal visual representation. In Interna-
tional Conference on Learning Representations, 2019.
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J
Corso, and Jianfeng Gao. Unified vision-language pre-
training for image captioning and VQA. In AAAI, pages
13041–13049, 2020.
Xinxin Zhu, Weining Wang, Longteng Guo, and Jing Liu.
AutoCaption: Image Captioning with Neural Architecture
Search. arXiv preprint arXiv:2012.09742, 2020.
Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei.
Visual7w: Grounded question answering in images. In
Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 4995–5004, 2016.