2012.12556
2012.12556
Abstract—Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the
self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to
computer vision tasks. In a variety of visual benchmarks, transformer-based models perform similar to or better than other types of
networks such as convolutional and recurrent neural networks. Given its high performance and less need for vision-specific inductive
bias, transformer is receiving more and more attention from the computer vision community. In this paper, we review these vision
transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages. The main categories we
explore include the backbone network, high/mid-level vision, low-level vision, and video processing. We also include efficient
transformer methods for pushing transformer into real device-based applications. Furthermore, we also take a brief look at the
arXiv:2012.12556v6 [cs.CV] 10 Jul 2023
self-attention mechanism in computer vision, as it is the base component in transformer. Toward the end of this paper, we discuss the
challenges and provide several further research directions for vision transformers.
Index Terms—Transformer, Self-attention, Computer Vision, High-level vision, Low-level vision, Video.
1 I NTRODUCTION
2017.6 | Transformer 2020.5 | GPT-3 2020.10 | ViT 2021 | ViT Variants 2023 | GPT4
Solely based on attention A huge transformer with Pure transformer Variants of ViT models, A generalized multi-
mechanism, the Transformer is 170B parameters, takes a architectures work well for e.g., DeiT, PVT, TNT, modal model for both
proposed and shows great big step towards general language and vision
visual recognition. and Swin.
performance on NLP tasks. NLP model.
tasks.
Fig. 1: Key milestones in the development of transformer. The vision transformer models are marked in red.
between high- and mid-level vision is becoming more obscure
in DNN-based vision systems [23], [24], we treat them as a
single category here. A few examples of transformer models that
address these high/mid-level vision tasks include DETR [16], de-
formable DETR [17] for object detection, and Max-DeepLab [25]
for segmentation. Low-level image processing mainly deals with
extracting descriptions from images (such descriptions are usually
represented as images themselves) [26]. Typical applications of
low-level image processing include super-resolution, image de-
noising, and style transfer. At present, only a few works [19], [27]
in low-level vision use transformers, creating the need for further
investigation. Another category is video processing, which is an
important part in both computer vision and image-based tasks. Due
to the sequential property of video, transformer is inherently well
suited for use on video tasks [20], [28], in which it is beginning
to perform on par with conventional CNNs and RNNs. Here, we
survey the works associated with transformer-based visual models
in order to track the progress in this field. Figure 1 shows the
development timeline of vision transformer — undoubtedly, there Fig. 2: Structure of the original transformer (image from [9]).
will be many more milestones in the future.
The rest of the paper is organized as follows. Section 2 2.1 Self-Attention
discusses the formulation of the standard transformer and the self-
In the self-attention layer, the input vector is first transformed into
attention mechanism. Section 4 is the main part of the paper, in
three different vectors: the query vector q, the key vector k and the
which we summarize the vision transformer models on backbone,
value vector v with dimension dq = dk = dv = dmodel = 512.
high/mid-level vision, low-level vision, and video tasks. We also
Vectors derived from different inputs are then packed together into
briefly describe efficient transformer methods, as they are closely
three different matrices, namely, Q, K and V. Subsequently, the
related to our main topic. In the final section, we give our
attention function between different input vectors is calculated as
conclusion and discuss several research directions and challenges.
follows (and shown in Figure 3 left):
Due to the page limit, we describe the methods of transformer in
NLP in the supplemental material, as the research experience may • Step 1: Compute scores between different input vectors
be beneficial for vision tasks. In the supplemental material, we also with S = Q · K⊤ ;
review the self-attention mechanism for CV as the supplementary • Step 2: Normalize
√ the scores for the stability of gradient
of vision transformer models. In this survey, we mainly include the with Sn = S/ dk ;
representative works (early, pioneering, novel, or inspiring works) • Step 3: Translate the scores into probabilities with softmax
since there are many preprinted works on arXiv and we cannot function P = softmax(Sn );
include them all in limited pages. • Step 4: Obtain the weighted value matrix with Z = V ·P.
The process can be unified into a single function:
2 F ORMULATION OF T RANSFORMER
Q · K⊤
Transformer [9] was first used in the field of natural language Attention(Q, K, V) = softmax( √ ) · V. (1)
processing (NLP) on machine translation tasks. As shown in dk
Figure 2, it consists of an encoder and a decoder with several The logic behind Eq. 1 is simple. Step 1 computes scores between
transformer blocks of the same architecture. The encoder gener- each pair of different vectors, and these scores determine the
ates encodings of inputs, while the decoder takes all the encodings degree of attention that we give other words when encoding
and using their incorporated contextual information to generate the word at the current position. Step 2 normalizes the scores
the output sequence. Each transformer block is composed of a to enhance gradient stability for improved training, and step 3
multi-head attention layer, a feed-forward neural network, shortcut translates the scores into probabilities. Finally, each value vector
connection and layer normalization. In the following, we describe is multiplied by the sum of the probabilities. Vectors with larger
each component of the transformer in detail. probabilities receive additional focus from the following layers.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3
where W1 and W2 are the two parameter matrices of the Fig. 4: A taxonomy of backbone using convolution and attention.
two linear transformation layers, and σ represents the nonlinear
activation function, such as GELU [51]. The dimensionality of the more dimensions, noise and redundant modality compared to text,
hidden layer is dh = 2048. they are believed to be more difficult for generative modeling.
Residual Connection in the Encoder and Decoder. As shown in Other than CNNs, the transformer can be used as backbone
Figure 2, a residual connection is added to each sub-layer in the networks for image classification. Wu et al. [59] adopted ResNet
encoder and decoder. This strengthens the flow of information in as a convenient baseline and used vision transformers to replace
order to achieve higher performance. A layer-normalization [52] the last stage of convolutions. Specifically, they apply convolu-
is followed after the residual connection. The output of these 1tional layers
Huawei Confidential to extract low-level features that are then fed into
operations can be described as: the vision transformer. For the vision transformer, they use a
tokenizer to group pixels into a small number of visual tokens,
LayerNorm(X + Attention(X)). (6) each representing a semantic concept in the image. These visual
Here, X is used as the input of self-attention layer, and the query, tokens are used directly for image classification, with the trans-
key and value matrices Q, K and V are all derived from the same formers being used to model the relationships between tokens. As
input matrix X. A variant pre-layer normalization (Pre-LN) is also shown in Figure 4, the works can be divided into purely using
widely-used [53], [54], [15]. Pre-LN inserts the layer normaliza- transformer for vision and combining CNN and transformer. We
tion inside the residual connection and before multi-head attention summarize the results of these models in Table 2 and Figure 6
or FFN. For the normalization layer, there are several alternatives to demonstrate the development of the backbones. In addition to
such as batch normalization [55]. Batch normalization usually supervised learning, self-supervised learning is also explored in
perform worse when applied on transformer as the feature values vision transformer.
change acutely [56]. Some other normalization algorithms [57],
3.1.1 Pure Transformer
[56], [58] have been proposed to improve training of transformer.
Final Layer in the Decoder. The final layer in the decoder is used ViT. Vision Transformer (ViT) [15] is a pure transformer directly
to turn the stack of vectors back into a word. This is achieved by a applies to the sequences of image patches for image classification
linear layer followed by a softmax layer. The linear layer projects task. It follows transformer’s original design as much as possible.
the vector into a logits vector with dword dimensions, in which Figure 5 shows the framework of ViT.
dword is the number of words in the vocabulary. The softmax To handle 2D images, the image X ∈ Rh×w×c is reshaped
2
layer is then used to transform the logits vector into probabilities. into a sequence of flattened 2D patches Xp ∈ Rn×(p ·c) such
When used for CV tasks, most transformers adopt the original that c is the number of channels. (h, w) is the resolution of the
transformer’s encoder module. Such transformers can be treated original image, while (p, p) is the resolution of each image patch.
as a new type of feature extractor. Compared with CNNs which The effective sequence length for the transformer is therefore n =
focus only on local characteristics, transformer can capture long- hw/p2 . Because the transformer uses constant widths in all of its
distance characteristics, meaning that it can easily derive global layers, a trainable linear projection maps each vectorized path to
information. And in contrast to RNNs, whose hidden state must the model dimension d, the output of which is referred to as patch
be computed sequentially, transformer is more efficient because embeddings.
the output of the self-attention layer and the fully connected layers Similar to BERT’s [class] token, a learnable embedding is
can be computed in parallel and easily accelerated. From this, we applied to the sequence of embedding patches. The state of this
can conclude that further study into using transformer in computer embedding serves as the image representation. During both pre-
vision as well as NLP would yield beneficial results. training and fine-tuning stage, the classification heads are attached
to the same size. In addition, 1D position embeddings are added
to the patch embeddings in order to retain positional information.
3 V ISION T RANSFORMER It is worth noting that ViT utilizes only the standard transformer’s
In this section, we review the applications of transformer- encoder (except for the place for the layer normalization), whose
based models in computer vision, including image classification, output precedes an MLP head. In most cases, ViT is pre-trained
high/mid-level vision, low-level vision and video processing. We on large datasets, and then fine-tuned for downstream tasks with
also briefly summarize the applications of the self-attention mech- smaller data.
anism and model compression methods for efficient transformer. ViT yields modest results when trained on mid-sized datasets
such as ImageNet, achieving accuracies of a few percentage
points below ResNets of comparable size. Because transformers
3.1 Backbone for Representation Learning lack some inductive biases inherent to CNNs–such as translation
Inspired by the success that transformer has achieved in the field of equivariance and locality–they do not generalize well when trained
NLP, some researchers have explored whether similar models can on insufficient amounts of data. However, the authors found
learn useful representations for images. Given that images involve that training the models on large datasets (14 million to 300
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5
million images) surpassed inductive bias. When pre-trained at Improving the calculation of self-attention layer has attracted
sufficient scale, transformers achieve excellent results on tasks many researchers. DeepViT [69] proposes to establish cross-
with fewer datapoints. For example, when pre-trained on the head communication to re-generate the attention maps to increase
JFT-300M dataset, ViT approached or even exceeded state of the diversity at different layers. KVT [70] introduces the k -NN
the art performance on multiple image recognition benchmarks. attention to utilize locality of images patches and ignore noisy
Specifically, it reached an accuracy of 88.36% on ImageNet, and tokens by only computing attentions with top-k similar tokens. Re-
77.16% on the VTAB suite of 19 tasks. finer [71] explores attention expansion in higher-dimension space
Touvron et al. [60] proposed a competitive convolution-free and applied convolution to augment local patterns of the attention
transformer, called Data-efficient image transformer (DeiT), by maps. XCiT [72] performs self-attention calculation across feature
training on only the ImageNet database. DeiT-B, the reference vi- channels rather than tokens, which allows efficient processing of
sion transformer, has the same architecture as ViT-B and employs high-resolution images. The computation complexity and attention
86 million parameters. With a strong data augmentation, DeiT- precision of the self-attention mechanism are two key-points for
B achieves top-1 accuracy of 83.1% (single-crop evaluation) on future optimization.
ImageNet with no external data. In addition, the authors observe The network architecture is an important factor as demon-
that using a CNN teacher gives better performance than using strated in the field of CNNs. The original architecture of ViT
a transformer. Specifically, DeiT-B can achieve top-1 accuracy is a simple stack of the same-shape transformer block. New
84.40% with the help of a token-based distillation. architecture design for vision transformer has been an interesting
topic. The pyramid-like architecture is utilized by many vision
transformer models [73], [61], [74], [75], [76], [77] including
PVT [73], HVT [78], Swin Transformer [61] and PiT [79]. There
are also other types of architectures, such as two-stream architec-
ture [80] and U-net architecture [81], [30]. Neural architecture
search (NAS) has also been investigated to search for better
transformer architectures, e.g., Scaling-ViT [82], ViTAS [83],
AutoFormer [84] and GLiT [85]. Currently, both network design
and NAS for vision transformer mainly draw on the experience of
CNN. In the future, we expect the specific and novel architectures
appear in the filed of vision transformer.
In addition to the aforementioned approaches, there are some
other directions to further improve vision transformer, e.g., posi-
Fig. 5: The framework of ViT (image from [15]). tional encoding [86], [87], normalization strategy [88], shortcut
connection [89] and removing attention [90], [91], [92], [93].
Variants of ViT. Following the paradigm of ViT, a series of
variants of ViT have been proposed to improve the performance 3.1.2 Transformer with Convolution
on vision tasks. The main approaches include enhancing locality, Although vision transformers have been successfully applied to
self-attention improvement and architecture design. various visual tasks due to their ability to capture long-range
The original vision transformer is good at capturing long-range dependencies within the input, there are still gaps in performance
dependencies between patches, but disregard the local feature between transformers and existing CNNs. One main reason can be
extraction as the 2D patch is projected to a vector with simple the lack of ability to extract local information. Except the above
linear layer. Recently, the researchers begin to pay attention to mentioned variants of ViT that enhance the locality, combining the
improve the modeling capacity for local information [29], [61], transformer with convolution can be a more straightforward way
[62]. TNT [29] further divides the patch into a number of sub- to introduce the locality into the conventional transformer.
patches and introduces a novel transformer-in-transformer archi- There are plenty of works trying to augment a conventional
tecture which utilizes an inner transformer block to model the transformer block or self-attention layer with convolution. For
relationship between sub-patches and an outer transformer block example, CPVT [86] proposed a conditional positional encoding
for patch-level information exchange. Twins [63] and CAT [64] (CPE) scheme, which is conditioned on the local neighborhood
alternately perform local and global attention layer-by-layer. Swin of input tokens and adaptable to arbitrary input sizes, to leverage
Transformers [61], [65] performs local attention within a win- convolutions for fine-level feature encoding. CvT [97], CeiT [98],
dow and introduces a shifted window partitioning approach for LocalViT [99] and CMT [95] analyzed the potential drawbacks
cross-window connections. Shuffle Transformer [66], [67] further when directly borrowing Transformer architectures from NLP and
utilizes the spatial shuffle operation instead of shifted window combined the convolutions with transformers together. Specifi-
partitioning to allow cross-window connections. RegionViT [62] cally, the feed-forward network (FFN) in each transformer block is
generates regional tokens and local tokens from an image, and combined with a convolutional layer that promotes the correlation
local tokens receive global information via attention with regional among neighboring tokens. LeViT [100] revisited principles from
tokens. In addition to the local attention, some other works propose extensive literature on CNNs and applied them to transformers,
to boost local information through local feature aggregation, e.g., proposing a hybrid neural network for fast inference image clas-
T2T [68]. These works demonstrate the benefit of the local sification. BoTNet [101] replaced the spatial convolutions with
information exchange and global information exchange in vision global self-attention in the final three bottleneck blocks of a
transformer. ResNet, and improved upon the baselines significantly on both
As a key component of transformer, self-attention layer pro-
vides the ability for global interaction between image patches. 1.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6
vision transformer models. Pure transformer means only using a few 85 85 ResNet
EfficientNet
T2T
Swin
convolutions in the stem stage. CNN + Transformer means using 84 84 DeiT CMT
PVT VOLO
convolutions in the intermediate layers. Following [60], [61], the 83 83
Accuracy (%)
Accuracy (%)
throughput is measured on NVIDIA V100 GPU and Pytorch, with 82 82
80 ResNet T2T 80
Params FLOPs Throughput Top-1 EfficientNet Swin
Model 79 DeiT CMT 79
(M) (B) (image/s) (%) PVT VOLO
CNN 78 78
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 200 400 600 800 1000 1200
ResNet-50 [12], [68] 25.6 4.1 1226 79.1 FLOPs (B) Throughput (image/s)
instance segmentation and object detection tasks with minimal Here, the identity permutation πi = i is adopted for 1 ⩽ i ⩽ n,
overhead in latency. which is also known as raster order. Chen et al. also considered the
Besides, some researchers have demonstrated that transformer BERT objective, which samples a sub-sequence M ⊂ [1, n] such
based models can be more difficult to enjoy a favorable ability of that each index i independently has probability 0.15 of appearing
fitting data [15], [102], [103], in other words, they are sensitive in M . M is called the BERT mask, and the model is trained by
to the choice of optimizer, hyper-parameter, and the schedule of minimizing the negative log-likelihood of the “masked” elements
training. Visformer [102] revealed the gap between transformers xM conditioned on the “unmasked” ones x[1,n]\M :
and CNNs with two different training settings. The first one is the LBERT = E E
X
[− log p(xi |x[1,n]\M )]. (9)
standard setting for CNNs, i.e., the training schedule is shorter x∼X M
i∈M
and the data augmentation only contains random cropping and
horizental flipping. The other one is the training setting used During the pre-training stage, they pick either LAR or LBERT
in [60], i.e., the training schedule is longer and the data augmenta- and minimize the loss over the pre-training dataset.
tion is stronger. [103] changed the early visual processing of ViT GPT-2 [110] formulation of the transformer decoder block
by replacing its embedding stem with a standard convolutional is used. To ensure proper conditioning when training the AR
stem, and found that this change allows ViT to converge faster objective, Chen et al. apply the standard upper triangular mask
and enables the use of either AdamW or SGD without a significant to the n × n matrix of attention logits. No attention logit masking
drop in accuracy. In addition to these two works, [100], [95] also is required when the BERT objective is used: Chen et al. zero
choose to add convolutional stem on the top of the transformer. out the positions after the content embeddings are applied to the
input sequence. Following the final transformer layer, they apply a
layer norm and learn a projection from the output to logits param-
3.1.3 Self-supervised Representation Learning eterizing the conditional distributions at each sequence element.
Generative Based Approach. Generative pre-training methods When training BERT, they simply ignore the logits at unmasked
for images have existed for a long time [104], [105], [106], [107]. positions.
Chen et al. [14] re-examined this class of methods and combined During the fine-tuning stage, they average pool the output of
it with self-supervised methods. After that, several works [108], the final layer normalization layer across the sequence dimension
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
large (e.g., 4096). With this simplification, the contrastive loss can (a) Transformer-based set prediction for detection
be implemented in a simple way. The encoder fq consists of a
backbone (e.g., ViT), a projection head and an extra prediction RPN
class
Transformer Predict
Image
head; while the encoder fk has the backbone and projection head, patches Encoders Head box
backbone to extract features from the input image. To supplement addition, a new bipartite matching scheme is designed for greater
the image features with position information, fixed positional en- training stability and faster convergence and two transformer-
codings are added to the flattened features before the features are based set prediction models, i.e. TSP-FCOS and TSP-RCNN, are
fed into the encoder-decoder transformer. The decoder consumes proposed to improve encoder-only DETR with feature pyramids.
the embeddings from the encoder along with N learned positional These new models achieve better performance compared with the
encodings (object queries), and produces N output embeddings. original DETR model. Gao et al. [125] proposed the Spatially
Here N is a predefined parameter and typically larger than the Modulated Co-Attention (SMCA) mechanism to accelerate the
number of objects in an image. Simple feed-forward networks convergence by constraining co-attention responses to be high
(FFNs) are used to compute the final predictions, which include near initially estimated bounding box locations. By integrating
the bounding box coordinates and class labels to indicate the the proposed SMCA module into DETR, similar mAP could be
specific class of object (or to indicate that no object exists). Unlike obtained with about 10× less training epochs under comparable
the original transformer, which computes predictions sequentially, inference cost.
DETR decodes N objects in parallel. DETR employs a bipartite Given the high computation complexity associated with
matching algorithm to assign the predicted and ground-truth DETR, Zheng et al. [123] proposed an Adaptive Clustering
objects. As shown in Eq. 11, the Hungarian loss is exploited to Transformer (ACT) to reduce the computation cost of pre-trained
compute the loss function for all matched pairs of objects. DETR. ACT adaptively clusters the query features using a locality
N h i
sensitivity hashing (LSH) method and broadcasts the attention
X
LHungarian (y, ŷ) = − log p̂σ̂(i) (ci ) + 1{ci ̸=∅} Lbox (bi , b̂σ̂ (i)) , (11) output to the queries represented by the selected prototypes. ACT
i=1
is used to replace the self-attention module of the pre-trained
where σ̂ is the optimal assignment, ci and p̂σ̂(i) (ci ) are the target DETR model without requiring any re-training. This approach
class label and predicted label, respectively, and bi and b̂σ̂ (i) are significantly reduces the computational cost while the accuracy
the ground truth and predicted bounding box, y = {(ci , bi )} slides slightly. The performance drop can be further reduced by
and ŷ are the ground truth and prediction of objects, respectively. utilizing a multi-task knowledge distillation (MTKD) method,
DETR shows impressive performance on object detection, deliv- which exploits the original transformer to distill the ACT module
ering comparable accuracy and speed with the popular and well- with a few epochs of fine-tuning. Yao et al. [126] pointed out
established Faster R-CNN [13] baseline on COCO benchmark. that the random initialization in DETR is the main reason for
the requirement of multiple decoder layers and slow convergence.
backbone encoder decoder prediction heads
To this end, they proposed the Efficient DETR to incorporate
set of image features
… FFN
class, the dense prior into the detection pipeline via an additional
box
CNN
FFN
no region proposal network. The better initialization enables them
transformer transformer object
+ encoder decoder class, to use only one decoder layers instead of six layers to achieve
FFN
box
… FFN
no
object
competitive performance with a more compact network.
positional encoding object queries
Transformer-based Backbone for Detection. Unlike DETR
which redesigns object detection as a set prediction tasks via
Fig. 8: The overall architecture of DETR (image from [16]). transformer, Beal et al. [115] proposed to utilize transformer as
DETR is a new design for the object detection framework a backbone for common detection frameworks such as Faster R-
based on transformer and empowers the community to develop CNN [13]. The input image is divided into several patches and
fully end-to-end detectors. However, the vanilla DETR poses fed into a vision transformer, whose output embedding features
several challenges, specifically, longer training schedule and poor are reorganized according to spatial information before passing
performance for small objects. To address these challenges, Zhu et through a detection head for the final results. A massive pre-
al. [17] proposed Deformable DETR, which has become a popular training transformer backbone could bring benefits to the proposed
method that significantly improves the detection performance. The ViT-FRCNN. There are also quite a few methods to explore versa-
deformable attention module attends to a small set of key positions tile vision transformer backbone design [29], [73], [61], [63] and
around a reference point rather than looking at all spatial locations transfer these backbones to traditional detection frameworks like
on image feature maps as performed by the original multi-head RetinaNet [129] and Cascade R-CNN [130]. For example, Swin
attention mechanism in transformer. This approach significantly Transformer [61] obtains about 4 box AP gains over ResNet-50
reduces the computational complexity and brings benefits in terms backbone with similar FLOPs for various detection frameworks.
of fast convergence. More importantly, the deformable attention Pre-training for Transformer-based Object Detection. Inspired
module can be easily applied for fusing multi-scale features. by the pre-training transformer scheme in NLP, several methods
Deformable DETR achieves better performance than DETR with have been proposed to explore different pre-training scheme
10× less training cost and 1.6× faster inference speed. And by for transformer-based object detection [33], [128], [131]. Dai et
using an iterative bounding box refinement method and two-stage al. [33] proposed unsupervised pre-training for object detection
scheme, Deformable DETR can further improve the detection (UP-DETR). Specifically, a novel unsupervised pretext task named
performance. random query patch detection is proposed to pre-train the DETR
There are also several methods to deal with the slow conver- model. With this unsupervised pre-training scheme, UP-DETR
gence problem of the original DETR. For example, Sun et al. [122] significantly improves the detection accuracy on a relatively small
investigated why the DETR model has slow convergence and dataset (PASCAL VOC). On the COCO benchmark with sufficient
discovered that this is mainly due to the cross-attention module training data, UP-DETR still outperforms DETR, demonstrating
in the transformer decoder. To address this issue, an encoder-only the effectiveness of the unsupervised pre-training scheme.
version of DETR is proposed, achieving considerable improve- Fang et al. [128] explored how to transfer the pure ViT
ment in terms of detection accuracy and training convergence. In structure that is pre-trained on ImageNet to the more challenging
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
TABLE 3: Comparison of different transformer-based object detectors on COCO 2017 val set. Running speed (FPS) is evaluated on an
NVIDIA Tesla V100 GPU as reported in [17]. † Estimated speed according to the reported number in the paper. ‡ ViT backbone is pre-trained
on ImageNet-21k. ∗ ViT backbone is pre-trained on an private dataset with 1.3 billion images.
Method Epochs AP AP50 AP75 APS APM APL #Params (M) GFLOPs FPS
CNN based
FCOS [127] 36 41.0 59.8 44.1 26.2 44.6 52.2 - 177 23†
Faster R-CNN + FPN [13] 109 42.0 62.1 45.5 26.6 45.4 53.4 42 180 26
CNN Backbone + Transformer Head
DETR [16] 500 42.0 62.4 44.2 20.5 45.8 61.1 41 86 28
DETR-DC5 [16] 500 43.3 63.1 45.9 22.5 47.3 61.1 41 187 12
Deformable DETR [17] 50 46.2 65.2 50.0 28.8 49.2 61.7 40 173 19
TSP-FCOS [122] 36 43.1 62.3 47.0 26.6 46.8 55.9 - 189 20†
TSP-RCNN [122] 96 45.0 64.5 49.6 29.7 47.7 58.0 - 188 15†
ACT+MKKD (L=32) [123] - 43.1 - - 61.4 47.1 22.2 - 169 14†
SMCA [125] 108 45.6 65.5 49.1 25.9 49.3 62.6 - - -
Efficient DETR [126] 36 45.1 63.1 49.1 28.3 48.4 59.0 35 210 -
UP-DETR [33] 150 40.5 60.8 42.6 19.0 44.4 60.0 41 - -
UP-DETR [33] 300 42.8 63.0 45.3 20.8 47.1 61.7 41 - -
Transformer Backbone + CNN Head
ViT-B/16-FRCNN‡ [115] 21 36.6 56.3 39.3 17.4 40.0 55.5 - - -
ViT-B/16-FRCNN∗ [115] 21 37.8 57.4 40.1 17.8 41.4 57.3 - - -
PVT-Small+RetinaNet [73] 12 40.4 61.3 43.0 25.0 42.9 55.7 34.2 118 -
Twins-SVT-S+RetinaNet [63] 12 43.0 64.2 46.3 28.0 46.4 57.5 34.3 104 -
Swin-T+RetinaNet [61] 12 41.5 62.1 44.2 25.1 44.9 55.5 38.5 118 -
Swin-T+ATSS [61] 36 47.2 66.5 51.3 - - - 36 215 -
Pure Transformer based
PVT-Small+DETR [73] 50 34.7 55.7 35.4 12.0 36.4 56.7 40 - -
TNT-S+DETR [29] 50 38.2 58.9 39.4 15.5 41.1 58.8 39 - -
YOLOS-Ti [128] 300 30.0 - - - - - 6.5 21 -
YOLOS-S [128] 150 37.6 57.6 39.2 15.9 40.2 57.3 28 179 -
YOLOS-B [128] 150 42.0 62.2 44.5 19.5 45.3 62.1 127 537 -
object detection task and proposed the YOLOS detector. To cope mask embeddings, and match them with ground truth for the set
with the object detection task, the proposed YOLOS first drops loss. ISTR conducted detection and segmentation with a recurrent
the classification tokens in ViT and appends learnable detection refinement strategy which is different from the existing top-down
tokens. Besides, the bipartite matching loss is utilized to perform and bottom-up frameworks. Yang et al. [133] investigated how
set prediction for objects. With this simple pre-training scheme to realize better and more efficient embedding learning to tackle
on ImageNet dataset, the proposed YOLOS shows competitive the semi-supervised video object segmentation under challenging
performance for object detection on COCO benchmark. multi-object scenarios. Some papers such as [134], [135] also
discussed using Transformer to deal with segmentation task.
3.2.2 Segmentation
Transformer for Semantic Segmentation. Zheng et al. [18]
Segmentation is an important topic in computer vision community,
proposed a transformer-based semantic segmentation network
which broadly includes panoptic segmentation, instance segmen-
(SETR). SETR utilizes an encoder similar to ViT [15] as the
tation and semantic segmentation etc. Vision transformer has also
encoder to extract features from an input image. A multi-level
shown impressive potential on the field of segmentation.
Transformer for Panoptic Segmentation. DETR [16] can be feature aggregation module is adopted for performing pixel-wise
naturally extended for panoptic segmentation tasks and achieve segmentation. Strudel et al. [136] introduced Segmenter which
competitive results by appending a mask head on the decoder. relies on the output embedding corresponding to image patches
Wang et al. [25] proposed Max-DeepLab to directly predict and obtains class labels with a point-wise linear decoder or a mask
panoptic segmentation results with a mask transformer, without transformer decoder. Xie et al. [137] proposed a simple, efficient
involving surrogate sub-tasks such as box detection. Similar to yet powerful semantic segmentation framework which unifies
DETR, Max-DeepLab streamlines the panoptic segmentation tasks Transformers with lightweight multilayer perception (MLP) de-
in an end-to-end fashion and directly predicts a set of non- coders, which outputs multiscale features and avoids complex
overlapping masks and corresponding labels. Model training is decoders.
performed using a panoptic quality (PQ) style loss, but unlike prior Transformer for Medical Image Segmentation. Cao et al. [30]
methods that stack a transformer on top of a CNN backbone, Max- proposed an Unet-like pure Transformer for medical image seg-
DeepLab adopts a dual-path framework that facilitates combining mentation, by feeding the tokenized image patches into the
the CNN and transformer. Transformer-based U-shaped Encoder-Decoder architecture with
Transformer for Instance Segmentation. VisTR, a transformer- skip-connections for local-global semantic feature learning. Vala-
based video instance segmentation model, was proposed by narasu et al. [138] explored transformer-based solutions and study
Wang et al. [34] to produce instance prediction results from the feasibility of using transformer-based network architectures
a sequence of input images. A strategy for matching instance for medical image segmentation tasks and proposed a Gated
sequence is proposed to assign the predictions with ground truths. Axial-Attention model which extends the existing architectures by
In order to obtain the mask sequence for each instance, VisTR introducing an additional control mechanism in the self-attention
utilizes the instance sequence segmentation module to accumulate module. Cell-DETR [139], based on the DETR panoptic segmen-
the mask features from multiple frames and segment the mask tation model, is an attempt to use transformer for cell instance seg-
sequence with a 3D CNN. Hu et al. [132] proposed an instance mentation. It adds skip connections that bridge features between
segmentation Transformer (ISTR) to predict low-dimensional the backbone CNN and the CNN decoder in the segmentation
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10
head in order to enhance feature fusion. Cell-DETR achieves inaccuracies in detection and corrected partial or entire skeleton
state-of-the-art performance for cell instance segmentation from corruption. Hao et al. [146] proposed to personalize a human pose
microscopy imagery. estimator given a set of test images of a person without using
any manual annotations. The method adapted the pose estimator
3.2.3 Pose Estimation during test time to exploit person-specific information, and used
Human pose and hand pose estimation are foundational topics that a Transformer model to build a transformation between the self-
have attracted significant interest from the research community. supervised keypoints and the supervised keypoints.
Articulated pose estimation is akin to a structured prediction task,
aiming to predict the joint coordinates or mesh vertices from input 3.2.4 Other Tasks
RGB/D images. Here we discuss some methods [35], [36], [37], There are also quite a lot different high/mid-level vision tasks
[119] that explore how to utilize transformer for modeling the that have explored the usage of vision transformer for better
global structure information of human poses and hand poses. performance. We briefly review several tasks below.
Transformer for Hand Pose Estimation. Huang et al. [35] pro- Pedestrian Detection. Because the distribution of objects is very
posed a transformer based network for 3D hand pose estimation dense in occlusion and crowd scenes, additional analysis and
from point sets. The encoder first utilizes a PointNet [140] to adaptation are often required when common detection networks
extract point-wise features from input point clouds and then adopts are applied to pedestrian detection tasks. Lin et al. [147] revealed
standard multi-head self-attention module to produce embeddings. that sparse uniform queries and a weak attention field in the
In order to expose more global pose-related information to the decoder result in performance degradation when directly applying
decoder, a feature extractor such as PointNet++ [141] is used DETR or Deformable DETR to pedestrian detection tasks. To
to extract hand joint-wise features, which are then fed into the alleviate these drawbacks, the authors proposes Pedestrian End-
decoder as positional encodings. Similarly, Huang et al. [36] to-end Detector (PED), which employs a new decoder called
proposed HOT-Net (short for hand-object transformer network) Dense Queries and Rectified Attention field (DQRF) to support
for 3D hand-object pose estimation. Unlike the preceding method dense queries and alleviate the noisy or narrow attention field
which employs transformer to directly predict 3D hand pose from of the queries. They also proposed V-Match, which achieves
input point clouds, HOT-Net uses a ResNet to generate initial 2D additional performance improvements by fully leveraging visible
hand-object pose and then feeds it into a transformer to predict annotations.
the 3D hand-object pose. A spectral graph convolution network Lane Detection. Based on PolyLaneNet [148], Liu et al. [118]
is therefore used to extract input embeddings for the encoder. proposed a method called LSTR, which improves performance
Hampali et al. [142] proposed to estimate the 3D poses of two of curve lane detection by learning the global context with
hands given a single color image. Specifically, appearance and a transformer network. Similar to PolyLaneNet, LSTR regards
spatial encodings of a set of potential 2D locations for the joints lane detection as a task of fitting lanes with polynomials and
of both hands were inputted to a transformer, and the attention uses neural networks to predict the parameters of polynomials.
mechanisms were used to sort out the correct configuration of the To capture slender structures for lanes and the global context,
joints and outputted the 3D poses of both hands. LSTR introduces a transformer network into the architecture. This
Transformer for Human Pose Estimation. Lin et al. [37] enables processing of low-level features extracted by CNNs. In ad-
proposed a mesh transformer (METRO) for predicting 3D human dition, LSTR uses Hungarian loss to optimize network parameters.
pose and mesh from a single RGB image. METRO extracts As demonstrated in [118], LSTR outperforms PolyLaneNet, with
image features via a CNN and then perform position encoding 2.82% higher accuracy and 3.65× higher FPS using 5-times fewer
by concatenating a template human mesh to the image feature. A parameters. The combination of a transformer network, CNN and
multi-layer transformer encoder with progressive dimensionality Hungarian Loss culminates in a lane detection framework that
reduction is proposed to gradually reduce the embedding dimen- is precise, fast, and tiny. Considering that the entire lane line
sions and finally produce 3D coordinates of human joint and mesh generally has an elongated shape and long-range, Liu et al. [149]
vertices. To encourage the learning of non-local relationships be- utilized a transformer encoder structure for more efficient context
tween human joints, METRO randomly mask some input queries feature extraction. This transformer encoder structure improves
during training. Yang et al. [119] constructed an explainable model the detection of the proposal points a lot, which rely on contextual
named TransPose based on Transformer architecture and low-level features and global information, especially in the case where the
convolutional blocks. The attention layers built in Transformer can backbone network is a small model.
capture long-range spatial relationships between keypoints and ex- Scene Graph. Scene graph is a structured representation of a
plain what dependencies the predicted keypoints locations highly scene that can clearly express the objects, attributes, and rela-
rely on. Li et al. [143] proposed a novel approach based on Token tionships between objects in the scene [150]. To generate scene
representation for human Pose estimation (TokenPose). Each key- graph, most of existing methods first extract image-based object
point was explicitly embedded as a token to simultaneously learn representations and then do message propagation between them.
constraint relationships and appearance cues from images. Mao et Graph R-CNN [151] utilizes self-attention to integrate contextual
al. [144] proposed a human pose estimation framework that solved information from neighboring nodes in the graph. Recently, Shar-
the task in the regression-based fashion. They formulated the pose ifzadeh et al. [152] employed transformers over the extracted
estimation task into a sequence prediction problem and solve it by object embedding. Sharifzadeh et al. [153] proposed a new
transformers, which bypass the drawbacks of the heatmap-based pipeline called Texema and employed a pre-trained Text-to-Text
pose estimator. Jiang et al. [145] proposed a novel transformer Transfer Transformer (T5) [154] to create structured graphs from
based network that can learn a distribution over both pose and textual input and utilized them to improve the relational reasoning
motion in an unsupervised fashion rather than tracking body parts module. The T5 model enables Texema to utilize the knowledge
and trying to temporally smooth them. The method overcame in texts.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11
Tracking. Some researchers also explored to use transformer fully leveraged using large scale pre-training datasets as BERT
encoder-decoder architecture in template-based discriminative and GPT-3 do in the NLP field? And is it possible to pre-train a
trackers, such as TMT [155], TrTr [156] and TransT [157]. All single transformer model and fine-tune it for different downstream
these work use a Siamese-like tracking pipeline to do video tasks with only a few epochs of fine-tuning? How to design more
object tracking and utilize the encoder-decoder network to re- powerful architecture by incorporating prior knowledge of the
place explicit cross-correlation operation for global and rich specific tasks? Several prior works have performed preliminary
contextual inter-dependencies. Specifically, the transformer en- discussions for the aforementioned topics and We hope more
coder and decoder are assigned to the template branch and the further research effort is conducted into exploring more powerful
searching branch, respectively. In addition, Sun et al. proposed transformers for high-level vision.
TransTrack [158], which is an online joint-detection-and-tracking
pipeline. It utilizes the query-key mechanism to track pre-existing 3.3 Low-level Vision
objects and introduces a set of learned object queries into the
Few works apply transformers on low-level vision fields, such
pipeline to detect new-coming objects. The proposed TransTrack
as image super-resolution and generation. These tasks often take
achieves 74.5% and 64.5% MOTA on MOT17 and MOT20 bench-
images as outputs (e.g., high-resolution or denoised images),
mark.
which is more challenging than high-level vision tasks such as
Re-Identification. He et al. [159] proposed TransReID to inves-
classification, segmentation, and detection, whose outputs are
tigate the application of pure transformers in the field of object
labels or boxes.
re-identification (ReID). While introducing transformer network
into object ReID, TransReID slices with overlap to reserve local Training Only
neighboring structures around the patches and introduces 2D bilin-
Images CNN
ear interpolation to help handle any given input resolution. With Images Encoder
the transformer module and the loss function, a strong baseline
was proposed to achieve comparable performance with CNN-
patches
based frameworks. Moreover, The jigsaw patch module (JPM) Transformer Tokens
was designed to facilitate perturbation-invariant and robust feature Decoders
Transformer
representation of objects and the side information embeddings Decoders
(SIE) was introduced to encode side information. The final frame-
work TransReID achieves state-of-the-art performance on both Images CNN
Decoder
person and vehicle ReID benchmarks. Both Liu et al. [160] and noise
Zhang et al. [161] provided solutions for introducing transformer (a) Image Generation (b) Image Generation
(GAN-based) (Transformer-based)
network into video-based person Re-ID. And similarly, both of the
them utilized separated transformer networks to refine spatial and Fig. 9: A generic framework for transformer in image generation.
temporal features, and then utilized a cross view transformer to
aggregate multi-view features. 3.3.1 Image Generation
Point Cloud Learning. A number of other works exploring An simple yet effective to apply transformer model to the image
transformer architecture for point cloud learning [162], [163], generation task is to directly change the architectures from CNNs
[164] have also emerged recently. For example, Guo et al. [163] to transformers, as shown in Figure 9 (a). Jiang et al. [39] proposed
proposed a novel framework that replaces the original self- TransGAN, which build GAN using the transformer architec-
attention module with a more suitable offset-attention module, ture. Since the it is difficult to generate high-resolution images
which includes implicit Laplace operator and normalization refine- pixel-wise, a memory-friendly generator is utilized by gradually
ment. In addition, Zhao et al. [164] designed a novel transformer increasing the feature map resolution at different stages. Corre-
architecture called Point Transformer. The proposed self-attention spondingly, a multi-scale discriminator is designed to handle the
layer is invariant to the permutation of the point set, making it varying size of inputs in different stages. Various training recipes
suitable for point set processing tasks. Point Transformer shows are introduced including grid self-attention, data augmentation,
strong performance for semantic segmentation task from 3D point relative position encoding and modified normalization to stabilize
clouds. the training and improve its performance. Experiments on various
benchmark datasets demonstrate the effectiveness and potential
3.2.5 Discussions of the transformer-based GAN model in image generation tasks.
As discussed in the preceding sections, transformers have shown Kwonjoon Lee et al. [165] proposed ViTGAN, which introduce
strong performance on several high-level tasks, including detec- several technique to both generator and discriminator to stabilize
tion, segmentation and pose estimation. The key issues that need to the training procedure and convergence. Euclidean distance is
be resolved before transformer can be adopted for high-level tasks introduced for the self-attention module to enforce the Lips-
relate to input embedding, position encoding, and prediction loss. chitzness of transformer discriminator. Self-modulated layernorm
Some methods propose improving the self-attention module from and implicit neural representation are proposed to enhance the
different perspectives, for example, deformable attention [17], training for the generator. As a result, ViTGAN is the first work
adaptive clustering [123] and point transformer [164]. Neverthe- to demonstrate transformer-based GANs can achieve comparable
less, exploration into the use of transformers for high-level vision performance to state-of-the-art CNN-based GANs.
tasks is still in the preliminary stages and so further research Parmar et al. [27] proposed Image Transformer, taking the first
may prove beneficial. For example, is it necessary to use feature step toward generalizing the transformer model to formulate image
extraction modules such as CNN and PointNet before transformer translation and generation tasks in an auto-regressive manner.
for potential better performance? How can vision transformer be Image Transformer consists of two parts: an encoder for extracting
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
image representation and a decoder to generate pixels. For each relevance ri,j is calculated between each patch qi in Q and ki in
pixel with value 0 − 255, a 256 × d dimensional embedding K as:
is learned for encoding each value into a d dimensional vector, qi ki
ri,j = , . (12)
which is fed into the encoder as input. The encoder and decoder ∥qi ∥ ∥ki ∥
adopt the same architecture as that in [9]. Each output pixel A hard-attention module is proposed to select high-resolution
q ′ is generated by calculating self-attention between the input features V according to the reference image, so that the low-
pixel q and previously generated pixels m1 , m2 , ... with position resolution image can be matched by using the relevance. The
embedding p1 , p2 , .... For image-conditioned generation, such as hard-attention map is calculated as:
super-resolution and inpainting, an encoder-decoder architecture is
used, where the encoder’s input is the low-resolution or corrupted hi = arg max ri,j (13)
j
images. For unconditional and class-conditional generation (i.e.,
noise to image), only the decoder is used for inputting noise vec- The most relevant reference patch is ti = vhi , where ti in
tors. Because the decoder’s input is the previously generated pixels T is the transferred features. A soft-attention module is then
(involving high computation cost when producing high-resolution used to transfer V to the low-resolution feature. The transferred
images), a local self-attention scheme is proposed. This scheme features from the high-resolution texture image and the low-
uses only the closest generated pixels as input for the decoder, resolution feature are used to generate the output features of
enabling Image Transformer to achieve performance on par with the low-resolution image. By leveraging the transformer-based
CNN-based models for image generation and translation tasks, architecture, TTSR can successfully transfer texture information
demonstrating the effectiveness of transformer-based models on from high-resolution reference images to low-resolution images in
low-level vision tasks. super-resolution tasks.
Since it is difficult to directly generate high-resolution images
Multi-head Flatten features Multi-tail
by transformer models, Esser et al. [38] proposed Taming Trans-
Denoising Denoising
former. Taming Transformer consists of two parts: a VQGAN Head Tail
and a transformer. VQGAN is a variant of VQVAE [166], which Transformer Encoder
uses a discriminator and perceptual loss to improve the visual Deraining Deraining
Tail
Head
quality. Through VQGAN, the image can be represented by a Features
Task embedding
series of context-rich discrete vectors and therefore these vectors x2 Up Features x2 Up
Head Tail
can be easily predicted by a transformer model through an auto-
…
regression way. The transformer model can learn the long-range
…
…
Transformer Decoder
decoder can be used to predict series of objects and their location, group activity recognition [177]. Gavrilyuk et al. proposed an
category, and size. This has enabled SceneFormer to outperform actor-transformer [178] architecture to learn the representation,
conventional CNN-based methods in user studies. using the static and dynamic representations generated by the 2D
and 3D networks as input. The output of the transformer is the
predicted activity.
Images
patches Video Retrieval. The key to content-based video retrieval is to
find the similarity between videos. Leveraging only the image-
level of video-level features to overcome the associated chal-
Transformer Transformer lenges, Shao et al. [179] suggested using the transformer to model
Encoders Decoders
the long-range semantic dependency. They also introduced the
supervised contrastive learning strategy to perform hard negative
mining. The results of using this approach on benchmark datasets
Images
patches demonstrate its performance and speed advantages. In addition,
Gabeur et al. [180] presented a multi-modal transformer to learn
Fig. 11: A generic framework for transformer in image processing.
It should be noted that iGPT [14] is pre-trained on an different cross-modal cues in order to represent videos.
inpainting-like task. Since iGPT mainly focus on the fine-tuning Video Object Detection. To detect objects in a video, both
performance on image classification tasks, we treat this work more global and local information is required. Chen et al. introduced
like an attempt on image classification task using transformer than the memory enhanced global-local aggregation (MEGA) [181]
low-level vision tasks. to capture more content. The representative features enhance the
In conclusion, different to classification and detection tasks, overall performance and address the ineffective and insufficient
the outputs of image generation and processing are images. Fig- problems. Furthermore, Yin et al. [182] proposed a spatiotem-
ure 11 illustrates using transformers in low-level vision. In image poral transformer to aggregate spatial and temporal information.
processing tasks, the images are first encoded into a sequence of Together with another spatial feature encoding component, these
tokens or patches and the transformer encoder uses the sequence two components perform well on 3D video object detection tasks.
as input, allowing the transformer decoder to successfully produce Multi-task Learning. Untrimmed video usually contains many
desired images. In image generation tasks, the GAN-based models frames that are irrelevant to the target tasks. It is therefore
directly learn a decoder to generated patches to outputting images crucial to mine the relevant information and discard the redundant
through linear projection, while the transformer-based models information. To extract such information, Seong et al. proposed
train a auto-encoder to learn a codebook for images and use an the video multi-task transformer network [183], which handles
auto-regression transformer model to predict the encoded tokens. multi-task learning on untrimmed videos. For the CoVieW dataset,
A meaningful direction for future research would be designing a the tasks are scene recognition, action recognition and importance
suitable architecture for different image processing tasks. score prediction. Two pre-trained networks on ImageNet and
Places365 extract the scene features and object features. The
multi-task transformers are stacked to implement feature fusion,
3.4 Video Processing leveraging the class conversion matrix (CCM).
Transformer performs surprisingly well on sequence-based tasks
and especially on NLP tasks. In computer vision (specifically, 3.4.2 Low-level Video Processing
video tasks), spatial and temporal dimension information is fa- Frame/Video Synthesis. Frame synthesis tasks involve synthe-
vored, giving rise to the application of transformer in a number sizing the frames between two consecutive frames or after a
of video tasks, such as frame synthesis [171], action recogni- frame sequence while video synthesis tasks involve synthesizing
tion [172], and video retrieval [173]. a video. Liu et al. proposed the ConvTransformer [171], which is
comprised of five components: feature embedding, position encod-
3.4.1 High-level Video Processing ing, encoder, query decoder, and the synthesis feed-forward net-
Video Action Recognition. Video human action tasks, as the work. Compared with LSTM based works, the ConvTransformer
name suggests, involves identifying and localizing human actions achieves superior results with a more parallelizable architecture.
in videos. Context (such as other people and objects) plays a Another transformer-based approach was proposed by Schatz et
critical role in recognizing human actions. Rohit et al. proposed al. [184], which uses a recurrent transformer network to synthetize
the action transformer [172] to model the underlying relationship human actions from novel views.
between the human of interest and the surrounding context. Video Inpainting. Video inpainting tasks involve completing
Specifically, the I3D [174] is used as the backbone to extract high- any missing regions within a frame. This is challenging, as it
level feature maps. The features extracted (using RoI pooling) requires information along the spatial and temporal dimensions
from intermediate feature maps are viewed as the query (Q), while to be merged. Zeng et al. proposed a spatial-temporal transformer
the key (K) and values (V) are calculated from the intermediate network [28], which uses all the input frames as input and fills
features. A self-attention mechanism is applied to the three compo- them in parallel. The spatial-temporal adversarial loss is used to
nents, and it outputs the classification and regressions predictions. optimize the transformer network.
Lohit et al. [175] proposed an interpretable differentiable module,
named temporal transformer network, to reduce the intra-class 3.4.3 Discussions
variance and increase the inter-class variance. In addition, Fayyaz Compared to image, video has an extra dimension to encode
and Gall proposed a temporal transformer [176] to perform action the temporal information. Exploiting both spatial and temporal
recognition tasks under weakly supervised settings. In addition information helps to have a better understanding of a video.
to human action recognition, transformer has been utilized for Thanks to the relationship modeling capability of transformer,
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14
video processing tasks have been improved by mining spatial described in an input text. Similar to GPT-3, DALL-E is a multi-
and temporal information simultaneously. Nevertheless, due to modal transformer with 12 billion model parameters autoregres-
the high complexity and much redundancy of video data, how sively trained on a dataset of 3.3 million text-image pairs. More
to efficiently and accurately modeling both spatial and temporal specifically, to train DALL-E, a two-stage training procedure is
relationships is still an open problem. used, where in stage 1, a discrete variational autoencoder is used
to compress 256× 256 RGB images into 32×32 image tokens and
3.5 Multi-Modal Tasks then in stage 2, an autoregressive transformer is trained to model
the joint distribution over the image and text tokens. Experimental
Owing to the success of transformer across text-based NLP tasks,
results show that DALL-E can generate images of various styles
many researches are keen to exploit its potential for processing
from scratch, including photorealistic imagery, cartoons and emoji
multi-modal tasks (e.g., video-text, image-text and audio-text).
or extend an existing image while still matching the description in
One example of this is VideoBERT [185], which uses a CNN-
the text. Subsequently, Ding et al. proposes CogView [43], which
based module to pre-process videos in order to obtain representa-
is a transformer with VQ-VAE tokenizer similar to DALL-E, but
tion tokens. A transformer encoder is then trained on these tokens
supports Chinese text input. They claim CogView outperforms
to learn the video-text representations for downstream tasks, such
DALL-E and previous GAN-bsed methods and also unlike DALL-
as video caption. Some other examples include VisualBERT [186]
E, CogView does not need an additional CLIP model to rerank the
and VL-BERT [187], which adopt a single-stream unified trans-
samples drawn from transformer, i.e. DALL-E.
former to capture visual elements and image-text relationship for
Recently, a Unified Transformer (UniT) [189] model is pro-
downstream tasks such as visual question answering (VQA) and
posed to cope with multi-modal multi-task learning, which can
visual commonsense reasoning (VCR). In addition, several studies
simultaneously handle multiple tasks across different domains,
such as SpeechBERT [188] explore the possibility of encoding
including object detection, natural language understanding and
audio and text pairs with a transformer encoder to process auto-
vision-language reasoning. Specifically, UniT has two transformer
text tasks such as speech question answering (SQA).
encoders to handle image and text inputs, respectively, and then
the transformer decoder takes the single or concatenated encoder
outputs according to the task modality. Finally, a task-specific
prediction head is applied to the decoder outputs for different
tasks. In the training stage, all tasks are jointly trained by randomly
selecting a specific task within an iteration. The experiments show
UniT achieves satisfactory performance on every task with a
compact set of model parameters.
In conclusion, current transformer-based mutil-modal models
demonstrates its architectural superiority for unifying data and
tasks of various modalities, which demonstrates the potential of
Fig. 12: The framework of the CLIP (image from [41]).
transformer to build a general-purpose intelligence agents able to
Apart from the aforementioned pioneering multi-modal trans- cope with vast amount of applications. Future researches can be
formers, Contrastive Language-Image Pre-training (CLIP) [41] conducted in exploring the effective training or the extendability
takes natural language as supervision to learn more efficient image of multi-modal transformers (e.g., GPT-4 [44]).
representation. CLIP jointly trains a text encoder and an image
encoder to predict the corresponding training text-image pairs.
3.6 Efficient Transformer
The text encoder of CLIP is a standard transformer with masked
self-attention used to preserve the initialization ability of the pre- Although transformer models have achieved success in various
trained language models. For the image encoder, CLIP considers tasks, their high requirements for memory and computing re-
two types of architecture, ResNet and Vision Transformer. CLIP is sources block their implementation on resource-limited devices
trained on a new dataset containing 400 million (image, text) pairs such as mobile phones. In this section, we review the researches
collected from the Internet. More specifically, given a batch of N carried out into compressing and accelerating transformer models
(image, text) pairs, CLIP learns both text and image embeddings for efficient implementation. This includes including network
jointly to maximize the cosine similarity of those N matched pruning, low-rank decomposition, knowledge distillation, network
embeddings while minimize N 2 − N incorrectly matched embed- quantization, and compact architecture design. Table 4 lists some
dings. On Zero-Shot transfer, CLIP demonstrates astonishing zero- representative works for compressing transformer-based models.
shot classification performances, achieving 76.2% top-1 accuracy
on ImageNet-1K dataset without using any ImageNet training 3.6.1 Pruning and Decomposition
labels. Concretely, at inference, the text encoder of CLIP first In transformer based pre-trained models (e.g., BERT), multiple
computes the feature embeddings of all ImageNet Labels and the attention operations are performed in parallel to independently
image encoder then computes the embeddings of all images. By model the relationship between different tokens [9], [10]. How-
calculating the cosine similarity of text and image embeddings, ever, specific tasks do not require all heads to be used. For
the text-image pair with the highest score should be the image and example, Michel et al. [45] presented empirical evidence that a
its corresponding label. Further experiments on 30 various CV large percentage of attention heads can be removed at test time
benchmarks show the zero-shot transfer ability of CLIP and the without impacting performance significantly. The number of heads
feature diversity learned by CLIP. required varies across different layers — some layers may even
While CLIP maps images according to the description in text, require only one head. Considering the redundancy on attention
another work DALL-E [42] synthesizes new images of categories heads, importance scores are defined to estimate the influence of
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15
each head on the final output in [45], and unimportant heads can be student networks, thereby facilitating the mimicking process. Due
removed for efficient deployment. Dalvi et al. [190] analyzed the to the various types of layers in the transformer model (i.e., self-
redundancy in pre-trained transformer models from two perspec- attention layer, embedding layer, and prediction layers), Jiao et
tives: general redundancy and task-specific redundancy. Following al. [46] design different objective functions to transfer knowledge
the lottery ticket hypothesis [191], Prasanna et al. [190] analyzed from teachers to students. For example, the outputs of student
the lotteries in BERT and showed that good sub-networks also models’ embedding layers imitate those of teachers via MSE
exist in transformer-based models, reducing both the FFN layers losses. For the vision transformer, Jia et al. [213] proposed a fine-
and attention heads in order to achieve high compression rates. grained manifold distillation method, which excavates effective
For the vision transformer [15] which splits an image to multiple knowledge through the relationship between images and the di-
patches, Tang et al. [192] proposed to reduce patch calculation vided patches.
to accelerate the inference, and the redundant patches can be
automatically discovered by considering their contributions to the 3.6.3 Quantization
effective output features. Zhu et al. [193] extended the network Quantization aims to reduce the number of bits needed to represent
slimming approach [194] to vision transformers for reducing network weight or intermediate features [214], [215]. Quantization
the dimensions of linear projections in both FFN and attention methods for general neural networks have been discussed at length
modules. and achieve performance on par with the original networks [216],
In addition to the width of transformer models, the depth [217], [218]. Recently, there has been growing interest in how
(i.e., the number of layers) can also be reduced to accelerate the to specially quantize transformer models [219], [220]. For ex-
inference process [204], [205]. Differing from the concept that ample, Shridhar et al. [221] suggested embedding the input into
different attention heads in transformer models can be computed in binary high-dimensional vectors, and then using the binary input
parallel, different layers have to be calculated sequentially because representation to train the binary neural networks. Cheong et
the input of the next layer depends on the output of previous al. [222] represented the weights in the transformer models by
layers. Fan et al. [204] proposed a layer-wisely dropping strategy low-bit (e.g., 4-bit) representation. Zhao et al. [223] empirically
to regularize the training of models, and then the whole layers are investigated various quantization methods and showed that k-
removed together at the test phase. means quantization has a huge development potential. Aimed
Beyond the pruning methods that directly discard modules in at machine translation tasks, Prato et al. [47] proposed a fully
transformer models, matrix decomposition aims to approximate quantized transformer, which, as the paper claims, is the first 8-
the large matrices with multiple small matrices based on the low- bit model not to suffer any loss in translation quality. Beside,
rank assumption. For example, Wang et al. [206] decomposed the Liu et al. [224] explored a post-training quantization scheme to
standard matrix multiplication in transformer models, improving reduce the memory storage and computational costs of vision
the inference efficiency. transformers.
transformers perform better on large datasets. The question for the XWq , K = XWk , V = XWv from the translation module,
future is whether to use CNN or transformer. once Wq = Wθ , Wk = Wϕ , Wv = Wg , Eq. 17 can be
By training with large datasets, transformers can achieve state- formulated as:
of-the-art performance on both NLP [11], [10] and CV bench-
marks [15]. It is possible that neural networks need big data rather Y = softmax(QKT )V = Attention(Q, K, V), (18)
than inductive bias. In closing, we leave you with a question: The self-attention module [9] proposed for machine translation
Can transformer obtains satisfactory results with a very simple is, to some extent, the same as the preceding non-local filtering
computational paradigm (e.g., with only fully connected layers) operations proposed for computer vision.
and massive data training? Generally, the final output signal of the self-attention module
for computer vision will be wrapped as:
ACKNOWLEDGEMENT Z = YWo + X (19)
This research is partially supported by MindSpore (https:// o
mindspore.cn/) and CANN (Compute Architecture for Neural where Y is generated through Eq. 17. If W is initialized as zero,
Networks). this self-attention module can be inserted into any existing model
without breaking its initial behavior.
A PPENDIX
A2. Revisiting Transformers for NLP
A1. General Formulation of Self-attention
Before transformer was developed, RNNs ( e.g., GRU [254]
The self-attention module [9] for machine translation computes the and LSTM [6]) with added attention [7] empowered most of
responses at each position in a sequence by estimating attention the state-of-the-art language models. However, RNNs require the
scores to all positions and gathering the corresponding embed- information flow to be processed sequentially from the previous
dings based on the scores accordingly. This can be viewed as a hidden states to the next one. This rules out the possibility of using
form of non-local filtering operations [252], [253]. We follow the acceleration and parallelization during training, and consequently
convention [252] to formulate the self-attention module. Given an hinders the potential of RNNs to process longer sequences or build
input signal (e.g., image, sequence, video and feature) X ∈ Rn×d , larger models. In 2017, Vaswani et al. [9] proposed transformer,
where n = h × w (indicating the number of pixels in feature) and a novel encoder-decoder architecture built solely on multi-head
d is the number of channels, the output signal is generated as: self-attention mechanisms and feed-forward neural networks. Its
1 X purpose was to solve seq-to-seq natural language tasks (e.g.,
yi = f (xi , xj )g(xj ), (14)
C(xi ) ∀j machine translation) easily by acquiring global dependencies. The
subsequent success of transformer demonstrates that leveraging
where xi ∈ R1×d and yi ∈ R1×d indicate the ith position attention mechanisms alone can achieve performance comparable
(e.g., space, time and spacetime) of the input signal X and output with attentive RNNs. Furthermore, the architecture of transformer
signal Y , respectively. Subscript j is the index that enumerates all lends itself to massively parallel computing, which enables train-
positions, and a pairwise function f (·) computes a representing ing on larger datasets. This has given rise to the surge of large
relationship (such as affinity) between i and all j . The function pre-trained models (PTMs) for natural language processing.
g(·) computes a representation of the input signal at position j , BERT [10] and its variants (e.g., SpanBERT [255],
and the response is normalized by a factor C(xi ). RoBERTa [256]) are a series of PTMs built on the multi-layer
Note that there are many choices for the pairwise function transformer encoder architecture. Two tasks are conducted on
f (·). For example, a simple extension of the Gaussian function BookCorpus [257] and English Wikipedia datasets at the pre-
could be used to compute the similarity in an embedding space. training stage of BERT: 1) Masked language modeling (MLM),
As such, the function f (·) can be formulated as: which involves first randomly masking out some tokens in the
T input and then training the model to predict; 2) Next sentence pre-
f (xi , xj ) = eθ(xi )ϕ(xj ) (15) diction, which uses paired sentences as input and predicts whether
where θ(·) and ϕ(·) can be any embedding layers. If we the second sentence is the original one in the document. After
consider the θ(·), ϕ(·), g(·) in the form of linear embedding: pre-training, BERT can be fine-tuned by adding an output layer
θ(X) = XWθ , ϕ(X) = XWϕ , g(X) = XWg where on a wide range of downstream tasks. More specifically, when
Wθ ∈ Rd×dk , Wϕ ∈ Rd×dk ,P Wg ∈ Rd×dv , and set the performing sequence-level tasks (e.g., sentiment analysis), BERT
normalization factor as C(xi ) = ∀j f (xi , xj ), the Eq. 14 can uses the representation of the first token for classification; for
be rewritten as: token-level tasks (e.g., name entity recognition), all tokens are fed
T T into the softmax layer for classification. At the time of its release,
exi wθ,i wϕ,j xj BERT achieved the state-of-the-art performance on 11 NLP tasks,
yi = P x w wT xT xj wg,j , (16)
je
i θ,i ϕ,j j setting a milestone in pre-trained language models. Generative
Pre-trained Transformer models (e.g., GPT [258], GPT-2 [110])
where wθ,i ∈ Rd×1 is the ith row of the weight matrix Wθ . For a are another type of PTMs based on the transformer decoder
1
given index i, C(x i)
f (xi , xj ) becomes the softmax output along architecture, which uses masked self-attention mechanisms. The
the dimension j . The formulation can be further rewritten as: main difference between the GPT series and BERT is the way
Y = softmax(XWθ WϕT X)g(X), (17) in which pre-training is performed. Unlike BERT, GPT models
are unidirectional language models pre-trained using Left-to-Right
where Y ∈ Rn×c is the output signal of the same size as X. (LTR) language modeling. Furthermore, BERT learns the sentence
Compared with the query, key and value representations Q = separator ([SEP]) and classifier token ([CLS]) embeddings during
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18
pre-training, whereas these embeddings are involved in only the A3. Self-attention for Computer Vision
fine-tuning stage of GPT. Due to its unidirectional pre-training
strategy, GPT achieves superior performance in many natural The preceding sections reviewed methods that use a transformer
language generation tasks. More recently, a massive transformer- architecture for vision tasks. We can conclude that self-attention
based model called GPT-3, which has an astonishing 175 billion plays a pivotal role in transformer. The self-attention module can
parameters, was developed [11]. By pre-training on 45 TB of also be considered a building block of CNN architectures, which
compressed plaintext data, GPT-3 can directly process different have low scaling properties concerning the large receptive fields.
types of downstream natural language tasks without fine-tuning. This building block is widely used on top of the networks to
As a result, it achieves strong performance on many NLP datasets, capture long-range interactions and enhance high-level semantic
including both natural language understanding and generation. features for vision tasks. In this section, we delve deeply into
Since the introduction of transformer, many other models have the models based on self-attention designed for challenging tasks
been proposed in addition to the transformer-based PTMs men- in computer vision. Such tasks include semantic segmentation,
tioned earlier. We list a few representative models in Table 5 for instance segmentation, object detection, keypoint detection, and
interested readers, but this is not the focus of our study. depth estimation. Here we briefly summarize the existing applica-
tions using self-attention for computer vision.
Image Classification. Trainable attention for classification
TABLE 5: List of representative language models built on consists of two main streams: hard attention [268], [269], [270] re-
transformer. Transformer is the standard encoder-decoder ar- garding the use of an image region, and soft attention [271], [272],
chitecture. Transformer Enc. and Dec. represent the encoder [273], [274] generating non-rigid feature maps. Ba et al. [268]
and decoder, respectively. Decoder uses mask self-attention to first proposed the term “visual attention” for image classification
prevent attending to the future tokens. The data of the Table tasks, and used attention to select relevant regions and locations
is from [203]. within the input image. This can also reduce the computational
complexity of the proposed model regarding the size of the input
Models Architecture # of Params Fine-tuning image. For medical image classification, AG-CNN [275] was
GPT [258] Transformer Dec. 117M Yes proposed to crop a sub-region from a global image by the attention
GPT-2 [110] Transformer Dec. 117M-1542M No heat map. And instead of using hard attention and recalibrating the
GPT-3 [11] Transformer Dec. 125M-175B No
BERT [10] Transformer Enc. 110M-340M Yes
crop of feature maps, SENet [276] was proposed to reweight the
RoBERTa [256] Transformer Enc. 355M Yes channel-wise responses of the convolutional features using soft
Two-Stream self-attention. Jetley et al. [272] used attention maps generated
XLNet [259] ≈ BERT Yes
Transformer Enc. by corresponding estimators to reweight intermediate features in
ELECTRA [260] Transformer Enc. 335M Yes DNNs. In addition, Han et al. [273] utilized the attribute-aware
UniLM [261] Transformer Enc. 340M Yes attention to enhance the representation of CNNs.
BART [262] Transformer 110% of BERT Yes
T5 [154] Transfomer 220M-11B Yes Semantic Segmentation. PSANet [277], OCNet [278],
ERNIE (THU) [263] Transformer Enc. 114M Yes DANet [279] and CFNet [280] are the pioneering works to propose
KnowBERT [264] Transformer Enc. 253M-523M Yes using the self-attention module in semantic segmentation tasks.
These works consider and augment the relationship and similar-
ity [281], [282], [283], [284], [285], [286] between the contextual
Apart from the PTMs trained on large corpora for general NLP pixels. DANet [279] simultaneously leverages the self-attention
tasks, transformer-based models have also been applied in many module on spatial and channel dimensions, whereas A2 Net [287]
other NLP-related domains and to multi-modal tasks. groups the pixels into a set of regions, and then augments the
pixel representations by aggregating the region representations
BioNLP Domain. Transformer-based models have outperformed with the generated attention weights. DGCNet [288] employs a
many traditional biomedical methods. Some examples of such dual graph CNN to model coordinate space similarity and feature
models include BioBERT [265], which uses a transformer ar- space similarity in a single framework. To improve the efficiency
chitecture for biomedical text mining tasks, and SciBERT [266], of the self-attention module for semantic segmentation, several
which is developed by training transformer on 114M scientific works [289], [290], [291], [292], [293] have been proposed,
articles (covering biomedical and computer science fields) with aiming to alleviate the huge amount of parameters brought by
the aim of executing NLP tasks in the scientific domain more calculating pixel similarities. For example, CGNL [289] applies
precisely. Another example is ClinicalBERT, proposed by Huang the Taylor series of the RBF kernel function to approximate the
et al. [267]. It utilizes transformer to develop and evaluate con- pixel similarities. CCNet [290] approximates the original self-
tinuous representations of clinical notes. One of the side effects attention scheme via two consecutive criss-cross attention mod-
of this is that the attention map of ClinicalBERT can be used ules. In addition, ISSA [291] factorizes the dense affinity matrix as
to explain predictions, thereby allowing high-quality connections the product of two sparse affinity matrices. There are other related
between different medical contents to be discovered. works using attention based graph reasoning modules [294], [295],
The rapid development of transformer-based models on a [292] to enhance both the local and global representations.
variety of NLP-related tasks demonstrates its structural superiority Object Detection. Ramachandran et al. [274] proposes an
and versatility, opening up the possibility that it will become a attention-based layer and swapped the conventional convolution
universal module applied in many AI fields other than just NLP. layers to build a fully attentional detector that outperforms the
The following part of this survey focuses on the applications of typical RetinaNet [129] on COCO benchmark [296]. GCNet [297]
transformer in a wide range of computer vision tasks that have assumes that the global contexts modeled by non-local operations
emerged over the past two years. are almost the same for different query positions within an image,
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19
and unifies the simplified formulation and SENet [276] into a [17] X. Zhu et al. Deformable detr: Deformable transformers for end-to-end
general framework for global context modeling [298], [299], object detection. In ICLR, 2021.
[18] S. Zheng et al. Rethinking semantic segmentation from a sequence-to-
[300], [301]. Vo et al. [302] designs a bidirectional operation sequence perspective with transformers. In CVPR, 2021.
to gather and distribute information from a query position to [19] H. Chen et al. Pre-trained image processing transformer. In CVPR,
all possible positions. Zhang et al. [120] suggests that previous 2021.
methods fail to interact with cross-scale features, and proposes [20] L. Zhou et al. End-to-end dense video captioning with masked
transformer. In CVPR, pp. 8739–8748, 2018.
Feature Pyramid Transformer, based on the self-attention module, [21] S. Ullman et al. High-level vision: Object recognition and visual
to fully exploit interactions across both space and scales. cognition, volume 2. MIT press Cambridge, MA, 1996.
Conventional detection methods usually exploit a single visual [22] R. Kimchi et al. Perceptual organization in vision: Behavioral and
representation (e.g., bounding box and corner point) for predicting neural perspectives. Psychology Press, 2003.
[23] J. Zhu et al. Top-down saliency detection via contextual pooling.
the final results. Hu et al. [303] proposes a relation module based Journal of Signal Processing Systems, 74(1):33–46, 2014.
on self-attention to process a set of objects simultaneously through [24] J. Long et al. Fully convolutional networks for semantic segmentation.
interaction between their appearance features. Cheng et al. [121] In CVPR, 2015.
proposes RelationNet++ with the bridging visual representations [25] H. Wang et al. Max-deeplab: End-to-end panoptic segmentation with
mask transformers. In CVPR, pp. 5463–5474, 2021.
(BVR) module to combine different heterogeneous representations [26] R. B. Fisher. Cvonline: The evolving, distributed, non-proprietary, on-
into a single one similar to that in the self-attention module. line compendium of computer vision. Retrieved January 28, 2006 from
Specifically, the master representation is treated as the query input http://homepages. inf. ed. ac. uk/rbf/CVonline, 2008.
and the auxiliary representations are regarded as the key input. [27] N. Parmar et al. Image transformer. In ICML, 2018.
[28] Y. Zeng et al. Learning joint spatial-temporal transformations for video
The enhanced feature can therefore bridge the information from inpainting. In ECCV, pp. 528–543. Springer, 2020.
auxiliary representations and benefit final detection results. [29] K. Han et al. Transformer in transformer. In NeurIPS, 2021.
Other Vision Tasks. Zhang et al. [304] proposes a resolution- [30] H. Cao et al. Swin-unet: Unet-like pure transformer for medical image
wise attention module to learn enhanced feature maps when segmentation. arXiv:2105.05537, 2021.
[31] X. Chen et al. An empirical study of training self-supervised vision
training multi-resolution networks to obtain accurate human key- transformers. In ICCV, 2021.
point locations for pose estimation task. Furthermore, Chang et [32] K. He et al. Masked autoencoders are scalable vision learners. In CVPR,
al. [305] uses an attention-mechanism based feature fusion block pp. 16000–16009, 2022.
to improve the accuracy of the human keypoint detection model. [33] Z. Dai et al. UP-DETR: unsupervised pre-training for object detection
with transformers. In CVPR, 2021.
To explore more generalized contextual information for im- [34] Y. Wang et al. End-to-end video instance segmentation with transform-
proving the self-supervised monocular trained depth estimation, ers. In CVPR, 2021.
Johnston et al. [306] directly leverages self-attention module. [35] L. Huang et al. Hand-transformer: Non-autoregressive structured mod-
Chen et al. [307] also proposes an attention-based aggregation net- eling for 3d hand pose estimation. In ECCV, pp. 17–33, 2020.
[36] L. Huang et al. Hot-net: Non-autoregressive transformer for 3d hand-
work to capture context information that differs in diverse scenes object pose estimation. In ACM MM, pp. 3136–3145, 2020.
for depth estimation. And Aich et al. [308] proposes bidirectional [37] K. Lin et al. End-to-end human pose and mesh reconstruction with
attention modules that utilize the forward and backward attention transformers. In CVPR, 2021.
operations for better results of monocular depth estimation. [38] P. Esser et al. Taming transformers for high-resolution image synthesis.
In CVPR, 2021.
[39] Y. Jiang et al. Transgan: Two transformers can make one strong gan. In
NeurIPS, 2021.
R EFERENCES [40] F. Yang et al. Learning texture transformer network for image super-
resolution. In CVPR, pp. 5791–5800, 2020.
[1] F. Rosenblatt. The perceptron, a perceiving and recognizing automaton
[41] A. Radford et al. Learning transferable visual models from natural
Project Para. Cornell Aeronautical Laboratory, 1957.
language supervision. arXiv:2103.00020, 2021.
[2] F. ROSENBLATT. Principles of neurodynamics. perceptrons and the
theory of brain mechanisms. Technical report, 1961. [42] A. Ramesh et al. Zero-shot text-to-image generation. In ICML, 2021.
[3] Y. LeCun et al. Gradient-based learning applied to document recogni- [43] M. Ding et al. Cogview: Mastering text-to-image generation via
tion. Proceedings of the IEEE, 86(11):2278–2324, 1998. transformers. In NeurIPS, 2021.
[4] A. Krizhevsky et al. Imagenet classification with deep convolutional [44] OpenAI. Gpt-4 technical report, 2023.
neural networks. In NeurIPS, pp. 1097–1105, 2012. [45] P. Michel et al. Are sixteen heads really better than one? In NeurIPS,
[5] D. E. Rumelhart et al. Learning internal representations by error pp. 14014–14024, 2019.
propagation. Technical report, 1985. [46] X. Jiao et al. TinyBERT: Distilling BERT for natural language under-
[6] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural standing. In Findings of EMNLP, pp. 4163–4174, 2020.
computation, 9(8):1735–1780, 1997. [47] G. Prato et al. Fully quantized transformer for machine translation. In
[7] D. Bahdanau et al. Neural machine translation by jointly learning to Findings of EMNLP, 2020.
align and translate. In ICLR, 2015. [48] Z.-H. Jiang et al. Convbert: Improving bert with span-based dynamic
[8] A. Parikh et al. A decomposable attention model for natural language convolution. NeurIPS, 33, 2020.
inference. In EMNLP, 2016. [49] J. Gehring et al. Convolutional sequence to sequence learning. In ICML,
[9] A. Vaswani et al. Attention is all you need. In NeurIPS, 2017. pp. 1243–1252. PMLR, 2017.
[10] J. Devlin et al. Bert: Pre-training of deep bidirectional transformers for [50] P. Shaw et al. Self-attention with relative position representations. In
language understanding. In NAACL-HLT, 2019. NAACL, pp. 464–468, 2018.
[11] T. B. Brown et al. Language models are few-shot learners. In NeurIPS, [51] D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus).
2020. arXiv:1606.08415, 2016.
[12] K. He et al. Deep residual learning for image recognition. In CVPR, [52] J. L. Ba et al. Layer normalization. arXiv:1607.06450, 2016.
pp. 770–778, 2016. [53] A. Baevski and M. Auli. Adaptive input representations for neural
[13] S. Ren et al. Faster R-CNN: Towards real-time object detection with language modeling. In ICLR, 2019.
region proposal networks. In NeurIPS, 2015. [54] Q. Wang et al. Learning deep transformer models for machine transla-
[14] M. Chen et al. Generative pretraining from pixels. In ICML, 2020. tion. In ACL, pp. 1810–1822, 2019.
[15] A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for [55] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep
image recognition at scale. In ICLR, 2021. network training by reducing internal covariate shift. In ICML, 2015.
[16] N. Carion et al. End-to-end object detection with transformers. In [56] S. Shen et al. Powernorm: Rethinking batch normalization in transform-
ECCV, 2020. ers. In ICML, 2020.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 20
[57] J. Xu et al. Understanding and improving layer normalization. In [97] H. Wu et al. Cvt: Introducing convolutions to vision transformers.
NeurIPS, 2019. arXiv:2103.15808, 2021.
[58] T. Bachlechner et al. Rezero is all you need: Fast convergence at large [98] K. Yuan et al. Incorporating convolution designs into visual transform-
depth. In Uncertainty in Artificial Intelligence, pp. 1352–1361. PMLR, ers. arXiv:2103.11816, 2021.
2021. [99] Y. Li et al. Localvit: Bringing locality to vision transformers.
[59] B. Wu et al. Visual transformers: Token-based image representation and arXiv:2104.05707, 2021.
processing for computer vision. arXiv:2006.03677, 2020. [100] B. Graham et al. Levit: a vision transformer in convnet’s clothing for
[60] H. Touvron et al. Training data-efficient image transformers & distilla- faster inference. In ICCV, 2021.
tion through attention. In ICML, 2020. [101] A. Srinivas et al. Bottleneck transformers for visual recognition. In
[61] Z. Liu et al. Swin transformer: Hierarchical vision transformer using CVPR, 2021.
shifted windows. In ICCV, 2021. [102] Z. Chen et al. Visformer: The vision-friendly transformer. arXiv, 2021.
[62] C.-F. Chen et al. Regionvit: Regional-to-local attention for vision [103] T. Xiao et al. Early convolutions help transformers see better. In
transformers. arXiv:2106.02689, 2021. NeurIPS, volume 34, 2021.
[63] X. Chu et al. Twins: Revisiting the design of spatial attention in vision [104] G. E. Hinton and R. S. Zemel. Autoencoders, minimum description
transformers. arXiv:2104.13840, 2021. length, and helmholtz free energy. NIPS, 6:3–10, 1994.
[64] H. Lin et al. Cat: Cross attention in vision transformer. arXiv, 2021. [105] P. Vincent et al. Extracting and composing robust features with
[65] X. Dong et al. Cswin transformer: A general vision transformer denoising autoencoders. In ICML, pp. 1096–1103, 2008.
backbone with cross-shaped windows. arXiv:2107.00652, 2021. [106] A. v. d. Oord et al. Conditional image generation with pixelcnn
[66] Z. Huang et al. Shuffle transformer: Rethinking spatial shuffle for vision decoders. arXiv preprint arXiv:1606.05328, 2016.
transformer. arXiv:2106.03650, 2021. [107] D. Pathak et al. Context encoders: Feature learning by inpainting. In
[67] J. Fang et al. Msg-transformer: Exchanging local spatial information by CVPR, pp. 2536–2544, 2016.
manipulating messenger tokens. arXiv:2105.15168, 2021. [108] Z. Li et al. Mst: Masked self-supervised transformer for visual
[68] L. Yuan et al. Tokens-to-token vit: Training vision transformers from representation. In NeurIPS, 2021.
scratch on imagenet. In ICCV, 2021. [109] H. Bao et al. Beit: Bert pre-training of image transformers.
[69] D. Zhou et al. Deepvit: Towards deeper vision transformer. arXiv, 2021. arXiv:2106.08254, 2021.
[70] P. Wang et al. Kvt: k-nn attention for boosting vision transformers. [110] A. Radford et al. Language models are unsupervised multitask learners.
arXiv:2106.00515, 2021. OpenAI blog, 1(8):9, 2019.
[71] D. Zhou et al. Refiner: Refining self-attention for vision transformers. [111] Z. Xie et al. Simmim: A simple framework for masked image modeling.
arXiv:2106.03714, 2021. In CVPR, pp. 9653–9663, 2022.
[72] A. El-Nouby et al. Xcit: Cross-covariance image transformers. [112] Z. Xie et al. Self-supervised learning with swin transformers.
arXiv:2106.09681, 2021. arXiv:2105.04553, 2021.
[73] W. Wang et al. Pyramid vision transformer: A versatile backbone for [113] C. Li et al. Efficient self-supervised vision transformers for representa-
dense prediction without convolutions. In ICCV, 2021. tion learning. arXiv:2106.09785, 2021.
[74] S. Sun* et al. Visual parser: Representing part-whole hierarchies with [114] K. He et al. Momentum contrast for unsupervised visual representation
transformers. arXiv:2107.05790, 2021. learning. In CVPR, 2020.
[75] H. Fan et al. Multiscale vision transformers. arXiv:2104.11227, 2021. [115] J. Beal et al. Toward transformer-based object detection.
[76] Z. Zhang et al. Nested hierarchical transformer: Towards accurate, data- arXiv:2012.09958, 2020.
efficient and interpretable visual understanding. In AAAI, 2022. [116] Z. Yuan et al. Temporal-channel transformer for 3d lidar-based video
[77] Z. Pan et al. Less is more: Pay less attention in vision transformers. In object detection for autonomous driving. IEEE TCSVT, 2021.
AAAI, 2022. [117] X. Pan et al. 3d object detection with pointformer. In CVPR, 2021.
[78] Z. Pan et al. Scalable visual transformers with hierarchical pooling. In [118] R. Liu et al. End-to-end lane shape prediction with transformers. In
ICCV, 2021. WACV, 2021.
[79] B. Heo et al. Rethinking spatial dimensions of vision transformers. In [119] S. Yang et al. Transpose: Keypoint localization via transformer. In
ICCV, 2021. ICCV, 2021.
[80] C.-F. Chen et al. Crossvit: Cross-attention multi-scale vision trans- [120] D. Zhang et al. Feature pyramid transformer. In ECCV, 2020.
former for image classification. In ICCV, 2021. [121] C. Chi et al. Relationnet++: Bridging visual representations for object
[81] Z. Wang et al. Uformer: A general u-shaped transformer for image detection via transformer decoder. NeurIPS, 2020.
restoration. arXiv:2106.03106, 2021. [122] Z. Sun et al. Rethinking transformer-based set prediction for object
[82] X. Zhai et al. Scaling vision transformers. arXiv:2106.04560, 2021. detection. In ICCV, pp. 3611–3620, 2021.
[83] X. Su et al. Vision transformer architecture search. arXiv, 2021. [123] M. Zheng et al. End-to-end object detection with adaptive clustering
[84] M. Chen et al. Autoformer: Searching transformers for visual recogni- transformer. In BMVC, 2021.
tion. In ICCV, pp. 12270–12280, 2021. [124] T. Ma et al. Oriented object detection with transformer.
[85] B. Chen et al. Glit: Neural architecture search for global and local arXiv:2106.03146, 2021.
image transformer. In ICCV, pp. 12–21, 2021. [125] P. Gao et al. Fast convergence of detr with spatially modulated co-
[86] X. Chu et al. Conditional positional encodings for vision transformers. attention. In ICCV, 2021.
arXiv:2102.10882, 2021. [126] Z. Yao et al. Efficient detr: Improving end-to-end object detector with
[87] K. Wu et al. Rethinking and improving relative position encoding for dense prior. arXiv:2104.01318, 2021.
vision transformer. In ICCV, 2021. [127] Z. Tian et al. Fcos: Fully convolutional one-stage object detection. In
[88] H. Touvron et al. Going deeper with image transformers. ICCV, pp. 9627–9636, 2019.
arXiv:2103.17239, 2021. [128] Y. Fang et al. You only look at one sequence: Rethinking transformer
[89] Y. Tang et al. Augmented shortcuts for vision transformers. In NeurIPS, in vision through object detection. In NeurIPS, 2021.
2021. [129] T.-Y. Lin et al. Focal loss for dense object detection. In ICCV, 2017.
[90] I. Tolstikhin et al. Mlp-mixer: An all-mlp architecture for vision. [130] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality
arXiv:2105.01601, 2021. object detection. In CVPR, 2018.
[91] L. Melas-Kyriazi. Do you even need attention? a stack of feed-forward [131] A. Bar et al. Detreg: Unsupervised pretraining with region priors for
layers does surprisingly well on imagenet. arXiv:2105.02723, 2021. object detection. arXiv:2106.04550, 2021.
[92] M.-H. Guo et al. Beyond self-attention: External attention using two [132] J. Hu et al. Istr: End-to-end instance segmentation with transformers.
linear layers for visual tasks. arXiv:2105.02358, 2021. arXiv:2105.00637, 2021.
[93] H. Touvron et al. Resmlp: Feedforward networks for image classifica- [133] Z. Yang et al. Associating objects with transformers for video object
tion with data-efficient training. arXiv:2105.03404, 2021. segmentation. In NeurIPS, 2021.
[94] M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolu- [134] S. Wu et al. Fully transformer networks for semantic image segmenta-
tional neural networks. In ICML, 2019. tion. arXiv:2106.04108, 2021.
[95] J. Guo et al. Cmt: Convolutional neural networks meet vision trans- [135] B. Dong et al. Solq: Segmenting objects by learning queries. In
formers. arXiv:2107.06263, 2021. NeurIPS, 2021.
[96] L. Yuan et al. Volo: Vision outlooker for visual recognition. [136] R. Strudel et al. Segmenter: Transformer for semantic segmentation. In
arXiv:2106.13112, 2021. ICCV, 2021.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 21
[137] E. Xie et al. Segformer: Simple and efficient design for semantic [177] W. Choi et al. What are they doing?: Collective activity classification
segmentation with transformers. In NeurIPS, 2021. using spatio-temporal relationship among people. In ICCVW, 2009.
[138] J. M. J. Valanarasu et al. Medical transformer: Gated axial-attention for [178] K. Gavrilyuk et al. Actor-transformers for group activity recognition.
medical image segmentation. In MICCAI, 2021. In CVPR, pp. 839–848, 2020.
[139] T. Prangemeier et al. Attention-based transformers for instance seg- [179] J. Shao et al. Temporal context aggregation for video retrieval with
mentation of cells in microstructures. In International Conference on contrastive learning. In WACV, 2021.
Bioinformatics and Biomedicine, pp. 700–707. IEEE, 2020. [180] V. Gabeur et al. Multi-modal transformer for video retrieval. In ECCV,
[140] C. R. Qi et al. Pointnet: Deep learning on point sets for 3d classification pp. 214–229, 2020.
and segmentation. In CVPR, pp. 652–660, 2017. [181] Y. Chen et al. Memory enhanced global-local aggregation for video
[141] C. R. Qi et al. Pointnet++: Deep hierarchical feature learning on point object detection. In CVPR, pp. 10337–10346, 2020.
sets in a metric space. NeurIPS, 30:5099–5108, 2017. [182] J. Yin et al. Lidar-based online 3d video object detection with graph-
[142] S. Hampali et al. Handsformer: Keypoint transformer for monocular 3d based message passing and spatiotemporal transformer attention. In
pose estimation ofhands and object in interaction. arXiv, 2021. 2020 CVPR, pp. 11495–11504, 2020.
[143] Y. Li et al. Tokenpose: Learning keypoint tokens for human pose [183] H. Seong et al. Video multitask transformer network. In ICCVW, 2019.
estimation. In ICCV, 2021. [184] K. M. Schatz et al. A recurrent transformer network for novel view
[144] W. Mao et al. Tfpose: Direct human pose estimation with transformers. action synthesis. In ECCV (27), pp. 410–426, 2020.
arXiv:2103.15320, 2021. [185] C. Sun et al. Videobert: A joint model for video and language
[145] T. Jiang et al. Skeletor: Skeletal transformers for robust body-pose representation learning. In ICCV, pp. 7464–7473, 2019.
estimation. In CVPR, 2021. [186] L. H. Li et al. Visualbert: A simple and performant baseline for vision
[146] Y. Li et al. Test-time personalization with a transformer for human pose and language. arXiv:1908.03557, 2019.
estimation. Advances in Neural Information Processing Systems, 34, [187] W. Su et al. Vl-bert: Pre-training of generic visual-linguistic represen-
2021. tations. In ICLR, 2020.
[147] M. Lin et al. Detr for pedestrian detection. arXiv:2012.06785, 2020. [188] Y.-S. Chuang et al. Speechbert: Cross-modal pre-trained language
[148] L. Tabelini et al. Polylanenet: Lane estimation via deep polynomial re- model for end-to-end spoken question answering. In Interspeech, 2020.
gression. In 2020 25th International Conference on Pattern Recognition [189] R. Hu and A. Singh. Unit: Multimodal multitask learning with a unified
(ICPR), pp. 6150–6156. IEEE, 2021. transformer. In ICCV, 2021.
[149] L. Liu et al. Condlanenet: a top-to-down lane detection framework [190] S. Prasanna et al. When bert plays the lottery, all tickets are winning.
based on conditional convolution. arXiv:2105.05003, 2021. In EMNLP, 2020.
[150] P. Xu et al. A survey of scene graph: Generation and application. IEEE [191] J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse,
Trans. Neural Netw. Learn. Syst, 2020. trainable neural networks. In ICLR, 2018.
[151] J. Yang et al. Graph r-cnn for scene graph generation. In ECCV, 2018. [192] Y. Tang et al. Patch slimming for efficient vision transformers.
[152] S. Sharifzadeh et al. Classification by attention: Scene graph classifica- arXiv:2106.02852, 2021.
tion with prior knowledge. In AAAI, 2021. [193] M. Zhu et al. Vision transformer pruning. arXiv:2104.08500, 2021.
[153] S. Sharifzadeh et al. Improving Visual Reasoning by Exploiting The
[194] Z. Liu et al. Learning efficient convolutional networks through network
Knowledge in Texts. arXiv:2102.04760, 2021.
slimming. In ICCV, 2017.
[154] C. Raffel et al. Exploring the limits of transfer learning with a unified
[195] Z. Lan et al. Albert: A lite bert for self-supervised learning of language
text-to-text transformer. JMLR, 21(140):1–67, 2020.
representations. In ICLR, 2020.
[155] N. Wang et al. Transformer meets tracker: Exploiting temporal context
[196] C. Xu et al. Bert-of-theseus: Compressing bert by progressive module
for robust visual tracking. In CVPR, 2021.
replacing. In EMNLP, pp. 7859–7869, 2020.
[156] M. Zhao et al. TrTr: Visual Tracking with Transformer.
[197] S. Shen et al. Q-bert: Hessian based ultra low precision quantization of
arXiv:2105.03817 [cs], May 2021. arXiv: 2105.03817.
bert. In AAAI, pp. 8815–8821, 2020.
[157] X. Chen et al. Transformer tracking. In CVPR, 2021.
[198] O. Zafrir et al. Q8bert: Quantized 8bit bert. arXiv:1910.06188, 2019.
[158] P. Sun et al. TransTrack: Multiple Object Tracking with Transformer.
arXiv:2012.15460 [cs], May 2021. arXiv: 2012.15460. [199] V. Sanh et al. Distilbert, a distilled version of bert: smaller, faster,
[159] S. He et al. TransReID: Transformer-based object re-identification. In cheaper and lighter. arXiv:1910.01108, 2019.
ICCV, 2021. [200] S. Sun et al. Patient knowledge distillation for bert model compression.
[160] X. Liu et al. A video is worth three views: Trigeminal transformers for In EMNLP-IJCNLP, pp. 4323–4332, 2019.
video-based person re-identification. arXiv:2104.01745, 2021. [201] Z. Sun et al. Mobilebert: a compact task-agnostic bert for resource-
[161] T. Zhang et al. Spatiotemporal transformer for video-based person re- limited devices. In ACL, pp. 2158–2170, 2020.
identification. arXiv:2103.16469, 2021. [202] I. Turc et al. Well-read students learn better: The impact of student
[162] N. Engel et al. Point transformer. IEEE Access, 9:134826–134840, initialization on knowledge distillation. arXiv:1908.08962, 2019.
2021. [203] X. Qiu et al. Pre-trained models for natural language processing: A
[163] M.-H. Guo et al. Pct: Point cloud transformer. Computational Visual survey. Science China Technological Sciences, pp. 1–26, 2020.
Media, 7(2):187–199, 2021. [204] A. Fan et al. Reducing transformer depth on demand with structured
[164] H. Zhao et al. Point transformer. In ICCV, pp. 16259–16268, 2021. dropout. In ICLR, 2020.
[165] K. Lee et al. Vitgan: Training gans with vision transformers. arXiv [205] L. Hou et al. Dynabert: Dynamic bert with adaptive width and depth.
preprint arXiv:2107.04589, 2021. NeurIPS, 33, 2020.
[166] A. v. d. Oord et al. Neural discrete representation learning. arXiv, 2017. [206] Z. Wang et al. Structured pruning of large language models. In EMNLP,
[167] J. Ho et al. Denoising diffusion probabilistic models. volume 33, pp. pp. 6151–6162, 2020.
6840–6851, 2020. [207] G. Hinton et al. Distilling the knowledge in a neural network.
[168] A. Ramesh et al. Hierarchical text-conditional image generation with arXiv:1503.02531, 2015.
clip latents. arXiv preprint arXiv:2204.06125, 2022. [208] C. Buciluǎ et al. Model compression. In SIGKDD, pp. 535–541, 2006.
[169] R. Rombach et al. High-resolution image synthesis with latent diffusion [209] J. Ba and R. Caruana. Do deep nets really need to be deep? NIPS, 2014.
models. In CVPR, pp. 10684–10695, 2022. [210] S. Mukherjee and A. H. Awadallah. Xtremedistil: Multi-stage distilla-
[170] X. Wang et al. Sceneformer: Indoor scene generation with transformers. tion for massive multilingual models. In ACL, pp. 2221–2234, 2020.
In 3DV, pp. 106–115. IEEE, 2021. [211] W. Wang et al. Minilm: Deep self-attention distillation for task-agnostic
[171] Z. Liu et al. Convtransformer: A convolutional transformer network for compression of pre-trained transformers. arXiv:2002.10957, 2020.
video frame synthesis. arXiv:2011.10185, 2020. [212] S. I. Mirzadeh et al. Improved knowledge distillation via teacher
[172] R. Girdhar et al. Video action transformer network. In CVPR, 2019. assistant. In AAAI, 2020.
[173] H. Liu et al. Two-stream transformer networks for video-based face [213] D. Jia et al. Efficient vision transformers via fine-grained manifold
alignment. T-PAMI, 40(11):2546–2554, 2017. distillation. arXiv:2107.01378, 2021.
[174] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new [214] V. Vanhoucke et al. Improving the speed of neural networks on cpus.
model and the kinetics dataset. In CVPR, 2017. In NIPS Workshop, 2011.
[175] S. Lohit et al. Temporal transformer networks: Joint learning of [215] Z. Yang et al. Searching for low-bit weights in quantized neural
invariant and discriminative time warping. In CVPR, 2019. networks. In NeurIPS, 2020.
[176] M. Fayyaz and J. Gall. Sct: Set constrained temporal transformer for [216] E. Park and S. Yoo. Profit: A novel training method for sub-4-bit
set supervised action segmentation. In 2020 CVPR, pp. 501–510, 2020. mobilenet models. In ECCV, pp. 430–446. Springer, 2020.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 22
[217] J. Fromm et al. Riptide: Fast end-to-end binarized neural networks. [257] Y. Zhu et al. Aligning books and movies: Towards story-like visual
Proceedings of Machine Learning and Systems, 2:379–389, 2020. explanations by watching movies and reading books. In ICCV, pp.
[218] Y. Bai et al. Proxquant: Quantized neural networks via proximal 19–27, 2015.
operators. In ICLR, 2019. [258] A. Radford et al. Improving language understanding by generative pre-
[219] A. Bhandare et al. Efficient 8-bit quantization of transformer neural training, 2018.
machine language translation model. arXiv:1906.00532, 2019. [259] Z. Yang et al. Xlnet: Generalized autoregressive pretraining for lan-
[220] C. Fan. Quantized transformer. Technical report, Stanford Univ., 2019. guage understanding. In NeurIPS, pp. 5753–5763, 2019.
[221] K. Shridhar et al. End to end binarized neural networks for text [260] K. Clark et al. Electra: Pre-training text encoders as discriminators
classification. In SustaiNLP, 2020. rather than generators. arXiv:2003.10555, 2020.
[222] R. Cheong and R. Daniel. transformers. zip: Compressing transformers [261] L. Dong et al. Unified language model pre-training for natural language
with pruning and quantization. Technical report, 2019. understanding and generation. In NeurIPS, pp. 13063–13075, 2019.
[223] Z. Zhao et al. An investigation on different underlying quantization [262] M. Lewis et al. Bart: Denoising sequence-to-sequence pre-training
schemes for pre-trained language models. In NLPCC, 2020. for natural language generation, translation, and comprehension.
[224] Z. Liu et al. Post-training quantization for vision transformer. In arXiv:1910.13461, 2019.
NeurIPS, 2021. [263] Z. Zhang et al. Ernie: Enhanced language representation with informa-
[225] Z. Wu et al. Lite transformer with long-short range attention. In ICLR, tive entities. arXiv:1905.07129, 2019.
2020. [264] M. E. Peters et al. Knowledge enhanced contextual word representa-
[226] Z. Geng et al. Is attention better than matrix decomposition? In ICLR, tions. arXiv:1909.04164, 2019.
2020. [265] J. Lee et al. Biobert: a pre-trained biomedical language representation
[227] Y. Guo et al. Nat: Neural architecture transformer for accurate and model for biomedical text mining. Bioinformatics, 36(4):1234–1240,
compact architectures. In NeurIPS, pp. 737–748, 2019. 2020.
[228] D. So et al. The evolved transformer. In ICML, pp. 5877–5886, 2019. [266] I. Beltagy et al. Scibert: A pretrained language model for scientific text.
[229] C. Li et al. Bossnas: Exploring hybrid cnn-transformers with block- arXiv:1903.10676, 2019.
wisely self-supervised neural architecture search. In ICCV, 2021. [267] K. Huang et al. Clinicalbert: Modeling clinical notes and predicting
[230] A. Katharopoulos et al. Transformers are rnns: Fast autoregressive hospital readmission. arXiv:1904.05342, 2019.
transformers with linear attention. In ICML, 2020. [268] J. Ba et al. Multiple object recognition with visual attention. In ICLR,
[231] C. Yun et al. o(n) connections are expressive enough: Universal 2014.
approximability of sparse transformers. In NeurIPS, 2020. [269] V. Mnih et al. Recurrent models of visual attention. NeurIPS, pp.
[232] M. Zaheer et al. Big bird: Transformers for longer sequences. In 2204–2212, 2014.
NeurIPS, 2020. [270] K. Xu et al. Show, attend and tell: Neural image caption generation
[233] D. A. Spielman and S.-H. Teng. Spectral sparsification of graphs. SIAM with visual attention. In International conference on machine learning,
Journal on Computing, 40(4), 2011. pp. 2048–2057, 2015.
[234] F. Chung and L. Lu. The average distances in random graphs with given [271] F. Wang et al. Residual attention network for image classification. In
expected degrees. PNAS, 99(25):15879–15882, 2002. CVPR, pp. 3156–3164, 2017.
[235] A. Krizhevsky and G. Hinton. Learning multiple layers of features from [272] S. Jetley et al. Learn to pay attention. In ICLR, 2018.
tiny images. Technical report, Citeseer, 2009. [273] K. Han et al. Attribute-aware attention model for fine-grained represen-
[236] X. Zhai et al. A large-scale study of representation learning with the tation learning. In ACM MM, pp. 2040–2048, 2018.
visual task adaptation benchmark. arXiv:1910.04867, 2019. [274] P. Ramachandran et al. Stand-alone self-attention in vision models. In
[237] Y. Cheng et al. Robust neural machine translation with doubly adver- NeurIPS, 2019.
sarial inputs. In ACL, 2019. [275] Q. Guan et al. Diagnose like a radiologist: Attention guided
[238] W. E. Zhang et al. Adversarial attacks on deep-learning models in convolutional neural network for thorax disease classification. In
natural language processing: A survey. ACM TIST, 11(3):1–41, 2020. arXiv:1801.09927, 2018.
[239] K. Mahmood et al. On the robustness of vision transformers to [276] J. Hu et al. Squeeze-and-excitation networks. In CVPR, pp. 7132–7141,
adversarial examples. arXiv:2104.02610, 2021. 2018.
[240] X. Mao et al. Towards robust vision transformer. arXiv, 2021. [277] H. Zhao et al. Psanet: Point-wise spatial attention network for scene
[241] S. Serrano and N. A. Smith. Is attention interpretable? In ACL, 2019. parsing. In ECCV, pp. 267–283, 2018.
[242] S. Wiegreffe and Y. Pinter. Attention is not not explanation. In EMNLP- [278] Y. Yuan et al. Ocnet: Object context for semantic segmentation.
IJCNLP, 2019. International Journal of Computer Vision, pp. 1–24, 2021.
[243] H. Chefer et al. Transformer interpretability beyond attention visualiza- [279] J. Fu et al. Dual attention network for scene segmentation. In CVPR,
tion. In CVPR, pp. 782–791, 2021. pp. 3146–3154, 2019.
[244] R. Livni et al. On the computational efficiency of training neural [280] H. Zhang et al. Co-occurrent features in semantic segmentation. In
networks. In NeurIPS, 2014. CVPR, pp. 548–557, 2019.
[245] B. Neyshabur et al. Towards understanding the role of over- [281] F. Zhang et al. Acfnet: Attentional class feature network for semantic
parametrization in generalization of neural networks. In ICLR, 2019. segmentation. In ICCV, pp. 6798–6807, 2019.
[246] K. Han et al. Ghostnet: More features from cheap operations. In CVPR, [282] X. Li et al. Expectation-maximization attention networks for semantic
pp. 1580–1589, 2020. segmentation. In ICCV, pp. 9167–9176, 2019.
[247] K. Han et al. Model rubik’s cube: Twisting resolution, depth and width [283] J. He et al. Adaptive pyramid context network for semantic segmenta-
for tinynets. NeurIPS, 33, 2020. tion. In CVPR, pp. 7519–7528, 2019.
[248] T. Chen et al. Diannao: a small-footprint high-throughput accelerator [284] O. Oktay et al. Attention u-net: Learning where to look for the pancreas.
for ubiquitous machine-learning. In ASPLOS, pp. 269–284, 2014. 2018.
[249] H. Liao et al. Davinci: A scalable architecture for neural network [285] Y. Wang et al. Self-supervised equivariant attention mechanism for
computing. In 2019 IEEE Hot Chips 31 Symposium (HCS), 2019. weakly supervised semantic segmentation. In CVPR, pp. 12275–12284,
[250] A. Jaegle et al. Perceiver: General perception with iterative attention. 2020.
In ICML, volume 139, pp. 4651–4664. PMLR, 18–24 Jul 2021. [286] X. Li et al. Global aggregation then local distribution in fully convolu-
[251] A. Jaegle et al. Perceiver io: A general architecture for structured inputs tional networks. In BMVC, 2019.
& outputs. arXiv preprint arXiv:2107.14795, 2021. [287] Y. Chen et al. Aˆ 2-nets: Double attention networks. NeurIPS, pp.
[252] X. Wang et al. Non-local neural networks. In CVPR, pp. 7794–7803, 352–361, 2018.
2018. [288] L. Zhang et al. Dual graph convolutional network for semantic
[253] A. Buades et al. A non-local algorithm for image denoising. In CVPR, segmentation. In BMVC, 2019.
pp. 60–65, 2005. [289] K. Yue et al. Compact generalized non-local network. In NeurIPS, pp.
[254] J. Chung et al. Empirical evaluation of gated recurrent neural networks 6510–6519, 2018.
on sequence modeling. arXiv:1412.3555, 2014. [290] Z. Huang et al. Ccnet: Criss-cross attention for semantic segmentation.
[255] M. Joshi et al. Spanbert: Improving pre-training by representing and In ICCV, pp. 603–612, 2019.
predicting spans. Transactions of the Association for Computational [291] L. Huang et al. Interlaced sparse self-attention for semantic segmenta-
Linguistics, 8:64–77, 2020. tion. arXiv:1907.12273, 2019.
[256] Y. Liu et al. Roberta: A robustly optimized bert pretraining approach. [292] Y. Li and A. Gupta. Beyond grids: Learning graph representations for
arXiv:1907.11692, 2019. visual recognition. NeurIPS, pp. 9225–9235, 2018.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 23