0% found this document useful (0 votes)

17 views23 pages

2012.12556

This document is a survey on visual transformers, exploring their application in computer vision tasks and categorizing various models based on their functions such as backbone networks, high/mid-level vision, low-level vision, and video processing. The paper highlights the advantages of transformer models over traditional convolutional and recurrent neural networks, emphasizing their strong representation capabilities and efficiency in real-world applications. Additionally, it discusses the self-attention mechanism, challenges in the field, and potential future research directions.

Uploaded by

f.alpcelik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views23 pages

2012.12556

Uploaded by

f.alpcelik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

A Survey on Visual Transformer

Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao,
Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, and Dacheng Tao Fellow, IEEE

Abstract—Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the
self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to
computer vision tasks. In a variety of visual benchmarks, transformer-based models perform similar to or better than other types of
networks such as convolutional and recurrent neural networks. Given its high performance and less need for vision-specific inductive
bias, transformer is receiving more and more attention from the computer vision community. In this paper, we review these vision
transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages. The main categories we
explore include the backbone network, high/mid-level vision, low-level vision, and video processing. We also include efficient
transformer methods for pushing transformer into real device-based applications. Furthermore, we also take a brief look at the
arXiv:2012.12556v6 [cs.CV] 10 Jul 2023

self-attention mechanism in computer vision, as it is the base component in transformer. Toward the end of this paper, we discuss the
challenges and provide several further research directions for vision transformers.

Index Terms—Transformer, Self-attention, Computer Vision, High-level vision, Low-level vision, Video.

1 I NTRODUCTION

D EEP neural networks (DNNs) have become the fundamental

infrastructure in today’s artificial intelligence (AI) systems.
Different types of tasks have typically involved different types
TB of compressed plaintext data using 175 billion parameters.
It achieved strong performance on different types of downstream
natural language tasks without requiring any fine-tuning. These
of networks. For example, multi-layer perceptron (MLP) or the transformer-based models, with their strong representation capac-
fully connected (FC) network is the classical type of neural ity, have achieved significant breakthroughs in NLP.
network, which is composed of multiple linear layers and non- Inspired by the major success of transformer architectures in
linear activations stacked together [1], [2]. Convolutional neural the field of NLP, researchers have recently applied transformer
networks (CNNs) introduce convolutional layers and pooling to computer vision (CV) tasks. In vision applications, CNNs are
layers for processing shift-invariant data such as images [3], [4]. considered the fundamental component [12], [13], but nowadays
And recurrent neural networks (RNNs) utilize recurrent cells to transformer is showing it is a potential alternative to CNN. Chen et
process sequential data or time series data [5], [6]. Transformer is al. [14] trained a sequence transformer to auto-regressively predict
a new type of neural network. It mainly utilizes the self-attention pixels, achieving results comparable to CNNs on image classi-
mechanism [7], [8] to extract intrinsic features [9] and shows great fication tasks. Another vision transformer model is ViT, which
potential for extensive use in AI applications. applies a pure transformer directly to sequences of image patches
Transformer was first applied to natural language processing to classify the full image. Recently proposed by Dosovitskiy et
(NLP) tasks where it achieved significant improvements [9], [10], al. [15], it has achieved state-of-the-art performance on multiple
[11]. For example, Vaswani et al. [9] first proposed transformer image recognition benchmarks. In addition to image classification,
based on attention mechanism for machine translation and English transformer has been utilized to address a variety of other vision
constituency parsing tasks. Devlin et al. [10] introduced a new lan- problems, including object detection [16], [17], semantic segmen-
guage representation model called BERT (short for Bidirectional tation [18], image processing [19], and video understanding [20].
Encoder Representations from Transformers), which pre-trains a Thanks to its exceptional performance, more and more researchers
transformer on unlabeled text taking into account the context of are proposing transformer-based models for improving a wide
each word as it is bidirectional. When BERT was published, it range of visual tasks.
obtained state-of-the-art performance on 11 NLP tasks. Brown et Due to the rapid increase in the number of transformer-based
al. [11] pre-trained a massive transformer-based model called vision models, keeping pace with the rate of new progress is
GPT-3 (short for Generative Pre-trained Transformer 3) on 45 becoming increasingly difficult. As such, a survey of the existing
works is urgent and would be beneficial for the community. In
• Kai Han, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui this paper, we focus on providing a comprehensive overview
Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, of the recent advances in vision transformers and discuss the
and Yunhe Wang are with Huawei Noah’s Ark Lab. E-mail: {kai.han,
yunhe.wang}@huawei.com.
potential directions for further improvement. To facilitate future
• Hanting Chen, Zhenhua Liu, Yehui Tang, and Zhaohui Yang are also with research on different topics, we categorize the transformer models
School of EECS, Peking University. by their application scenarios, as listed in Table 1. The main
• Dacheng Tao is with the School of Computer Science, in the Faculty of categories include backbone network, high/mid-level vision, low-
Engineering, at The University of Sydney, 6 Cleveland St, Darlington,
NSW 2008, Australia. E-mail: [email protected]. level vision, and video processing. High-level vision deals with the
• Corresponding to Yunhe Wang and Dacheng Tao. interpretation and use of what is seen in the image [21], whereas
• All authors are listed in alphabetical order of last name (except the mid-level vision deals with how this information is organized into
primary and corresponding authors).
what we experience as objects and surfaces [22]. Given the gap
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

2017.6 | Transformer 2020.5 | GPT-3 2020.10 | ViT 2021 | ViT Variants 2023 | GPT4
Solely based on attention A huge transformer with Pure transformer Variants of ViT models, A generalized multi-
mechanism, the Transformer is 170B parameters, takes a architectures work well for e.g., DeiT, PVT, TNT, modal model for both
proposed and shows great big step towards general language and vision
visual recognition. and Swin.
performance on NLP tasks. NLP model.
tasks.

2018.10 | BERT 2020.5 | DETR End of 2020 | IPT/SETR/CLIP 2022 | DALLE2/StableDiffsusion

Pre-training transformer models A simple yet effective Applications of transformer Generating high-quality
begin to be dominated in the framework for high-level vision model on low-level vision, images from natural
field of NLP. by viewing object detection as
segment and multimodal language descriptions with
a direct set prediction problem.
tasks, respectively. diffusion models.

Fig. 1: Key milestones in the development of transformer. The vision transformer models are marked in red.
between high- and mid-level vision is becoming more obscure
in DNN-based vision systems [23], [24], we treat them as a
single category here. A few examples of transformer models that
address these high/mid-level vision tasks include DETR [16], de-
formable DETR [17] for object detection, and Max-DeepLab [25]
for segmentation. Low-level image processing mainly deals with
extracting descriptions from images (such descriptions are usually
represented as images themselves) [26]. Typical applications of
low-level image processing include super-resolution, image de-
noising, and style transfer. At present, only a few works [19], [27]
in low-level vision use transformers, creating the need for further
investigation. Another category is video processing, which is an
important part in both computer vision and image-based tasks. Due
to the sequential property of video, transformer is inherently well
suited for use on video tasks [20], [28], in which it is beginning
to perform on par with conventional CNNs and RNNs. Here, we
survey the works associated with transformer-based visual models
in order to track the progress in this field. Figure 1 shows the
development timeline of vision transformer — undoubtedly, there Fig. 2: Structure of the original transformer (image from [9]).
will be many more milestones in the future.
The rest of the paper is organized as follows. Section 2 2.1 Self-Attention
discusses the formulation of the standard transformer and the self-
In the self-attention layer, the input vector is first transformed into
attention mechanism. Section 4 is the main part of the paper, in
three different vectors: the query vector q, the key vector k and the
which we summarize the vision transformer models on backbone,
value vector v with dimension dq = dk = dv = dmodel = 512.
high/mid-level vision, low-level vision, and video tasks. We also
Vectors derived from different inputs are then packed together into
briefly describe efficient transformer methods, as they are closely
three different matrices, namely, Q, K and V. Subsequently, the
related to our main topic. In the final section, we give our
attention function between different input vectors is calculated as
conclusion and discuss several research directions and challenges.
follows (and shown in Figure 3 left):
Due to the page limit, we describe the methods of transformer in
NLP in the supplemental material, as the research experience may • Step 1: Compute scores between different input vectors
be beneficial for vision tasks. In the supplemental material, we also with S = Q · K⊤ ;
review the self-attention mechanism for CV as the supplementary • Step 2: Normalize
√ the scores for the stability of gradient
of vision transformer models. In this survey, we mainly include the with Sn = S/ dk ;
representative works (early, pioneering, novel, or inspiring works) • Step 3: Translate the scores into probabilities with softmax
since there are many preprinted works on arXiv and we cannot function P = softmax(Sn );
include them all in limited pages. • Step 4: Obtain the weighted value matrix with Z = V ·P.
The process can be unified into a single function:
2 F ORMULATION OF T RANSFORMER
Q · K⊤
Transformer [9] was first used in the field of natural language Attention(Q, K, V) = softmax( √ ) · V. (1)
processing (NLP) on machine translation tasks. As shown in dk
Figure 2, it consists of an encoder and a decoder with several The logic behind Eq. 1 is simple. Step 1 computes scores between
transformer blocks of the same architecture. The encoder gener- each pair of different vectors, and these scores determine the
ates encodings of inputs, while the decoder takes all the encodings degree of attention that we give other words when encoding
and using their incorporated contextual information to generate the word at the current position. Step 2 normalizes the scores
the output sequence. Each transformer block is composed of a to enhance gradient stability for improved training, and step 3
multi-head attention layer, a feed-forward neural network, shortcut translates the scores into probabilities. Finally, each value vector
connection and layer normalization. In the following, we describe is multiplied by the sum of the probabilities. Vectors with larger
each component of the transformer in detail. probabilities receive additional focus from the following layers.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

TABLE 1: Representative works of vision transformers.

Category Sub-category Method Highlights Publication
ViT [15] Image patches, standard transformer ICLR 2021
Supervised pretraining TNT [29] Transformer in transformer, local attention NeurIPS 2021
Backbone Swin [30] Shifted window, window-based self-attention ICCV 2021
iGPT [14] Pixel prediction self-supervised learning, GPT model ICML 2020
Self-supervised pretraining MoCo v3 [31] Contrastive self-supervised learning, ViT ICCV 2021
MAE [32] Masked image modeling, ViT CVPR 2022
DETR [16] Set-based prediction, bipartite matching, transformer ECCV 2020
Object detection Deformable DETR [17] DETR, deformable attention module ICLR 2021
UP-DETR [33] Unsupervised pre-training, random query patch detection CVPR 2021
High/Mid-level Max-DeepLab [25] PQ-style bipartite matching, dual-path transformer CVPR 2021
Segmentation VisTR [34] Instance sequence matching and segmentation CVPR 2021
vision
SETR [18] Sequence-to-sequence prediction, standard transformer CVPR 2021
Hand-Transformer [35] Non-autoregressive transformer, 3D point set ECCV 2020
Pose Estimation HOT-Net [36] Structured-reference extractor MM 2020
METRO [37] Progressive dimensionality reduction CVPR 2021
Image Transformer [27] Pixel generation using transformer ICML 2018
Image generation Taming transformer [38] VQ-GAN, auto-regressive transformer CVPR 2021
Low-level
vision TransGAN [39] GAN using pure transformer architecture NeurIPS 2021
IPT [19] Multi-task, ImageNet pre-training, transformer model CVPR 2021
Image enhancement
TTSR [40] Texture transformer, RefSR CVPR 2020
Video Video inpainting STTN [28] Spatial-temporal adversarial loss ECCV 2020
processing Video captioning Masked Transformer [20] Masking network, event proposal CVPR 2018
Classification CLIP [41] NLP supervision for images, zero-shot transfer arXiv 2021
DALL-E [42] Zero-shot text-to image generation ICML 2021
Multimodality Image generation
Cogview [43] VQ-VAE, Chinese input NeurIPS 2021
Multi-task GPT-4 [44] Large Multi-modal model for NLP & CV tasks arXiv 2023
Decomposition ASH [45] Number of heads, importance estimation NeurIPS 2019
Efficient Distillation TinyBert [46] Various losses for different modules EMNLP Findings 2020
transformer Quantization FullyQT [47] Fully quantized transformer EMNLP Findings 2020
Architecture design ConvBert [48] Local dependence, dynamic convolution NeurIPS 2020

The encoder-decoder attention layer in the decoder module

is similar to the self-attention layer in the encoder module with
the following exceptions: The key matrix K and value matrix V
are derived from the encoder module, and the query matrix Q is
derived from the previous layer.
Note that the preceding process is invariant to the position of
each word, meaning that the self-attention layer lacks the ability
to capture the positional information of words in a sentence.
However, the sequential nature of sentences in a language requires
us to incorporate the positional information within our encoding.
Fig. 3: (Left) Self-attention process. (Right) Multi-head attention.
To address this issue and allow the final input vector of the word
The image is from [9].
to be obtained, a positional encoding with dimension dmodel is
added to the original input embedding. Specifically, the position is
encoded with the following equations: attention on other equally important positions at the same time.
This is achieved by giving attention layers different representation
pos subspace. Specifically, different query, key and value matrices are
P E(pos, 2i) = sin( 2i ); (2)
10000 dmodel used for different heads, and these matrices can project the input
pos vectors into different representation subspace after training due to
P E(pos, 2i + 1) = cos( 2i ), (3)
10000 dmodel random initialization.
in which pos denotes the position of the word in a sentence, and To elaborate on this in greater detail, given an input vector
i represents the current dimension of the positional encoding. In and the number of heads h, the input vector is first transformed
this way, each element of the positional encoding corresponds to into three different groups of vectors: the query group, the key
a sinusoid, and it allows the transformer model to learn to attend group and the value group. In each group, there are h vectors with
by relative positions and extrapolate to longer sequence lengths dimension dq′ = dk′ = dv′ = dmodel /h = 64. The vectors
during inference. In apart from the fixed positional encoding in the derived from different inputs are then packed together into three
vanilla transformer, learned positional encoding [49] and relative different groups of matrices: {Qi }hi=1 , {Ki }hi=1 and {Vi }hi=1 .
positional encoding [50] are also utilized in various models [10], The multi-head attention process is shown as follows:
[15].
Multi-Head Attention. Multi-head attention is a mechanism MultiHead(Q′ , K′ , V′ ) = Concat(head1 , · · · , headh )Wo ,
that can be used to boost the performance of the vanilla self-
where headi = Attention(Qi , Ki , Vi ). (4)
attention layer. Note that for a given reference word, we often
want to focus on several other words when going through the
sentence. A single-head self-attention layer limits our ability to Here, Q′ (and similarly K′ and V′ ) is the concatenation of
focus on one or more specific positions without influencing the {Qi }hi=1 , and Wo ∈ Rdmodel ×dmodel is the projection weight.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

2.2 Other Key Concepts in Transformer Backbone for Representation Learning

Feed-Forward Network. A feed-forward network (FFN) is ap-
plied after the self-attention layers in each encoder and decoder. It Convolution Attention
consists of two linear transformation layers and a nonlinear acti-
vation function within them, and can be denoted as the following CNN CNN + Transformer Transformer SENet
function: NLNet
FFN(X) = W2 σ(W1 X), (5) AlexNet/ResNet/DenseNet BoTNet/CeiT ViT/PVT/TNT/Swin GCNet

where W1 and W2 are the two parameter matrices of the Fig. 4: A taxonomy of backbone using convolution and attention.
two linear transformation layers, and σ represents the nonlinear
activation function, such as GELU [51]. The dimensionality of the more dimensions, noise and redundant modality compared to text,
hidden layer is dh = 2048. they are believed to be more difficult for generative modeling.
Residual Connection in the Encoder and Decoder. As shown in Other than CNNs, the transformer can be used as backbone
Figure 2, a residual connection is added to each sub-layer in the networks for image classification. Wu et al. [59] adopted ResNet
encoder and decoder. This strengthens the flow of information in as a convenient baseline and used vision transformers to replace
order to achieve higher performance. A layer-normalization [52] the last stage of convolutions. Specifically, they apply convolu-
is followed after the residual connection. The output of these 1tional layers
Huawei Confidential to extract low-level features that are then fed into

operations can be described as: the vision transformer. For the vision transformer, they use a
tokenizer to group pixels into a small number of visual tokens,
LayerNorm(X + Attention(X)). (6) each representing a semantic concept in the image. These visual
Here, X is used as the input of self-attention layer, and the query, tokens are used directly for image classification, with the trans-
key and value matrices Q, K and V are all derived from the same formers being used to model the relationships between tokens. As
input matrix X. A variant pre-layer normalization (Pre-LN) is also shown in Figure 4, the works can be divided into purely using
widely-used [53], [54], [15]. Pre-LN inserts the layer normaliza- transformer for vision and combining CNN and transformer. We
tion inside the residual connection and before multi-head attention summarize the results of these models in Table 2 and Figure 6
or FFN. For the normalization layer, there are several alternatives to demonstrate the development of the backbones. In addition to
such as batch normalization [55]. Batch normalization usually supervised learning, self-supervised learning is also explored in
perform worse when applied on transformer as the feature values vision transformer.
change acutely [56]. Some other normalization algorithms [57],
3.1.1 Pure Transformer
[56], [58] have been proposed to improve training of transformer.
Final Layer in the Decoder. The final layer in the decoder is used ViT. Vision Transformer (ViT) [15] is a pure transformer directly
to turn the stack of vectors back into a word. This is achieved by a applies to the sequences of image patches for image classification
linear layer followed by a softmax layer. The linear layer projects task. It follows transformer’s original design as much as possible.
the vector into a logits vector with dword dimensions, in which Figure 5 shows the framework of ViT.
dword is the number of words in the vocabulary. The softmax To handle 2D images, the image X ∈ Rh×w×c is reshaped
2

layer is then used to transform the logits vector into probabilities. into a sequence of flattened 2D patches Xp ∈ Rn×(p ·c) such
When used for CV tasks, most transformers adopt the original that c is the number of channels. (h, w) is the resolution of the
transformer’s encoder module. Such transformers can be treated original image, while (p, p) is the resolution of each image patch.
as a new type of feature extractor. Compared with CNNs which The effective sequence length for the transformer is therefore n =
focus only on local characteristics, transformer can capture long- hw/p2 . Because the transformer uses constant widths in all of its
distance characteristics, meaning that it can easily derive global layers, a trainable linear projection maps each vectorized path to
information. And in contrast to RNNs, whose hidden state must the model dimension d, the output of which is referred to as patch
be computed sequentially, transformer is more efficient because embeddings.
the output of the self-attention layer and the fully connected layers Similar to BERT’s [class] token, a learnable embedding is
can be computed in parallel and easily accelerated. From this, we applied to the sequence of embedding patches. The state of this
can conclude that further study into using transformer in computer embedding serves as the image representation. During both pre-
vision as well as NLP would yield beneficial results. training and fine-tuning stage, the classification heads are attached
to the same size. In addition, 1D position embeddings are added
to the patch embeddings in order to retain positional information.
3 V ISION T RANSFORMER It is worth noting that ViT utilizes only the standard transformer’s
In this section, we review the applications of transformer- encoder (except for the place for the layer normalization), whose
based models in computer vision, including image classification, output precedes an MLP head. In most cases, ViT is pre-trained
high/mid-level vision, low-level vision and video processing. We on large datasets, and then fine-tuned for downstream tasks with
also briefly summarize the applications of the self-attention mech- smaller data.
anism and model compression methods for efficient transformer. ViT yields modest results when trained on mid-sized datasets
such as ImageNet, achieving accuracies of a few percentage
points below ResNets of comparable size. Because transformers
3.1 Backbone for Representation Learning lack some inductive biases inherent to CNNs–such as translation
Inspired by the success that transformer has achieved in the field of equivariance and locality–they do not generalize well when trained
NLP, some researchers have explored whether similar models can on insufficient amounts of data. However, the authors found
learn useful representations for images. Given that images involve that training the models on large datasets (14 million to 300
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

million images) surpassed inductive bias. When pre-trained at Improving the calculation of self-attention layer has attracted
sufficient scale, transformers achieve excellent results on tasks many researchers. DeepViT [69] proposes to establish cross-
with fewer datapoints. For example, when pre-trained on the head communication to re-generate the attention maps to increase
JFT-300M dataset, ViT approached or even exceeded state of the diversity at different layers. KVT [70] introduces the k -NN
the art performance on multiple image recognition benchmarks. attention to utilize locality of images patches and ignore noisy
Specifically, it reached an accuracy of 88.36% on ImageNet, and tokens by only computing attentions with top-k similar tokens. Re-
77.16% on the VTAB suite of 19 tasks. finer [71] explores attention expansion in higher-dimension space
Touvron et al. [60] proposed a competitive convolution-free and applied convolution to augment local patterns of the attention
transformer, called Data-efficient image transformer (DeiT), by maps. XCiT [72] performs self-attention calculation across feature
training on only the ImageNet database. DeiT-B, the reference vi- channels rather than tokens, which allows efficient processing of
sion transformer, has the same architecture as ViT-B and employs high-resolution images. The computation complexity and attention
86 million parameters. With a strong data augmentation, DeiT- precision of the self-attention mechanism are two key-points for
B achieves top-1 accuracy of 83.1% (single-crop evaluation) on future optimization.
ImageNet with no external data. In addition, the authors observe The network architecture is an important factor as demon-
that using a CNN teacher gives better performance than using strated in the field of CNNs. The original architecture of ViT
a transformer. Specifically, DeiT-B can achieve top-1 accuracy is a simple stack of the same-shape transformer block. New
84.40% with the help of a token-based distillation. architecture design for vision transformer has been an interesting
topic. The pyramid-like architecture is utilized by many vision
transformer models [73], [61], [74], [75], [76], [77] including
PVT [73], HVT [78], Swin Transformer [61] and PiT [79]. There
are also other types of architectures, such as two-stream architec-
ture [80] and U-net architecture [81], [30]. Neural architecture
search (NAS) has also been investigated to search for better
transformer architectures, e.g., Scaling-ViT [82], ViTAS [83],
AutoFormer [84] and GLiT [85]. Currently, both network design
and NAS for vision transformer mainly draw on the experience of
CNN. In the future, we expect the specific and novel architectures
appear in the filed of vision transformer.
In addition to the aforementioned approaches, there are some
other directions to further improve vision transformer, e.g., posi-
Fig. 5: The framework of ViT (image from [15]). tional encoding [86], [87], normalization strategy [88], shortcut
connection [89] and removing attention [90], [91], [92], [93].
Variants of ViT. Following the paradigm of ViT, a series of
variants of ViT have been proposed to improve the performance 3.1.2 Transformer with Convolution
on vision tasks. The main approaches include enhancing locality, Although vision transformers have been successfully applied to
self-attention improvement and architecture design. various visual tasks due to their ability to capture long-range
The original vision transformer is good at capturing long-range dependencies within the input, there are still gaps in performance
dependencies between patches, but disregard the local feature between transformers and existing CNNs. One main reason can be
extraction as the 2D patch is projected to a vector with simple the lack of ability to extract local information. Except the above
linear layer. Recently, the researchers begin to pay attention to mentioned variants of ViT that enhance the locality, combining the
improve the modeling capacity for local information [29], [61], transformer with convolution can be a more straightforward way
[62]. TNT [29] further divides the patch into a number of sub- to introduce the locality into the conventional transformer.
patches and introduces a novel transformer-in-transformer archi- There are plenty of works trying to augment a conventional
tecture which utilizes an inner transformer block to model the transformer block or self-attention layer with convolution. For
relationship between sub-patches and an outer transformer block example, CPVT [86] proposed a conditional positional encoding
for patch-level information exchange. Twins [63] and CAT [64] (CPE) scheme, which is conditioned on the local neighborhood
alternately perform local and global attention layer-by-layer. Swin of input tokens and adaptable to arbitrary input sizes, to leverage
Transformers [61], [65] performs local attention within a win- convolutions for fine-level feature encoding. CvT [97], CeiT [98],
dow and introduces a shifted window partitioning approach for LocalViT [99] and CMT [95] analyzed the potential drawbacks
cross-window connections. Shuffle Transformer [66], [67] further when directly borrowing Transformer architectures from NLP and
utilizes the spatial shuffle operation instead of shifted window combined the convolutions with transformers together. Specifi-
partitioning to allow cross-window connections. RegionViT [62] cally, the feed-forward network (FFN) in each transformer block is
generates regional tokens and local tokens from an image, and combined with a convolutional layer that promotes the correlation
local tokens receive global information via attention with regional among neighboring tokens. LeViT [100] revisited principles from
tokens. In addition to the local attention, some other works propose extensive literature on CNNs and applied them to transformers,
to boost local information through local feature aggregation, e.g., proposing a hybrid neural network for fast inference image clas-
T2T [68]. These works demonstrate the benefit of the local sification. BoTNet [101] replaced the spatial convolutions with
information exchange and global information exchange in vision global self-attention in the final three bottleneck blocks of a
transformer. ResNet, and improved upon the baselines significantly on both
As a key component of transformer, self-attention layer pro-
vides the ability for global interaction between image patches. 1.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

TABLE 2: ImageNet result comparison of representative CNN and 86 86

vision transformer models. Pure transformer means only using a few 85 85 ResNet
EfficientNet
T2T
Swin
convolutions in the stem stage. CNN + Transformer means using 84 84 DeiT CMT
PVT VOLO
convolutions in the intermediate layers. Following [60], [61], the 83 83

Accuracy (%)

Accuracy (%)
throughput is measured on NVIDIA V100 GPU and Pytorch, with 82 82

224×224 input size. 81 81

80 ResNet T2T 80
Params FLOPs Throughput Top-1 EfficientNet Swin
Model 79 DeiT CMT 79
(M) (B) (image/s) (%) PVT VOLO
CNN 78 78
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 200 400 600 800 1000 1200
ResNet-50 [12], [68] 25.6 4.1 1226 79.1 FLOPs (B) Throughput (image/s)

ResNet-101 [12], [68] 44.7 7.9 753 79.9

ResNet-152 [12], [68] 60.2 11.5 526 80.8 (a) Acc v.s. FLOPs. (b) Acc v.s. throughput.
EfficientNet-B0 [94] 5.3 0.39 2694 77.1
EfficientNet-B1 [94] 7.8 0.70 1662 79.1 Fig. 6: FLOPs and throughput comparison of representative CNN
EfficientNet-B2 [94] 9.2 1.0 1255 80.1 and vision transformer models.
EfficientNet-B3 [94] 12 1.8 732 81.6
EfficientNet-B4 [94] 19 4.2 349 82.9
Pure Transformer [109] were proposed to extend generative based self-supervised
DeiT-Ti [15], [60] 5 1.3 2536 72.2 learning for vision transformer.
DeiT-S [15], [60] 22 4.6 940 79.8
DeiT-B [15], [60] 86 17.6 292 81.8 We briefly introduce iGPT [14] to demonstrate its mechanism.
T2T-ViT-14 [68] 21.5 5.2 764 81.5 This approach consists of a pre-training stage followed by a fine-
T2T-ViT-19 [68] 39.2 8.9 464 81.9
T2T-ViT-24 [68] 64.1 14.1 312 82.3 tuning stage. During the pre-training stage, auto-regressive and
PVT-Small [73] 24.5 3.8 820 79.8 BERT objectives are explored. To implement pixel prediction, a
PVT-Medium [73] 44.2 6.7 526 81.2
PVT-Large [73] 61.4 9.8 367 81.7
sequence transformer architecture is adopted instead of language
TNT-S [29] 23.8 5.2 428 81.5 tokens (as used in NLP). Pre-training can be thought of as a
TNT-B [29] 65.6 14.1 246 82.9 favorable initialization or regularizer when used in combination
CPVT-S [86] 23 4.6 930 80.5
CPVT-B [86] 88 17.6 285 82.3 with early stopping. During the fine-tuning stage, they add a
Swin-T [61] 29 4.5 755 81.3 small classification head to the model. This helps optimize a
Swin-S [61] 50 8.7 437 83.0
Swin-B [61] 88 15.4 278 83.3
classification objective and adapts all weights.
CNN + Transformer The image pixels are transformed into a sequential data by
Twins-SVT-S [63] 24 2.9 1059 81.7 k -means clustering. Given an unlabeled dataset X consisting of
Twins-SVT-B [63] 56 8.6 469 83.2
Twins-SVT-L [63] 99.2 15.1 288 83.7 high dimensional data x = (x1 , · · · , xn ), they train the model by
Shuffle-T [66] 29 4.6 791 82.5 minimizing the negative log-likelihood of the data:
Shuffle-S [66] 50 8.9 450 83.5
Shuffle-B [66] 88 15.6 279 84.0 LAR = E [− log p(x)], (7)
CMT-S [95] 25.1 4.0 563 83.5 x∼X
CMT-B [95] 45.7 9.3 285 84.5
VOLO-D1 [96] 27 6.8 481 84.2 where p(x) is the probability density of the data of images, which
VOLO-D2 [96] 59 14.1 244 85.2 can be modeled as:
VOLO-D3 [96] 86 20.6 168 85.4
n
VOLO-D4 [96] 193 43.8 100 85.7 Y
VOLO-D5 [96] 296 69.0 64 86.1 p(x) = p(xπi |xπ1 , · · · , xπi−1 , θ). (8)
i=1

instance segmentation and object detection tasks with minimal Here, the identity permutation πi = i is adopted for 1 ⩽ i ⩽ n,
overhead in latency. which is also known as raster order. Chen et al. also considered the
Besides, some researchers have demonstrated that transformer BERT objective, which samples a sub-sequence M ⊂ [1, n] such
based models can be more difficult to enjoy a favorable ability of that each index i independently has probability 0.15 of appearing
fitting data [15], [102], [103], in other words, they are sensitive in M . M is called the BERT mask, and the model is trained by
to the choice of optimizer, hyper-parameter, and the schedule of minimizing the negative log-likelihood of the “masked” elements
training. Visformer [102] revealed the gap between transformers xM conditioned on the “unmasked” ones x[1,n]\M :
and CNNs with two different training settings. The first one is the LBERT = E E
X
[− log p(xi |x[1,n]\M )]. (9)
standard setting for CNNs, i.e., the training schedule is shorter x∼X M
i∈M
and the data augmentation only contains random cropping and
horizental flipping. The other one is the training setting used During the pre-training stage, they pick either LAR or LBERT
in [60], i.e., the training schedule is longer and the data augmenta- and minimize the loss over the pre-training dataset.
tion is stronger. [103] changed the early visual processing of ViT GPT-2 [110] formulation of the transformer decoder block
by replacing its embedding stem with a standard convolutional is used. To ensure proper conditioning when training the AR
stem, and found that this change allows ViT to converge faster objective, Chen et al. apply the standard upper triangular mask
and enables the use of either AdamW or SGD without a significant to the n × n matrix of attention logits. No attention logit masking
drop in accuracy. In addition to these two works, [100], [95] also is required when the BERT objective is used: Chen et al. zero
choose to add convolutional stem on the top of the transformer. out the positions after the content embeddings are applied to the
input sequence. Following the final transformer layer, they apply a
layer norm and learn a projection from the output to logits param-
3.1.3 Self-supervised Representation Learning eterizing the conditional distributions at each sequence element.
Generative Based Approach. Generative pre-training methods When training BERT, they simply ignore the logits at unmasked
for images have existed for a long time [104], [105], [106], [107]. positions.
Chen et al. [14] re-examined this class of methods and combined During the fine-tuning stage, they average pool the output of
it with self-supervised methods. After that, several works [108], the final layer normalization layer across the sequence dimension
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

to extract a d-dimensional vector of features per example. They 3.1.4 Discussions

learn a projection from the pooled feature to class logits and All of the components of vision transformer including multi-
use this projection to minimize a cross entropy loss. Practical head self-attention, multi-layer perceptron, shortcut connection,
applications offer empirical evidence that the joint objective of layer normalization, positional encoding and network topology,
cross entropy loss and pretraining loss (LAR or LBERT ) works play key roles in visual recognition. As stated above, a number
even better. After iGPT, masked image modeling is proposed such of works have been proposed to improve the effectiveness and
as MAE [32] and SimMIM [111] which achieves competitive efficiency of vision transformer. From the results in Figure 6,
performance on downstream tasks. we can see that combining CNN and transformer achieve the
iGPT and ViT are two pioneering works to apply transformer better performance, indicating their complementation to each other
for visual tasks. The difference of iGPT and ViT-like models through local connection and global connection. Further investi-
mainly lies on 3 aspects: 1) The input of iGPT is a sequence of gation on backbone networks can lead to the improvement for the
color palettes by clustering pixels, while ViT uniformly divided whole vision community. As for the self-supervised representation
the image into a number of local patches; 2) The architec- learning for vision transformer, we still need to make effort to
ture of iGPT is an encoder-decoder framework, while ViT only pursue the success of large-scale pretraining in the filed of NLP.
has transformer encoder; 3) iGPT utilizes auto-regressive self-
supervised loss for training, while ViT is trained by supervised 3.2 High/Mid-level Vision
image classification task. Recently there has been growing interest in using transformer
Contrastive Learning Based Approach. Currently, contrastive for high/mid-level computer vision tasks, such as object detec-
learning is the most popular manner of self-supervised learning for tion [16], [17], [115], [116], [117], lane detection [118], segmen-
computer vision. Contrastive learning has been applied on vision tation [34], [25], [18] and pose estimation [35], [36], [37], [119].
transformer for unsupervised pretraining [31], [112], [113]. We review these methods in this section.
Chen et al. [31] investigate the effects of several fundamental
components for training self-supervised ViT. The authors observe 3.2.1 Generic Object Detection
that instability is a major issue that degrades accuracy, and these Traditional object detectors are mainly built upon CNNs, but
results are indeed partial failure and they can be improved when transformer-based object detection has gained significant interest
training is made more stable. recently due to its advantageous capability.
They introduce a “MoCo v3” framework, which is an incre- Some object detection methods have attempted to use trans-
mental improvement of MoCo [114]. Specifically, the authors take former’s self-attention mechanism and then enhance the spe-
two crops for each image under random data augmentation. They cific modules for modern detectors, such as feature fusion
are encodes by two encoders, fq and fk , with output vectors module [120] and prediction head [121]. We discuss this in
q and k. Intuitively, q behaves like a “query” and the goal of the supplemental material. Transformer-based object detection
learning is to retrieve the corresponding “key”. This is formulated methods are broadly categorized into two groups: transformer-
as minimizing a contrastive loss function, which can be written as: based set prediction methods [16], [17], [122], [123], [124] and
transformer-based backbone methods [115], [117], as shown in
Fig. 7. Transformer-based methods have shown strong perfor-
exp(q · k+ /τ ) mance compared with CNN-based detectors, in terms of both
Lq = −log . (10)
exp(q · k+ /τ ) + k− exp(q · k− /τ )
P
accuracy and running speed. Table 3 shows the detection results
for different transformer-based object detectors mentioned earlier
Here k+ is fk ’s output on the same image as q, known as q’s on the COCO 2012 val set.
positive sample. The set k− consists of fk ’s outputs from other class box
images, known as q’s negative samples. τ is a temperature hyper-
Transformer Transformer
parameter for l2 -normalized q, k. MoCo v3 uses the keys that nat- Image CNN FFN
Encoders Decoders Set prediction
urally co-exist in the same batch and abandon the memory queue, Positional Object
which they find has diminishing gain if the batch is sufficiently Encoding Queries

large (e.g., 4096). With this simplification, the contrastive loss can (a) Transformer-based set prediction for detection
be implemented in a simple way. The encoder fq consists of a
backbone (e.g., ViT), a projection head and an extra prediction RPN
class
Transformer Predict
Image
head; while the encoder fk has the backbone and projection head, patches Encoders Head box

but not the prediction head. fk is updated by the moving-average Positional

Encoding
of fq , excluding the prediction head.
(b) Transformer-based backbone for detection
MoCo v3 shows that the instability is a major issue of training
the self-supervised ViT, thus they describe a simple trick that can Fig. 7: General framework of transformer-based object detection.
improve the stability in various cases of the experiments. They Transformer-based Set Prediction for Detection. As a pioneer
observe that it is not necessary to train the patch projection layer. for transformer-based detection method, the detection transformer
For the standard ViT patch size, the patch projection matrix is (DETR) proposed by Carion et al. [16] redesigns the framework
complete or over-complete. And in this case, random projection of object detection. DETR, a simple and fully end-to-end object
should be sufficient to preserve the information of the original detector, treats the object detection task as an intuitive set pre-
patches. However, the trick alleviates the issue, but does not solve diction problem, eliminating traditional hand-crafted components
it. The model can still be unstable if the learning rate is too big and such as anchor generation and non-maximum suppression (NMS)
the first layer is unlikely the essential reason for the instability. post-processing. As shown in Fig. 8, DETR starts with a CNN
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

backbone to extract features from the input image. To supplement addition, a new bipartite matching scheme is designed for greater
the image features with position information, fixed positional en- training stability and faster convergence and two transformer-
codings are added to the flattened features before the features are based set prediction models, i.e. TSP-FCOS and TSP-RCNN, are
fed into the encoder-decoder transformer. The decoder consumes proposed to improve encoder-only DETR with feature pyramids.
the embeddings from the encoder along with N learned positional These new models achieve better performance compared with the
encodings (object queries), and produces N output embeddings. original DETR model. Gao et al. [125] proposed the Spatially
Here N is a predefined parameter and typically larger than the Modulated Co-Attention (SMCA) mechanism to accelerate the
number of objects in an image. Simple feed-forward networks convergence by constraining co-attention responses to be high
(FFNs) are used to compute the final predictions, which include near initially estimated bounding box locations. By integrating
the bounding box coordinates and class labels to indicate the the proposed SMCA module into DETR, similar mAP could be
specific class of object (or to indicate that no object exists). Unlike obtained with about 10× less training epochs under comparable
the original transformer, which computes predictions sequentially, inference cost.
DETR decodes N objects in parallel. DETR employs a bipartite Given the high computation complexity associated with
matching algorithm to assign the predicted and ground-truth DETR, Zheng et al. [123] proposed an Adaptive Clustering
objects. As shown in Eq. 11, the Hungarian loss is exploited to Transformer (ACT) to reduce the computation cost of pre-trained
compute the loss function for all matched pairs of objects. DETR. ACT adaptively clusters the query features using a locality
N h i
sensitivity hashing (LSH) method and broadcasts the attention
X
LHungarian (y, ŷ) = − log p̂σ̂(i) (ci ) + 1{ci ̸=∅} Lbox (bi , b̂σ̂ (i)) , (11) output to the queries represented by the selected prototypes. ACT
i=1
is used to replace the self-attention module of the pre-trained
where σ̂ is the optimal assignment, ci and p̂σ̂(i) (ci ) are the target DETR model without requiring any re-training. This approach
class label and predicted label, respectively, and bi and b̂σ̂ (i) are significantly reduces the computational cost while the accuracy
the ground truth and predicted bounding box, y = {(ci , bi )} slides slightly. The performance drop can be further reduced by
and ŷ are the ground truth and prediction of objects, respectively. utilizing a multi-task knowledge distillation (MTKD) method,
DETR shows impressive performance on object detection, deliv- which exploits the original transformer to distill the ACT module
ering comparable accuracy and speed with the popular and well- with a few epochs of fine-tuning. Yao et al. [126] pointed out
established Faster R-CNN [13] baseline on COCO benchmark. that the random initialization in DETR is the main reason for
the requirement of multiple decoder layers and slow convergence.
backbone encoder decoder prediction heads
To this end, they proposed the Efficient DETR to incorporate
set of image features
… FFN
class, the dense prior into the detection pipeline via an additional
box
CNN
FFN
no region proposal network. The better initialization enables them
transformer transformer object
+ encoder decoder class, to use only one decoder layers instead of six layers to achieve
FFN
box

… FFN
no
object
competitive performance with a more compact network.
positional encoding object queries
Transformer-based Backbone for Detection. Unlike DETR
which redesigns object detection as a set prediction tasks via
Fig. 8: The overall architecture of DETR (image from [16]). transformer, Beal et al. [115] proposed to utilize transformer as
DETR is a new design for the object detection framework a backbone for common detection frameworks such as Faster R-
based on transformer and empowers the community to develop CNN [13]. The input image is divided into several patches and
fully end-to-end detectors. However, the vanilla DETR poses fed into a vision transformer, whose output embedding features
several challenges, specifically, longer training schedule and poor are reorganized according to spatial information before passing
performance for small objects. To address these challenges, Zhu et through a detection head for the final results. A massive pre-
al. [17] proposed Deformable DETR, which has become a popular training transformer backbone could bring benefits to the proposed
method that significantly improves the detection performance. The ViT-FRCNN. There are also quite a few methods to explore versa-
deformable attention module attends to a small set of key positions tile vision transformer backbone design [29], [73], [61], [63] and
around a reference point rather than looking at all spatial locations transfer these backbones to traditional detection frameworks like
on image feature maps as performed by the original multi-head RetinaNet [129] and Cascade R-CNN [130]. For example, Swin
attention mechanism in transformer. This approach significantly Transformer [61] obtains about 4 box AP gains over ResNet-50
reduces the computational complexity and brings benefits in terms backbone with similar FLOPs for various detection frameworks.
of fast convergence. More importantly, the deformable attention Pre-training for Transformer-based Object Detection. Inspired
module can be easily applied for fusing multi-scale features. by the pre-training transformer scheme in NLP, several methods
Deformable DETR achieves better performance than DETR with have been proposed to explore different pre-training scheme
10× less training cost and 1.6× faster inference speed. And by for transformer-based object detection [33], [128], [131]. Dai et
using an iterative bounding box refinement method and two-stage al. [33] proposed unsupervised pre-training for object detection
scheme, Deformable DETR can further improve the detection (UP-DETR). Specifically, a novel unsupervised pretext task named
performance. random query patch detection is proposed to pre-train the DETR
There are also several methods to deal with the slow conver- model. With this unsupervised pre-training scheme, UP-DETR
gence problem of the original DETR. For example, Sun et al. [122] significantly improves the detection accuracy on a relatively small
investigated why the DETR model has slow convergence and dataset (PASCAL VOC). On the COCO benchmark with sufficient
discovered that this is mainly due to the cross-attention module training data, UP-DETR still outperforms DETR, demonstrating
in the transformer decoder. To address this issue, an encoder-only the effectiveness of the unsupervised pre-training scheme.
version of DETR is proposed, achieving considerable improve- Fang et al. [128] explored how to transfer the pure ViT
ment in terms of detection accuracy and training convergence. In structure that is pre-trained on ImageNet to the more challenging
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

TABLE 3: Comparison of different transformer-based object detectors on COCO 2017 val set. Running speed (FPS) is evaluated on an
NVIDIA Tesla V100 GPU as reported in [17]. † Estimated speed according to the reported number in the paper. ‡ ViT backbone is pre-trained
on ImageNet-21k. ∗ ViT backbone is pre-trained on an private dataset with 1.3 billion images.
Method Epochs AP AP50 AP75 APS APM APL #Params (M) GFLOPs FPS
CNN based
FCOS [127] 36 41.0 59.8 44.1 26.2 44.6 52.2 - 177 23†
Faster R-CNN + FPN [13] 109 42.0 62.1 45.5 26.6 45.4 53.4 42 180 26
CNN Backbone + Transformer Head
DETR [16] 500 42.0 62.4 44.2 20.5 45.8 61.1 41 86 28
DETR-DC5 [16] 500 43.3 63.1 45.9 22.5 47.3 61.1 41 187 12
Deformable DETR [17] 50 46.2 65.2 50.0 28.8 49.2 61.7 40 173 19
TSP-FCOS [122] 36 43.1 62.3 47.0 26.6 46.8 55.9 - 189 20†
TSP-RCNN [122] 96 45.0 64.5 49.6 29.7 47.7 58.0 - 188 15†
ACT+MKKD (L=32) [123] - 43.1 - - 61.4 47.1 22.2 - 169 14†
SMCA [125] 108 45.6 65.5 49.1 25.9 49.3 62.6 - - -
Efficient DETR [126] 36 45.1 63.1 49.1 28.3 48.4 59.0 35 210 -
UP-DETR [33] 150 40.5 60.8 42.6 19.0 44.4 60.0 41 - -
UP-DETR [33] 300 42.8 63.0 45.3 20.8 47.1 61.7 41 - -
Transformer Backbone + CNN Head
ViT-B/16-FRCNN‡ [115] 21 36.6 56.3 39.3 17.4 40.0 55.5 - - -
ViT-B/16-FRCNN∗ [115] 21 37.8 57.4 40.1 17.8 41.4 57.3 - - -
PVT-Small+RetinaNet [73] 12 40.4 61.3 43.0 25.0 42.9 55.7 34.2 118 -
Twins-SVT-S+RetinaNet [63] 12 43.0 64.2 46.3 28.0 46.4 57.5 34.3 104 -
Swin-T+RetinaNet [61] 12 41.5 62.1 44.2 25.1 44.9 55.5 38.5 118 -
Swin-T+ATSS [61] 36 47.2 66.5 51.3 - - - 36 215 -
Pure Transformer based
PVT-Small+DETR [73] 50 34.7 55.7 35.4 12.0 36.4 56.7 40 - -
TNT-S+DETR [29] 50 38.2 58.9 39.4 15.5 41.1 58.8 39 - -
YOLOS-Ti [128] 300 30.0 - - - - - 6.5 21 -
YOLOS-S [128] 150 37.6 57.6 39.2 15.9 40.2 57.3 28 179 -
YOLOS-B [128] 150 42.0 62.2 44.5 19.5 45.3 62.1 127 537 -

object detection task and proposed the YOLOS detector. To cope mask embeddings, and match them with ground truth for the set
with the object detection task, the proposed YOLOS first drops loss. ISTR conducted detection and segmentation with a recurrent
the classification tokens in ViT and appends learnable detection refinement strategy which is different from the existing top-down
tokens. Besides, the bipartite matching loss is utilized to perform and bottom-up frameworks. Yang et al. [133] investigated how
set prediction for objects. With this simple pre-training scheme to realize better and more efficient embedding learning to tackle
on ImageNet dataset, the proposed YOLOS shows competitive the semi-supervised video object segmentation under challenging
performance for object detection on COCO benchmark. multi-object scenarios. Some papers such as [134], [135] also
discussed using Transformer to deal with segmentation task.
3.2.2 Segmentation
Transformer for Semantic Segmentation. Zheng et al. [18]
Segmentation is an important topic in computer vision community,
proposed a transformer-based semantic segmentation network
which broadly includes panoptic segmentation, instance segmen-
(SETR). SETR utilizes an encoder similar to ViT [15] as the
tation and semantic segmentation etc. Vision transformer has also
encoder to extract features from an input image. A multi-level
shown impressive potential on the field of segmentation.
Transformer for Panoptic Segmentation. DETR [16] can be feature aggregation module is adopted for performing pixel-wise
naturally extended for panoptic segmentation tasks and achieve segmentation. Strudel et al. [136] introduced Segmenter which
competitive results by appending a mask head on the decoder. relies on the output embedding corresponding to image patches
Wang et al. [25] proposed Max-DeepLab to directly predict and obtains class labels with a point-wise linear decoder or a mask
panoptic segmentation results with a mask transformer, without transformer decoder. Xie et al. [137] proposed a simple, efficient
involving surrogate sub-tasks such as box detection. Similar to yet powerful semantic segmentation framework which unifies
DETR, Max-DeepLab streamlines the panoptic segmentation tasks Transformers with lightweight multilayer perception (MLP) de-
in an end-to-end fashion and directly predicts a set of non- coders, which outputs multiscale features and avoids complex
overlapping masks and corresponding labels. Model training is decoders.
performed using a panoptic quality (PQ) style loss, but unlike prior Transformer for Medical Image Segmentation. Cao et al. [30]
methods that stack a transformer on top of a CNN backbone, Max- proposed an Unet-like pure Transformer for medical image seg-
DeepLab adopts a dual-path framework that facilitates combining mentation, by feeding the tokenized image patches into the
the CNN and transformer. Transformer-based U-shaped Encoder-Decoder architecture with
Transformer for Instance Segmentation. VisTR, a transformer- skip-connections for local-global semantic feature learning. Vala-
based video instance segmentation model, was proposed by narasu et al. [138] explored transformer-based solutions and study
Wang et al. [34] to produce instance prediction results from the feasibility of using transformer-based network architectures
a sequence of input images. A strategy for matching instance for medical image segmentation tasks and proposed a Gated
sequence is proposed to assign the predictions with ground truths. Axial-Attention model which extends the existing architectures by
In order to obtain the mask sequence for each instance, VisTR introducing an additional control mechanism in the self-attention
utilizes the instance sequence segmentation module to accumulate module. Cell-DETR [139], based on the DETR panoptic segmen-
the mask features from multiple frames and segment the mask tation model, is an attempt to use transformer for cell instance seg-
sequence with a 3D CNN. Hu et al. [132] proposed an instance mentation. It adds skip connections that bridge features between
segmentation Transformer (ISTR) to predict low-dimensional the backbone CNN and the CNN decoder in the segmentation
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

head in order to enhance feature fusion. Cell-DETR achieves inaccuracies in detection and corrected partial or entire skeleton
state-of-the-art performance for cell instance segmentation from corruption. Hao et al. [146] proposed to personalize a human pose
microscopy imagery. estimator given a set of test images of a person without using
any manual annotations. The method adapted the pose estimator
3.2.3 Pose Estimation during test time to exploit person-specific information, and used
Human pose and hand pose estimation are foundational topics that a Transformer model to build a transformation between the self-
have attracted significant interest from the research community. supervised keypoints and the supervised keypoints.
Articulated pose estimation is akin to a structured prediction task,
aiming to predict the joint coordinates or mesh vertices from input 3.2.4 Other Tasks
RGB/D images. Here we discuss some methods [35], [36], [37], There are also quite a lot different high/mid-level vision tasks
[119] that explore how to utilize transformer for modeling the that have explored the usage of vision transformer for better
global structure information of human poses and hand poses. performance. We briefly review several tasks below.
Transformer for Hand Pose Estimation. Huang et al. [35] pro- Pedestrian Detection. Because the distribution of objects is very
posed a transformer based network for 3D hand pose estimation dense in occlusion and crowd scenes, additional analysis and
from point sets. The encoder first utilizes a PointNet [140] to adaptation are often required when common detection networks
extract point-wise features from input point clouds and then adopts are applied to pedestrian detection tasks. Lin et al. [147] revealed
standard multi-head self-attention module to produce embeddings. that sparse uniform queries and a weak attention field in the
In order to expose more global pose-related information to the decoder result in performance degradation when directly applying
decoder, a feature extractor such as PointNet++ [141] is used DETR or Deformable DETR to pedestrian detection tasks. To
to extract hand joint-wise features, which are then fed into the alleviate these drawbacks, the authors proposes Pedestrian End-
decoder as positional encodings. Similarly, Huang et al. [36] to-end Detector (PED), which employs a new decoder called
proposed HOT-Net (short for hand-object transformer network) Dense Queries and Rectified Attention field (DQRF) to support
for 3D hand-object pose estimation. Unlike the preceding method dense queries and alleviate the noisy or narrow attention field
which employs transformer to directly predict 3D hand pose from of the queries. They also proposed V-Match, which achieves
input point clouds, HOT-Net uses a ResNet to generate initial 2D additional performance improvements by fully leveraging visible
hand-object pose and then feeds it into a transformer to predict annotations.
the 3D hand-object pose. A spectral graph convolution network Lane Detection. Based on PolyLaneNet [148], Liu et al. [118]
is therefore used to extract input embeddings for the encoder. proposed a method called LSTR, which improves performance
Hampali et al. [142] proposed to estimate the 3D poses of two of curve lane detection by learning the global context with
hands given a single color image. Specifically, appearance and a transformer network. Similar to PolyLaneNet, LSTR regards
spatial encodings of a set of potential 2D locations for the joints lane detection as a task of fitting lanes with polynomials and
of both hands were inputted to a transformer, and the attention uses neural networks to predict the parameters of polynomials.
mechanisms were used to sort out the correct configuration of the To capture slender structures for lanes and the global context,
joints and outputted the 3D poses of both hands. LSTR introduces a transformer network into the architecture. This
Transformer for Human Pose Estimation. Lin et al. [37] enables processing of low-level features extracted by CNNs. In ad-
proposed a mesh transformer (METRO) for predicting 3D human dition, LSTR uses Hungarian loss to optimize network parameters.
pose and mesh from a single RGB image. METRO extracts As demonstrated in [118], LSTR outperforms PolyLaneNet, with
image features via a CNN and then perform position encoding 2.82% higher accuracy and 3.65× higher FPS using 5-times fewer
by concatenating a template human mesh to the image feature. A parameters. The combination of a transformer network, CNN and
multi-layer transformer encoder with progressive dimensionality Hungarian Loss culminates in a lane detection framework that
reduction is proposed to gradually reduce the embedding dimen- is precise, fast, and tiny. Considering that the entire lane line
sions and finally produce 3D coordinates of human joint and mesh generally has an elongated shape and long-range, Liu et al. [149]
vertices. To encourage the learning of non-local relationships be- utilized a transformer encoder structure for more efficient context
tween human joints, METRO randomly mask some input queries feature extraction. This transformer encoder structure improves
during training. Yang et al. [119] constructed an explainable model the detection of the proposal points a lot, which rely on contextual
named TransPose based on Transformer architecture and low-level features and global information, especially in the case where the
convolutional blocks. The attention layers built in Transformer can backbone network is a small model.
capture long-range spatial relationships between keypoints and ex- Scene Graph. Scene graph is a structured representation of a
plain what dependencies the predicted keypoints locations highly scene that can clearly express the objects, attributes, and rela-
rely on. Li et al. [143] proposed a novel approach based on Token tionships between objects in the scene [150]. To generate scene
representation for human Pose estimation (TokenPose). Each key- graph, most of existing methods first extract image-based object
point was explicitly embedded as a token to simultaneously learn representations and then do message propagation between them.
constraint relationships and appearance cues from images. Mao et Graph R-CNN [151] utilizes self-attention to integrate contextual
al. [144] proposed a human pose estimation framework that solved information from neighboring nodes in the graph. Recently, Shar-
the task in the regression-based fashion. They formulated the pose ifzadeh et al. [152] employed transformers over the extracted
estimation task into a sequence prediction problem and solve it by object embedding. Sharifzadeh et al. [153] proposed a new
transformers, which bypass the drawbacks of the heatmap-based pipeline called Texema and employed a pre-trained Text-to-Text
pose estimator. Jiang et al. [145] proposed a novel transformer Transfer Transformer (T5) [154] to create structured graphs from
based network that can learn a distribution over both pose and textual input and utilized them to improve the relational reasoning
motion in an unsupervised fashion rather than tracking body parts module. The T5 model enables Texema to utilize the knowledge
and trying to temporally smooth them. The method overcame in texts.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

Tracking. Some researchers also explored to use transformer fully leveraged using large scale pre-training datasets as BERT
encoder-decoder architecture in template-based discriminative and GPT-3 do in the NLP field? And is it possible to pre-train a
trackers, such as TMT [155], TrTr [156] and TransT [157]. All single transformer model and fine-tune it for different downstream
these work use a Siamese-like tracking pipeline to do video tasks with only a few epochs of fine-tuning? How to design more
object tracking and utilize the encoder-decoder network to re- powerful architecture by incorporating prior knowledge of the
place explicit cross-correlation operation for global and rich specific tasks? Several prior works have performed preliminary
contextual inter-dependencies. Specifically, the transformer en- discussions for the aforementioned topics and We hope more
coder and decoder are assigned to the template branch and the further research effort is conducted into exploring more powerful
searching branch, respectively. In addition, Sun et al. proposed transformers for high-level vision.
TransTrack [158], which is an online joint-detection-and-tracking
pipeline. It utilizes the query-key mechanism to track pre-existing 3.3 Low-level Vision
objects and introduces a set of learned object queries into the
Few works apply transformers on low-level vision fields, such
pipeline to detect new-coming objects. The proposed TransTrack
as image super-resolution and generation. These tasks often take
achieves 74.5% and 64.5% MOTA on MOT17 and MOT20 bench-
images as outputs (e.g., high-resolution or denoised images),
mark.
which is more challenging than high-level vision tasks such as
Re-Identification. He et al. [159] proposed TransReID to inves-
classification, segmentation, and detection, whose outputs are
tigate the application of pure transformers in the field of object
labels or boxes.
re-identification (ReID). While introducing transformer network
into object ReID, TransReID slices with overlap to reserve local Training Only
neighboring structures around the patches and introduces 2D bilin-
Images CNN
ear interpolation to help handle any given input resolution. With Images Encoder
the transformer module and the loss function, a strong baseline
was proposed to achieve comparable performance with CNN-
patches
based frameworks. Moreover, The jigsaw patch module (JPM) Transformer Tokens
was designed to facilitate perturbation-invariant and robust feature Decoders
Transformer
representation of objects and the side information embeddings Decoders
(SIE) was introduced to encode side information. The final frame-
work TransReID achieves state-of-the-art performance on both Images CNN
Decoder
person and vehicle ReID benchmarks. Both Liu et al. [160] and noise
Zhang et al. [161] provided solutions for introducing transformer (a) Image Generation (b) Image Generation
(GAN-based) (Transformer-based)
network into video-based person Re-ID. And similarly, both of the
them utilized separated transformer networks to refine spatial and Fig. 9: A generic framework for transformer in image generation.
temporal features, and then utilized a cross view transformer to
aggregate multi-view features. 3.3.1 Image Generation
Point Cloud Learning. A number of other works exploring An simple yet effective to apply transformer model to the image
transformer architecture for point cloud learning [162], [163], generation task is to directly change the architectures from CNNs
[164] have also emerged recently. For example, Guo et al. [163] to transformers, as shown in Figure 9 (a). Jiang et al. [39] proposed
proposed a novel framework that replaces the original self- TransGAN, which build GAN using the transformer architec-
attention module with a more suitable offset-attention module, ture. Since the it is difficult to generate high-resolution images
which includes implicit Laplace operator and normalization refine- pixel-wise, a memory-friendly generator is utilized by gradually
ment. In addition, Zhao et al. [164] designed a novel transformer increasing the feature map resolution at different stages. Corre-
architecture called Point Transformer. The proposed self-attention spondingly, a multi-scale discriminator is designed to handle the
layer is invariant to the permutation of the point set, making it varying size of inputs in different stages. Various training recipes
suitable for point set processing tasks. Point Transformer shows are introduced including grid self-attention, data augmentation,
strong performance for semantic segmentation task from 3D point relative position encoding and modified normalization to stabilize
clouds. the training and improve its performance. Experiments on various
benchmark datasets demonstrate the effectiveness and potential
3.2.5 Discussions of the transformer-based GAN model in image generation tasks.
As discussed in the preceding sections, transformers have shown Kwonjoon Lee et al. [165] proposed ViTGAN, which introduce
strong performance on several high-level tasks, including detec- several technique to both generator and discriminator to stabilize
tion, segmentation and pose estimation. The key issues that need to the training procedure and convergence. Euclidean distance is
be resolved before transformer can be adopted for high-level tasks introduced for the self-attention module to enforce the Lips-
relate to input embedding, position encoding, and prediction loss. chitzness of transformer discriminator. Self-modulated layernorm
Some methods propose improving the self-attention module from and implicit neural representation are proposed to enhance the
different perspectives, for example, deformable attention [17], training for the generator. As a result, ViTGAN is the first work
adaptive clustering [123] and point transformer [164]. Neverthe- to demonstrate transformer-based GANs can achieve comparable
less, exploration into the use of transformers for high-level vision performance to state-of-the-art CNN-based GANs.
tasks is still in the preliminary stages and so further research Parmar et al. [27] proposed Image Transformer, taking the first
may prove beneficial. For example, is it necessary to use feature step toward generalizing the transformer model to formulate image
extraction modules such as CNN and PointNet before transformer translation and generation tasks in an auto-regressive manner.
for potential better performance? How can vision transformer be Image Transformer consists of two parts: an encoder for extracting
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

image representation and a decoder to generate pixels. For each relevance ri,j is calculated between each patch qi in Q and ki in
pixel with value 0 − 255, a 256 × d dimensional embedding K as:
is learned for encoding each value into a d dimensional vector, qi ki
ri,j = , . (12)
which is fed into the encoder as input. The encoder and decoder ∥qi ∥ ∥ki ∥
adopt the same architecture as that in [9]. Each output pixel A hard-attention module is proposed to select high-resolution
q ′ is generated by calculating self-attention between the input features V according to the reference image, so that the low-
pixel q and previously generated pixels m1 , m2 , ... with position resolution image can be matched by using the relevance. The
embedding p1 , p2 , .... For image-conditioned generation, such as hard-attention map is calculated as:
super-resolution and inpainting, an encoder-decoder architecture is
used, where the encoder’s input is the low-resolution or corrupted hi = arg max ri,j (13)
j
images. For unconditional and class-conditional generation (i.e.,
noise to image), only the decoder is used for inputting noise vec- The most relevant reference patch is ti = vhi , where ti in
tors. Because the decoder’s input is the previously generated pixels T is the transferred features. A soft-attention module is then
(involving high computation cost when producing high-resolution used to transfer V to the low-resolution feature. The transferred
images), a local self-attention scheme is proposed. This scheme features from the high-resolution texture image and the low-
uses only the closest generated pixels as input for the decoder, resolution feature are used to generate the output features of
enabling Image Transformer to achieve performance on par with the low-resolution image. By leveraging the transformer-based
CNN-based models for image generation and translation tasks, architecture, TTSR can successfully transfer texture information
demonstrating the effectiveness of transformer-based models on from high-resolution reference images to low-resolution images in
low-level vision tasks. super-resolution tasks.
Since it is difficult to directly generate high-resolution images
Multi-head Flatten features Multi-tail
by transformer models, Esser et al. [38] proposed Taming Trans-
Denoising Denoising
former. Taming Transformer consists of two parts: a VQGAN Head Tail
and a transformer. VQGAN is a variant of VQVAE [166], which Transformer Encoder

uses a discriminator and perceptual loss to improve the visual Deraining Deraining
Tail
Head
quality. Through VQGAN, the image can be represented by a Features
Task embedding
series of context-rich discrete vectors and therefore these vectors x2 Up Features x2 Up
Head Tail
can be easily predicted by a transformer model through an auto-

…
regression way. The transformer model can learn the long-range
…

…
Transformer Decoder

interactions for generating high-resolution images. As a result, the x4 Up x4 Up

Tail
Head Reshape
proposed Taming Transformer achieves state-of-the-art results on
a wide variety of image synthesis tasks.
Fig. 10: Diagram of IPT architecture (image from [19]).
Besides image generation, DALL·E [42] proposed the trans-
former model for text-to-image generation, which synthesizes Different from the preceding methods that use transformer
images according to the given captions. The whole framework models on single tasks, Chen et al. [19] proposed Image Pro-
consists of two stages. In the first stage, a discrete VAE is cessing Transformer (IPT), which fully utilizes the advantages
utilized to learn the visual codebook. In the second stage, the of transformers by using large pre-training datasets. It achieves
text is decoded by BPE-encode and the corresponding image state-of-the-art performance in several image processing tasks,
is decoded by dVAE learned in the first stage. Then an auto- including super-resolution, denoising, and deraining. As shown
regression transformer is used to learn the prior between the in Figure 10, IPT consists of multiple heads, an encoder, a
encoded text and image. During the inference procedure, tokens decoder, and multiple tails. The multi-head, multi-tail structure
of images are predicted by the transformer and decoded by the and task embeddings are introduced for different image processing
learned decoder. The CLIP model [41] is introduced to rank tasks. The features are divided into patches, which are fed into
generated samples. Experiments on text-to-image generation task the encoder-decoder architecture. Following this, the outputs are
demonstrate the powerful ability of the proposed model. Note that reshaped to features with the same size. Given the advantages
our survey mainly focus on pure vision tasks, we do not include of pre-training transformer models on large datasets, IPT uses
the framework of DALL·E in Figure 9. The image generation has the ImageNet dataset for pre-training. Specifically, images from
been pushed to a higher level with the introduction of diffusion this dataset are degraded by manually adding noise, rain streaks,
model [167], such as DALLE2 [168] and Stable Diffusion [169]. or downsampling in order to generate corrupted images. The
degraded images are used as inputs for IPT, while the original
images are used as the optimization goal of the outputs. A self-
3.3.2 Image Processing supervised method is also introduced to enhance the generalization
ability of the IPT model. Once the model is trained, it is fine-
A number of recent works eschew using each pixel as the input tuned on each task by using the corresponding head, tail, and
for transformer models and instead use patches (set of pixels) as task embedding. IPT largely achieves performance improvements
input. For example, Yang et al. [40] proposed Texture Transformer on image processing tasks (e.g., 2 dB in image denoising tasks),
Network for Image Super-Resolution (TTSR), using the trans- demonstrating the huge potential of applying transformer-based
former architecture in the reference-based image super-resolution models to the field of low-level vision.
problem. It aims to transfer relevant textures from reference Besides single image generation, Wang et al. [170] proposed
images to low-resolution images. Taking a low-resolution image SceneFormer to utilize transformer in 3D indoor scene generation.
and reference image as the query Q and key K, respectively, the By treating a scene as a sequence of objects, the transformer
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

decoder can be used to predict series of objects and their location, group activity recognition [177]. Gavrilyuk et al. proposed an
category, and size. This has enabled SceneFormer to outperform actor-transformer [178] architecture to learn the representation,
conventional CNN-based methods in user studies. using the static and dynamic representations generated by the 2D
and 3D networks as input. The output of the transformer is the
predicted activity.
Images
patches Video Retrieval. The key to content-based video retrieval is to
find the similarity between videos. Leveraging only the image-
level of video-level features to overcome the associated chal-
Transformer Transformer lenges, Shao et al. [179] suggested using the transformer to model
Encoders Decoders
the long-range semantic dependency. They also introduced the
supervised contrastive learning strategy to perform hard negative
mining. The results of using this approach on benchmark datasets
Images
patches demonstrate its performance and speed advantages. In addition,
Gabeur et al. [180] presented a multi-modal transformer to learn
Fig. 11: A generic framework for transformer in image processing.
It should be noted that iGPT [14] is pre-trained on an different cross-modal cues in order to represent videos.
inpainting-like task. Since iGPT mainly focus on the fine-tuning Video Object Detection. To detect objects in a video, both
performance on image classification tasks, we treat this work more global and local information is required. Chen et al. introduced
like an attempt on image classification task using transformer than the memory enhanced global-local aggregation (MEGA) [181]
low-level vision tasks. to capture more content. The representative features enhance the
In conclusion, different to classification and detection tasks, overall performance and address the ineffective and insufficient
the outputs of image generation and processing are images. Fig- problems. Furthermore, Yin et al. [182] proposed a spatiotem-
ure 11 illustrates using transformers in low-level vision. In image poral transformer to aggregate spatial and temporal information.
processing tasks, the images are first encoded into a sequence of Together with another spatial feature encoding component, these
tokens or patches and the transformer encoder uses the sequence two components perform well on 3D video object detection tasks.
as input, allowing the transformer decoder to successfully produce Multi-task Learning. Untrimmed video usually contains many
desired images. In image generation tasks, the GAN-based models frames that are irrelevant to the target tasks. It is therefore
directly learn a decoder to generated patches to outputting images crucial to mine the relevant information and discard the redundant
through linear projection, while the transformer-based models information. To extract such information, Seong et al. proposed
train a auto-encoder to learn a codebook for images and use an the video multi-task transformer network [183], which handles
auto-regression transformer model to predict the encoded tokens. multi-task learning on untrimmed videos. For the CoVieW dataset,
A meaningful direction for future research would be designing a the tasks are scene recognition, action recognition and importance
suitable architecture for different image processing tasks. score prediction. Two pre-trained networks on ImageNet and
Places365 extract the scene features and object features. The
multi-task transformers are stacked to implement feature fusion,
3.4 Video Processing leveraging the class conversion matrix (CCM).
Transformer performs surprisingly well on sequence-based tasks
and especially on NLP tasks. In computer vision (specifically, 3.4.2 Low-level Video Processing
video tasks), spatial and temporal dimension information is fa- Frame/Video Synthesis. Frame synthesis tasks involve synthe-
vored, giving rise to the application of transformer in a number sizing the frames between two consecutive frames or after a
of video tasks, such as frame synthesis [171], action recogni- frame sequence while video synthesis tasks involve synthesizing
tion [172], and video retrieval [173]. a video. Liu et al. proposed the ConvTransformer [171], which is
comprised of five components: feature embedding, position encod-
3.4.1 High-level Video Processing ing, encoder, query decoder, and the synthesis feed-forward net-
Video Action Recognition. Video human action tasks, as the work. Compared with LSTM based works, the ConvTransformer
name suggests, involves identifying and localizing human actions achieves superior results with a more parallelizable architecture.
in videos. Context (such as other people and objects) plays a Another transformer-based approach was proposed by Schatz et
critical role in recognizing human actions. Rohit et al. proposed al. [184], which uses a recurrent transformer network to synthetize
the action transformer [172] to model the underlying relationship human actions from novel views.
between the human of interest and the surrounding context. Video Inpainting. Video inpainting tasks involve completing
Specifically, the I3D [174] is used as the backbone to extract high- any missing regions within a frame. This is challenging, as it
level feature maps. The features extracted (using RoI pooling) requires information along the spatial and temporal dimensions
from intermediate feature maps are viewed as the query (Q), while to be merged. Zeng et al. proposed a spatial-temporal transformer
the key (K) and values (V) are calculated from the intermediate network [28], which uses all the input frames as input and fills
features. A self-attention mechanism is applied to the three compo- them in parallel. The spatial-temporal adversarial loss is used to
nents, and it outputs the classification and regressions predictions. optimize the transformer network.
Lohit et al. [175] proposed an interpretable differentiable module,
named temporal transformer network, to reduce the intra-class 3.4.3 Discussions
variance and increase the inter-class variance. In addition, Fayyaz Compared to image, video has an extra dimension to encode
and Gall proposed a temporal transformer [176] to perform action the temporal information. Exploiting both spatial and temporal
recognition tasks under weakly supervised settings. In addition information helps to have a better understanding of a video.
to human action recognition, transformer has been utilized for Thanks to the relationship modeling capability of transformer,
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

video processing tasks have been improved by mining spatial described in an input text. Similar to GPT-3, DALL-E is a multi-
and temporal information simultaneously. Nevertheless, due to modal transformer with 12 billion model parameters autoregres-
the high complexity and much redundancy of video data, how sively trained on a dataset of 3.3 million text-image pairs. More
to efficiently and accurately modeling both spatial and temporal specifically, to train DALL-E, a two-stage training procedure is
relationships is still an open problem. used, where in stage 1, a discrete variational autoencoder is used
to compress 256× 256 RGB images into 32×32 image tokens and
3.5 Multi-Modal Tasks then in stage 2, an autoregressive transformer is trained to model
the joint distribution over the image and text tokens. Experimental
Owing to the success of transformer across text-based NLP tasks,
results show that DALL-E can generate images of various styles
many researches are keen to exploit its potential for processing
from scratch, including photorealistic imagery, cartoons and emoji
multi-modal tasks (e.g., video-text, image-text and audio-text).
or extend an existing image while still matching the description in
One example of this is VideoBERT [185], which uses a CNN-
the text. Subsequently, Ding et al. proposes CogView [43], which
based module to pre-process videos in order to obtain representa-
is a transformer with VQ-VAE tokenizer similar to DALL-E, but
tion tokens. A transformer encoder is then trained on these tokens
supports Chinese text input. They claim CogView outperforms
to learn the video-text representations for downstream tasks, such
DALL-E and previous GAN-bsed methods and also unlike DALL-
as video caption. Some other examples include VisualBERT [186]
E, CogView does not need an additional CLIP model to rerank the
and VL-BERT [187], which adopt a single-stream unified trans-
samples drawn from transformer, i.e. DALL-E.
former to capture visual elements and image-text relationship for
Recently, a Unified Transformer (UniT) [189] model is pro-
downstream tasks such as visual question answering (VQA) and
posed to cope with multi-modal multi-task learning, which can
visual commonsense reasoning (VCR). In addition, several studies
simultaneously handle multiple tasks across different domains,
such as SpeechBERT [188] explore the possibility of encoding
including object detection, natural language understanding and
audio and text pairs with a transformer encoder to process auto-
vision-language reasoning. Specifically, UniT has two transformer
text tasks such as speech question answering (SQA).
encoders to handle image and text inputs, respectively, and then
the transformer decoder takes the single or concatenated encoder
outputs according to the task modality. Finally, a task-specific
prediction head is applied to the decoder outputs for different
tasks. In the training stage, all tasks are jointly trained by randomly
selecting a specific task within an iteration. The experiments show
UniT achieves satisfactory performance on every task with a
compact set of model parameters.
In conclusion, current transformer-based mutil-modal models
demonstrates its architectural superiority for unifying data and
tasks of various modalities, which demonstrates the potential of
Fig. 12: The framework of the CLIP (image from [41]).
transformer to build a general-purpose intelligence agents able to
Apart from the aforementioned pioneering multi-modal trans- cope with vast amount of applications. Future researches can be
formers, Contrastive Language-Image Pre-training (CLIP) [41] conducted in exploring the effective training or the extendability
takes natural language as supervision to learn more efficient image of multi-modal transformers (e.g., GPT-4 [44]).
representation. CLIP jointly trains a text encoder and an image
encoder to predict the corresponding training text-image pairs.
3.6 Efficient Transformer
The text encoder of CLIP is a standard transformer with masked
self-attention used to preserve the initialization ability of the pre- Although transformer models have achieved success in various
trained language models. For the image encoder, CLIP considers tasks, their high requirements for memory and computing re-
two types of architecture, ResNet and Vision Transformer. CLIP is sources block their implementation on resource-limited devices
trained on a new dataset containing 400 million (image, text) pairs such as mobile phones. In this section, we review the researches
collected from the Internet. More specifically, given a batch of N carried out into compressing and accelerating transformer models
(image, text) pairs, CLIP learns both text and image embeddings for efficient implementation. This includes including network
jointly to maximize the cosine similarity of those N matched pruning, low-rank decomposition, knowledge distillation, network
embeddings while minimize N 2 − N incorrectly matched embed- quantization, and compact architecture design. Table 4 lists some
dings. On Zero-Shot transfer, CLIP demonstrates astonishing zero- representative works for compressing transformer-based models.
shot classification performances, achieving 76.2% top-1 accuracy
on ImageNet-1K dataset without using any ImageNet training 3.6.1 Pruning and Decomposition
labels. Concretely, at inference, the text encoder of CLIP first In transformer based pre-trained models (e.g., BERT), multiple
computes the feature embeddings of all ImageNet Labels and the attention operations are performed in parallel to independently
image encoder then computes the embeddings of all images. By model the relationship between different tokens [9], [10]. How-
calculating the cosine similarity of text and image embeddings, ever, specific tasks do not require all heads to be used. For
the text-image pair with the highest score should be the image and example, Michel et al. [45] presented empirical evidence that a
its corresponding label. Further experiments on 30 various CV large percentage of attention heads can be removed at test time
benchmarks show the zero-shot transfer ability of CLIP and the without impacting performance significantly. The number of heads
feature diversity learned by CLIP. required varies across different layers — some layers may even
While CLIP maps images according to the description in text, require only one head. Considering the redundancy on attention
another work DALL-E [42] synthesizes new images of categories heads, importance scores are defined to estimate the influence of
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

each head on the final output in [45], and unimportant heads can be student networks, thereby facilitating the mimicking process. Due
removed for efficient deployment. Dalvi et al. [190] analyzed the to the various types of layers in the transformer model (i.e., self-
redundancy in pre-trained transformer models from two perspec- attention layer, embedding layer, and prediction layers), Jiao et
tives: general redundancy and task-specific redundancy. Following al. [46] design different objective functions to transfer knowledge
the lottery ticket hypothesis [191], Prasanna et al. [190] analyzed from teachers to students. For example, the outputs of student
the lotteries in BERT and showed that good sub-networks also models’ embedding layers imitate those of teachers via MSE
exist in transformer-based models, reducing both the FFN layers losses. For the vision transformer, Jia et al. [213] proposed a fine-
and attention heads in order to achieve high compression rates. grained manifold distillation method, which excavates effective
For the vision transformer [15] which splits an image to multiple knowledge through the relationship between images and the di-
patches, Tang et al. [192] proposed to reduce patch calculation vided patches.
to accelerate the inference, and the redundant patches can be
automatically discovered by considering their contributions to the 3.6.3 Quantization
effective output features. Zhu et al. [193] extended the network Quantization aims to reduce the number of bits needed to represent
slimming approach [194] to vision transformers for reducing network weight or intermediate features [214], [215]. Quantization
the dimensions of linear projections in both FFN and attention methods for general neural networks have been discussed at length
modules. and achieve performance on par with the original networks [216],
In addition to the width of transformer models, the depth [217], [218]. Recently, there has been growing interest in how
(i.e., the number of layers) can also be reduced to accelerate the to specially quantize transformer models [219], [220]. For ex-
inference process [204], [205]. Differing from the concept that ample, Shridhar et al. [221] suggested embedding the input into
different attention heads in transformer models can be computed in binary high-dimensional vectors, and then using the binary input
parallel, different layers have to be calculated sequentially because representation to train the binary neural networks. Cheong et
the input of the next layer depends on the output of previous al. [222] represented the weights in the transformer models by
layers. Fan et al. [204] proposed a layer-wisely dropping strategy low-bit (e.g., 4-bit) representation. Zhao et al. [223] empirically
to regularize the training of models, and then the whole layers are investigated various quantization methods and showed that k-
removed together at the test phase. means quantization has a huge development potential. Aimed
Beyond the pruning methods that directly discard modules in at machine translation tasks, Prato et al. [47] proposed a fully
transformer models, matrix decomposition aims to approximate quantized transformer, which, as the paper claims, is the first 8-
the large matrices with multiple small matrices based on the low- bit model not to suffer any loss in translation quality. Beside,
rank assumption. For example, Wang et al. [206] decomposed the Liu et al. [224] explored a post-training quantization scheme to
standard matrix multiplication in transformer models, improving reduce the memory storage and computational costs of vision
the inference efficiency. transformers.

3.6.2 Knowledge Distillation 3.6.4 Compact Architecture Design

Knowledge distillation aims to train student networks by trans- Beyond compressing pre-defined transformer models into smaller
ferring knowledge from large teacher networks [207], [208], ones, some works attempt to design compact models di-
[209]. Compared with teacher networks, student networks usually rectly [225], [48]. Jiang et al. [48] simplified the calculation
have thinner and shallower architectures, which are easier to of self-attention by proposing a new module — called span-
be deployed on resource-limited resources. Both the output and based dynamic convolution — that combine the fully-connected
intermediate features of neural networks can also be used to layers and the convolutional layers. Interesting “hamburger” layers
transfer effective information from teachers to students. Focused are proposed in [226], using matrix decomposition to substitute
on transformer models, Mukherjee et al. [210] used the pre-trained the original self-attention layers.Compared with standard self-
BERT [10] as a teacher to guide the training of small models, attention operations, matrix decomposition can be calculated more
leveraging large amounts of unlabeled data. Wang et al. [211] train efficiently while clearly reflecting the dependence between differ-
the student networks to mimic the output of self-attention layers ent tokens. The design of efficient transformer architectures can
in the pre-trained teacher models. The dot-product between values also be automated with neural architecture search (NAS) [227],
is introduced as a new form of knowledge for guiding students. A [228], which automatically searches how to combine different
teacher’s assistant [212] is also introduced in [211], reducing the components. For example, Su et al. [83] searched patch size and
gap between large pre-trained transformer models and compact dimensions of linear projections and head number of attention
TABLE 4: List of representative compressed transformer- modules to get an efficient vision transformer. Li et al. [229]
based models. The data of the Table is from [203]. explored a self-supervised search strategy to get a hybrid architec-
ture composing of both convolutional modules and self-attention
Models Compress Type #Layer Params Speed Up modules.
BERTBASE [10] Baseline 12 110M ×1 The self-attention operation in transformer models calculates
ALBERT [195] Decomposition 12 12M ×5.6
the dot product between representations from different input to-
BERT- Architecture
6 66M ×1.94 kens in a given sequence (patches in image recognition tasks [15]),
of-Theseus [196] design
Q-BERT [197]
Quantization
12
- -
whose complexity is O(N ), where N is the length of the se-
Q8BERT [198] 12 quence. Recently, there has been a targeted focus to reduce the
TinyBERT [46] 4 14.5M ×9.4
DistilBERT [199] 6 6.6m ×1.63
complexity to O(N ) in large methods so that transformer models
BERT-PKD [200] Distillation 3∼6 45.7∼67M ×3.73∼1.64 can scale to long sequences [230], [231], [232]. For example,
MobileBERT [201] 24 25.3M ×4.0 Katharopoulos et al. [230] approximated self-attention as a linear
PD [202] 6 67.5M ×2.0 dot-product of kernel feature maps and revealed the relationship
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

tasks. Practitioners concern the robustness of transformer (e.g.

Pruning & Decomposition
Redundant Compact Quantization Low-Bit the vulnerability issue [237]). Although the robustness has been
Model Knowledge Distillation Model Model investigated in [238], [239], [240], it is still an open problem
waiting to be solved.
Although numerous works have explained the use of trans-
Expertise Architecture Design formers in NLP [241], [242], it remains a challenging subject
or NAS
to clearly explain why transformer works well on visual tasks.
The inductive biases, including translation equivariance and lo-
Fig. 13: Different methods for compressing transformers. cality, are attributed to CNN’s success, but transformer lacks any
between tokens via RNNs. Zaheer et al. [232] considered each to- inductive bias. The current literature usually analyzes the effect
ken as a vertex in a graph and defined the inner product calculation in an intuitive way [15], [243]. For example, Dosovitskiy et
between two tokens as an edge. Inspired by graph theories [233], al. [15] claim that large-scale training can surpass inductive
[234], various sparse graph are combined to approximate the dense bias. Position embeddings are added into image patches to retain
graph in transformer models, and can achieve O(N ) complexity. positional information, which is important in computer vision
Discussion. The preceding methods take different approaches tasks. Inspired by the heavy parameter usage in transformers,
in how they attempt to identify redundancy in transformer mod- over-parameterization [244], [245] may be a potential point to the
els (see Figure 13). Pruning and decomposition methods usually interpretability of vision transformers.
require pre-defined models with redundancy. Specifically, pruning Last but not least, developing efficient transformer models for
focuses on reducing the number of components (e.g., layers, CV remains an open problem. Transformer models are usually
heads) in transformer models while decomposition represents an huge and computationally expensive. For example, the base ViT
original matrix with multiple small matrices. Compact models model [15] requires 18 billion FLOPs to process an image. In
also can be directly designed either manually (requiring sufficient contrast, the lightweight CNN model GhostNet [246], [247] can
expertise) or automatically (e.g., via NAS). The obtained compact achieve similar performance with only about 600 million FLOPs.
models can be further represented with low-bits via quantization Although several methods have been proposed to compress trans-
methods for efficient deployment on resource-limited devices. former, they remain highly complex. And these methods, which
were originally designed for NLP, may not be suitable for CV.
Consequently, efficient transformer models are urgently needed
4 C ONCLUSIONS AND D ISCUSSIONS so that vision transformer can be deployed on resource-limited
Transformer is becoming a hot topic in the field of computer devices.
vision due to its competitive performance and tremendous poten-
tial compared with CNNs. To discover and utilize the power of 4.2 Future Prospects
transformer, as summarized in this survey, a number of methods
have been proposed in recent years. These methods show excellent In order to drive the development of vision transformers, we
performance on a wide range of visual tasks, including backbone, provide several potential directions for future study.
high/mid-level vision, low-level vision, and video processing. One direction is the effectiveness and the efficiency of trans-
Nevertheless, the potential of transformer for computer vision has formers in computer vision. The goal is to develop highly ef-
not yet been fully explored, meaning that several challenges still fective and efficient vision transformers; specifically, transformers
need to be resolved. In this section, we discuss these challenges with high performance and low resource cost. The performance
and provide insights on the future prospects. determines whether the model can be applied on real-world
applications, while the resource cost influences the deployment
on devices [248], [249]. The effectiveness is usually correlated
4.1 Challenges with the efficiency, so determining how to achieve a better balance
Although researchers have proposed many transformer-based between them is a meaningful topic for future study.
models to tackle computer vision tasks, these works are only the Most of the existing vision transformer models are designed to
first steps in this field and still have much room for improvement. handle only a single task. Many NLP models such as GPT-3 [11]
For example, the transformer architecture in ViT [15] follows have demonstrated how transformer can deal with multiple tasks in
the standard transformer for NLP [9], but an improved version one model. IPT [19] in the CV field is also able to process multiple
specifically designed for CV remains to be explored. Moreover, it low-level vision tasks, such as super-resolution, image denoising,
is necessary to apply transformer to more tasks other than those and deraining. Perceiver [250] and Perceiver IO [251] are the
mentioned earlier. pioneering models that can work on several domains including
The generalization and robustness of transformers for com- images, audio, multimodal, point clouds. We believe that more
puter vision are also challenging. Compared with CNNs, pure tasks can be involved in only one model. Unifying all visual
transformers lack some inductive biases and rely heavily on tasks and even other tasks in one transformer (i.e., a grand unified
massive datasets for large-scale training [15]. Consequently, the model) is an exciting topic.
quality of data has a significant influence on the generalization There have been various types of neural networks, such as
and robustness of transformers. Although ViT shows exceptional CNN, RNN, and transformer. In the CV field, CNNs used to
performance on downstream image classification tasks such as be the mainstream choice [12], [94], but now transformer is
CIFAR [235] and VTAB [236], directly applying the ViT back- becoming popular. CNNs can capture inductive biases such as
bone on object detection has failed to achieve better results than translation equivariance and locality, whereas ViT uses large-scale
CNNs [115]. There is still a long way to go in order to better training to surpass inductive bias [15]. From the evidence currently
generalize pre-trained transformers on more generalized visual available [15], CNNs perform well on small datasets, whereas
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17

transformers perform better on large datasets. The question for the XWq , K = XWk , V = XWv from the translation module,
future is whether to use CNN or transformer. once Wq = Wθ , Wk = Wϕ , Wv = Wg , Eq. 17 can be
By training with large datasets, transformers can achieve state- formulated as:
of-the-art performance on both NLP [11], [10] and CV bench-
marks [15]. It is possible that neural networks need big data rather Y = softmax(QKT )V = Attention(Q, K, V), (18)
than inductive bias. In closing, we leave you with a question: The self-attention module [9] proposed for machine translation
Can transformer obtains satisfactory results with a very simple is, to some extent, the same as the preceding non-local filtering
computational paradigm (e.g., with only fully connected layers) operations proposed for computer vision.
and massive data training? Generally, the final output signal of the self-attention module
for computer vision will be wrapped as:
ACKNOWLEDGEMENT Z = YWo + X (19)
This research is partially supported by MindSpore (https:// o
mindspore.cn/) and CANN (Compute Architecture for Neural where Y is generated through Eq. 17. If W is initialized as zero,
Networks). this self-attention module can be inserted into any existing model
without breaking its initial behavior.

A PPENDIX
A2. Revisiting Transformers for NLP
A1. General Formulation of Self-attention
Before transformer was developed, RNNs ( e.g., GRU [254]
The self-attention module [9] for machine translation computes the and LSTM [6]) with added attention [7] empowered most of
responses at each position in a sequence by estimating attention the state-of-the-art language models. However, RNNs require the
scores to all positions and gathering the corresponding embed- information flow to be processed sequentially from the previous
dings based on the scores accordingly. This can be viewed as a hidden states to the next one. This rules out the possibility of using
form of non-local filtering operations [252], [253]. We follow the acceleration and parallelization during training, and consequently
convention [252] to formulate the self-attention module. Given an hinders the potential of RNNs to process longer sequences or build
input signal (e.g., image, sequence, video and feature) X ∈ Rn×d , larger models. In 2017, Vaswani et al. [9] proposed transformer,
where n = h × w (indicating the number of pixels in feature) and a novel encoder-decoder architecture built solely on multi-head
d is the number of channels, the output signal is generated as: self-attention mechanisms and feed-forward neural networks. Its
1 X purpose was to solve seq-to-seq natural language tasks (e.g.,
yi = f (xi , xj )g(xj ), (14)
C(xi ) ∀j machine translation) easily by acquiring global dependencies. The
subsequent success of transformer demonstrates that leveraging
where xi ∈ R1×d and yi ∈ R1×d indicate the ith position attention mechanisms alone can achieve performance comparable
(e.g., space, time and spacetime) of the input signal X and output with attentive RNNs. Furthermore, the architecture of transformer
signal Y , respectively. Subscript j is the index that enumerates all lends itself to massively parallel computing, which enables train-
positions, and a pairwise function f (·) computes a representing ing on larger datasets. This has given rise to the surge of large
relationship (such as affinity) between i and all j . The function pre-trained models (PTMs) for natural language processing.
g(·) computes a representation of the input signal at position j , BERT [10] and its variants (e.g., SpanBERT [255],
and the response is normalized by a factor C(xi ). RoBERTa [256]) are a series of PTMs built on the multi-layer
Note that there are many choices for the pairwise function transformer encoder architecture. Two tasks are conducted on
f (·). For example, a simple extension of the Gaussian function BookCorpus [257] and English Wikipedia datasets at the pre-
could be used to compute the similarity in an embedding space. training stage of BERT: 1) Masked language modeling (MLM),
As such, the function f (·) can be formulated as: which involves first randomly masking out some tokens in the
T input and then training the model to predict; 2) Next sentence pre-
f (xi , xj ) = eθ(xi )ϕ(xj ) (15) diction, which uses paired sentences as input and predicts whether
where θ(·) and ϕ(·) can be any embedding layers. If we the second sentence is the original one in the document. After
consider the θ(·), ϕ(·), g(·) in the form of linear embedding: pre-training, BERT can be fine-tuned by adding an output layer
θ(X) = XWθ , ϕ(X) = XWϕ , g(X) = XWg where on a wide range of downstream tasks. More specifically, when
Wθ ∈ Rd×dk , Wϕ ∈ Rd×dk ,P Wg ∈ Rd×dv , and set the performing sequence-level tasks (e.g., sentiment analysis), BERT
normalization factor as C(xi ) = ∀j f (xi , xj ), the Eq. 14 can uses the representation of the first token for classification; for
be rewritten as: token-level tasks (e.g., name entity recognition), all tokens are fed
T T into the softmax layer for classification. At the time of its release,
exi wθ,i wϕ,j xj BERT achieved the state-of-the-art performance on 11 NLP tasks,
yi = P x w wT xT xj wg,j , (16)
je
i θ,i ϕ,j j setting a milestone in pre-trained language models. Generative
Pre-trained Transformer models (e.g., GPT [258], GPT-2 [110])
where wθ,i ∈ Rd×1 is the ith row of the weight matrix Wθ . For a are another type of PTMs based on the transformer decoder
1
given index i, C(x i)
f (xi , xj ) becomes the softmax output along architecture, which uses masked self-attention mechanisms. The
the dimension j . The formulation can be further rewritten as: main difference between the GPT series and BERT is the way
Y = softmax(XWθ WϕT X)g(X), (17) in which pre-training is performed. Unlike BERT, GPT models
are unidirectional language models pre-trained using Left-to-Right
where Y ∈ Rn×c is the output signal of the same size as X. (LTR) language modeling. Furthermore, BERT learns the sentence
Compared with the query, key and value representations Q = separator ([SEP]) and classifier token ([CLS]) embeddings during
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18

pre-training, whereas these embeddings are involved in only the A3. Self-attention for Computer Vision
fine-tuning stage of GPT. Due to its unidirectional pre-training
strategy, GPT achieves superior performance in many natural The preceding sections reviewed methods that use a transformer
language generation tasks. More recently, a massive transformer- architecture for vision tasks. We can conclude that self-attention
based model called GPT-3, which has an astonishing 175 billion plays a pivotal role in transformer. The self-attention module can
parameters, was developed [11]. By pre-training on 45 TB of also be considered a building block of CNN architectures, which
compressed plaintext data, GPT-3 can directly process different have low scaling properties concerning the large receptive fields.
types of downstream natural language tasks without fine-tuning. This building block is widely used on top of the networks to
As a result, it achieves strong performance on many NLP datasets, capture long-range interactions and enhance high-level semantic
including both natural language understanding and generation. features for vision tasks. In this section, we delve deeply into
Since the introduction of transformer, many other models have the models based on self-attention designed for challenging tasks
been proposed in addition to the transformer-based PTMs men- in computer vision. Such tasks include semantic segmentation,
tioned earlier. We list a few representative models in Table 5 for instance segmentation, object detection, keypoint detection, and
interested readers, but this is not the focus of our study. depth estimation. Here we briefly summarize the existing applica-
tions using self-attention for computer vision.
Image Classification. Trainable attention for classification
TABLE 5: List of representative language models built on consists of two main streams: hard attention [268], [269], [270] re-
transformer. Transformer is the standard encoder-decoder ar- garding the use of an image region, and soft attention [271], [272],
chitecture. Transformer Enc. and Dec. represent the encoder [273], [274] generating non-rigid feature maps. Ba et al. [268]
and decoder, respectively. Decoder uses mask self-attention to first proposed the term “visual attention” for image classification
prevent attending to the future tokens. The data of the Table tasks, and used attention to select relevant regions and locations
is from [203]. within the input image. This can also reduce the computational
complexity of the proposed model regarding the size of the input
Models Architecture # of Params Fine-tuning image. For medical image classification, AG-CNN [275] was
GPT [258] Transformer Dec. 117M Yes proposed to crop a sub-region from a global image by the attention
GPT-2 [110] Transformer Dec. 117M-1542M No heat map. And instead of using hard attention and recalibrating the
GPT-3 [11] Transformer Dec. 125M-175B No
BERT [10] Transformer Enc. 110M-340M Yes
crop of feature maps, SENet [276] was proposed to reweight the
RoBERTa [256] Transformer Enc. 355M Yes channel-wise responses of the convolutional features using soft
Two-Stream self-attention. Jetley et al. [272] used attention maps generated
XLNet [259] ≈ BERT Yes
Transformer Enc. by corresponding estimators to reweight intermediate features in
ELECTRA [260] Transformer Enc. 335M Yes DNNs. In addition, Han et al. [273] utilized the attribute-aware
UniLM [261] Transformer Enc. 340M Yes attention to enhance the representation of CNNs.
BART [262] Transformer 110% of BERT Yes
T5 [154] Transfomer 220M-11B Yes Semantic Segmentation. PSANet [277], OCNet [278],
ERNIE (THU) [263] Transformer Enc. 114M Yes DANet [279] and CFNet [280] are the pioneering works to propose
KnowBERT [264] Transformer Enc. 253M-523M Yes using the self-attention module in semantic segmentation tasks.
These works consider and augment the relationship and similar-
ity [281], [282], [283], [284], [285], [286] between the contextual
Apart from the PTMs trained on large corpora for general NLP pixels. DANet [279] simultaneously leverages the self-attention
tasks, transformer-based models have also been applied in many module on spatial and channel dimensions, whereas A2 Net [287]
other NLP-related domains and to multi-modal tasks. groups the pixels into a set of regions, and then augments the
pixel representations by aggregating the region representations
BioNLP Domain. Transformer-based models have outperformed with the generated attention weights. DGCNet [288] employs a
many traditional biomedical methods. Some examples of such dual graph CNN to model coordinate space similarity and feature
models include BioBERT [265], which uses a transformer ar- space similarity in a single framework. To improve the efficiency
chitecture for biomedical text mining tasks, and SciBERT [266], of the self-attention module for semantic segmentation, several
which is developed by training transformer on 114M scientific works [289], [290], [291], [292], [293] have been proposed,
articles (covering biomedical and computer science fields) with aiming to alleviate the huge amount of parameters brought by
the aim of executing NLP tasks in the scientific domain more calculating pixel similarities. For example, CGNL [289] applies
precisely. Another example is ClinicalBERT, proposed by Huang the Taylor series of the RBF kernel function to approximate the
et al. [267]. It utilizes transformer to develop and evaluate con- pixel similarities. CCNet [290] approximates the original self-
tinuous representations of clinical notes. One of the side effects attention scheme via two consecutive criss-cross attention mod-
of this is that the attention map of ClinicalBERT can be used ules. In addition, ISSA [291] factorizes the dense affinity matrix as
to explain predictions, thereby allowing high-quality connections the product of two sparse affinity matrices. There are other related
between different medical contents to be discovered. works using attention based graph reasoning modules [294], [295],
The rapid development of transformer-based models on a [292] to enhance both the local and global representations.
variety of NLP-related tasks demonstrates its structural superiority Object Detection. Ramachandran et al. [274] proposes an
and versatility, opening up the possibility that it will become a attention-based layer and swapped the conventional convolution
universal module applied in many AI fields other than just NLP. layers to build a fully attentional detector that outperforms the
The following part of this survey focuses on the applications of typical RetinaNet [129] on COCO benchmark [296]. GCNet [297]
transformer in a wide range of computer vision tasks that have assumes that the global contexts modeled by non-local operations
emerged over the past two years. are almost the same for different query positions within an image,
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19

and unifies the simplified formulation and SENet [276] into a [17] X. Zhu et al. Deformable detr: Deformable transformers for end-to-end
general framework for global context modeling [298], [299], object detection. In ICLR, 2021.
[18] S. Zheng et al. Rethinking semantic segmentation from a sequence-to-
[300], [301]. Vo et al. [302] designs a bidirectional operation sequence perspective with transformers. In CVPR, 2021.
to gather and distribute information from a query position to [19] H. Chen et al. Pre-trained image processing transformer. In CVPR,
all possible positions. Zhang et al. [120] suggests that previous 2021.
methods fail to interact with cross-scale features, and proposes [20] L. Zhou et al. End-to-end dense video captioning with masked
transformer. In CVPR, pp. 8739–8748, 2018.
Feature Pyramid Transformer, based on the self-attention module, [21] S. Ullman et al. High-level vision: Object recognition and visual
to fully exploit interactions across both space and scales. cognition, volume 2. MIT press Cambridge, MA, 1996.
Conventional detection methods usually exploit a single visual [22] R. Kimchi et al. Perceptual organization in vision: Behavioral and
representation (e.g., bounding box and corner point) for predicting neural perspectives. Psychology Press, 2003.
[23] J. Zhu et al. Top-down saliency detection via contextual pooling.
the final results. Hu et al. [303] proposes a relation module based Journal of Signal Processing Systems, 74(1):33–46, 2014.
on self-attention to process a set of objects simultaneously through [24] J. Long et al. Fully convolutional networks for semantic segmentation.
interaction between their appearance features. Cheng et al. [121] In CVPR, 2015.
proposes RelationNet++ with the bridging visual representations [25] H. Wang et al. Max-deeplab: End-to-end panoptic segmentation with
mask transformers. In CVPR, pp. 5463–5474, 2021.
(BVR) module to combine different heterogeneous representations [26] R. B. Fisher. Cvonline: The evolving, distributed, non-proprietary, on-
into a single one similar to that in the self-attention module. line compendium of computer vision. Retrieved January 28, 2006 from
Specifically, the master representation is treated as the query input http://homepages. inf. ed. ac. uk/rbf/CVonline, 2008.
and the auxiliary representations are regarded as the key input. [27] N. Parmar et al. Image transformer. In ICML, 2018.
[28] Y. Zeng et al. Learning joint spatial-temporal transformations for video
The enhanced feature can therefore bridge the information from inpainting. In ECCV, pp. 528–543. Springer, 2020.
auxiliary representations and benefit final detection results. [29] K. Han et al. Transformer in transformer. In NeurIPS, 2021.
Other Vision Tasks. Zhang et al. [304] proposes a resolution- [30] H. Cao et al. Swin-unet: Unet-like pure transformer for medical image
wise attention module to learn enhanced feature maps when segmentation. arXiv:2105.05537, 2021.
[31] X. Chen et al. An empirical study of training self-supervised vision
training multi-resolution networks to obtain accurate human key- transformers. In ICCV, 2021.
point locations for pose estimation task. Furthermore, Chang et [32] K. He et al. Masked autoencoders are scalable vision learners. In CVPR,
al. [305] uses an attention-mechanism based feature fusion block pp. 16000–16009, 2022.
to improve the accuracy of the human keypoint detection model. [33] Z. Dai et al. UP-DETR: unsupervised pre-training for object detection
with transformers. In CVPR, 2021.
To explore more generalized contextual information for im- [34] Y. Wang et al. End-to-end video instance segmentation with transform-
proving the self-supervised monocular trained depth estimation, ers. In CVPR, 2021.
Johnston et al. [306] directly leverages self-attention module. [35] L. Huang et al. Hand-transformer: Non-autoregressive structured mod-
Chen et al. [307] also proposes an attention-based aggregation net- eling for 3d hand pose estimation. In ECCV, pp. 17–33, 2020.
[36] L. Huang et al. Hot-net: Non-autoregressive transformer for 3d hand-
work to capture context information that differs in diverse scenes object pose estimation. In ACM MM, pp. 3136–3145, 2020.
for depth estimation. And Aich et al. [308] proposes bidirectional [37] K. Lin et al. End-to-end human pose and mesh reconstruction with
attention modules that utilize the forward and backward attention transformers. In CVPR, 2021.
operations for better results of monocular depth estimation. [38] P. Esser et al. Taming transformers for high-resolution image synthesis.
In CVPR, 2021.
[39] Y. Jiang et al. Transgan: Two transformers can make one strong gan. In
NeurIPS, 2021.
R EFERENCES [40] F. Yang et al. Learning texture transformer network for image super-
resolution. In CVPR, pp. 5791–5800, 2020.
[1] F. Rosenblatt. The perceptron, a perceiving and recognizing automaton
[41] A. Radford et al. Learning transferable visual models from natural
Project Para. Cornell Aeronautical Laboratory, 1957.
language supervision. arXiv:2103.00020, 2021.
[2] F. ROSENBLATT. Principles of neurodynamics. perceptrons and the
theory of brain mechanisms. Technical report, 1961. [42] A. Ramesh et al. Zero-shot text-to-image generation. In ICML, 2021.
[3] Y. LeCun et al. Gradient-based learning applied to document recogni- [43] M. Ding et al. Cogview: Mastering text-to-image generation via
tion. Proceedings of the IEEE, 86(11):2278–2324, 1998. transformers. In NeurIPS, 2021.
[4] A. Krizhevsky et al. Imagenet classification with deep convolutional [44] OpenAI. Gpt-4 technical report, 2023.
neural networks. In NeurIPS, pp. 1097–1105, 2012. [45] P. Michel et al. Are sixteen heads really better than one? In NeurIPS,
[5] D. E. Rumelhart et al. Learning internal representations by error pp. 14014–14024, 2019.
propagation. Technical report, 1985. [46] X. Jiao et al. TinyBERT: Distilling BERT for natural language under-
[6] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural standing. In Findings of EMNLP, pp. 4163–4174, 2020.
computation, 9(8):1735–1780, 1997. [47] G. Prato et al. Fully quantized transformer for machine translation. In
[7] D. Bahdanau et al. Neural machine translation by jointly learning to Findings of EMNLP, 2020.
align and translate. In ICLR, 2015. [48] Z.-H. Jiang et al. Convbert: Improving bert with span-based dynamic
[8] A. Parikh et al. A decomposable attention model for natural language convolution. NeurIPS, 33, 2020.
inference. In EMNLP, 2016. [49] J. Gehring et al. Convolutional sequence to sequence learning. In ICML,
[9] A. Vaswani et al. Attention is all you need. In NeurIPS, 2017. pp. 1243–1252. PMLR, 2017.
[10] J. Devlin et al. Bert: Pre-training of deep bidirectional transformers for [50] P. Shaw et al. Self-attention with relative position representations. In
language understanding. In NAACL-HLT, 2019. NAACL, pp. 464–468, 2018.
[11] T. B. Brown et al. Language models are few-shot learners. In NeurIPS, [51] D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus).
2020. arXiv:1606.08415, 2016.
[12] K. He et al. Deep residual learning for image recognition. In CVPR, [52] J. L. Ba et al. Layer normalization. arXiv:1607.06450, 2016.
pp. 770–778, 2016. [53] A. Baevski and M. Auli. Adaptive input representations for neural
[13] S. Ren et al. Faster R-CNN: Towards real-time object detection with language modeling. In ICLR, 2019.
region proposal networks. In NeurIPS, 2015. [54] Q. Wang et al. Learning deep transformer models for machine transla-
[14] M. Chen et al. Generative pretraining from pixels. In ICML, 2020. tion. In ACL, pp. 1810–1822, 2019.
[15] A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for [55] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep
image recognition at scale. In ICLR, 2021. network training by reducing internal covariate shift. In ICML, 2015.
[16] N. Carion et al. End-to-end object detection with transformers. In [56] S. Shen et al. Powernorm: Rethinking batch normalization in transform-
ECCV, 2020. ers. In ICML, 2020.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 20

[57] J. Xu et al. Understanding and improving layer normalization. In [97] H. Wu et al. Cvt: Introducing convolutions to vision transformers.
NeurIPS, 2019. arXiv:2103.15808, 2021.
[58] T. Bachlechner et al. Rezero is all you need: Fast convergence at large [98] K. Yuan et al. Incorporating convolution designs into visual transform-
depth. In Uncertainty in Artificial Intelligence, pp. 1352–1361. PMLR, ers. arXiv:2103.11816, 2021.
2021. [99] Y. Li et al. Localvit: Bringing locality to vision transformers.
[59] B. Wu et al. Visual transformers: Token-based image representation and arXiv:2104.05707, 2021.
processing for computer vision. arXiv:2006.03677, 2020. [100] B. Graham et al. Levit: a vision transformer in convnet’s clothing for
[60] H. Touvron et al. Training data-efficient image transformers & distilla- faster inference. In ICCV, 2021.
tion through attention. In ICML, 2020. [101] A. Srinivas et al. Bottleneck transformers for visual recognition. In
[61] Z. Liu et al. Swin transformer: Hierarchical vision transformer using CVPR, 2021.
shifted windows. In ICCV, 2021. [102] Z. Chen et al. Visformer: The vision-friendly transformer. arXiv, 2021.
[62] C.-F. Chen et al. Regionvit: Regional-to-local attention for vision [103] T. Xiao et al. Early convolutions help transformers see better. In
transformers. arXiv:2106.02689, 2021. NeurIPS, volume 34, 2021.
[63] X. Chu et al. Twins: Revisiting the design of spatial attention in vision [104] G. E. Hinton and R. S. Zemel. Autoencoders, minimum description
transformers. arXiv:2104.13840, 2021. length, and helmholtz free energy. NIPS, 6:3–10, 1994.
[64] H. Lin et al. Cat: Cross attention in vision transformer. arXiv, 2021. [105] P. Vincent et al. Extracting and composing robust features with
[65] X. Dong et al. Cswin transformer: A general vision transformer denoising autoencoders. In ICML, pp. 1096–1103, 2008.
backbone with cross-shaped windows. arXiv:2107.00652, 2021. [106] A. v. d. Oord et al. Conditional image generation with pixelcnn
[66] Z. Huang et al. Shuffle transformer: Rethinking spatial shuffle for vision decoders. arXiv preprint arXiv:1606.05328, 2016.
transformer. arXiv:2106.03650, 2021. [107] D. Pathak et al. Context encoders: Feature learning by inpainting. In
[67] J. Fang et al. Msg-transformer: Exchanging local spatial information by CVPR, pp. 2536–2544, 2016.
manipulating messenger tokens. arXiv:2105.15168, 2021. [108] Z. Li et al. Mst: Masked self-supervised transformer for visual
[68] L. Yuan et al. Tokens-to-token vit: Training vision transformers from representation. In NeurIPS, 2021.
scratch on imagenet. In ICCV, 2021. [109] H. Bao et al. Beit: Bert pre-training of image transformers.
[69] D. Zhou et al. Deepvit: Towards deeper vision transformer. arXiv, 2021. arXiv:2106.08254, 2021.
[70] P. Wang et al. Kvt: k-nn attention for boosting vision transformers. [110] A. Radford et al. Language models are unsupervised multitask learners.
arXiv:2106.00515, 2021. OpenAI blog, 1(8):9, 2019.
[71] D. Zhou et al. Refiner: Refining self-attention for vision transformers. [111] Z. Xie et al. Simmim: A simple framework for masked image modeling.
arXiv:2106.03714, 2021. In CVPR, pp. 9653–9663, 2022.
[72] A. El-Nouby et al. Xcit: Cross-covariance image transformers. [112] Z. Xie et al. Self-supervised learning with swin transformers.
arXiv:2106.09681, 2021. arXiv:2105.04553, 2021.
[73] W. Wang et al. Pyramid vision transformer: A versatile backbone for [113] C. Li et al. Efficient self-supervised vision transformers for representa-
dense prediction without convolutions. In ICCV, 2021. tion learning. arXiv:2106.09785, 2021.
[74] S. Sun* et al. Visual parser: Representing part-whole hierarchies with [114] K. He et al. Momentum contrast for unsupervised visual representation
transformers. arXiv:2107.05790, 2021. learning. In CVPR, 2020.
[75] H. Fan et al. Multiscale vision transformers. arXiv:2104.11227, 2021. [115] J. Beal et al. Toward transformer-based object detection.
[76] Z. Zhang et al. Nested hierarchical transformer: Towards accurate, data- arXiv:2012.09958, 2020.
efficient and interpretable visual understanding. In AAAI, 2022. [116] Z. Yuan et al. Temporal-channel transformer for 3d lidar-based video
[77] Z. Pan et al. Less is more: Pay less attention in vision transformers. In object detection for autonomous driving. IEEE TCSVT, 2021.
AAAI, 2022. [117] X. Pan et al. 3d object detection with pointformer. In CVPR, 2021.
[78] Z. Pan et al. Scalable visual transformers with hierarchical pooling. In [118] R. Liu et al. End-to-end lane shape prediction with transformers. In
ICCV, 2021. WACV, 2021.
[79] B. Heo et al. Rethinking spatial dimensions of vision transformers. In [119] S. Yang et al. Transpose: Keypoint localization via transformer. In
ICCV, 2021. ICCV, 2021.
[80] C.-F. Chen et al. Crossvit: Cross-attention multi-scale vision trans- [120] D. Zhang et al. Feature pyramid transformer. In ECCV, 2020.
former for image classification. In ICCV, 2021. [121] C. Chi et al. Relationnet++: Bridging visual representations for object
[81] Z. Wang et al. Uformer: A general u-shaped transformer for image detection via transformer decoder. NeurIPS, 2020.
restoration. arXiv:2106.03106, 2021. [122] Z. Sun et al. Rethinking transformer-based set prediction for object
[82] X. Zhai et al. Scaling vision transformers. arXiv:2106.04560, 2021. detection. In ICCV, pp. 3611–3620, 2021.
[83] X. Su et al. Vision transformer architecture search. arXiv, 2021. [123] M. Zheng et al. End-to-end object detection with adaptive clustering
[84] M. Chen et al. Autoformer: Searching transformers for visual recogni- transformer. In BMVC, 2021.
tion. In ICCV, pp. 12270–12280, 2021. [124] T. Ma et al. Oriented object detection with transformer.
[85] B. Chen et al. Glit: Neural architecture search for global and local arXiv:2106.03146, 2021.
image transformer. In ICCV, pp. 12–21, 2021. [125] P. Gao et al. Fast convergence of detr with spatially modulated co-
[86] X. Chu et al. Conditional positional encodings for vision transformers. attention. In ICCV, 2021.
arXiv:2102.10882, 2021. [126] Z. Yao et al. Efficient detr: Improving end-to-end object detector with
[87] K. Wu et al. Rethinking and improving relative position encoding for dense prior. arXiv:2104.01318, 2021.
vision transformer. In ICCV, 2021. [127] Z. Tian et al. Fcos: Fully convolutional one-stage object detection. In
[88] H. Touvron et al. Going deeper with image transformers. ICCV, pp. 9627–9636, 2019.
arXiv:2103.17239, 2021. [128] Y. Fang et al. You only look at one sequence: Rethinking transformer
[89] Y. Tang et al. Augmented shortcuts for vision transformers. In NeurIPS, in vision through object detection. In NeurIPS, 2021.
2021. [129] T.-Y. Lin et al. Focal loss for dense object detection. In ICCV, 2017.
[90] I. Tolstikhin et al. Mlp-mixer: An all-mlp architecture for vision. [130] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality
arXiv:2105.01601, 2021. object detection. In CVPR, 2018.
[91] L. Melas-Kyriazi. Do you even need attention? a stack of feed-forward [131] A. Bar et al. Detreg: Unsupervised pretraining with region priors for
layers does surprisingly well on imagenet. arXiv:2105.02723, 2021. object detection. arXiv:2106.04550, 2021.
[92] M.-H. Guo et al. Beyond self-attention: External attention using two [132] J. Hu et al. Istr: End-to-end instance segmentation with transformers.
linear layers for visual tasks. arXiv:2105.02358, 2021. arXiv:2105.00637, 2021.
[93] H. Touvron et al. Resmlp: Feedforward networks for image classifica- [133] Z. Yang et al. Associating objects with transformers for video object
tion with data-efficient training. arXiv:2105.03404, 2021. segmentation. In NeurIPS, 2021.
[94] M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolu- [134] S. Wu et al. Fully transformer networks for semantic image segmenta-
tional neural networks. In ICML, 2019. tion. arXiv:2106.04108, 2021.
[95] J. Guo et al. Cmt: Convolutional neural networks meet vision trans- [135] B. Dong et al. Solq: Segmenting objects by learning queries. In
formers. arXiv:2107.06263, 2021. NeurIPS, 2021.
[96] L. Yuan et al. Volo: Vision outlooker for visual recognition. [136] R. Strudel et al. Segmenter: Transformer for semantic segmentation. In
arXiv:2106.13112, 2021. ICCV, 2021.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 21

[137] E. Xie et al. Segformer: Simple and efficient design for semantic [177] W. Choi et al. What are they doing?: Collective activity classification
segmentation with transformers. In NeurIPS, 2021. using spatio-temporal relationship among people. In ICCVW, 2009.
[138] J. M. J. Valanarasu et al. Medical transformer: Gated axial-attention for [178] K. Gavrilyuk et al. Actor-transformers for group activity recognition.
medical image segmentation. In MICCAI, 2021. In CVPR, pp. 839–848, 2020.
[139] T. Prangemeier et al. Attention-based transformers for instance seg- [179] J. Shao et al. Temporal context aggregation for video retrieval with
mentation of cells in microstructures. In International Conference on contrastive learning. In WACV, 2021.
Bioinformatics and Biomedicine, pp. 700–707. IEEE, 2020. [180] V. Gabeur et al. Multi-modal transformer for video retrieval. In ECCV,
[140] C. R. Qi et al. Pointnet: Deep learning on point sets for 3d classification pp. 214–229, 2020.
and segmentation. In CVPR, pp. 652–660, 2017. [181] Y. Chen et al. Memory enhanced global-local aggregation for video
[141] C. R. Qi et al. Pointnet++: Deep hierarchical feature learning on point object detection. In CVPR, pp. 10337–10346, 2020.
sets in a metric space. NeurIPS, 30:5099–5108, 2017. [182] J. Yin et al. Lidar-based online 3d video object detection with graph-
[142] S. Hampali et al. Handsformer: Keypoint transformer for monocular 3d based message passing and spatiotemporal transformer attention. In
pose estimation ofhands and object in interaction. arXiv, 2021. 2020 CVPR, pp. 11495–11504, 2020.
[143] Y. Li et al. Tokenpose: Learning keypoint tokens for human pose [183] H. Seong et al. Video multitask transformer network. In ICCVW, 2019.
estimation. In ICCV, 2021. [184] K. M. Schatz et al. A recurrent transformer network for novel view
[144] W. Mao et al. Tfpose: Direct human pose estimation with transformers. action synthesis. In ECCV (27), pp. 410–426, 2020.
arXiv:2103.15320, 2021. [185] C. Sun et al. Videobert: A joint model for video and language
[145] T. Jiang et al. Skeletor: Skeletal transformers for robust body-pose representation learning. In ICCV, pp. 7464–7473, 2019.
estimation. In CVPR, 2021. [186] L. H. Li et al. Visualbert: A simple and performant baseline for vision
[146] Y. Li et al. Test-time personalization with a transformer for human pose and language. arXiv:1908.03557, 2019.
estimation. Advances in Neural Information Processing Systems, 34, [187] W. Su et al. Vl-bert: Pre-training of generic visual-linguistic represen-
2021. tations. In ICLR, 2020.
[147] M. Lin et al. Detr for pedestrian detection. arXiv:2012.06785, 2020. [188] Y.-S. Chuang et al. Speechbert: Cross-modal pre-trained language
[148] L. Tabelini et al. Polylanenet: Lane estimation via deep polynomial re- model for end-to-end spoken question answering. In Interspeech, 2020.
gression. In 2020 25th International Conference on Pattern Recognition [189] R. Hu and A. Singh. Unit: Multimodal multitask learning with a unified
(ICPR), pp. 6150–6156. IEEE, 2021. transformer. In ICCV, 2021.
[149] L. Liu et al. Condlanenet: a top-to-down lane detection framework [190] S. Prasanna et al. When bert plays the lottery, all tickets are winning.
based on conditional convolution. arXiv:2105.05003, 2021. In EMNLP, 2020.
[150] P. Xu et al. A survey of scene graph: Generation and application. IEEE [191] J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse,
Trans. Neural Netw. Learn. Syst, 2020. trainable neural networks. In ICLR, 2018.
[151] J. Yang et al. Graph r-cnn for scene graph generation. In ECCV, 2018. [192] Y. Tang et al. Patch slimming for efficient vision transformers.
[152] S. Sharifzadeh et al. Classification by attention: Scene graph classifica- arXiv:2106.02852, 2021.
tion with prior knowledge. In AAAI, 2021. [193] M. Zhu et al. Vision transformer pruning. arXiv:2104.08500, 2021.
[153] S. Sharifzadeh et al. Improving Visual Reasoning by Exploiting The
[194] Z. Liu et al. Learning efficient convolutional networks through network
Knowledge in Texts. arXiv:2102.04760, 2021.
slimming. In ICCV, 2017.
[154] C. Raffel et al. Exploring the limits of transfer learning with a unified
[195] Z. Lan et al. Albert: A lite bert for self-supervised learning of language
text-to-text transformer. JMLR, 21(140):1–67, 2020.
representations. In ICLR, 2020.
[155] N. Wang et al. Transformer meets tracker: Exploiting temporal context
[196] C. Xu et al. Bert-of-theseus: Compressing bert by progressive module
for robust visual tracking. In CVPR, 2021.
replacing. In EMNLP, pp. 7859–7869, 2020.
[156] M. Zhao et al. TrTr: Visual Tracking with Transformer.
[197] S. Shen et al. Q-bert: Hessian based ultra low precision quantization of
arXiv:2105.03817 [cs], May 2021. arXiv: 2105.03817.
bert. In AAAI, pp. 8815–8821, 2020.
[157] X. Chen et al. Transformer tracking. In CVPR, 2021.
[198] O. Zafrir et al. Q8bert: Quantized 8bit bert. arXiv:1910.06188, 2019.
[158] P. Sun et al. TransTrack: Multiple Object Tracking with Transformer.
arXiv:2012.15460 [cs], May 2021. arXiv: 2012.15460. [199] V. Sanh et al. Distilbert, a distilled version of bert: smaller, faster,
[159] S. He et al. TransReID: Transformer-based object re-identification. In cheaper and lighter. arXiv:1910.01108, 2019.
ICCV, 2021. [200] S. Sun et al. Patient knowledge distillation for bert model compression.
[160] X. Liu et al. A video is worth three views: Trigeminal transformers for In EMNLP-IJCNLP, pp. 4323–4332, 2019.
video-based person re-identification. arXiv:2104.01745, 2021. [201] Z. Sun et al. Mobilebert: a compact task-agnostic bert for resource-
[161] T. Zhang et al. Spatiotemporal transformer for video-based person re- limited devices. In ACL, pp. 2158–2170, 2020.
identification. arXiv:2103.16469, 2021. [202] I. Turc et al. Well-read students learn better: The impact of student
[162] N. Engel et al. Point transformer. IEEE Access, 9:134826–134840, initialization on knowledge distillation. arXiv:1908.08962, 2019.
2021. [203] X. Qiu et al. Pre-trained models for natural language processing: A
[163] M.-H. Guo et al. Pct: Point cloud transformer. Computational Visual survey. Science China Technological Sciences, pp. 1–26, 2020.
Media, 7(2):187–199, 2021. [204] A. Fan et al. Reducing transformer depth on demand with structured
[164] H. Zhao et al. Point transformer. In ICCV, pp. 16259–16268, 2021. dropout. In ICLR, 2020.
[165] K. Lee et al. Vitgan: Training gans with vision transformers. arXiv [205] L. Hou et al. Dynabert: Dynamic bert with adaptive width and depth.
preprint arXiv:2107.04589, 2021. NeurIPS, 33, 2020.
[166] A. v. d. Oord et al. Neural discrete representation learning. arXiv, 2017. [206] Z. Wang et al. Structured pruning of large language models. In EMNLP,
[167] J. Ho et al. Denoising diffusion probabilistic models. volume 33, pp. pp. 6151–6162, 2020.
6840–6851, 2020. [207] G. Hinton et al. Distilling the knowledge in a neural network.
[168] A. Ramesh et al. Hierarchical text-conditional image generation with arXiv:1503.02531, 2015.
clip latents. arXiv preprint arXiv:2204.06125, 2022. [208] C. Buciluǎ et al. Model compression. In SIGKDD, pp. 535–541, 2006.
[169] R. Rombach et al. High-resolution image synthesis with latent diffusion [209] J. Ba and R. Caruana. Do deep nets really need to be deep? NIPS, 2014.
models. In CVPR, pp. 10684–10695, 2022. [210] S. Mukherjee and A. H. Awadallah. Xtremedistil: Multi-stage distilla-
[170] X. Wang et al. Sceneformer: Indoor scene generation with transformers. tion for massive multilingual models. In ACL, pp. 2221–2234, 2020.
In 3DV, pp. 106–115. IEEE, 2021. [211] W. Wang et al. Minilm: Deep self-attention distillation for task-agnostic
[171] Z. Liu et al. Convtransformer: A convolutional transformer network for compression of pre-trained transformers. arXiv:2002.10957, 2020.
video frame synthesis. arXiv:2011.10185, 2020. [212] S. I. Mirzadeh et al. Improved knowledge distillation via teacher
[172] R. Girdhar et al. Video action transformer network. In CVPR, 2019. assistant. In AAAI, 2020.
[173] H. Liu et al. Two-stream transformer networks for video-based face [213] D. Jia et al. Efficient vision transformers via fine-grained manifold
alignment. T-PAMI, 40(11):2546–2554, 2017. distillation. arXiv:2107.01378, 2021.
[174] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new [214] V. Vanhoucke et al. Improving the speed of neural networks on cpus.
model and the kinetics dataset. In CVPR, 2017. In NIPS Workshop, 2011.
[175] S. Lohit et al. Temporal transformer networks: Joint learning of [215] Z. Yang et al. Searching for low-bit weights in quantized neural
invariant and discriminative time warping. In CVPR, 2019. networks. In NeurIPS, 2020.
[176] M. Fayyaz and J. Gall. Sct: Set constrained temporal transformer for [216] E. Park and S. Yoo. Profit: A novel training method for sub-4-bit
set supervised action segmentation. In 2020 CVPR, pp. 501–510, 2020. mobilenet models. In ECCV, pp. 430–446. Springer, 2020.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 22

[217] J. Fromm et al. Riptide: Fast end-to-end binarized neural networks. [257] Y. Zhu et al. Aligning books and movies: Towards story-like visual
Proceedings of Machine Learning and Systems, 2:379–389, 2020. explanations by watching movies and reading books. In ICCV, pp.
[218] Y. Bai et al. Proxquant: Quantized neural networks via proximal 19–27, 2015.
operators. In ICLR, 2019. [258] A. Radford et al. Improving language understanding by generative pre-
[219] A. Bhandare et al. Efficient 8-bit quantization of transformer neural training, 2018.
machine language translation model. arXiv:1906.00532, 2019. [259] Z. Yang et al. Xlnet: Generalized autoregressive pretraining for lan-
[220] C. Fan. Quantized transformer. Technical report, Stanford Univ., 2019. guage understanding. In NeurIPS, pp. 5753–5763, 2019.
[221] K. Shridhar et al. End to end binarized neural networks for text [260] K. Clark et al. Electra: Pre-training text encoders as discriminators
classification. In SustaiNLP, 2020. rather than generators. arXiv:2003.10555, 2020.
[222] R. Cheong and R. Daniel. transformers. zip: Compressing transformers [261] L. Dong et al. Unified language model pre-training for natural language
with pruning and quantization. Technical report, 2019. understanding and generation. In NeurIPS, pp. 13063–13075, 2019.
[223] Z. Zhao et al. An investigation on different underlying quantization [262] M. Lewis et al. Bart: Denoising sequence-to-sequence pre-training
schemes for pre-trained language models. In NLPCC, 2020. for natural language generation, translation, and comprehension.
[224] Z. Liu et al. Post-training quantization for vision transformer. In arXiv:1910.13461, 2019.
NeurIPS, 2021. [263] Z. Zhang et al. Ernie: Enhanced language representation with informa-
[225] Z. Wu et al. Lite transformer with long-short range attention. In ICLR, tive entities. arXiv:1905.07129, 2019.
2020. [264] M. E. Peters et al. Knowledge enhanced contextual word representa-
[226] Z. Geng et al. Is attention better than matrix decomposition? In ICLR, tions. arXiv:1909.04164, 2019.
2020. [265] J. Lee et al. Biobert: a pre-trained biomedical language representation
[227] Y. Guo et al. Nat: Neural architecture transformer for accurate and model for biomedical text mining. Bioinformatics, 36(4):1234–1240,
compact architectures. In NeurIPS, pp. 737–748, 2019. 2020.
[228] D. So et al. The evolved transformer. In ICML, pp. 5877–5886, 2019. [266] I. Beltagy et al. Scibert: A pretrained language model for scientific text.
[229] C. Li et al. Bossnas: Exploring hybrid cnn-transformers with block- arXiv:1903.10676, 2019.
wisely self-supervised neural architecture search. In ICCV, 2021. [267] K. Huang et al. Clinicalbert: Modeling clinical notes and predicting
[230] A. Katharopoulos et al. Transformers are rnns: Fast autoregressive hospital readmission. arXiv:1904.05342, 2019.
transformers with linear attention. In ICML, 2020. [268] J. Ba et al. Multiple object recognition with visual attention. In ICLR,
[231] C. Yun et al. o(n) connections are expressive enough: Universal 2014.
approximability of sparse transformers. In NeurIPS, 2020. [269] V. Mnih et al. Recurrent models of visual attention. NeurIPS, pp.
[232] M. Zaheer et al. Big bird: Transformers for longer sequences. In 2204–2212, 2014.
NeurIPS, 2020. [270] K. Xu et al. Show, attend and tell: Neural image caption generation
[233] D. A. Spielman and S.-H. Teng. Spectral sparsification of graphs. SIAM with visual attention. In International conference on machine learning,
Journal on Computing, 40(4), 2011. pp. 2048–2057, 2015.
[234] F. Chung and L. Lu. The average distances in random graphs with given [271] F. Wang et al. Residual attention network for image classification. In
expected degrees. PNAS, 99(25):15879–15882, 2002. CVPR, pp. 3156–3164, 2017.
[235] A. Krizhevsky and G. Hinton. Learning multiple layers of features from [272] S. Jetley et al. Learn to pay attention. In ICLR, 2018.
tiny images. Technical report, Citeseer, 2009. [273] K. Han et al. Attribute-aware attention model for fine-grained represen-
[236] X. Zhai et al. A large-scale study of representation learning with the tation learning. In ACM MM, pp. 2040–2048, 2018.
visual task adaptation benchmark. arXiv:1910.04867, 2019. [274] P. Ramachandran et al. Stand-alone self-attention in vision models. In
[237] Y. Cheng et al. Robust neural machine translation with doubly adver- NeurIPS, 2019.
sarial inputs. In ACL, 2019. [275] Q. Guan et al. Diagnose like a radiologist: Attention guided
[238] W. E. Zhang et al. Adversarial attacks on deep-learning models in convolutional neural network for thorax disease classification. In
natural language processing: A survey. ACM TIST, 11(3):1–41, 2020. arXiv:1801.09927, 2018.
[239] K. Mahmood et al. On the robustness of vision transformers to [276] J. Hu et al. Squeeze-and-excitation networks. In CVPR, pp. 7132–7141,
adversarial examples. arXiv:2104.02610, 2021. 2018.
[240] X. Mao et al. Towards robust vision transformer. arXiv, 2021. [277] H. Zhao et al. Psanet: Point-wise spatial attention network for scene
[241] S. Serrano and N. A. Smith. Is attention interpretable? In ACL, 2019. parsing. In ECCV, pp. 267–283, 2018.
[242] S. Wiegreffe and Y. Pinter. Attention is not not explanation. In EMNLP- [278] Y. Yuan et al. Ocnet: Object context for semantic segmentation.
IJCNLP, 2019. International Journal of Computer Vision, pp. 1–24, 2021.
[243] H. Chefer et al. Transformer interpretability beyond attention visualiza- [279] J. Fu et al. Dual attention network for scene segmentation. In CVPR,
tion. In CVPR, pp. 782–791, 2021. pp. 3146–3154, 2019.
[244] R. Livni et al. On the computational efficiency of training neural [280] H. Zhang et al. Co-occurrent features in semantic segmentation. In
networks. In NeurIPS, 2014. CVPR, pp. 548–557, 2019.
[245] B. Neyshabur et al. Towards understanding the role of over- [281] F. Zhang et al. Acfnet: Attentional class feature network for semantic
parametrization in generalization of neural networks. In ICLR, 2019. segmentation. In ICCV, pp. 6798–6807, 2019.
[246] K. Han et al. Ghostnet: More features from cheap operations. In CVPR, [282] X. Li et al. Expectation-maximization attention networks for semantic
pp. 1580–1589, 2020. segmentation. In ICCV, pp. 9167–9176, 2019.
[247] K. Han et al. Model rubik’s cube: Twisting resolution, depth and width [283] J. He et al. Adaptive pyramid context network for semantic segmenta-
for tinynets. NeurIPS, 33, 2020. tion. In CVPR, pp. 7519–7528, 2019.
[248] T. Chen et al. Diannao: a small-footprint high-throughput accelerator [284] O. Oktay et al. Attention u-net: Learning where to look for the pancreas.
for ubiquitous machine-learning. In ASPLOS, pp. 269–284, 2014. 2018.
[249] H. Liao et al. Davinci: A scalable architecture for neural network [285] Y. Wang et al. Self-supervised equivariant attention mechanism for
computing. In 2019 IEEE Hot Chips 31 Symposium (HCS), 2019. weakly supervised semantic segmentation. In CVPR, pp. 12275–12284,
[250] A. Jaegle et al. Perceiver: General perception with iterative attention. 2020.
In ICML, volume 139, pp. 4651–4664. PMLR, 18–24 Jul 2021. [286] X. Li et al. Global aggregation then local distribution in fully convolu-
[251] A. Jaegle et al. Perceiver io: A general architecture for structured inputs tional networks. In BMVC, 2019.
& outputs. arXiv preprint arXiv:2107.14795, 2021. [287] Y. Chen et al. Aˆ 2-nets: Double attention networks. NeurIPS, pp.
[252] X. Wang et al. Non-local neural networks. In CVPR, pp. 7794–7803, 352–361, 2018.
2018. [288] L. Zhang et al. Dual graph convolutional network for semantic
[253] A. Buades et al. A non-local algorithm for image denoising. In CVPR, segmentation. In BMVC, 2019.
pp. 60–65, 2005. [289] K. Yue et al. Compact generalized non-local network. In NeurIPS, pp.
[254] J. Chung et al. Empirical evaluation of gated recurrent neural networks 6510–6519, 2018.
on sequence modeling. arXiv:1412.3555, 2014. [290] Z. Huang et al. Ccnet: Criss-cross attention for semantic segmentation.
[255] M. Joshi et al. Spanbert: Improving pre-training by representing and In ICCV, pp. 603–612, 2019.
predicting spans. Transactions of the Association for Computational [291] L. Huang et al. Interlaced sparse self-attention for semantic segmenta-
Linguistics, 8:64–77, 2020. tion. arXiv:1907.12273, 2019.
[256] Y. Liu et al. Roberta: A robustly optimized bert pretraining approach. [292] Y. Li and A. Gupta. Beyond grids: Learning graph representations for
arXiv:1907.11692, 2019. visual recognition. NeurIPS, pp. 9225–9235, 2018.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 23

[293] S. Kumaar et al. Cabinet: Efficient context aggregation network for

low-latency semantic segmentation. arXiv:2011.00993, 2020.
[294] X. Liang et al. Symbolic graph reasoning meets convolutions. NeurIPS,
pp. 1853–1863, 2018.
[295] Y. Chen et al. Graph-based global reasoning networks. In CVPR, pp.
433–442, 2019.
[296] T.-Y. Lin et al. Microsoft coco: Common objects in context. In ECCV,
pp. 740–755, 2014.
[297] Y. Cao et al. Gcnet: Non-local networks meet squeeze-excitation
networks and beyond. In ICCV Workshops, 2019.
[298] W. Li et al. Object detection based on an adaptive attention mechanism.
Scientific Reports, pp. 1–13, 2020.
[299] T.-I. Hsieh et al. One-shot object detection with co-attention and co-
excitation. In NeurIPS, pp. 2725–2734, 2019.
[300] Q. Fan et al. Few-shot object detection with attention-rpn and multi-
relation detector. In CVPR, pp. 4013–4022, 2020.
[301] H. Perreault et al. Spotnet: Self-attention multi-task network for object
detection. In 2020 17th Conference on Computer and Robot Vision
(CRV), pp. 230–237, 2020.
[302] X.-T. Vo et al. Bidirectional non-local networks for object detection. In
International Conference on Computational Collective Intelligence, pp.
491–501, 2020.
[303] H. Hu et al. Relation networks for object detection. In CVPR, pp.
3588–3597, 2018.
[304] K. Zhang et al. Learning enhanced resolution-wise features for human
pose estimation. In 2020 IEEE International Conference on Image
Processing (ICIP), pp. 2256–2260, 2020.
[305] Y. Chang et al. The same size dilated attention network for keypoint
detection. In International Conference on Artificial Neural Networks,
pp. 471–483, 2019.
[306] A. Johnston and G. Carneiro. Self-supervised monocular trained depth
estimation using self-attention and discrete disparity volume. In CVPR,
pp. 4756–4765, 2020.
[307] Y. Chen et al. Attention-based context aggregation network for monoc-
ular depth estimation. International Journal of Machine Learning and
Cybernetics, pp. 1583–1596, 2021.
[308] S. Aich et al. Bidirectional attention network for monocular depth
estimation. In ICRA, 2021.

A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
24 pages
02-Complexity Analysis of An Algorithm
No ratings yet
02-Complexity Analysis of An Algorithm
25 pages
Daraz Digital Marketing Strategy
No ratings yet
Daraz Digital Marketing Strategy
38 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
23 pages
Transformers For Vision
No ratings yet
Transformers For Vision
28 pages
ViT Survey On Segmentation
No ratings yet
ViT Survey On Segmentation
30 pages
Transformers_in_computational_visual_media_A_surve
No ratings yet
Transformers_in_computational_visual_media_A_surve
30 pages
ViT Transformers SEMINAR
No ratings yet
ViT Transformers SEMINAR
16 pages
TSP_CMC_50790
No ratings yet
TSP_CMC_50790
24 pages
A Survey of Visual Transformers
No ratings yet
A Survey of Visual Transformers
23 pages
An Overview of Vision Transformers For Image Processing A Survey
No ratings yet
An Overview of Vision Transformers For Image Processing A Survey
17 pages
Gaurav_Vision_Transformer
No ratings yet
Gaurav_Vision_Transformer
10 pages
NeurIPS 2021 Transformer in Transformer Paper
No ratings yet
NeurIPS 2021 Transformer in Transformer Paper
12 pages
A Survey of Visual Transformers
No ratings yet
A Survey of Visual Transformers
21 pages
10 Transformers
No ratings yet
10 Transformers
22 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
5 pages
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
No ratings yet
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
14 pages
Ai Lakshmana Sai Vision Transformer
No ratings yet
Ai Lakshmana Sai Vision Transformer
19 pages
Vi Transformer
No ratings yet
Vi Transformer
21 pages
Neural Architecture Search For Transformers A Surv
No ratings yet
Neural Architecture Search For Transformers A Surv
39 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
6 pages
Behavior Cloning For Self Driving Cars Using Attention Models
No ratings yet
Behavior Cloning For Self Driving Cars Using Attention Models
5 pages
Lec25 Architectures
No ratings yet
Lec25 Architectures
52 pages
Transformer-Based Visual Segmentation: A Survey
No ratings yet
Transformer-Based Visual Segmentation: A Survey
25 pages
Research Notes
No ratings yet
Research Notes
9 pages
good note - ViT
No ratings yet
good note - ViT
13 pages
Vision Transformers: Revolutionizing Computer Vision
No ratings yet
Vision Transformers: Revolutionizing Computer Vision
14 pages
Applsci 13 05521 v2
No ratings yet
Applsci 13 05521 v2
17 pages
Challenging Task[1]
No ratings yet
Challenging Task[1]
21 pages
1-s2.0-S0957417423031688-main
No ratings yet
1-s2.0-S0957417423031688-main
48 pages
A_Survey_on_Efficient_Vision_Transformers_Algorithms_Techniques_and_Performance_Benchmarking
No ratings yet
A_Survey_on_Efficient_Vision_Transformers_Algorithms_Techniques_and_Performance_Benchmarking
19 pages
Transformer-Based Visual Segmentation - A Survey
No ratings yet
Transformer-Based Visual Segmentation - A Survey
23 pages
Government College of Engineering Aurangabad: Submitted BY
No ratings yet
Government College of Engineering Aurangabad: Submitted BY
22 pages
CMT: Convolutional Neural Networks Meet Vision Transformers
No ratings yet
CMT: Convolutional Neural Networks Meet Vision Transformers
11 pages
Chen Et Al. - 2020 - Pre-Trained Image Processing Transformer
No ratings yet
Chen Et Al. - 2020 - Pre-Trained Image Processing Transformer
13 pages
A_review_of_advances_in_image_recognition_models_F
No ratings yet
A_review_of_advances_in_image_recognition_models_F
5 pages
paper2
No ratings yet
paper2
8 pages
A survey of the Vision Transformers and its CNN-Transformer based Variants_Khan et al_
No ratings yet
A survey of the Vision Transformers and its CNN-Transformer based Variants_Khan et al_
82 pages
2022 AIOpen A Survey of Transformers Lin, Wang, Liu, Qiu
No ratings yet
2022 AIOpen A Survey of Transformers Lin, Wang, Liu, Qiu
22 pages
ViTA_A_Vision_Transformer_Inference_Accelerator_for_Edge_Applications
No ratings yet
ViTA_A_Vision_Transformer_Inference_Accelerator_for_Edge_Applications
5 pages
applsci-14-04316
No ratings yet
applsci-14-04316
27 pages
ViViT: A Video Vision Transformer
No ratings yet
ViViT: A Video Vision Transformer
14 pages
Yang 2022
No ratings yet
Yang 2022
20 pages
03 - ViViT - A Video Vision Transformer
No ratings yet
03 - ViViT - A Video Vision Transformer
13 pages
paper3
No ratings yet
paper3
7 pages
Research Paper (2) Done
No ratings yet
Research Paper (2) Done
17 pages
Jimaging 09 00147 v2
No ratings yet
Jimaging 09 00147 v2
4 pages
The Evolution of Deep Learning
No ratings yet
The Evolution of Deep Learning
53 pages
Abstract
No ratings yet
Abstract
2 pages
Transformers in Single Object Tracking: An Experimental Survey
No ratings yet
Transformers in Single Object Tracking: An Experimental Survey
32 pages
(NIPS23) Scattering Transformation For ViT
No ratings yet
(NIPS23) Scattering Transformation For ViT
21 pages
Transformer Architectures_ResearchPaper (1)
No ratings yet
Transformer Architectures_ResearchPaper (1)
13 pages
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
No ratings yet
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
2 pages
2021 NeurIPS VAAT Akbari, Yuan, Qian, Chuang, Chang, Cui, Gong
No ratings yet
2021 NeurIPS VAAT Akbari, Yuan, Qian, Chuang, Chang, Cui, Gong
16 pages
Seminar
No ratings yet
Seminar
61 pages
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
No ratings yet
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
58 pages
Advanced Deep Learning and Transformers - Cirrincione
No ratings yet
Advanced Deep Learning and Transformers - Cirrincione
3 pages
Vision Transformers in Medical Imaging- A Review
No ratings yet
Vision Transformers in Medical Imaging- A Review
31 pages
Computer Vision
No ratings yet
Computer Vision
2 pages
s10278-025-01481-y
No ratings yet
s10278-025-01481-y
44 pages
Assignment Transforming Computer Vision The Rise of Vision Transformers and Its Impact
No ratings yet
Assignment Transforming Computer Vision The Rise of Vision Transformers and Its Impact
3 pages
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
bu_konu_guzel
No ratings yet
bu_konu_guzel
19 pages
Hasan2024
No ratings yet
Hasan2024
234 pages
Analysis of the Transit Traverse Survey Method of the United Stat
No ratings yet
Analysis of the Transit Traverse Survey Method of the United Stat
103 pages
Wen_2021_IOP_Conf._Ser.__Earth_Environ._Sci._783_012095
No ratings yet
Wen_2021_IOP_Conf._Ser.__Earth_Environ._Sci._783_012095
8 pages
What are Database Recovery Techniques
No ratings yet
What are Database Recovery Techniques
11 pages
General Description Features: High Precision Low Cost MCM Power Switch
No ratings yet
General Description Features: High Precision Low Cost MCM Power Switch
10 pages
Manual For Oil Bunker Flow System
No ratings yet
Manual For Oil Bunker Flow System
9 pages
Trees-graph Theory_module 7
No ratings yet
Trees-graph Theory_module 7
35 pages
Seraph Content Analysis
No ratings yet
Seraph Content Analysis
4 pages
DILEEP Business Proposal
No ratings yet
DILEEP Business Proposal
4 pages
Math 2 - Practice Test 2B PDF
No ratings yet
Math 2 - Practice Test 2B PDF
24 pages
Step-By-step Guide To Create Payroll Custom Function
No ratings yet
Step-By-step Guide To Create Payroll Custom Function
8 pages
EE2202 Electromagnetic Theory Lecture Notes
No ratings yet
EE2202 Electromagnetic Theory Lecture Notes
125 pages
Lab 1 ROS Intro LAB 02 PDF
No ratings yet
Lab 1 ROS Intro LAB 02 PDF
33 pages
Guide For Salesforce AdminSuperSet
No ratings yet
Guide For Salesforce AdminSuperSet
4 pages
Compass v1.6.5 - Browser Interface User Guide (20.09.28)
No ratings yet
Compass v1.6.5 - Browser Interface User Guide (20.09.28)
71 pages
12V To 5V Converter Using LM7805 IC - Power Supply
No ratings yet
12V To 5V Converter Using LM7805 IC - Power Supply
4 pages
C#13Expansion
No ratings yet
C#13Expansion
10 pages
Application of Drone in Agriculture A Re
100% (1)
Application of Drone in Agriculture A Re
7 pages
Impact
No ratings yet
Impact
28 pages
SikaSwell-Swellable Profiles and Sealants
No ratings yet
SikaSwell-Swellable Profiles and Sealants
5 pages
AAR603 4B - Axially Loaded Column
No ratings yet
AAR603 4B - Axially Loaded Column
20 pages
Blockchain Technology Report Final
100% (1)
Blockchain Technology Report Final
25 pages
PLMSS-2021 Brochure
No ratings yet
PLMSS-2021 Brochure
8 pages
Free Thesis PowerPoint Template
No ratings yet
Free Thesis PowerPoint Template
29 pages
Engine Mechanical (5Vz-Fe) : SST (Special Service Tools)
No ratings yet
Engine Mechanical (5Vz-Fe) : SST (Special Service Tools)
3 pages
What is JDBC
No ratings yet
What is JDBC
3 pages
VERSI 1-Latihan
No ratings yet
VERSI 1-Latihan
7 pages
Sipoxipapalunuwusijolur 2
No ratings yet
Sipoxipapalunuwusijolur 2
5 pages
2. Mathematical Methods Class 11 Physics Notes
No ratings yet
2. Mathematical Methods Class 11 Physics Notes
60 pages
Supply Autarsys ESS Medium
No ratings yet
Supply Autarsys ESS Medium
4 pages
NAE GUIDE Johnson Controls
No ratings yet
NAE GUIDE Johnson Controls
124 pages