0% found this document useful (0 votes)
51 views15 pages

Efficient Vision Transformers

This document proposes SimA, a simple Softmax-free attention mechanism for vision transformers. SimA normalizes query and key matrices with L1 norms instead of using a Softmax layer. This allows the attention block to be computed as a simple matrix multiplication of query, key, and value matrices. Empirically, replacing the attention blocks in DeiT, XCiT, and CvT models with SimA achieves comparable accuracy without needing a Softmax layer. SimA can dynamically choose to compute attention linearly in the number of tokens or channels at test time.

Uploaded by

sam wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views15 pages

Efficient Vision Transformers

This document proposes SimA, a simple Softmax-free attention mechanism for vision transformers. SimA normalizes query and key matrices with L1 norms instead of using a Softmax layer. This allows the attention block to be computed as a simple matrix multiplication of query, key, and value matrices. Empirically, replacing the attention blocks in DeiT, XCiT, and CvT models with SimA achieves comparable accuracy without needing a Softmax layer. SimA can dynamically choose to compute attention linearly in the number of tokens or channels at test time.

Uploaded by

sam wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

SimA: Simple Softmax-free Attention

for Vision Transformers

Soroush Abbasi Koohpayegani1,2 Hamed Pirsiavash2


1 University of Maryland, Baltimore County 2 University of California, Davis
[email protected] [email protected]
arXiv:2206.08898v1 [cs.CV] 17 Jun 2022

Abstract
Recently, vision transformers have become very popular. However, deploying
them in many applications is computationally expensive partly due to the Softmax
layer in the attention block. We introduce a simple but effective, Softmax-free
attention block, SimA, which normalizes query and key matrices with simple
`1 -norm instead of using Softmax layer. Then, the attention block in SimA is
a simple multiplication of three matrices, so SimA can dynamically change the
ordering of the computation at the test time to achieve linear computation on the
number of tokens or the number of channels. We empirically show that SimA
applied to three SOTA variations of transformers, DeiT, XCiT, and CvT, results
in on-par accuracy compared to the SOTA models, without any need for Softmax
layer. Interestingly, changing SimA from multi-head to single-head has only a
small effect on the accuracy, which simplifies the attention block further. The code
is available here: https://github.com/UCDvision/sima

1 Introduction
Recently, vision transformers have become very popular. Compared to CNNs, they achieve better
accuracy, however, deploying transformers in devices with smaller computational resources is
challenging. One reason is that a transformer model calls the Softmax layer several times which calls
exp(.) operation consequently. We know that the exp(.) operation is costly particularly in smaller
devices with limited computational resources. For instance, implementing exp(.) on FGPA is much
more costly compared to implementing simple multiplication or addition.
We are interested in simplifying the attention mechanism by removing the Softmax layer. We believe
one role of the Softmax layer is to normalize the attention values so that tokens can compete with
each other. Our main idea is to enable this competition by normalizing the query and key matrices
with their `1 -norm before multiplying them. Then, removing the Softmax layer results in the whole
attention mechanism to boil down to simply multiplying three matrices “query”, “key”, and “value”.
As a bi-product, due to the associative property of multiplication, there are two possible orderings of
multiplying these three matrices at the test time. Depending on the ordering, the computation can
be quadratic on the number of tokens, N , or the number of channels, D. Hence, we can reduce the
computation further by deciding on the ordering at the test time by comparing N and D. Note that
this decision is made dynamically at the test time depending on the image resolution without affecting
the training process. Moreover, since we normalize the vectors before multiplying, our method is
numerically more stable so that use half-precision floating point without overflowing.
The attention mechanism deals with the tokens without considering their ordering. This is an
interesting property that opens the door to many applications. For instance, the distribution of the
token values is relatively robust compared to CNNs when we mask (drop) 75% of the tokens in
masking auto-encoder (MAE[22]). Moreover, the tokens can be seen as a non-ordered set that can

Preprint. Under review.


Figure 1: Effect of Softmax on inference time: We evaluate performance of each model on a single
RTX 8000 GPU with batch size of 8. When comparing the baseline to our method (SimA), we fix
the order of (QK T V ) to have the same dot product complexity as the baseline. For example, when
comparing with DeiT, if N > D, then it is more efficient to do Q̂(K̂ T V ) for our method, but we do
(Q̂K̂ T )V to have same complexity as DeiT(O(N 2 D)). We do this to solely evaluate the effect of
Softmax on the computation time. In the left figure, we fix the token dimension to 384 and increase
the image resolution. On 1536 × 1536 resolution, DeiT is 58% slower than our method due to the
overhead of exp(.) function in Softmax. In the right figure, we fix the resolution and increase the
capacity of the model (dimensions of Q and K). With dimension of 8192, XCiT is 22% slower due
to Softmax overhead.

come from various sources (e.g., multiple cameras or non-camera sensors). Note that this permutation
invariance property does not exist in some other models like MLP-Mixer[52]. Hence, instead of using
MLP-Mixer that does not have Softmax by default, we are interested in removing Softmax from the
original transformers to keep this permutation invariance property.
We perform experiments with our simple attention block, denoted SimA, by using it in standard vision
transformers, DeiT, CvT, and XCiT. Our method achieves on-par results with SOTA on ImageNet
classification, MS-COCO object detection and segmentation, and also self-supervised learning.
In summary, our SimA attention block does not use Softmax, which makes it computationally more
efficient generally (see Fig. 1), and on the edge devices specifically. SimA can dynamically choose
to be linear on N or D at the test time depending on the image resolution or the number of tokens.
Changing Multi-head attention to Single-head one or changing GELU activation function to ReLU,
has a very small effect on the accuracy of SimA. This makes SimA very simple and effective for
various applications.

2 Method

2.1 Background on Vision Transformers:

Self-Attention Block: The original vision transformer[12] uses the self-attention block introduced
in [56]. Self-attention block gets X ∈ RN ×D as the input where N is the number of tokens and D
is the dimensionality of each token. Then Wq ∈ RD×D , Wk ∈ RD×D and Wv ∈ RD×D projects
X into three N × D matrices query (Q = XWq ), key (K = XWk ) and√value (V = XWv ). We
calculate attention matrix A ∈ RN ×N defined as A = Sof tmax(QK T / D) where Sof tmax is
applied to each row independently, so each row in A sums to one. Using attention matrix A, we
calculate the output O = AV . Each row of O ∈ RN ×D corresponds to one token and since rows of
A sum to one, each token in a weighted average of the values of all tokens.
Additionally, vision transformers divide Q, K, and V of each token into H heads, where each head
has its own attention over the corresponding head in all tokens. For example, Q = [Q1 ; Q2 ; ...; QH ]
D
where Qi ∈ RN × H is the query matrix for the i’th head. Then, we calculate H self-attention for all
heads in parallel and concatenate the outputs to get O = [O1 ; O2 ; ...; OH ]. Due to using multiple
heads, this block is called Multi-Head Self-Attention (MSA).
Finally, the self-attention block has an additional output projection Wproj ∈ RD×D , thus the final
output of the self-attention block is OWproj which is of size RN ×D .

2
Figure 2: Our Simple Attention (SimA): First, we normalize each channel in Q and K with `1 -norm
across the tokens, to get Q̂ and K̂. Next, we can choose either (Q̂K̂ T )V or Q̂(K̂ T V ) depending on
the number of input tokens N . Compared to XCA and MSA, our method has following benefits: (1)
It is free of Softmax, hence it is more efficient. (2) At test time we can dynamically switch between
(Q̂K̂ T )V and Q̂(K̂ T V ) based on the number of input tokens (e.g., different image resolution).

Cross-covariance Attention Block (XCA): Vanilla self-attention block has a complexity of


O(DN 2 ) which is quadratic on N . [2, 47] introduce an attention mechanism that is linear on
N . In XCA, we calculate the attention matrix with A = K T Q where A is a D × D matrix. Next,
we apply Softmax on each columns, so that columns sum to one. Then we calculate output as
O = V A. Note that A is an attention of channels on each other rather than tokens. Compared to
vanilla self-attention (MSA), XCA has complexity of O(D2 N ). Since XCA is linear on N , it is more
efficient when N  D and it is less efficient when N  D.
Vision Transformer Block: Vision transformers architecture contains n consecutive Vision Trans-
former blocks. Each block has MSA block followed by a Feed-Forward Network (FFN) both with
skip connection. FFN is a simple 2-layer MLP which projects tokens from D dimension to 4D
and again back to D dimensions. FFN use GELU[24] as the activation function. Moreover, we use
LayerNorm [3] on each token before forwarding them through MSA or FFN blocks. The following
two updating rules summarize each block of the vision transformer:

(Step1) X ← X + M SA(LayerNorm1 (X))


(Step2) X ← X + F F N (LayerNorm2 (X))

3
2.2 Simple Attention (SimA):

Our main goal is to reduce the computation by removing the Softmax layer. We believe one of the
roles of the Softmax layer is to normalize the attention so that each token is a weighted average of
the values of all tokens. This ensures that the attention values are bounded. Hence, we introduce an
alternative normalization method that does not need a Softmax layer.
In the regular attention block, if a channel in Q and/or K has large values, that channel may dominate
the dot product QK T . This results in other channels being ignored in calculating the attention. We
believe this may be one of the reasons leading to superior performance of the multi-head attention
(MSA) compared to the single-head one. Since in MSA, the dominating channel can dominate a
single head only leaving the other heads still operational. We propose a method to take this solution
to the extreme where we can normalize each channel in Q and K across tokens so that different
channels become more comparable. We do this by simply dividing the values of each channel by the
`1 norm of that channel across all tokens:
Qi Ki
Q̂i := and K̂ i :=
|Qi |1 |K i |1

where Qi is the i’th column of Q (values of the i’th channel for all tokens) and Q̂ and K̂ are the
normalized query and key matrices.
Given this simple normalization method, we remove the Softmax layer, so the attention block can be
written as:

O = Q̂K̂ T V

where O ∈ RN ×D . Similar to standard transformers, we use this block for each head separately,
concatenate the outputs, and finally apply the output projection OWproj .
One can assume Q̂K̂ T is the attention matrix that quantifies the effect of a token on another token.
Interestingly, if the query and key vectors have a large angle, the attention values can become negative,
meaning that a token can affect another one negatively. This is in contrast to regular transformers
where the attention is always positive.
Due to our normalization, the attention values are bounded between −D and D. The extremes happen
when only a single row of Q and a single row of K are nonzero. In this case, all other tokens will
have zero query and key vectors. One may divide the attention by D to bound it between −1 and 1.
This constant scalar multiplier can be absorbed into Wv , the projection matrix for V .
The cost of Softmax: Both XCA and MSA use Softmax for normalization. Softmax needs running
exp(.) which is costly. MSA uses Softmax on a matrix of size N × N while XCA uses Softmax on a
matrix of size D × D. Hence, the order of exp(.) operations is O(HN 2 ) for MSA and O(HD2 ) for
XCA. Therefore, Softmax will be bottleneck when increasing the number of tokens (higher image
resolutions) in MSA and number of channels (higher capacity transformers) in XCA. On the other
hand, our attention block does not use exp(.) operation at all. Note that our main experiment still uses
GELU activation in MLP layers, however, its computation is not the dominant since it includes only
O(DN ) operations of exp(.). Moreover, in the last row of Table 1, we show that changing GELU to
ReLU in SimA gets comparable accuracy to the main experiment (79.6% vs 79.8%). This version of
SimA does not use any exp(.) operation at the inference time. The reduction in the computation cost
of Softmax is shown in Fig. 1-left for increasing N and in Fig. 1-right for increasing D. This figure
shows the speed-up due to removing Softmax only and does not include the speed-up due to changing
the order of multiplications. We believe this speed-up will be even larger in smaller computation
units since calculating exp(.) is more challenging in such units.
The cost of matrix dot products: The computation of the attention block in standard transformers
have the order of O(N 2 D) while some recent works including XCiT change that to O(N D2 ). Since
we simply multiply three matrices together, we can choose either one of these two by changing the
order of multiplications: (Q̂K̂ T )V or Q̂(K̂ T V ). Our method enables us choosing it dynamically at
the test time depending on the number of tokens. If N < D, we can use (Q̂K̂ T )V and otherwise, we
can use Q̂(K̂ T V ).

4
3 Related Work

Vision Transformers: Convolutional Neural Networks (CNNs) have become ubiquitous as the
most commonly used network architecture for computer vision tasks. AlexNet [30], ResNet [20],
Xception [7] and MobileNet [26] are some of the well-known CNN architectures. While CNNs are
the defacto standard for network architecture used for image recognition tasks, transformers have
recently emerged as a promising alternative to CNNs for many computer vision tasks. Transformers
[56] are another class of network architecture that relies entirely on self-attention mechanism and
was originally introduced for the task of machine translation and other NLP tasks. ViT [12] adapts
transformers to obtain convolution-free architecture for computer vision tasks by dividing each
image in 16 × 16 patches and considering each patch as a token input. DeiT [55] improves training
efficiency of ViT, resulting in a transformer that does not require large dataset for training. CaiT [53]
explores transformers with increased depth. The Scaled Dot-Product Attention module [56] used by
transformers rely on the softmax operation for normalization. Unlike CNNs/MLP [30, 20, 48, 50, 54,
52] based architectures, softmax is an important part of transformer architecture. In this paper, we
address replacing the softmax operation in the self-attention module of vision transformers.
Efficient Vision Transformers: Transformers are expensive with high memory footprint. Deploy-
ment of transformers on edge devices where resources are limited is a major concern. LeViT [16] uses
down-sampling in stages to improve efficiency. [40, 60] integrate convolution in transformer. [34, 25]
improve the self-attention efficiency by limiting the attention of each token to subset of tokens. [38]
uses distillation to improve the efficiency of the network. [14, 46, 39] decrease the number of tokens
by pruning unimportant tokens. Although these works limit the computation generally, softmax is
still required to calculate the attention. Our idea is orthogonal to these methods since we can replace
attention block in any transformer with our Softmax-free attention block.
Linear Attention: Vanilla attention has O(N 2 D) computation and memory complexity, where N is
number of input tokens and D is dimension of each token. Some works target this issue by replacing
vanilla attention with a linear attention with O(N D2 ) complexity. XCiT [47, 2] uses attention across
feature channels rather than tokens. Some works use similarity kernels to approximate softmax, thus
it is possible to have linear complexity by doing φ(Q)(φ(K)T φ(V )) instead of (φ(Q)φ(K)T )φ(V )
where φ(x) is the kernel function. [28] uses φ(x) = 1 + elu(x), whereas [37, 42] use Gaussian kernel
functions. [63] use SVD decomposition and [8] use positive random features to approximate softmax.
[57] approximate attention with a low rank matrix. All these methods either use exponential function
or cos/sin function which is costly. For example, SOFT [37] removes Softmax without reducing
the number of exp(.) operations. A recent work in the NLP community, CosFormer [44], passes Q
and K through a ReLU unit and normalizes their product. It also adds a re-weighting method that
improves the locality of the data using sine(.) and cosine(.) functions, which is costly. Our ideas
are different since we aim to remove the Sof tmax operation, reduce the computation, and switch
the order of multiplication depending on the number of tokens. Also, our idea is simpler and we
apply it to visual recognition rather than NLP. Moreover, the focus of those methods is to have linear
attention with respect to number of tokens which is not the main focus of this paper.
Softmax Approximation: Softmax is an expensive operation on hardware since it requires exponen-
tial calculation. More specifically, softmax in transformer architecture contributes to major part of
computation when the input is large [49]. Some works have tried to address this issue by designing a
hardware friendly softmax approximation. [4] approximates softmax with Taylor expansions, whereas
[15, 13, 19, 70] target designing a hardware architecture to approximate softmax. Softermax [49]
uses a low-precision implementation of 2x . [66] uses lower precision computation. [43, 33] use
quantized softmax. While these works approximate Softmax at the hardware, we replace Softmax
completely with `1 normalization at the model architecture.

4 Experiments

We evaluate effectiveness of SimA attention block by replacing self-attention in three popular


vision transformer families: DeiT, XCiT and CvT. In Section 4.1, we evaluate our model on image
classification task. In Section 4.2, we evaluate our model on object detection and image segmentation
tasks. In Section 4.3, we show that our method works on a pretext task for self-supervised learning.

5
Table 1: ImageNet classification: We denote replacing Softmax attention with SimA by X → SimA.
Softmax column indicates the number of exp(.) operations in the backbone. N is the number of
tokens, D is the token dimension, H is the number of heads, M is the local window size, and R
is the reduction ratio. For a fair comparison, we also report ResNet50 RA (with RandAug[10]).
Models indicated by * use teacher during training. EfficientNet outperforms our method, but it is
a convolutional network and uses more FLOPs at higher image resolution. SOFT also has exp(.)
function in the backbone which is costly. Purple rows are our method while blue rows are baselines.
Our method is a Softmax-free transformer and has on-par accuracy with SOTA transformers. To
simplify SimA even further, we investigate two more variations in yellow rows: (1) Replacing GELU
with ReLU, (2) Replacing multi-head attention with single head attention. Interestingly, SimA has
comparable performance even with single head attention and ReLU. Note that the ReLU version does
not need any exp(.) operation at the inference time.
Model params FLOPs Resolution Softmax/#exp Top1-Acc
CNN ResNet18 [20] 12M 1.8B 224 0 69.8
Transformer XCiT-T12/8 [2] 7M 4.8B 224 HD2 79.7
Transformer XCiT-T12/8 → SimA 7M 4.8B 224 0 79.4
ResNet50 RA [10] 25M 3.9B 224 0 77.6
EfficientNet-B5 RA [10] 30M 9.9B 456 0 83.9
CNN
RegNetY-4GF [45] 21M 4.0B 224 0 80.0
ConvNeXt-T [35] 29M 4.5B 224 0 82.1
ResMLP-S24 [54] 30M 6.0B 224 0 79.4
MLP MS-MLP-T [69] 28M 4.9B 224 0 82.1
Hire-MLP-S [18] 33M 4.2B 224 0 82.1
Twin-SVT-S [9] 24M 3.7B 224 HM 2 N 81.7
Hybrid CvT-13 [61] 20M 4.5B 224 HN 2 81.6
CvT-13 → SimA 20M 4.5B 224 0 81.4
Swin-T [34] 29M 4.5B 224 HM 2 N 81.3
PVT-S [58] 24M 4.0B 224 HN 2 /R 79.8
T2T-ViT-14 [64] 21M 5.2B 224 HN 2 80.7
CaiT-XS24* [53] 26M 19.3B 384 HN 2 84.1
SOFT-S [37] 24M 3.3B 224 HN 2 82.2
Transformer DeiT-S* [55] 22M 4.6B 224 HN 2 81.2
XCiT-S12/16*[2] 26M 4.8B 224 HD2 83.3
DeiT-S [55] 22M 4.6B 224 HN 2 79.8
XCiT-S12/16[2] 26M 4.8B 224 HD2 82.0
DeiT-S → SimA 22M 4.6B 224 0 79.8
XCiT-S12/16 → SimA 26M 4.8B 224 0 82.1
Multi-Head/GELU DeiT-S → SimA 22M 4.6B 224 0 79.8
Multi-Head → Single-Head DeiT-S → SimA 22M 4.6B 224 0 79.4
GELU → ReLU DeiT-S → SimA 22M 4.6B 224 0 79.6

4.1 Image Classification

Dataset: We train all models on ImageNet-1k [11], a large scale annotated dataset for image
classification. ImageNet consists of 1000 classes with a train set of 1.2M images and a validation set
of 50K images. We report Top-1 accuracy on the validation set for evaluation.
Implementation Details: We use PyTorch[41] and Timm[59] libraries to train our models with a
setup similar to [2, 55]. We use AdamW[36] optimizer. We train CvT and DeiT models with 300
epochs and XCiT models with 400 epochs. We set the batch size to 1024 and weight decay to 0.05.
We use cosine scheduling with an initial learning rate of 5e − 4. We use Stochastic depth drop rate
[27] of 0.05. Data augmentations are same as those in [55] including Rand-Augment[10], CutMix[65]
and Mixup[67]. Following [2, 53], we train our models with images of resolution 224 and evaluate it
using images with a crop ratio of 1.0. Training DeiT-S or XCiT-S12/16 with 8 RTX 6000 GPUs takes
approximately 100 hours.
Baselines: To show generalization of our SimA attention mechanism on transformer architectures,
we use three popular backbone architectures:
DeiT: [55] is a well-know transformer architecture based on ViT[12]. We use DeiT-S in our experi-
ments. DeiT-S architecture has the following settings: patch size= 16, embedding dimensions= 384,

6
Table 2: Transfer to MS-COCO dataset: Models with * are pretrained with a teacher on ImageNet.
Swin-T has more parameters and Softmax overhead. XCiT-S12/8 has 4× more tokens. Our method
is Softmax-free, thus it is more efficient for high resolution images and high capacity models (Fig. 1).
Detection Segmentation
box
Backbone params Softmax AP APbox
50 APbox
75 APmask
APmask
50 APmask
75
ResNet50 [20] 44.2M 7 41.0 61.7 44.9 37.1 58.4 40.1
PVT-Small [58] 44.1M 3 43.0 65.3 46.9 39.9 62.5 42.8
ViL-Small [68] 45.0M 3 43.4 64.9 47.0 39.6 62.1 42.4
Swin-T [34] 47.8M 3 46.0 68.1 50.3 41.6 65.1 44.9
XCiT-S12/16* 44.3M 3 45.3 67.0 49.5 40.8 64.0 43.8
XCiT-S12/8* 43.1M 3 47.0 68.9 51.7 42.3 66.0 45.4
XCiT-S12/16 44.3M 3 45.0 66.7 48.9 40.5 63.6 43.2
XCiT-S12/16 → SimA 44.3M 7 44.8 66.5 48.8 40.3 63.2 43.3

number of heads= 6 and layers= 12. Self-attention in DeiT has complexity of O(DN 2 ) which is
quadratic on the number of tokens N .
XCiT: [2] is a state-of-the-art vision transformer architecture with a linear attention. Compared
to DeiT, XCiT has 2 major differences: (1) XCiT has Local Patch Interaction (LPI) in each block,
which consists of one depth-wise 3×3 convolution followed by Batch Normalization, GELU and
another depth-wise 3×3 convolution. (2) XCiT has separate class attention layers similar to [53]. The
CLS token is added at the end of the initial self-attention stage and class attention layers are used to
aggregate information from image tokens to the class token. This modification adds extra parameters
and computation to the model.
We replace SimA in two variant of XCiT: XCiT-S12/16 and XCiT-T12/8. XCiT-S12/16 has a patch
size of 16, embedding dimension of 384, 8 heads, 12 layers, and 2 class attention layers. XCiT-T12/8
is similar to XCiT-S12/16 with a patch size of 8, embedding dimension of 192, and 4 heads.
CvT: To show that our SimA attention mechanism generalizes to more transformer architectures, we
apply SimA to CvT [61], which is a SOTA hybrid convolution/transformer architectures. CvT has 3
stages. Each stage has a Convolution Token Embedding layer follows by transformer blocks. We use
CvT-13 in our experiments which has 1, 2, and 10 transformer blocks in stages 1 to 3 respectively (13
blocks in total). Additionally, CvT uses Convolution Projection layer for Q, K, and V projection.
Results on ImageNet: We replace MSA and XCA blocks with our SimA block in DeiT, CvT and
XCiT respectively, and train our models on ImageNet. Note that we train our models from scratch
without distillation from a teacher. Results are in Table 1. In XCiT models, we get comparable
results when replacing XCA block with SimA block (0.1 point improvement in XCiT-S12/16 and
0.3 point reduction in XCiT-T12/8). Compared to DeiT-S, our attention block performs on-par with
DeiT-S. Moreover, our method with no Softmax layer, achieves comparable accuracy (0.2 point
lower) compared to CvT-13. This suggests that one can replace attention block with SimA in these
standard SOTA transformers without degrading their performance. Since SimA is Softmax-free, it
has the advantage over regular attention architectures in terms of efficiency and simplicity.

4.2 Transfer To Object Detection and Semantic Segmentation

As shown in Fig. 1 and [49], softmax operation represents a large fraction of runtime in vision
transformers, especially when the image resolution is high. In object detection and segmentation
tasks we usually forward high resolution images. We demonstrate the transferability of SimA to these
dense prediction tasks by fine-tuning our ImageNet pretrained model on them.
Dataset: We use MS-COCO [32] dataset for our dense prediction tasks. MS-COCO has 118K
training images and 5K validation images with 80 categories. Images are annotated with bounding
boxes and semantic segmentation masks.
Implementation Details: We follow [2, 34, 6] for the setup and implementation. We use our
pretrained model as the backbone of Mask RCNN[21]. Similar to [2], we use FPN[31] to extract
features from layers 4, 6, 8 and 12 of the transformer. We use AdamW[36] optimizer with a learning

7
Table 3: Self-Supervised Learning: We train SimA attention block with DINO (SSL). Our method
achieves performance comparable to transformer models with Softmax and trained for 100 epochs.
Note that methods with different SSL task and higher number of epochs are not directly comparable.
SSL Method Model params epochs Softmax FLOPs Linear k-NN
ISD [51] ResNet50 [20] 25M 200 7 3.9B 69.8 62.0
MoCo v2 [23] ResNet50 [20] 25M 200 7 3.9B 69.9 -
MSF [29] ResNet50 [20] 25M 200 7 3.9B 72.4 64.9
BYOL [17] ResNet50 [20] 25M 1000 7 3.9B 74.3 66.9
MoBY [62] Swin-T [34] 29M 300 3 4.5B 75.0 –
DINO [5] ResNet-50 [20] 23M 300 7 4.1B 74.5 65.6
DINO [5] ResMLP-S24 [54] 30M 300 7 6.0B 72.8 69.4
DINO [5] ViT-S/16 [12] 22M 300 3 4.6B 76.1 72.8
DINO [5] XCiT-S12/16 26M 300 3 4.9B 77.8 76.0
DINO [5] ViT-S/16 22M 100 3 4.6B 74.0 69.3
DINO [5] XCiT-S12/16 26M 100 3 4.9B 75.8 71.6
DINO [5] XCiT-S12/16 → SimA 26M 100 7 4.9B 75.5 71.2

rate of 1e − 4 and weight decay 0.05. We train our model for 36 epochs with batch size of 16 on 8
RTX2080Ti GPUs. Training takes 36 hours.
Results on MS-COCO: We compare our XCiT-S12/16 → SimA model with other vision transform-
ers and ResNet in Table 2. We report the performance on the minival set. For a fair comparison, we
limit the comparison to all models which are initialized with ImageNet1K pretrained backbones and
trained with the same training time budget (3x schedule) on MS-COCO dataset. In comparison to
other transformers, our method gets on-par performance while it is free of Softmax overhead on high
resolution images or high capacity models (refer to Fig. 1).

4.3 Self-Supervised Learning

To show the generalizability of SimA, we train our SimA model on a pretext task for self-supervised
learning (SSL). We use the non-contrastive task introduced by [5] for SSL pre-training. We train
our model on ImageNet train set (1.2M ) without the use of ground-truth labels. DINO training is
relatively expensive. It requires forwarding multi-crop augmentation through teacher and student.
Due to limited resources, we train our model and baseline methods for 100 epochs. To train our
XCiT-S12/16 → SimA model with DINO [5], we follow the training configuration of XCiT-S12/16
from the official repository of DINO 1 . Similar to DINO, we use AdamW optimizer in PyTorch
library with initial learning rate of 0.00025 with cosine scheduling. We use initial weight decay of
0.04 and increase it to 0.4 with cosine scheduling. We train for 100 epochs with minibatches of size
256. The training takes approximately 100 hours on four RTX-3090 GPUs. We use similar settings
for training our method and the baseline (XCiT-S12/16).
Results of SSL training: Following [5, 1], we report k-NN and Linear evaluation metrics for
evaluating the SSL models. For k-NN evaluation, we forward images of training and validation set
through the frozen backbone and extract features. We report 20-NN on the validation set. For Linear
evaluation, we freeze the backbone and train a linear layer on extracted features from the frozen
backbone and report Top-1 accuracy on the ImageNet validation set. We adopt a similar approach
to DINO [5] for extracting features from XCiT architecture. We extract the classification tokens of
the last two class attention layers and global average pooling of the last two regular attention layers.
Each of those 4 vectors is of size 384. We concatenate them and train a linear layer of size 4 × 384
to 1000 classes of ImageNet1K. We use similar training settings as DINO [5] to train a linear layer
for both our method and the baseline (XCiT-S12/16). We train for 100 epochs with SGD optimizer
and the following settings: learning rate: 0.001 with cosine scheduling, batch size: 1024, and weight
decay: 0. Results are shown in Table 3. Our Softmax-free method performs comparably with the
baselines with 100 epochs of training.

1
https://github.com/facebookresearch/dino

8
4.4 Single-head vs Multi-head Attention:

As mentioned in Section 2.2, in the regular attention block, if a channel in Q and/or K has large
values, that channel may dominate the dot product QK T . We believe multi-head attention (MSA)
mitigates this issue to some degree by containing the dominant channel in one head only so that the
other heads can have reasonable effect in the final attention. In SimA, by doing `1 normalization
of each channel in Q and K across tokens, different channels become more comparable in the dot
product QK T , so multi-head attention may not be needed. To evaluate our hypothesis empirically,
we train both DeiT-S → SimA and DeiT-S with single head attention only. Results are in Table. 4.
Interestingly, we show that our method gets comparable results even with single-head attention (0.4
point lower). On the other hand, the accuracy of DeiT-S drops by 2.8 points with single head attention.
This suggests that unlike vanilla attention block, multi-head attention is not critically important in
SimA, which leads to simplicity SimA even further.

Table 4: Effect of Removing Multi-Head Attention: We use DeiT-S → SimA for our method. In
single head, our method degrades much less compared to DeiT due to normalization of Q and K.
Model SimA DeiT-S
Attention Heads 6 (Multi-Head) 1 (Single) 6 (Multi-Head) 1 (Single)
ImageNet Top-1 acc. 79.8 79.4 (-0.4) 79.8 77.0 (-2.8)

4.5 Ablation

In this section we study different components of SimA. We train all ablation experiments on Ima-
geNet1k. We use XCiT-S12/16 → SimA with the same hyperparameters as our main experiment in
ImageNet classification in Section 4.1. Results are in Table. 5.
Effect of `1 Normalization To see the effect of `1 normalization, we train our model without
normalization. Note that by removing the normalization, the range of QK T can be from −∞ to +∞.
We observe that training become unstable and results in a frequent NaN loss in our trials. We also
replace `1 normalization with `2 normalization. We observe that accuracy drops by 2.9 points.
SimA without LPI: Although XCiT[2] shows that LPI layer can improve the accuracy by 1.2 point, it
limits the application of vanilla transformer (e.g., running masked auto encoder models like MAE[22]
is not straightforward). To show that our method is not dependent on LPI, we train our model without
LPI. We observe that the accuracy drops by 1.2 point. Hence, although LPI boosts the accuracy, our
method has comparable performance without LPI.
Replacing GELU with ReLU: Similar to Softmax function, GELU activation function also uses
exp(.) operation, which is costly. We replace all GELU activation functions in DeiT-S → SimA with
ReLU. We observe that DeiT-S → SimA with ReLU gets accuracy of 79.6 which is only 0.2 points
lower than DeiT-S → SimA with GELU activation function. Note that SimA with ReLU does not use
any exp(.) operation at the inference time, leading to further efficiency of the model. Results are in
Table 1 (yellow rows).

Table 5: Ablation: We use XCiT-S12/16 → SimA for all models and train with 400 epochs on
ImageNet1K. Note that “Fail” means that we cannot train the model due to NaN loss during training.
Our method uses `1 normalization by default.
Ablation SimA w/o normalization `2 -normalization w/o LPI layer
ImageNet Top-1 acc. 82.1 Fail 79.2 80.9

Numerical stability: Instead of using Softmax, one can normalize the multiplication of Q and K by
dividing it with the simple summation of the rows, where changing the ordering of multiplications is
still possible. However, this may result in negative denominator leading to flipping the sign of the
attention values, which is not desired. We can solve this problem by passing Q and K through ReLU.
We experimented with this setting. However, this was very unstable in training and even the earlier
snapshots of the model were throwing overflow error in half-precision when switching the order of
the multiplications. We observe that higher learning rate makes the training very unstable and results
in a NaN loss value. We were able to train this method with smaller learning rate (5e-5) and got Top-1

9
Figure 3: Our method (SimA): We extract Q̂ and K̂ from layer 12 of transformer. We get `2 -norm
of each token for Q̂ and K̂, normalize it to range [0,1] and overlay it as a heatmap on the image.
Interestingly, magnitude of tokens represent the significance of tokens in our method.

accuracy of 76.1%, which is 2.9 point lower than DeiT-S → SimA. We believe this is caused due to
multiplying large matrices without any normalization. A similar observation for instability is also
noted in XCiT[2], where the network becomes unstable when the norm of tokens is not controlled.

4.6 Visualization

The dot product Q̂K̂ T is correlated with the magnitude of Q̂ and K̂ vectors. Hence, we believe
this magnitude can highlight the important tokens or image regions. This can be seen as a form of
explanation or saliency map. First, we extract Q̂ and K̂ in the last layer of transformer (layer 12).
Then, we calculate the `2 -norm of Q̂ along the channel dimension to get a single non-negative scalar
for each token. We reshape this N × 1 vector to the image shape, up-sample it to original image size,
normalize it to range [0, 1], and overlay it on the image as a heatmap. We repeat the same for K̂.
As shown qualitatively in Fig. 3, such a visualization highlights the important regions of the image.
Moreover, we study the same visualization on standard DeiT with the `2 -norm of both Q and Q̂ and
get a relatively flat distribution of the heatmap that is shown in Fig. 3. Note that we report these
results for qualitative understanding of the model and do not evaluate it quantitatively or compare
it with other network explanation methods. Also, note that the comparison with DeiT is not fair
since in DeiT, Q and K are not necessarily comparable as the normalization happens in the Softmax
operation after multiplying them.

10
4.7 Simple Pseudocode of SimA:

Our code is publicly available2 . Since our method is simple, we include the pseudocode of SimA:

Algorithm 1 Pseudocode of SimA (Single Head) in a PyTorch-like style.


# self.qkv: nn.Linear(dim, dim * 3, bias=qkv_bias) ; query, key, value projection
# self.proj: nn.Linear(dim, dim, bias=output_proj_bias) ; output projection

def forward(self, x):


B, N, D = x.shape # B: batch size, N: number of Tokens, D: Dimension of Tokens
qkv = self.qkv(x).reshape(B, N, 3, D).permute(2, 0, 1, 3) # (3 x B x N x D)
q, k, v = qkv[0], qkv[1], qkv[2] # split into query (B x N x D), key (B x N x D) and value (B x N x D)

k = torch.nn.functional.normalize(k, p=1.0, dim=-2) # Normalized query (B x N x D)


q = torch.nn.functional.normalize(q, p=1.0, dim=-2) # Normalized key (B x N x D)

if (N/D) < 1:
x = (q @ k.transpose(-2, -1)) @ v # (B x N x D)
else:
x = q @ (k.transpose(-2, -1) @ v) # (B x N x D)

x = self.proj(x) # Output (B x N x D)
return x

5 Conclusion
We introduced SimA, a simple attention block that does not involve exp(.) operation to reduce the
computational cost of transformers particularly at edge devices. SimA performs normalization on key
and query matrices before multiplying them, enabling dynamically switching between O(DN 2 ) or
O(D2 N ) depending on the number of tokens (e.g., image resolution). Our extensive experiments
show that while reducing the cost of inference, SimA achieves on-par results compared to SOTA
methods on various benchmarks including ImageNet classification, MS-COCO object detection and
segmentation, and self-supervised learning. Moreover, a single-head variation of SimA, which is even
simpler, achieves to on-par accuracy compared to SOTA multi-head attention models. We believe
SimA can encourage research in this direction leading to easier adoption of transformers on edge
devices with limited resources.
Societal Impact: Our core motivation is to reduce the computation at the inference, which has
positive societal impacts including democratizing AI by reducing the need for computational resources
and reducing the carbon footprint of the models at the inference time. However, similar to most AI
methods, it can be harmful at the hands of the adversary. It can enable running transformers on the
low-cost cameras that may be used in military or surveillance applications.
Acknowledgments: This material is based upon work partially supported by the Defense Advanced
Research Projects Agency (DARPA) under Contract No. HR00112190135, the United States Air
Force under Contract No. FA8750-19-C-0098, funding from SAP SE, and NSF grants 1845216 and
1920079. Any opinions, findings, and conclusions or recommendations expressed in this material are
those of the authors and do not necessarily reflect the views of the United States Air Force, DARPA,
or other funding agencies. Moreover, we would like to thank K L Navaneet, Vipin Pillai, and Kossar
Pourahmadi for the valuable discussions and proof-reading the paper.

References
[1] Soroush Abbasi Koohpayegani, Ajinkya Tejankar, and Hamed Pirsiavash. “Compress: Self-
supervised learning by compressing representations”. In: Advances in Neural Information
Processing Systems 33 (2020), pp. 12980–12992.
[2] Alaaeldin Ali et al. “Xcit: Cross-covariance image transformers”. In: Advances in neural
information processing systems 34 (2021).
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. “Layer normalization”. In: arXiv
preprint arXiv:1607.06450 (2016).
2
https://github.com/UCDvision/sima

11
[4] Kunal Banerjee et al. “Exploring alternatives to softmax function”. In: arXiv preprint
arXiv:2011.11538 (2020).
[5] Mathilde Caron et al. “Emerging properties in self-supervised vision transformers”. In: Pro-
ceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 9650–
9660.
[6] Kai Chen et al. “MMDetection: Open mmlab detection toolbox and benchmark”. In: arXiv
preprint arXiv:1906.07155 (2019).
[7] François Chollet. “Xception: Deep learning with depthwise separable convolutions”. In: Pro-
ceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1251–
1258.
[8] Krzysztof Choromanski et al. “Rethinking attention with performers”. In: arXiv preprint
arXiv:2009.14794 (2020).
[9] Xiangxiang Chu et al. “Twins: Revisiting the Design of Spatial Attention in Vision Transform-
ers”. In: Advances in Neural Information Processing Systems. Ed. by M. Ranzato et al. Vol. 34.
Curran Associates, Inc., 2021, pp. 9355–9366. URL: https://proceedings.neurips.cc/
paper/2021/file/4e0928de075538c593fbdabb0c5ef2c3-Paper.pdf.
[10] Ekin D Cubuk et al. “Randaugment: Practical automated data augmentation with a reduced
search space”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops. 2020, pp. 702–703.
[11] Jia Deng et al. “Imagenet: A large-scale hierarchical image database”. In: 2009 IEEE confer-
ence on computer vision and pattern recognition. Ieee. 2009, pp. 248–255.
[12] Alexey Dosovitskiy et al. “An image is worth 16x16 words: Transformers for image recognition
at scale”. In: arXiv preprint arXiv:2010.11929 (2020).
[13] Gaoming Du et al. “Efficient softmax hardware architecture for deep neural networks”. In:
Proceedings of the 2019 on Great Lakes Symposium on VLSI. 2019, pp. 75–80.
[14] Mohsen Fayyaz et al. “Ats: Adaptive token sampling for efficient vision transformers”. In:
arXiv preprint arXiv:2111.15667 (2021).
[15] Yue Gao, Weiqiang Liu, and Fabrizio Lombardi. “Design and implementation of an approxi-
mate softmax layer for deep neural networks”. In: 2020 IEEE International Symposium on
Circuits and Systems (ISCAS). IEEE. 2020, pp. 1–5.
[16] Benjamin Graham et al. “LeViT: a Vision Transformer in ConvNet’s Clothing for Faster
Inference”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision.
2021, pp. 12259–12269.
[17] Jean-Bastien Grill et al. “Bootstrap your own latent: A new approach to self-supervised
learning”. In: arXiv preprint arXiv:2006.07733 (2020).
[18] Jianyuan Guo et al. “Hire-mlp: Vision mlp via hierarchical rearrangement”. In: arXiv preprint
arXiv:2108.13341 (2021).
[19] Tae Jun Ham et al. “Aˆ 3: Accelerating attention mechanisms in neural networks with approxi-
mation”. In: 2020 IEEE International Symposium on High Performance Computer Architecture
(HPCA). IEEE. 2020, pp. 328–341.
[20] Kaiming He et al. “Deep residual learning for image recognition”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2016, pp. 770–778.
[21] Kaiming He et al. “Mask r-cnn”. In: Proceedings of the IEEE international conference on
computer vision. 2017, pp. 2961–2969.
[22] Kaiming He et al. “Masked autoencoders are scalable vision learners”. In: arXiv preprint
arXiv:2111.06377 (2021).
[23] Kaiming He et al. “Momentum contrast for unsupervised visual representation learning”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020,
pp. 9729–9738.
[24] Dan Hendrycks and Kevin Gimpel. “Gaussian error linear units (gelus)”. In: arXiv preprint
arXiv:1606.08415 (2016).
[25] Jonathan Ho et al. “Axial attention in multidimensional transformers”. In: arXiv preprint
arXiv:1912.12180 (2019).
[26] Andrew G Howard et al. “Mobilenets: Efficient convolutional neural networks for mobile
vision applications”. In: arXiv preprint arXiv:1704.04861 (2017).

12
[27] Gao Huang et al. “Deep networks with stochastic depth”. In: European conference on computer
vision. Springer. 2016, pp. 646–661.
[28] Angelos Katharopoulos et al. “Transformers are rnns: Fast autoregressive transformers with
linear attention”. In: International Conference on Machine Learning. PMLR. 2020, pp. 5156–
5165.
[29] Soroush Abbasi Koohpayegani, Ajinkya Tejankar, and Hamed Pirsiavash. “Mean shift for
self-supervised learning”. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. 2021, pp. 10326–10335.
[30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep
convolutional neural networks”. In: Advances in neural information processing systems 25
(2012).
[31] Tsung-Yi Lin et al. “Feature pyramid networks for object detection”. In: Proceedings of the
IEEE conference on computer vision and pattern recognition. 2017, pp. 2117–2125.
[32] Tsung-Yi Lin et al. “Microsoft coco: Common objects in context”. In: European conference
on computer vision. Springer. 2014, pp. 740–755.
[33] Ye Lin et al. “Towards fully 8-bit integer inference for the transformer model”. In: arXiv
preprint arXiv:2009.08034 (2020).
[34] Ze Liu et al. “Swin transformer: Hierarchical vision transformer using shifted windows”. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 10012–
10022.
[35] Zhuang Liu et al. “A ConvNet for the 2020s”. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR) (2022).
[36] Ilya Loshchilov and Frank Hutter. “Decoupled weight decay regularization”. In: arXiv preprint
arXiv:1711.05101 (2017).
[37] Jiachen Lu et al. “Soft: Softmax-free transformer with linear complexity”. In: Advances in
Neural Information Processing Systems 34 (2021).
[38] Wenhao Lu, Jian Jiao, and Ruofei Zhang. “Twinbert: Distilling knowledge to twin-structured
bert models for efficient retrieval”. In: arXiv preprint arXiv:2002.06275 (2020).
[39] Dmitrii Marin et al. “Token pooling in vision transformers”. In: arXiv preprint
arXiv:2110.03860 (2021).
[40] Sachin Mehta and Mohammad Rastegari. MobileViT: Light-weight, General-purpose, and
Mobile-friendly Vision Transformer. 2021. arXiv: 2110.02178 [cs.CV].
[41] Adam Paszke et al. “Pytorch: An imperative style, high-performance deep learning library”.
In: Advances in neural information processing systems 32 (2019).
[42] Hao Peng et al. “Random feature attention”. In: arXiv preprint arXiv:2103.02143 (2021).
[43] Gabriele Prato, Ella Charlaix, and Mehdi Rezagholizadeh. “Fully quantized transformer for
machine translation”. In: arXiv preprint arXiv:1910.10485 (2019).
[44] Zhen Qin et al. “cosFormer: Rethinking Softmax in Attention”. In: arXiv preprint
arXiv:2202.08791 (2022).
[45] Ilija Radosavovic et al. “Designing network design spaces”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2020, pp. 10428–10436.
[46] Yongming Rao et al. “Dynamicvit: Efficient vision transformers with dynamic token sparsifi-
cation”. In: Advances in neural information processing systems 34 (2021).
[47] Zhuoran Shen et al. “Efficient attention: Attention with linear complexities”. In: Proceedings of
the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021, pp. 3531–3539.
[48] Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale
image recognition”. In: arXiv preprint arXiv:1409.1556 (2014).
[49] Jacob R Stevens et al. “Softermax: Hardware/Software Co-Design of an Efficient Softmax for
Transformers”. In: 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE. 2021,
pp. 469–474.
[50] Mingxing Tan and Quoc Le. “Efficientnet: Rethinking model scaling for convolutional neural
networks”. In: International conference on machine learning. PMLR. 2019, pp. 6105–6114.
[51] Ajinkya Tejankar et al. “ISD: Self-Supervised Learning by Iterative Similarity Distillation”.
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Oct.
2021, pp. 9609–9618.

13
[52] Ilya O Tolstikhin et al. “Mlp-mixer: An all-mlp architecture for vision”. In: Advances in Neural
Information Processing Systems 34 (2021).
[53] Hugo Touvron et al. “Going deeper with image transformers”. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2021, pp. 32–42.
[54] Hugo Touvron et al. “Resmlp: Feedforward networks for image classification with data-efficient
training”. In: arXiv preprint arXiv:2105.03404 (2021).
[55] Hugo Touvron et al. “Training data-efficient image transformers & distillation through atten-
tion”. In: International Conference on Machine Learning. PMLR. 2021, pp. 10347–10357.
[56] Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural information process-
ing systems 30 (2017).
[57] Sinong Wang et al. “Linformer: Self-attention with linear complexity”. In: arXiv preprint
arXiv:2006.04768 (2020).
[58] Wenhai Wang et al. “Pyramid vision transformer: A versatile backbone for dense predic-
tion without convolutions”. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. 2021, pp. 568–578.
[59] Ross Wightman. Pytorch image models. 2019.
[60] Haiping Wu et al. “CvT: Introducing Convolutions to Vision Transformers”. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision (ICCV). Oct. 2021, pp. 22–31.
[61] Haiping Wu et al. “Cvt: Introducing convolutions to vision transformers”. In: Proceedings of
the IEEE/CVF International Conference on Computer Vision. 2021, pp. 22–31.
[62] Zhenda Xie et al. “Self-supervised learning with swin transformers”. In: arXiv preprint
arXiv:2105.04553 (2021).
[63] Yunyang Xiong et al. “Nyströmformer: A Nyström-based Algorithm for Approximating
Self-Attention”. In: (2021).
[64] Li Yuan et al. “Tokens-to-Token ViT: Training Vision Transformers From Scratch on Im-
ageNet”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision
(ICCV). Oct. 2021, pp. 558–567.
[65] Sangdoo Yun et al. “Cutmix: Regularization strategy to train strong classifiers with localizable
features”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019,
pp. 6023–6032.
[66] Ofir Zafrir et al. “Q8bert: Quantized 8bit bert”. In: 2019 Fifth Workshop on Energy Efficient
Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS). IEEE. 2019,
pp. 36–39.
[67] Hongyi Zhang et al. “mixup: Beyond empirical risk minimization”. In: arXiv preprint
arXiv:1710.09412 (2017).
[68] Pengchuan Zhang et al. “Multi-Scale Vision Longformer: A New Vision Transformer for High-
Resolution Image Encoding”. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV). Oct. 2021, pp. 2998–3008.
[69] Huangjie Zheng et al. “Mixing and shifting: Exploiting global and local dependencies in vision
MLPs”. In: arXiv preprint arXiv:2202.06510 (2022).
[70] Danyang Zhu et al. “Efficient precision-adjustable architecture for softmax function in deep
learning”. In: IEEE Transactions on Circuits and Systems II: Express Briefs 67.12 (2020),
pp. 3382–3386.

14
SimA: Simple Softmax-free Attention for Vision Transformers
Appendix

Visualization: Figure A1 provides more results similar to Figure 3. Please see Section 4.6 for details.

Figure A1: Our method (SimA): We extract Q̂ and K̂ from layer 12 of transformer. We get `2 -norm
of each token for Q̂ and K̂, normalize it to range [0,1] and overlay it as a heatmap on the image.
Interestingly, magnitude of tokens represent the significance of tokens in our method. Note that
all images are randomly selected from MS-COCO test set without any visual inspection or cherry
picking.

15

You might also like