Efficient Transformers: A Survey
Efficient Transformers: A Survey
Yi Tay [email protected]
Google Research
Mostafa Dehghani [email protected]
Google Research, Brain team
Dara Bahri [email protected]
Google Research
arXiv:2009.06732v2 [cs.LG] 16 Sep 2020
Editor: Preprint
Abstract
Transformer model architectures have garnered immense interest lately due to their effec-
tiveness across a range of domains like language, vision and reinforcement learning. In the
field of natural language processing for example, Transformers have become an indispens-
able staple in the modern deep learning stack. Recently, a dizzying number of “X-former”
models have been proposed - Reformer, Linformer, Performer, Longformer, to name a few
- which improve upon the original Transformer architecture, many of which make improve-
ments around computational and memory efficiency. With the aim of helping the avid
researcher navigate this flurry, this paper characterizes a large and thoughtful selection of
recent efficiency-flavored “X-former” models, providing an organized and comprehensive
overview of existing work and models across multiple domains.
Keywords: Deep Learning, Natural Language Processing, Transformer Models, Atten-
tion Models
1. Introduction
Transformers (Vaswani et al., 2017) are a formidable force in the modern deep learning
stack. Transformers are pervasive and have made tremendous impact in many fields such
as language understanding (Devlin et al., 2018; Brown et al., 2020; Raffel et al., 2019) and
image processing (Parmar et al., 2018; Carion et al., 2020). As such, it is only natural that
a wealth of research has been dedicated to making fundamental improvements to the model
over the past few years (Dehghani et al., 2018; So et al., 2019; Ahmed et al., 2017). This
immense interest has also spurred research into more efficient variants of the model (Kitaev
et al., 2020; Roy et al., 2020; Beltagy et al., 2020; Katharopoulos et al., 2020; Tay et al.,
2020b; Wang et al., 2020b; Rae et al., 2020).
There has been such a surge of Transformer model variants proposed recently, that
researchers and practitioners alike may find it challenging to keep pace with the rate of
innovation. As of this writing (circa August 2020), there have been nearly a dozen new
efficiency-focused models proposed in just the past 6 months. Thus, a survey of the existing
literature is both beneficial for the community and quite timely.
1
Efficient Transformers: A Survey
2. Background on Transformers
This section provides an overview of the well-established Transformer architecture (Vaswani
et al., 2017). Transformers are multi-layered architectures formed by stacking Transformer
blocks on top of one another.
Transformer blocks are characterized by a multi-head self-attention mechanism, a position-
wise feed-forward network, layer normalization (Ba et al., 2016) modules and residual con-
nectors. The input to the Transformer model is often a tensor of shape RB × RN , where B
is the batch size, N the sequence length.
The input first passes through an embedding layer that converts each one-hot token
representation into a d dimensional embedding, i.e., RB × RN × RD The new tensor is
then additively composed with positional encodings, and passed through a multi-headed
self-attention module. Positional encodings can take the form of a sinusoidal input (as
per (Vaswani et al., 2017)) or be trainable embeddings.
The inputs and output of the multi-headed self-attention module are connected by
residual connectors and a layer normalization layer. The output of the multi-headed self-
2
Efficient Transformers: A Survey
Output
Probabilities
Softmax
Computational Linear
and Memory
Complexity
Add & Norm
Feed Forward
Positional Positional
Embedding Embedding
Input Embedding Input Embedding
inputs inputs
attention module is then passed to a two-layered feed-forward network which has its in-
puts/outputs similarly connected in a residual fashion with layer normalization. The sub-
layer residual connectors with layer norm is expressed as:
X = LayerNorm(FS (X)) + X
where FS is the sub-layer module which is either the multi-headed self-attention or the
position-wise feed-forward layers.
3
Efficient Transformers: A Survey
Wo [A1 · · · ANH ], where Wo is an output linear projection. Note that the computation of A
d
is typically done in a parallel fashion by considering tensors of RB × RN × RNh × R Nh and
computing the linear transforms for all heads in parallel.
The attention matrix A = QK > is chiefly responsible for learning alignment scores
between tokens in the sequence. In this formulation, the dot product between each ele-
ment/token in the query (Q) and key (K) is taken. This drives the self-alignment process
in self-attention whereby tokens learn to gather from each other.
On the scalability of Self-Attention At this point, it is apparent that the memory and
computational complexity required to compute the attention matrix is quadratic in the input
sequence length, i.e., N × N . In particular, the QK > matrix multiplication operation alone
consumes N 2 time and memory. This restricts the overall utility of self-attentive models
in applications which demand the processing of long sequences. In subsequent sections, we
discuss methods that reduce the cost of self-attention.
XA = LayerNorm(MultiheadSelfAttention(X)) + X
XB = LayerNorm(PositionFFN(XA )) + XA
where X is the input of the Transformer block and XB is the output of the Transformer
block.
4
Efficient Transformers: A Survey
Transformer-XL
(Dai et al., 2019)
Performer
(Choromanski et al., 2020) Set Transformer Compressive
(Lee et al., 2019)
Transformer
(Rae et al., 2018)
Memory
Linformer Compressed
(Wang et al., 2020b) (Liu et al., 2018)
Longformer Routing
ETC
(Beltagy et al., 2020)
Transformer
(Roy et al., 2020)
Linear Synthesizer (Ainslie et al., 2020)
(Tay et al., 2020a)
Transformer Big Bird
(Katharopoulos et al., 2020) (Zaheer et al., 2020)
Sinkhorn
Transformer
(Tay et al., 2020b)
Reformer
Blockwise Transformer (Kitaev et al., 2020)
(Qiu et al., 2019)
Sparse Transformer
Image Transformer (Child et al., 2019)
(Parmar et al., 2018)
Axial Transformer
(Ho et al., 2019)
5
Efficient Transformers: A Survey
6
Efficient Transformers: A Survey
• Memory - Another prominent method is to leverage a side memory module that can
access multiple tokens at once. A common form is global memory which is able to
access the entire sequence. The global tokens act as a form of memory that learns to
gather from input sequence tokens. This was first introduced in Set Transformers (Lee
et al., 2019) as the inducing points method. These parameters are often interpreted
as “memory” and are used as a form of temporary context for future processing.
This can be thought of as a form of parameter attention (Sukhbaatar et al., 2019).
Global memory is also used in ETC (Ainslie et al., 2020) and Longformer (Beltagy
et al., 2020). With a limited number of memory (or inducing points), we are able
to perform a preliminary pooling like operation of the input sequence to compress
the input sequence - a neat trick to have at one’s disposal when designing efficient
self-attention modules.
1. We note that this is also often referred to as factorization approaches, e.g., in (Child et al., 2019). We
decide to refer to this class of models as combination approaches because (1) it is a better fit to what these
models are actually doing and (2) to avoid confusion with matrix factorization or low-rank approaches.
7
Efficient Transformers: A Survey
classic example of this technique, as it projects the length dimension of keys and val-
ues to a lower-dimensional representation (N → k). It is easy to see that the low-rank
method ameliorates the memory complexity problem of the self-attention because the
N × N matrix is now decomposed to N × k.
We note that these buckets are broad characterization of the different efficient Transformer
models. In reality, there is no sharp boundary between these groups and models may be
composed of multiple different technical innovations. For example, the k-means clustering in
Routing Transformer (Roy et al., 2020) can be also interpreted as a form of global memory
approach, since one can view the centroids as parameterized memory. In Reformer, however,
this is used to learn the sparsity pattern of the attention weights. Additionally, pooling (Liu
et al., 2018) can be also interpreted as a form of memory model.
8
Efficient Transformers: A Survey
Local Attention Span An straightforward solution for dealing with long sequences in
Transformers is to limit the attention span to a local neighborhood. Liu et al. (2018)
proposed dividing the input sequence into blocks of similar length and run self-attention
within each block independently. This keeps the cost of attention per block constant, thus
the number of activations scales linearly by the input length.
Computation and Memory Complexity For the block size of b, the computational
and memory cost of self-attention is each block O(b2 ) and given there are n/b blocks, the
computational and memory cost of local attention is O(b.n). For the memory-compressed
attention, applying a convolution with kernel size and strides of k, the computational and
memory cost of the attention mechanism reduces to O(n.n/k).
Localized Attention Span Limiting the receptive field to a local neighborhood (Parmar
et al., 2018, 2019) addresses the issue with the computation and memory cost of running
global self-attention on large inputs, but changing the neighborhood per query position
would prohibit packing the computations of the self-attention into two matrix multiplica-
tions. To avoid that, Image Transformer proposes partitioning the inputs into “query blocks
and their associated “memory blocks“, where for all queries from a single query block, the
model attends to the same memory block. There are two different schemes for choosing
query blocks and their associated memory block neighborhoods: 1-dimensional local atten-
tion and 2-dimensional local attention. Here we briefly explain these schemes in the decoder
case.
For the 1-dimensional local attention, the image is flattened in the raster order and
partitioned into non-overlapping query blocks Q of length lq , and for each query block,
9
Efficient Transformers: A Survey
Memory Block
Memory Block
Query Block
Query Block Q Q
a memory block M is build from the same pixels in the Q as well as a fixed number of
pixels, lm , generated before the query pixel. In the 2-dimensional local attention, pixels are
generated in raster order. For the 2-dimensional local attention, the image is partitioned
into multiple non-overlapping rectangular query blocks of length lq = wq × hq . The memory
block extends the query block to the top, left hm and wm pixels and to the right wm pixels,
so lm = (wq × qh ) + 2 × (hm + wm ). The query pixel can attend to all other pixels. In
the 2-dimensional local attention, pixels in the image are generated one query block after
another. Generated blocks are in raster order, as well as generated pixels inside every block.
Restrictions Image Transformer, and in general restricting the context in the attention
mechanism to a local neighborhood, can decrease the cost of memory and computation at
the price of losing the global receptive field. This can be an issue where global information
is required to solve the task. Also, local-attention still has quadratic complexity to the
region length, introducing an extra hyper-parameter to the trade-off between performance
and computational complexity.
10
Efficient Transformers: A Survey
process literature to reduce the complexity of attention from quadratic to linear in the size
of the input set.
Problems involving sets of objects often have a permutation invariance property: the
target value for the set is the same regardless of the order of the objects in the set. Zaheer
et al. (2017) proved that all permutation-invariant functions can be represented by the
following functional form:
where the pooling function pool is a simple summation and φ and ρ are continuous functions.
This form can be interpreted as the composition of an encoder φ and decoder ρ (pool(·)).
While this form is a universal approximator in the space of permutation-invariant functions,
it is unclear how well such models will fit tasks in practice, given a limited capacity. The
Set Transformer proposes a solution that can be viewed as a encoder + pooled decoder, but
where, unlike the form given above, the encoder and decoder can attend to input elements
individually and the pooling function is parameterized.
Attention Blocks The model introduces the following constructs: “Multihead Attention
Block” (MAB), “Set Attention Block” (SAB), “Induced Set Attention Block” (ISAB), and
“Pooling by Multihead Attention” (PMA). They are defined as follows.
It is straightforward to see that both ISAB and SAB are permutation equivariant - in other
words, if the input is permuted in some way then the corresponding output of the block
is permuted in exactly the same way. Meanwhile, the pooling layer PMA is permutation
invariant. Since functional composition, i.e. layering, preserves these properties, the Set
Transformer encoder-decoder combination is permutation invariant.
Efficiency We can understand the m inducing points Im learned in each ISAB layer, as a
form of static memory. In addition to reducing the O(N n2 ) complexity of the self-attending
SAB layer to O(N mn), a reduction particularly valuable when the input set is large, the
inducing points effectively encode some global structure that helps explain its inputs. For
example, in the problem of amortized clustering, where one attempts to learn to map an
11
Efficient Transformers: A Survey
Figure 4: Illustration of patterns of the attention matrix for dense self-attention in Trans-
formers and sparse fixed attention in Sparse Transformers.
input set of points to the centers of clusters of points inside the set, the inducing points
learned could be appropriately distributed so that the encoder can effectively compare query
elements with each other implicitly via their proximity to the inducing points.
The trainable k seeds Sk used in the pooling layer PMAk can be viewed as static memory
in a similar light, reducing the memory and runtime complexity of the architecture.
where Aij is the attention weight of qi , kj and b c denote the floor operation. In this case,
we only compute the attention if bj/N c = bi/N c (within the same block).
Strided Attention Heads The other half of the heads are dedicated to fixed strided
patterns. This is also referred to as strides. Concretely,
(
Qi (K)>j ), if (i − j) mod N = 0
Âij =
0 otherwise
12
Efficient Transformers: A Survey
Q Row Attention
Column Attention
Figure 5: Attention span in Axial Transformer on a two-dimensional input.
13
Efficient Transformers: A Survey
An advantage of Axial Transformer over similar methods like Sparse Transformer is that
while it provides the global receptive field, it is straightforward to implement it and does
not require a custom kernel for an efficient implementation.
3.2.6 Longformer
Longformer (Beltagy et al., 2020) is a variant of Sparse Transformer. Its key distinction
compared to Sparse Transformer is “Dilated Sliding Windows”, which can enable better
long-range coverage without sacrificing sparsity. This is achieved by increasing the receptive
fields by having gaps in the attention patterns. The Longformer also gradually increases the
receptive field as the model goes deeper, dedicating lower levels for modeling local patterns
and upper levels for modeling global patterns.
Global Attention For classification tasks, the Longformer adopts global tokens (e.g.,
CLS tokens) that have access to all input sequences.
Parameter and Memory Complexity The complexity of the model is reduced from
O(n2 ) to O(nk) where k is the size of the window. When using global attention, the
Longformer creates another set of query-key-value projections for this global attention,
doubling the cost of the parameters at the attention layer.
Memory and Parameter Complexity The memory complexity of the ETC model is
O(n2g + ng N ), where ng is the number of global tokens and N is the input sequence length.
Restrictions Intuitively, it is easy to observe that ETC cannot be used for auto-regressive
decoding. This is because we are not able to compute causal masks because of the global
attention.
3.2.8 BigBird
The BigBird model (Zaheer et al., 2020) is another Transformer for modeling longer se-
quences and is primarily built on top of ETC (Ainslie et al., 2020). The Big Bird model com-
prises of several key components, namely (1) global tokens, (2) random attention (queries
attend to random keys) and (3) fixed patterns (local sliding windows).
14
Efficient Transformers: A Survey
Global Attention Fundamentally, the idea of using global memory can be traced all the
way back to Longformer/ETC and Set Transformer model. Notably, the global memory in
Big Bird is extended to contain tokens within the sequence, instead of simply parameterized
memory. The authors call this the ‘internal transformer construction (ITC)’ in which a
subset of indices is selected as global tokens. This can be interpreted as a memory based
approach.
Sliding Window Attention The window-ed attention was first proposed in early local-
based attention models (Image Transformer, Compressed Attention and/or Sparse Trans-
former). In BigBird, each query attends to w/2 tokens to the left and w/2 tokens to the
right. This corresponds to a fixed pattern (FP) approach.
Random Attention Finally, each query attends to r random keys. This pattern is fixed.
Memory and Parameter Complexity The memory complexity of the self-attention
is linear, i.e., O(n). The BigBird model does not introduce new parameters beyond the
Transformer model.
Restrictions Similar to ETC, the BigBird model cannot be used to autoregressively
decode. Hence, qualifying it as an encoder-only model.
where Ci is the cluster that vector Ri is assigned to. In other words, the token at i only
attends to tokens in the same cluster.
Memory and Parameter Complexity The Routing Transformer introduces additional
parameters in the clustering mechanism, namely a k×d centroid vectors and a Wr projection
matrix. The memory complexity is O(n1.5 ).
15
Efficient Transformers: A Survey
3.2.10 Reformer
Reformer (Kitaev et al., 2020) is another efficient attention model based on locality sen-
sitive hashing (LSH). The Reformer also introduces reversible Transformer layers, which
contributes to further reducing its memory footprint.
LSH Attention The LSH attention introduces parameter-sharing between query and
keys. It hashes the query-keys into buckets using a random-projection based hashing func-
tion. The key idea is that nearby vectors should obtain a similar hash while distant vectors
should not, hence being termed as ‘locality sensitive’. To perform hashing, a random matrix
R ∈ Rk×b/2 is first introduced. Next, The hashing function is defined as:
where [; ] is the concatenation of two vectors. For all queries, attention is computed if and
only if the query and key hashes match, i.e., h(qi ) = h(kj ). In other words, attention
is computed amongst query and keys if they fall in the same hash bucket. In order to
maintain causal masking (the ability to auto-regressively decode), the Reformer assigns
and maintains a position index for every query/key. It is therefore able to compare if each
query key comparison is auto-regressively valid.
Memory Efficiency with LSH Attention The key idea behind LSH attention is to
classify tokens into buckets and then process them bucket by bucket in a chunked fashion.
To this end, queries are first sorted by bucket number and then by sequence order within
the same bucket. During computation, tokens only attend to same bucket in its own chunk
and previous chunk. The chunking and sorted bucketing techniques help to improve the
overall efficiency of the Reformer model.
Parameter and Memory Complexity The memory complexity of the Reformer is
O(n log n). In terms of parameter costs, the Reformer shares queries and keys, which
reduces the cost of the QKV transforms by a third. The random projections are not
trainable parameters and hence do not incur parameter costs. Overall, the Reformer has
fewer parameters than vanilla Transformers. The reversible layers in Reformer also reduce
the memory consumption during training by enabling activations to be reconstructed from
the next layer’s. This reduces memory cost since this eliminates the need to store activations
for all layers during backpropagation.
16
Efficient Transformers: A Survey
Sorting Network The sorting operator is parameterized by a meta sorting network. Let
X be the input sequence of dimension N × d.
where FS (.) is a parameterized function such as a two layer feed-forward network with ReLU
activation. The output of FS (.) is a tensor of nB × nB . The BlockSum function learns the
sum embeddings of local blocks. The BlockShape function reshapes the input tensor into
RN ×d → RnB ×b×d . Here, we note that N = nB × b, where b is the size of the block and nB
is the number of total blocks.
Sinkhorn Sorting φ is the Sinkhorn balancing operator (Sinkhorn, 1964; Adams and
Zemel, 2011) which converts the nB × nB matrix into a soft permutation matrix. Specifi-
cally, a series of row- and column-wise normalizations are applied on the matrix output of
FS BlockSum(X). For the sake of brevity, we do not delve into details of this operation.
Further details can be found at (Adams and Zemel, 2011; Tay et al., 2020b).
Parameter and Memory Complexity The memory complexity of the Sinkhorn Trans-
former is O(b2 ) where b is the block size and b = NNb . Additional parameter costs are incurred
from the meta sorting network FS (.). The number of additional parameters is therefore 2d2
when a two layer ReLU network is used as the sorting network.
3.2.12 Linformer
The Linformer (Wang et al., 2020b) is an efficient Transformer based on the idea of low-rank
self-attention.
17
Efficient Transformers: A Survey
recurrent neural network (RNN). The model has been shown to improve inference speeds
up to three orders of magnitude without much loss in predictive performance.
The method rests on the simple but powerful observation that the accumulated value
Vi0 for the query Qi in position i can be written as:
Pp
j=1 sim(Qi , Kj )Vj
Vi0 = Pp .
j=1 sim(Qi , Kj )
Here, p = N in full, unmasked attention and T p= i in the case of causal masking. Now,
in usual softmax attention, sim(q, k) = exp q√dk . Linear Transformer, however, expresses
the similarity as a kernel function. That is, sim(q, k) := φ(q)T φ(k), where φ is a, possibly
high-dimensional, feature map. With this choice, we can rewrite Vi0 as:
φ(Qi )T Sp
Vi0 = ,
φ(Qi )T Zp
Xp
Sp := φ(Kj )VjT ,
j=1
Xp
Zp := φ(Kj ).
j=1
For unmasked attention, since p = N we only need to compute SN and ZN once and we
reuse them for the computation at every position 0 ≤ i ≤ N . For causal attention, the Si ’s
and Zi ’s can be viewed as states of an RNN that are updated by the following recurrence
relations:
with initial condition S0 = Z0 = 0. If the dimension of the key, query, and values are all d
and cost to compute φ is O(c), then the overall run-time complexity of Linear Transformer
is O(N cd). The authors choose
φ(x) = elu(x) + 1,
where elu(·) denotes the exponential linear unit (Clevert et al., 2015). With this choice of
feature map, c = d and the end-to-end complexity of the model is O(N d2 ). The authors
go further and show that in addition to the forward pass, the backward pass (i.e. gradient
computation) can be achieved in linear time and constant memory by using cumulative
sums. We defer readers interested in this derivation to the original paper.
3.2.14 Performer
The Performer (Choromanski et al., 2020) model is characterized by its Generalized Atten-
tion mechanism and its usage of random Kernels.
18
Efficient Transformers: A Survey
where D̂ = diag(Q0 ((K 0 )> 1N )), Q0 = DQ φ(Q> )> , and K 0 = DK φ(K > )> . Note that
DQ = g(Q> >
i ), DK = h(Ki ). The function φ(x) is defined as:
c
φ(X) = √ f (W x + b)> (7)
M
where c > 0 is a constant, W ∈ RM ×d is a random feature matrix, and M is the dimension-
ality of this matrix that controls the number of random features. We are able to see that
we do not explicitly compute A = QK > and hence avoid paying the N 2 cost. For rigorous
theoretical analysis and further details, we refer interested readers to (Choromanski et al.,
2020).
Parameter and Memory Complexity The complexity of bidirectional FAVOR algo-
rithm is O(M d + N d + M N ) where M is the dimensionality of the random features.
3.2.15 Synthesizers
Synthesizer models (Tay et al., 2020a) are an attempt to study and investigate the true
importance of conditioning within the self-attention mechanism. In (Tay et al., 2020a), the
authors study a synthetic self-attention module in which attention weights are approximated
instead of being computed by pairwise dot products. Synthesizers are only implicitly related
to efficient Transformers. However, the factorized variants can be considered a low-rank
efficient Transformer model.
Dense Synthesizers In the Dense Synthesizer, each token xi is projected to a vector
of length N using a two-layered non-linear feed-forward network. The computation of the
attention matrix A is described as:
A = W2 (σR (W1 (X) + b)) + b (8)
19
Efficient Transformers: A Survey
Random Synthesizers Another variant of the Synthesizer model uses random matrices
for A. In this case, the output can be expressed by:
Y = Softmax(R)G(X). (10)
where R ∈ RN ×N is a trainable and/or non-trainable matrix. In (Tay et al., 2020a), the
authors show that Random Synthesizers achieve competitive performance.
Factorized Variants The Dense and Random Synthesizers also come with factorized
variants that consider a low-rank structure of the attention matrix. For factorized random
Synthesizer can be written as:
Y = Softmax(R1 R2> )G(X). (11)
3.2.16 Transformer-XL
The Transformer-XL model (Dai et al., 2019) relies on segment-based recurrence. Segment-
based recurrence can be considered an orthogonal approach to the other techniques discussed
since it does not explicitly sparsify the dense self-attention matrix. Instead, it connects
adjacent blocks with a recurrent mechanism.
Segment Recurrence The recurrent mechanism in Transformer-XL is described as:
h̃n−1 n−1
τ +1 = [SG(hτ ) hn−1
τ +1 ] (13)
> > >
qτn+1 , kτn+1 , vτn+1 = hτn−1 n−1 n−1
+1 Wq , h̃τ +1 Wk , h̃τ +1 Wv (14)
hnτ+1 = Transformer(qτn+1 , kτn+1 , vτn+1 ) (15)
where SG() is the stop gradient function, is the concatenation of two sequences along the
length dimension. Notably, the keys and values are conditioned on the previous sequence
length h̃n−1 n−1
τ +1 instead of hτ +1
20
Efficient Transformers: A Survey
Memory Reconstruction In order to better retain memories over long sequences, the
Compressive Transformer implements an auto-encoding loss that learns to reconstruct the
original memory from its compressed version, i.e., Lae = ||old mem − g(new cm(i) )|| where
ns
g(.) : R c ×d → Rns ×d is a parameterized function. A second attention reconstruction is a
lossy re-construct that attempts to reconstruct the attention over memory instead of the
lossless reconstruction of the memory itself.
4. Discussion
This section the state of research pertaining to this class of efficient models.
4.1 On Evaluation
While the field is bustling with new Transformer models, there is hardly an easy way to
compare these models side by side. Many research papers select their own benchmarks to
showcase the abilities of the proposed model. This is also coupled with different hyperpa-
rameter settings like model sizes and configurations which can make it difficult to correctly
attribute the reason for the performance gains Moreover, some papers conflate this with
pretraining (Devlin et al., 2018) which makes it even harder to distinguish the relative
performance of these different models. It is still a mystery to which fundamental efficient
Transformer block one should consider using.
On one hand, there are multiple models that focus on generative modeling, showcasing
the ability of the proposed Transformer unit on auto-regressive modeling of sequences. To
21
Efficient Transformers: A Survey
this end, Sparse Transformers (Child et al., 2019), Adaptive Transformers (Correia et al.,
2019), Routing Transformers (Roy et al., 2020) and Reformers (Kitaev et al., 2020) are
mainly focused on generative modeling tasks. These benchmarks typically involve language
modeling and/or pixel-wise image generation on datasets such as wikitext, enwik8 and/or
ImageNet/CIFAR. Models that use segment based recurrence such as Transformer-XL and
Compressive Transformers are also focused on long-range language modeling tasks such as
PG-19.
On one hand, a collection of models is mainly focused on encoding-only tasks such as
question answering, reading comprehension and or selections from the Glue benchmark.
For example, the ETC model (Ainslie et al., 2020) only runs experiments on question
answering benchmarks such as NaturalQuestions or TriviaQA. On the other hand, the
Linformer (Wang et al., 2020b) focuses on subsets of the GLUE benchmark. This split
is very natural and intuitive, since models like ETC and Linformer cannot be used in an
auto-regressive fashion, i.e., cannot be used to decode. This aggravates the difficulty of
comparing these encoder-only models with the other models.
There are models that focus on a balance of both. The Longformer (Beltagy et al., 2020)
tries to balance this by running benchmarks on both generative modeling and encoder-only
tasks. The Sinkhorn Transformer (Tay et al., 2020b) compares on both generative modeling
tasks as well as encoding only tasks.
Additionally, it is also worthy to note that, although Seq2Seq machine translation (MT)
was one of the problems that popularized Transformer models, not many of these efficient
Transformer models evaluate on MT. This is likely because sequence lengths in MT is not
long enough to warrant the usage of these models.
While generative modeling, glue tasks and/or question answering appear to be the
common evaluation benchmarks adopted by many of these tasks, there are several niche
benchmarks that a small isolated number of papers choose to evaluate on. For starters,
the Performer model (Choromanski et al., 2020) evaluates on masked language modeling
on proteins, deviating from serious head-on comparisons with other efficient Transformer
models. The Linear Transformer (Katharopoulos et al., 2020) also evaluates on speech
recognition, which is a rare benchmark amongst this group of papers.
22
Efficient Transformers: A Survey
We observe the next wave of models comes in the form of learnable sparsity patterns.
Reformer (Kitaev et al., 2020) and Routing Transformers (Roy et al., 2020) are very similar
in the sense that they are models that learn to cluster/bucket tokens before performing
attention. The key difference is the means to the end whereby Reformer uses a hashing
function while the Routing Transformer uses online k-means for cluster assignment. In
parallel, Sinkhorn Transformers (Tay et al., 2020b) are also based on the idea of sorting,
albeit at block-level. These three models largely follow a similar paradigm of re-arranging
sequences for efficient computation of attention scores.
Next, we then observe several extensions that are largely built off the Sparse Transformer
paradigm. The ETC (Ainslie et al., 2020) and Longformer (Beltagy et al., 2020) models
are very similar ideas that are fundamentally Sparse Transformer extensions. These models
incorporate the notion of a global memory, which is reminiscent of the Set Transformer’s
inducing point method or the global memory of the Star Transformer. Modifications to
strides, such as using dilated windows was also proposed in the Longformer work.
The most recent wave of models we’ve been seeing is models that are based on low-rank
approximation or kernel methods, e.g., models such as Linformer (Wang et al., 2020b),
Performer (Choromanski et al., 2020) and/or Linear Transformers (Katharopoulos et al.,
2020). Although due to the state of evaluation and the high parallelism of research, it
is quite unclear if this low-rank or kernel paradigm is actually better than the learnable
pattern (LP) or memory based efficient Transformer models.
On the side, it is good to note that the recurrent based models (Transformer-XL and
Compressive Transformers) seem to operate orthogonally and are in less direct comparison
with the other models.
• Quantization / Mixed Precision Learning mixed precision models has the poten-
tial to improve memory costs. The Q-BERT (Shen et al., 2020) is a model that quan-
tizes Transformer models to ultra-low precision. Meanwhile mixed precision train-
ing (Ott et al., 2019) is a highly popular technique to reduced the memory costs
of training Transformers. (Fan et al., 2020) applies Quantization Aware training to
Transformer models.
• Knowledge Distillation Knowledge distillation (KD) (Hinton et al., 2015) has been
a useful technique for transfering the knowledge learned from a larger teacher model
to a smaller student model. The smaller model can then be efficiently deployed into
23
Efficient Transformers: A Survey
production. There have been many attempts to distill large Transformer models. For
example, DistilBERT (Sanh et al., 2019), task-specific distillation (Tang et al., 2019)
and TinyBERT (Jiao et al., 2019).
• Task Adapters This line of research has been primarily focused on the problem of
fine-tuning large Transformer on T tasks and aiming to reuse parameters across a
variety of tasks. The key idea is that task adapters (Houlsby et al., 2019) enable
reuse of parameters across tasks and reuse the need of serving T models in production
- resulting in overall parameter savings. A modest number of models have been
proposed, such as PALS (Stickland and Murray, 2019), MAD-X (Pfeiffer et al., 2020)
and HyperGrid (Tay et al., 2020c).
5. Conclusion
In this paper we surveyed the literature on efficient Transformer models especially pertain-
ing ot the quadratic complexity of the self-attention module. We provided a taxonomy and
high-level abstraction of the core techniques employed in these class of new models. We
characterize the existing models based on techniques and provided a comprehensive walk-
through on several of the efficient Transformer models. Finally, we discussed the evaluation
landscape of these models along with the design trends of these models. We ended of with
a brief discussion of other parallel orthogonal efforts that may improve the efficiency of
Transformer models in general.
24
Efficient Transformers: A Survey
References
Ryan Prescott Adams and Richard S Zemel. Ranking via sinkhorn propagation. arXiv
preprint arXiv:1106.1925, 2011.
Karim Ahmed, Nitish Shirish Keskar, and Richard Socher. Weighted transformer network
for machine translation. arXiv preprint arXiv:1711.02132, 2017.
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Philip Pham, Anirudh Ravula, and Sumit
Sanghai. Etc: Encoding long and structured data in transformers. arXiv preprint
arXiv:2004.08483, 2020.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv
preprint arXiv:1607.06450, 2016.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by
jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document trans-
former. arXiv preprint arXiv:2004.05150, 2020.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov,
and Sergey Zagoruyko. End-to-end object detection with transformers. arXiv preprint
arXiv:2005.12872, 2020.
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences
with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Jared Davis,
Tamas Sarlos, David Belanger, Lucy Colwell, and Adrian Weller. Masked language
modeling for proteins via linearly scalable long-context transformers. arXiv preprint
arXiv:2006.03555, 2020.
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep
network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289,
2015.
Gonçalo M Correia, Vlad Niculae, and André FT Martins. Adaptively sparse transformers.
arXiv preprint arXiv:1909.00015, 2019.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhut-
dinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv
preprint arXiv:1901.02860, 2019.
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser.
Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
25
Efficient Transformers: A Survey
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018.
Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi Gribonval, Herve Je-
gou, and Armand Joulin. Training with quantization noise for extreme fixed-point com-
pression. arXiv preprint arXiv:2004.07320, 2020.
Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang.
Star-transformer. arXiv preprint arXiv:1902.09113, 2019a.
Yong Guo, Yin Zheng, Mingkui Tan, Qi Chen, Jian Chen, Peilin Zhao, and Junzhou Huang.
Nat: Neural architecture transformer for accurate and compact architectures. In Advances
in Neural Information Processing Systems, pages 737–748, 2019b.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.
arXiv preprint arXiv:1503.02531, 2015.
Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in
multidimensional transformers. arXiv preprint arXiv:1912.12180, 2019.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Larous-
silhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer
learning for nlp. arXiv preprint arXiv:1902.00751, 2019.
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and
Qun Liu. Tinybert: Distilling bert for natural language understanding. arXiv preprint
arXiv:1909.10351, 2019.
Lukasz Kaiser, Aidan N Gomez, and Francois Chollet. Depthwise separable convolutions
for neural machine translation. arXiv preprint arXiv:1706.03059, 2017.
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Trans-
formers are rnns: Fast autoregressive transformers with linear attention. arXiv preprint
arXiv:2006.16236, 2020.
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient trans-
former. In International Conference on Learning Representations, 2020. URL https:
//openreview.net/forum?id=rkgNKkHtvB.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and
Radu Soricut. Albert: A lite bert for self-supervised learning of language representations.
arXiv preprint arXiv:1909.11942, 2019.
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh.
Set transformer: A framework for attention-based permutation-invariant neural networks.
In International Conference on Machine Learning, pages 3744–3753, 2019.
Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser,
and Noam Shazeer. Generating wikipedia by summarizing long sequences. arXiv preprint
arXiv:1801.10198, 2018.
26
Efficient Transformers: A Survey
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David
Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling.
arXiv preprint arXiv:1904.01038, 2019.
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander
Ku, and Dustin Tran. Image transformer. arXiv preprint arXiv:1802.05751, 2018.
Niki Parmar, Prajit Ramachandran, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and
Jon Shlens. Stand-alone self-attention in vision models. In Advances in Neural Informa-
tion Processing Systems, pages 68–80, 2019.
Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. Mad-x: An adapter-based
framework for multi-task cross-lingual transfer. arXiv preprint arXiv:2005.00052, 2020.
Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. Block-
wise self-attention for long document understanding. arXiv preprint arXiv:1911.02972,
2019.
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P.
Lillicrap. Compressive transformers for long-range sequence modelling. In International
Conference on Learning Representations, 2020. URL https://openreview.net/forum?
id=SylKikSYDH.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning
with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-
based sparse attention with routing transformers. arXiv preprint arXiv:2003.05997, 2020.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled
version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108,
2019.
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W
Mahoney, and Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization of
bert. 2020.
Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic
matrices. The annals of mathematical statistics, 35(2):876–879, 1964.
David R So, Chen Liang, and Quoc V Le. The evolved transformer. arXiv preprint
arXiv:1901.11117, 2019.
Asa Cooper Stickland and Iain Murray. Bert and pals: Projected attention layers for efficient
adaptation in multi-task learning. arXiv preprint arXiv:1902.02671, 2019.
Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Ar-
mand Joulin. Augmenting self-attention with persistent memory. arXiv preprint
arXiv:1907.01470, 2019.
27
Efficient Transformers: A Survey
Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Dis-
tilling task-specific knowledge from bert into simple neural networks. arXiv preprint
arXiv:1903.12136, 2019.
Yi Tay, Aston Zhang, Luu Anh Tuan, Jinfeng Rao, Shuai Zhang, Shuohang Wang, Jie Fu,
and Siu Cheung Hui. Lightweight and efficient neural natural language processing with
quaternion networks. arXiv preprint arXiv:1906.04393, 2019.
Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthe-
sizer: Rethinking self-attention in transformer models. arXiv preprint arXiv:2005.00743,
2020a.
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn
attention. arXiv preprint arXiv:2002.11296, 2020b.
Yi Tay, Zhe Zhao, Dara Bahri, Donald Metzler, and Da-Cheng Juan. Hypergrid: Efficient
multi-task transformers with grid-wise decomposable hyper projections. arXiv preprint
arXiv:2007.05891, 2020c.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances
in neural information processing systems, pages 5998–6008, 2017.
Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song
Han. Hat: Hardware-aware transformers for efficient natural language processing. arXiv
preprint arXiv:2005.14187, 2020a.
Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-
attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020b.
Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhut-
dinov, and Alexander J Smola. Deep sets. In Advances in neural information processing
systems, pages 3391–3401, 2017.
Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santi-
ago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird:
Transformers for longer sequences. arXiv preprint arXiv:2007.14062, 2020.
28