0% found this document useful (0 votes)
631 views28 pages

Efficient Transformers: A Survey

This document surveys recent efficient Transformer models that aim to improve upon the original Transformer architecture's quadratic computational and memory complexity. It characterizes a large selection of recent "X-former" models from the past 6 months that introduce innovations for processing long sequences efficiently. The survey provides an organized overview of existing work across natural language and computer vision tasks to help researchers navigate the increasing number of efficient Transformer variants.

Uploaded by

Marabatalha Mc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
631 views28 pages

Efficient Transformers: A Survey

This document surveys recent efficient Transformer models that aim to improve upon the original Transformer architecture's quadratic computational and memory complexity. It characterizes a large selection of recent "X-former" models from the past 6 months that introduce innovations for processing long sequences efficiently. The survey provides an organized overview of existing work across natural language and computer vision tasks to help researchers navigate the increasing number of efficient Transformer variants.

Uploaded by

Marabatalha Mc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Efficient Transformers: A Survey

Efficient Transformers: A Survey

Yi Tay [email protected]
Google Research
Mostafa Dehghani [email protected]
Google Research, Brain team
Dara Bahri [email protected]
Google Research
arXiv:2009.06732v2 [cs.LG] 16 Sep 2020

Donald Metzler [email protected]


Google Research

Editor: Preprint

Abstract
Transformer model architectures have garnered immense interest lately due to their effec-
tiveness across a range of domains like language, vision and reinforcement learning. In the
field of natural language processing for example, Transformers have become an indispens-
able staple in the modern deep learning stack. Recently, a dizzying number of “X-former”
models have been proposed - Reformer, Linformer, Performer, Longformer, to name a few
- which improve upon the original Transformer architecture, many of which make improve-
ments around computational and memory efficiency. With the aim of helping the avid
researcher navigate this flurry, this paper characterizes a large and thoughtful selection of
recent efficiency-flavored “X-former” models, providing an organized and comprehensive
overview of existing work and models across multiple domains.
Keywords: Deep Learning, Natural Language Processing, Transformer Models, Atten-
tion Models

1. Introduction
Transformers (Vaswani et al., 2017) are a formidable force in the modern deep learning
stack. Transformers are pervasive and have made tremendous impact in many fields such
as language understanding (Devlin et al., 2018; Brown et al., 2020; Raffel et al., 2019) and
image processing (Parmar et al., 2018; Carion et al., 2020). As such, it is only natural that
a wealth of research has been dedicated to making fundamental improvements to the model
over the past few years (Dehghani et al., 2018; So et al., 2019; Ahmed et al., 2017). This
immense interest has also spurred research into more efficient variants of the model (Kitaev
et al., 2020; Roy et al., 2020; Beltagy et al., 2020; Katharopoulos et al., 2020; Tay et al.,
2020b; Wang et al., 2020b; Rae et al., 2020).
There has been such a surge of Transformer model variants proposed recently, that
researchers and practitioners alike may find it challenging to keep pace with the rate of
innovation. As of this writing (circa August 2020), there have been nearly a dozen new
efficiency-focused models proposed in just the past 6 months. Thus, a survey of the existing
literature is both beneficial for the community and quite timely.

1
Efficient Transformers: A Survey

The self-attention mechanism is a key defining characteristic of Transformer models.


The mechanism can be viewed as a graph-like inductive bias that connects all tokens in
a sequence with a relevance-based pooling operation. A well-known concern with self-
attention is the quadratic time and memory complexity, which can hinder model scalability
in many settings. There has been an overwhelming influx of model variants proposed
recently that address this problem. We hereinafter name this class of models “efficient
Transformers”.
Based on the context, efficiency of a model can be interpreted differently. It might
refer to the memory footprint of the model, which is of importance when the memory
of accelerators on which the model is running is limited. The efficiency might also refer
to computational costs, e.g. number of FLOPs, both during training and inference. In
particular, for on-device applications, models are supposed to be able to operate with a
limited computational budget. Here in this survey, we refer to the efficiency of Transformers,
both in terms of memory and computation, when they are used for modeling large inputs.
Efficient self-attention models are crucial in applications that model long sequences.
For example, documents, images, and videos are all often composed of a relatively large
number of pixels or tokens. Efficiency in processing long sequences is therefore paramount
for widespread adoption of Transformers.
This survey sets out to provide a comprehensive overview of the recent advances made
in this class of models. We are primarily interested in modeling advances and architectural
innovations that improve the efficiency of Transformers by tackling the quadratic complexity
issue of the self-attention mechanism, we also briefly discuss general improvements and other
efficiency improvements in subsequent sections.
In this paper, we propose a taxonomy of efficient Transformer models, characterizing
them by the technical innovation and primary use case. Specifically, we review Transformer
models that have applications in both language and vision domains, attempting to consoli-
date the literature across the spectrum. We also provide a detailed walk-through of many
of these models and draw connections between them.

2. Background on Transformers
This section provides an overview of the well-established Transformer architecture (Vaswani
et al., 2017). Transformers are multi-layered architectures formed by stacking Transformer
blocks on top of one another.
Transformer blocks are characterized by a multi-head self-attention mechanism, a position-
wise feed-forward network, layer normalization (Ba et al., 2016) modules and residual con-
nectors. The input to the Transformer model is often a tensor of shape RB × RN , where B
is the batch size, N the sequence length.
The input first passes through an embedding layer that converts each one-hot token
representation into a d dimensional embedding, i.e., RB × RN × RD The new tensor is
then additively composed with positional encodings, and passed through a multi-headed
self-attention module. Positional encodings can take the form of a sinusoidal input (as
per (Vaswani et al., 2017)) or be trainable embeddings.
The inputs and output of the multi-headed self-attention module are connected by
residual connectors and a layer normalization layer. The output of the multi-headed self-

2
Efficient Transformers: A Survey

Output
Probabilities

Softmax

Computational Linear
and Memory
Complexity
Add & Norm

Feed Forward

Add & Norm


Linear
Multi-Head
MatMul Add & Norm
Cross-Attention
Concatenate
K V Q
×N
Softmax Feed Forward

Scaled Dot-Product Attention N×


Scale Add & Norm
Add & Norm
MatMul Masked
Linear Linear Linear Multi-Head Multi-Head
Self-Attention Self-Attention
Q K V K V Q K V Q K V Q

Positional Positional
Embedding Embedding
Input Embedding Input Embedding

inputs inputs

Figure 1: Architecture of the standard Transformer (Vaswani et al., 2017)

attention module is then passed to a two-layered feed-forward network which has its in-
puts/outputs similarly connected in a residual fashion with layer normalization. The sub-
layer residual connectors with layer norm is expressed as:

X = LayerNorm(FS (X)) + X

where FS is the sub-layer module which is either the multi-headed self-attention or the
position-wise feed-forward layers.

2.1 Multi-Head Self-Attention


The Transformer model leverages a multi-headed self-attention mechanism. The key idea
behind the mechanism is to learn an alignment (Bahdanau et al., 2014) in which each
element in the sequence learns to gather from other tokens in the sequence. The operation
for a single head is defined as:

Ah = Softmax(αQh Kh> )Vh ,

where Qh = Wq X, Kh = Wk X and Vh = Wv X are linear transformations applied on the


d

temporal dimension of the input sequence. Wq , Wk , Wv ∈ R Nh are the weight matrices
(parameters) for the query, key and value projections and project the input X to an output
tensor of d dimensions. Nh is the number of heads
X is a matrix of size RN × Rd , and α is a scaling factor that is typically set to √1d . Let
us denote the number of heads by NH . The outputs of heads A1 · · · ANH are concatenated
together and passed into a dense layer. The output Y can thus be expressed as Y =

3
Efficient Transformers: A Survey

Wo [A1 · · · ANH ], where Wo is an output linear projection. Note that the computation of A
d
is typically done in a parallel fashion by considering tensors of RB × RN × RNh × R Nh and
computing the linear transforms for all heads in parallel.
The attention matrix A = QK > is chiefly responsible for learning alignment scores
between tokens in the sequence. In this formulation, the dot product between each ele-
ment/token in the query (Q) and key (K) is taken. This drives the self-alignment process
in self-attention whereby tokens learn to gather from each other.
On the scalability of Self-Attention At this point, it is apparent that the memory and
computational complexity required to compute the attention matrix is quadratic in the input
sequence length, i.e., N × N . In particular, the QK > matrix multiplication operation alone
consumes N 2 time and memory. This restricts the overall utility of self-attentive models
in applications which demand the processing of long sequences. In subsequent sections, we
discuss methods that reduce the cost of self-attention.

2.2 Position-wise Feed-forward Layers


The outputs of the self-attention module is then passed into a two-layered feed-forward
network with ReLU activations. This feed-forward layer operates on each position indepen-
dently hence the term position-wise. This is expressed as follows:

F2 (ReLU (F1 (XA )))

where F1 , F2 are feed-forward functions of the form W x + b.

2.3 Putting it all together


Each Transformer block can be expressed as:

XA = LayerNorm(MultiheadSelfAttention(X)) + X
XB = LayerNorm(PositionFFN(XA )) + XA

where X is the input of the Transformer block and XB is the output of the Transformer
block.

2.4 Transformer Mode


It is important to note the differences in the mode of usage of the Transformer block.
Transformers can primarily be used in three ways, namely: (1) encoder-only (e.g., for
classification), (2) decoder-only (e.g., for language modeling), and (3) encoder-decoder (e.g.,
for machine translation). In encoder-decoder mode, there are usually multiple multi-headed
self-attention modules, including a standard self-attention in both the encoder and the
decoder, along with an encoder-decoder cross-attention that allows the decoder to utilize
information from the encoder. This influences the design of the self-attention mechanism. In
the encoder mode, there is no restriction or constraint that the self-attention mechanism has
to be causal, i.e., dependent solely on the present and past tokens. In the encoder-decoder
setting, the encoder and encoder-decoder cross attention can afford to be non-causal but
the decoder self-attention must be causal. The ability to support causal auto-regressive

4
Efficient Transformers: A Survey

Transformer-XL
(Dai et al., 2019)

Performer
(Choromanski et al., 2020) Set Transformer Compressive
(Lee et al., 2019)
Transformer
(Rae et al., 2018)

Memory
Linformer Compressed
(Wang et al., 2020b) (Liu et al., 2018)
Longformer Routing
ETC
(Beltagy et al., 2020)
Transformer
(Roy et al., 2020)
Linear Synthesizer (Ainslie et al., 2020)
(Tay et al., 2020a)
Transformer Big Bird
(Katharopoulos et al., 2020) (Zaheer et al., 2020)

Sinkhorn
Transformer
(Tay et al., 2020b)
Reformer
Blockwise Transformer (Kitaev et al., 2020)
(Qiu et al., 2019)

Sparse Transformer
Image Transformer (Child et al., 2019)
(Parmar et al., 2018)

Axial Transformer
(Ho et al., 2019)

Figure 2: Taxonomy of Efficient Transformer Architectures.

decoding is required when designing efficient self-attention mechanisms since it can be a


limiting factor in many applications.

3. A Survey of Efficient Transformer Models

In this section, we provide a high-level overview of efficient Transformer models. We begin


by presenting a characterization of the different models. Table 1 lists the efficient Trans-
formers released to date while Figure 2 presents a graphical overview of several key efficient
Transformer models.

3.1 A Taxonomy of Efficient Transformers

This section outlines a general taxonomy of efficient Transformer models, charactered by


their core techniques and primary use case. The primary goal of most of these models, with
the exception of those based on segment-based recurrence, is to approximate the quadratic-
cost attention matrix. Each method applies some notion of sparsity to the otherwise dense
attention mechanism.

5
Efficient Transformers: A Survey

Model / Paper Complexity Decode Class


Memory Compressed† (Liu et al., 2018) O(n2c ) X FP+M
Image Transformer† (Parmar et al., 2018) O(n.m) X FP
Set Transformer† (Lee et al., 2019) O(nk) 7 M
Transformer-XL† (Dai et al., 2019) O(n2 ) X RC

Sparse Transformer (Child et al., 2019) O(n n) X FP
Reformer† (Kitaev et al., 2020) O(n log n) X LP
Routing Transformer (Roy et al., 2020) O(n log n) X LP

Axial Transformer (Ho et al., 2019) O(n n) X FP
Compressive Transformer† (Rae et al., 2020) O(n2 ) X RC
Sinkhorn Transformer† (Tay et al., 2020b) O(b2 ) X LP
Longformer (Beltagy et al., 2020) O(n(k + m)) X FP+M
ETC (Ainslie et al., 2020) O(n2g + nng ) 7 FP+M
Synthesizer (Tay et al., 2020a) O(n2 ) X LR+LP
Performer (Choromanski et al., 2020) O(n) X KR
Linformer (Wang et al., 2020b) O(n) 7 LR
Linear Transformers† (Katharopoulos et al., 2020) O(n) X KR
Big Bird (Zaheer et al., 2020) O(n) 7 FP+M

Table 1: Summary of Efficient Transformer Models presented in chronological order of


their first public disclosure. Some papers presented sequentially may first appear at the
same time, e.g., as an ICLR submission. Papers annotated with a superscript † are peer-
reviewed papers. Class abbreviations include: FP = Fixed Patterns or Combinations of
Fixed Patterns, M = Memory, LP = Learnable Pattern, LR = Low Rank, KR = Kernel
and RC = Recurrence. Furthermore, n generally refers to the sequence length and b is the
local window (or block) size. We use subscript g on n to denote global memory length and
nc to denote convolutionally compressed sequence lengths.

• Fixed Patterns (FP) - The earliest modifications to self-attention simply sparsifies


the attention matrix by limiting the field of view to fixed, predefined patterns such as
local windows and block patterns of fixed strides.

– Blockwise Patterns The simplest example of this technique in practice is the


blockwise (or chunking) paradigm which considers blocks of local receptive fields
by chunking input sequences into fixed blocks. Examples of models that do
this include Blockwise (Qiu et al., 2019) and/or Local Attention (Parmar et al.,
2018). Chunking input sequences into blocks reduces the complexity from N 2 to
B 2 (block size) with B <<< N , significantly reducing the cost. These blockwise
or chunking methods serve as a basis for many more complex models.
– Strided Patterns Another approach is to consider strided attention patterns,
i.e., only attending at fixed intervals. Models such as Sparse Transformer (Child
et al., 2019) and/or Longformer (Beltagy et al., 2020) employ strided or “dilated”
windows.
– Compressed Patterns - Another line of attack here is to use some pooling
operator to down-sample the sequence length to be a form of fixed pattern. For

6
Efficient Transformers: A Survey

instance, Compressed Attention (Liu et al., 2018) uses strided convolution to


effectively reduce the sequence length.

• Combination of Patterns (CP) - The key idea of combined1 approaches is to


improve coverage by combining two or more distinct access patterns. For example,
the Sparse Transformer (Child et al., 2019) combines strided and local attention by
assigning half of its heads to pattern. Similarly, Axial Transformer (Ho et al., 2019)
applies a sequence of self-attention computations given a high dimensional tensor
as input, each along a single axis of the input tensor. In essence, the combination
of patterns reduces memory complexity in the same way that fixed patterns does.
The difference, however, is that the aggregation and combinaton of multiple patterns
improves the overall coverage of the self-attention mechanism.

• Learnable Patterns (LP) - An extension to fixed, pre-determined pattern are learn-


able ones. Unsurprisingly, models using learnable patterns aim to learn the access pat-
tern in a data-driven fashion. A key characteristic of learning patterns is to determine
a notion of token relevance and then assign tokens to buckets or clusters. Notably, Re-
former (Kitaev et al., 2020) introduces a hash-based similarity measure to efficiently
cluster tokens into chunks. In a simlar vein, the Routing Transformer (Roy et al., 2020)
employs online k-means clustering on the tokens. Meanwhile, the Sinkhorn Sorting
Network (Tay et al., 2020b) exposes the sparsity in attention weights by learning to
to sort blocks of the input sequence. In all these models, the similarity function is
trained end-to-end jointly with the rest of the network. The key idea of learnable
patterns is still to exploit fixed patterns (chunked patterns). However, this class of
methods learn to sort/cluster the input tokens - enabling a more optimal global view
of the sequence while maintaining the efficiency benefits of fixed patterns approaches.

• Memory - Another prominent method is to leverage a side memory module that can
access multiple tokens at once. A common form is global memory which is able to
access the entire sequence. The global tokens act as a form of memory that learns to
gather from input sequence tokens. This was first introduced in Set Transformers (Lee
et al., 2019) as the inducing points method. These parameters are often interpreted
as “memory” and are used as a form of temporary context for future processing.
This can be thought of as a form of parameter attention (Sukhbaatar et al., 2019).
Global memory is also used in ETC (Ainslie et al., 2020) and Longformer (Beltagy
et al., 2020). With a limited number of memory (or inducing points), we are able
to perform a preliminary pooling like operation of the input sequence to compress
the input sequence - a neat trick to have at one’s disposal when designing efficient
self-attention modules.

• Low-Rank Methods - Another emerging technique is to improve efficiency by lever-


aging low-rank approximations of the self-attention matrix. The key idea is to assume
low-rank structure in the N × N matrix. The Linformer (Wang et al., 2020b) is a

1. We note that this is also often referred to as factorization approaches, e.g., in (Child et al., 2019). We
decide to refer to this class of models as combination approaches because (1) it is a better fit to what these
models are actually doing and (2) to avoid confusion with matrix factorization or low-rank approaches.

7
Efficient Transformers: A Survey

classic example of this technique, as it projects the length dimension of keys and val-
ues to a lower-dimensional representation (N → k). It is easy to see that the low-rank
method ameliorates the memory complexity problem of the self-attention because the
N × N matrix is now decomposed to N × k.

• Kernels - Another recently popular method to improve the efficiency of Transform-


ers is to view the attention mechanism through kernelization. The usage of kernels
(Katharopoulos et al., 2020; Choromanski et al., 2020) enable clever mathematical
re-writing of the self-attention mechanism to avoid explicitly computing the N × N
matrix. Since kernels are a form of approximation of the attention matrix, they can
be also viewed as a form of low-rank method (Choromanski et al., 2020).

• Recurrence - A natural extension to the blockwise method is to connect these blocks


via recurrence. Transformer-XL (Dai et al., 2019) proposed a segment-level recurrence
mechanism that connects multiple segments and blocks. These models can, in some
sense, be viewed as fixed pattern models. However, we decided to create its own
category due to its deviation from other block / local approaches.

We note that these buckets are broad characterization of the different efficient Transformer
models. In reality, there is no sharp boundary between these groups and models may be
composed of multiple different technical innovations. For example, the k-means clustering in
Routing Transformer (Roy et al., 2020) can be also interpreted as a form of global memory
approach, since one can view the centroids as parameterized memory. In Reformer, however,
this is used to learn the sparsity pattern of the attention weights. Additionally, pooling (Liu
et al., 2018) can be also interpreted as a form of memory model.

3.2 Detailed Walk-through of Efficient Transformer Models


This section delves into the details of several key efficient Transformer models, discussing
their pros, cons, and unique talking points. Due to the large number of Transformer models,
we sample several models to elaborate on.
Structure of this section We begin by discussing local and fixed patterns models such
as the Memory Compressed Transformer (Liu et al., 2018) and Image Transformer (Parmar
et al., 2018). We then discuss the Set Transformers (Lee et al., 2019), an early approach for
utilizing global memory. Following which, we move on to models that utilize combinations
of patterns such as Sparse Transformers (Child et al., 2019) and Axial Transformers (Ho
et al., 2019). Next, we discuss Longformer (Beltagy et al., 2020) and ETC (Ainslie et al.,
2020), an introduction of the memory-based approaches to the Sparse Transformer fam-
ily. Our detailed walkthrough moves on to models that incorporate learnable patterns
(LP) such as Routing Transformers (Roy et al., 2020), Reformer (Kitaev et al., 2020) and
Sinkhorn Transformers (Tay et al., 2020b). After which, we introduce Linformer (Wang
et al., 2020b) and Synthesizers (Tay et al., 2020a), models that can be considered low-
rank factorization approaches. We then discuss models based on kernel approaches such
as Performer (Choromanski et al., 2020) and Linear Transformers (Katharopoulos et al.,
2020). Finally, we discuss the models that are based on segment-based recurrence such as
Transformer-XL (Dai et al., 2019) and Compressive Transformers (Rae et al., 2020).

8
Efficient Transformers: A Survey

3.2.1 Memory Compressed Transformer


Memory Compressed Transformer (Liu et al., 2018) is one of the early attempts for mod-
ifying Transformers in order to handle longer sequences. The modification introduced in
Memory Compressed Transformer is in two folds: localizing the attention span and using
memory compressed attention.

Local Attention Span An straightforward solution for dealing with long sequences in
Transformers is to limit the attention span to a local neighborhood. Liu et al. (2018)
proposed dividing the input sequence into blocks of similar length and run self-attention
within each block independently. This keeps the cost of attention per block constant, thus
the number of activations scales linearly by the input length.

Memory-compressed Attention The idea of memory compressed attention is to re-


duce the number of keys and queries using a strided convolution, while the queries remain
unchanged. This leads to reduction of the size of the attention matrix as well as the at-
tention computations based on a compression factor that depends on the kernel size and
the strides of the convolution. Memory compressed attention lets the model exchange the
information globally across the input sequence as opposed to the local attention.

Computation and Memory Complexity For the block size of b, the computational
and memory cost of self-attention is each block O(b2 ) and given there are n/b blocks, the
computational and memory cost of local attention is O(b.n). For the memory-compressed
attention, applying a convolution with kernel size and strides of k, the computational and
memory cost of the attention mechanism reduces to O(n.n/k).

3.2.2 Image Transformer


Image Transformer (Parmar et al., 2018), inspired by convolutional neural networks, re-
stricts the receptive field of self-attention to only local neighborhoods. This helps the
model scale up to process larger batch sizes while keeping the likelihood loss tractable.
Besides the efficiency, adapting the notion of loctality can be a desirable inductive bias for
processing images. Image Transformer offers the encoder-decoder architecture, where the
encoder generates a contextualized representation for every pixel-channel in the inputs and
the decoder autoregressively generates one channel per pixel at each time step.

Localized Attention Span Limiting the receptive field to a local neighborhood (Parmar
et al., 2018, 2019) addresses the issue with the computation and memory cost of running
global self-attention on large inputs, but changing the neighborhood per query position
would prohibit packing the computations of the self-attention into two matrix multiplica-
tions. To avoid that, Image Transformer proposes partitioning the inputs into “query blocks
and their associated “memory blocks“, where for all queries from a single query block, the
model attends to the same memory block. There are two different schemes for choosing
query blocks and their associated memory block neighborhoods: 1-dimensional local atten-
tion and 2-dimensional local attention. Here we briefly explain these schemes in the decoder
case.
For the 1-dimensional local attention, the image is flattened in the raster order and
partitioned into non-overlapping query blocks Q of length lq , and for each query block,

9
Efficient Transformers: A Survey

Memory Block
Memory Block
Query Block

Query Block Q Q

(a) 1-dimensional local attention (b) 2-dimensional local attention

Figure 3: Attention span in Image Transformer on a two-dimensional input.

a memory block M is build from the same pixels in the Q as well as a fixed number of
pixels, lm , generated before the query pixel. In the 2-dimensional local attention, pixels are
generated in raster order. For the 2-dimensional local attention, the image is partitioned
into multiple non-overlapping rectangular query blocks of length lq = wq × hq . The memory
block extends the query block to the top, left hm and wm pixels and to the right wm pixels,
so lm = (wq × qh ) + 2 × (hm + wm ). The query pixel can attend to all other pixels. In
the 2-dimensional local attention, pixels in the image are generated one query block after
another. Generated blocks are in raster order, as well as generated pixels inside every block.

Computational and Memory Complexity In Image Transformer, the attention ma-


trix has the shape of lq × m, where lq is the chosen length for the query blocks and M is
the length of the memory block (which is in fact lq + lm ). Given that memory blocks do not
overlap, we have to compute n×lq attention matrices. Thus the memory and computational
complexity of Image Transformer is O(n.m).

Restrictions Image Transformer, and in general restricting the context in the attention
mechanism to a local neighborhood, can decrease the cost of memory and computation at
the price of losing the global receptive field. This can be an issue where global information
is required to solve the task. Also, local-attention still has quadratic complexity to the
region length, introducing an extra hyper-parameter to the trade-off between performance
and computational complexity.

3.2.3 Set Transformer


The Set Transformer (Lee et al., 2019) adapts the Transformer model for set-input problems
- that is, problems wherein the input is a set of features and the output is some function
of this set (and is thereby invariant to the permutation, or ordering, of the input features).
The Set Transformer leverages attention to capture interactions between elements of the
input set. Furthermore, it applies the idea of inducing points from the sparse Gaussian

10
Efficient Transformers: A Survey

process literature to reduce the complexity of attention from quadratic to linear in the size
of the input set.
Problems involving sets of objects often have a permutation invariance property: the
target value for the set is the same regardless of the order of the objects in the set. Zaheer
et al. (2017) proved that all permutation-invariant functions can be represented by the
following functional form:

network ({x1 , . . . , xN }) = ρ (pool ({φ(x1 ), . . . , φ(xN )})) ,

where the pooling function pool is a simple summation and φ and ρ are continuous functions.
This form can be interpreted as the composition of an encoder φ and decoder ρ (pool(·)).
While this form is a universal approximator in the space of permutation-invariant functions,
it is unclear how well such models will fit tasks in practice, given a limited capacity. The
Set Transformer proposes a solution that can be viewed as a encoder + pooled decoder, but
where, unlike the form given above, the encoder and decoder can attend to input elements
individually and the pooling function is parameterized.
Attention Blocks The model introduces the following constructs: “Multihead Attention
Block” (MAB), “Set Attention Block” (SAB), “Induced Set Attention Block” (ISAB), and
“Pooling by Multihead Attention” (PMA). They are defined as follows.

MAB(X, Y) := LayerNorm (H + rFF(H)) ,


H := LayerNorm (X + Multihead(X, Y, Y )) ,
SAB(X) := MAB(X, X),
ISABm (X) := MAB (X, MAB(Im , X)) .
PMAk (X) := MAB (Sk , rFF(X)) .

Here, X ∈ RN ×d represents N , d-dimensional input/outputs stacked row-wise and rFF is a


parameterized feed-forward layer that operates on each row of its input matrix separately.
Im ∈ Rm×d represents m trainable d-dimensional “inducing points” while Sk ∈ Rk×d repre-
sent k trainable d-dimensional “seed vectors” (with k set to 1 except when k > 1 correlated
outputs are needed). The Set Transformer’s encoder is just N layers of either SAB or ISAB
(with N often set to 2 in practice) while its decoder is given by:

Decoder(X) := rFF (SAB (PMAk (X))) .

It is straightforward to see that both ISAB and SAB are permutation equivariant - in other
words, if the input is permuted in some way then the corresponding output of the block
is permuted in exactly the same way. Meanwhile, the pooling layer PMA is permutation
invariant. Since functional composition, i.e. layering, preserves these properties, the Set
Transformer encoder-decoder combination is permutation invariant.
Efficiency We can understand the m inducing points Im learned in each ISAB layer, as a
form of static memory. In addition to reducing the O(N n2 ) complexity of the self-attending
SAB layer to O(N mn), a reduction particularly valuable when the input set is large, the
inducing points effectively encode some global structure that helps explain its inputs. For
example, in the problem of amortized clustering, where one attempts to learn to map an

11
Efficient Transformers: A Survey

(a) Transformer (b) Sparse Transformer

Figure 4: Illustration of patterns of the attention matrix for dense self-attention in Trans-
formers and sparse fixed attention in Sparse Transformers.

input set of points to the centers of clusters of points inside the set, the inducing points
learned could be appropriately distributed so that the encoder can effectively compare query
elements with each other implicitly via their proximity to the inducing points.
The trainable k seeds Sk used in the pooling layer PMAk can be viewed as static memory
in a similar light, reducing the memory and runtime complexity of the architecture.

3.2.4 Sparse Transformer


The Sparse Transformer (Child et al., 2019) presents a simple initial attempt to reduce
the quadratic complexity of the standard self-attention mechanism. The key idea is to
reduce the dense attention matrix to a sparse version by only computing attention on a
sparse number of qi , kj pairs. Sparse Transformer employs fixed attention patterns which
are defined by strides and local neighbourhoods. Computation is factorized, wherein local
and stride patterns are split amongst the heads.
Local Attention Heads Half of the heads in the Sparse Transformer are dedicated to
local attention.
(
Qi (K)>
j ), if bj/N c = bi/N c
Âij =
0 otherwise

where Aij is the attention weight of qi , kj and b c denote the floor operation. In this case,
we only compute the attention if bj/N c = bi/N c (within the same block).
Strided Attention Heads The other half of the heads are dedicated to fixed strided
patterns. This is also referred to as strides. Concretely,
(
Qi (K)>j ), if (i − j) mod N = 0
Âij =
0 otherwise

12
Efficient Transformers: A Survey

Q Row Attention

Column Attention
Figure 5: Attention span in Axial Transformer on a two-dimensional input.

The final result of the factorized sparse attention is visualized in Figure 4.

Parameter and Memory Complexity The modification in the self-attention mecha-


nism does not alter the parameter costs of the model since the model still retains the Q, K, V
transforms from the original Transformer model. The memory complexity of the attention
layer is reduced from O(n2 ) to O(n log n) .

Restrictions The Sparse Transformer implementation requires custom GPU kernels to


implement a specific block-sparse variant of matrix-matrix-multiplication and cannot be
easily implemented on other hardware such as TPUs.

3.2.5 Axial Transformer


Axial Transformer (Ho et al., 2019) uses factorization in a simple yet effective setup for
the self-attention mechanism to process large inputs that are organized as multidimensional
tensors. Instead of applying attention to the flattened version of the input, Axial Trans-
former simply applies multiple attentions, each along a single axis of the input tensor. Each
attention, in fact, mixes information along a particular axis, while keeping information
along other axes independent. Since the length of any single axis is typically much smaller
than the total number of elements, Axial Transformer significantly saves computation and
memory.
Axial Transformer offers an encoder-decoder architecture. For the decoding, to be able
to implement the casual mask, Axial Transformer combines axial attentions with shift op-
erations. For instance, for a model on 2-dimensional tensors, pixels are generated in raster
order and to do that, first, the model encodes all pixels through an unmasked row and
unmasked column attention. Then, for each row, the model applies an unmasked row and
masked column attention to integrate the previously sampled rows. Finally, the model shifts
the encoded representation up to make sure the conditioning information satisfies causality,
and runs a masked row-attention to sample a new row in the image.

13
Efficient Transformers: A Survey

An advantage of Axial Transformer over similar methods like Sparse Transformer is that
while it provides the global receptive field, it is straightforward to implement it and does
not require a custom kernel for an efficient implementation.

3.2.6 Longformer
Longformer (Beltagy et al., 2020) is a variant of Sparse Transformer. Its key distinction
compared to Sparse Transformer is “Dilated Sliding Windows”, which can enable better
long-range coverage without sacrificing sparsity. This is achieved by increasing the receptive
fields by having gaps in the attention patterns. The Longformer also gradually increases the
receptive field as the model goes deeper, dedicating lower levels for modeling local patterns
and upper levels for modeling global patterns.

Global Attention For classification tasks, the Longformer adopts global tokens (e.g.,
CLS tokens) that have access to all input sequences.

Parameter and Memory Complexity The complexity of the model is reduced from
O(n2 ) to O(nk) where k is the size of the window. When using global attention, the
Longformer creates another set of query-key-value projections for this global attention,
doubling the cost of the parameters at the attention layer.

3.2.7 Extended Transformer Construction (ETC)


The ETC model (Ainslie et al., 2020) is another variation in the Sparse Transformer family.
It introduces a new global-local attention mechanism. There are four components to this
new attention mechanism, namely (1) global-to-global (g2g), global-to-local (g2l), local-to-
global (l2g) and local-to-local (l2l).
Aside from the original input to the model, ETC introduces ng auxiliary tokens as a
prefix to the original input sequence. These tokens are regarded as global tokens and take
part in global-to-∗ and ∗-to-global attention. The local-to-local component acts as the local
attention with a fixed radius of k. Overall, ETC is quite similar to the Longformer in the
way it introduces global auxiliary tokens. These tokens are trainable parameters and can be
interpreted as a form of memory that pools across the sequence to collect global sequence
information.

Memory and Parameter Complexity The memory complexity of the ETC model is
O(n2g + ng N ), where ng is the number of global tokens and N is the input sequence length.

Restrictions Intuitively, it is easy to observe that ETC cannot be used for auto-regressive
decoding. This is because we are not able to compute causal masks because of the global
attention.

3.2.8 BigBird
The BigBird model (Zaheer et al., 2020) is another Transformer for modeling longer se-
quences and is primarily built on top of ETC (Ainslie et al., 2020). The Big Bird model com-
prises of several key components, namely (1) global tokens, (2) random attention (queries
attend to random keys) and (3) fixed patterns (local sliding windows).

14
Efficient Transformers: A Survey

Global Attention Fundamentally, the idea of using global memory can be traced all the
way back to Longformer/ETC and Set Transformer model. Notably, the global memory in
Big Bird is extended to contain tokens within the sequence, instead of simply parameterized
memory. The authors call this the ‘internal transformer construction (ITC)’ in which a
subset of indices is selected as global tokens. This can be interpreted as a memory based
approach.
Sliding Window Attention The window-ed attention was first proposed in early local-
based attention models (Image Transformer, Compressed Attention and/or Sparse Trans-
former). In BigBird, each query attends to w/2 tokens to the left and w/2 tokens to the
right. This corresponds to a fixed pattern (FP) approach.
Random Attention Finally, each query attends to r random keys. This pattern is fixed.
Memory and Parameter Complexity The memory complexity of the self-attention
is linear, i.e., O(n). The BigBird model does not introduce new parameters beyond the
Transformer model.
Restrictions Similar to ETC, the BigBird model cannot be used to autoregressively
decode. Hence, qualifying it as an encoder-only model.

3.2.9 Routing Transformer


The Routing Transformer (Roy et al., 2020) is a content-based sparse attention mechanism.
It proposes a clustering-based attention mechanism that learns the attention sparsity in
a data driven fashion. The first step is to project Q and K into a routing matrix R of
dimensions n × d.

R = QWR + KWR (1)

where WR is a d × d orthonormal projection matrix.


k-means Clustering The R matrix undergoes k-means clustering with a series of pa-
rameterized cluster centroids u1 , u2 · · · ck . The k-means in Routing Transformer is trained
in an online fashion. To ensure a similar number of tokens in a cluster, the model initializes

n clusters, computes each token’s distance against the cluster centroid, and takes an equal
top-k for each centroid. Since the cluster centroids are trainable parameters, this is also
reminiscent of the all-attention layer proposed by (Sukhbaatar et al., 2019).
Routing Strategy Thereafter, the routing strategy is defined as:
X
Xi0 = Aij Vj (2)
j∈Ci ,j≤i

where Ci is the cluster that vector Ri is assigned to. In other words, the token at i only
attends to tokens in the same cluster.
Memory and Parameter Complexity The Routing Transformer introduces additional
parameters in the clustering mechanism, namely a k×d centroid vectors and a Wr projection
matrix. The memory complexity is O(n1.5 ).

15
Efficient Transformers: A Survey

3.2.10 Reformer
Reformer (Kitaev et al., 2020) is another efficient attention model based on locality sen-
sitive hashing (LSH). The Reformer also introduces reversible Transformer layers, which
contributes to further reducing its memory footprint.
LSH Attention The LSH attention introduces parameter-sharing between query and
keys. It hashes the query-keys into buckets using a random-projection based hashing func-
tion. The key idea is that nearby vectors should obtain a similar hash while distant vectors
should not, hence being termed as ‘locality sensitive’. To perform hashing, a random matrix
R ∈ Rk×b/2 is first introduced. Next, The hashing function is defined as:

h(x) = arg max([xR; −xR]) (3)

where [; ] is the concatenation of two vectors. For all queries, attention is computed if and
only if the query and key hashes match, i.e., h(qi ) = h(kj ). In other words, attention
is computed amongst query and keys if they fall in the same hash bucket. In order to
maintain causal masking (the ability to auto-regressively decode), the Reformer assigns
and maintains a position index for every query/key. It is therefore able to compare if each
query key comparison is auto-regressively valid.
Memory Efficiency with LSH Attention The key idea behind LSH attention is to
classify tokens into buckets and then process them bucket by bucket in a chunked fashion.
To this end, queries are first sorted by bucket number and then by sequence order within
the same bucket. During computation, tokens only attend to same bucket in its own chunk
and previous chunk. The chunking and sorted bucketing techniques help to improve the
overall efficiency of the Reformer model.
Parameter and Memory Complexity The memory complexity of the Reformer is
O(n log n). In terms of parameter costs, the Reformer shares queries and keys, which
reduces the cost of the QKV transforms by a third. The random projections are not
trainable parameters and hence do not incur parameter costs. Overall, the Reformer has
fewer parameters than vanilla Transformers. The reversible layers in Reformer also reduce
the memory consumption during training by enabling activations to be reconstructed from
the next layer’s. This reduces memory cost since this eliminates the need to store activations
for all layers during backpropagation.

3.2.11 Sinkhorn Transformers


This section introduces the Sparse Sinkhorn Transformer (Tay et al., 2020b). The Sinkhorn
Transformer belongs to the family of learned patterns. This model is a chunked/blocked
model that learns sparse patterns by re-sorting the input key and values in a block-wise
fashion and then applying local block-based attention.
(
(Qi ψS (K)>
j ), ifbj/N c = bi/N c
Aij =
0 otherwise

where ψS applies a sorting operator on the sequence length dimension.

16
Efficient Transformers: A Survey

Sorting Network The sorting operator is parameterized by a meta sorting network. Let
X be the input sequence of dimension N × d.

ψS (X) = φS (FS (BlockSum(X))) BlockShape(X) (4)

where FS (.) is a parameterized function such as a two layer feed-forward network with ReLU
activation. The output of FS (.) is a tensor of nB × nB . The BlockSum function learns the
sum embeddings of local blocks. The BlockShape function reshapes the input tensor into
RN ×d → RnB ×b×d . Here, we note that N = nB × b, where b is the size of the block and nB
is the number of total blocks.

Sinkhorn Sorting φ is the Sinkhorn balancing operator (Sinkhorn, 1964; Adams and
Zemel, 2011) which converts the nB × nB matrix into a soft permutation matrix. Specifi-
cally, a series of row- and column-wise normalizations are applied on the matrix output of
FS BlockSum(X). For the sake of brevity, we do not delve into details of this operation.
Further details can be found at (Adams and Zemel, 2011; Tay et al., 2020b).

Parameter and Memory Complexity The memory complexity of the Sinkhorn Trans-
former is O(b2 ) where b is the block size and b = NNb . Additional parameter costs are incurred
from the meta sorting network FS (.). The number of additional parameters is therefore 2d2
when a two layer ReLU network is used as the sorting network.

3.2.12 Linformer
The Linformer (Wang et al., 2020b) is an efficient Transformer based on the idea of low-rank
self-attention.

Low Rank Projections on Length Dimensions The Linformer projects the N × d


dimensional keys and values to k×d dimensions using additional projection layers. Note that
this is a reduction on the length dimension instead of the key and value dimensions. Given
the newly projected keys (K 0 ) and values (V 0 ), the QK 0 matrix is now (N × k) dimensions
instead of (N × N ). The attention matrix Softmax(QK 0 ) multiplies with V 0 ∈ Rk×d to
result in an output tensor of dimensions N × d. To some extent, Linformer is reminiscent of
depth-wise convolutions (Kaiser et al., 2017). A projection on the length dimension causes
mixing of sequence information (dimension-wise) in a single transformation. Hence, it is
non-trivial to maintain causal masking and/or prevent mixing of past and future information
when computing attention scores.

Parameter and Memory Complexity The memory complexity of Linformer is O(n).


There is only a minimal parameter costs of the Linformer due to the extra N × k length
projections. If k is sufficiently small, there is negligible parameter costs incurred.

3.2.13 Linear Transformer


The Linear Transformer (Katharopoulos et al., 2020) improves the complexity of self-
attention from quadratic to linear by using a kernel-based formulation of self-attention
and the associative property of matrix products. Furthermore, it reduces attention with ca-
sual masking (which is used in auto-regressive decoding) to a linear-time, constant memory

17
Efficient Transformers: A Survey

recurrent neural network (RNN). The model has been shown to improve inference speeds
up to three orders of magnitude without much loss in predictive performance.
The method rests on the simple but powerful observation that the accumulated value
Vi0 for the query Qi in position i can be written as:
Pp
j=1 sim(Qi , Kj )Vj
Vi0 = Pp .
j=1 sim(Qi , Kj )

Here, p = N in full, unmasked attention and T p= i in the case of causal masking. Now,
in usual softmax attention, sim(q, k) = exp q√dk . Linear Transformer, however, expresses
the similarity as a kernel function. That is, sim(q, k) := φ(q)T φ(k), where φ is a, possibly
high-dimensional, feature map. With this choice, we can rewrite Vi0 as:

φ(Qi )T Sp
Vi0 = ,
φ(Qi )T Zp
Xp
Sp := φ(Kj )VjT ,
j=1
Xp
Zp := φ(Kj ).
j=1

For unmasked attention, since p = N we only need to compute SN and ZN once and we
reuse them for the computation at every position 0 ≤ i ≤ N . For causal attention, the Si ’s
and Zi ’s can be viewed as states of an RNN that are updated by the following recurrence
relations:

Si = Si−1 + φ(Ki )ViT ,


Zi = Zi−1 + φ(Ki )

with initial condition S0 = Z0 = 0. If the dimension of the key, query, and values are all d
and cost to compute φ is O(c), then the overall run-time complexity of Linear Transformer
is O(N cd). The authors choose

φ(x) = elu(x) + 1,

where elu(·) denotes the exponential linear unit (Clevert et al., 2015). With this choice of
feature map, c = d and the end-to-end complexity of the model is O(N d2 ). The authors
go further and show that in addition to the forward pass, the backward pass (i.e. gradient
computation) can be achieved in linear time and constant memory by using cumulative
sums. We defer readers interested in this derivation to the original paper.

3.2.14 Performer
The Performer (Choromanski et al., 2020) model is characterized by its Generalized Atten-
tion mechanism and its usage of random Kernels.

18
Efficient Transformers: A Survey

Generalized Attention The generalized attention entangles Qi , Kj with a kernel func-


tion K. The attention matrix in Performer is computed via:
A = [g(Q> > > >
i )K(Qi Kj ), h(Kj )] (5)
where K(.) is a kernel function that maps d × d to a scalar value R and g, h are functions
that map d to a scalar value R.
Fast Attention via Orthogonal Random Features (FAVOR) The above computa-
tion is still quadratic in complexity. Hence, the Performer leverages approximation tricks to
avoid storing and computing the N × N attention matrix. It leverages orthogonal random
features (ORF) for doing so. The final attention output Y of the Performer is described as
follows:
Y = D̂−1 (Q0 ((K 0 )> V )) (6)

where D̂ = diag(Q0 ((K 0 )> 1N )), Q0 = DQ φ(Q> )> , and K 0 = DK φ(K > )> . Note that
DQ = g(Q> >
i ), DK = h(Ki ). The function φ(x) is defined as:
c
φ(X) = √ f (W x + b)> (7)
M
where c > 0 is a constant, W ∈ RM ×d is a random feature matrix, and M is the dimension-
ality of this matrix that controls the number of random features. We are able to see that
we do not explicitly compute A = QK > and hence avoid paying the N 2 cost. For rigorous
theoretical analysis and further details, we refer interested readers to (Choromanski et al.,
2020).
Parameter and Memory Complexity The complexity of bidirectional FAVOR algo-
rithm is O(M d + N d + M N ) where M is the dimensionality of the random features.

3.2.15 Synthesizers
Synthesizer models (Tay et al., 2020a) are an attempt to study and investigate the true
importance of conditioning within the self-attention mechanism. In (Tay et al., 2020a), the
authors study a synthetic self-attention module in which attention weights are approximated
instead of being computed by pairwise dot products. Synthesizers are only implicitly related
to efficient Transformers. However, the factorized variants can be considered a low-rank
efficient Transformer model.
Dense Synthesizers In the Dense Synthesizer, each token xi is projected to a vector
of length N using a two-layered non-linear feed-forward network. The computation of the
attention matrix A is described as:
A = W2 (σR (W1 (X) + b)) + b (8)

where X ∈ RN ×d is the input sequence, W2 ∈ Rd×N , W1 ∈ Rd×d , and σR is the ReLU


activation function. Given A, the output of the Synthetic Dense function is computed as:
Y = Softmax(A)G(X). (9)

where G(X) is another parameterized function RN ×d → RN ×d .

19
Efficient Transformers: A Survey

Random Synthesizers Another variant of the Synthesizer model uses random matrices
for A. In this case, the output can be expressed by:
Y = Softmax(R)G(X). (10)
where R ∈ RN ×N is a trainable and/or non-trainable matrix. In (Tay et al., 2020a), the
authors show that Random Synthesizers achieve competitive performance.
Factorized Variants The Dense and Random Synthesizers also come with factorized
variants that consider a low-rank structure of the attention matrix. For factorized random
Synthesizer can be written as:
Y = Softmax(R1 R2> )G(X). (11)

where R1 , R2 ∈ RN ×k . On the other hand, the Dense Synthesizer can be factorized as


follows:
A = HB (B) ∗ HC (C) where B, C = FB (Xi ), FC (Xi ), (12)
where FB (.) projects onto b dimensions and FC (.) projects Xi onto c dimensions with
c × b = N . HB , HC are tile and repeat functions respectively.
Parameter and Memory Complexity For Random Synthesizers that adopt a non-
trainable R, there is no need to store N 2 activations at this layer. For the trainable Random
Synthesizer, the memory complexity and parameter complexity remains as N 2 . However,
there is no need to compute N 2 dot products, reducing the computational costs significantly.
The Factorized Random Synthesizers reduce the parameter costs to 2(N × k).

3.2.16 Transformer-XL
The Transformer-XL model (Dai et al., 2019) relies on segment-based recurrence. Segment-
based recurrence can be considered an orthogonal approach to the other techniques discussed
since it does not explicitly sparsify the dense self-attention matrix. Instead, it connects
adjacent blocks with a recurrent mechanism.
Segment Recurrence The recurrent mechanism in Transformer-XL is described as:
h̃n−1 n−1
τ +1 = [SG(hτ ) hn−1
τ +1 ] (13)
> > >
qτn+1 , kτn+1 , vτn+1 = hτn−1 n−1 n−1
+1 Wq , h̃τ +1 Wk , h̃τ +1 Wv (14)
hnτ+1 = Transformer(qτn+1 , kτn+1 , vτn+1 ) (15)
where SG() is the stop gradient function, is the concatenation of two sequences along the
length dimension. Notably, the keys and values are conditioned on the previous sequence
length h̃n−1 n−1
τ +1 instead of hτ +1

Relative Positional Encodings Transformer-XL introduces novel relative position en-


codings. In this scheme, absolute positional encodings are not added to the content embed-
dings. Instead, they are only considered while computing attention weights where they can
be replaced with relative position encodings. Since the relative position encodings are not
directly relevant to the efficiency of the model, we refer interested readers to (Dai et al.,
2019) for more details.

20
Efficient Transformers: A Survey

3.2.17 Compressive Transformers


Compressive Transformers (Rae et al., 2020) are a natural extension of the Transformer-XL
model. The key idea behind the Compressive Transformer is to maintain a fine-grained
memory of past segment activations. This is unlike Transformer-XL, which discards past
activations as it moves across segments.

Memory The Compressive Transformer is characterized by a dual memory system - a


primary memory and a secondary compressed memory. It maintains a memory with nm
memory slots and ncm compressive memory slots. Whenever the model accepts a new input
segment, the oldest ns activations in the primary memory are moved to the compressed
memory where a compression function is applied.

Compression These memories are compressed with a variety of compression functions


such as (1) mean/max pooling (2) 1D convolutions, (3) dilated convolutions, and (4) most
used (e.g., sorted by usage of attention).

Memory Reconstruction In order to better retain memories over long sequences, the
Compressive Transformer implements an auto-encoding loss that learns to reconstruct the
original memory from its compressed version, i.e., Lae = ||old mem − g(new cm(i) )|| where
ns
g(.) : R c ×d → Rns ×d is a parameterized function. A second attention reconstruction is a
lossy re-construct that attempts to reconstruct the attention over memory instead of the
lossless reconstruction of the memory itself.

Computational and Memory Complexity In terms of memory and computational


complexity, on a square image of size N , Axial Transformer performs the attention com-
√ √
putation in O(n n), which saves O( n) over normal self-attention. For instance, with on
square image with N pixels, organized in a b × b grid, Axial Transformer runs b attention
sequences of length b, which is of complexity O(b.b2 ). In a more general case, for a d-
dimensional tensor of shape N = N 1/d × . . . × N 1/d , Axial Transformer saves a O(N (d−1)/d )
factor of resources over standard self-attention.

4. Discussion
This section the state of research pertaining to this class of efficient models.

4.1 On Evaluation
While the field is bustling with new Transformer models, there is hardly an easy way to
compare these models side by side. Many research papers select their own benchmarks to
showcase the abilities of the proposed model. This is also coupled with different hyperpa-
rameter settings like model sizes and configurations which can make it difficult to correctly
attribute the reason for the performance gains Moreover, some papers conflate this with
pretraining (Devlin et al., 2018) which makes it even harder to distinguish the relative
performance of these different models. It is still a mystery to which fundamental efficient
Transformer block one should consider using.
On one hand, there are multiple models that focus on generative modeling, showcasing
the ability of the proposed Transformer unit on auto-regressive modeling of sequences. To

21
Efficient Transformers: A Survey

this end, Sparse Transformers (Child et al., 2019), Adaptive Transformers (Correia et al.,
2019), Routing Transformers (Roy et al., 2020) and Reformers (Kitaev et al., 2020) are
mainly focused on generative modeling tasks. These benchmarks typically involve language
modeling and/or pixel-wise image generation on datasets such as wikitext, enwik8 and/or
ImageNet/CIFAR. Models that use segment based recurrence such as Transformer-XL and
Compressive Transformers are also focused on long-range language modeling tasks such as
PG-19.
On one hand, a collection of models is mainly focused on encoding-only tasks such as
question answering, reading comprehension and or selections from the Glue benchmark.
For example, the ETC model (Ainslie et al., 2020) only runs experiments on question
answering benchmarks such as NaturalQuestions or TriviaQA. On the other hand, the
Linformer (Wang et al., 2020b) focuses on subsets of the GLUE benchmark. This split
is very natural and intuitive, since models like ETC and Linformer cannot be used in an
auto-regressive fashion, i.e., cannot be used to decode. This aggravates the difficulty of
comparing these encoder-only models with the other models.
There are models that focus on a balance of both. The Longformer (Beltagy et al., 2020)
tries to balance this by running benchmarks on both generative modeling and encoder-only
tasks. The Sinkhorn Transformer (Tay et al., 2020b) compares on both generative modeling
tasks as well as encoding only tasks.
Additionally, it is also worthy to note that, although Seq2Seq machine translation (MT)
was one of the problems that popularized Transformer models, not many of these efficient
Transformer models evaluate on MT. This is likely because sequence lengths in MT is not
long enough to warrant the usage of these models.
While generative modeling, glue tasks and/or question answering appear to be the
common evaluation benchmarks adopted by many of these tasks, there are several niche
benchmarks that a small isolated number of papers choose to evaluate on. For starters,
the Performer model (Choromanski et al., 2020) evaluates on masked language modeling
on proteins, deviating from serious head-on comparisons with other efficient Transformer
models. The Linear Transformer (Katharopoulos et al., 2020) also evaluates on speech
recognition, which is a rare benchmark amongst this group of papers.

4.2 On Model Design Trends


When matching our broad categorization against the timeline of the introduction of these
models, we are able to see the trend that the community is taking towards designing efficient
Transformer models. Early work in this area has primary been dominated by more intuitive
and simple approaches such as fixed patterns. To this end, most early work in this area
seems to be based on block/local patterns such as Image Transformer (Parmar et al., 2018),
Compressed Attention (Liu et al., 2018), Blockwise Transformer (Qiu et al., 2019) or the
local windows in Sparse Transformer (Child et al., 2019).
The paradigm of factorizing various fixed patterns was first introduced in (Child et al.,
2019) and the Axial Transformer (Ho et al., 2019). At this particular same time, we start to
observe early traces of memory based approaches from both the inducing point methods in
the Set Transformer (Lee et al., 2019) or global nodes in the Star Transformer (Guo et al.,
2019a) model.

22
Efficient Transformers: A Survey

We observe the next wave of models comes in the form of learnable sparsity patterns.
Reformer (Kitaev et al., 2020) and Routing Transformers (Roy et al., 2020) are very similar
in the sense that they are models that learn to cluster/bucket tokens before performing
attention. The key difference is the means to the end whereby Reformer uses a hashing
function while the Routing Transformer uses online k-means for cluster assignment. In
parallel, Sinkhorn Transformers (Tay et al., 2020b) are also based on the idea of sorting,
albeit at block-level. These three models largely follow a similar paradigm of re-arranging
sequences for efficient computation of attention scores.
Next, we then observe several extensions that are largely built off the Sparse Transformer
paradigm. The ETC (Ainslie et al., 2020) and Longformer (Beltagy et al., 2020) models
are very similar ideas that are fundamentally Sparse Transformer extensions. These models
incorporate the notion of a global memory, which is reminiscent of the Set Transformer’s
inducing point method or the global memory of the Star Transformer. Modifications to
strides, such as using dilated windows was also proposed in the Longformer work.
The most recent wave of models we’ve been seeing is models that are based on low-rank
approximation or kernel methods, e.g., models such as Linformer (Wang et al., 2020b),
Performer (Choromanski et al., 2020) and/or Linear Transformers (Katharopoulos et al.,
2020). Although due to the state of evaluation and the high parallelism of research, it
is quite unclear if this low-rank or kernel paradigm is actually better than the learnable
pattern (LP) or memory based efficient Transformer models.
On the side, it is good to note that the recurrent based models (Transformer-XL and
Compressive Transformers) seem to operate orthogonally and are in less direct comparison
with the other models.

4.3 Brief Discussion on Orthogonal Efficiency Efforts


While this paper is focused on the computational and memory complexity of the self-
attention module, we briefly summarize several orthogonal efforts that may also contribute
to model efficiency, scalability and overall usability of Transformer models.
• Weight Sharing Sharing parameters of the Transformer models would help in re-
ducing overall model size. The Universal Transformers (Dehghani et al., 2018) tie
attention and transition weights across layers. Similarly, Albert (Lan et al., 2019)
does the same parameter sharing across layers. On the other hand, the Quaternion
Transformer (Tay et al., 2019) proposes a weight sharing scheme inspired by Hamilton
products that locally shares the components in the linear transformation layers.

• Quantization / Mixed Precision Learning mixed precision models has the poten-
tial to improve memory costs. The Q-BERT (Shen et al., 2020) is a model that quan-
tizes Transformer models to ultra-low precision. Meanwhile mixed precision train-
ing (Ott et al., 2019) is a highly popular technique to reduced the memory costs
of training Transformers. (Fan et al., 2020) applies Quantization Aware training to
Transformer models.

• Knowledge Distillation Knowledge distillation (KD) (Hinton et al., 2015) has been
a useful technique for transfering the knowledge learned from a larger teacher model
to a smaller student model. The smaller model can then be efficiently deployed into

23
Efficient Transformers: A Survey

production. There have been many attempts to distill large Transformer models. For
example, DistilBERT (Sanh et al., 2019), task-specific distillation (Tang et al., 2019)
and TinyBERT (Jiao et al., 2019).

• Neural Architecture Search (NAS) Searching for more efficient Transformer


architectures is also a common strategy. (Guo et al., 2019b) proposed Neural Ar-
chitecture Transformer (NAT), using NAS to search for more compact and efficient
Transformers by removing redundant operations. (Wang et al., 2020a) proposed HAT
(Hardware-aware Transformers), a method that leverages NAS and uses hardware
efficiency feedback as a reward signal.

• Task Adapters This line of research has been primarily focused on the problem of
fine-tuning large Transformer on T tasks and aiming to reuse parameters across a
variety of tasks. The key idea is that task adapters (Houlsby et al., 2019) enable
reuse of parameters across tasks and reuse the need of serving T models in production
- resulting in overall parameter savings. A modest number of models have been
proposed, such as PALS (Stickland and Murray, 2019), MAD-X (Pfeiffer et al., 2020)
and HyperGrid (Tay et al., 2020c).

5. Conclusion
In this paper we surveyed the literature on efficient Transformer models especially pertain-
ing ot the quadratic complexity of the self-attention module. We provided a taxonomy and
high-level abstraction of the core techniques employed in these class of new models. We
characterize the existing models based on techniques and provided a comprehensive walk-
through on several of the efficient Transformer models. Finally, we discussed the evaluation
landscape of these models along with the design trends of these models. We ended of with
a brief discussion of other parallel orthogonal efforts that may improve the efficiency of
Transformer models in general.

24
Efficient Transformers: A Survey

References
Ryan Prescott Adams and Richard S Zemel. Ranking via sinkhorn propagation. arXiv
preprint arXiv:1106.1925, 2011.

Karim Ahmed, Nitish Shirish Keskar, and Richard Socher. Weighted transformer network
for machine translation. arXiv preprint arXiv:1711.02132, 2017.

Joshua Ainslie, Santiago Ontanon, Chris Alberti, Philip Pham, Anirudh Ravula, and Sumit
Sanghai. Etc: Encoding long and structured data in transformers. arXiv preprint
arXiv:2004.08483, 2020.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv
preprint arXiv:1607.06450, 2016.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by
jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document trans-
former. arXiv preprint arXiv:2004.05150, 2020.

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov,
and Sergey Zagoruyko. End-to-end object detection with transformers. arXiv preprint
arXiv:2005.12872, 2020.

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences
with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Jared Davis,
Tamas Sarlos, David Belanger, Lucy Colwell, and Adrian Weller. Masked language
modeling for proteins via linearly scalable long-context transformers. arXiv preprint
arXiv:2006.03555, 2020.

Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep
network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289,
2015.

Gonçalo M Correia, Vlad Niculae, and André FT Martins. Adaptively sparse transformers.
arXiv preprint arXiv:1909.00015, 2019.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhut-
dinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv
preprint arXiv:1901.02860, 2019.

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser.
Universal transformers. arXiv preprint arXiv:1807.03819, 2018.

25
Efficient Transformers: A Survey

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018.
Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi Gribonval, Herve Je-
gou, and Armand Joulin. Training with quantization noise for extreme fixed-point com-
pression. arXiv preprint arXiv:2004.07320, 2020.
Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang.
Star-transformer. arXiv preprint arXiv:1902.09113, 2019a.
Yong Guo, Yin Zheng, Mingkui Tan, Qi Chen, Jian Chen, Peilin Zhao, and Junzhou Huang.
Nat: Neural architecture transformer for accurate and compact architectures. In Advances
in Neural Information Processing Systems, pages 737–748, 2019b.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.
arXiv preprint arXiv:1503.02531, 2015.
Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in
multidimensional transformers. arXiv preprint arXiv:1912.12180, 2019.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Larous-
silhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer
learning for nlp. arXiv preprint arXiv:1902.00751, 2019.
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and
Qun Liu. Tinybert: Distilling bert for natural language understanding. arXiv preprint
arXiv:1909.10351, 2019.
Lukasz Kaiser, Aidan N Gomez, and Francois Chollet. Depthwise separable convolutions
for neural machine translation. arXiv preprint arXiv:1706.03059, 2017.
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Trans-
formers are rnns: Fast autoregressive transformers with linear attention. arXiv preprint
arXiv:2006.16236, 2020.
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient trans-
former. In International Conference on Learning Representations, 2020. URL https:
//openreview.net/forum?id=rkgNKkHtvB.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and
Radu Soricut. Albert: A lite bert for self-supervised learning of language representations.
arXiv preprint arXiv:1909.11942, 2019.
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh.
Set transformer: A framework for attention-based permutation-invariant neural networks.
In International Conference on Machine Learning, pages 3744–3753, 2019.
Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser,
and Noam Shazeer. Generating wikipedia by summarizing long sequences. arXiv preprint
arXiv:1801.10198, 2018.

26
Efficient Transformers: A Survey

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David
Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling.
arXiv preprint arXiv:1904.01038, 2019.

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander
Ku, and Dustin Tran. Image transformer. arXiv preprint arXiv:1802.05751, 2018.

Niki Parmar, Prajit Ramachandran, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and
Jon Shlens. Stand-alone self-attention in vision models. In Advances in Neural Informa-
tion Processing Systems, pages 68–80, 2019.

Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. Mad-x: An adapter-based
framework for multi-task cross-lingual transfer. arXiv preprint arXiv:2005.00052, 2020.

Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. Block-
wise self-attention for long document understanding. arXiv preprint arXiv:1911.02972,
2019.

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P.
Lillicrap. Compressive transformers for long-range sequence modelling. In International
Conference on Learning Representations, 2020. URL https://openreview.net/forum?
id=SylKikSYDH.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning
with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-
based sparse attention with routing transformers. arXiv preprint arXiv:2003.05997, 2020.

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled
version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108,
2019.

Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W
Mahoney, and Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization of
bert. 2020.

Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic
matrices. The annals of mathematical statistics, 35(2):876–879, 1964.

David R So, Chen Liang, and Quoc V Le. The evolved transformer. arXiv preprint
arXiv:1901.11117, 2019.

Asa Cooper Stickland and Iain Murray. Bert and pals: Projected attention layers for efficient
adaptation in multi-task learning. arXiv preprint arXiv:1902.02671, 2019.

Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Ar-
mand Joulin. Augmenting self-attention with persistent memory. arXiv preprint
arXiv:1907.01470, 2019.

27
Efficient Transformers: A Survey

Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Dis-
tilling task-specific knowledge from bert into simple neural networks. arXiv preprint
arXiv:1903.12136, 2019.

Yi Tay, Aston Zhang, Luu Anh Tuan, Jinfeng Rao, Shuai Zhang, Shuohang Wang, Jie Fu,
and Siu Cheung Hui. Lightweight and efficient neural natural language processing with
quaternion networks. arXiv preprint arXiv:1906.04393, 2019.

Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthe-
sizer: Rethinking self-attention in transformer models. arXiv preprint arXiv:2005.00743,
2020a.

Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn
attention. arXiv preprint arXiv:2002.11296, 2020b.

Yi Tay, Zhe Zhao, Dara Bahri, Donald Metzler, and Da-Cheng Juan. Hypergrid: Efficient
multi-task transformers with grid-wise decomposable hyper projections. arXiv preprint
arXiv:2007.05891, 2020c.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances
in neural information processing systems, pages 5998–6008, 2017.

Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song
Han. Hat: Hardware-aware transformers for efficient natural language processing. arXiv
preprint arXiv:2005.14187, 2020a.

Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-
attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020b.

Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhut-
dinov, and Alexander J Smola. Deep sets. In Advances in neural information processing
systems, pages 3391–3401, 2017.

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santi-
ago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird:
Transformers for longer sequences. arXiv preprint arXiv:2007.14062, 2020.

28

You might also like