SimpleTron NeurIPS 2022
SimpleTron NeurIPS 2022
Complexity
Anonymous Author(s)
Affiliation
Address
email
Abstract
1 In this paper, we propose that the dot product pairwise matching attention layer,
2 which is widely used in Transformer-based models, is redundant for the model
3 performance. Attention, in its original formulation, has to be seen rather as a human-
4 level tool to explore and/or visualize relevancy scores in sequential data. However,
5 the way how it is constructed leads to significant computational complexity. Instead,
6 we present SimpleTRON: Simple Transformer with O(N) Complexity, a simple
7 and fast alternative without any approximation that, unlike other approximation
8 models, does not have any architecture-related overhead therefore can be seen
9 as a purely linear Transformer-like model. This architecture, to the best of our
10 knowledge, outperforms existing sub-quadratic attention approximation models
11 on several tasks from the Long-Range Arena benchmark. Moreover, we show,
12 that SimpleTRON can benefit from weight transfer from pretrained large language
13 models, as its parameters can be fully transferable.
14 1 Introduction
15 Initially designed for natural language processing, the Transformer architecture [1] emerged in a
16 wide range of other domains and quickly became a state-of-the-art in language modeling [2] as
17 well as in generative tasks [3, 4], image processing [5, 6], speech recognition [7], reinforcement
18 learning [8], and others. From the original paper named "Attention is All You Need" [1] on, it
19 seems to be widely considered that the query-key-value framework, which implies a global pairwise
20 comparison between query and key tokens, is a necessary condition for the model performance. Even
21 though such a mechanism allows a human-comprehensible visualization of interactions between the
22 tokens, unveiling the interpretability up to some extent, an element-wise token comparison leads
23 to a quadratic complexity both in terms of time and space. Therefore, even though the original
24 Transformer architecture virtually can handle arbitrarily long range dependencies given the infinite
25 compute, which is in opposite to most recurrent neural networks [9], the complexity of regular
26 full-rank attention limits Transformer applications when long sequences are required.
27 In this paper we present a SimpleTRON model with a SimpleAttention mechanism as an extremely
28 simple yet efficient solution to replace the original quadratic complexity softmax attention. The
29 proposed mechanism not only possesses linear time and memory complexity, but outperforms the
30 current state-of-the-art Transformers on the text classification, matching and ListOps tasks from the
31 LRA [10] benchmark, which became a widely applied test for sequence processing models. Moreover,
32 since the SimpleAttention has analogous building blocks as the original attention it is suitable for
33 transfer learning as one can use pre-trained weights from the existing transformer models.
Submitted to 36th Conference on Neural Information Processing Systems (NeurIPS 2022). Do not distribute.
Original Self-Attention Mechanism
KhT
L Qh
Qh Vh Y
LxDE
DExDhid
Embeddings
V* DExDhid
x1 K* DExDhid softmax
x2 Q* Q
Q
.
L . X Q L Kh
Simple Self-Attention Mechanism
.
.
xL Lxd dxL Lxd Lxd
dxd
L Vh Qh Vh Qh Y
d = Dhid/Nh
Figure 1: Illustration of the Self-Attention calculation in the vanilla Transformer (top) and
SimpleTRON Self-Attention (bottom) dashed blocks representing matrices: X - input sequence
{x1 , x2 ..., xn } of the length L, Q∗ - query, K∗ - key, V∗ - value square matrices of dimensions
DE × Dhid , d is a dimension of a single head (d = Dhid /Nh ), DE is an embedding dimensionality.
34 2 Related Work
35 As a restrictive limitation, the computational complexity of the original model motivated the com-
36 munity to quest for the solution in order to approximate the architecture with asymptotically faster
37 models [11]. Thus, recently, a dizzying number of so-called "Efficient Transformers" appeared. Each
38 of these implementations applied some notion of sparsity to the otherwise dense attention mechanism
39 and reached a sub-quadratic complexity with comparable performance.
40 Among the solutions to rationalize the Transformer complexity, there were engineering approaches
41 such as sparse attention [12, 13], graph attention [14] or compressive attention [15] that maps past
42 hidden activations to a smaller set of compressed representations, allowed to use longer sequences
43 at comparable compute. Further engineering methods include Longformer [16], where attention
44 mechanism is a combination of a windowed local-context self-attention and a global attention
45 that encodes inductive bias, Coformer [17] and Attention Augmented CNN [18] which are hybrid
46 architectures of CNN augmented Transformers, Imputer [19] – the model that generates output
47 sequences iteratively via imputations and dynamic programming, Reformer [20] using dot-product
48 attention and reversible residual layers, N-gram Masked Self-Attention [21] etc.
49 Another branch of research to reduce Transformer complexity is dedicated to matrix and kernel
50 approximations based on strong mathematical basis. That includes Performer [22], which uses
51 kernel approximation, factorized attention [23], random feature attention [24] or, for example,
52 Nyströmformer [25] using Nyström matrix approximation. Learnable kernel approximation was
53 presented by Chowdhury et al. [26], where the authors reported trainable kernel by learning the
54 spectral distribution and approximation of the Transformer kernel as a dot product between spectral
55 feature maps. Continual transformers [27] reduce time and memory complexity by introducing the
56 continual retroactive and single-output attention mechanisms that aggregate matrices over time and
57 cache the step vectors.
58 Additionally there are several studies that changed the whole concept of attention and replaced it
59 with Fast Fourier Transform (FFT) which does not require any training [28], or used dense layers
60 to mix the tokens along both axes [29]. Another work [30] reported vision transformer-inspired[5]
61 model that independently mixes the spatial and channel locations of image patches using depth-wise
62 convolutions, outperforming existing Transformer-based solutions for image recognition.
63 3 The Model
64 The original multi-head attention layer utilizes the softmax-normalization of a head-wise product of
65 query and transposed key matrices combined with the value matrix as:
Qh KTh
Attention(Qh , Kh , Vh ) = softmax √ Vh ,
d
2
66 where Qh , Kh , and Vh ∈ RL,d are the query, key, and value, respectively, corresponding to the
67 h-th head, d is the query dimensionality, and L is the length of the input sequence. The head-wise
68 inputs to the operation are obtained by splitting the Q, K, V ∈ RL,Dhid matrices across the hidden
69 dimension axis Dhid into Nh pieces of size d = Dhid /Nh corresponding to Nh heads. The input
70 X ∈ RL,Dhid is transformed to the Q, K, and V matrices by a linear transformation using matrices
71 Q∗ , K∗ , V∗ ∈ RDhid ,Dhid and biases q∗ , k∗ , v∗ ∈ RL,Dhid as parameters:
Q = XQ∗ + q∗ ,
K = XK∗ + k∗ ,
V = XV∗ + v∗ .
72 The final output of the attention layer is then produced by applying another linear layer on a
73 concatenation of all heads and adding the duplicated input X that corresponds to a skip connection:
SelfAttention(X) =
X + W(Attention(Q1 , K1 , V1 ), . . . ,
Attention(QNh , KNh , VNh )) + w,
74 where W ∈ RL,Dhid and w ∈ RL,Dhid are the parameters of the linear layer.
75 The Qh , Kh , and Vh are rectangular matrices with the first dimension typically dominating the
76 second one. Thus, the quadratic complexity appears upon the Qh KTh operation with respect to the
77 sequence length. Swapping matrix multiplication order (first KTh Vh , then multiply with Qh ) would
78 reduce complexity to linear. However, softmax non-linearity forbids such shuffling. Here, we applied
79 several major tweaks to the model in order to reach linear complexity and to improve performance:
83 Therefore we obtain a no-softmax attention with the direct q-k-v product, which could be described
84 by the following simple formula for an attention operation on a single head:
SimpleAttention(Qh , Kh , Vh ) =
1
√ Qh (KTh Vh ).
L
85 When the linear layer is not used to produced the final output, then the q-k-v product concatenated
86 for all heads goes directly to the residual sum with duplicated input from skip connection.
87 Unlike the linear mixing models such as [28], this transformation is not linear in terms of input X.
88 To see this just note that
Qh = QIh = XQ∗ Ih + q∗ Ih
89 and analogously for Kh and Vh , where Ih ∈ RDhid ,d is a matrix with all entries zero except identity
90 matrix d × d on rows from d(h − 1) + 1 to dh, i.e. the multiplication QIh takes exactly those columns
91 from matrix Q that correspond to the h-th head. Hence, we obtain
SimpleAttention(X) =
1
√ (XQ∗ Ih + q∗ Ih )((XK∗ Ih +
L
k∗ Ih )T (XV∗ Ih + v∗ Ih )).
92 This equation resembles the quadratic form multiplied again by the input.
93 We further refer the above-mentioned mechanism as SimpleAttention, and the model as SimpleTRON
94 which stands for Simple Transformer with O(N) Complexity. The matrix operations within a single
95 head of SimpleAttention is illustrated in Figure 1.
3
Table 1: Baseline and proposed models on the three LRA tasks. We denote sequence length as L,
attention span as K and Sinkhorn model block size as B. The notation for our models is: Simple -
SimpleAttention without both skip connection and linear layer, Simple-Res - SimpleAttention with
skip connection and without linear layer, and Simple-ResL - SimpleAttention with skip connection
and linear layer behind the q-k-v multiplication.
96 4 Experiments
98 Even though numerous sub-quadratic complexity approximations of the vanilla Transformer claimed
99 comparable or even superior performance to the original model, it is fair to express that each model
100 can be task-dependent and possess strikingly different results upon modality. Moreover, some
101 benchmark test can be parameter-dependent, thus bigger models can perform better. Therefore up to
102 some point effective evaluation of the Transformer-like models was uncertain, due to the absence
103 of a unified and systematic benchmark. In this regard, Tay et al. [10] published the benchmark for
104 efficient Transformer models called "Long Range Arena" (LRA), that consists of task of various data
105 types and modalities, where data is presented in sequences ranging from 1K to 16K tokens.
106 We use LRA[10] as the standardised benchmark for efficient Transformer evaluation:
107 • Following the recommendations from [10], we replicate the learning schedule and all the hy-
108 perparameters that relate to our model architecture, while keeping additional parametrization
109 below 10%
110 • To reproduce the experimental setup from [10], we used the gradient accumulation in order
111 to simulate larger batch sizes
112 • Given the stochastic weight initialization and sampling, each model was trained for 5 times
113 to observe model behavior, accuracy variance and to avoid so-called black swans – random
114 seeds that give radically different results [31]. Best results are reported in Table 1
115 • As we focus on NLP domain in the present work, we test out model on three LRA tasks –
116 BPE text classification, information retrieval and ListOps.
4
117 • Since our models tend to converge slower in terms of number of iterations, we prolonged
118 the training on matching and ListOps tasks to 15K steps.
121 In the beginning of the paper, we raised a question, whether we need a pairwise matching attention
122 layer or any of its approximations. To find the answer we performed a simple experiment:
123 First, the SimpleTRON model was trained regularly, reaching it’s top accuracy. As q-k-v matrices are
124 of the same dimesnions in both SimpleAttention and original SoftmaxAttention, thus the weights are
125 interchangeable. Therefore, we transferred the trained weights from SimpleAttention to SoftmaxAtten-
126 tion, froze q-k-v layers and retrained the rest of the model. The logic behind such experiment was
127 quite simple: if we need a pairwise comparison in q-k product, then the model won’t be able to reach
128 the efficiency of a vanilla Transformer as q-k-v layers are frozen and not trained optimally.
129 Moreover, as we know the community performed an immense effort to pretrain large language models
130 on comprehensive datasets, allowing many researchers and companies reap all the benefits of transfer
131 learning by fine-tuning pretrained models on specific tasks from various domains [33, 34, 35]. Lately
132 it was proposed, that learning abilities of the Transformer models trained on a extensive language
133 dataset can overcome NLP modality and used as a universal computational engines [36]. On the other
134 hand, such training is an extremely resource demanding process [37] with a considerable carbon
135 footprint [38]. Therefore, simply of of curiosity, we tried to apply fine tuning for text classification on
136 our SimpleTRON model using weight from pretrained BERT [2]. This operation of weight transfer
137 is applicable to SimpleTRON architecture as the size and dimensionality of the model layers can be
138 fully identical and, therefore, transferable.
139 Thus, we further refer the above-mentioned mechanism as SimpleAttention, and the model as Sim-
140 pleTRON which stands for Simple Transformer with O(N) Complexity. The matrix operations within
141 a single head of SimpleAttention is illustrated in the Figure 1.
144 Number of parameters By removing feed-forward linear layer following the q-k-v product, our
145 model in fact had less parameters than its counterparts given the restrictions reported in [10]. Which
146 is only a small margin, however worth mentioning as the authors placed parametrization restrictions
147 in their paper.
148 Training speed By swapping the q-k-v product matrices and avoiding any kind of approximation
149 we’ve reached a truly linear complexity with a respect to the input length. It has to be emphasized,
150 that the most of linear attention approximations reporting linear complexity, in fact omitting high
151 architecture-dependent multiplier, which should be taken into account in practice.
152 Memory efficiency The current model, given the above-mentioned LRA tasks, was found to be an
153 order of magnitude more memory efficient in comparison with the vanilla transformer. For the visual
154 comparison see Figure 3, which was obtained using text classification task architecture on synthetic
155 dataset with sequences of length from 256 to 32K, to explore memory and time consumption of the
156 model as compared to vanilla Transformer. As we used Tesla V100 GPU with 16Gb RAM, vanilla
157 Transformer model could not fit the data in memory with batch size = 1.
158 Classification accuracy To observe the model behaviour and accuracy variance, we train our
159 models for 5 times to avoid so-called black swans – random seeds that give radically different
160 results [31]. As a result, our model had shown 66.75/73.92/37.45 % top accuracy on the test
161 classification/matching/ListOps splits and 66.61/73.74/37.15 % mean5 test accuracy respectively,
162 outperforming other known linear approximation of attention mechanism.
5
0.50
train
0.45 validation
0.40
0.35
Accuracy, %
0.30
0.25
0.20
0.15
0.10
0 20000 40000 60000 80000 100000
Iteration
train
0.9 validation
0.8
Accuracy, %
0.7
0.6
0.5
0.700
0.675 Softmax Attention
Pretrained Simple Attention 0.0005
0.650
0.625 0.0004
Learning rate
Accuracy, %
0.600 0.0003
0.575
0.0002
0.550
0.525 0.0001
0.500
0.0000
0 1000 2000 3000 4000 5000
Iteration
Figure 2: Training evolution plots of example runs on ListOps (top), Matching (center) and (bottom) -
Training (dashed lines) and validation (solid lines) curves of the original vanilla Transformer (red lines)
architecture and transformer with frozen q-k-v layers (blue lines) transferred from SimpleAttention
model.
√
163 Normalization The original Transformer model uses the 1/ d normalization term of the QKT
164 product to counteract the vector magnitude explosion and the following decreased gradient flow
165 through the Softmax. Since we don’t use any saturating function in the attention
√ module, our model
166 works without any normalization terms. However, we have found the 1/ L term to be useful for a
167 more stable convergence.
168 Convergence We have found our model to be converging slower than vanilla Transformer in terms
169 of number of iterations for certain tasks, such as Matching and ListOps. (see Figure 6), however this
170 is being counteracted by faster computation, especially for longer sequences in ListOps benchmark,
171 as it is shown in Figure 3.
6
16000 1200
SimpleTRON Time
14000 Transformer Time
1000
12000
10000 800
Epoch time, s
Memory, Mb
8000
600
6000
4000 400
2000 SimpleTRON Memory
Transformer Memory
0 200
0 5000 10000 15000 20000 25000 30000
Sequence length
Figure 3: Memory and Time Complexity of SimpleAttention model in comparison with Vanilla
Transformer with the respect to the input length.
174 Attention transfer Transferred weights from, SimpleAttention model to the original SoftmaxAt-
175 tention model has shown an interesting behaviour: having about 30% less trainable parameters and
176 in fact no ability to learn pairwise relation between the tokens in the sequences, the model trained
177 up to the original accuracy (Figure 6 (bottom)) of the vanilla transformer, however in much less of
178 training epochs. We assume, that given the frozen gradients in q-k-v layers, such SimpleAttention
179 pre-training is an evidence of the pairwise token comparison redundancy. This also opens a way of
180 fast (re)training of already deployed models.
182 It has to be emphasized, that the models, where the order of matrix multiplication in the attention head
183 does is simply swapped without any further modification often do not converge. Successful training
184 is possible only in the case, when the number of blocks is low (i.e. 4 blocks for the text classification
185 task). Even though, at the very early stage of training, deeper models with simple attention are on-par
186 with the vanilla Transformer and after certain number of epochs work no better than a random choice.
187 One of the pathways in order to reach a stable and efficient training, is to remove the linear layer that
188 follows output of the attention as described above. However, it was empirically discovered for larger
189 models with a higher number of parameters that are usually applied in practice (i.e. comparable with
190 BERT [2] by a number of parameters) the original SimpleTRON architecture shows performance
191 lagging behind the vanilla Transformer. Moreover, in some cases, a linear layer following attention
192 operation is technically necessary when the dimensionality of DE differs from Dhid . Another case,
193 when the linear layer would be beneficial is a weight transfer from pretrained transformer like model.
194 Another option is adding skip-connections [39] through the SimpleTRON block. Nevertheless, the
195 original Transformer block already has skip connection from the block input to Layer Normalization
196 layer, in the present implementation we added another skip connection (shown in red in Figure 5).
197 Therefore, we were able to train larger models of arbitrary depth. In this case, the presence of an
198 additional linear layer does not have any deleterious effect on the model’s convergence.
199 The reason for the deeper models to fail on training is that the weights of SimpleAttention output
200 tend to be symmetrical in the deeper blocks of the model. We found, that additional skip connections
201 lead to higher variance in weights and therefore better inference ability of the model. Furthermore,
202 to show that the models with skip connections perform well on the LRA dataset, we performed the
203 experiments using the models with skip connections with or without the linear layer. The result are
204 consistent with ones of the original SimpleTRON model, while model the linear layer usually do not
205 converge without skip connections (see Table 1). It is worth mentioning, that we could not obtain any
206 accuracy gain by stacking more SimpleTRON together neither with, nor without skip connections.
7
0.75 train
validation
0.70
Accuracy, %
0.65
0.60
0.55
0.50
0 2500 5000 7500 10000 12500 15000 17500
Iteration
0.75 train
validation
0.70
Accuracy, %
0.65
0.60
0.55
0.50
0 2500 5000 7500 10000 12500 15000 17500
Iteration
0.75 train
validation
0.70
Accuracy, %
0.65
0.60
0.55
0.50
0 2500 5000 7500 10000 12500 15000 17500
Iteration
Figure 4: Training evolution plots for a SimpleAttention model containing 8 blocks, on text classifi-
cation task.
207 5.3 Larger models and utilizing weight from pretrained BERT
208 As shown above, our architecture is superior on long text classification from the LRA benchmarks,
209 which is unified test for efficient Transformer models. However, the true power of Transformer
210 architecture is its ability to capture patterns from the large scale comprehensive datasets (often
211 natural language datasets). Therefore, we performed preliminary experiment on comparing BERT [2]
212 language model with SimpleTRON of the similar architecture. Training from scratch of AG News
213 Corpus dataset [40], showed that SimpleTRON (with skip connections and linear layer) model with
214 the architecture mimicking BERT posessed 89.9% of accuracy, while training vanilla Transformer
215 with BERT-base architecture we obtained 1.2% higher accuracy.
8
Add & LayerNorm Add & LayerNorm
Multi-Head Multi-Head
Simple Self-Attention Simple Self-Attention
Figure 5: Illustration of the SimpleAttention block with a double skip connection, without (left) and
with (right) normalized output.
Table 2: Model performance with a respect to the number of layers in the model. Training on AG
News Corpus dataset
1
parallel blocks
2
normalized
216 However, while SimpleTRON model in this experiment contained linear layer, the weights from
217 BERT are fully transferable to the proposed architecture. Therefore, using weights from pretrained
218 BERT model, we were able to perform fine-tuning on SimpleTRON architecture. Interestingly, even
219 though SimpleTRON is in fact a different model we could obtain an inference gain using the weights
220 from pretrained model, with the accuracy of 92.8%, which is 2.7% higher than training a model from
221 scratch on AG News dataset. As we mentioned earlier, to train large language model properly vast
222 resources has to be invested. Weight transfer from Transformer to SimpleTRON is a way towards
223 more sustainable training by using already trained models.
224 As discussed above, SimpleTRON architecture works especially well, when there are limited number
225 of stacked blocks, therefore we performed the experiments on the models with reduced depth
226 both for fine-tuning and training from scratch. Indeed, SimpleTRON architecture was found to
227 outperform BERT based model, when two models containing 6 blocks were trained from scratch.
228 While transferring 6 blocks of pretrained BERT a notable increase of performance was observed, 6
229 blocks SimpleTRON model lags only a small margin behind the original BERT architecture. Overall,
9
230 we found that even-though SimpleTRON blocks are able to outperform Transformer architecture, in
231 case of larger models, the proposed architecture do not take advantage of a stacked blocks well. This
232 is a subject for a further investigation of SimpleTRON architecture training and regularization. The
233 performances of the models on depth from 1 to 12 blocks show that testing accuracy of our model
234 saturates quickly with depth, and may even decrease, when using original self-attention mechanism
235 inference accuracy increases with the model depth. Nevertheless, this model degradation can be
236 solved by stacking blocks in parallel (Table 2), even though parallel architecture, i.e. wide model,
237 do not show rapidly increasing accuracy with the number of block, the training is more stable and
238 models converges better for the wider models.
239 As discussed above, the weights of SimpleAttention output tend to be symmetrical in the deeper
240 blocks of the model, therefore, the third approach we tried for deeper architecture models, was weight
241 normalization. So far, the best results, obtained, was a small tweak of the SimpleAttention block,
242 where second skip connection is normalized using using LayerNorm before the final output (see
243 Figure 5 (left)). Thus, according to the preliminary results, shown in Table 2 such model outperforms
244 BERT for 1, 6, and 12 layers, when trained from scratch of AG News Corpus Dataset. Therefore, in
245 this case we managed to preserve the approximation capacity of SimpleAttention block and take an
246 advantage of stacked attention blocks.
259 Different tasks Transformer architecture is known to be pervasive, however we are fully aware that
260 performance of any model can be task-dependent. Here we show the results on text-classification
261 task from LRA benchmark dataset and AG News. Therefore thus our goal in the near future is to
262 expand SimpleTRON application to other modalities, such as computer vision, as well as looking for
263 a more efficient way to utilize the model depth.
267 q-k-v framework elimination In the present work we followed the original q-k-v framework in
268 order to show the step further towards attetnionless transformer architecture. However, we believe
269 that there is a more efficient framework as long as q-k-v initially assumed global pairwise comparison.
270 References
271 [1] Ashish Vaswani et al. Attention Is All You Need. 2017. arXiv: 1706.03762 [cs.CL].
272 [2] Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
273 2019. arXiv: 1810.04805 [cs.CL].
274 [3] Alec Radford et al. “Language Models are Unsupervised Multitask Learners”. In: (2018). URL: https:
275 //d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.
276 [4] Tom B. Brown et al. Language Models are Few-Shot Learners. 2020. arXiv: 2005.14165 [cs.CL].
277 [5] Alexey Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
278 2021. arXiv: 2010.11929 [cs.CV].
10
279 [6] Hugo Touvron et al. Training data-efficient image transformers and distillation through attention. 2021.
280 arXiv: 2012.12877 [cs.CV].
281 [7] Yangyang Shi et al. Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency
282 Streaming Speech Recognition. 2020. arXiv: 2010.10759 [cs.SD].
283 [8] Lili Chen et al. Decision Transformer: Reinforcement Learning via Sequence Modeling. 2021. arXiv:
284 2106.01345 [cs.LG].
285 [9] Ilya Sutskever. Training recurrent neural networks. University of Toronto Toronto, Canada, 2013.
286 [10] Yi Tay et al. Long Range Arena: A Benchmark for Efficient Transformers. 2020. arXiv: 2011.04006
287 [cs.LG].
288 [11] Yi Tay et al. Efficient Transformers: A Survey. 2020. arXiv: 2009.06732 [cs.LG].
289 [12] Aurko Roy et al. Efficient Content-Based Sparse Attention with Routing Transformers. 2020. arXiv:
290 2003.05997 [cs.LG].
291 [13] Rewon Child et al. Generating Long Sequences with Sparse Transformers. 2019. arXiv: 1904.10509
292 [cs.LG].
293 [14] Wasi Uddin Ahmad, Nanyun Peng, and Kai-Wei Chang. GATE: Graph Attention Transformer Encoder
294 for Cross-lingual Relation and Event Extraction. 2021. arXiv: 2010.03009 [cs.CL].
295 [15] Jack W. Rae et al. Compressive Transformers for Long-Range Sequence Modelling. 2019. arXiv: 1911.
296 05507 [cs.LG].
297 [16] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The Long-Document Transformer. 2020.
298 arXiv: 2004.05150 [cs.CL].
299 [17] Anmol Gulati et al. Conformer: Convolution-augmented Transformer for Speech Recognition. 2020.
300 arXiv: 2005.08100 [eess.AS].
301 [18] Irwan Bello et al. Attention Augmented Convolutional Networks. 2020. arXiv: 1904.09925 [cs.CV].
302 [19] William Chan et al. Imputer: Sequence Modelling via Imputation and Dynamic Programming. 2020.
303 arXiv: 2002.08926 [eess.AS].
304 [20] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The Efficient Transformer. 2020. arXiv:
305 2001.04451 [cs.LG].
306 [21] Ciprian Chelba et al. Faster Transformer Decoding: N-gram Masked Self-Attention. 2020. arXiv: 2001.
307 04589 [cs.LG].
308 [22] Krzysztof Choromanski et al. Rethinking Attention with Performers. 2021. arXiv: 2009.14794 [cs.LG].
309 [23] Zhuoran Shen et al. Efficient Attention: Attention with Linear Complexities. 2020. arXiv: 1812.01243
310 [cs.CV].
311 [24] Hao Peng et al. Random Feature Attention. 2021. arXiv: 2103.02143 [cs.CL].
312 [25] Yunyang Xiong et al. Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention.
313 2021. arXiv: 2102.03902 [cs.CL].
314 [26] Sankalan Pal Chowdhury et al. On Learning the Transformer Kernel. 2021. arXiv: 2110.08323 [cs.LG].
315 [27] Lukas Hedegaard, Arian Bakhtiarnia, and Alexandros Iosifidis. “Continual Transformers: Redundancy-
316 Free Attention for Online Inference”. In: arXiv preprint arXiv:2201.06268 (2022).
317 [28] James Lee-Thorp et al. FNet: Mixing Tokens with Fourier Transforms. 2021. arXiv: 2105 . 03824
318 [cs.CL].
319 [29] Ilya Tolstikhin et al. MLP-Mixer: An all-MLP Architecture for Vision. 2021. arXiv: 2105 . 01601
320 [cs.CV].
321 [30] Anonymous. “Patches Are All You Need?” In: Submitted to The Tenth International Conference on
322 Learning Representations. under review. 2022. URL: https : / / openreview . net / forum ? id =
323 TVHS5Y4dNvM.
324 [31] David Picard. Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep
325 learning architectures for computer vision. 2021. arXiv: 2109.08203 [cs.CV].
326 [32] Adam Paszke. “PyTorch: An Imperative Style, High-Performance Deep Learning Library”. In: Advances
327 in Neural Information Processing Systems 32. Ed. by H. Wallach et al. Curran Associates, Inc., 2019,
328 pp. 8024–8035.
329 [33] Chi Sun et al. “How to fine-tune bert for text classification?” In: China National Conference on Chinese
330 Computational Linguistics. Springer. 2019, pp. 194–206.
331 [34] Tianyi Zhang et al. Revisiting Few-sample BERT Fine-tuning. 2021. arXiv: 2006.05987 [cs.CL].
332 [35] Wietse de Vries and Malvina Nissim. “As Good as New. How to Successfully Recycle English GPT-2
333 to Make Models for Other Languages”. In: Findings of the Association for Computational Linguistics:
334 ACL-IJCNLP 2021 (2021). DOI: 10.18653/v1/2021.findings- acl.74. URL: http://dx.doi.
335 org/10.18653/v1/2021.findings-acl.74.
336 [36] Kevin Lu et al. Pretrained Transformers as Universal Computation Engines. 2021. arXiv: 2103.05247
337 [cs.LG].
11
338 [37] Robert Dale. “GPT-3: What’s it good for?” In: Natural Language Engineering 27.1 (2021), pp. 113–118.
339 [38] David Patterson et al. “Carbon emissions and large neural network training”. In: arXiv preprint
340 arXiv:2104.10350 (2021).
341 [39] Kaiming He et al. “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference
342 on computer vision and pattern recognition. 2016, pp. 770–778.
343 [40] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. “Character-level Convolutional Networks for Text
344 Classification”. In: NIPS. 2015.
12
345 A Appendix
346 The code to reproduce the main experimental results can be found here:
347 https://anonymous.4open.science/r/simpleTron-6DDB/README.md
13
Table 3: Hyperparameters used for this experiment
14
Attention STD, block 1
Transformer
0.00055 Simple + Linear
Simple
0.00050
0.00045
0.00040
0.00035
0.00030
0 20 40 60 80 100
0.0006
0.0005
0.0004
0.0003
0 20 40 60 80 100
Figure 6: Training evolution of standard deviation of Attention output weights for vanilla transformer
T
(softmax( QK√ )V ) and SimpleTRON ( √1 QK T V ) model containing 8 blocks, on text classification
d L
task. In case of vanilla transformer Softmax normalization is omitted in standard deviation calculation.
15