0% found this document useful (0 votes)

28 views15 pages

SimpleTron NeurIPS 2022

Uploaded by

kovalenko.alx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views15 pages

SimpleTron NeurIPS 2022

Uploaded by

kovalenko.alx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

SimpleTRON: Simple Transformer with O(N)

Complexity

Anonymous Author(s)
Affiliation
Address
email

Abstract

1 In this paper, we propose that the dot product pairwise matching attention layer,
2 which is widely used in Transformer-based models, is redundant for the model
3 performance. Attention, in its original formulation, has to be seen rather as a human-
4 level tool to explore and/or visualize relevancy scores in sequential data. However,
5 the way how it is constructed leads to significant computational complexity. Instead,
6 we present SimpleTRON: Simple Transformer with O(N) Complexity, a simple
7 and fast alternative without any approximation that, unlike other approximation
8 models, does not have any architecture-related overhead therefore can be seen
9 as a purely linear Transformer-like model. This architecture, to the best of our
10 knowledge, outperforms existing sub-quadratic attention approximation models
11 on several tasks from the Long-Range Arena benchmark. Moreover, we show,
12 that SimpleTRON can benefit from weight transfer from pretrained large language
13 models, as its parameters can be fully transferable.

14 1 Introduction

15 Initially designed for natural language processing, the Transformer architecture [1] emerged in a
16 wide range of other domains and quickly became a state-of-the-art in language modeling [2] as
17 well as in generative tasks [3, 4], image processing [5, 6], speech recognition [7], reinforcement
18 learning [8], and others. From the original paper named "Attention is All You Need" [1] on, it
19 seems to be widely considered that the query-key-value framework, which implies a global pairwise
20 comparison between query and key tokens, is a necessary condition for the model performance. Even
21 though such a mechanism allows a human-comprehensible visualization of interactions between the
22 tokens, unveiling the interpretability up to some extent, an element-wise token comparison leads
23 to a quadratic complexity both in terms of time and space. Therefore, even though the original
24 Transformer architecture virtually can handle arbitrarily long range dependencies given the infinite
25 compute, which is in opposite to most recurrent neural networks [9], the complexity of regular
26 full-rank attention limits Transformer applications when long sequences are required.
27 In this paper we present a SimpleTRON model with a SimpleAttention mechanism as an extremely
28 simple yet efficient solution to replace the original quadratic complexity softmax attention. The
29 proposed mechanism not only possesses linear time and memory complexity, but outperforms the
30 current state-of-the-art Transformers on the text classification, matching and ListOps tasks from the
31 LRA [10] benchmark, which became a widely applied test for sequence processing models. Moreover,
32 since the SimpleAttention has analogous building blocks as the original attention it is suitable for
33 transfer learning as one can use pre-trained weights from the existing transformer models.

Submitted to 36th Conference on Neural Information Processing Systems (NeurIPS 2022). Do not distribute.
Original Self-Attention Mechanism

Lxd Dhid Lxd dxL LxL Lxd

KhT

L Qh
Qh Vh Y
LxDE
DExDhid
Embeddings
V* DExDhid
x1 K* DExDhid softmax
x2 Q* Q
Q
.
L . X Q L Kh
Simple Self-Attention Mechanism
.
.
xL Lxd dxL Lxd Lxd

Input Sequence KhT

dxd
L Vh Qh Vh Qh Y

d = Dhid/Nh

Figure 1: Illustration of the Self-Attention calculation in the vanilla Transformer (top) and
SimpleTRON Self-Attention (bottom) dashed blocks representing matrices: X - input sequence
{x1 , x2 ..., xn } of the length L, Q∗ - query, K∗ - key, V∗ - value square matrices of dimensions
DE × Dhid , d is a dimension of a single head (d = Dhid /Nh ), DE is an embedding dimensionality.

34 2 Related Work
35 As a restrictive limitation, the computational complexity of the original model motivated the com-
36 munity to quest for the solution in order to approximate the architecture with asymptotically faster
37 models [11]. Thus, recently, a dizzying number of so-called "Efficient Transformers" appeared. Each
38 of these implementations applied some notion of sparsity to the otherwise dense attention mechanism
39 and reached a sub-quadratic complexity with comparable performance.
40 Among the solutions to rationalize the Transformer complexity, there were engineering approaches
41 such as sparse attention [12, 13], graph attention [14] or compressive attention [15] that maps past
42 hidden activations to a smaller set of compressed representations, allowed to use longer sequences
43 at comparable compute. Further engineering methods include Longformer [16], where attention
44 mechanism is a combination of a windowed local-context self-attention and a global attention
45 that encodes inductive bias, Coformer [17] and Attention Augmented CNN [18] which are hybrid
46 architectures of CNN augmented Transformers, Imputer [19] – the model that generates output
47 sequences iteratively via imputations and dynamic programming, Reformer [20] using dot-product
48 attention and reversible residual layers, N-gram Masked Self-Attention [21] etc.
49 Another branch of research to reduce Transformer complexity is dedicated to matrix and kernel
50 approximations based on strong mathematical basis. That includes Performer [22], which uses
51 kernel approximation, factorized attention [23], random feature attention [24] or, for example,
52 Nyströmformer [25] using Nyström matrix approximation. Learnable kernel approximation was
53 presented by Chowdhury et al. [26], where the authors reported trainable kernel by learning the
54 spectral distribution and approximation of the Transformer kernel as a dot product between spectral
55 feature maps. Continual transformers [27] reduce time and memory complexity by introducing the
56 continual retroactive and single-output attention mechanisms that aggregate matrices over time and
57 cache the step vectors.
58 Additionally there are several studies that changed the whole concept of attention and replaced it
59 with Fast Fourier Transform (FFT) which does not require any training [28], or used dense layers
60 to mix the tokens along both axes [29]. Another work [30] reported vision transformer-inspired[5]
61 model that independently mixes the spatial and channel locations of image patches using depth-wise
62 convolutions, outperforming existing Transformer-based solutions for image recognition.

63 3 The Model
64 The original multi-head attention layer utilizes the softmax-normalization of a head-wise product of
65 query and transposed key matrices combined with the value matrix as:
Qh KTh

Attention(Qh , Kh , Vh ) = softmax √ Vh ,
d

2
66 where Qh , Kh , and Vh ∈ RL,d are the query, key, and value, respectively, corresponding to the
67 h-th head, d is the query dimensionality, and L is the length of the input sequence. The head-wise
68 inputs to the operation are obtained by splitting the Q, K, V ∈ RL,Dhid matrices across the hidden
69 dimension axis Dhid into Nh pieces of size d = Dhid /Nh corresponding to Nh heads. The input
70 X ∈ RL,Dhid is transformed to the Q, K, and V matrices by a linear transformation using matrices
71 Q∗ , K∗ , V∗ ∈ RDhid ,Dhid and biases q∗ , k∗ , v∗ ∈ RL,Dhid as parameters:
Q = XQ∗ + q∗ ,
K = XK∗ + k∗ ,
V = XV∗ + v∗ .
72 The final output of the attention layer is then produced by applying another linear layer on a
73 concatenation of all heads and adding the duplicated input X that corresponds to a skip connection:
SelfAttention(X) =
X + W(Attention(Q1 , K1 , V1 ), . . . ,
Attention(QNh , KNh , VNh )) + w,

74 where W ∈ RL,Dhid and w ∈ RL,Dhid are the parameters of the linear layer.
75 The Qh , Kh , and Vh are rectangular matrices with the first dimension typically dominating the
76 second one. Thus, the quadratic complexity appears upon the Qh KTh operation with respect to the
77 sequence length. Swapping matrix multiplication order (first KTh Vh , then multiply with Qh ) would
78 reduce complexity to linear. However, softmax non-linearity forbids such shuffling. Here, we applied
79 several major tweaks to the model in order to reach linear complexity and to improve performance:

80 • Reject the softmax nonliearity.

81 • Change the order of matrix multiplication to avoid quadratic complexity.
82 • Remove the linear layer producing the final output of the attention layer (optionally).

83 Therefore we obtain a no-softmax attention with the direct q-k-v product, which could be described
84 by the following simple formula for an attention operation on a single head:
SimpleAttention(Qh , Kh , Vh ) =
1
√ Qh (KTh Vh ).
L
85 When the linear layer is not used to produced the final output, then the q-k-v product concatenated
86 for all heads goes directly to the residual sum with duplicated input from skip connection.
87 Unlike the linear mixing models such as [28], this transformation is not linear in terms of input X.
88 To see this just note that
Qh = QIh = XQ∗ Ih + q∗ Ih
89 and analogously for Kh and Vh , where Ih ∈ RDhid ,d is a matrix with all entries zero except identity
90 matrix d × d on rows from d(h − 1) + 1 to dh, i.e. the multiplication QIh takes exactly those columns
91 from matrix Q that correspond to the h-th head. Hence, we obtain

SimpleAttention(X) =
1
√ (XQ∗ Ih + q∗ Ih )((XK∗ Ih +
L
k∗ Ih )T (XV∗ Ih + v∗ Ih )).

92 This equation resembles the quadratic form multiplied again by the input.
93 We further refer the above-mentioned mechanism as SimpleAttention, and the model as SimpleTRON
94 which stands for Simple Transformer with O(N) Complexity. The matrix operations within a single
95 head of SimpleAttention is illustrated in Figure 1.

3
Table 1: Baseline and proposed models on the three LRA tasks. We denote sequence length as L,
attention span as K and Sinkhorn model block size as B. The notation for our models is: Simple -
SimpleAttention without both skip connection and linear layer, Simple-Res - SimpleAttention with
skip connection and without linear layer, and Simple-ResL - SimpleAttention with skip connection
and linear layer behind the q-k-v multiplication.

Model Complexity Classification Matching ListOps

Synthesizer O(L2 ) 61.68 54.67 36.99

2 2
Sinkhorn Trans. O(B + (N/B)
√ ) 61.20 53.83 33.67
Sparse Trans. O(L L) 63.58 59.59 17.07
Reformer O(L log L) 56.10 53.40 37.27
Local Attention O(LK) 52.98 53.39 15.82
Longformer O(LK) 62.85 56.89 35.63
Linformer O(L) 53.94 52.27 35.70
BigBird O(L) 64.02 59.29 36.05
Linear ELU O(L) 65.90 53.09 16.13
Performer O(L) 65.40 53.82 18.01

GMM-RKS O(L) 66.20 58.74 18.15

FastFood-RKS O(L) 65.91 57.47 18.20
Generative-RKS O(L) 66.37 59.02 17.80
GMM-PRF O(L) 62.70 59.64 36.95
FastFood-PRF O(L) 64.69 67.90 37.25
Generative-PRF O(L) 62.39 67.18 37.10

Simple (ours) O(L) 66.75 73.92 37.45

Simple-Res (ours) O(L) 66.65 74.83 37.10
Simple-ResL (ours) O(L) 66.71 73.59 37.55

96 4 Experiments

97 4.1 LRA benchmark

98 Even though numerous sub-quadratic complexity approximations of the vanilla Transformer claimed
99 comparable or even superior performance to the original model, it is fair to express that each model
100 can be task-dependent and possess strikingly different results upon modality. Moreover, some
101 benchmark test can be parameter-dependent, thus bigger models can perform better. Therefore up to
102 some point effective evaluation of the Transformer-like models was uncertain, due to the absence
103 of a unified and systematic benchmark. In this regard, Tay et al. [10] published the benchmark for
104 efficient Transformer models called "Long Range Arena" (LRA), that consists of task of various data
105 types and modalities, where data is presented in sequences ranging from 1K to 16K tokens.
106 We use LRA[10] as the standardised benchmark for efficient Transformer evaluation:

107 • Following the recommendations from [10], we replicate the learning schedule and all the hy-
108 perparameters that relate to our model architecture, while keeping additional parametrization
109 below 10%
110 • To reproduce the experimental setup from [10], we used the gradient accumulation in order
111 to simulate larger batch sizes
112 • Given the stochastic weight initialization and sampling, each model was trained for 5 times
113 to observe model behavior, accuracy variance and to avoid so-called black swans – random
114 seeds that give radically different results [31]. Best results are reported in Table 1
115 • As we focus on NLP domain in the present work, we test out model on three LRA tasks –
116 BPE text classification, information retrieval and ListOps.

4
117 • Since our models tend to converge slower in terms of number of iterations, we prolonged
118 the training on matching and ListOps tasks to 15K steps.

119 The model was implemented in PyTorch library [32].

120 4.2 SimpleTRON transfer learning

121 In the beginning of the paper, we raised a question, whether we need a pairwise matching attention
122 layer or any of its approximations. To find the answer we performed a simple experiment:
123 First, the SimpleTRON model was trained regularly, reaching it’s top accuracy. As q-k-v matrices are
124 of the same dimesnions in both SimpleAttention and original SoftmaxAttention, thus the weights are
125 interchangeable. Therefore, we transferred the trained weights from SimpleAttention to SoftmaxAtten-
126 tion, froze q-k-v layers and retrained the rest of the model. The logic behind such experiment was
127 quite simple: if we need a pairwise comparison in q-k product, then the model won’t be able to reach
128 the efficiency of a vanilla Transformer as q-k-v layers are frozen and not trained optimally.
129 Moreover, as we know the community performed an immense effort to pretrain large language models
130 on comprehensive datasets, allowing many researchers and companies reap all the benefits of transfer
131 learning by fine-tuning pretrained models on specific tasks from various domains [33, 34, 35]. Lately
132 it was proposed, that learning abilities of the Transformer models trained on a extensive language
133 dataset can overcome NLP modality and used as a universal computational engines [36]. On the other
134 hand, such training is an extremely resource demanding process [37] with a considerable carbon
135 footprint [38]. Therefore, simply of of curiosity, we tried to apply fine tuning for text classification on
136 our SimpleTRON model using weight from pretrained BERT [2]. This operation of weight transfer
137 is applicable to SimpleTRON architecture as the size and dimensionality of the model layers can be
138 fully identical and, therefore, transferable.
139 Thus, we further refer the above-mentioned mechanism as SimpleAttention, and the model as Sim-
140 pleTRON which stands for Simple Transformer with O(N) Complexity. The matrix operations within
141 a single head of SimpleAttention is illustrated in the Figure 1.

142 5 Results and Discussion

143 5.1 LRA

144 Number of parameters By removing feed-forward linear layer following the q-k-v product, our
145 model in fact had less parameters than its counterparts given the restrictions reported in [10]. Which
146 is only a small margin, however worth mentioning as the authors placed parametrization restrictions
147 in their paper.

148 Training speed By swapping the q-k-v product matrices and avoiding any kind of approximation
149 we’ve reached a truly linear complexity with a respect to the input length. It has to be emphasized,
150 that the most of linear attention approximations reporting linear complexity, in fact omitting high
151 architecture-dependent multiplier, which should be taken into account in practice.

152 Memory efficiency The current model, given the above-mentioned LRA tasks, was found to be an
153 order of magnitude more memory efficient in comparison with the vanilla transformer. For the visual
154 comparison see Figure 3, which was obtained using text classification task architecture on synthetic
155 dataset with sequences of length from 256 to 32K, to explore memory and time consumption of the
156 model as compared to vanilla Transformer. As we used Tesla V100 GPU with 16Gb RAM, vanilla
157 Transformer model could not fit the data in memory with batch size = 1.

158 Classification accuracy To observe the model behaviour and accuracy variance, we train our
159 models for 5 times to avoid so-called black swans – random seeds that give radically different
160 results [31]. As a result, our model had shown 66.75/73.92/37.45 % top accuracy on the test
161 classification/matching/ListOps splits and 66.61/73.74/37.15 % mean5 test accuracy respectively,
162 outperforming other known linear approximation of attention mechanism.

5
0.50
train
0.45 validation
0.40
0.35

Accuracy, %
0.30
0.25
0.20
0.15
0.10
0 20000 40000 60000 80000 100000
Iteration

train
0.9 validation

0.8
Accuracy, %

0.7

0.6

0.5

0 10000 20000 30000 40000 50000 60000

Iteration

0.700
0.675 Softmax Attention
Pretrained Simple Attention 0.0005
0.650
0.625 0.0004
Learning rate
Accuracy, %

0.600 0.0003
0.575
0.0002
0.550
0.525 0.0001
0.500
0.0000
0 1000 2000 3000 4000 5000
Iteration

Figure 2: Training evolution plots of example runs on ListOps (top), Matching (center) and (bottom) -
Training (dashed lines) and validation (solid lines) curves of the original vanilla Transformer (red lines)
architecture and transformer with frozen q-k-v layers (blue lines) transferred from SimpleAttention
model.

√
163 Normalization The original Transformer model uses the 1/ d normalization term of the QKT
164 product to counteract the vector magnitude explosion and the following decreased gradient flow
165 through the Softmax. Since we don’t use any saturating function in the attention
√ module, our model
166 works without any normalization terms. However, we have found the 1/ L term to be useful for a
167 more stable convergence.

168 Convergence We have found our model to be converging slower than vanilla Transformer in terms
169 of number of iterations for certain tasks, such as Matching and ListOps. (see Figure 6), however this
170 is being counteracted by faster computation, especially for longer sequences in ListOps benchmark,
171 as it is shown in Figure 3.

6
16000 1200
SimpleTRON Time
14000 Transformer Time
1000
12000
10000 800

Epoch time, s
Memory, Mb
8000
600
6000
4000 400
2000 SimpleTRON Memory
Transformer Memory
0 200
0 5000 10000 15000 20000 25000 30000
Sequence length

Figure 3: Memory and Time Complexity of SimpleAttention model in comparison with Vanilla
Transformer with the respect to the input length.

172 Compressed representation The QKT V product in SimpleSelfAttention may be interpreted as a

173 comparison of an uncompressed input projection Q with its compressed representation KT V.

174 Attention transfer Transferred weights from, SimpleAttention model to the original SoftmaxAt-
175 tention model has shown an interesting behaviour: having about 30% less trainable parameters and
176 in fact no ability to learn pairwise relation between the tokens in the sequences, the model trained
177 up to the original accuracy (Figure 6 (bottom)) of the vanilla transformer, however in much less of
178 training epochs. We assume, that given the frozen gradients in q-k-v layers, such SimpleAttention
179 pre-training is an evidence of the pairwise token comparison redundancy. This also opens a way of
180 fast (re)training of already deployed models.

181 5.2 Linear Layer Elimination and Excessive Depth

182 It has to be emphasized, that the models, where the order of matrix multiplication in the attention head
183 does is simply swapped without any further modification often do not converge. Successful training
184 is possible only in the case, when the number of blocks is low (i.e. 4 blocks for the text classification
185 task). Even though, at the very early stage of training, deeper models with simple attention are on-par
186 with the vanilla Transformer and after certain number of epochs work no better than a random choice.
187 One of the pathways in order to reach a stable and efficient training, is to remove the linear layer that
188 follows output of the attention as described above. However, it was empirically discovered for larger
189 models with a higher number of parameters that are usually applied in practice (i.e. comparable with
190 BERT [2] by a number of parameters) the original SimpleTRON architecture shows performance
191 lagging behind the vanilla Transformer. Moreover, in some cases, a linear layer following attention
192 operation is technically necessary when the dimensionality of DE differs from Dhid . Another case,
193 when the linear layer would be beneficial is a weight transfer from pretrained transformer like model.
194 Another option is adding skip-connections [39] through the SimpleTRON block. Nevertheless, the
195 original Transformer block already has skip connection from the block input to Layer Normalization
196 layer, in the present implementation we added another skip connection (shown in red in Figure 5).
197 Therefore, we were able to train larger models of arbitrary depth. In this case, the presence of an
198 additional linear layer does not have any deleterious effect on the model’s convergence.
199 The reason for the deeper models to fail on training is that the weights of SimpleAttention output
200 tend to be symmetrical in the deeper blocks of the model. We found, that additional skip connections
201 lead to higher variance in weights and therefore better inference ability of the model. Furthermore,
202 to show that the models with skip connections perform well on the LRA dataset, we performed the
203 experiments using the models with skip connections with or without the linear layer. The result are
204 consistent with ones of the original SimpleTRON model, while model the linear layer usually do not
205 converge without skip connections (see Table 1). It is worth mentioning, that we could not obtain any
206 accuracy gain by stacking more SimpleTRON together neither with, nor without skip connections.

7
0.75 train
validation
0.70

Accuracy, %
0.65

0.60

0.55

0.50
0 2500 5000 7500 10000 12500 15000 17500
Iteration

0.75 train
validation
0.70
Accuracy, %

0.65

0.60

0.55

0.50
0 2500 5000 7500 10000 12500 15000 17500
Iteration

0.75 train
validation
0.70
Accuracy, %

0.65

0.60

0.55

0.50
0 2500 5000 7500 10000 12500 15000 17500
Iteration

Figure 4: Training evolution plots for a SimpleAttention model containing 8 blocks, on text classifi-
cation task.

207 5.3 Larger models and utilizing weight from pretrained BERT

208 As shown above, our architecture is superior on long text classification from the LRA benchmarks,
209 which is unified test for efficient Transformer models. However, the true power of Transformer
210 architecture is its ability to capture patterns from the large scale comprehensive datasets (often
211 natural language datasets). Therefore, we performed preliminary experiment on comparing BERT [2]
212 language model with SimpleTRON of the similar architecture. Training from scratch of AG News
213 Corpus dataset [40], showed that SimpleTRON (with skip connections and linear layer) model with
214 the architecture mimicking BERT posessed 89.9% of accuracy, while training vanilla Transformer
215 with BERT-base architecture we obtained 1.2% higher accuracy.

8
Add & LayerNorm Add & LayerNorm

Feed Forward Layer Feed Forward Layer

Add & LayerNorm Add & LayerNorm

Linear Layer (optional) Linear Layer (optional)

Multi-Head Multi-Head
Simple Self-Attention Simple Self-Attention

Query Key Value Query Key Value

Figure 5: Illustration of the SimpleAttention block with a double skip connection, without (left) and
with (right) normalized output.

Table 2: Model performance with a respect to the number of layers in the model. Training on AG
News Corpus dataset

N_blocks BERT SimpleTRON SimpleTRON1 SimpleTRON2

1 89.12 90.90 90.30 90.97

2 90.38 90.32 90.52
3 90.28 90.44 90.74
4 90.66 90.41 90.51
5 90.11 90.35 90.66
6 90.30 90.22 90.43 91.52
7 90.68 90.19 91.14
8 90.68 89.98 90.80
9 91.13 89.95 91.04
10 91.21 89.94 90.74
11 90.78 89.43 90.66
12 90.10 89.90 90.49 91.22

1
parallel blocks
2
normalized

216 However, while SimpleTRON model in this experiment contained linear layer, the weights from
217 BERT are fully transferable to the proposed architecture. Therefore, using weights from pretrained
218 BERT model, we were able to perform fine-tuning on SimpleTRON architecture. Interestingly, even
219 though SimpleTRON is in fact a different model we could obtain an inference gain using the weights
220 from pretrained model, with the accuracy of 92.8%, which is 2.7% higher than training a model from
221 scratch on AG News dataset. As we mentioned earlier, to train large language model properly vast
222 resources has to be invested. Weight transfer from Transformer to SimpleTRON is a way towards
223 more sustainable training by using already trained models.
224 As discussed above, SimpleTRON architecture works especially well, when there are limited number
225 of stacked blocks, therefore we performed the experiments on the models with reduced depth
226 both for fine-tuning and training from scratch. Indeed, SimpleTRON architecture was found to
227 outperform BERT based model, when two models containing 6 blocks were trained from scratch.
228 While transferring 6 blocks of pretrained BERT a notable increase of performance was observed, 6
229 blocks SimpleTRON model lags only a small margin behind the original BERT architecture. Overall,

9
230 we found that even-though SimpleTRON blocks are able to outperform Transformer architecture, in
231 case of larger models, the proposed architecture do not take advantage of a stacked blocks well. This
232 is a subject for a further investigation of SimpleTRON architecture training and regularization. The
233 performances of the models on depth from 1 to 12 blocks show that testing accuracy of our model
234 saturates quickly with depth, and may even decrease, when using original self-attention mechanism
235 inference accuracy increases with the model depth. Nevertheless, this model degradation can be
236 solved by stacking blocks in parallel (Table 2), even though parallel architecture, i.e. wide model,
237 do not show rapidly increasing accuracy with the number of block, the training is more stable and
238 models converges better for the wider models.
239 As discussed above, the weights of SimpleAttention output tend to be symmetrical in the deeper
240 blocks of the model, therefore, the third approach we tried for deeper architecture models, was weight
241 normalization. So far, the best results, obtained, was a small tweak of the SimpleAttention block,
242 where second skip connection is normalized using using LayerNorm before the final output (see
243 Figure 5 (left)). Thus, according to the preliminary results, shown in Table 2 such model outperforms
244 BERT for 1, 6, and 12 layers, when trained from scratch of AG News Corpus Dataset. Therefore, in
245 this case we managed to preserve the approximation capacity of SimpleAttention block and take an
246 advantage of stacked attention blocks.

247 6 Conclusion and Future Work

248 To conclude, we raised the question, whether the attention or any of its approximation is needed for
249 the model performance. We presented a simple alternative of a truly linear model with a respect
250 to the input length, that outperformed existing models on the several LRA tasks, at the same time
251 possessing an extremely fast and memory efficient training. The key point is to reject the QK T
252 product with the following Softmax normalization and the linear layer after. We showed that trained
253 q-k-v in the SimpleAttention model can be effectively transferred to the classical SoftmaxAttention,
254 which allows fast model pre-training. Moreover as layers in the model is identical to the ones of
255 transformer, weight transfer from pretrained large language models, such as BERT is possible with
256 clear evidence of positive effect on the model performance. This is a valuable feature, as well as
257 training of large language models is a very resource demanding task. Nevertheless, there are several
258 tasks yet to be done:

259 Different tasks Transformer architecture is known to be pervasive, however we are fully aware that
260 performance of any model can be task-dependent. Here we show the results on text-classification
261 task from LRA benchmark dataset and AG News. Therefore thus our goal in the near future is to
262 expand SimpleTRON application to other modalities, such as computer vision, as well as looking for
263 a more efficient way to utilize the model depth.

264 Interpretability Initially the attention mechanism allowed a human-comprehensible visualization

265 of relevancy scores in the sequences, nevertheless many approximation models lost this feature.
266 Therefore, our goal is to gain it back using our model.

267 q-k-v framework elimination In the present work we followed the original q-k-v framework in
268 order to show the step further towards attetnionless transformer architecture. However, we believe
269 that there is a more efficient framework as long as q-k-v initially assumed global pairwise comparison.

270 References
271 [1] Ashish Vaswani et al. Attention Is All You Need. 2017. arXiv: 1706.03762 [cs.CL].
272 [2] Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
273 2019. arXiv: 1810.04805 [cs.CL].
274 [3] Alec Radford et al. “Language Models are Unsupervised Multitask Learners”. In: (2018). URL: https:
275 //d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.
276 [4] Tom B. Brown et al. Language Models are Few-Shot Learners. 2020. arXiv: 2005.14165 [cs.CL].
277 [5] Alexey Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
278 2021. arXiv: 2010.11929 [cs.CV].

10
279 [6] Hugo Touvron et al. Training data-efficient image transformers and distillation through attention. 2021.
280 arXiv: 2012.12877 [cs.CV].
281 [7] Yangyang Shi et al. Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency
282 Streaming Speech Recognition. 2020. arXiv: 2010.10759 [cs.SD].
283 [8] Lili Chen et al. Decision Transformer: Reinforcement Learning via Sequence Modeling. 2021. arXiv:
284 2106.01345 [cs.LG].
285 [9] Ilya Sutskever. Training recurrent neural networks. University of Toronto Toronto, Canada, 2013.
286 [10] Yi Tay et al. Long Range Arena: A Benchmark for Efficient Transformers. 2020. arXiv: 2011.04006
287 [cs.LG].
288 [11] Yi Tay et al. Efficient Transformers: A Survey. 2020. arXiv: 2009.06732 [cs.LG].
289 [12] Aurko Roy et al. Efficient Content-Based Sparse Attention with Routing Transformers. 2020. arXiv:
290 2003.05997 [cs.LG].
291 [13] Rewon Child et al. Generating Long Sequences with Sparse Transformers. 2019. arXiv: 1904.10509
292 [cs.LG].
293 [14] Wasi Uddin Ahmad, Nanyun Peng, and Kai-Wei Chang. GATE: Graph Attention Transformer Encoder
294 for Cross-lingual Relation and Event Extraction. 2021. arXiv: 2010.03009 [cs.CL].
295 [15] Jack W. Rae et al. Compressive Transformers for Long-Range Sequence Modelling. 2019. arXiv: 1911.
296 05507 [cs.LG].
297 [16] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The Long-Document Transformer. 2020.
298 arXiv: 2004.05150 [cs.CL].
299 [17] Anmol Gulati et al. Conformer: Convolution-augmented Transformer for Speech Recognition. 2020.
300 arXiv: 2005.08100 [eess.AS].
301 [18] Irwan Bello et al. Attention Augmented Convolutional Networks. 2020. arXiv: 1904.09925 [cs.CV].
302 [19] William Chan et al. Imputer: Sequence Modelling via Imputation and Dynamic Programming. 2020.
303 arXiv: 2002.08926 [eess.AS].
304 [20] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The Efficient Transformer. 2020. arXiv:
305 2001.04451 [cs.LG].
306 [21] Ciprian Chelba et al. Faster Transformer Decoding: N-gram Masked Self-Attention. 2020. arXiv: 2001.
307 04589 [cs.LG].
308 [22] Krzysztof Choromanski et al. Rethinking Attention with Performers. 2021. arXiv: 2009.14794 [cs.LG].
309 [23] Zhuoran Shen et al. Efficient Attention: Attention with Linear Complexities. 2020. arXiv: 1812.01243
310 [cs.CV].
311 [24] Hao Peng et al. Random Feature Attention. 2021. arXiv: 2103.02143 [cs.CL].
312 [25] Yunyang Xiong et al. Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention.
313 2021. arXiv: 2102.03902 [cs.CL].
314 [26] Sankalan Pal Chowdhury et al. On Learning the Transformer Kernel. 2021. arXiv: 2110.08323 [cs.LG].
315 [27] Lukas Hedegaard, Arian Bakhtiarnia, and Alexandros Iosifidis. “Continual Transformers: Redundancy-
316 Free Attention for Online Inference”. In: arXiv preprint arXiv:2201.06268 (2022).
317 [28] James Lee-Thorp et al. FNet: Mixing Tokens with Fourier Transforms. 2021. arXiv: 2105 . 03824
318 [cs.CL].
319 [29] Ilya Tolstikhin et al. MLP-Mixer: An all-MLP Architecture for Vision. 2021. arXiv: 2105 . 01601
320 [cs.CV].
321 [30] Anonymous. “Patches Are All You Need?” In: Submitted to The Tenth International Conference on
322 Learning Representations. under review. 2022. URL: https : / / openreview . net / forum ? id =
323 TVHS5Y4dNvM.
324 [31] David Picard. Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep
325 learning architectures for computer vision. 2021. arXiv: 2109.08203 [cs.CV].
326 [32] Adam Paszke. “PyTorch: An Imperative Style, High-Performance Deep Learning Library”. In: Advances
327 in Neural Information Processing Systems 32. Ed. by H. Wallach et al. Curran Associates, Inc., 2019,
328 pp. 8024–8035.
329 [33] Chi Sun et al. “How to fine-tune bert for text classification?” In: China National Conference on Chinese
330 Computational Linguistics. Springer. 2019, pp. 194–206.
331 [34] Tianyi Zhang et al. Revisiting Few-sample BERT Fine-tuning. 2021. arXiv: 2006.05987 [cs.CL].
332 [35] Wietse de Vries and Malvina Nissim. “As Good as New. How to Successfully Recycle English GPT-2
333 to Make Models for Other Languages”. In: Findings of the Association for Computational Linguistics:
334 ACL-IJCNLP 2021 (2021). DOI: 10.18653/v1/2021.findings- acl.74. URL: http://dx.doi.
335 org/10.18653/v1/2021.findings-acl.74.
336 [36] Kevin Lu et al. Pretrained Transformers as Universal Computation Engines. 2021. arXiv: 2103.05247
337 [cs.LG].

11
338 [37] Robert Dale. “GPT-3: What’s it good for?” In: Natural Language Engineering 27.1 (2021), pp. 113–118.
339 [38] David Patterson et al. “Carbon emissions and large neural network training”. In: arXiv preprint
340 arXiv:2104.10350 (2021).
341 [39] Kaiming He et al. “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference
342 on computer vision and pattern recognition. 2016, pp. 770–778.
343 [40] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. “Character-level Convolutional Networks for Text
344 Classification”. In: NIPS. 2015.

12
345 A Appendix
346 The code to reproduce the main experimental results can be found here:
347 https://anonymous.4open.science/r/simpleTron-6DDB/README.md

13
Table 3: Hyperparameters used for this experiment

Parameter Classification Matching ListOps

Seq. Length 4000 4000 2000

Batch Size 32 32 32
Training Steps 20 000 15 000 15 000
Optimizer AdamW (β1 = 0.9, β2 = 0.999)
Base LR 0.05 0.05 0.005
Weight Decay 0.1 0.1 0.1
Warmup Steps 8000 8000 1000
Schedule Base LR * Warmup * Sqrt Decay
Warmup Mul. min(1,
p Current Step/Warmup Steps)
Sqrt Decay Mul. 1/ max(Current Step, Warmup Steps)
Loss CCE
Blocks 4 4 6
Heads 4 4 8
Hidden dim. 256 128 512
QKV dim. 256 128 512
MLP dim. 1024 512 2048
Dropout 0.1 0.1 0.1
Activation GELU GELU (ReLU in output) GELU
Pooling CLS CLS CLS
Pos. encoding Learnable Learnable Learnable

14
Attention STD, block 1
Transformer
0.00055 Simple + Linear
Simple
0.00050

0.00045

0.00040

0.00035

0.00030
0 20 40 60 80 100

Attention STD, block 2

0.0008
Transformer
Simple + Linear
0.0007 Simple

0.0006

0.0005

0.0004

0.0003

0 20 40 60 80 100

Attention STD, block 3

Transformer
0.00035 Simple + Linear
Simple
0.00030
0.00025
0.00020
0.00015
0.00010
0.00005
0 20 40 60 80 100

Attention STD, block 4

Transformer
0.00035 Simple + Linear
Simple
0.00030
0.00025
0.00020
0.00015
0.00010
0.00005
0 20 40 60 80 100

Figure 6: Training evolution of standard deviation of Attention output weights for vanilla transformer
T
(softmax( QK√ )V ) and SimpleTRON ( √1 QK T V ) model containing 8 blocks, on text classification
d L
task. In case of vanilla transformer Softmax normalization is omitted in standard deviation calculation.

All You Need To Know About The Self-Attention Layer
No ratings yet
All You Need To Know About The Self-Attention Layer
80 pages
Chap6 Transformer (20240219) - DL4H Practioner Guide
No ratings yet
Chap6 Transformer (20240219) - DL4H Practioner Guide
36 pages
Efficient Vision Transformers
No ratings yet
Efficient Vision Transformers
15 pages
Transformers From Scratch PoliTO - Ipynb Colab
No ratings yet
Transformers From Scratch PoliTO - Ipynb Colab
17 pages
A1
No ratings yet
A1
11 pages
Intra-Neuronal Attention Within Language Models: Relationships Between Activation and Semantics
No ratings yet
Intra-Neuronal Attention Within Language Models: Relationships Between Activation and Semantics
42 pages
Clustered Attention for Fast Transformers
No ratings yet
Clustered Attention for Fast Transformers
10 pages
Fastformer: Additive Attention Can Be All You Need
No ratings yet
Fastformer: Additive Attention Can Be All You Need
11 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Transformer
No ratings yet
Transformer
4 pages
Efficient Attention Mechanisms For Large Language Model - A Survey
No ratings yet
Efficient Attention Mechanisms For Large Language Model - A Survey
26 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Paper 2
No ratings yet
Paper 2
8 pages
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
No ratings yet
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
15 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Transformer: Attention-Only Model
No ratings yet
Transformer: Attention-Only Model
139 pages
Aiayn
No ratings yet
Aiayn
15 pages
Lecture 6 Transformers
No ratings yet
Lecture 6 Transformers
92 pages
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
Transformers - The Brain of ChatGPT
No ratings yet
Transformers - The Brain of ChatGPT
25 pages
MoBA Tech Report
No ratings yet
MoBA Tech Report
15 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
4 pages
7181 Attention Is All You Need
No ratings yet
7181 Attention Is All You Need
11 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
11 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
Attention
No ratings yet
Attention
15 pages
R: T E T: Eformer HE Fficient Ransformer
No ratings yet
R: T E T: Eformer HE Fficient Ransformer
12 pages
Efficient Transformers: A Survey
No ratings yet
Efficient Transformers: A Survey
28 pages
Dis7 Sol
No ratings yet
Dis7 Sol
8 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Example File
No ratings yet
Example File
3 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
495 Lecture 8
No ratings yet
495 Lecture 8
28 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
19 pages
Transformer Concepts
100% (1)
Transformer Concepts
8 pages
2296 Treeformer Dense Gradient Tree
No ratings yet
2296 Treeformer Dense Gradient Tree
15 pages
2023 FIT Chen Li
No ratings yet
2023 FIT Chen Li
15 pages
Transformer
No ratings yet
Transformer
33 pages
C11-Attention and Transformers
No ratings yet
C11-Attention and Transformers
59 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
21 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Presentation On Attention Model
No ratings yet
Presentation On Attention Model
14 pages
NeurIPS 2021 Redesigning The Transformer Architecture With Insights From Multi Particle Dynamical Systems Paper
No ratings yet
NeurIPS 2021 Redesigning The Transformer Architecture With Insights From Multi Particle Dynamical Systems Paper
14 pages
Transformer
No ratings yet
Transformer
5 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
CS414-Lesson 10.transformer and Applications
No ratings yet
CS414-Lesson 10.transformer and Applications
50 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
Hierarchical Transformers Are More Efficient Language Models
No ratings yet
Hierarchical Transformers Are More Efficient Language Models
11 pages
LONGNET: Transforming 1B Token Scaling
No ratings yet
LONGNET: Transforming 1B Token Scaling
15 pages
Attention Book Sample
No ratings yet
Attention Book Sample
32 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
IJDSA Franta
No ratings yet
IJDSA Franta
14 pages
Icann 2023
No ratings yet
Icann 2023
4 pages
ALife Conference 2024 PINCA
No ratings yet
ALife Conference 2024 PINCA
3 pages
Ionic Origin of A Negative Capacitance in Lead Hal
No ratings yet
Ionic Origin of A Negative Capacitance in Lead Hal
12 pages
Linear Algebra in Maple
No ratings yet
Linear Algebra in Maple
22 pages
4 Assignment
No ratings yet
4 Assignment
10 pages
Systematic (8,4) Code Solutions
No ratings yet
Systematic (8,4) Code Solutions
36 pages
Face Recognition Technology: Seminar Report
No ratings yet
Face Recognition Technology: Seminar Report
35 pages
Vegan
No ratings yet
Vegan
292 pages
CBSE Class 12 Mathematics Important Questions Matrices: Material Downloaded From - 1 / 25
No ratings yet
CBSE Class 12 Mathematics Important Questions Matrices: Material Downloaded From - 1 / 25
25 pages
Tupaseko Mathematics V.1
100% (1)
Tupaseko Mathematics V.1
15 pages
Risk Ranking Matrix Guide
100% (1)
Risk Ranking Matrix Guide
3 pages
Unit I
No ratings yet
Unit I
24 pages
M Sc-Physics PDF
No ratings yet
M Sc-Physics PDF
55 pages
Linear Algebra Pure Applied 1st Edition Edgar G. Goodaire Get PDF
No ratings yet
Linear Algebra Pure Applied 1st Edition Edgar G. Goodaire Get PDF
80 pages
MH-CET Maths Practice Test
No ratings yet
MH-CET Maths Practice Test
16 pages
Advanced Matrix and Algebra Guide
No ratings yet
Advanced Matrix and Algebra Guide
257 pages
Mat PP2
No ratings yet
Mat PP2
15 pages
A Guide To Implement A Two Wheeled Robot Using Pole-Placement On Arduino
No ratings yet
A Guide To Implement A Two Wheeled Robot Using Pole-Placement On Arduino
5 pages
Computer Graphics Course Guide
No ratings yet
Computer Graphics Course Guide
15 pages
S3 MTC Scheme of Work
No ratings yet
S3 MTC Scheme of Work
7 pages
Fasshauer 2008 Lecture4
No ratings yet
Fasshauer 2008 Lecture4
88 pages
FGED1 Sample
No ratings yet
FGED1 Sample
12 pages
Maths-Part Test-4
No ratings yet
Maths-Part Test-4
6 pages
Python Record
No ratings yet
Python Record
14 pages
Assignment Theory
No ratings yet
Assignment Theory
4 pages
A Spreadsheet Approach To Business Quantitative Methods
No ratings yet
A Spreadsheet Approach To Business Quantitative Methods
16 pages
ME371 Quality Engineering
No ratings yet
ME371 Quality Engineering
121 pages
AU09 CP318-1 Inventor API Intro Assemblies
No ratings yet
AU09 CP318-1 Inventor API Intro Assemblies
49 pages
Module 4 Unit 3 Company Stucture
No ratings yet
Module 4 Unit 3 Company Stucture
7 pages
SPPU FIRST SEM Syllabus-1-1
No ratings yet
SPPU FIRST SEM Syllabus-1-1
20 pages
OfflineProject 110 SELECTION 2024 2025 XII 18042024103701
No ratings yet
OfflineProject 110 SELECTION 2024 2025 XII 18042024103701
18 pages
Volume 12 Traffic Appraisal of Roads Schemes Section 1 Traffic Appraisal Manual
No ratings yet
Volume 12 Traffic Appraisal of Roads Schemes Section 1 Traffic Appraisal Manual
14 pages
Case Study: Non-Metallic Processes
No ratings yet
Case Study: Non-Metallic Processes
1 page