Report: Transformer
1   Introduction
Transformer is a model architecture that eschews recurrence and instead relies
entirely on an attention mechanism to draw global dependencies between input
and output.
2   Transformer Architecture
                    Fig. 1. The Transformer-model architecture
The Transformer has an encoder-decoder structure. Here, the encoder maps an
input sequence of symbol representations (x1 , ..., xn ) to a sequence of continuous
representations z = (z1 , ..., zn ). Given z, the decoder then generates an output
Report: Transformer
sequence of symbols (y1 , ...yn ) one element at a time. At each step, the model is
auto-regressive, consuming the previously generated symbols as additional input
when generating the next.
2.1   Encoder and Decoder Stacks
Encoder The encoder is composed of a stack of N = 6 identical layers. Each
layer has a multi-head self-attention sub-layer and a feedforward sub-layer, wrapped
with residual connections and followed by layer normalization. All sub-layers in
the model produce output of dimension dmodel = 512.
Decoder The decoder is also composed of a stack of N = 6 identical layers.
Each decoder layer adds a third sublayer: multi-head attention over the encoder
output. Like the encoder, each sublayer uses residual connections followed by
layer normalization. Self-attention is masked so position i attends only to posi-
tions < i.
2.2   Attention
An attention function can be described as mapping a query and a set of key-value
pairs to an output, where the query, keys, values, and output are all vectors. The
output is computed as a weighted sum of the values, where the weight assigned
to each value is computed by a compatibility function of the query with the
corresponding key
Fig. 2. (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of
several attention layers running in parallel.
2.2.1 Scaled Dot-Product Attention The input consists of queries and
keys of dimension dk , and values of dimension
                                           √ dv . We compute the dot products
of the query with all keys, divide each by dk , and apply a softmax function to
obtain the weights on the values. In practice, we compute the attention function
                                         2
Report: Transformer
on a set of queries simultaneously, the matrix of outputs can be represented as:
                                                   QK T
                    Attention(Q, K, V ) = sof tmax( √ )V                         (1)
                                                     dk
2.2.2 Multi-Head Attention Instead of performing a single attention func-
tion with dmodel -dimensional keys, values, and queries, it is more beneficial to
linearly project them h times with different, learned linear projections to dk , dk ,
and dv dimensions respectively. On each of these projected versions of queries,
keys, and values, we then perform the attention function in parallel, yielding dv -
dimensional output values. These are concatenated and once again projected,
resulting in the final values, as depicted in Fig.2.
This allows the model to attend to information from different representation
subspaces at different positions. The formula can be represented as:
              M ultiHead(Q, K, V ) = Concat(head1 , ..., headh )W O              (2)
where
                    headi = Attention(QWiQ , KWiK , V WiV )                      (3)
with the projections are parameter matrices     WiQ , WiK , WiV          o
                                                                  , and W .
2.2.3 Applications of Attention in Transformer The Transformer uses
multi-head attention in three different ways:
 – In encoder-decoder attention, queries come from the decoder, while keys
   and values come from the encoder output, allowing each decoder position to
   attend to the entire input sequence.
 – The encoder uses self-attention, where queries, keys, and values all come from
   the previous encoder layer, allowing each position to attend to all others.
 – Decoder self-attention lets each position attend to itself and earlier positions,
   with future tokens masked to maintain the auto-regressive property.
2.3     Position-wise Feed-Forward Networks
In addition to attention sub-layers, each of the layers in the encoder and decoder
contains a fully connected feed-forward network, which is applied to each position
separately and identically. This consists of two linear transformations with a
ReLU activation in between.
                      F F N (x) = max(0, xW1 + b1 )W2 + b2                       (4)
2.4     Embeddings and Softmax
Learnable embeddings are used to convert the input and output tokens to a
vector of dimension dmodel . In addition, we also use a usual learned linear trans-
formation and softmax function to convert the decoder output to the next-token
probabilities. In this model, we share the same weight matrix between the two
embedding layers and the pre-softmax   √ linear transformation. In the embedding
layers, we multiply those weights by dmodel
                                         3
Report: Transformer
2.5   Positional Encoding
For the model to utilize the order of the sequence, we add the ”positional en-
codings” to the output of the input and output embeddings. These have the
same dimension dmodel as the embeddings. We use sine and cosine functions of
different frequencies:
                       P Epos,2i = sin(pos/100002i/dmodel )                   (5)
                      P Epos,2i+1 = cos(pos/100002i/dmodel )                  (6)
where pos is the position and i is the dimension.
3     Why self-attention
We can obtain the same size of result using convolutional or recurrent layers
instead of self-attention ones. However, self-attention layers are the least com-
putationally complex ones, while achieving superior performance.
4     Training
4.1   Training Data
We train on the standard WMT 2014 English-German dataset and WMT 2014
English-French dataset. Sentence pairs were batched together by approximate
sequence length; each batch contains approximately 25000 source tokens and
25000 target tokens.
4.2   Optimizer
We use Adam optimizer with β1 = 0.9, β2 = 0.98, and ϵ = 10−9 . We varied the
learning rate over the training course according to the formula:
        lrate = d−0.5
                 model .min(step num
                                     −0.5
                                          , step num.warmup steps−1.5 )       (7)
4.3   Regularization
We utilize three types of regularization during training: dropout some elements
in the output of all sub-layers and the input of the encoder and decoder stacks
with a probability of 0.1, and use label smoothing to improve accuracy and
BLEU score.
5     Conclusion
Transformer is the first sequence transduction model based entirely on attention.
We plan to apply Transformer to other domains with large inputs and outputs,
such as images, audio, and video.