Transformer
Transformer
Transformer
October 5th, 2023 http://adl.miulab.tw
2
Sequence Encoding
Basic Attention
3 Representations of Variable Length Data
◉ Input: word sequence, image pixels, audio signal, click logs
◉ Property: continuity, temporal, importance distribution
◉ Example
✓ Basic combination: average, sum
✓ Neural combination: network architectures should consider input domain properties
− CNN (convolutional neural network)
− RNN (recurrent neural network): temporal information
have a nice
5 Convolutional Neural Networks
◉ Easy to parallelize
◉ Exploit local dependencies
✓ Long-distance dependencies require many layers
have a nice
6 Attention
◉ Encoder-decoder model is important in NMT
◉ RNNs need attention mechanism to handle long dependencies
◉ Attention allows us to access any state
learning
<END>
deep
Information of the
whole sentences
ℎ1 ℎ 2 ℎ 3 ℎ 4
RNN RNN
Encoder Decoder
深 度 學 習 ℎ 4 ℎ 4
ℎ 4
7 Basic Attention
1
𝛼0
0
𝑐
key ℎ1 ℎ 2 ℎ 3 ℎ 4 value ℎ1 ℎ 2
ℎ 3
ℎ 4
深 度 學 習 深 度 學 習
8 Dot-Product Attention
◉ Input: a query 𝑞 and a set of key-value (𝑘-𝑣) pairs to an output
◉ Output: weighted sum of values
Inner product of
query and corresponding key
softmax
row-wise
Sequence Encoding
10
Self-Attention
11 Attention
◉ Encoder-decoder model is important in NMT
◉ RNNs need attention mechanism to handle long dependencies
◉ Attention allows us to access any state
+
× × ×
cmp cmp cmp
𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒
𝛼 relevant?
𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒
𝟏 𝟐 𝟏 𝟑 𝟏 𝟒
𝛼1,2 = 𝒒 ∙ 𝒌 𝛼1,3 = 𝒒 ∙ 𝒌 𝛼1,4 = 𝒒 ∙ 𝒌
𝛼
attention score 1,2 𝛼1,3 𝛼1,4
𝟏 𝟐
𝒒 query 𝒌 key 𝒌 𝟑
𝒌 𝟒
𝟏 𝟐 𝟑 𝟒
𝒂 𝒂 𝒂 𝒂
𝟏 𝑞 𝟏 𝟐 𝑘 𝟐 𝟑 𝑘 𝟑 𝟒 𝑘 𝟒
𝒒 =𝑊 𝒂 𝒌 = 𝑊 𝒂 𝒌 = 𝑊 𝒂 𝒌 = 𝑊 𝒂
′ 𝛼1,𝑖 𝛼1,𝑗
16 Self-Attention 𝛼1,𝑖 =𝑒 / 𝑒
𝑗
′ ′ ′ ′
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4
softmax
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4
𝟏 𝟏 𝒌 𝟐
𝒒 𝒌 𝒌 𝟑
𝒌 𝟒
𝟏 𝟐 𝟑 𝟒
𝒂 𝒂 𝒂 𝒂
𝟏 𝑞 𝟏 𝟐 𝑘 𝟐 𝟑 𝑘 𝟑 𝟒 𝑘 𝟒
𝒒 =𝑊 𝒂 𝒌 = 𝑊 𝒂 𝒌 = 𝑊 𝒂 𝒌 = 𝑊 𝒂
𝒌 𝟏 = 𝑘
𝑊 𝒂𝟏
17 Self-Attention extract information based 𝟏
𝒃 = 𝛼 ′
𝒗𝒊
on attention scores 1,𝑖
𝑖
𝒃 𝟏
′ ′ ′ ′
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4
𝟏 𝟏 𝟏 𝟐 𝟐 𝟑 𝟒
𝒒 𝒌 𝒗 𝒌 𝒗 𝒌 𝟑 𝒗 𝒌 𝟒 𝒗
𝟏 𝟐 𝟑 𝟒
𝒂 𝒂 𝒂 𝒂
𝟏 𝑣 𝟏 𝟐 𝑣 𝟐 𝟑 𝑣 𝟑 𝟒 𝑣 𝟒
𝒗 =𝑊 𝒂 𝒗 = 𝑊 𝒂 𝒗 = 𝑊 𝒂 𝒗 = 𝑊 𝒂
18
Self-Attention
parallel
𝒃𝟏
𝒃𝟐 𝒃𝟑 𝒃𝟒
𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒
𝑖
𝟐
𝒃
′ ′ ′ ′
𝛼2,1 𝛼2,2 𝛼2,3 𝛼2,4
𝟏 𝟏 𝟏 𝟐 𝟐 𝟐 𝟑 𝟑 𝟒 𝟒
𝒒 𝒌 𝒗 𝒒 𝒌 𝒗 𝒒 𝒌 𝟑 𝒗 𝒒 𝒌 𝟒 𝒗
𝟏 𝟐 𝟑 𝟒
𝒂 𝒂 𝒂 𝒂
20 Self-Attention 𝒒 𝒊 = 𝑞
𝑊 𝒂𝒊
𝒒 𝟏 𝒒𝟐 𝒒𝟑 𝒒𝟒
= 𝑊 𝑞 𝒂 𝟏 𝒂2
𝒂 𝒂 3 4
𝑄 I
𝒌 𝟏 𝟐 𝟑 𝟒 𝟏 2 3 𝒂 4
𝒌 𝒊 = 𝑘
𝑊 𝒂𝒊 𝒌 𝒌 𝒌 = 𝑊 𝑘 𝒂 𝒂 𝒂
𝐾 I
𝟏 𝟐 𝟑 𝟒 𝟏 2 3 4
𝒗𝒊 = 𝑣
𝑊 𝒂𝒊 𝒗 𝒗 𝒗 𝒗 = 𝑊 𝑣 𝒂 𝒂 𝒂 𝒂
𝑉 I
𝟏 𝟏 𝟏 𝟐 𝟐 𝟐 𝟑 𝟑 𝟒 𝟒
𝒒 𝒌 𝒗 𝒒 𝒌 𝒗 𝒒 𝒌 𝟑 𝒗 𝒒 𝒌 𝟒 𝒗
𝟏 𝟐 𝟑 𝟒
𝒂 𝒂 𝒂 𝒂
𝛼1,1
21 Self-Attention 𝒌 𝟏
𝛼1,1 = 𝒌 𝟏 𝒒 𝟏
𝛼1,2 = 𝒌 𝟐 𝒒𝟏 𝛼1,2 𝒌 𝟐
𝒒𝟏
𝛼1,3 = 𝟑
𝒌
𝟑 𝟏 𝟒 𝟏
𝛼1,3 = 𝒌 𝒒 𝛼1,4 = 𝒌 𝒒 𝛼1,4 𝒌 𝟒
𝟏 𝟏 𝟏 𝟐 𝟐 𝟐 𝟑 𝟑 𝟒 𝟒
𝒒 𝒌 𝒗 𝒒 𝒌 𝒗 𝒒 𝒌 𝟑 𝒗 𝒒 𝒌 𝟒 𝒗
′ ′ ′ ′
𝛼1,1 𝛼2,1 𝛼3,1 𝛼4,1 𝛼1,1 𝛼2,1 𝛼3,1 𝛼4,1 𝒌 𝟏
′ ′ ′ ′
𝛼1,2 𝛼2,2 𝛼3,2 𝛼4,2 𝛼1,2 𝛼2,2 𝛼3,2 𝛼4,2 𝒌 𝟐
𝟐 𝟑 𝟒
𝒒 𝟏 𝒒 𝒒 𝒒
′
𝛼1,3 ′
𝛼2,3 ′ ′
𝛼3,3 𝛼4,3 =
𝛼1,3 𝛼2,3 𝛼3,3 𝛼4,3 𝒌 𝟑
′
𝛼1,4 ′
𝛼2,4 ′ ′
𝛼3,4 𝛼4,4 𝛼1,4 𝛼2,4 𝛼3,4 𝛼4,4 𝒌 𝟒 𝑄
𝐴′ 𝑇
𝐴 𝐾
22 Self-Attention
𝟏
𝒃
′ ′ ′ ′
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4
𝟏 𝟏 𝟏 𝟐 𝟐 𝟑 𝟒
𝒒 𝒌 𝒗 𝒌 𝒗 𝒌 𝟑 𝒗 𝒌 𝟒 𝒗
′ ′ ′ ′
𝛼1,1 𝛼2,1 𝛼3,1 𝛼4,1
′ ′ ′ ′
𝟏 𝟐 𝟑
𝛼1,2 𝛼2,2 𝛼3,2 𝛼4,2
𝒃 𝒃 𝒃 𝟒 𝒗 𝟏 𝒗𝟐 𝒗𝟑 𝒗𝟒
𝒃 = ′ ′ ′ ′
𝛼1,3 𝛼2,3 𝛼3,3 𝛼4,3
O 𝑉 ′ ′ ′ ′
𝛼1,4 𝛼2,4 𝛼3,4 𝛼4,4
𝐴′
23 Self-Attention Q = 𝑊𝑞 I
K = 𝑊𝑘 I
V = 𝑊 𝑣 I
Parameters
to be learned
A′ A = 𝐾 𝑇 Q
Attention Matrix
O = V A′
24 Transformer Idea
softmax
Feed-Forward NN
Encoder-Decoder Attention
Feed-Forward NN Self Attention
Self Attention
FFNN
FFNN FFNN FFNN FFNN
FFNN
+ +
× × ×
Self-Attention
× × ×
Self-Attention
softmax softmax
FFNN
+
Value × × ×
MatMulV MatMulV MatMulV
softmax
Dot-Prod
Key Query
MatMulK MatMulQ
+
Value × × ×
MatMulV MatMulV MatMulV
softmax
Dot-Prod
Key Query
MatMulK MatMulQ
kicked
who to whom
did
what
kicked
kicked
who
kicked
did
what
kicked
to whom
kicked
who to whom
did
what
𝒊,𝟏 𝒊,𝟐 𝒊,𝟏 𝒊,𝟐 𝒊,𝟏 𝒊,𝟐 𝒋,𝟏 𝒋,𝟐 𝒋,𝟏 𝒋,𝟐 𝒗𝒋,𝟏 𝒋,𝟐
𝒒 𝒒 𝒌 𝒌 𝒗 𝒗 𝒒 𝒒 𝒌 𝒌 𝒗
𝒊 𝒊 𝒊 𝒋 𝒋 𝒗 𝒋
𝒒 𝒌 𝒗 𝒒 𝒌
𝒊 𝑞 𝒊
𝒒 =𝑊 𝒂 𝒂𝒊
(2 heads as example) 𝒂𝒋
36 Multi-Head Attention
𝒊,𝟏 𝑞,1 𝒊 𝒃 𝒊,𝟏
𝒒 = 𝑊 𝒒
𝒒𝒊,𝟐 = 𝑞,2
𝑊 𝒒 𝒊
𝒊,𝟐
𝒃
𝒊,𝟏 𝒊,𝟐 𝒊,𝟏 𝒊,𝟐 𝒊,𝟏 𝒊,𝟐 𝒋,𝟏 𝒋,𝟐 𝒋,𝟏 𝒋,𝟐 𝒗𝒋,𝟏 𝒋,𝟐
𝒒 𝒒 𝒌 𝒌 𝒗 𝒗 𝒒 𝒒 𝒌 𝒌 𝒗
𝒊 𝒊 𝒊 𝒋 𝒋 𝒗 𝒋
𝒒 𝒌 𝒗 𝒒 𝒌
𝒊 𝑞 𝒊
𝒒 =𝑊 𝒂 𝒂𝒊
(2 heads as example) 𝒂𝒋
37 Multi-Head Attention
𝒃 𝒊,𝟏
𝑂
𝒃 𝒊
= 𝑊
𝒃 𝒊,𝟐
𝒊,𝟏 𝒊,𝟐 𝒊,𝟏 𝒊,𝟐 𝒊,𝟏 𝒊,𝟐 𝒋,𝟏 𝒋,𝟐 𝒋,𝟏 𝒋,𝟐 𝒗𝒋,𝟏 𝒋,𝟐
𝒒 𝒒 𝒌 𝒌 𝒗 𝒗 𝒒 𝒒 𝒌 𝒌 𝒗
𝒊 𝒊 𝒊 𝒋 𝒋 𝒗 𝒋
𝒒 𝒌 𝒗 𝒒 𝒌
𝒊 𝑞 𝒊
𝒒 =𝑊 𝒂 𝒂𝒊
(2 heads as example) 𝒂𝒋
38
Sequence Encoding
Transformer
39 Transformer Overview
◉ Non-recurrent encoder-decoder for MT
◉ PyTorch explanation by Sasha Rush
http://nlp.seas.harvard.edu/2018/04/03/attention.html
Masked
Multi-Head Multi-Head
Attention Attention
https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-中文版-ef2ddf8597a4
45 Encoder Input
◉ Problem: temporal information is missing
◉ Solution: positional encoding allows words at
different locations to have different embeddings with
Add & Norm
fixed dimensions Feed
Forward
https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-中文版-ef2ddf8597a4
46 Positional Encoding
◉ Criteria for positional encoding
○ Unique encoding for each position
○ Deterministic
○ Distance between neighboring positions should be the same
○ Model can easily generalize to longer sentences
d-dim
49 Positional Encoding
◉ Criteria for positional encoding
○ Unique encoding for each position
○ Deterministic
○ Distance between neighboring positions should be the same
○ Model can easily generalize to longer sentences
◉ Idea 3:
○ The normalized value of the position (0~1)
○ Distances may differ in sentences with different lengths
https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-中文版-ef2ddf8597a4
54 Multi-Head Attention Details
https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-中文版-ef2ddf8597a4
55 Training Tips
◉ Byte-pair encodings (BPE)
◉ Checkpoint averaging
◉ ADAM optimizer with learning rate changes
◉ Dropout during training at every layer just before adding residual
◉ Label smoothing
◉ Auto-regressive decoding with beam search and length penalties
56 MT Experiments