0% found this document useful (0 votes)
15 views58 pages

Transformer

Transformer

Uploaded by

saaspeter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views58 pages

Transformer

Transformer

Uploaded by

saaspeter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Applied Deep Learning

Transformer
October 5th, 2023 http://adl.miulab.tw
2
Sequence Encoding
Basic Attention
3 Representations of Variable Length Data
◉ Input: word sequence, image pixels, audio signal, click logs
◉ Property: continuity, temporal, importance distribution
◉ Example
✓ Basic combination: average, sum
✓ Neural combination: network architectures should consider input domain properties
− CNN (convolutional neural network)
− RNN (recurrent neural network): temporal information

Network architectures should consider the input domain properties


4 Recurrent Neural Networks
◉ Learning variable-length representations
✓ Fit for sentences and sequences of values
◉ Sequential computation makes parallelization difficult
◉ No explicit modeling of long and short range dependencies

RNN RNN RNN

RNN RNN RNN

have a nice
5 Convolutional Neural Networks
◉ Easy to parallelize
◉ Exploit local dependencies
✓ Long-distance dependencies require many layers

FFNN FFNN FFNN

conv conv conv

have a nice
6 Attention
◉ Encoder-decoder model is important in NMT
◉ RNNs need attention mechanism to handle long dependencies
◉ Attention allows us to access any state

learning

<END>
deep
Information of the
whole sentences

ℎ1 ℎ 2 ℎ 3 ℎ 4
RNN RNN
Encoder Decoder
深 度 學 習 ℎ 4 ℎ 4
ℎ 4
7 Basic Attention

1
𝛼0
0
𝑐

match 𝑧 0 0.5𝛼ො 1 0.5 𝛼ො 2 0.0 𝛼ො 3 0.0 𝛼ො 4


0 0 0 0
query softmax

key ℎ1 ℎ 2 ℎ 3 ℎ 4 value ℎ1 ℎ 2
ℎ 3
ℎ 4

深 度 學 習 深 度 學 習
8 Dot-Product Attention
◉ Input: a query 𝑞 and a set of key-value (𝑘-𝑣) pairs to an output
◉ Output: weighted sum of values
Inner product of
query and corresponding key

✓ Query 𝑞 is a 𝑑𝑘 -dim vector


✓ Key 𝑘 is a 𝑑𝑘 -dim vector
✓ Value 𝑣 is a 𝑑𝑣 -dim vector
9 Dot-Product Attention in Matrix
◉ Input: multiple queries 𝑞 and a set of key-value (𝑘-𝑣) pairs to an output
◉ Output: a set of weighted sum of values

softmax
row-wise
Sequence Encoding
10
Self-Attention
11 Attention
◉ Encoder-decoder model is important in NMT
◉ RNNs need attention mechanism to handle long dependencies
◉ Attention allows us to access any state

Using attention to replace recurrence architectures


12 Self-Attention
◉ Constant “path length” between two positions
◉ Easy to parallelize

FFNN FFNN FFNN FFNN

+
× × ×
cmp cmp cmp

have a nice day


13 Self-Attention
◉ Constant “path length” between two positions
◉ Easy to parallelize
𝒃𝟏
𝒃𝟐 𝒃𝟑 𝒃𝟒

𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒

Can be either input or a hidden layer


14 Self-Attention
◉ Constant “path length” between two positions
◉ Easy to parallelize
𝒃𝟏
𝒃𝟐 𝒃𝟑 𝒃𝟒

𝛼 relevant?

𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒

Can be either input or a hidden layer


15 Self-Attention

𝟏 𝟐 𝟏 𝟑 𝟏 𝟒
𝛼1,2 = 𝒒 ∙ 𝒌 𝛼1,3 = 𝒒 ∙ 𝒌 𝛼1,4 = 𝒒 ∙ 𝒌
𝛼
attention score 1,2 𝛼1,3 𝛼1,4

𝟏 𝟐
𝒒 query 𝒌 key 𝒌 𝟑
𝒌 𝟒

𝟏 𝟐 𝟑 𝟒
𝒂 𝒂 𝒂 𝒂

𝟏 𝑞 𝟏 𝟐 𝑘 𝟐 𝟑 𝑘 𝟑 𝟒 𝑘 𝟒
𝒒 =𝑊 𝒂 𝒌 = 𝑊 𝒂 𝒌 = 𝑊 𝒂 𝒌 = 𝑊 𝒂
′ 𝛼1,𝑖 𝛼1,𝑗
16 Self-Attention 𝛼1,𝑖 =𝑒 /෍ 𝑒
𝑗
′ ′ ′ ′
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4

softmax
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4

𝟏 𝟏 𝒌 𝟐
𝒒 𝒌 𝒌 𝟑
𝒌 𝟒

𝟏 𝟐 𝟑 𝟒
𝒂 𝒂 𝒂 𝒂

𝟏 𝑞 𝟏 𝟐 𝑘 𝟐 𝟑 𝑘 𝟑 𝟒 𝑘 𝟒
𝒒 =𝑊 𝒂 𝒌 = 𝑊 𝒂 𝒌 = 𝑊 𝒂 𝒌 = 𝑊 𝒂
𝒌 𝟏 = 𝑘
𝑊 𝒂𝟏
17 Self-Attention extract information based 𝟏
𝒃 = ෍ 𝛼 ′
𝒗𝒊
on attention scores 1,𝑖
𝑖
𝒃 𝟏

′ ′ ′ ′
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4

𝟏 𝟏 𝟏 𝟐 𝟐 𝟑 𝟒
𝒒 𝒌 𝒗 𝒌 𝒗 𝒌 𝟑 𝒗 𝒌 𝟒 𝒗

𝟏 𝟐 𝟑 𝟒
𝒂 𝒂 𝒂 𝒂

𝟏 𝑣 𝟏 𝟐 𝑣 𝟐 𝟑 𝑣 𝟑 𝟒 𝑣 𝟒
𝒗 =𝑊 𝒂 𝒗 = 𝑊 𝒂 𝒗 = 𝑊 𝒂 𝒗 = 𝑊 𝒂
18
Self-Attention
parallel

𝒃𝟏
𝒃𝟐 𝒃𝟑 𝒃𝟒

𝒂𝟏 𝒂𝟐 𝒂𝟑 𝒂𝟒

Can be either input or a hidden layer


19 Self-Attention 𝟐
𝒃 = ′
෍ 𝛼2,𝑖 𝒗𝒊

𝑖
𝟐
𝒃

′ ′ ′ ′
𝛼2,1 𝛼2,2 𝛼2,3 𝛼2,4

𝟏 𝟏 𝟏 𝟐 𝟐 𝟐 𝟑 𝟑 𝟒 𝟒
𝒒 𝒌 𝒗 𝒒 𝒌 𝒗 𝒒 𝒌 𝟑 𝒗 𝒒 𝒌 𝟒 𝒗

𝟏 𝟐 𝟑 𝟒
𝒂 𝒂 𝒂 𝒂
20 Self-Attention 𝒒 𝒊 = 𝑞
𝑊 𝒂𝒊
𝒒 𝟏 𝒒𝟐 𝒒𝟑 𝒒𝟒
= 𝑊 𝑞 𝒂 𝟏 𝒂2
𝒂 𝒂 3 4

𝑄 I
𝒌 𝟏 𝟐 𝟑 𝟒 𝟏 2 3 𝒂 4
𝒌 𝒊 = 𝑘
𝑊 𝒂𝒊 𝒌 𝒌 𝒌 = 𝑊 𝑘 𝒂 𝒂 𝒂
𝐾 I
𝟏 𝟐 𝟑 𝟒 𝟏 2 3 4
𝒗𝒊 = 𝑣
𝑊 𝒂𝒊 𝒗 𝒗 𝒗 𝒗 = 𝑊 𝑣 𝒂 𝒂 𝒂 𝒂
𝑉 I
𝟏 𝟏 𝟏 𝟐 𝟐 𝟐 𝟑 𝟑 𝟒 𝟒
𝒒 𝒌 𝒗 𝒒 𝒌 𝒗 𝒒 𝒌 𝟑 𝒗 𝒒 𝒌 𝟒 𝒗

𝟏 𝟐 𝟑 𝟒
𝒂 𝒂 𝒂 𝒂
𝛼1,1
21 Self-Attention 𝒌 𝟏

𝛼1,1 = 𝒌 𝟏 𝒒 𝟏
𝛼1,2 = 𝒌 𝟐 𝒒𝟏 𝛼1,2 𝒌 𝟐
𝒒𝟏
𝛼1,3 = 𝟑
𝒌
𝟑 𝟏 𝟒 𝟏
𝛼1,3 = 𝒌 𝒒 𝛼1,4 = 𝒌 𝒒 𝛼1,4 𝒌 𝟒

𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4

𝟏 𝟏 𝟏 𝟐 𝟐 𝟐 𝟑 𝟑 𝟒 𝟒
𝒒 𝒌 𝒗 𝒒 𝒌 𝒗 𝒒 𝒌 𝟑 𝒗 𝒒 𝒌 𝟒 𝒗
′ ′ ′ ′
𝛼1,1 𝛼2,1 𝛼3,1 𝛼4,1 𝛼1,1 𝛼2,1 𝛼3,1 𝛼4,1 𝒌 𝟏
′ ′ ′ ′
𝛼1,2 𝛼2,2 𝛼3,2 𝛼4,2 𝛼1,2 𝛼2,2 𝛼3,2 𝛼4,2 𝒌 𝟐
𝟐 𝟑 𝟒
𝒒 𝟏 𝒒 𝒒 𝒒

𝛼1,3 ′
𝛼2,3 ′ ′
𝛼3,3 𝛼4,3 =
𝛼1,3 𝛼2,3 𝛼3,3 𝛼4,3 𝒌 𝟑

𝛼1,4 ′
𝛼2,4 ′ ′
𝛼3,4 𝛼4,4 𝛼1,4 𝛼2,4 𝛼3,4 𝛼4,4 𝒌 𝟒 𝑄
𝐴′ 𝑇
𝐴 𝐾
22 Self-Attention
𝟏
𝒃

′ ′ ′ ′
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4

𝟏 𝟏 𝟏 𝟐 𝟐 𝟑 𝟒
𝒒 𝒌 𝒗 𝒌 𝒗 𝒌 𝟑 𝒗 𝒌 𝟒 𝒗
′ ′ ′ ′
𝛼1,1 𝛼2,1 𝛼3,1 𝛼4,1
′ ′ ′ ′
𝟏 𝟐 𝟑
𝛼1,2 𝛼2,2 𝛼3,2 𝛼4,2
𝒃 𝒃 𝒃 𝟒 𝒗 𝟏 𝒗𝟐 𝒗𝟑 𝒗𝟒
𝒃 = ′ ′ ′ ′
𝛼1,3 𝛼2,3 𝛼3,3 𝛼4,3
O 𝑉 ′ ′ ′ ′
𝛼1,4 𝛼2,4 𝛼3,4 𝛼4,4
𝐴′
23 Self-Attention Q = 𝑊𝑞 I
K = 𝑊𝑘 I
V = 𝑊 𝑣 I
Parameters
to be learned

A′ A = 𝐾 𝑇 Q

Attention Matrix

O = V A′
24 Transformer Idea
softmax

Feed-Forward NN
Encoder-Decoder Attention
Feed-Forward NN Self Attention
Self Attention

FFNN
FFNN FFNN FFNN FFNN
FFNN

FFNN FFNN FFNN FFNN Encoder-Decoder Attention

+ +
× × ×

Self-Attention
× × ×
Self-Attention

softmax softmax

cmp cmp cmp cmp cmp cmp

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.


25 Encoder Self-Attention (Vaswani+, 2017)

FFNN

+
Value × × ×
MatMulV MatMulV MatMulV

softmax

Dot-Prod Dot-Prod Dot-Prod

Dot-Prod
Key Query
MatMulK MatMulQ

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.


26 Decoder Self-Attention (Vaswani+, 2017)

+
Value × × ×
MatMulV MatMulV MatMulV

softmax

Dot-Prod Dot-Prod Dot-Prod

Dot-Prod
Key Query
MatMulK MatMulQ

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.


27
Sequence Encoding
Multi-Head Attention
28 Convolutions

kicked

who to whom
did
what

I kicked the ball


29 Self-Attention

kicked

I kicked the ball


30 Attention Head: who

kicked

who

I kicked the ball


31 Attention Head: did what

kicked

did
what

I kicked the ball


32 Attention Head: to whom

kicked

to whom

I kicked the ball


33 Multi-Head Attention

kicked

who to whom
did
what

I kicked the ball


34 Comparison
◉ Convolution: different linear transformations by relative positions
kicked

I kicked the ball


◉ Attention: a weighted average
kicked

I kicked the ball

◉ Multi-Head Attention: parallel attention layers with different linear transformations


on input/output kicked

I kicked the ball


35 Multi-Head Attention
𝒊,𝟏 𝑞,1 𝒊 𝒃 𝒊,𝟏
𝒒 = 𝑊 𝒒
𝒒𝒊,𝟐 = 𝑞,2
𝑊 𝒒 𝒊

𝒊,𝟏 𝒊,𝟐 𝒊,𝟏 𝒊,𝟐 𝒊,𝟏 𝒊,𝟐 𝒋,𝟏 𝒋,𝟐 𝒋,𝟏 𝒋,𝟐 𝒗𝒋,𝟏 𝒋,𝟐
𝒒 𝒒 𝒌 𝒌 𝒗 𝒗 𝒒 𝒒 𝒌 𝒌 𝒗

𝒊 𝒊 𝒊 𝒋 𝒋 𝒗 𝒋
𝒒 𝒌 𝒗 𝒒 𝒌

𝒊 𝑞 𝒊
𝒒 =𝑊 𝒂 𝒂𝒊
(2 heads as example) 𝒂𝒋
36 Multi-Head Attention
𝒊,𝟏 𝑞,1 𝒊 𝒃 𝒊,𝟏
𝒒 = 𝑊 𝒒
𝒒𝒊,𝟐 = 𝑞,2
𝑊 𝒒 𝒊
𝒊,𝟐
𝒃

𝒊,𝟏 𝒊,𝟐 𝒊,𝟏 𝒊,𝟐 𝒊,𝟏 𝒊,𝟐 𝒋,𝟏 𝒋,𝟐 𝒋,𝟏 𝒋,𝟐 𝒗𝒋,𝟏 𝒋,𝟐
𝒒 𝒒 𝒌 𝒌 𝒗 𝒗 𝒒 𝒒 𝒌 𝒌 𝒗

𝒊 𝒊 𝒊 𝒋 𝒋 𝒗 𝒋
𝒒 𝒌 𝒗 𝒒 𝒌

𝒊 𝑞 𝒊
𝒒 =𝑊 𝒂 𝒂𝒊
(2 heads as example) 𝒂𝒋
37 Multi-Head Attention
𝒃 𝒊,𝟏
𝑂
𝒃 𝒊
= 𝑊
𝒃 𝒊,𝟐

𝒊,𝟏 𝒊,𝟐 𝒊,𝟏 𝒊,𝟐 𝒊,𝟏 𝒊,𝟐 𝒋,𝟏 𝒋,𝟐 𝒋,𝟏 𝒋,𝟐 𝒗𝒋,𝟏 𝒋,𝟐
𝒒 𝒒 𝒌 𝒌 𝒗 𝒗 𝒒 𝒒 𝒌 𝒌 𝒗

𝒊 𝒊 𝒊 𝒋 𝒋 𝒗 𝒋
𝒒 𝒌 𝒗 𝒒 𝒌

𝒊 𝑞 𝒊
𝒒 =𝑊 𝒂 𝒂𝒊
(2 heads as example) 𝒂𝒋
38
Sequence Encoding
Transformer
39 Transformer Overview
◉ Non-recurrent encoder-decoder for MT
◉ PyTorch explanation by Sasha Rush
http://nlp.seas.harvard.edu/2018/04/03/attention.html

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.


40 Transformer Overview
◉ Non-recurrent encoder-decoder for MT
◉ PyTorch explanation by Sasha Rush
http://nlp.seas.harvard.edu/2018/04/03/attention.html
Multi-Head
Attention

Masked
Multi-Head Multi-Head
Attention Attention

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.


41 Multi-Head Attention
◉ Idea: allow words to interact with one another
◉ Model
− Map V, K, Q to lower dimensional spaces
− Apply attention, concatenate outputs
− Linear transformation

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.


42 Scaled Dot-Product Attention
𝑇
◉ Problem: when 𝑑𝑘 gets large, the variance of 𝑞 𝑘 increases
→ 𝑞 and 𝑘 are random variables with mean 0 and variance 1
𝑇
→ 𝑞 𝑘 has mean 0 and variance 𝑑𝑘
→ variance 1 is preferred
◉ Solution: scale by 𝑑𝑘

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.


43 Transformer Overview
◉ Non-recurrent encoder-decoder for MT
◉ PyTorch explanation by Sasha Rush
http://nlp.seas.harvard.edu/2018/04/03/attention.html
Add & Norm
Feed
Forward

Add & Norm


Multi-Head
Attention

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.


44 Transformer Encoder Block
Add & Norm
◉ Each block has
Feed Forward
− multi-head attention
− 2-layer feed-forward NN (w/ ReLU)
Add & Norm
◉ Both parts contain Multi-Head
− Residual connection Attention

− Layer normalization (LayerNorm)


Change input to have 0 mean and 1 variance per layer & per training point
→ LayerNorm(x + sublayer(x))

https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-中文版-ef2ddf8597a4
45 Encoder Input
◉ Problem: temporal information is missing
◉ Solution: positional encoding allows words at
different locations to have different embeddings with
Add & Norm
fixed dimensions Feed
Forward

Add & Norm


Multi-Head
Attention

https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-中文版-ef2ddf8597a4
46 Positional Encoding
◉ Criteria for positional encoding
○ Unique encoding for each position
○ Deterministic
○ Distance between neighboring positions should be the same
○ Model can easily generalize to longer sentences

Transformer Architecture: The Positional Encoding - Amirhossein Kazemnejad's Blog


47 Positional Encoding
◉ Criteria for positional encoding
○ Unique encoding for each position
○ Deterministic
○ Distance between neighboring positions should be the same
○ Model can easily generalize to longer sentences
◉ Idea 1:
○ A value to indicate the word’s position
○ Larger value (longer sentence) may not be easily generalized 
48 Positional Encoding
◉ Criteria for positional embeddings
○ Unique encoding for each position
○ Deterministic
○ Distance between neighboring positions should be the same
○ Model can easily generalize to longer sentences
◉ Idea 2: 1-hot encoding
○ A d-dim vector to encode d positions
○ Cannot generalize to longer sentences 

[1, 0, 0, …, 0] only represent the sequences with the length <= d

d-dim
49 Positional Encoding
◉ Criteria for positional encoding
○ Unique encoding for each position
○ Deterministic
○ Distance between neighboring positions should be the same
○ Model can easily generalize to longer sentences

◉ Idea 3:
○ The normalized value of the position (0~1)
○ Distances may differ in sentences with different lengths 

PE 0.25 0.50 0.75 1.00 PE 0.10 0.20 0.30 0.40 1.00


0.25 0.10
50 Sinusoidal Positional Encoding
◉ Criteria for positional embeddings
○ Unique encoding for each position
○ Deterministic
○ Distance between neighboring positions should be the same
○ Model can easily generalize to longer sentences
◉ Idea:

○ A d-dim vector to represent positions


51 Sinusoidal Positional Encoding

Transformer Architecture: The Positional Encoding - Amirhossein Kazemnejad's Blog


52 Sinusoidal Positional Encoding
◉ Distance between neighboring
positions
○ symmetrical
○ decay nicely with time

Dot product of position embeddings for all time-steps

Transformer Architecture: The Positional Encoding - Amirhossein Kazemnejad's Blog


53 Encoder Input
◉ Problem: temporal information is missing
◉ Solution: positional encoding allows words at
different locations to have different embeddings with
Add & Norm
fixed dimensions Feed
Forward

Add & Norm


Multi-Head
Attention

https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-中文版-ef2ddf8597a4
54 Multi-Head Attention Details

https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-中文版-ef2ddf8597a4
55 Training Tips
◉ Byte-pair encodings (BPE)
◉ Checkpoint averaging
◉ ADAM optimizer with learning rate changes
◉ Dropout during training at every layer just before adding residual
◉ Label smoothing
◉ Auto-regressive decoding with beam search and length penalties
56 MT Experiments

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.


57 Parsing Experiments

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.


58 Concluding Remarks

◉ Non-recurrence model is easy to parallelize


◉ Multi-head attention captures different aspects by
interacting between words
◉ Positional encoding captures location information
◉ Each transformer block can be applied to diverse tasks

You might also like