05 Attention Slides
05 Attention Slides
y1 y2 y3 y4
z0 z1 z2 z3 z4 z5 h1 h2 h3 h4
context
x1 x2 x3 x4 x5
This is my cat .
• The problem with this model: It is difficult to encode the whole sentence in a single vector z5 of
fixed size.
1
Encoding to representation of a varying-length
• Intuition: The longer the input sentence, the longer our representation should be. Let the length
of our representation be equal to the length of the input sequence.
z1 z2 z3 z4 z5
Encoder
x1 x2 x3 x4 x5
2
Encoding to representation of a varying-length
• Intuition: The longer the input sentence, the longer our representation should be. Let the length
of our representation be equal to the length of the input sequence.
z1 z2 z3 z4 z5
h1 h2 h3 h4 h5
x1 x2 x3 x4 x5
• We can use intermediate states of the RNN as representations but this does not work well:
representation z1 at the first position does not depend on subsequent words.
2
Encoding with bi-directional RNN
• In the classical model (Bahdanau et al., 2014), the varying-length representation was build using a
bi-directional RNN. z1 z2 z3 z4 z5
←
z−1 ←
z−2 ←
z−3 ←
z−4 ←
z−5 z0
z0 −
→
z1 −
→
z2 −
→
z3 −
→
z4 −
→
z5
x1 x2 x3 x4 x5
• The bi-direction RNN does two passes through the input sequence: forward and backward.
• The output at position j is a concatenation z = [−
j
→
z ;←
z−] of the states (or outputs) in the forward
j j
and backward passes.
3
Sequence-to-sequence model: context in decoding
x1 x2 x3 x4 x5
4
Attention: Using context of varying length
y1 y2 y3 y4
x1 x2 x3 x4 x5
5
Attention mechanism from (Bahdanau et al., 2014)
• Scores ej are computed using the current decoder state hi−1 and representation zj :
ej = f (hi−1 , zj )
6
Full architecture from (Bahdanau et al., 2014)
y1 y2 y3 y4
h1 h2 h3 h4
c3
h2
Attention
z1 z2 z3 z4 z5
←
z−1 ←
z−2 ←
z−3 ←
z−4 ←
z−5
−
→
z1 −
→
z2 −
→
z3 −
→
z4 −
→
z5
x1 x2 x3 x4 x5
7
Attention-based models have much better performance
8
Attention coefficients
• Weights αij can be visualized. The x-axis and y-axis of each plot correspond to the words in the
source sentence and the generated translation, respectively.
9
Neural image captioning (Xu et al., 2016)
10
Neural image captioning (Xu et al., 2016)
• The image is preprocessed into 14 × 14 feature maps with a convolutional network pre-trained on
ImageNet.
• The 14 × 14 feature maps are split into L annotation vectors zj .
• The annotation vectors are used as context in the decoding RNN.
y1 y2 y3 y4
z1 z2
h1 h2 h3 h4
c3
h2
zL Attention
z1 z2 z3 ... zL
11
Convolutional sequence-to-sequence models
(Gehring et al., 2017)
Problems with RNN encoders
13
Convolutional encoder
z1 z2 z3 z4 z5
• Gehring et al. (2017) proposed to use a
convolutional network (CNN) to encode input
sequences. CNN
• Since convolutional layers have shared weights,
they can process sequences of varying lengths.
x1 x2 x3 x4 x5
• Advantage: CNN can compute representations in all positions in parallel. We use both preceding
and subsequent positions (unlike a bi-directional RNN).
• Disadvantage: CNN does not take into account whether the position is at the beginning of a
sequence or at the end.
14
Convolutional encoder
z1 z2 z3 z4 z5
• Gehring et al. (2017) proposed to use a
convolutional network (CNN) to encode input
sequences. CNN
• Since convolutional layers have shared weights,
they can process sequences of varying lengths.
x1 + p1 x2 + p2 x3 + p3 x4 + p4 x5 + p5
• Advantage: CNN can compute representations in all positions in parallel. We use both preceding
and subsequent positions (unlike a bi-directional RNN).
• Disadvantage: CNN does not take into account whether the position is at the beginning of a
sequence or at the end.
• Gehring et al. (2017) fix this problem by adding position embedding pj to word embeddings xj .
14
Decoding with convolutional layers
y1 y2 y3 y4
• Can we also avoid using RNNs in the decoder?
• The decoder is an autoregressive model with the context
Decoder
context provided by the encoder
15
Decoding with convolutional layers
y1 y2 y3 y4
• Can we also avoid using RNNs in the decoder?
• The decoder is an autoregressive model with the context
Decoder
context provided by the encoder
15
Decoding with convolutional layers
y1 y2 y3 y4
• Can we also avoid using RNNs in the decoder?
• The decoder is an autoregressive model with the context
Decoder
context provided by the encoder
15
Decoding with convolutional layers
y1 y2 y3 y4
• Can we also avoid using RNNs in the decoder?
• The decoder is an autoregressive model with the context
Decoder
context provided by the encoder
15
Decoding with convolutional layers
y1 y2 y3 y4
• Advantage of a convolutional decoder: During
training, we can compute output elements for all
context
positions in parallel. Recall that in the RNN Decoder
decoder, we had to produce the output sequence
one element at a time.
SOS y1 y2 y3
16
Decoding with convolutional layers
y1 y2 y3 y4
• Advantage of a convolutional decoder: During
training, we can compute output elements for all
context
positions in parallel. Recall that in the RNN Decoder
decoder, we had to produce the output sequence
one element at a time.
SOS y1 y2 y3
y1
SOS
16
Decoding with convolutional layers
y1 y2 y3 y4
• Advantage of a convolutional decoder: During
training, we can compute output elements for all
context
positions in parallel. Recall that in the RNN Decoder
decoder, we had to produce the output sequence
one element at a time.
SOS y1 y2 y3
y2
SOS y1
16
Decoding with convolutional layers
y1 y2 y3 y4
• Advantage of a convolutional decoder: During
training, we can compute output elements for all
context
positions in parallel. Recall that in the RNN Decoder
decoder, we had to produce the output sequence
one element at a time.
SOS y1 y2 y3
y3
SOS y1 y2
16
Decoding with convolutional layers
y1 y2 y3 y4
• Advantage of a convolutional decoder: During
training, we can compute output elements for all
context
positions in parallel. Recall that in the RNN Decoder
decoder, we had to produce the output sequence
one element at a time.
SOS y1 y2 y3
y4
SOS y1 y2 y3
16
An autoregressive model with 1d convolutional layer
• We can make sure that the receptive field of yi does not contain subsequent elements yi 0 , i 0 ≥ i
by using shifted convolutions.
y1 y2 y3 y4 y5
SOS y1 y2 y3 y4
standard convolution
shifted convolution
• If we stack multiple convolutional layers built in the same way, the desired property is preserved.
17
Attention in a convolutional decoder
o1 o2 o3 o4
• How can we use the context provided by the
encoder in such a decoder?
• Attention used in (Gehring et al., 2017): zn
X ···
oi = αij (zj + xj + pj ) Attention
z2
j=1
z1
where xj are word embeddings and pj are position
embeddings for the input sequence. h1 h2 h3 h4
• The attention weights are shifted 1d convolution
exp(h>
i zj )
αij = P n > 0
0 exp(h i zj )
j =1
SOS y1 y2 y3
18
Full architecture from (Gehring et al., 2017)
y1 y2 y3 y4
... ... ... ...
zn
CNN encoder
xn
• They used multiple decoder blocks
stacked on top of each other. ··· ···
z2 Attention
x2
• Each decoder block attends to the z1
outputs of the encoder zj . x1
SOS y1 y2 y3
19
Full architecture (Gehring et al., 2017)
20
Translation performance
21
Transformers
(Vaswani et al., 2017)
Transformer architecture
y1 y2 y3 y4
... ... ... ...
• The general architecture is similar to
ConvS2S: zn
xn
• The encoder converts input sequence
encoder
(x1 , ..., xn ) into continuous ··· ···
representations (z1 , ..., zn ).
z2 Attention
x2
• The decoder processes all positions in z1
x1
parallel using shifted output sequence
(y1 , ..., ym ) as input and output. The
autoregressive structure is preserved by h1 h2 h3 h4
masking.
• The decoder attends to representations decoder layer
(z1 , ..., zn ).
SOS y1 y2 y3
23
Transformer architecture
(y1 , ..., ym )
24
Transformer: Attention mechanism
25
Transformer: Scaled dot-product attention
(y1 , ..., ym )
n n
X X encoder decoder
oi = αij zj oi = αij vj
j=1 j=1
√ √
exp(z>j hi / dk ) exp(k> j qi / dk )
αij = P n >
√ αij = Pn √
0
j =1
exp(z j 0 hi / dk ) j 0 =1
exp(k>j 0 qi / dk )
26
Scaled dot-product attention
27
Multi-head attention
28
Encoder-decoder attention
(y1 , ..., ym )
attention
V K Q
encoder decoder
29
Transformer encoder: Self-attention
• Advantage: The first position affects the representation in the last position (and vice versa)
already after one layer! (think how many layers are needed for that in RNN or convolutional
encoders).
30
Transformer encoder
(y1 , ..., ym )
(z1 , ..., zn )
• MLP networks are inside the ”Feed Forward” block.
• The encoder is a stack of multiple such blocks (each attention
block contains an attention module and a mini-MLP).
• Each block contains standard deep learning tricks: decoder
• skip connections
• layer normalization
31
Transformer decoder
y1 y2 y3 y4
zn
• Similarly to ConvS2S, the decoder implements an ···
Decoder
autoregressive model with the context provided by z1
the encoder
• When predicting word yi we can use the preceding words y1 , ..., yi−1 but not subsequent words
yi , ..., ym .
32
Transformer decoder
y1 y2 y3 y4
zn
• Similarly to ConvS2S, the decoder implements an ···
Decoder
autoregressive model with the context provided by z1
the encoder
• When predicting word yi we can use the preceding words y1 , ..., yi−1 but not subsequent words
yi , ..., ym .
32
Transformer decoder
y1 y2 y3 y4
zn
• Similarly to ConvS2S, the decoder implements an ···
Decoder
autoregressive model with the context provided by z1
the encoder
• When predicting word yi we can use the preceding words y1 , ..., yi−1 but not subsequent words
yi , ..., ym .
32
Transformer decoder
y1 y2 y3 y4
zn
• Similarly to ConvS2S, the decoder implements an ···
Decoder
autoregressive model with the context provided by z1
the encoder
• When predicting word yi we can use the preceding words y1 , ..., yi−1 but not subsequent words
yi , ..., ym .
32
Transformer decoder
o1 o2 o3 o4
SOS y1 y2 y3
33
Transformer decoder: Masked self-attention
• We want not to use subsequent positions vi+1 , ..., vm when computing output hi . We can do that
using attention masks mij :
mij = 0 , if j ≤ i
mij = −∞ and therefore αij = 0, if j > i
34
Decoder
(y1 , ..., ym )
35
Transformer’s positional encoding
36
Transformer’s positional encoding
• The outputs will be shuffled in the same way. Thus, the computed representations will not depend
on the order of the elements in the input sequence.
• This is not desired: the order of the words is important for understanding the meaning of a
sentence.
36
Transformer’s positional encoding
• Recall: ConvS2S used position embedding (embedding positions just like words) and added them
to word embeddings.
• Motivation: It is easy for the model to learn to attend by relative positions, since for any fixed
offset k, PEp+k can be represented as a linear function of PEp .
37
Transformer: Full model
38
Translation performance
39
Rotary position embedding (Su et al., 2021)
• Transformers’ attention uses dot product < ·, · > of transformed word embeddings for matching
queries to keys:
(Wq (xi + pi ))> (Wk (xj + pj )) =< fq (xi , i), fk (xj , j) >
• Intuively, we would like an attention mechanism that is invariant to the aboslute position of the
words in the sequence, but is sensitive to the relative positions:
where Re[·] is the real part of a complex number and (·)∗ denotes the conjugate complex number.
40
Rotary position embedding (Su et al., 2021)
• General case:
− sin mθ1 ···
cos mθ1 0 0 0 0
sin mθ1 cos mθ1 0 0 ··· 0 0
0
0 cos mθ2 − sin mθ2 ··· 0 0
fq (xm , m) = 0
0 sin mθ2 cos mθ2 ··· 0 0 Wq xm
. .. .. .. .. .. ..
..
. . . . . .
0 0 0 0 ··· cos mθd/2 − sin mθd/2
0 0 0 0 ··· sin mθd/2 cos mθd/2
41
Vision Transformers (Dosovitskiy et al., 2020)
• Although introduced for natural language processing tasks, transformers have now been used in
many other domains and they show great performance.
• ViT is typically pre-trained on large datasets and then fine-tuned to (smaller) downstream tasks.
42
BERT: Transformer-based language model
(Devlin et al., 2018)
Transfer learning in natural language processing
44
BERT: Pre-trained language model (Devlin et al., 2018)
45
BERT: Pre-training task 1
Tok 2
Sentence A Sentence B
46
BERT: Pre-training task 2
True/False
47
Fine-tuning BERT on a sentence classification task
Sentence: Although the value added services being provided are great but the prices are high. Class: mixed review
Sentence: Great work done #XYZ Problem resolved by customer care in just one day. Class: positive review
48
Fine-tuning BERT on a question answering task
Paragraph: Beyoncé Giselle Knowles-Carter (born September 4, 1981) is an American singer, songwriter, record producer, dancer and actress. Born
and raised in Houston, Texas, Beyoncé performed in various singing and dancing competitions as a child. She rose to fame in the late 1990s as the
lead singer of Destiny’s Child, one of the best-selling girl groups of all time.
49
BERT: GLUE Test results
The number below each task denotes the number of training examples.
F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B,
and accuracy scores are reported for the other tasks.
50
Recap
Recap
52
Home assignment
Assignment 05 transformer
54
Recommended reading
55