0% found this document useful (0 votes)
8 views69 pages

05 Attention Slides

The document discusses attention-based models in deep learning, particularly in sequence-to-sequence tasks like machine translation. It highlights the limitations of traditional models that use fixed-size vectors and introduces bi-directional RNNs and attention mechanisms to improve performance. Additionally, it explores the use of convolutional networks for encoding and decoding sequences, emphasizing their advantages in parallel processing and efficiency.

Uploaded by

mrolaw01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views69 pages

05 Attention Slides

The document discusses attention-based models in deep learning, particularly in sequence-to-sequence tasks like machine translation. It highlights the limitations of traditional models that use fixed-size vectors and introduces bi-directional RNNs and attention mechanisms to improve performance. Additionally, it explores the use of convolutional networks for encoding and decoding sequences, emphasizing their advantages in parallel processing and efficiency.

Uploaded by

mrolaw01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

CS-E4890 Deep Learning

Lecture #5: Attention-based models

Jorma Laaksonen — Juho Kannala — Alexander Ilin


Simple sequence-to-sequence model

• Previously we considered a sequence-to-sequence model for statistical machine translation:

y1 y2 y3 y4

z0 z1 z2 z3 z4 z5 h1 h2 h3 h4

context
x1 x2 x3 x4 x5
This is my cat .

• The problem with this model: It is difficult to encode the whole sentence in a single vector z5 of
fixed size.

1
Encoding to representation of a varying-length

• Intuition: The longer the input sentence, the longer our representation should be. Let the length
of our representation be equal to the length of the input sequence.
z1 z2 z3 z4 z5

Encoder

x1 x2 x3 x4 x5

2
Encoding to representation of a varying-length

• Intuition: The longer the input sentence, the longer our representation should be. Let the length
of our representation be equal to the length of the input sequence.
z1 z2 z3 z4 z5

h1 h2 h3 h4 h5

x1 x2 x3 x4 x5

• We can use intermediate states of the RNN as representations but this does not work well:
representation z1 at the first position does not depend on subsequent words.

2
Encoding with bi-directional RNN

• In the classical model (Bahdanau et al., 2014), the varying-length representation was build using a
bi-directional RNN. z1 z2 z3 z4 z5


z−1 ←
z−2 ←
z−3 ←
z−4 ←
z−5 z0

z0 −

z1 −

z2 −

z3 −

z4 −

z5

x1 x2 x3 x4 x5

• The bi-direction RNN does two passes through the input sequence: forward and backward.
• The output at position j is a concatenation z = [−
j

z ;←
z−] of the states (or outputs) in the forward
j j
and backward passes.

3
Sequence-to-sequence model: context in decoding

• In the simple seq2seq model (see the first y1 y2 y3 y4


slide), we used the last state of the
encoder RNN as the context for decoding
in each step. h1 h2 h3 h4
• What should we do if we want to use an
encoded representation of a varying
length?
z0 z1 z2 z3 z4 z5

x1 x2 x3 x4 x5

4
Attention: Using context of varying length

y1 y2 y3 y4

• We can select one of the vectors zj as h1 h2 h3 h4


our context when decoding at step i.
• Which one to select? We let the neural
network decide it by itself using the Attention
attention mechanism.
z1 z2 z3 z4 z5
• You can think of attention as a switch
that selects one of the inputs zj . Encoder

x1 x2 x3 x4 x5

5
Attention mechanism from (Bahdanau et al., 2014)

• Select one of the inputs as the output:


n
X y1 y2 y3 y4
c= αj zj
j=1
n
X h1 h2 h3 h4
0 < αj < 1, αj = 1
c
j=1
h2
• Weights αj are computed using softmax: Attention
exp(ej ) z1 z2 z3 z4 z5
αj = Pn
j 0 =1
exp(ej 0 )

• Scores ej are computed using the current decoder state hi−1 and representation zj :

ej = f (hi−1 , zj )

where f can be modeled by a multilayer perceptron (MLP).

6
Full architecture from (Bahdanau et al., 2014)

y1 y2 y3 y4

h1 h2 h3 h4
c3
h2
Attention

z1 z2 z3 z4 z5

z−1 ←
z−2 ←
z−3 ←
z−4 ←
z−5



z1 −

z2 −

z3 −

z4 −

z5

x1 x2 x3 x4 x5

7
Attention-based models have much better performance

• Using attention significantly improves the quality of translation.

The translation performances on an English-to-French translation task (WMT’14)


Model BLEU
Simple Enc-Dec 17.82
Attention-based Enc-Dec 28.45
Attention-based Enc-Dec (LV) 34.11
Attention-based Enc-Dec (LV, ensemble) 37.19
LV - large vocabulary
source: (Jean et al., 2014)

8
Attention coefficients

• Weights αij can be visualized. The x-axis and y-axis of each plot correspond to the words in the
source sentence and the generated translation, respectively.

9
Neural image captioning (Xu et al., 2016)

• Models with attention have been used in many domains.


• “Show, Attend and Tell” paper solves the task of image captionining similarly to a translation
task: images are “translated” to sentences.

10
Neural image captioning (Xu et al., 2016)

• The image is preprocessed into 14 × 14 feature maps with a convolutional network pre-trained on
ImageNet.
• The 14 × 14 feature maps are split into L annotation vectors zj .
• The annotation vectors are used as context in the decoding RNN.

y1 y2 y3 y4
z1 z2

h1 h2 h3 h4
c3
h2
zL Attention

z1 z2 z3 ... zL

11
Convolutional sequence-to-sequence models
(Gehring et al., 2017)
Problems with RNN encoders

• Problem with RNN encoding:


• The number of steps is equal to the number of
words in the input sentence. This can make z1 z2 z3 z4 z5
training slow.
• We need to take multiple steps to model relations ←
between distant words. Modeling long-term z−1 ←
z−2 ←
z−3 ←
z−4 ←
z−5
dependencies can be difficult with RNNs.


z1 −

z2 −

z3 −

z4 −

z5
• Since we know how to deal with encodings of
varying lengths (using attention), we do not really
need to use an RNN. The encoder can be any x1 x2 x3 x4 x5
network that converts input sequence (x1 , ..., xn )
into representations (z1 , ..., zn ).

13
Convolutional encoder

z1 z2 z3 z4 z5
• Gehring et al. (2017) proposed to use a
convolutional network (CNN) to encode input
sequences. CNN
• Since convolutional layers have shared weights,
they can process sequences of varying lengths.
x1 x2 x3 x4 x5

• Advantage: CNN can compute representations in all positions in parallel. We use both preceding
and subsequent positions (unlike a bi-directional RNN).
• Disadvantage: CNN does not take into account whether the position is at the beginning of a
sequence or at the end.

14
Convolutional encoder

z1 z2 z3 z4 z5
• Gehring et al. (2017) proposed to use a
convolutional network (CNN) to encode input
sequences. CNN
• Since convolutional layers have shared weights,
they can process sequences of varying lengths.
x1 + p1 x2 + p2 x3 + p3 x4 + p4 x5 + p5

• Advantage: CNN can compute representations in all positions in parallel. We use both preceding
and subsequent positions (unlike a bi-directional RNN).
• Disadvantage: CNN does not take into account whether the position is at the beginning of a
sequence or at the end.
• Gehring et al. (2017) fix this problem by adding position embedding pj to word embeddings xj .

14
Decoding with convolutional layers

y1 y2 y3 y4
• Can we also avoid using RNNs in the decoder?
• The decoder is an autoregressive model with the context
Decoder
context provided by the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) y1 y2 y3


context SOS

• This function can be modeled by a (convolutional) network:


• Since we process sequences (inputs with one-dimensional structure), we use 1d convolutions.
• Inputs and outputs are same sequences but 1) the output is shifted by one position, 2) the input
sequence starts with a special SOS token.
• The receptive field of yi should not contain subsequent elements yi 0 , i 0 ≥ i (this can be achieved by
using shifted convolutions).

15
Decoding with convolutional layers

y1 y2 y3 y4
• Can we also avoid using RNNs in the decoder?
• The decoder is an autoregressive model with the context
Decoder
context provided by the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) y1 y2 y3


context SOS

• This function can be modeled by a (convolutional) network:


• Since we process sequences (inputs with one-dimensional structure), we use 1d convolutions.
• Inputs and outputs are same sequences but 1) the output is shifted by one position, 2) the input
sequence starts with a special SOS token.
• The receptive field of yi should not contain subsequent elements yi 0 , i 0 ≥ i (this can be achieved by
using shifted convolutions).

15
Decoding with convolutional layers

y1 y2 y3 y4
• Can we also avoid using RNNs in the decoder?
• The decoder is an autoregressive model with the context
Decoder
context provided by the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) y1 y2 y3


context SOS

• This function can be modeled by a (convolutional) network:


• Since we process sequences (inputs with one-dimensional structure), we use 1d convolutions.
• Inputs and outputs are same sequences but 1) the output is shifted by one position, 2) the input
sequence starts with a special SOS token.
• The receptive field of yi should not contain subsequent elements yi 0 , i 0 ≥ i (this can be achieved by
using shifted convolutions).

15
Decoding with convolutional layers

y1 y2 y3 y4
• Can we also avoid using RNNs in the decoder?
• The decoder is an autoregressive model with the context
Decoder
context provided by the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) y1 y2 y3


context SOS

• This function can be modeled by a (convolutional) network:


• Since we process sequences (inputs with one-dimensional structure), we use 1d convolutions.
• Inputs and outputs are same sequences but 1) the output is shifted by one position, 2) the input
sequence starts with a special SOS token.
• The receptive field of yi should not contain subsequent elements yi 0 , i 0 ≥ i (this can be achieved by
using shifted convolutions).

15
Decoding with convolutional layers

y1 y2 y3 y4
• Advantage of a convolutional decoder: During
training, we can compute output elements for all
context
positions in parallel. Recall that in the RNN Decoder
decoder, we had to produce the output sequence
one element at a time.
SOS y1 y2 y3

16
Decoding with convolutional layers

y1 y2 y3 y4
• Advantage of a convolutional decoder: During
training, we can compute output elements for all
context
positions in parallel. Recall that in the RNN Decoder
decoder, we had to produce the output sequence
one element at a time.
SOS y1 y2 y3
y1

• At test time (generation mode), we still have to


context
produce the output sequence one element at a time Decoder
(since it is an autoregressive model).

SOS

16
Decoding with convolutional layers

y1 y2 y3 y4
• Advantage of a convolutional decoder: During
training, we can compute output elements for all
context
positions in parallel. Recall that in the RNN Decoder
decoder, we had to produce the output sequence
one element at a time.
SOS y1 y2 y3
y2

• At test time (generation mode), we still have to


context
produce the output sequence one element at a time Decoder
(since it is an autoregressive model).

SOS y1

16
Decoding with convolutional layers

y1 y2 y3 y4
• Advantage of a convolutional decoder: During
training, we can compute output elements for all
context
positions in parallel. Recall that in the RNN Decoder
decoder, we had to produce the output sequence
one element at a time.
SOS y1 y2 y3
y3

• At test time (generation mode), we still have to


context
produce the output sequence one element at a time Decoder
(since it is an autoregressive model).

SOS y1 y2

16
Decoding with convolutional layers

y1 y2 y3 y4
• Advantage of a convolutional decoder: During
training, we can compute output elements for all
context
positions in parallel. Recall that in the RNN Decoder
decoder, we had to produce the output sequence
one element at a time.
SOS y1 y2 y3
y4

• At test time (generation mode), we still have to


context
produce the output sequence one element at a time Decoder
(since it is an autoregressive model).

SOS y1 y2 y3

16
An autoregressive model with 1d convolutional layer

• We can make sure that the receptive field of yi does not contain subsequent elements yi 0 , i 0 ≥ i
by using shifted convolutions.
y1 y2 y3 y4 y5

SOS y1 y2 y3 y4
standard convolution
shifted convolution

• If we stack multiple convolutional layers built in the same way, the desired property is preserved.

17
Attention in a convolutional decoder

o1 o2 o3 o4
• How can we use the context provided by the
encoder in such a decoder?
• Attention used in (Gehring et al., 2017): zn
X ···
oi = αij (zj + xj + pj ) Attention
z2
j=1
z1
where xj are word embeddings and pj are position
embeddings for the input sequence. h1 h2 h3 h4
• The attention weights are shifted 1d convolution
exp(h>
i zj )
αij = P n > 0
0 exp(h i zj )
j =1
SOS y1 y2 y3

• Attention compares hi to representations zj using dot product and passes value zj + xj + pj


corresponding to the best match.

18
Full architecture from (Gehring et al., 2017)

y1 y2 y3 y4
... ... ... ...

zn

CNN encoder
xn
• They used multiple decoder blocks
stacked on top of each other. ··· ···
z2 Attention
x2
• Each decoder block attends to the z1
outputs of the encoder zj . x1

• There are skip connections (skipping h1 h2 h3 h4


the attention block).
shifted 1d convolution

SOS y1 y2 y3

19
Full architecture (Gehring et al., 2017)

• They used multiple decoder blocks stacked


on top of each other.
• Each decoder block attends to the outputs of
the encoder zj .
• There are skip connections (skipping the
attention block).

20
Translation performance

The translation performances on an English-to-French translation task (WMT’14)


Model BLEU
Simple Enc-Dec 17.82
Attention-based Enc-Dec 28.45
Attention-based Enc-Dec (LV) 34.11
Attention-based Enc-Dec (LV, ensemble) 37.19
ConvS2S (BPE 40K) 40.51

21
Transformers
(Vaswani et al., 2017)
Transformer architecture

y1 y2 y3 y4
... ... ... ...
• The general architecture is similar to
ConvS2S: zn
xn
• The encoder converts input sequence

encoder
(x1 , ..., xn ) into continuous ··· ···
representations (z1 , ..., zn ).
z2 Attention
x2
• The decoder processes all positions in z1
x1
parallel using shifted output sequence
(y1 , ..., ym ) as input and output. The
autoregressive structure is preserved by h1 h2 h3 h4
masking.
• The decoder attends to representations decoder layer
(z1 , ..., zn ).

SOS y1 y2 y3

23
Transformer architecture

(y1 , ..., ym )

• The general architecture is similar to


ConvS2S:
• The encoder converts input sequence (z1 , ..., zn )
(x1 , ..., xn ) into continuous
representations (z1 , ..., zn ). attention
• The decoder processes all positions in
parallel using shifted output sequence
encoder decoder
(y1 , ..., ym ) as input and output. The
autoregressive structure is preserved by
masking.
• The decoder attends to representations
(z1 , ..., zn ).

(x1 , ..., xn ) (SOS, y1 , ..., ym−1 )

24
Transformer: Attention mechanism

• Intuition from ConvS2S: The attention mechanism compares intermediate representations hi


developed in the decoder with encoded inputs zj and outputs zj that is closest to hi .
o1 o2 o3 o4
• Basic attention mechanism in transformers:
n
X
oi = αij zj zn
j=1
√ ···
exp(z> Attention
j hi / dk ) z2
αij = Pn √
j 0 =1
exp(z> h / dk )
j0 i z1
dk is the dimensionality of the zj and hi .
• The authors called this scaled dot-product attention.
h1 h2 h3 h4

25
Transformer: Scaled dot-product attention

(y1 , ..., ym )

• We can think of the scaled dot-product attention as finding


values vj = zj with keys kj = zj that are closest to query
qi = hi . (z1 , ..., zn )

• Re-writing the scaled dot-product attention using keys, values


attention
and query: V K Q

n n
X X encoder decoder
oi = αij zj oi = αij vj
j=1 j=1
√ √
exp(z>j hi / dk ) exp(k> j qi / dk )
αij = P n >
√ αij = Pn √
0
j =1
exp(z j 0 hi / dk ) j 0 =1
exp(k>j 0 qi / dk )

(x1 , ..., xn ) (SOS, y1 , ..., ym−1 )

26
Scaled dot-product attention

• Scaled dot-product attention:


n
X
oi = αij vj
j=1

exp(k> j qi / dk )
αij = Pn √
j 0 =1
exp(k> q / dk )
j0 i

in the matrix form:


 
QK>
attention(Q, K, V) = softmax √ V
dk

with V ∈ Rn×dv , Q ∈ Rm×dk , K ∈ Rn×dk . scaled dot-product attention

27
Multi-head attention

• Instead of doing a single scaled dot-product attention, the authors


found it beneficial to project keys, queries and values into
lower-dimensional spaces, perform scaled dot-product attention
there and concatenate the outputs:

headi = attention(QWiQ , KWiK , VWiV )


MultiHead(Q, K , V ) = Concat(head1 , ..., headh )W O

V ∈ Rn×dv , Q ∈ Rm×dk , K ∈ Rn×dk ,


headi ∈ Rm×di , output ∈ Rm×dk .
multi-head attention

28
Encoder-decoder attention

(y1 , ..., ym )

• Multi-head attention is used to attend to the encoder


outputs by the decoder.
(z1 , ..., zn )

attention
V K Q

encoder decoder

(x1 , ..., xn ) (SOS, y1 , ..., ym−1 )

29
Transformer encoder: Self-attention

• How to implement the encoder? Previously we used: a bi-directional RNN or a CNN.


• Transformer encoder uses the following blocks:
• Process every position x1 , ..., xn with the same multilayer
z1 z2 z3 z4
perceptron (MLP) network.
• Mix information from different positions using the
MLP MLP MLP MLP
multi-head attention mechanism (attention is all you need).
• Self-attention: inputs xi are used as keys, values and queries.
• Example with a scaled dot-product attention (for simplicity): Self-attention
n √
X exp(xj> xi / dk )
zi = αij xj αij = P n √ x1 x2 x3 x4
j 0 =1
exp(xj>0 xi / dk )
j=1

• Advantage: The first position affects the representation in the last position (and vice versa)
already after one layer! (think how many layers are needed for that in RNN or convolutional
encoders).

30
Transformer encoder

(y1 , ..., ym )

(z1 , ..., zn )
• MLP networks are inside the ”Feed Forward” block.
• The encoder is a stack of multiple such blocks (each attention
block contains an attention module and a mini-MLP).
• Each block contains standard deep learning tricks: decoder
• skip connections
• layer normalization

(x1 , ..., xn ) (SOS, y1 , ..., ym−1 )

31
Transformer decoder

y1 y2 y3 y4

zn
• Similarly to ConvS2S, the decoder implements an ···
Decoder
autoregressive model with the context provided by z1
the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) SOS y1 y2 y3

• When predicting word yi we can use the preceding words y1 , ..., yi−1 but not subsequent words
yi , ..., ym .

32
Transformer decoder

y1 y2 y3 y4

zn
• Similarly to ConvS2S, the decoder implements an ···
Decoder
autoregressive model with the context provided by z1
the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) SOS y1 y2 y3

• When predicting word yi we can use the preceding words y1 , ..., yi−1 but not subsequent words
yi , ..., ym .

32
Transformer decoder

y1 y2 y3 y4

zn
• Similarly to ConvS2S, the decoder implements an ···
Decoder
autoregressive model with the context provided by z1
the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) SOS y1 y2 y3

• When predicting word yi we can use the preceding words y1 , ..., yi−1 but not subsequent words
yi , ..., ym .

32
Transformer decoder

y1 y2 y3 y4

zn
• Similarly to ConvS2S, the decoder implements an ···
Decoder
autoregressive model with the context provided by z1
the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) SOS y1 y2 y3

• When predicting word yi we can use the preceding words y1 , ..., yi−1 but not subsequent words
yi , ..., ym .

32
Transformer decoder

o1 o2 o3 o4

• The cross-attention block is the multi-head kn = vn = zn


attention module described earlier.
···
• Again, attention-is-all-you-need idea: Use cross-attention
k2 = v2 = z2
self-attention as a building block of the decoder.
k1 = v1 = z1
• We need to make sure that we do not use
subsequent inputs yi , ..., ym when producing output q: h1 h2 h3 h4
oi at position i. This is done using masked
masked self-attention
self-attention (see next slide).

SOS y1 y2 y3

33
Transformer decoder: Masked self-attention

• Let us denote the inputs of the self-attention layer as vj


h1 h2 h3 h4
and outputs as hj .
• For simplicity, assume that we use scaled dot-product at-
tention: masked self-attention
m √
X exp(vj> vi / dk + mij )
hi = αij vj αij = Pm √
j 0 =1
exp(vj>0 vi / dk + mij 0 ) v1 v2 v3 v4
j=1

• We want not to use subsequent positions vi+1 , ..., vm when computing output hi . We can do that
using attention masks mij :

mij = 0 , if j ≤ i
mij = −∞ and therefore αij = 0, if j > i

34
Decoder

(y1 , ..., ym )

• After self-attention and cross-attention, the


(z1 , ..., zn )
representation in each position is processed with a
mini-MLP (Feed Forward block in the figure).
• The decoder is a stack of multiple such blocks (each
block contains two attention modules and a mini-MLP).
encoder
• Each block contains standard deep learning tricks:
• skip connections
• layer normalization

(x1 , ..., xn ) (SOS, y1 , ..., ym−1 )

35
Transformer’s positional encoding

• For simplicity, assume that we use scaled dot-product at-


tention:
n √
X exp(xj> xi / dk )
zi = αij xj αij = Pn √ self-attention
j 0 =1
exp(xj>0 xi / dk )
j=1

What will happen to the outputs, if we shuffle the inputs x2 x4 x1 x3


(change their order)?

36
Transformer’s positional encoding

• For simplicity, assume that we use scaled dot-product at-


tention: z2 z4 z1 z3
n √
X exp(xj> xi / dk )
zi = αij xj αij = Pn √ self-attention
j 0 =1
exp(xj>0 xi / dk )
j=1

What will happen to the outputs, if we shuffle the inputs x2 x4 x1 x3


(change their order)?

• The outputs will be shuffled in the same way. Thus, the computed representations will not depend
on the order of the elements in the input sequence.
• This is not desired: the order of the words is important for understanding the meaning of a
sentence.

36
Transformer’s positional encoding

• Recall: ConvS2S used position embedding (embedding positions just like words) and added them
to word embeddings.

• Transformers use hard-coded (not learned) positional


encoding:

PE(p, 2i) = sin(p/100002i/d )


PE(p, 2i + 1) = cos(p/100002i/d )

where p is position, i is the element of the encoding.


• This encoding has the same dimensionality d as in-
put/output embeddings. source: Annotated Transformer

• Motivation: It is easy for the model to learn to attend by relative positions, since for any fixed
offset k, PEp+k can be represented as a linear function of PEp .

37
Transformer: Full model

• Training of the transformer model needs a ramp-up of the


learning rate:

source: Annotated transformer

• If you have trouble understanding the model, check out


the Annotated Transformer blog post.

38
Translation performance

The translation performances on an English-to-French


translation task (WMT’14) according to (Vaswani et al., 2017)
Model BLEU
ConvS2S 40.46
ConvS2S (ensemble) 41.29
Transformer (base model) 38.1
Transformer (big) 41.8

39
Rotary position embedding (Su et al., 2021)

• Transformers’ attention uses dot product < ·, · > of transformed word embeddings for matching
queries to keys:
(Wq (xi + pi ))> (Wk (xj + pj )) =< fq (xi , i), fk (xj , j) >
• Intuively, we would like an attention mechanism that is invariant to the aboslute position of the
words in the sequence, but is sensitive to the relative positions:

< fq (xm , m), fk (xn , n) >= g(xm , xn , m − n)

• Su et al. (2021) derive a solution, which for a 2d case is:


 
jmθ cos mθ − sin mθ
fq (xm , m) = (Wq xm )e = Wq xm
sin mθ cos mθ
fk (xn , n) = (Wk xn )e jnθ
g(xm , xn , m − n) = Re[(Wq xm )(Wk xn )∗ e j(m−n)θ ]

where Re[·] is the real part of a complex number and (·)∗ denotes the conjugate complex number.

40
Rotary position embedding (Su et al., 2021)

• General case:

− sin mθ1 ···

cos mθ1 0 0 0 0
 sin mθ1 cos mθ1 0 0 ··· 0 0 
 
 0
 0 cos mθ2 − sin mθ2 ··· 0 0 

fq (xm , m) =  0
 0 sin mθ2 cos mθ2 ··· 0 0  Wq xm

 . .. .. .. .. .. ..
 ..

 . . . . . . 

 0 0 0 0 ··· cos mθd/2 − sin mθd/2 
0 0 0 0 ··· sin mθd/2 cos mθd/2

with θi = 10000−2(i−1)/d , i ∈ {1, . . . , d/2} (following the original positional encoding).


• In contrast to the original positional encoding which is additive, the rotary position embedding is
multiplicative.
• Rotary position embedding embedding is usually applied in every attention layer of the
transformer.

41
Vision Transformers (Dosovitskiy et al., 2020)

• Although introduced for natural language processing tasks, transformers have now been used in
many other domains and they show great performance.

• One example is the Vision Transformer


(ViT) which is a transformer-based
architecture for image processing tasks:
• split an image into fixed-size patches
• linearly embed each of them
• add position embeddings
• feed the resulting sequence of vectors to a
standard Transformer encoder.
• In order to perform classification, add an
extra learnable “classification token” to
the sequence.

• ViT is typically pre-trained on large datasets and then fine-tuned to (smaller) downstream tasks.

42
BERT: Transformer-based language model
(Devlin et al., 2018)
Transfer learning in natural language processing

• Common natural language understanding tasks (see, e.g., GLUE benchmark):


• Sentiment analysis: e.g., classification of sentences extracted from movie reviews
• Question answering: e.g., detect pairs (question, sentence) which contain the correct answer
• Determine if two sentences are semantically equivalent (binary classification)
• Labeled datasets for such tasks are often limited (labels are expensive to collect).
• Transfer learning:
• Pre-train language models on large text corpora (unlabeled data, thus unsupervised learning)
• Fine-tune a pre-trained model to a specific task

44
BERT: Pre-trained language model (Devlin et al., 2018)

• The model is essentially a transformer encoder C T1 T2 T3 ... TN TSEP T0 ... T0


1 M

(Vaswani et al., 2017).


• The model can represent either a single
sentence or a pair of sentences (we need to BERT (transformer encoder)
process pairs in some downstream tasks, such
as question answering).
... ...
• Pre-trained on a large corpus, e.g., English CLS Tok 1 Tok 2 Tok 3 Tok N SEP Tok 1 Tok M

Wikipedia (2,500M words). Sentence A Sentence B

45
BERT: Pre-training task 1

Tok 2

• Pre-training task 1: Predict a masked input


token (denoising task). BERT (transformer encoder)
• Use special MASK token to “corrupt” the
input sequence.
• The task is to reconstruct the masked token.
CLS Tok 1 MASK Tok 3 ... Tok N SEP Tok 1 ... Tok M

Sentence A Sentence B

46
BERT: Pre-training task 2

True/False

• Pre-training task 2: Predict whether sentence


B follows sentence A.
• 50% of the time sentence B follows sentence BERT (transformer encoder)
A in the corpus.
• 50% of the time sentence B is randomly
chosen.
CLS Tok 1 Tok 2 Tok 3 ... Tok N SEP Tok 1 ... Tok M
• Binary classification task.
Sentence A Sentence B

47
Fine-tuning BERT on a sentence classification task

Sentence: Although the value added services being provided are great but the prices are high. Class: mixed review

Sentence: Great work done #XYZ Problem resolved by customer care in just one day. Class: positive review

• BERT is fine-tuned on task-specific training data: Class

task-specific inputs and outputs


• Example: sentence classification task:
• Sentence A: input sentence BERT
• Sentence B: ∅ (transformer encoder)
• Output: target class (taken from the first position)
• New layer is introduced to convert the output at the first
CLS Tok 1 Tok 2 Tok 3 ... Tok N
position into class probabilities.
Input sentence
• All parameters of the model are fine-tuned!

48
Fine-tuning BERT on a question answering task

Paragraph: Beyoncé Giselle Knowles-Carter (born September 4, 1981) is an American singer, songwriter, record producer, dancer and actress. Born
and raised in Houston, Texas, Beyoncé performed in various singing and dancing competitions as a child. She rose to fame in the late 1990s as the
lead singer of Destiny’s Child, one of the best-selling girl groups of all time.

Question: When did Beyonce start becoming popular?

Correct answer: in the late 1990s


IsEnd IsEnd
IsStart IsStart

• Fine-tuning BERT on a question answering task:


• Sentence A: question
• Sentence B: paragraph (passage)
BERT (transformer encoder)
• Output sequence: Probabilities of each word in
the passage being the start and the end
• All parameters of the model are updated on the
CLS Tok 1 ... Tok N SEP Tok 1 ... Tok M
task-specific data!
Question Paragraph

49
BERT: GLUE Test results

GLUE Test results

The number below each task denotes the number of training examples.
F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B,
and accuracy scores are reported for the other tasks.

50
Recap
Recap

• Attention allows the encoder of a sequence-to-sequence model to produce representations of a


varying length, which dramatically increases the quality of the neural machine translation model.
• A sequence-to-sequence model can be implemented by CNNs: convolutional layers can process
sequential data and an autoregressive decoder can be implemented by shifted (causal) 1d
convolutions.
• Transformers are neural networks that use attention (self-attention, cross-attention) as the main
computational block.
• Multi-head attention allows paying attention to different parts of the attended sequence.
• Transformers have been used in multiple domains (e.g., vision, protein modeling).
• BERT is a popular transformer-based language model which can be tuned to custom NLU tasks.

52
Home assignment
Assignment 05 transformer

• You need to implement and train a transformer model for


the task of statistical machine translation task (the same
task as in the previous assignment).

54
Recommended reading

• Papers cited in the lecture slides.


• The Annotated Transformer blog post.

55

You might also like