0% found this document useful (0 votes)

8 views69 pages

05 Attention Slides

The document discusses attention-based models in deep learning, particularly in sequence-to-sequence tasks like machine translation. It highlights the limitations of traditional models that use fixed-size vectors and introduces bi-directional RNNs and attention mechanisms to improve performance. Additionally, it explores the use of convolutional networks for encoding and decoding sequences, emphasizing their advantages in parallel processing and efficiency.

Uploaded by

mrolaw01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views69 pages

05 Attention Slides

Uploaded by

mrolaw01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

CS-E4890 Deep Learning

Lecture #5: Attention-based models

Jorma Laaksonen — Juho Kannala — Alexander Ilin

Simple sequence-to-sequence model

• Previously we considered a sequence-to-sequence model for statistical machine translation:

y1 y2 y3 y4

z0 z1 z2 z3 z4 z5 h1 h2 h3 h4

context
x1 x2 x3 x4 x5
This is my cat .

• The problem with this model: It is difficult to encode the whole sentence in a single vector z5 of
fixed size.

1
Encoding to representation of a varying-length

• Intuition: The longer the input sentence, the longer our representation should be. Let the length
of our representation be equal to the length of the input sequence.
z1 z2 z3 z4 z5

Encoder

x1 x2 x3 x4 x5

2
Encoding to representation of a varying-length

• Intuition: The longer the input sentence, the longer our representation should be. Let the length
of our representation be equal to the length of the input sequence.
z1 z2 z3 z4 z5

h1 h2 h3 h4 h5

x1 x2 x3 x4 x5

• We can use intermediate states of the RNN as representations but this does not work well:
representation z1 at the first position does not depend on subsequent words.

2
Encoding with bi-directional RNN

• In the classical model (Bahdanau et al., 2014), the varying-length representation was build using a
bi-directional RNN. z1 z2 z3 z4 z5

←
z−1 ←
z−2 ←
z−3 ←
z−4 ←
z−5 z0

z0 −
→
z1 −
→
z2 −
→
z3 −
→
z4 −
→
z5

x1 x2 x3 x4 x5

• The bi-direction RNN does two passes through the input sequence: forward and backward.
• The output at position j is a concatenation z = [−
j
→
z ;←
z−] of the states (or outputs) in the forward
j j
and backward passes.

3
Sequence-to-sequence model: context in decoding

• In the simple seq2seq model (see the first y1 y2 y3 y4

slide), we used the last state of the
encoder RNN as the context for decoding
in each step. h1 h2 h3 h4
• What should we do if we want to use an
encoded representation of a varying
length?
z0 z1 z2 z3 z4 z5

x1 x2 x3 x4 x5

4
Attention: Using context of varying length

y1 y2 y3 y4

• We can select one of the vectors zj as h1 h2 h3 h4

our context when decoding at step i.
• Which one to select? We let the neural
network decide it by itself using the Attention
attention mechanism.
z1 z2 z3 z4 z5
• You can think of attention as a switch
that selects one of the inputs zj . Encoder

x1 x2 x3 x4 x5

5
Attention mechanism from (Bahdanau et al., 2014)

• Select one of the inputs as the output:

n
X y1 y2 y3 y4
c= αj zj
j=1
n
X h1 h2 h3 h4
0 < αj < 1, αj = 1
c
j=1
h2
• Weights αj are computed using softmax: Attention
exp(ej ) z1 z2 z3 z4 z5
αj = Pn
j 0 =1
exp(ej 0 )

• Scores ej are computed using the current decoder state hi−1 and representation zj :

ej = f (hi−1 , zj )

where f can be modeled by a multilayer perceptron (MLP).

6
Full architecture from (Bahdanau et al., 2014)

y1 y2 y3 y4

h1 h2 h3 h4
c3
h2
Attention

z1 z2 z3 z4 z5
←
z−1 ←
z−2 ←
z−3 ←
z−4 ←
z−5

−
→
z1 −
→
z2 −
→
z3 −
→
z4 −
→
z5

x1 x2 x3 x4 x5

7
Attention-based models have much better performance

• Using attention significantly improves the quality of translation.

The translation performances on an English-to-French translation task (WMT’14)

Model BLEU
Simple Enc-Dec 17.82
Attention-based Enc-Dec 28.45
Attention-based Enc-Dec (LV) 34.11
Attention-based Enc-Dec (LV, ensemble) 37.19
LV - large vocabulary
source: (Jean et al., 2014)

8
Attention coefficients

• Weights αij can be visualized. The x-axis and y-axis of each plot correspond to the words in the
source sentence and the generated translation, respectively.

9
Neural image captioning (Xu et al., 2016)

• Models with attention have been used in many domains.

• “Show, Attend and Tell” paper solves the task of image captionining similarly to a translation
task: images are “translated” to sentences.

10
Neural image captioning (Xu et al., 2016)

• The image is preprocessed into 14 × 14 feature maps with a convolutional network pre-trained on
ImageNet.
• The 14 × 14 feature maps are split into L annotation vectors zj .
• The annotation vectors are used as context in the decoding RNN.

y1 y2 y3 y4
z1 z2

h1 h2 h3 h4
c3
h2
zL Attention

z1 z2 z3 ... zL

11
Convolutional sequence-to-sequence models
(Gehring et al., 2017)
Problems with RNN encoders

• Problem with RNN encoding:

• The number of steps is equal to the number of
words in the input sentence. This can make z1 z2 z3 z4 z5
training slow.
• We need to take multiple steps to model relations ←
between distant words. Modeling long-term z−1 ←
z−2 ←
z−3 ←
z−4 ←
z−5
dependencies can be difficult with RNNs.
−
→
z1 −
→
z2 −
→
z3 −
→
z4 −
→
z5
• Since we know how to deal with encodings of
varying lengths (using attention), we do not really
need to use an RNN. The encoder can be any x1 x2 x3 x4 x5
network that converts input sequence (x1 , ..., xn )
into representations (z1 , ..., zn ).

13
Convolutional encoder

z1 z2 z3 z4 z5
• Gehring et al. (2017) proposed to use a
convolutional network (CNN) to encode input
sequences. CNN
• Since convolutional layers have shared weights,
they can process sequences of varying lengths.
x1 x2 x3 x4 x5

• Advantage: CNN can compute representations in all positions in parallel. We use both preceding
and subsequent positions (unlike a bi-directional RNN).
• Disadvantage: CNN does not take into account whether the position is at the beginning of a
sequence or at the end.

14
Convolutional encoder

14
Decoding with convolutional layers

y1 y2 y3 y4
• Can we also avoid using RNNs in the decoder?
• The decoder is an autoregressive model with the context
Decoder
context provided by the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) y1 y2 y3

context SOS

• This function can be modeled by a (convolutional) network:

• Since we process sequences (inputs with one-dimensional structure), we use 1d convolutions.
• Inputs and outputs are same sequences but 1) the output is shifted by one position, 2) the input
sequence starts with a special SOS token.
• The receptive field of yi should not contain subsequent elements yi 0 , i 0 ≥ i (this can be achieved by
using shifted convolutions).

15
Decoding with convolutional layers

y1 y2 y3 y4
• Can we also avoid using RNNs in the decoder?
• The decoder is an autoregressive model with the context
Decoder
context provided by the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) y1 y2 y3

context SOS

• This function can be modeled by a (convolutional) network:

15
Decoding with convolutional layers

y1 y2 y3 y4
• Can we also avoid using RNNs in the decoder?
• The decoder is an autoregressive model with the context
Decoder
context provided by the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) y1 y2 y3

context SOS

• This function can be modeled by a (convolutional) network:

15
Decoding with convolutional layers

y1 y2 y3 y4
• Can we also avoid using RNNs in the decoder?
• The decoder is an autoregressive model with the context
Decoder
context provided by the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) y1 y2 y3

context SOS

• This function can be modeled by a (convolutional) network:

15
Decoding with convolutional layers

y1 y2 y3 y4
• Advantage of a convolutional decoder: During
training, we can compute output elements for all
context
positions in parallel. Recall that in the RNN Decoder
decoder, we had to produce the output sequence
one element at a time.
SOS y1 y2 y3

16
Decoding with convolutional layers

• At test time (generation mode), we still have to

context
produce the output sequence one element at a time Decoder
(since it is an autoregressive model).

SOS

16
Decoding with convolutional layers

• At test time (generation mode), we still have to

context
produce the output sequence one element at a time Decoder
(since it is an autoregressive model).

SOS y1

16
Decoding with convolutional layers

• At test time (generation mode), we still have to

context
produce the output sequence one element at a time Decoder
(since it is an autoregressive model).

SOS y1 y2

16
Decoding with convolutional layers

• At test time (generation mode), we still have to

context
produce the output sequence one element at a time Decoder
(since it is an autoregressive model).

SOS y1 y2 y3

16
An autoregressive model with 1d convolutional layer

• We can make sure that the receptive field of yi does not contain subsequent elements yi 0 , i 0 ≥ i
by using shifted convolutions.
y1 y2 y3 y4 y5

SOS y1 y2 y3 y4
standard convolution
shifted convolution

• If we stack multiple convolutional layers built in the same way, the desired property is preserved.

17
Attention in a convolutional decoder

o1 o2 o3 o4
• How can we use the context provided by the
encoder in such a decoder?
• Attention used in (Gehring et al., 2017): zn
X ···
oi = αij (zj + xj + pj ) Attention
z2
j=1
z1
where xj are word embeddings and pj are position
embeddings for the input sequence. h1 h2 h3 h4
• The attention weights are shifted 1d convolution
exp(h>
i zj )
αij = P n > 0
0 exp(h i zj )
j =1
SOS y1 y2 y3

• Attention compares hi to representations zj using dot product and passes value zj + xj + pj

corresponding to the best match.

18
Full architecture from (Gehring et al., 2017)

y1 y2 y3 y4
... ... ... ...

CNN encoder
xn
• They used multiple decoder blocks
stacked on top of each other. ··· ···
z2 Attention
x2
• Each decoder block attends to the z1
outputs of the encoder zj . x1

• There are skip connections (skipping h1 h2 h3 h4

the attention block).
shifted 1d convolution

SOS y1 y2 y3

19
Full architecture (Gehring et al., 2017)

• They used multiple decoder blocks stacked

on top of each other.
• Each decoder block attends to the outputs of
the encoder zj .
• There are skip connections (skipping the
attention block).

20
Translation performance

The translation performances on an English-to-French translation task (WMT’14)

Model BLEU
Simple Enc-Dec 17.82
Attention-based Enc-Dec 28.45
Attention-based Enc-Dec (LV) 34.11
Attention-based Enc-Dec (LV, ensemble) 37.19
ConvS2S (BPE 40K) 40.51

21
Transformers
(Vaswani et al., 2017)
Transformer architecture

y1 y2 y3 y4
... ... ... ...
• The general architecture is similar to
ConvS2S: zn
xn
• The encoder converts input sequence

encoder
(x1 , ..., xn ) into continuous ··· ···
representations (z1 , ..., zn ).
z2 Attention
x2
• The decoder processes all positions in z1
x1
parallel using shifted output sequence
(y1 , ..., ym ) as input and output. The
autoregressive structure is preserved by h1 h2 h3 h4
masking.
• The decoder attends to representations decoder layer
(z1 , ..., zn ).

SOS y1 y2 y3

23
Transformer architecture

(y1 , ..., ym )

• The general architecture is similar to

ConvS2S:
• The encoder converts input sequence (z1 , ..., zn )
(x1 , ..., xn ) into continuous
representations (z1 , ..., zn ). attention
• The decoder processes all positions in
parallel using shifted output sequence
encoder decoder
(y1 , ..., ym ) as input and output. The
autoregressive structure is preserved by
masking.
• The decoder attends to representations
(z1 , ..., zn ).

(x1 , ..., xn ) (SOS, y1 , ..., ym−1 )

24
Transformer: Attention mechanism

• Intuition from ConvS2S: The attention mechanism compares intermediate representations hi

developed in the decoder with encoded inputs zj and outputs zj that is closest to hi .
o1 o2 o3 o4
• Basic attention mechanism in transformers:
n
X
oi = αij zj zn
j=1
√ ···
exp(z> Attention
j hi / dk ) z2
αij = Pn √
j 0 =1
exp(z> h / dk )
j0 i z1
dk is the dimensionality of the zj and hi .
• The authors called this scaled dot-product attention.
h1 h2 h3 h4

25
Transformer: Scaled dot-product attention

(y1 , ..., ym )

• We can think of the scaled dot-product attention as finding

values vj = zj with keys kj = zj that are closest to query
qi = hi . (z1 , ..., zn )

• Re-writing the scaled dot-product attention using keys, values

attention
and query: V K Q

n n
X X encoder decoder
oi = αij zj oi = αij vj
j=1 j=1
√ √
exp(z>j hi / dk ) exp(k> j qi / dk )
αij = P n >
√ αij = Pn √
0
j =1
exp(z j 0 hi / dk ) j 0 =1
exp(k>j 0 qi / dk )

(x1 , ..., xn ) (SOS, y1 , ..., ym−1 )

26
Scaled dot-product attention

• Scaled dot-product attention:

n
X
oi = αij vj
j=1
√
exp(k> j qi / dk )
αij = Pn √
j 0 =1
exp(k> q / dk )
j0 i

in the matrix form:

QK>
attention(Q, K, V) = softmax √ V
dk

with V ∈ Rn×dv , Q ∈ Rm×dk , K ∈ Rn×dk . scaled dot-product attention

27
Multi-head attention

• Instead of doing a single scaled dot-product attention, the authors

found it beneficial to project keys, queries and values into
lower-dimensional spaces, perform scaled dot-product attention
there and concatenate the outputs:

headi = attention(QWiQ , KWiK , VWiV )

MultiHead(Q, K , V ) = Concat(head1 , ..., headh )W O

V ∈ Rn×dv , Q ∈ Rm×dk , K ∈ Rn×dk ,

headi ∈ Rm×di , output ∈ Rm×dk .
multi-head attention

28
Encoder-decoder attention

(y1 , ..., ym )

• Multi-head attention is used to attend to the encoder

outputs by the decoder.
(z1 , ..., zn )

attention
V K Q

encoder decoder

(x1 , ..., xn ) (SOS, y1 , ..., ym−1 )

29
Transformer encoder: Self-attention

• How to implement the encoder? Previously we used: a bi-directional RNN or a CNN.

• Transformer encoder uses the following blocks:
• Process every position x1 , ..., xn with the same multilayer
z1 z2 z3 z4
perceptron (MLP) network.
• Mix information from different positions using the
MLP MLP MLP MLP
multi-head attention mechanism (attention is all you need).
• Self-attention: inputs xi are used as keys, values and queries.
• Example with a scaled dot-product attention (for simplicity): Self-attention
n √
X exp(xj> xi / dk )
zi = αij xj αij = P n √ x1 x2 x3 x4
j 0 =1
exp(xj>0 xi / dk )
j=1

• Advantage: The first position affects the representation in the last position (and vice versa)
already after one layer! (think how many layers are needed for that in RNN or convolutional
encoders).

30
Transformer encoder

(y1 , ..., ym )

(z1 , ..., zn )
• MLP networks are inside the ”Feed Forward” block.
• The encoder is a stack of multiple such blocks (each attention
block contains an attention module and a mini-MLP).
• Each block contains standard deep learning tricks: decoder
• skip connections
• layer normalization

(x1 , ..., xn ) (SOS, y1 , ..., ym−1 )

31
Transformer decoder

y1 y2 y3 y4

zn
• Similarly to ConvS2S, the decoder implements an ···
Decoder
autoregressive model with the context provided by z1
the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) SOS y1 y2 y3

• When predicting word yi we can use the preceding words y1 , ..., yi−1 but not subsequent words
yi , ..., ym .

32
Transformer decoder

y1 y2 y3 y4

zn
• Similarly to ConvS2S, the decoder implements an ···
Decoder
autoregressive model with the context provided by z1
the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) SOS y1 y2 y3

• When predicting word yi we can use the preceding words y1 , ..., yi−1 but not subsequent words
yi , ..., ym .

32
Transformer decoder

y1 y2 y3 y4

zn
• Similarly to ConvS2S, the decoder implements an ···
Decoder
autoregressive model with the context provided by z1
the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) SOS y1 y2 y3

• When predicting word yi we can use the preceding words y1 , ..., yi−1 but not subsequent words
yi , ..., ym .

32
Transformer decoder

y1 y2 y3 y4

zn
• Similarly to ConvS2S, the decoder implements an ···
Decoder
autoregressive model with the context provided by z1
the encoder

yi = f (yi−1 , ..., y1 , z1 , ..., zn ) SOS y1 y2 y3

• When predicting word yi we can use the preceding words y1 , ..., yi−1 but not subsequent words
yi , ..., ym .

32
Transformer decoder

o1 o2 o3 o4

• The cross-attention block is the multi-head kn = vn = zn

attention module described earlier.
···
• Again, attention-is-all-you-need idea: Use cross-attention
k2 = v2 = z2
self-attention as a building block of the decoder.
k1 = v1 = z1
• We need to make sure that we do not use
subsequent inputs yi , ..., ym when producing output q: h1 h2 h3 h4
oi at position i. This is done using masked
masked self-attention
self-attention (see next slide).

SOS y1 y2 y3

33
Transformer decoder: Masked self-attention

• Let us denote the inputs of the self-attention layer as vj

h1 h2 h3 h4
and outputs as hj .
• For simplicity, assume that we use scaled dot-product at-
tention: masked self-attention
m √
X exp(vj> vi / dk + mij )
hi = αij vj αij = Pm √
j 0 =1
exp(vj>0 vi / dk + mij 0 ) v1 v2 v3 v4
j=1

• We want not to use subsequent positions vi+1 , ..., vm when computing output hi . We can do that
using attention masks mij :

mij = 0 , if j ≤ i
mij = −∞ and therefore αij = 0, if j > i

34
Decoder

(y1 , ..., ym )

• After self-attention and cross-attention, the

(z1 , ..., zn )
representation in each position is processed with a
mini-MLP (Feed Forward block in the figure).
• The decoder is a stack of multiple such blocks (each
block contains two attention modules and a mini-MLP).
encoder
• Each block contains standard deep learning tricks:
• skip connections
• layer normalization

(x1 , ..., xn ) (SOS, y1 , ..., ym−1 )

35
Transformer’s positional encoding

• For simplicity, assume that we use scaled dot-product at-

tention:
n √
X exp(xj> xi / dk )
zi = αij xj αij = Pn √ self-attention
j 0 =1
exp(xj>0 xi / dk )
j=1

What will happen to the outputs, if we shuffle the inputs x2 x4 x1 x3

(change their order)?

36
Transformer’s positional encoding

• For simplicity, assume that we use scaled dot-product at-

tention: z2 z4 z1 z3
n √
X exp(xj> xi / dk )
zi = αij xj αij = Pn √ self-attention
j 0 =1
exp(xj>0 xi / dk )
j=1

What will happen to the outputs, if we shuffle the inputs x2 x4 x1 x3

(change their order)?

• The outputs will be shuffled in the same way. Thus, the computed representations will not depend
on the order of the elements in the input sequence.
• This is not desired: the order of the words is important for understanding the meaning of a
sentence.

36
Transformer’s positional encoding

• Recall: ConvS2S used position embedding (embedding positions just like words) and added them
to word embeddings.

• Transformers use hard-coded (not learned) positional

encoding:

PE(p, 2i) = sin(p/100002i/d )

PE(p, 2i + 1) = cos(p/100002i/d )

where p is position, i is the element of the encoding.

• This encoding has the same dimensionality d as in-
put/output embeddings. source: Annotated Transformer

• Motivation: It is easy for the model to learn to attend by relative positions, since for any fixed
offset k, PEp+k can be represented as a linear function of PEp .

37
Transformer: Full model

• Training of the transformer model needs a ramp-up of the

learning rate:

source: Annotated transformer

• If you have trouble understanding the model, check out

the Annotated Transformer blog post.

38
Translation performance

The translation performances on an English-to-French

translation task (WMT’14) according to (Vaswani et al., 2017)
Model BLEU
ConvS2S 40.46
ConvS2S (ensemble) 41.29
Transformer (base model) 38.1
Transformer (big) 41.8

39
Rotary position embedding (Su et al., 2021)

• Transformers’ attention uses dot product < ·, · > of transformed word embeddings for matching
queries to keys:
(Wq (xi + pi ))> (Wk (xj + pj )) =< fq (xi , i), fk (xj , j) >
• Intuively, we would like an attention mechanism that is invariant to the aboslute position of the
words in the sequence, but is sensitive to the relative positions:

< fq (xm , m), fk (xn , n) >= g(xm , xn , m − n)

• Su et al. (2021) derive a solution, which for a 2d case is:

jmθ cos mθ − sin mθ
fq (xm , m) = (Wq xm )e = Wq xm
sin mθ cos mθ
fk (xn , n) = (Wk xn )e jnθ
g(xm , xn , m − n) = Re[(Wq xm )(Wk xn )∗ e j(m−n)θ ]

where Re[·] is the real part of a complex number and (·)∗ denotes the conjugate complex number.

40
Rotary position embedding (Su et al., 2021)

• General case:

− sin mθ1 ···

cos mθ1 0 0 0 0
 sin mθ1 cos mθ1 0 0 ··· 0 0 
 
 0
 0 cos mθ2 − sin mθ2 ··· 0 0 

fq (xm , m) =  0
 0 sin mθ2 cos mθ2 ··· 0 0  Wq xm

 . .. .. .. .. .. ..
 ..

 . . . . . . 

 0 0 0 0 ··· cos mθd/2 − sin mθd/2 
0 0 0 0 ··· sin mθd/2 cos mθd/2

with θi = 10000−2(i−1)/d , i ∈ {1, . . . , d/2} (following the original positional encoding).

• In contrast to the original positional encoding which is additive, the rotary position embedding is
multiplicative.
• Rotary position embedding embedding is usually applied in every attention layer of the
transformer.

41
Vision Transformers (Dosovitskiy et al., 2020)

• Although introduced for natural language processing tasks, transformers have now been used in
many other domains and they show great performance.

• One example is the Vision Transformer

(ViT) which is a transformer-based
architecture for image processing tasks:
• split an image into fixed-size patches
• linearly embed each of them
• add position embeddings
• feed the resulting sequence of vectors to a
standard Transformer encoder.
• In order to perform classification, add an
extra learnable “classification token” to
the sequence.

• ViT is typically pre-trained on large datasets and then fine-tuned to (smaller) downstream tasks.

42
BERT: Transformer-based language model
(Devlin et al., 2018)
Transfer learning in natural language processing

• Common natural language understanding tasks (see, e.g., GLUE benchmark):

• Sentiment analysis: e.g., classification of sentences extracted from movie reviews
• Question answering: e.g., detect pairs (question, sentence) which contain the correct answer
• Determine if two sentences are semantically equivalent (binary classification)
• Labeled datasets for such tasks are often limited (labels are expensive to collect).
• Transfer learning:
• Pre-train language models on large text corpora (unlabeled data, thus unsupervised learning)
• Fine-tune a pre-trained model to a specific task

44
BERT: Pre-trained language model (Devlin et al., 2018)

• The model is essentially a transformer encoder C T1 T2 T3 ... TN TSEP T0 ... T0

1 M

(Vaswani et al., 2017).

• The model can represent either a single
sentence or a pair of sentences (we need to BERT (transformer encoder)
process pairs in some downstream tasks, such
as question answering).
... ...
• Pre-trained on a large corpus, e.g., English CLS Tok 1 Tok 2 Tok 3 Tok N SEP Tok 1 Tok M

Wikipedia (2,500M words). Sentence A Sentence B

45
BERT: Pre-training task 1

Tok 2

• Pre-training task 1: Predict a masked input

token (denoising task). BERT (transformer encoder)
• Use special MASK token to “corrupt” the
input sequence.
• The task is to reconstruct the masked token.
CLS Tok 1 MASK Tok 3 ... Tok N SEP Tok 1 ... Tok M

Sentence A Sentence B

46
BERT: Pre-training task 2

True/False

• Pre-training task 2: Predict whether sentence

B follows sentence A.
• 50% of the time sentence B follows sentence BERT (transformer encoder)
A in the corpus.
• 50% of the time sentence B is randomly
chosen.
CLS Tok 1 Tok 2 Tok 3 ... Tok N SEP Tok 1 ... Tok M
• Binary classification task.
Sentence A Sentence B

47
Fine-tuning BERT on a sentence classification task

Sentence: Although the value added services being provided are great but the prices are high. Class: mixed review

Sentence: Great work done #XYZ Problem resolved by customer care in just one day. Class: positive review

• BERT is fine-tuned on task-specific training data: Class

task-specific inputs and outputs

• Example: sentence classification task:
• Sentence A: input sentence BERT
• Sentence B: ∅ (transformer encoder)
• Output: target class (taken from the first position)
• New layer is introduced to convert the output at the first
CLS Tok 1 Tok 2 Tok 3 ... Tok N
position into class probabilities.
Input sentence
• All parameters of the model are fine-tuned!

48
Fine-tuning BERT on a question answering task

Paragraph: Beyoncé Giselle Knowles-Carter (born September 4, 1981) is an American singer, songwriter, record producer, dancer and actress. Born
and raised in Houston, Texas, Beyoncé performed in various singing and dancing competitions as a child. She rose to fame in the late 1990s as the
lead singer of Destiny’s Child, one of the best-selling girl groups of all time.

Question: When did Beyonce start becoming popular?

Correct answer: in the late 1990s

IsEnd IsEnd
IsStart IsStart

• Fine-tuning BERT on a question answering task:

• Sentence A: question
• Sentence B: paragraph (passage)
BERT (transformer encoder)
• Output sequence: Probabilities of each word in
the passage being the start and the end
• All parameters of the model are updated on the
CLS Tok 1 ... Tok N SEP Tok 1 ... Tok M
task-specific data!
Question Paragraph

49
BERT: GLUE Test results

GLUE Test results

The number below each task denotes the number of training examples.
F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B,
and accuracy scores are reported for the other tasks.

50
Recap
Recap

• Attention allows the encoder of a sequence-to-sequence model to produce representations of a

varying length, which dramatically increases the quality of the neural machine translation model.
• A sequence-to-sequence model can be implemented by CNNs: convolutional layers can process
sequential data and an autoregressive decoder can be implemented by shifted (causal) 1d
convolutions.
• Transformers are neural networks that use attention (self-attention, cross-attention) as the main
computational block.
• Multi-head attention allows paying attention to different parts of the attended sequence.
• Transformers have been used in multiple domains (e.g., vision, protein modeling).
• BERT is a popular transformer-based language model which can be tuned to custom NLU tasks.

52
Home assignment
Assignment 05 transformer

• You need to implement and train a transformer model for

the task of statistical machine translation task (the same
task as in the previous assignment).

54
Recommended reading

• Papers cited in the lecture slides.

• The Annotated Transformer blog post.

Stepping Stone Method (Transportation Problem)
83% (6)
Stepping Stone Method (Transportation Problem)
25 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
unit5 3
No ratings yet
unit5 3
48 pages
Deep Recurrent Neural Networks (1)
No ratings yet
Deep Recurrent Neural Networks (1)
24 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
Pervasive Attention 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
No ratings yet
Pervasive Attention 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
11 pages
AN2DL_05_2324_Seq2SeqAndWordEmbedding
No ratings yet
AN2DL_05_2324_Seq2SeqAndWordEmbedding
42 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
Polynomial Expansion Paper
No ratings yet
Polynomial Expansion Paper
4 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
Attention is all you need
No ratings yet
Attention is all you need
15 pages
transformers and attention models
No ratings yet
transformers and attention models
115 pages
Module 3 Part 2 Encoder
No ratings yet
Module 3 Part 2 Encoder
14 pages
RNN-StannfordBased
No ratings yet
RNN-StannfordBased
102 pages
UNIT-3 Sequence Modeling
No ratings yet
UNIT-3 Sequence Modeling
20 pages
cl8_encdec
No ratings yet
cl8_encdec
51 pages
10-rnn
No ratings yet
10-rnn
56 pages
Slides on RNNs 26th March 2025
No ratings yet
Slides on RNNs 26th March 2025
30 pages
Sequence Models-II
No ratings yet
Sequence Models-II
10 pages
lec-11
No ratings yet
lec-11
30 pages
Unit 3 Questions With Answers Ghanta Ka Password
No ratings yet
Unit 3 Questions With Answers Ghanta Ka Password
20 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
No ratings yet
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
40 pages
Deep Learning Basics
No ratings yet
Deep Learning Basics
10 pages
DL_MOD4 (3)
No ratings yet
DL_MOD4 (3)
105 pages
Asset-V1 - MITx 6.86x 1T2021 Type@asset Block@slides - Lecture10 - Withcredits
No ratings yet
Asset-V1 - MITx 6.86x 1T2021 Type@asset Block@slides - Lecture10 - Withcredits
25 pages
11-rnn
No ratings yet
11-rnn
32 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
Attention
No ratings yet
Attention
12 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
M5 Topic 1 - Encoder Decoder
No ratings yet
M5 Topic 1 - Encoder Decoder
21 pages
L5
No ratings yet
L5
99 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
lecture15_transformer
No ratings yet
lecture15_transformer
26 pages
Attention Layers
No ratings yet
Attention Layers
103 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
Unit 3
No ratings yet
Unit 3
27 pages
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
No ratings yet
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
117 pages
Module 4-1
No ratings yet
Module 4-1
44 pages
Sequence Learning Problem
No ratings yet
Sequence Learning Problem
42 pages
Lecture5
No ratings yet
Lecture5
102 pages
Unit IV DL
No ratings yet
Unit IV DL
122 pages
06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
No ratings yet
06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
44 pages
Lecture8 421
No ratings yet
Lecture8 421
85 pages
L22_Attention in Deep Learning
No ratings yet
L22_Attention in Deep Learning
65 pages
UNIT 2 FULL - Compressed
No ratings yet
UNIT 2 FULL - Compressed
26 pages
Attention_is_All_You_Need__Explained
No ratings yet
Attention_is_All_You_Need__Explained
46 pages
16_
No ratings yet
16_
41 pages
Decoder Models Ppt 2
No ratings yet
Decoder Models Ppt 2
63 pages
Week9 Seq2seq
No ratings yet
Week9 Seq2seq
32 pages
RNN
No ratings yet
RNN
53 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
dlunit4
No ratings yet
dlunit4
122 pages
20190630transformer-210110081057
No ratings yet
20190630transformer-210110081057
32 pages
Modeling With Machine Learning: RNN (Part 1)
No ratings yet
Modeling With Machine Learning: RNN (Part 1)
24 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
5-Convolutional Neural Network
No ratings yet
5-Convolutional Neural Network
43 pages
9th Math Polynomials MCQ
100% (1)
9th Math Polynomials MCQ
5 pages
Divide Polynomials SGI
100% (1)
Divide Polynomials SGI
2 pages
4- Solution of Nonlinear Equations - Numerical Analysis
No ratings yet
4- Solution of Nonlinear Equations - Numerical Analysis
6 pages
Practical Assignment LP-II-1
No ratings yet
Practical Assignment LP-II-1
11 pages
WDM11 01 Que 20200305
No ratings yet
WDM11 01 Que 20200305
28 pages
Fill in The Blanks
No ratings yet
Fill in The Blanks
2 pages
Ada Lab Manaul
No ratings yet
Ada Lab Manaul
25 pages
chapitre 5- Shortest path problem
No ratings yet
chapitre 5- Shortest path problem
23 pages
Bcs 054
No ratings yet
Bcs 054
3 pages
Thuat Toan LLL
No ratings yet
Thuat Toan LLL
2 pages
Numerical Methods Syllabus
100% (1)
Numerical Methods Syllabus
2 pages
Gr 8 Maths Exam Term 2
No ratings yet
Gr 8 Maths Exam Term 2
6 pages
lecture-5 (2)
No ratings yet
lecture-5 (2)
21 pages
System of Linear Equations Direct Methods
No ratings yet
System of Linear Equations Direct Methods
21 pages
DAA Practical Question
No ratings yet
DAA Practical Question
11 pages
CHE 411 Lesson 9 Note
No ratings yet
CHE 411 Lesson 9 Note
18 pages
Fbmmm-742 Applied Optimal Control FINAL (Delivery Date: 05 June 2020)
No ratings yet
Fbmmm-742 Applied Optimal Control FINAL (Delivery Date: 05 June 2020)
4 pages
Unit 6.1 - Linear Programming
100% (1)
Unit 6.1 - Linear Programming
6 pages
Inventory Homework2
No ratings yet
Inventory Homework2
2 pages
Dimensionality Reduction Techniques You Should Know in 2021
No ratings yet
Dimensionality Reduction Techniques You Should Know in 2021
12 pages
OR - 5th Meeting For Upload
No ratings yet
OR - 5th Meeting For Upload
28 pages
Gauss Seidel Method
No ratings yet
Gauss Seidel Method
7 pages
Global Stiffness Matrix
0% (1)
Global Stiffness Matrix
13 pages
Mathematics of Deep Learning 1687444204
No ratings yet
Mathematics of Deep Learning 1687444204
45 pages
Cs3491 - Aiml - Unit III - Linear Regression Models
No ratings yet
Cs3491 - Aiml - Unit III - Linear Regression Models
34 pages
DSA Using C - C++
No ratings yet
DSA Using C - C++
14 pages
Algorithm
No ratings yet
Algorithm
3 pages
Explicit and Implicit Examples
No ratings yet
Explicit and Implicit Examples
8 pages