0% found this document useful (0 votes)
48 views66 pages

Attention & Transformers

This lecture focuses on the Transformer architecture, which consists of an encoder-decoder model that utilizes multi-head self-attention mechanisms. It addresses the limitations of RNNs, such as vanishing gradients and lack of parallelizability, and introduces attention as a method for capturing relationships between tokens in a sequence. Key concepts include self-attention, query-key-value representations, and the ability to parallelize computations for efficiency.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views66 pages

Attention & Transformers

This lecture focuses on the Transformer architecture, which consists of an encoder-decoder model that utilizes multi-head self-attention mechanisms. It addresses the limitations of RNNs, such as vanishing gradients and lack of parallelizability, and introduces attention as a method for capturing relationships between tokens in a sequence. Key concepts include self-attention, query-key-value representations, and the ability to parallelize computations for efficiency.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

COMP 3361 Natural Language Processing

Lecture 9: Attention and Transformers

Spring 2024

Many materials from CSE447@UW (Liwei Jiang), COS 484@Princeton, and CS224n@Stanford with special thanks!
Transformers

(Vaswani et al., 2017)


Transformer encoder-decoder
• Transformer encoder + Transformer decoder
• First designed and experimented on NMT
Transformer encoder-decoder
• Transformer encoder = a stack of encoder layers

• Transformer decoder = a stack of decoder layers

Transformer encoder: BERT, RoBERTa, ELECTRA

Transformer decoder: GPT-3, ChatGPT, Palm

Transformer encoder-decoder: T5, BART

• Key innovation: multi-head, self-attention


• Transformers don’t have any recurrence structures!
h
ht = f(ht−1, xt) ∈ ℝ
Transformers: roadmap

• From attention to self-attention


• From self-attention to multi-head self-attention
• Feedforward layers
• Positional encoding
• Residual connections + layer normalization
• Transformer encoder vs Transformer decoder
Issues with RNNs: Linear Interaction Distance
• RNNs are unrolled left-to-right.
• Linear locality is a useful heuristic: nearby
words often affect each other’s meaning! Steve Jobs

• However, there’s the vanishing gradient


problem for long sequences.
O(sequence length)
• The gradients that are used to update the
network become extremely small or "vanish"
as they are backpropogated from the output
layers to the earlier layers.
Steve Jobs who … Apple

• Failing to capture long-term dependences.

6 Lecture 5: Attention & Transformers


Issues with RNNs: Lack of Parallelizability
• Forward and backward passes have O(sequence length) unparallelizable operations
• GPUs can perform many independent computations (like addition) at once!
• But future RNN hidden states can’t be computed in full before past RNN hidden
states have been computed.
• Training and inference are slow; inhibits on very large datasets!

1 2 3 N

0 1 2

h1 h2 h3 hT

Numbers indicate min # of steps before a state can be computed

7 Lecture 5: Attention & Transformers


The New De Facto Method: Attention

Instead of deciding the


next token solely based on
the previously seen tokens,
each token will “look at”
all input tokens at the
same to decide which
ones are most important
to decide the next token.
In practice, the actions of all tokens
are done in parallel!

8 Lecture 5: Attention & Transformers


Building the Intuition of Attention
• Attention treats each token’s representation as a query to access and incorporate
information from a set of values.
• Today we look at attention within a single sequence.
• Number of unparallelizable operations does NOT increase with sequence length.
• Maximum interaction distance: O(1), since all tokens interact at every layer!

attention 2 2 2 2 2 2 2 2

All tokens attend to all tokens


attention 1 1 1 1 1 1 1 1 in previous layer; most
arrows here are omitted
embedding 0 0 0 0 0 0 0 0

h1 h2 h3 hT

9 Lecture 5: Attention & Transformers


Attention as a soft, averaging lookup table
We can think of attention as performing fuzzy lookup in a key-value store.

In a lookup table, we have a table of keys In attention, the query matches all keys softly, to
that map to values. The query matches a weight between 0 and 1. The keys’ values are
one of the keys, returning its value. multiplied by the weights and summed.

10 Lecture 5: Attention & Transformers


Self-Attention: Basic Concepts [Lena Viota Blog]

Query: asking for


information

Key: saying that it


has some information

Value: giving the


information

11 Lecture 5: Attention & Transformers


Self-Attention: Walk-through
b1 b2 b3 b4

Each bi is obtained by considering ∀ai

Self-Attention Layer

a1 a2 a3 a4

Can be either input or a hidden layer

12 Lecture 5: Attention & Transformers


Self-Attention: Walk-through
b1

How relevant are a2, a3, a4 to a1?


We denote the level
of relevance as α

a1 a2 a3 a4

13 Lecture 5: Attention & Transformers


α

How to compute α?
α=q⋅k W

tanh

q . k q + k

WQ WK WQ WK

a1 a4 We’ll use this! a1 a4


Method 1 (most common): Dot product Method 2: Additive

14 Lecture 5: Attention & Transformers


Self-Attention: Walk-through

attention scores α1,2 = q1 ⋅ k2 α1,3 = q1 ⋅ k3 α1,4 = q1 ⋅ k4

query q1 k2 key k3 k4
k2 = WK a2 k3 = WK a3 k4 = WK a4

q1 = WQ a1

a1 a2 a3 a4

15 Lecture 5: Attention & Transformers


Self-Attention: Walk-through

α1,1 = q1 ⋅ k1 α1,2 = q1 ⋅ k2 α1,3 = q1 ⋅ k3 α1,4 = q1 ⋅ k4

query q1 k1 k2 key k3 k4
k1 = WK a1 k2 = WK a2 k3 = WK a3 k4 = WK a4

q1 = WQ a1

a1 a2 a3 a4

16 Lecture 5: Attention & Transformers


α1,i
′ e
α1,i = α1,j
∑j e
′ ′ ′ ′
α1,1 α1,2 α1,3 α1,4

Softmax

α1,1 = q1 ⋅ k1 α1,2 = q1 ⋅ k2 α1,3 = q1 ⋅ k3 α1,4 = q1 ⋅ k4

query q1 k1 k2 key k3 k4
k1 = WK a1 k2 = WK a2 k3 = WK a3 k4 = WK a4

q1 = WQ a1

a1 a2 a3 a4

17 Lecture 5: Attention & Transformers




Denote how relevant each token are to a1!
Use attention scores to extract information
′ ′ ′ ′
α1,1 α1,2 α1,3 α1,4

Softmax

α1,1 = q1 ⋅ k1 α1,2 = q1 ⋅ k2 α1,3 = q1 ⋅ k3 α1,4 = q1 ⋅ k4

query q1 k1 k2 key k3 k4
k1 = WK a1 k2 = WK a2 k3 = WK a3 k4 = WK a4

q1 = WQ a1

a1 a2 a3 a4

18 Lecture 5: Attention & Transformers






Use attention scores to extract information



b1 = α1,i vi
i
b1
′ ′ ′ ′
α1,1 × α1,2 × α1,3 × α1,4 ×

q1 k1 v1 k2 v2 k3 v3 k4 v4

v1 = WV a1 v2 = WV a2 v3 = WV a3 v4 = WV a4

a1 a2 a3 a4

19 Lecture 5: Attention & Transformers







Use attention scores to extract information



b1 = α1,i vi
i
b1
′ ′ ′ ′
α1,1 × α1,2 × α1,3 × α1,4 ×

The higher the attention score is, the α1,i
more important ai is to composing b1

q1 k1 v1 k2 v2 k3 v3 k4 v4

v1 = WV a1 v2 = WV a2 v3 = WV a3 v4 = WV a4

a1 a2 a3 a4

20 Lecture 5: Attention & Transformers








Repeat the same calculation for all ai to obtain bi



b2 = α2,i vi
b2 i

′ ′ ′ ′
α2,1 × α2,2 × α2,3 × α2,4 ×

q1 k1 v1 q2 k2 v2 q3 k3 v3 q4 k4 v4

a1 a2 a3 a4

21 Lecture 5: Attention & Transformers







Repeat the same calculation for all ai to obtain bi



b2 = α2,i vi
b2 i

′ ′ ′ ′
α2,1 × α2,2 × α2,3 × α2,4 ×
Note that the computation of be bi can
parallelized, as they are independent to
each other
q1 k1 v1 q2 k2 v2 q3 k3 v3 q4 k4 v4

a1 a2 a3 a4

22 Lecture 5: Attention & Transformers







Parallelize the computation!
QKV

Q I K I V I

q1 a1 k1 a1 v1 a1
q2 a2 k2 a2 v2 a2
= WQ = WK = WV
q3 a3 k3 a3 v3 a3
q4 a4 k4 a4 v4 a4

23 Lecture 5: Attention & Transformers


Parallelize the computation! q1
Attention Scores α1,1 α1,2 α1,3 α1,4

= k1 k2 k3 k4
′ ′ ′ ′
α1,1 α1,2 α1,3 α1,4

q1 k1 v1 k2 v2 k3 v3 k4 v4

v1 = WV a1 v2 = WV a2 v3 = WV a3 v4 = WV a4

a1 a2 a3 a4

24 Lecture 5: Attention & Transformers






Parallelize the computation!
Attention Scores

q1
α1,1 α1,2 α1,3 α1,4

= k1 k2 k3 k4

25 Lecture 5: Attention & Transformers


Parallelize the computation!
Attention Scores

Q
A ′ A K T

′ ′ ′ ′
q1
α1,1 α1,2 α1,3 α1,4 α1,1 α1,2 α1,3 α1,4

α2,1 ′
α2,2 ′
α2,3 ′
α2,4 α2,1 α2,2 α2,3 α2,4 q2
′ ′ ′ ′
= k1 k2 k3 k4
α3,1 α3,2 α3,3 α3,4 α3,1 α3,2 α3,3 α3,4 q3
′ ′ ′ ′
α4,1 α4,2 α4,3 α4,4 α4,1 α4,2 α4,3 α4,4
q4

26 Lecture 5: Attention & Transformers




















α1,1 v1 + ′
α1,2 v2 + ′
α1,3 v3 + ′
α1,4 v4

b1 ′ ′ ′ ′
v1
α1,1 α1,2 α1,3 α1,4
v2
=
v3

v4

Parallelize the computation!


Weighted Sum of Values with Attention Scores
27 Lecture 5: Attention & Transformers








Parallelize the computation!

O V

A
b1 ′ ′ ′ ′
v1
α1,1 α1,2 α1,3 α1,4
b2 ′
α2,1 ′
α2,2 ′
α2,3 ′
α2,4 v2
=
b3 ′
α3,1 ′
α3,2 ′
α3,3 ′
α3,4 v3
′ ′ ′ ′
α4,1 α4,2 α4,3 α4,4
b4 v4

Parallelize the computation!


Weighted Sum of Values with Attention Scores
28 Lecture 5: Attention & Transformers

















Q = I WQ
K = I WK Q = I WQ K = I WK V = I WV
V = I WV

T
A=QK
′ Softmax T
T
A = I WQ (I WK ) = I WQ WKT I T
A A = Q K

A = softmax(A)


O=A V ′ O = A V

29 Lecture 5: Attention & Transformers






The Matrices Form of Self-Attention
n×d d
Q = I WQ I = {a1, . . . , an} ∈ ℝ , where ai ∈ ℝ
d×d
K = I WK WQ, WK, WV ∈ ℝ
n×d
V = I WV Q, K, V ∈ ℝ ?

T
A=QK
T T T ′ n×n
A = I WQ (I WK ) = I WQ WK I A, A ∈ ℝ ?

A = softmax(A) Dimensions?

′ n×d
O=A V O ∈ ℝ?

30 Lecture 5: Attention & Transformers





The Matrices Form of Self-Attention
n×d d
Q = I WQ I = {a1, . . . , an} ∈ ℝ , where ai ∈ ℝ
d×d
K = I WK WQ, WK, WV ∈ ℝ
n×d
V = I WV Q, K, V ∈ ℝ

T
A=QK
T T T ′ n×n
A = I WQ (I WK ) = I WQ WK I A, A ∈ ℝ

A = softmax(A) Dimensions?

′ n×d
O=A V O∈ℝ

31 Lecture 5: Attention & Transformers





Self-Attention: Summary
Let w1:n be a sequence of words in vocabulary , like Steve Jobs founded Apple.
d×|V|
For each wi, let ai = Ewi, where E ∈ ℝ is an embedding matrix.
d×d
1. Transform each word embedding with weight matrices WQ, WK, WV , each in ℝ
qi = WQ ai (queries) ki = WK ai (keys) vi = WV ai (values)
2. Compute pairwise similarities between keys and queries; normalize with softmax
αi,j
′ e
αi,j = kj qi αi,j = αi,j
∑j e
3. Compute output for each word as weighted sum of values


bi = αi,j vj
j

𝑉
32 Lecture 5: Attention & Transformers

Limitations and Solutions of Self-Attention

No Sequence Order Position Embedding

No Nonlinearities Adding Feed-forward Networks

Looking into the Future Masking

33 Lecture 5: Attention & Transformers


Limitations and Solutions of Self-Attention

No Sequence Order Position Embedding

No Nonlinearities Adding Feed-forward Networks

Looking into the Future Masking

34 Lecture 5: Attention & Transformers


No Sequence Order → Position Embedding
• All tokens in an input sequence are simultaneously fed into self-attention
blocks. Thus, there’s no difference between tokens at different positions.
• We lose the position info!

• How do we bring the position info back, just like in RNNs?


d
• Representing each sequence index as a vector: pi ∈ ℝ , for i ∈ {1,...,n}

• How to incorporate the position info into the self-attention blocks? qi ki vi


• Just add the pi to the input: aî = ai + pi
• where ai is the embedding of the word at index i.
• In deep self-attention networks, we do this at the rst layer.
• We can also concatenate ai and pi, but more commonly we add them. pi + ai

35 Lecture 5: Attention & Transformers


fi
Position Representation Vectors via Sinusoids
Sinusoidal Position Representations (from the original Transformer paper):
concatenate sinusoidal functions of varying periods.

2∗1/
sin( /10000 )

Dimension
2∗1/
cos( /10000 )
=

2∗ 2 /
sin( /10000 )
2∗ 2 /
cos( /10000 ) Index in the sequence
https://timodenk.com/blog/linear-relationships-in-the-transformers-positional-encoding/

• Periodicity indicates that maybe “absolute position” isn’t as important


• Maybe can extrapolate to longer sequences as periods restart!

• Not learnable; also the extrapolation doesn’t really work!


𝑖
36
𝑖
𝑖
Lecture 5: Attention & Transformers
𝑖
𝑖
𝒑
𝑑
𝑑
𝑑
𝑑
𝑑
𝑑
Learnable Position Representation Vectors
Learned absolute position representations: pi contains learnable parameters.
d×n
• Learn a matrix p ∈ ℝ , and let each pi be a column of that matrix
• Most systems use this method.

• Flexibility: each position gets to be learned to t the data

• Cannot extrapolate to indices outside 1,...,n.

Sometimes people try more exible representations of position:


• Relative linear position attention [Shaw et al., 2018]
• Dependency syntax-based position [Wang et al., 2019]

37 Lecture 5: Attention & Transformers


fl
fi
Limitations and Solutions of Self-Attention

No Sequence Order Position Embedding

No Nonlinearities Adding Feed-forward Networks

Looking into the Future Masking

38 Lecture 5: Attention & Transformers


No Nonlinearities → Add Feed-forward Networks
c1 c2 … cn
There are no element-wise nonlinearities in
self-attention; stacking more self-attention FF FF … FF
layers just re-averages value vectors.
Self-Attention

b1 b2 … bn
Easy Fix: add a feed-forward network
to post-process each output vector. FF FF … FF

Self-Attention

a1 a2 … an

39 Lecture 5: Attention & Transformers


Limitations and Solutions of Self-Attention

No Sequence Order Position Embedding

No Nonlinearities Adding Feed-forward Networks

Looking into the Future Masking

40 Lecture 5: Attention & Transformers


Looking into the Future → Masking
• In decoders (language modeling,
producing the next word given
previous context), we need to
ensure we don’t peek at the future.

https://jalammar.github.io/illustrated-gpt2/
Lecture 5: Attention & Transformers
Looking into the Future → Masking
We can look at these (not
• In decoders (language modeling, greyed out) words

{
producing the next word given qi kj, j ≤ i
αi,j =
previous context), we need to −∞, j > i RT ]
TA h e e f h o
[ S T c h w
ensure we don’t peek at the future.
−∞ −∞ −∞
[START]
• To enable parallelization, we mask
out attention to future words by
−∞ −∞
setting attention scores to −∞. The
For encoding
these words
−∞
chef

who

42 Lecture 5: Attention & Transformers


Now We Put Things Together Output
Probabilities

• Self-attention Softmax

• The basic computation Linear

• Positional Encoding

Repeat for number


of encoder blocks
Feed-Forward
• Specify the sequence order
• Nonlinearities
Masked Self-Attention
• Adding a feed-forward network at the
output of the self-attention block
Block
• Masking
Position Embedding
• Parallelize operations (looking at all tokens) +
while not leaking info from the future Input Embeddings

Inputs
43 Lecture 5: Attention & Transformers
Output Probabilities

The Transformer Decoder Softmax

Linear

• A Transformer decoder is what we use


Add & Norm
to build systems like language models.

Repeat for number


of encoder blocks
Feed-Forward

• It’s a lot like our minimal self-attention


Add & Norm
architecture, but with a few more
components. Masked Multi-head
• Residual connection (“Add”) Attention
• Layer normalization (“Norm")

Position Embedding
• Replace self-attention with multi-head +
self-attention. Input Embeddings

Inputs
44 Lecture 5: Attention & Transformers
Multi-head Attention
“The Beast with Many Heads”

• It is better to use multiple attention functions instead of one!


• Each attention function (“head”) can focus on di erent positions.

H0 H1 H7

https://jalammar.github.io/illustrated-transformer/
ff
Multi-Head Attention: Walk-through

bi,1
′ ′
αi,i,1 × αi,j,1 ×

qi,1 qi,2 ki,1 ki,2 vi,1 vi,2 qj,1 qj,2 kj,1 kj,2 vj,1 vj,2

qi ki vi qj kj vj

ai Multi-head Attention aj
46 Lecture 5: Attention & Transformers


Multi-Head Attention: Walk-through

bi,2
′ ′
αi,i,2 × αi,j,2 ×

qi,1 qi,2 ki,1 ki,2 vi,1 vi,2 qj,1 qj,2 kj,1 kj,2 vj,1 vj,2

qi ki vi qj kj vj

ai Multi-head Attention aj
47 Lecture 5: Attention & Transformers


bi,1
Concatenation
bi = Y
× ×
bi,2
Some
transformatio × ×

qi,1 qi,2 ki,1 ki,2 vi,1 vi,2 qj,1 qj,2 kj,1 kj,2 vj,1 vj,2

qi ki vi qj kj vj

ai Multi-head Attention aj
48 Lecture 5: Attention & Transformers
Recall the Matrices Form of Self-Attention
n×d d
Q = I WQ I = {a1, . . . , an} ∈ ℝ , where ai ∈ ℝ
d×d
K = I WK WQ, WK, WV ∈ ℝ
n×d
V = I WV Q, K, V ∈ ℝ

T
A=QK
T T T ′ n×n
A = I WQ (I WK ) = I WQ WK I A, A ∈ ℝ

A = softmax(A)

′ n×d
O=A V O∈ℝ

49 Lecture 5: Attention & Transformers





Multi-head Attention in Matrices

• Multiple attention “heads” can be de ned via multiple WQ, WK, WV matrices
l l l d× dh
• Let WQ, WK, WV ∈ℝ , where h is the number of attention heads, and l ranges
from 1 to h.
• Each attention head performs attention independently:
l l lT T l
• O = softmax(I WQ WK I )I WV
l
• Concatenating different O from different attention heads.
1 n d×d
• O = [O ; . . . ; O ] Y, where Y ∈ ℝ

50 Lecture 5: Attention & Transformers


fi
The Matrices Form of Multi-head Attention
l l n×d d
Q =I WQ I = {a1, . . . , an} ∈ ℝ , where ai ∈ ℝ
l l l l l d× dh
K =I WK WQ, WK, WV ∈ℝ
d
l l l l l n× h
V =I WV Q ,K ,V ∈ ℝ ?
l l lT
A =Q K l′
A ,A ∈ ℝ l n×n?
l′ l
A = softmax(A )
Dimensions?
l′
d
l n× h
l
O =A V l O ∈ℝ ?

d×d
Y∈ℝ
1 h n×d
1 h
O = [O ; . . . ; O ] Y [O ; . . . ; O ] ∈ ℝ ?
n×d
O∈ℝ ?
51 Lecture 5: Attention & Transformers



The Matrices Form of Multi-head Attention
l l n×d d
Q =I WQ I = {a1, . . . , an} ∈ ℝ , where ai ∈ ℝ
l l l l l d× dh
K =I WK WQ, WK, WV ∈ℝ
d
l l l l l n× h
V =I WV Q ,K ,V ∈ ℝ
l l lT
A =Q K l′
A ,A ∈ ℝ l n×n
l′ l
A = softmax(A )
Dimensions?
l′
d
l n× h
l
O =A V l O ∈ℝ

d×d
Y∈ℝ
1 h n×d
1 h
O = [O ; . . . ; O ] Y [O ; . . . ; O ] ∈ ℝ
n×d
O∈ℝ
52 Lecture 5: Attention & Transformers



Multi-head Attention is Computationally Ef cient
• Even though we compute h many attention heads, it’s not more costly.
d
n×d n×h× h
• We compute I WQ ∈ ℝ , and then reshape to ℝ .
• Likewise for I WK and I WV.
h×n× dh
• Then we transpose to ℝ ; now the head axis is like a batch axis.
• Almost everything else is identical. All we need to do is to reshape the tensors!

T T h sets of attention scores!


I WQ WK I
I WQ WKT I T =
h×n×n
∈ℝ

Softmax ( I T
WQ WK I T
) I WV = O′ Y = O ∈ℝn×d

53 Lecture 5: Attention & Transformers



fi
Scaled Dot Product
• “Scaled Dot Product” attention aids in training.
• When dimensionality d becomes large, dot products between vectors tend to become
large.
• Because of this, inputs to the softmax function can be large, making the gradients small.

• Instead of the self-attention function we’ve


seen:
lT l lT T
• O = so l
max (I l
WQ WK T
I )I l
WV
l
I WQ WK I l
O = so max( )I WV
• We divide the attention scores by d/h , to d/h
stop the scores from becoming large just as a
func on of d/h (the dimensionality divided by the
number of heads).
ft
ft
54 Lecture 5: Attention & Transformers
ti
Output Probabilities

The Transformer Decoder Softmax

Linear

Add & Norm

Repeat for number


of encoder blocks
Feed-Forward
• Now that we’ve replaced self-attention
with multi-head self-attention, we’ll go Add & Norm
through two op miza on tricks:
Masked Multi-head
• Residual connection (“Add”) Attention
• Layer normalization (“Norm”)
Block

+ Position Embedding

Input Embeddings

Inputs
55 Lecture 5: Attention & Transformers
ti
ti
Residual Connections
• Residual connections are a trick to help models train better.
(i) (i−1)
• Instead of X = Layer(X ) (where i represents the layer)
(i−1) (i)
X Layer X

(i) (i−1) (i−1)


• We let X = X + Layer(X ) (so we only have to learn “the residual” from
the previous layer)
(i−1) (i)
X Layer + X

• Gradient is great through the residual


connection; it’s 1! [no residuals] [residuals]

• Bias towards the identity function! [Loss landscape visualization,


Li et al., 2018, on a ResNet]

56 Lecture 5: Attention & Transformers


Layer Normalization
• Layer normalization is a trick to help models train faster.
• Idea: cut down on uninformative variation in hidden vector values by normalizing to unit mean
and standard deviation within each layer.
• LayerNorm’s success may be due to its normalizing gradients [Xu et al., 2019]
• Let ∈ ℝ be an individual (word) vector in the model.


Let = ; this is the mean; ∈ ℝ.

=1

1
∑(
− ) ; this is the standard deviation;
2
Let = ∈ ℝ.
• =1

• Let ∈ ℝ and ∈ ℝ be learned “gain” and “bias” parameters. (Can omit!)


• Then layer normalization computes:

Normalize by output = ∗ + Modulate by learned
• +
scalar mean and element-wise gain and
𝑗
𝑗
𝑑
variance bias
𝑗
𝜎
𝜖
𝜎
𝑥
𝜇
𝑗
𝜇
𝑥
𝛾
𝛽
57 Lecture 5: Attention & Transformers
𝑥
𝜇
𝛾
𝜎
𝛽
𝑑
𝑑
𝑑
𝑑
𝑑
𝑥
𝜇
Output Probabilities

The Transformer Decoder Softmax

Linear

Add & Norm


• The Transformer Decoder is a stack of

Repeat for number


of encoder blocks
Feed-Forward
Transformer Decoder Blocks.
• Each Block consists of: Add & Norm
• Masked Multi-head Self-attention
Masked Multi-head
• Add & Norm Attention
• Feed-Forward
Block
• Add & Norm
+ Position Embedding

Input Embeddings

Inputs
58 Lecture 5: Attention & Transformers
Output Probabilities

The Transformer Encoder Softmax

Linear

• The Transformer Decoder Add & Norm


constrains to unidirectional

Repeat for number


of encoder blocks
Feed-Forward
context, as for language
models.
Add & Norm
• What if we want bidirectional
context, like in a bidirectional Multi-head
RNN? Attention

• We use Transformer Encoder — Block


the ONLY difference is that we
No masks! + Position Embedding
remove the masking in self-
Input Embeddings
attention.
Encoder Inputs
59 Lecture 5: Attention & Transformers
The Transformer Encoder-Decoder

• More on Encoder-Decoder models will be wt1+2, . . .


introduced in the next lecture!
• Right now we only need to know that it processes the
source sentence with a bidirectional model
(Encoder) and generates the target with a
unidirectional model (Decoder). wt1+1, . . . , wt2

• The Transformer Decoder is modi ed to perform


cross-attention to the output of the Encoder. w1, . . . , wt1

60 Lecture 5: Attention & Transformers


fi
Add & Norm
Cross-Attention Feed-Forward
Linear
Add & Norm
Add & Norm Softmax
Masked Multi-head
Feed-Forward Attention
Output Probabilities
K V Q

Add & Norm


Add & Norm

Multi-head Masked Multi-head


Attention Attention

Block Block

+ Position Embedding Position Embedding


+
Input Embeddings Input Embeddings

Encoder Inputs Decoder Inputs


61 Lecture 5: Attention & Transformers
Cross-Attention Details
• Self-attention: queries, keys, and values come from the same source.
• Cross-Attention: keys and values are from Encoder (like a memory); queries are
from Decoder.
d
• Let h1, …, h be output vectors from the Transformer encoder, hi ∈ ℝ .
d
• Let 1, …, be input vectors from the Transformer decoder, zi ∈ ℝ .
• Keys and values from the encoder:
• ki = WK hi
• vi = WV hi
• Queries are drawn from the decoder:
• qi = WQ zi
𝑛
𝑛
𝑧
𝑧
62 Lecture 5: Attention & Transformers
Transformers: pros and cons
• Easier to capture long-range dependencies: we draw attention between every pair of words!

• Easier to parallelize:
Q
<latexit sha1_base64="Zj1Owf2jr65GlRqNMJIdIlsAOuc=">AAAB73icbVBNSwMxEJ2tX7V+VT16CRbBU9kVUS9C0YvHFmy70K4lm2bb0GyyJlmhLP0TXjwo4tW/481/Y9ruQVsfDDzem2FmXphwpo3rfjuFldW19Y3iZmlre2d3r7x/0NIyVYQ2ieRS+SHWlDNBm4YZTv1EURyHnLbD0e3Ubz9RpZkU92ac0CDGA8EiRrCxkt9A18hvPzR65YpbdWdAy8TLSQVy1Hvlr25fkjSmwhCOte54bmKCDCvDCKeTUjfVNMFkhAe0Y6nAMdVBNrt3gk6s0keRVLaEQTP190SGY63HcWg7Y2yGetGbiv95ndREV0HGRJIaKsh8UZRyZCSaPo/6TFFi+NgSTBSztyIyxAoTYyMq2RC8xZeXSeus6l1U3cZ5pXaTx1GEIziGU/DgEmpwB3VoAgEOz/AKb86j8+K8Ox/z1oKTzxzCHzifP4YNjvs=</latexit>

K = XW K V = XW V
<latexit sha1_base64="O/Xdn2nZwVqugGAVDtC02kvexhg=">AAAB73icbVBNSwMxEJ34WetX1aOXYBE8lV0R9SIUvQi9VLDtQruWbJptQ7PZNckKZemf8OJBEa/+HW/+G9N2D9r6YODx3gwz84JEcG0c5xstLa+srq0XNoqbW9s7u6W9/aaOU0VZg8YiVl5ANBNcsobhRjAvUYxEgWCtYHgz8VtPTGkey3szSpgfkb7kIafEWMmr4SvstR5q3VLZqThT4EXi5qQMOerd0lenF9M0YtJQQbRuu05i/Iwow6lg42In1SwhdEj6rG2pJBHTfja9d4yPrdLDYaxsSYOn6u+JjERaj6LAdkbEDPS8NxH/89qpCS/9jMskNUzS2aIwFdjEePI87nHFqBEjSwhV3N6K6YAoQo2NqGhDcOdfXiTN04p7XnHuzsrV6zyOAhzCEZyACxdQhVuoQwMoCHiGV3hDj+gFvaOPWesSymcO4A/Q5w9zs47v</latexit> <latexit sha1_base64="sG8GlMJZBZitk452XyeL5flQu3E=">AAAB73icbVBNS8NAEJ31s9avqkcvi0XwVBIR9SIUvXisYNNAG8tmu2mXbjZxdyOU0D/hxYMiXv073vw3btsctPXBwOO9GWbmhang2jjON1paXlldWy9tlDe3tnd2K3v7nk4yRVmTJiJRfkg0E1yypuFGMD9VjMShYK1weDPxW09MaZ7IezNKWRCTvuQRp8RYyffwFfZbD163UnVqzhR4kbgFqUKBRrfy1eklNIuZNFQQrduuk5ogJ8pwKti43Mk0Swkdkj5rWypJzHSQT+8d42Or9HCUKFvS4Kn6eyInsdajOLSdMTEDPe9NxP+8dmaiyyDnMs0Mk3S2KMoENgmePI97XDFqxMgSQhW3t2I6IIpQYyMq2xDc+ZcXiXdac89rzt1ZtX5dxFGCQziCE3DhAupwCw1oAgUBz/AKb+gRvaB39DFrXULFzAH8Afr8AZVYjwU=</latexit>

Q = XW

• Are positional encodings enough to capture positional information?


Otherwise self-attention is an unordered function of its input

• Quadratic computation in self-attention

Can become very slow when the sequence length is large


Quadratic computation as a function of sequence length
Q
<latexit sha1_base64="Zj1Owf2jr65GlRqNMJIdIlsAOuc=">AAAB73icbVBNSwMxEJ2tX7V+VT16CRbBU9kVUS9C0YvHFmy70K4lm2bb0GyyJlmhLP0TXjwo4tW/481/Y9ruQVsfDDzem2FmXphwpo3rfjuFldW19Y3iZmlre2d3r7x/0NIyVYQ2ieRS+SHWlDNBm4YZTv1EURyHnLbD0e3Ubz9RpZkU92ac0CDGA8EiRrCxkt9A18hvPzR65YpbdWdAy8TLSQVy1Hvlr25fkjSmwhCOte54bmKCDCvDCKeTUjfVNMFkhAe0Y6nAMdVBNrt3gk6s0keRVLaEQTP190SGY63HcWg7Y2yGetGbiv95ndREV0HGRJIaKsh8UZRyZCSaPo/6TFFi+NgSTBSztyIyxAoTYyMq2RC8xZeXSeus6l1U3cZ5pXaTx1GEIziGU/DgEmpwB3VoAgEOz/AKb86j8+K8Ox/z1oKTzxzCHzifP4YNjvs=</latexit>

K V
<latexit sha1_base64="O/Xdn2nZwVqugGAVDtC02kvexhg=">AAAB73icbVBNSwMxEJ34WetX1aOXYBE8lV0R9SIUvQi9VLDtQruWbJptQ7PZNckKZemf8OJBEa/+HW/+G9N2D9r6YODx3gwz84JEcG0c5xstLa+srq0XNoqbW9s7u6W9/aaOU0VZg8YiVl5ANBNcsobhRjAvUYxEgWCtYHgz8VtPTGkey3szSpgfkb7kIafEWMmr4SvstR5q3VLZqThT4EXi5qQMOerd0lenF9M0YtJQQbRuu05i/Iwow6lg42In1SwhdEj6rG2pJBHTfja9d4yPrdLDYaxsSYOn6u+JjERaj6LAdkbEDPS8NxH/89qpCS/9jMskNUzS2aIwFdjEePI87nHFqBEjSwhV3N6K6YAoQo2NqGhDcOdfXiTN04p7XnHuzsrV6zyOAhzCEZyACxdQhVuoQwMoCHiGV3hDj+gFvaOPWesSymcO4A/Q5w9zs47v</latexit> <latexit sha1_base64="sG8GlMJZBZitk452XyeL5flQu3E=">AAAB73icbVBNS8NAEJ31s9avqkcvi0XwVBIR9SIUvXisYNNAG8tmu2mXbjZxdyOU0D/hxYMiXv073vw3btsctPXBwOO9GWbmhang2jjON1paXlldWy9tlDe3tnd2K3v7nk4yRVmTJiJRfkg0E1yypuFGMD9VjMShYK1weDPxW09MaZ7IezNKWRCTvuQRp8RYyffwFfZbD163UnVqzhR4kbgFqUKBRrfy1eklNIuZNFQQrduuk5ogJ8pwKti43Mk0Swkdkj5rWypJzHSQT+8d42Or9HCUKFvS4Kn6eyInsdajOLSdMTEDPe9NxP+8dmaiyyDnMs0Mk3S2KMoENgmePI97XDFqxMgSQhW3t2I6IIpQYyMq2xDc+ZcXiXdac89rzt1ZtX5dxFGCQziCE3DhAupwCw1oAgUBz/AKb+gRvaB39DFrXULFzAH8Afr8AZVYjwU=</latexit>

Q = XW K = XW V = XW
n × dq dk × n

n × dv

2 2
Need to compute n pairs of scores (= dot product) O(n d)
2
RNNs only require O(nd ) running time:
ht = g(Wht−1 + Uxt + b)

(assuming input dimension = hidden dimension = d)


Quadratic computation as a function of sequence length
2 2
Need to compute n pairs of scores (= dot product) O(n d)

Max sequence length = 1,024 in GPT-2

What if we want to scale n ≥ 50,000? For example, to work on long documents?


The Revolutionary Impact of Transformers
• Almost all current-day leading language models use Transformer building blocks.
• E.g., GPT1/2/3/4, T5, Llama 1/2, BERT, … almost anything we can name
• Transformer-based models dominate nearly all NLP leaderboards.

• Since Transformer has been popularized in


language applications, computer vision also
adapted Transformers, e.g., Vision
Transformers.
[Khan et al., 2021]
What’s next after
Transformers?

66 Lecture 5: Attention & Transformers

You might also like