0% found this document useful (0 votes)

48 views66 pages

Attention & Transformers

This lecture focuses on the Transformer architecture, which consists of an encoder-decoder model that utilizes multi-head self-attention mechanisms. It addresses the limitations of RNNs, such as vanishing gradients and lack of parallelizability, and introduces attention as a method for capturing relationships between tokens in a sequence. Key concepts include self-attention, query-key-value representations, and the ability to parallelize computations for efficiency.

Uploaded by

9gt5rqjjnq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views66 pages

Attention & Transformers

Uploaded by

9gt5rqjjnq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

COMP 3361 Natural Language Processing

Lecture 9: Attention and Transformers

Spring 2024

Many materials from CSE447@UW (Liwei Jiang), COS 484@Princeton, and CS224n@Stanford with special thanks!
Transformers

(Vaswani et al., 2017)

Transformer encoder-decoder
• Transformer encoder + Transformer decoder
• First designed and experimented on NMT
Transformer encoder-decoder
• Transformer encoder = a stack of encoder layers

• Transformer decoder = a stack of decoder layers

Transformer encoder: BERT, RoBERTa, ELECTRA

Transformer decoder: GPT-3, ChatGPT, Palm

Transformer encoder-decoder: T5, BART

• Key innovation: multi-head, self-attention

• Transformers don’t have any recurrence structures!
h
ht = f(ht−1, xt) ∈ ℝ
Transformers: roadmap

• From attention to self-attention

• From self-attention to multi-head self-attention
• Feedforward layers
• Positional encoding
• Residual connections + layer normalization
• Transformer encoder vs Transformer decoder
Issues with RNNs: Linear Interaction Distance
• RNNs are unrolled left-to-right.
• Linear locality is a useful heuristic: nearby
words often affect each other’s meaning! Steve Jobs

• However, there’s the vanishing gradient

problem for long sequences.
O(sequence length)
• The gradients that are used to update the
network become extremely small or "vanish"
as they are backpropogated from the output
layers to the earlier layers.
Steve Jobs who … Apple

• Failing to capture long-term dependences.

6 Lecture 5: Attention & Transformers

Issues with RNNs: Lack of Parallelizability
• Forward and backward passes have O(sequence length) unparallelizable operations
• GPUs can perform many independent computations (like addition) at once!
• But future RNN hidden states can’t be computed in full before past RNN hidden
states have been computed.
• Training and inference are slow; inhibits on very large datasets!

1 2 3 N

0 1 2

h1 h2 h3 hT

Numbers indicate min # of steps before a state can be computed

7 Lecture 5: Attention & Transformers

The New De Facto Method: Attention

Instead of deciding the

next token solely based on
the previously seen tokens,
each token will “look at”
all input tokens at the
same to decide which
ones are most important
to decide the next token.
In practice, the actions of all tokens
are done in parallel!

8 Lecture 5: Attention & Transformers

Building the Intuition of Attention
• Attention treats each token’s representation as a query to access and incorporate
information from a set of values.
• Today we look at attention within a single sequence.
• Number of unparallelizable operations does NOT increase with sequence length.
• Maximum interaction distance: O(1), since all tokens interact at every layer!

attention 2 2 2 2 2 2 2 2

All tokens attend to all tokens

attention 1 1 1 1 1 1 1 1 in previous layer; most
arrows here are omitted
embedding 0 0 0 0 0 0 0 0

h1 h2 h3 hT

9 Lecture 5: Attention & Transformers

Attention as a soft, averaging lookup table
We can think of attention as performing fuzzy lookup in a key-value store.

In a lookup table, we have a table of keys In attention, the query matches all keys softly, to
that map to values. The query matches a weight between 0 and 1. The keys’ values are
one of the keys, returning its value. multiplied by the weights and summed.

10 Lecture 5: Attention & Transformers

Self-Attention: Basic Concepts [Lena Viota Blog]

Query: asking for

information

Key: saying that it

has some information

Value: giving the

information

11 Lecture 5: Attention & Transformers

Self-Attention: Walk-through
b1 b2 b3 b4

Each bi is obtained by considering ∀ai

Self-Attention Layer

a1 a2 a3 a4

Can be either input or a hidden layer

12 Lecture 5: Attention & Transformers

Self-Attention: Walk-through
b1

How relevant are a2, a3, a4 to a1?

We denote the level
of relevance as α

a1 a2 a3 a4

13 Lecture 5: Attention & Transformers

How to compute α?
α=q⋅k W

tanh

q . k q + k

WQ WK WQ WK

a1 a4 We’ll use this! a1 a4

Method 1 (most common): Dot product Method 2: Additive

14 Lecture 5: Attention & Transformers

Self-Attention: Walk-through

attention scores α1,2 = q1 ⋅ k2 α1,3 = q1 ⋅ k3 α1,4 = q1 ⋅ k4

query q1 k2 key k3 k4
k2 = WK a2 k3 = WK a3 k4 = WK a4

q1 = WQ a1

a1 a2 a3 a4

15 Lecture 5: Attention & Transformers

Self-Attention: Walk-through

α1,1 = q1 ⋅ k1 α1,2 = q1 ⋅ k2 α1,3 = q1 ⋅ k3 α1,4 = q1 ⋅ k4

query q1 k1 k2 key k3 k4
k1 = WK a1 k2 = WK a2 k3 = WK a3 k4 = WK a4

q1 = WQ a1

a1 a2 a3 a4

16 Lecture 5: Attention & Transformers

α1,i
′ e
α1,i = α1,j
∑j e
′ ′ ′ ′
α1,1 α1,2 α1,3 α1,4

Softmax

α1,1 = q1 ⋅ k1 α1,2 = q1 ⋅ k2 α1,3 = q1 ⋅ k3 α1,4 = q1 ⋅ k4

query q1 k1 k2 key k3 k4
k1 = WK a1 k2 = WK a2 k3 = WK a3 k4 = WK a4

q1 = WQ a1

a1 a2 a3 a4

17 Lecture 5: Attention & Transformers

Denote how relevant each token are to a1!
Use attention scores to extract information
′ ′ ′ ′
α1,1 α1,2 α1,3 α1,4

Softmax

α1,1 = q1 ⋅ k1 α1,2 = q1 ⋅ k2 α1,3 = q1 ⋅ k3 α1,4 = q1 ⋅ k4

query q1 k1 k2 key k3 k4
k1 = WK a1 k2 = WK a2 k3 = WK a3 k4 = WK a4

q1 = WQ a1

a1 a2 a3 a4

18 Lecture 5: Attention & Transformers

Use attention scores to extract information

′
∑
b1 = α1,i vi
i
b1
′ ′ ′ ′
α1,1 × α1,2 × α1,3 × α1,4 ×

q1 k1 v1 k2 v2 k3 v3 k4 v4

v1 = WV a1 v2 = WV a2 v3 = WV a3 v4 = WV a4

a1 a2 a3 a4

19 Lecture 5: Attention & Transformers

Use attention scores to extract information

′
∑
b1 = α1,i vi
i
b1
′ ′ ′ ′
α1,1 × α1,2 × α1,3 × α1,4 ×
′
The higher the attention score is, the α1,i
more important ai is to composing b1

q1 k1 v1 k2 v2 k3 v3 k4 v4

v1 = WV a1 v2 = WV a2 v3 = WV a3 v4 = WV a4

a1 a2 a3 a4

20 Lecture 5: Attention & Transformers

Repeat the same calculation for all ai to obtain bi

′
∑
b2 = α2,i vi
b2 i

′ ′ ′ ′
α2,1 × α2,2 × α2,3 × α2,4 ×

q1 k1 v1 q2 k2 v2 q3 k3 v3 q4 k4 v4

a1 a2 a3 a4

21 Lecture 5: Attention & Transformers

Repeat the same calculation for all ai to obtain bi

′
∑
b2 = α2,i vi
b2 i

′ ′ ′ ′
α2,1 × α2,2 × α2,3 × α2,4 ×
Note that the computation of be bi can
parallelized, as they are independent to
each other
q1 k1 v1 q2 k2 v2 q3 k3 v3 q4 k4 v4

a1 a2 a3 a4

22 Lecture 5: Attention & Transformers

Parallelize the computation!
QKV

Q I K I V I

q1 a1 k1 a1 v1 a1
q2 a2 k2 a2 v2 a2
= WQ = WK = WV
q3 a3 k3 a3 v3 a3
q4 a4 k4 a4 v4 a4

23 Lecture 5: Attention & Transformers

Parallelize the computation! q1
Attention Scores α1,1 α1,2 α1,3 α1,4

= k1 k2 k3 k4
′ ′ ′ ′
α1,1 α1,2 α1,3 α1,4

q1 k1 v1 k2 v2 k3 v3 k4 v4

v1 = WV a1 v2 = WV a2 v3 = WV a3 v4 = WV a4

a1 a2 a3 a4

24 Lecture 5: Attention & Transformers

Parallelize the computation!
Attention Scores

q1
α1,1 α1,2 α1,3 α1,4

= k1 k2 k3 k4

25 Lecture 5: Attention & Transformers

Parallelize the computation!
Attention Scores

Q
A ′ A K T

′ ′ ′ ′
q1
α1,1 α1,2 α1,3 α1,4 α1,1 α1,2 α1,3 α1,4
′
α2,1 ′
α2,2 ′
α2,3 ′
α2,4 α2,1 α2,2 α2,3 α2,4 q2
′ ′ ′ ′
= k1 k2 k3 k4
α3,1 α3,2 α3,3 α3,4 α3,1 α3,2 α3,3 α3,4 q3
′ ′ ′ ′
α4,1 α4,2 α4,3 α4,4 α4,1 α4,2 α4,3 α4,4
q4

26 Lecture 5: Attention & Transformers

′
α1,1 v1 + ′
α1,2 v2 + ′
α1,3 v3 + ′
α1,4 v4

b1 ′ ′ ′ ′
v1
α1,1 α1,2 α1,3 α1,4
v2
=
v3

Parallelize the computation!

Weighted Sum of Values with Attention Scores
27 Lecture 5: Attention & Transformers

Parallelize the computation!

O V
′
A
b1 ′ ′ ′ ′
v1
α1,1 α1,2 α1,3 α1,4
b2 ′
α2,1 ′
α2,2 ′
α2,3 ′
α2,4 v2
=
b3 ′
α3,1 ′
α3,2 ′
α3,3 ′
α3,4 v3
′ ′ ′ ′
α4,1 α4,2 α4,3 α4,4
b4 v4

Parallelize the computation!

Weighted Sum of Values with Attention Scores
28 Lecture 5: Attention & Transformers

Q = I WQ
K = I WK Q = I WQ K = I WK V = I WV
V = I WV

T
A=QK
′ Softmax T
T
A = I WQ (I WK ) = I WQ WKT I T
A A = Q K
′
A = softmax(A)

′
O=A V ′ O = A V

29 Lecture 5: Attention & Transformers

The Matrices Form of Self-Attention
n×d d
Q = I WQ I = {a1, . . . , an} ∈ ℝ , where ai ∈ ℝ
d×d
K = I WK WQ, WK, WV ∈ ℝ
n×d
V = I WV Q, K, V ∈ ℝ ?

T
A=QK
T T T ′ n×n
A = I WQ (I WK ) = I WQ WK I A, A ∈ ℝ ?
′
A = softmax(A) Dimensions?

′ n×d
O=A V O ∈ ℝ?

30 Lecture 5: Attention & Transformers

The Matrices Form of Self-Attention
n×d d
Q = I WQ I = {a1, . . . , an} ∈ ℝ , where ai ∈ ℝ
d×d
K = I WK WQ, WK, WV ∈ ℝ
n×d
V = I WV Q, K, V ∈ ℝ

T
A=QK
T T T ′ n×n
A = I WQ (I WK ) = I WQ WK I A, A ∈ ℝ
′
A = softmax(A) Dimensions?

′ n×d
O=A V O∈ℝ

31 Lecture 5: Attention & Transformers

Self-Attention: Summary
Let w1:n be a sequence of words in vocabulary , like Steve Jobs founded Apple.
d×|V|
For each wi, let ai = Ewi, where E ∈ ℝ is an embedding matrix.
d×d
1. Transform each word embedding with weight matrices WQ, WK, WV , each in ℝ
qi = WQ ai (queries) ki = WK ai (keys) vi = WV ai (values)
2. Compute pairwise similarities between keys and queries; normalize with softmax
αi,j
′ e
αi,j = kj qi αi,j = αi,j
∑j e
3. Compute output for each word as weighted sum of values
′
∑
bi = αi,j vj
j

𝑉
32 Lecture 5: Attention & Transformers

Limitations and Solutions of Self-Attention

No Sequence Order Position Embedding

No Nonlinearities Adding Feed-forward Networks

Looking into the Future Masking

33 Lecture 5: Attention & Transformers

Limitations and Solutions of Self-Attention

No Sequence Order Position Embedding

No Nonlinearities Adding Feed-forward Networks

Looking into the Future Masking

34 Lecture 5: Attention & Transformers

No Sequence Order → Position Embedding
• All tokens in an input sequence are simultaneously fed into self-attention
blocks. Thus, there’s no difference between tokens at different positions.
• We lose the position info!

• How do we bring the position info back, just like in RNNs?

d
• Representing each sequence index as a vector: pi ∈ ℝ , for i ∈ {1,...,n}

• How to incorporate the position info into the self-attention blocks? qi ki vi

• Just add the pi to the input: aî = ai + pi
• where ai is the embedding of the word at index i.
• In deep self-attention networks, we do this at the rst layer.
• We can also concatenate ai and pi, but more commonly we add them. pi + ai

35 Lecture 5: Attention & Transformers

fi
Position Representation Vectors via Sinusoids
Sinusoidal Position Representations (from the original Transformer paper):
concatenate sinusoidal functions of varying periods.

2∗1/
sin( /10000 )

Dimension
2∗1/
cos( /10000 )
=

2∗ 2 /
sin( /10000 )
2∗ 2 /
cos( /10000 ) Index in the sequence
https://timodenk.com/blog/linear-relationships-in-the-transformers-positional-encoding/

• Periodicity indicates that maybe “absolute position” isn’t as important

• Maybe can extrapolate to longer sequences as periods restart!

• Not learnable; also the extrapolation doesn’t really work!

𝑖
36
𝑖
𝑖
Lecture 5: Attention & Transformers
𝑖
𝑖
𝒑
𝑑
𝑑
𝑑
𝑑
𝑑
𝑑
Learnable Position Representation Vectors
Learned absolute position representations: pi contains learnable parameters.
d×n
• Learn a matrix p ∈ ℝ , and let each pi be a column of that matrix
• Most systems use this method.

• Flexibility: each position gets to be learned to t the data

• Cannot extrapolate to indices outside 1,...,n.

Sometimes people try more exible representations of position:

• Relative linear position attention [Shaw et al., 2018]
• Dependency syntax-based position [Wang et al., 2019]

37 Lecture 5: Attention & Transformers

fl
fi
Limitations and Solutions of Self-Attention

No Sequence Order Position Embedding

No Nonlinearities Adding Feed-forward Networks

Looking into the Future Masking

38 Lecture 5: Attention & Transformers

No Nonlinearities → Add Feed-forward Networks
c1 c2 … cn
There are no element-wise nonlinearities in
self-attention; stacking more self-attention FF FF … FF
layers just re-averages value vectors.
Self-Attention

b1 b2 … bn
Easy Fix: add a feed-forward network
to post-process each output vector. FF FF … FF

Self-Attention

a1 a2 … an

39 Lecture 5: Attention & Transformers

Limitations and Solutions of Self-Attention

No Sequence Order Position Embedding

No Nonlinearities Adding Feed-forward Networks

Looking into the Future Masking

40 Lecture 5: Attention & Transformers

Looking into the Future → Masking
• In decoders (language modeling,
producing the next word given
previous context), we need to
ensure we don’t peek at the future.

https://jalammar.github.io/illustrated-gpt2/
Lecture 5: Attention & Transformers
Looking into the Future → Masking
We can look at these (not
• In decoders (language modeling, greyed out) words

{
producing the next word given qi kj, j ≤ i
αi,j =
previous context), we need to −∞, j > i RT ]
TA h e e f h o
[ S T c h w
ensure we don’t peek at the future.
−∞ −∞ −∞
[START]
• To enable parallelization, we mask
out attention to future words by
−∞ −∞
setting attention scores to −∞. The
For encoding
these words
−∞
chef

who

42 Lecture 5: Attention & Transformers

Now We Put Things Together Output
Probabilities

• Self-attention Softmax

• The basic computation Linear

• Positional Encoding

Repeat for number

of encoder blocks
Feed-Forward
• Specify the sequence order
• Nonlinearities
Masked Self-Attention
• Adding a feed-forward network at the
output of the self-attention block
Block
• Masking
Position Embedding
• Parallelize operations (looking at all tokens) +
while not leaking info from the future Input Embeddings

Inputs
43 Lecture 5: Attention & Transformers
Output Probabilities

The Transformer Decoder Softmax

Linear

• A Transformer decoder is what we use

Add & Norm
to build systems like language models.

Repeat for number

of encoder blocks
Feed-Forward

• It’s a lot like our minimal self-attention

Add & Norm
architecture, but with a few more
components. Masked Multi-head
• Residual connection (“Add”) Attention
• Layer normalization (“Norm")

Position Embedding
• Replace self-attention with multi-head +
self-attention. Input Embeddings

Inputs
44 Lecture 5: Attention & Transformers
Multi-head Attention
“The Beast with Many Heads”

• It is better to use multiple attention functions instead of one!

• Each attention function (“head”) can focus on di erent positions.

H0 H1 H7

https://jalammar.github.io/illustrated-transformer/
ff
Multi-Head Attention: Walk-through

bi,1
′ ′
αi,i,1 × αi,j,1 ×

qi,1 qi,2 ki,1 ki,2 vi,1 vi,2 qj,1 qj,2 kj,1 kj,2 vj,1 vj,2

qi ki vi qj kj vj

ai Multi-head Attention aj
46 Lecture 5: Attention & Transformers

Multi-Head Attention: Walk-through

bi,2
′ ′
αi,i,2 × αi,j,2 ×

qi,1 qi,2 ki,1 ki,2 vi,1 vi,2 qj,1 qj,2 kj,1 kj,2 vj,1 vj,2

qi ki vi qj kj vj

ai Multi-head Attention aj
47 Lecture 5: Attention & Transformers

bi,1
Concatenation
bi = Y
× ×
bi,2
Some
transformatio × ×

qi,1 qi,2 ki,1 ki,2 vi,1 vi,2 qj,1 qj,2 kj,1 kj,2 vj,1 vj,2

qi ki vi qj kj vj

ai Multi-head Attention aj
48 Lecture 5: Attention & Transformers
Recall the Matrices Form of Self-Attention
n×d d
Q = I WQ I = {a1, . . . , an} ∈ ℝ , where ai ∈ ℝ
d×d
K = I WK WQ, WK, WV ∈ ℝ
n×d
V = I WV Q, K, V ∈ ℝ

T
A=QK
T T T ′ n×n
A = I WQ (I WK ) = I WQ WK I A, A ∈ ℝ
′
A = softmax(A)

′ n×d
O=A V O∈ℝ

49 Lecture 5: Attention & Transformers

Multi-head Attention in Matrices

• Multiple attention “heads” can be de ned via multiple WQ, WK, WV matrices
l l l d× dh
• Let WQ, WK, WV ∈ℝ , where h is the number of attention heads, and l ranges
from 1 to h.
• Each attention head performs attention independently:
l l lT T l
• O = softmax(I WQ WK I )I WV
l
• Concatenating different O from different attention heads.
1 n d×d
• O = [O ; . . . ; O ] Y, where Y ∈ ℝ

50 Lecture 5: Attention & Transformers

fi
The Matrices Form of Multi-head Attention
l l n×d d
Q =I WQ I = {a1, . . . , an} ∈ ℝ , where ai ∈ ℝ
l l l l l d× dh
K =I WK WQ, WK, WV ∈ℝ
d
l l l l l n× h
V =I WV Q ,K ,V ∈ ℝ ?
l l lT
A =Q K l′
A ,A ∈ ℝ l n×n?
l′ l
A = softmax(A )
Dimensions?
l′
d
l n× h
l
O =A V l O ∈ℝ ?

d×d
Y∈ℝ
1 h n×d
1 h
O = [O ; . . . ; O ] Y [O ; . . . ; O ] ∈ ℝ ?
n×d
O∈ℝ ?
51 Lecture 5: Attention & Transformers

The Matrices Form of Multi-head Attention
l l n×d d
Q =I WQ I = {a1, . . . , an} ∈ ℝ , where ai ∈ ℝ
l l l l l d× dh
K =I WK WQ, WK, WV ∈ℝ
d
l l l l l n× h
V =I WV Q ,K ,V ∈ ℝ
l l lT
A =Q K l′
A ,A ∈ ℝ l n×n
l′ l
A = softmax(A )
Dimensions?
l′
d
l n× h
l
O =A V l O ∈ℝ

d×d
Y∈ℝ
1 h n×d
1 h
O = [O ; . . . ; O ] Y [O ; . . . ; O ] ∈ ℝ
n×d
O∈ℝ
52 Lecture 5: Attention & Transformers

Multi-head Attention is Computationally Ef cient
• Even though we compute h many attention heads, it’s not more costly.
d
n×d n×h× h
• We compute I WQ ∈ ℝ , and then reshape to ℝ .
• Likewise for I WK and I WV.
h×n× dh
• Then we transpose to ℝ ; now the head axis is like a batch axis.
• Almost everything else is identical. All we need to do is to reshape the tensors!

T T h sets of attention scores!

I WQ WK I
I WQ WKT I T =
h×n×n
∈ℝ

Softmax ( I T
WQ WK I T
) I WV = O′ Y = O ∈ℝn×d

53 Lecture 5: Attention & Transformers

fi
Scaled Dot Product
• “Scaled Dot Product” attention aids in training.
• When dimensionality d becomes large, dot products between vectors tend to become
large.
• Because of this, inputs to the softmax function can be large, making the gradients small.

• Instead of the self-attention function we’ve

seen:
lT l lT T
• O = so l
max (I l
WQ WK T
I )I l
WV
l
I WQ WK I l
O = so max( )I WV
• We divide the attention scores by d/h , to d/h
stop the scores from becoming large just as a
func on of d/h (the dimensionality divided by the
number of heads).
ft
ft
54 Lecture 5: Attention & Transformers
ti
Output Probabilities

The Transformer Decoder Softmax

Linear

Add & Norm

Repeat for number

of encoder blocks
Feed-Forward
• Now that we’ve replaced self-attention
with multi-head self-attention, we’ll go Add & Norm
through two op miza on tricks:
Masked Multi-head
• Residual connection (“Add”) Attention
• Layer normalization (“Norm”)
Block

+ Position Embedding

Input Embeddings

Inputs
55 Lecture 5: Attention & Transformers
ti
ti
Residual Connections
• Residual connections are a trick to help models train better.
(i) (i−1)
• Instead of X = Layer(X ) (where i represents the layer)
(i−1) (i)
X Layer X

(i) (i−1) (i−1)

• We let X = X + Layer(X ) (so we only have to learn “the residual” from
the previous layer)
(i−1) (i)
X Layer + X

• Gradient is great through the residual

connection; it’s 1! [no residuals] [residuals]

• Bias towards the identity function! [Loss landscape visualization,

Li et al., 2018, on a ResNet]

56 Lecture 5: Attention & Transformers

Layer Normalization
• Layer normalization is a trick to help models train faster.
• Idea: cut down on uninformative variation in hidden vector values by normalizing to unit mean
and standard deviation within each layer.
• LayerNorm’s success may be due to its normalizing gradients [Xu et al., 2019]
• Let ∈ ℝ be an individual (word) vector in the model.

∑
Let = ; this is the mean; ∈ ℝ.
•
=1

1
∑(
− ) ; this is the standard deviation;
2
Let = ∈ ℝ.
• =1

• Let ∈ ℝ and ∈ ℝ be learned “gain” and “bias” parameters. (Can omit!)

• Then layer normalization computes:
−
Normalize by output = ∗ + Modulate by learned
• +
scalar mean and element-wise gain and
𝑗
𝑗
𝑑
variance bias
𝑗
𝜎
𝜖
𝜎
𝑥
𝜇
𝑗
𝜇
𝑥
𝛾
𝛽
57 Lecture 5: Attention & Transformers
𝑥
𝜇
𝛾
𝜎
𝛽
𝑑
𝑑
𝑑
𝑑
𝑑
𝑥
𝜇
Output Probabilities

The Transformer Decoder Softmax

Linear

Add & Norm

• The Transformer Decoder is a stack of

Repeat for number

of encoder blocks
Feed-Forward
Transformer Decoder Blocks.
• Each Block consists of: Add & Norm
• Masked Multi-head Self-attention
Masked Multi-head
• Add & Norm Attention
• Feed-Forward
Block
• Add & Norm
+ Position Embedding

Input Embeddings

Inputs
58 Lecture 5: Attention & Transformers
Output Probabilities

The Transformer Encoder Softmax

Linear

• The Transformer Decoder Add & Norm

constrains to unidirectional

Repeat for number

of encoder blocks
Feed-Forward
context, as for language
models.
Add & Norm
• What if we want bidirectional
context, like in a bidirectional Multi-head
RNN? Attention

• We use Transformer Encoder — Block

the ONLY difference is that we
No masks! + Position Embedding
remove the masking in self-
Input Embeddings
attention.
Encoder Inputs
59 Lecture 5: Attention & Transformers
The Transformer Encoder-Decoder

• More on Encoder-Decoder models will be wt1+2, . . .

introduced in the next lecture!
• Right now we only need to know that it processes the
source sentence with a bidirectional model
(Encoder) and generates the target with a
unidirectional model (Decoder). wt1+1, . . . , wt2

• The Transformer Decoder is modi ed to perform

cross-attention to the output of the Encoder. w1, . . . , wt1

60 Lecture 5: Attention & Transformers

fi
Add & Norm
Cross-Attention Feed-Forward
Linear
Add & Norm
Add & Norm Softmax
Masked Multi-head
Feed-Forward Attention
Output Probabilities
K V Q

Add & Norm

Multi-head Masked Multi-head

Attention Attention

Block Block

+ Position Embedding Position Embedding

+
Input Embeddings Input Embeddings

Encoder Inputs Decoder Inputs

61 Lecture 5: Attention & Transformers
Cross-Attention Details
• Self-attention: queries, keys, and values come from the same source.
• Cross-Attention: keys and values are from Encoder (like a memory); queries are
from Decoder.
d
• Let h1, …, h be output vectors from the Transformer encoder, hi ∈ ℝ .
d
• Let 1, …, be input vectors from the Transformer decoder, zi ∈ ℝ .
• Keys and values from the encoder:
• ki = WK hi
• vi = WV hi
• Queries are drawn from the decoder:
• qi = WQ zi
𝑛
𝑛
𝑧
𝑧
62 Lecture 5: Attention & Transformers
Transformers: pros and cons
• Easier to capture long-range dependencies: we draw attention between every pair of words!

• Easier to parallelize:
Q
<latexit sha1_base64="Zj1Owf2jr65GlRqNMJIdIlsAOuc=">AAAB73icbVBNSwMxEJ2tX7V+VT16CRbBU9kVUS9C0YvHFmy70K4lm2bb0GyyJlmhLP0TXjwo4tW/481/Y9ruQVsfDDzem2FmXphwpo3rfjuFldW19Y3iZmlre2d3r7x/0NIyVYQ2ieRS+SHWlDNBm4YZTv1EURyHnLbD0e3Ubz9RpZkU92ac0CDGA8EiRrCxkt9A18hvPzR65YpbdWdAy8TLSQVy1Hvlr25fkjSmwhCOte54bmKCDCvDCKeTUjfVNMFkhAe0Y6nAMdVBNrt3gk6s0keRVLaEQTP190SGY63HcWg7Y2yGetGbiv95ndREV0HGRJIaKsh8UZRyZCSaPo/6TFFi+NgSTBSztyIyxAoTYyMq2RC8xZeXSeus6l1U3cZ5pXaTx1GEIziGU/DgEmpwB3VoAgEOz/AKb86j8+K8Ox/z1oKTzxzCHzifP4YNjvs=</latexit>

K = XW K V = XW V
<latexit sha1_base64="O/Xdn2nZwVqugGAVDtC02kvexhg=">AAAB73icbVBNSwMxEJ34WetX1aOXYBE8lV0R9SIUvQi9VLDtQruWbJptQ7PZNckKZemf8OJBEa/+HW/+G9N2D9r6YODx3gwz84JEcG0c5xstLa+srq0XNoqbW9s7u6W9/aaOU0VZg8YiVl5ANBNcsobhRjAvUYxEgWCtYHgz8VtPTGkey3szSpgfkb7kIafEWMmr4SvstR5q3VLZqThT4EXi5qQMOerd0lenF9M0YtJQQbRuu05i/Iwow6lg42In1SwhdEj6rG2pJBHTfja9d4yPrdLDYaxsSYOn6u+JjERaj6LAdkbEDPS8NxH/89qpCS/9jMskNUzS2aIwFdjEePI87nHFqBEjSwhV3N6K6YAoQo2NqGhDcOdfXiTN04p7XnHuzsrV6zyOAhzCEZyACxdQhVuoQwMoCHiGV3hDj+gFvaOPWesSymcO4A/Q5w9zs47v</latexit> <latexit sha1_base64="sG8GlMJZBZitk452XyeL5flQu3E=">AAAB73icbVBNS8NAEJ31s9avqkcvi0XwVBIR9SIUvXisYNNAG8tmu2mXbjZxdyOU0D/hxYMiXv073vw3btsctPXBwOO9GWbmhang2jjON1paXlldWy9tlDe3tnd2K3v7nk4yRVmTJiJRfkg0E1yypuFGMD9VjMShYK1weDPxW09MaZ7IezNKWRCTvuQRp8RYyffwFfZbD163UnVqzhR4kbgFqUKBRrfy1eklNIuZNFQQrduuk5ogJ8pwKti43Mk0Swkdkj5rWypJzHSQT+8d42Or9HCUKFvS4Kn6eyInsdajOLSdMTEDPe9NxP+8dmaiyyDnMs0Mk3S2KMoENgmePI97XDFqxMgSQhW3t2I6IIpQYyMq2xDc+ZcXiXdac89rzt1ZtX5dxFGCQziCE3DhAupwCw1oAgUBz/AKb+gRvaB39DFrXULFzAH8Afr8AZVYjwU=</latexit>

Q = XW

• Are positional encodings enough to capture positional information?

Otherwise self-attention is an unordered function of its input

• Quadratic computation in self-attention

Can become very slow when the sequence length is large

Quadratic computation as a function of sequence length
Q
<latexit sha1_base64="Zj1Owf2jr65GlRqNMJIdIlsAOuc=">AAAB73icbVBNSwMxEJ2tX7V+VT16CRbBU9kVUS9C0YvHFmy70K4lm2bb0GyyJlmhLP0TXjwo4tW/481/Y9ruQVsfDDzem2FmXphwpo3rfjuFldW19Y3iZmlre2d3r7x/0NIyVYQ2ieRS+SHWlDNBm4YZTv1EURyHnLbD0e3Ubz9RpZkU92ac0CDGA8EiRrCxkt9A18hvPzR65YpbdWdAy8TLSQVy1Hvlr25fkjSmwhCOte54bmKCDCvDCKeTUjfVNMFkhAe0Y6nAMdVBNrt3gk6s0keRVLaEQTP190SGY63HcWg7Y2yGetGbiv95ndREV0HGRJIaKsh8UZRyZCSaPo/6TFFi+NgSTBSztyIyxAoTYyMq2RC8xZeXSeus6l1U3cZ5pXaTx1GEIziGU/DgEmpwB3VoAgEOz/AKb86j8+K8Ox/z1oKTzxzCHzifP4YNjvs=</latexit>

K V
<latexit sha1_base64="O/Xdn2nZwVqugGAVDtC02kvexhg=">AAAB73icbVBNSwMxEJ34WetX1aOXYBE8lV0R9SIUvQi9VLDtQruWbJptQ7PZNckKZemf8OJBEa/+HW/+G9N2D9r6YODx3gwz84JEcG0c5xstLa+srq0XNoqbW9s7u6W9/aaOU0VZg8YiVl5ANBNcsobhRjAvUYxEgWCtYHgz8VtPTGkey3szSpgfkb7kIafEWMmr4SvstR5q3VLZqThT4EXi5qQMOerd0lenF9M0YtJQQbRuu05i/Iwow6lg42In1SwhdEj6rG2pJBHTfja9d4yPrdLDYaxsSYOn6u+JjERaj6LAdkbEDPS8NxH/89qpCS/9jMskNUzS2aIwFdjEePI87nHFqBEjSwhV3N6K6YAoQo2NqGhDcOdfXiTN04p7XnHuzsrV6zyOAhzCEZyACxdQhVuoQwMoCHiGV3hDj+gFvaOPWesSymcO4A/Q5w9zs47v</latexit> <latexit sha1_base64="sG8GlMJZBZitk452XyeL5flQu3E=">AAAB73icbVBNS8NAEJ31s9avqkcvi0XwVBIR9SIUvXisYNNAG8tmu2mXbjZxdyOU0D/hxYMiXv073vw3btsctPXBwOO9GWbmhang2jjON1paXlldWy9tlDe3tnd2K3v7nk4yRVmTJiJRfkg0E1yypuFGMD9VjMShYK1weDPxW09MaZ7IezNKWRCTvuQRp8RYyffwFfZbD163UnVqzhR4kbgFqUKBRrfy1eklNIuZNFQQrduuk5ogJ8pwKti43Mk0Swkdkj5rWypJzHSQT+8d42Or9HCUKFvS4Kn6eyInsdajOLSdMTEDPe9NxP+8dmaiyyDnMs0Mk3S2KMoENgmePI97XDFqxMgSQhW3t2I6IIpQYyMq2xDc+ZcXiXdac89rzt1ZtX5dxFGCQziCE3DhAupwCw1oAgUBz/AKb+gRvaB39DFrXULFzAH8Afr8AZVYjwU=</latexit>

Q = XW K = XW V = XW
n × dq dk × n

n × dv

2 2
Need to compute n pairs of scores (= dot product) O(n d)
2
RNNs only require O(nd ) running time:
ht = g(Wht−1 + Uxt + b)

(assuming input dimension = hidden dimension = d)

Quadratic computation as a function of sequence length
2 2
Need to compute n pairs of scores (= dot product) O(n d)

Max sequence length = 1,024 in GPT-2

What if we want to scale n ≥ 50,000? For example, to work on long documents?

The Revolutionary Impact of Transformers
• Almost all current-day leading language models use Transformer building blocks.
• E.g., GPT1/2/3/4, T5, Llama 1/2, BERT, … almost anything we can name
• Transformer-based models dominate nearly all NLP leaderboards.

• Since Transformer has been popularized in

language applications, computer vision also
adapted Transformers, e.g., Vision
Transformers.
[Khan et al., 2021]
What’s next after
Transformers?

66 Lecture 5: Attention & Transformers

Transformers - The Brain of ChatGPT
No ratings yet
Transformers - The Brain of ChatGPT
25 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Lec 12
No ratings yet
Lec 12
30 pages
Lecture 6 Transformers
No ratings yet
Lecture 6 Transformers
92 pages
Lecture 10
No ratings yet
Lecture 10
66 pages
Transformer 24 Aug
No ratings yet
Transformer 24 Aug
56 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Transformers
No ratings yet
Transformers
41 pages
CS414-Lesson 10.transformer and Applications
No ratings yet
CS414-Lesson 10.transformer and Applications
50 pages
Transformer
No ratings yet
Transformer
41 pages
12 Transformer
No ratings yet
12 Transformer
41 pages
Transformer
No ratings yet
Transformer
58 pages
Lecture 11 Transformers Annotations
No ratings yet
Lecture 11 Transformers Annotations
70 pages
Transformers From Scratch PoliTO - Ipynb Colab
No ratings yet
Transformers From Scratch PoliTO - Ipynb Colab
17 pages
495 Lecture 8
No ratings yet
495 Lecture 8
28 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
Chap6 Transformer (20240219) - DL4H Practioner Guide
No ratings yet
Chap6 Transformer (20240219) - DL4H Practioner Guide
36 pages
Notes of Transformer
No ratings yet
Notes of Transformer
8 pages
All You Need To Know About The Self-Attention Layer
No ratings yet
All You Need To Know About The Self-Attention Layer
80 pages
Transformer
No ratings yet
Transformer
4 pages
Transformers Attention Is All You Need (1) - 1-39
No ratings yet
Transformers Attention Is All You Need (1) - 1-39
39 pages
A1
No ratings yet
A1
11 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
XCS224N Module6 Slides
No ratings yet
XCS224N Module6 Slides
99 pages
Transformers
No ratings yet
Transformers
102 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
Transformer Presentation
No ratings yet
Transformer Presentation
15 pages
Transformers
No ratings yet
Transformers
102 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
NLP Lecture 01-15-Attnmechanism
No ratings yet
NLP Lecture 01-15-Attnmechanism
13 pages
Attention Is All You Need Explained
No ratings yet
Attention Is All You Need Explained
46 pages
Lecture 25
No ratings yet
Lecture 25
13 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
04 Transformer 4
No ratings yet
04 Transformer 4
97 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
03a. Self Attention
No ratings yet
03a. Self Attention
35 pages
Report 1 Transformers
No ratings yet
Report 1 Transformers
7 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
NLP 8
No ratings yet
NLP 8
42 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
AI Transformers for Researchers
No ratings yet
AI Transformers for Researchers
65 pages
ATV - CVPR'23 Tutorial
No ratings yet
ATV - CVPR'23 Tutorial
152 pages
Transformers
No ratings yet
Transformers
15 pages
Transformers Architecture
No ratings yet
Transformers Architecture
5 pages
AA2 3.2 Attention 2024
No ratings yet
AA2 3.2 Attention 2024
58 pages
Transformers for CAP6412 Students
No ratings yet
Transformers for CAP6412 Students
69 pages
Attention LLM
No ratings yet
Attention LLM
36 pages
Transformer
No ratings yet
Transformer
5 pages
1 Solution
No ratings yet
1 Solution
3 pages
E5. Efficient LM Methods
No ratings yet
E5. Efficient LM Methods
41 pages
E3. AI Agents
No ratings yet
E3. AI Agents
49 pages
Pre-Training & LLM 2
No ratings yet
Pre-Training & LLM 2
46 pages
LLM Scaling Laws & Emergent Capacities
No ratings yet
LLM Scaling Laws & Emergent Capacities
23 pages
Multi-Class Classification
No ratings yet
Multi-Class Classification
52 pages
LLM Prompting & In-Context Learning
No ratings yet
LLM Prompting & In-Context Learning
18 pages
Neural Language Models & Tokenization
No ratings yet
Neural Language Models & Tokenization
70 pages
Deep Learning Recap
No ratings yet
Deep Learning Recap
13 pages
0.1. Probability Review
No ratings yet
0.1. Probability Review
6 pages
Introduction
No ratings yet
Introduction
6 pages
Orthogonality
No ratings yet
Orthogonality
61 pages
Subspace and Basis
No ratings yet
Subspace and Basis
60 pages
Matrices and Linear Transformations
No ratings yet
Matrices and Linear Transformations
74 pages
Arman DeArtemi - Growing Taller With Psionic
100% (2)
Arman DeArtemi - Growing Taller With Psionic
28 pages
Pimentel Speech On Organic Farming
No ratings yet
Pimentel Speech On Organic Farming
19 pages
Your Electronic Ticket Receipt
No ratings yet
Your Electronic Ticket Receipt
2 pages
Manual Motoniveladora Champion
67% (12)
Manual Motoniveladora Champion
51 pages
Anspach & Hobday Brewery Expansion Plan
No ratings yet
Anspach & Hobday Brewery Expansion Plan
29 pages
Sase Reviewer
No ratings yet
Sase Reviewer
161 pages
RTAF 090 HSS XLN R1234ze - Product Report 30 09 2024
No ratings yet
RTAF 090 HSS XLN R1234ze - Product Report 30 09 2024
2 pages
Tropospheric Wave Propagation
No ratings yet
Tropospheric Wave Propagation
17 pages
Cleat Wiring
No ratings yet
Cleat Wiring
10 pages
LIGHTING in MUSEUM - Elements and Design Consideration
100% (1)
LIGHTING in MUSEUM - Elements and Design Consideration
38 pages
Enhancement in Elastic Modulus of GFRP Bars by Mat
No ratings yet
Enhancement in Elastic Modulus of GFRP Bars by Mat
5 pages
46 PDF
No ratings yet
46 PDF
23 pages
Chapter 1 Key Standards of Epp
No ratings yet
Chapter 1 Key Standards of Epp
51 pages
Medicinal Plant Treasures of India
No ratings yet
Medicinal Plant Treasures of India
150 pages
Hungary - SME Fact Sheet 2022
No ratings yet
Hungary - SME Fact Sheet 2022
1 page
Properties of Plastic - eMachineShop PDF
100% (1)
Properties of Plastic - eMachineShop PDF
3 pages
SubcellularFractionation Fa15
No ratings yet
SubcellularFractionation Fa15
25 pages
BPAG 172 Solved Assignment
No ratings yet
BPAG 172 Solved Assignment
6 pages
Financial Modeling 4th Edition Simon Benninga 0262027283 978-0262027281 Instant Download
100% (4)
Financial Modeling 4th Edition Simon Benninga 0262027283 978-0262027281 Instant Download
54 pages
Practical Risk Theory For Actuaries
0% (1)
Practical Risk Theory For Actuaries
1 page
Non-Veg Recipe E - Book
No ratings yet
Non-Veg Recipe E - Book
12 pages
Quiz 4793 de So 5 On Thi Anh Chuyen Vao 10 CNN
No ratings yet
Quiz 4793 de So 5 On Thi Anh Chuyen Vao 10 CNN
15 pages
CV Example PHD Anthro
No ratings yet
CV Example PHD Anthro
2 pages
Power Transformer Specs Guide
No ratings yet
Power Transformer Specs Guide
31 pages
Gerhard Coetzee CV
No ratings yet
Gerhard Coetzee CV
3 pages
CAP413 - Phraseology
100% (1)
CAP413 - Phraseology
264 pages
MYA Final Statistics and Probability
No ratings yet
MYA Final Statistics and Probability
10 pages
Distribution Management Systems - Robert Uluski
No ratings yet
Distribution Management Systems - Robert Uluski
178 pages
GIS in Supply Chain Management
No ratings yet
GIS in Supply Chain Management
10 pages
Drawing Tee Split 6715X-1818-MY65-AF65V-53753 Calc and GA
No ratings yet
Drawing Tee Split 6715X-1818-MY65-AF65V-53753 Calc and GA
2 pages