0% found this document useful (0 votes)
12 views25 pages

Transformers - The Brain of ChatGPT

The document provides an introduction to Transformers, a neural network architecture that utilizes attention mechanisms to process language data. It explains how attention allows models to focus on relevant parts of the input, improving context capture in sequence modeling. Key concepts such as self-attention, positional encoding, and the structure of Transformer models are discussed, highlighting their significance in modern machine learning applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views25 pages

Transformers - The Brain of ChatGPT

The document provides an introduction to Transformers, a neural network architecture that utilizes attention mechanisms to process language data. It explains how attention allows models to focus on relevant parts of the input, improving context capture in sequence modeling. Key concepts such as self-attention, positional encoding, and the structure of Transformer models are discussed, highlighting their significance in modern machine learning applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Introduction to Machine Learning

Transformers: The Brain of GPT Models

MATH 370: Machine Learning


Tanujit Chakraborty
@ Sorbonne
[email protected]
Convolution was ruling AI till 2017. But then…
Transformer: a specific kind of network architecture, like a fancier feedforward network, but based
on attention

“Attention is all you need” (Vaswani et al, 2017)


Transformers have nothing to do with Image but Language

● Each word is a vector, and the attention is nothing but a matrix of how much each word
is connected to another. Mathematically it is the dot product of two word vectors.
Transformers have nothing to do with Image but Language
Attention
▪ We use attention to “focus” on some part of interest in an input
▪ Other nearby relevant parts help us focus
▪ Other irrelevant parts do not contribute in the process

▪ In sequence modeling problems, we can use attention between input and output
tokens (between encoder and decoder parts), as well as among the inputs only
(only within the encoder part)
Need for Attention
▪ Each layer in standard deep neural nets computes a linear transform + nonlinearity

▪ For 𝑁 inputs, collectively denoting inputs as 𝑿 ∈ ℝ𝑁×𝐾1 and outputs as 𝑯 ∈ ℝ𝑁×𝐾2


Notation alert: Input 𝑿 can be data (if 𝑯 denotes first
𝑯 = 𝑔(𝑿𝑾) hidden layer) or the 𝑯 of the previous hidden layer

▪ Here the weights 𝑾 ∈ ℝ𝐾1 ×𝐾2 do not depend on the inputs 𝑿


▪ Output 𝒉𝑛 = 𝑔 𝑾⊤ 𝒙𝑛 ∈ ℝ𝐾2 only depends on 𝒙𝑛 ∈ ℝ𝐾1 and pays no attention to 𝒙𝑚 , 𝑚 ≠ 𝑛
𝒉𝑛

𝒙1 𝒙𝑛−1 𝒙𝑛 𝒙𝑛+1 𝒙𝑁

▪ When different inputs outputs have inter-dependencies (e.g., they denote representations of
words in a sentence, or patches in an image), paying attention to other inputs is helpful/needed
Attention Mechanism
▪ Don’t define output 𝒉𝑛 as 𝒉𝑛 = 𝑔(𝑾𝒙𝑛 ) but as a weighted combination of all inputs
𝑁 𝑁 𝛼𝑛𝑖 is the attention 𝒗𝑖 is the “value” vector of input
score(to be learned) which
𝒉𝑛 = ෍ 𝛼𝑛𝑖 𝑿 𝑓(𝒙𝑖 ) = ෍ 𝛼𝑛𝑖 𝑿 𝒗𝑖 tells us how much input 𝒙𝑖
𝒙𝑖 (how input 𝒙𝑖 should be used
𝑖=1 𝑖=1 to compute the output 𝒉𝑛 )
should attend to output 𝒉𝑛

▪ Attention scores 𝛼𝑛𝑖 𝑿 and “value” 𝒗𝑖 = 𝑓(𝒙𝑖 ) of 𝒙𝑖 can be defined in various ways
𝑁 × 𝐾 matrix Row 𝑛 is 𝒒𝑛 𝑁 × 𝐾 matrix of 𝑁
of “queries” 𝑁 × 𝐾 matrix of
“keys”
“value” vectors
Will be used to
𝑸 = 𝑿𝑾𝑄 Query 𝒒𝑛 is 𝑲 = 𝑿𝑾𝐾 of the 𝑁 keys
𝑽 = 𝑿𝑾𝑉
compute the attention compared with each
scores for 𝒉𝑛 of the 𝑁 keys

▪ One popular way to define the attention scores


exp(𝒒⊤𝑛 𝒌𝑖 )
𝛼𝑛𝑖 𝑿 = 𝑁
σ𝑗=1 exp(𝒒⊤𝑛 𝒌𝑗 )

▪ Attention mechanism (especially self-attention) is used in transformers


Attention Mechanism
𝒉𝑛

𝛼𝑛,𝑛+1
𝛼𝑛,𝑛−1 𝛼𝑛,𝑛

𝒒𝑛−1 𝒗𝑛−1 𝒌𝑛−1 𝒒𝑛 𝒗𝑛 𝒌𝑛 𝒒𝑛+1 𝒗𝑛+1 𝒌𝑛+1

𝒙𝑛−1 𝒙𝑛 𝒙𝑛+1

𝑸 = 𝑿𝑾𝑄 𝑲 = 𝑿𝑾𝐾 𝑽 = 𝑿𝑾𝑉


RNN with Attention
▪ RNNs have also been augmented with attention to help remember the distant past

▪ Attention mechanism for a bi-directional RNN encoder-decoder model


Its computation depends the
embeddings of all (past/future) tokens

𝒉𝑖
𝒉𝑖 =
𝒉𝑖

𝑇
𝒄𝑡 = ෍ 𝛼𝑡,𝑖 𝒉𝑖
𝑖=1

Pic source: https://matthewmcateer.me/blog/getting-started-with-attention-for-classification/


Self-Attention
▪ With self-attention, each token 𝒙𝑛 can “attend to” all other tokens of the same
sequence when computing this token’s embedding 𝒉𝑛
The input sequence but
differing in the last
An input word (tired -> wide)
sequence

How much the word


“it” is being “attended
to” by other words in Note how the
the input sequence attention has changed

Example credit: https://blog.research.google/2017/08/transformer-novel-neural-network.html

▪ Attention helps capture the context better and in a much more “global” manner
▪ “Global”: Long ranges captures and, in both directions (previous and ahead)
Provided in form of a 𝐷-dim
Self-Attention embedding (e.g., word2vec)

▪ For an 𝑁 length sequence, the attention scores for each token 𝒙𝑛 are computed
Linear projection
using 𝑁 × 𝐷 matrix of original
by 𝐷 × 𝑑 matrix.
embeddings from the input layer
▪ A query vector 𝒒𝑛 associated with that token Row 𝑛 is 𝒒𝑛 𝑁 × 𝑑 matrix
Learnable

▪ 𝑁 key vectors 𝑲 = {𝒌1 , 𝒌2 , … , 𝒌𝑁 } (one per token) of “queries” 𝑸 = 𝑿𝑾𝑄


▪ 𝑁 value vectors 𝑽 = {𝒗1 , 𝒗2 , … , 𝒗𝑁 } associated with the key vectors 𝐷 × 𝑑 matrix.
Learnable
Assuming same 𝑁 × 𝑑 matrix
size 𝑑 as query of 𝑁 “keys” 𝑲 = 𝑿𝑾𝐾
▪ One way to compute the attention score is 𝑁 × 𝐾 matrix of
𝐷 ×K matrix.
⊤𝒌 ) Learnable
How much token 𝑖
exp(𝒒 𝑛 𝑖
Dot-product attention “value” vectors
of the 𝑁 keys
𝑽 = 𝑿𝑾𝑉
attends to token 𝑛 𝛼𝑛,𝑖 = 𝑁 ⊤
(query and key assumed
σ𝑗=1 exp(𝒒𝑛 𝒌𝑗 ) 𝑑 dimensional)
𝑸 and 𝑲 are 𝑁×𝑣
▪ Given attention scores, encoder’s hidden state for 𝒙𝑛 is assumed 𝑁 × 𝑑

𝑸𝑲⊤
Attention-weighted sum of
the value vectors of all the
𝑁
tokens in the sequence
𝒉𝑛 = ෍ 𝛼𝑛,𝑖 𝒗𝑖 𝑁×𝑣 𝑯 = softmax 𝑽
Thus the encoding of 𝑥𝑛 𝑖=1 𝑑
depends on all the tokens
in the sequence Dividing by 𝑑 ensures variance of the dot product is 1 “Scaled” dot-product attention
Query (Q), Keys (K), and Values (V) in Language.

● It finds out the Query (Question) and Keys (Answers).

● The output is how they are related.


Query (Q), Keys (K), and Values (V) in Language.

● The values are new embeddings of the words, based on the relationship between
Queries and Keys.
Transformers
▪ Transformers also use the idea of attention
▪ “self-attention” over all input tokens
▪ “self-attention” over each output token and previous tokens
▪ “cross-attention” between output tokens and input tokens
▪ Transformer also compute embeddings of all tokens in parallel

▪ Transformers are based on the following key ideas


▪ “Self-attention” and “cross-attention” for computing the hidden
states
▪ Positional encoding
▪ Residual connections

▪ Attention helps capture the context better and in a much


more “global” manner in sequence data
Positional Encoding
▪ Transformers also need a “positional encoding” for each token of the input since they
don’t process the tokens sequentially (unlike RNNs)
▪ Let 𝒑𝑖 ∈ ℝ𝑑 be the positional encoding for location 𝑖. One way to define it is
Here 𝐶 denotes the
maximum possible 𝑖 𝑖 Note the smooth
transition as the position
length of a sequence 𝑝𝑖,2𝑗 = sin , 𝑝𝑖,2𝑗+1 = cos
𝐶 2𝑗/𝑑 𝐶 2𝑗/𝑑 index changes

Positional encoding
vector for location 𝑖 𝑖 𝑖 𝑖 𝑖
assuming 𝑑 = 4 𝒑𝑖 = sin , cos , sin , cos
𝐶 0/4 𝐶 0/4 𝐶 2/4 𝐶 2/4

▪ Given the positional encoding, we add them to the token embedding


ෝ𝑖 = 𝒙𝑖 + 𝒑𝑖
𝒙
▪ The above positional encoding is pre-defined but can also be learned
Zooming into the encoder and the decoder..
FF operation is applied
Most likely output (𝑁)
Output 𝑡Ƹ𝑚−1 from the
“position-wise” (for “auto-regressive”
each token separately) token at step 𝑚 previous step (𝑚 − 1)
generation
but all FF blocks in the (ℓ) (ℓ) (ℓ) (ℓ) in the sequence
layer have same weights 𝑡𝑠 𝑡1 𝑡2 𝑡3
(𝑁) (𝑁)
(ℓ) (ℓ) (ℓ)
𝑠3
𝑡Ƹ𝑚 = argmax𝑖=1,…,𝑉 softmax(𝑾𝑡𝑚 )
Each FF (feed- 𝑠1 𝑠2 Layer Normalization
forward) is usually a
(𝑁) 𝑡𝑒 is a special end of
linear layer + ReLU
Layer Normalization FF FF FF FF 𝑡1Ƹ
(𝑁) (𝑁)
𝑡Ƹ2 𝑡Ƹ3 𝑡Ƹ𝑒(𝑁) sequence (EOS) token
nonlinearity +
another linear layer
Layer Normalization Like FF, linear
Softmax Softmax Softmax Softmax With weight matrix 𝑊 of
Layer normalization FF FF FF operation is
size 𝑉 × 𝐷 where 𝑉 is
also applied
(batch normalization is Cross-Attention Layer position-wise vocab size and 𝐷 is the
difficult since difference Linear Linear Linear Linear dimensionality of the last
Layer Normalization
input sequences can be Layer Normalization decoder block
(𝑁)
of different lengths) (𝑁) (𝑁) (𝑁) (𝑁) embeddings 𝑡𝑚
Self-Attention Layer Masked Self-Attention Layer 𝑡𝑠 𝑡1Ƹ 𝑡Ƹ2 𝑡Ƹ3
Blue arrows are the Decoder’s output layer
residual connections (ℓ−1) (ℓ−1) (ℓ−1) (ℓ−1) (ℓ−1) (ℓ−1) (ℓ−1)
𝑠1 𝑠2 𝑠3 𝑡𝑠 𝑡1 𝑡2 𝑡3 Output (“target”) Note the one position
(towards right) shift
Input (“source”) An Encoder Block A Decoder Block connected with token embeddings
between decoder’s input
token embeddings (N such blocks) the corresponding encoder block vs output
(N such blocks) 𝑡𝑠 is a special start of
Fixed in the first layer (obtained sequence (SOS) token
from an input embedding table)
and learned for subsequent layers
Multi-head Attention (MHA)
▪ A single attention function can capture only one notion of similarity
▪ Transformers therefore use multi-head attention (MHA)
MHA
Single Attention
Output

Output
Concatenate and Project

Attention
Output Output Output

Query Keys Values


Attention Attention Attention

Query Keys Values Query Keys Values Query Keys Values

Pic credit: https://huggingface.co/learn/nlp-course/chapter1/


(Masked) Multi-head Attention (MHA)

Here, on the output side, we


used “masked” MHA because Pic source: “Attention is all you need” (Vaswani et al, 2017)
during output generation, we
don’t want to look at future
tokens
Computing Attention Efficiently…
▪ The standard attention mechanism is inefficient for large sequences
𝑂(𝑇 2 ) storage and 𝑸𝑲⊤
computation cost for a
𝑇 length sequence
𝑯 = softmax 𝑽
𝑑
▪ Many ways to make it more efficient, e.g.,
Sparse Attention Linearized Attention
exp 𝑸𝑲⊤ ≈ 𝜙 𝑸 ⊤ 𝜙(𝑲)

A nonlinear
projection based
on kernels

E.g., kernel
random features
projection

Pic source: A Survey of Transformers (Lin et al, 2021)


Attention in Images: An Image is Worth 16x16 Words

● The image is divided into patches. Each such patch is acting as a word similar to language.
1An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al, 2020)
Attention in Images: An Image is Worth 1Gx1G Words

● Each patch represents vector representation contains (image information + position


information)
Popular Transformer Variants: BERT and GPT
▪ The standard transformer architecture is an encoder-decoder model
▪ Some models use just the encoder or the decoder of the transformer
▪ BERT (Bidirectional Encoder Representations from Transformers)
▪ Basic BERT can be learned to encoder token sequences
▪ GPT (Generative Pretrained Transformer)
▪ Basic GPT can be used to generate token sequences similar to its training data

Encoder Decoder
This encoder can be Also, no cross-attention
used for other tasks since there is no encoder
by fine-tuning
A transformer which
A transformer which contains only the decoder
contains only the encoder BERT GPT
Pre-trained using a next
Trained unsupervisedly
token prediction objective
using a missing token
prediction objective
This is just start of
sentence token Missing token which BERT tries to predict
Transformers for Images: ViT
▪ Transformers can be used for images as well1. For image classification, it looks like this
Only the encoder part of
the transformer needed
whereas on the output
side, we just need an MLP
with softmax outout

Treat image patches as


tokens of a sequence

Also use the position


information

▪ Early work showed ViT can outperform CNNs given very large amount of training data
▪ However, recent work2 has shown that good old CNNs still rule! ViT and CNN perform
comparably at scale, i.e., when both given large amount of compute and training data
2ConvNets Match Vision Transformers at Scale (Smith et al, 2023)
Exercise: Transformer

https://www.byhand.ai/
Any question?

Readings for you:


▪ Deep Learning book
▪ Nice Demo from Stanford CS-231 Course
▪ Dive into Deep Learning (Very Useful)
▪ Special thanks to Daniel Jurafsky, Piyush Rai, Srijit Mukherjee
– I adopted some of their slides available online.

You might also like