0% found this document useful (0 votes)

12 views25 pages

Transformers - The Brain of ChatGPT

The document provides an introduction to Transformers, a neural network architecture that utilizes attention mechanisms to process language data. It explains how attention allows models to focus on relevant parts of the input, improving context capture in sequence modeling. Key concepts such as self-attention, positional encoding, and the structure of Transformer models are discussed, highlighting their significance in modern machine learning applications.

Uploaded by

Pratham Zlatan Poyra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views25 pages

Transformers - The Brain of ChatGPT

Uploaded by

Pratham Zlatan Poyra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Introduction to Machine Learning

Transformers: The Brain of GPT Models

MATH 370: Machine Learning

Tanujit Chakraborty
@ Sorbonne
[email protected]
Convolution was ruling AI till 2017. But then…
Transformer: a specific kind of network architecture, like a fancier feedforward network, but based
on attention

“Attention is all you need” (Vaswani et al, 2017)

Transformers have nothing to do with Image but Language

● Each word is a vector, and the attention is nothing but a matrix of how much each word
is connected to another. Mathematically it is the dot product of two word vectors.
Transformers have nothing to do with Image but Language
Attention
▪ We use attention to “focus” on some part of interest in an input
▪ Other nearby relevant parts help us focus
▪ Other irrelevant parts do not contribute in the process

▪ In sequence modeling problems, we can use attention between input and output
tokens (between encoder and decoder parts), as well as among the inputs only
(only within the encoder part)
Need for Attention
▪ Each layer in standard deep neural nets computes a linear transform + nonlinearity

▪ For 𝑁 inputs, collectively denoting inputs as 𝑿 ∈ ℝ𝑁×𝐾1 and outputs as 𝑯 ∈ ℝ𝑁×𝐾2

Notation alert: Input 𝑿 can be data (if 𝑯 denotes first
𝑯 = 𝑔(𝑿𝑾) hidden layer) or the 𝑯 of the previous hidden layer

▪ Here the weights 𝑾 ∈ ℝ𝐾1 ×𝐾2 do not depend on the inputs 𝑿

▪ Output 𝒉𝑛 = 𝑔 𝑾⊤ 𝒙𝑛 ∈ ℝ𝐾2 only depends on 𝒙𝑛 ∈ ℝ𝐾1 and pays no attention to 𝒙𝑚 , 𝑚 ≠ 𝑛
𝒉𝑛

𝒙1 𝒙𝑛−1 𝒙𝑛 𝒙𝑛+1 𝒙𝑁

▪ When different inputs outputs have inter-dependencies (e.g., they denote representations of
words in a sentence, or patches in an image), paying attention to other inputs is helpful/needed
Attention Mechanism
▪ Don’t define output 𝒉𝑛 as 𝒉𝑛 = 𝑔(𝑾𝒙𝑛 ) but as a weighted combination of all inputs
𝑁 𝑁 𝛼𝑛𝑖 is the attention 𝒗𝑖 is the “value” vector of input
score(to be learned) which
𝒉𝑛 = ෍ 𝛼𝑛𝑖 𝑿 𝑓(𝒙𝑖 ) = ෍ 𝛼𝑛𝑖 𝑿 𝒗𝑖 tells us how much input 𝒙𝑖
𝒙𝑖 (how input 𝒙𝑖 should be used
𝑖=1 𝑖=1 to compute the output 𝒉𝑛 )
should attend to output 𝒉𝑛

▪ Attention scores 𝛼𝑛𝑖 𝑿 and “value” 𝒗𝑖 = 𝑓(𝒙𝑖 ) of 𝒙𝑖 can be defined in various ways
𝑁 × 𝐾 matrix Row 𝑛 is 𝒒𝑛 𝑁 × 𝐾 matrix of 𝑁
of “queries” 𝑁 × 𝐾 matrix of
“keys”
“value” vectors
Will be used to
𝑸 = 𝑿𝑾𝑄 Query 𝒒𝑛 is 𝑲 = 𝑿𝑾𝐾 of the 𝑁 keys
𝑽 = 𝑿𝑾𝑉
compute the attention compared with each
scores for 𝒉𝑛 of the 𝑁 keys

▪ One popular way to define the attention scores

exp(𝒒⊤𝑛 𝒌𝑖 )
𝛼𝑛𝑖 𝑿 = 𝑁
σ𝑗=1 exp(𝒒⊤𝑛 𝒌𝑗 )

▪ Attention mechanism (especially self-attention) is used in transformers

Attention Mechanism
𝒉𝑛

𝛼𝑛,𝑛+1
𝛼𝑛,𝑛−1 𝛼𝑛,𝑛

𝒒𝑛−1 𝒗𝑛−1 𝒌𝑛−1 𝒒𝑛 𝒗𝑛 𝒌𝑛 𝒒𝑛+1 𝒗𝑛+1 𝒌𝑛+1

𝒙𝑛−1 𝒙𝑛 𝒙𝑛+1

𝑸 = 𝑿𝑾𝑄 𝑲 = 𝑿𝑾𝐾 𝑽 = 𝑿𝑾𝑉

RNN with Attention
▪ RNNs have also been augmented with attention to help remember the distant past

▪ Attention mechanism for a bi-directional RNN encoder-decoder model

Its computation depends the
embeddings of all (past/future) tokens

𝒉𝑖
𝒉𝑖 =
𝒉𝑖

𝑇
𝒄𝑡 = ෍ 𝛼𝑡,𝑖 𝒉𝑖
𝑖=1

Pic source: https://matthewmcateer.me/blog/getting-started-with-attention-for-classification/

Self-Attention
▪ With self-attention, each token 𝒙𝑛 can “attend to” all other tokens of the same
sequence when computing this token’s embedding 𝒉𝑛
The input sequence but
differing in the last
An input word (tired -> wide)
sequence

How much the word

“it” is being “attended
to” by other words in Note how the
the input sequence attention has changed

Example credit: https://blog.research.google/2017/08/transformer-novel-neural-network.html

▪ Attention helps capture the context better and in a much more “global” manner
▪ “Global”: Long ranges captures and, in both directions (previous and ahead)
Provided in form of a 𝐷-dim
Self-Attention embedding (e.g., word2vec)

▪ For an 𝑁 length sequence, the attention scores for each token 𝒙𝑛 are computed
Linear projection
using 𝑁 × 𝐷 matrix of original
by 𝐷 × 𝑑 matrix.
embeddings from the input layer
▪ A query vector 𝒒𝑛 associated with that token Row 𝑛 is 𝒒𝑛 𝑁 × 𝑑 matrix
Learnable

▪ 𝑁 key vectors 𝑲 = {𝒌1 , 𝒌2 , … , 𝒌𝑁 } (one per token) of “queries” 𝑸 = 𝑿𝑾𝑄

▪ 𝑁 value vectors 𝑽 = {𝒗1 , 𝒗2 , … , 𝒗𝑁 } associated with the key vectors 𝐷 × 𝑑 matrix.
Learnable
Assuming same 𝑁 × 𝑑 matrix
size 𝑑 as query of 𝑁 “keys” 𝑲 = 𝑿𝑾𝐾
▪ One way to compute the attention score is 𝑁 × 𝐾 matrix of
𝐷 ×K matrix.
⊤𝒌 ) Learnable
How much token 𝑖
exp(𝒒 𝑛 𝑖
Dot-product attention “value” vectors
of the 𝑁 keys
𝑽 = 𝑿𝑾𝑉
attends to token 𝑛 𝛼𝑛,𝑖 = 𝑁 ⊤
(query and key assumed
σ𝑗=1 exp(𝒒𝑛 𝒌𝑗 ) 𝑑 dimensional)
𝑸 and 𝑲 are 𝑁×𝑣
▪ Given attention scores, encoder’s hidden state for 𝒙𝑛 is assumed 𝑁 × 𝑑

𝑸𝑲⊤
Attention-weighted sum of
the value vectors of all the
𝑁
tokens in the sequence
𝒉𝑛 = ෍ 𝛼𝑛,𝑖 𝒗𝑖 𝑁×𝑣 𝑯 = softmax 𝑽
Thus the encoding of 𝑥𝑛 𝑖=1 𝑑
depends on all the tokens
in the sequence Dividing by 𝑑 ensures variance of the dot product is 1 “Scaled” dot-product attention
Query (Q), Keys (K), and Values (V) in Language.

● It finds out the Query (Question) and Keys (Answers).

● The output is how they are related.

Query (Q), Keys (K), and Values (V) in Language.

● The values are new embeddings of the words, based on the relationship between
Queries and Keys.
Transformers
▪ Transformers also use the idea of attention
▪ “self-attention” over all input tokens
▪ “self-attention” over each output token and previous tokens
▪ “cross-attention” between output tokens and input tokens
▪ Transformer also compute embeddings of all tokens in parallel

▪ Transformers are based on the following key ideas

▪ “Self-attention” and “cross-attention” for computing the hidden
states
▪ Positional encoding
▪ Residual connections

▪ Attention helps capture the context better and in a much

more “global” manner in sequence data
Positional Encoding
▪ Transformers also need a “positional encoding” for each token of the input since they
don’t process the tokens sequentially (unlike RNNs)
▪ Let 𝒑𝑖 ∈ ℝ𝑑 be the positional encoding for location 𝑖. One way to define it is
Here 𝐶 denotes the
maximum possible 𝑖 𝑖 Note the smooth
transition as the position
length of a sequence 𝑝𝑖,2𝑗 = sin , 𝑝𝑖,2𝑗+1 = cos
𝐶 2𝑗/𝑑 𝐶 2𝑗/𝑑 index changes

Positional encoding
vector for location 𝑖 𝑖 𝑖 𝑖 𝑖
assuming 𝑑 = 4 𝒑𝑖 = sin , cos , sin , cos
𝐶 0/4 𝐶 0/4 𝐶 2/4 𝐶 2/4

▪ Given the positional encoding, we add them to the token embedding

ෝ𝑖 = 𝒙𝑖 + 𝒑𝑖
𝒙
▪ The above positional encoding is pre-defined but can also be learned
Zooming into the encoder and the decoder..
FF operation is applied
Most likely output (𝑁)
Output 𝑡Ƹ𝑚−1 from the
“position-wise” (for “auto-regressive”
each token separately) token at step 𝑚 previous step (𝑚 − 1)
generation
but all FF blocks in the (ℓ) (ℓ) (ℓ) (ℓ) in the sequence
layer have same weights 𝑡𝑠 𝑡1 𝑡2 𝑡3
(𝑁) (𝑁)
(ℓ) (ℓ) (ℓ)
𝑠3
𝑡Ƹ𝑚 = argmax𝑖=1,…,𝑉 softmax(𝑾𝑡𝑚 )
Each FF (feed- 𝑠1 𝑠2 Layer Normalization
forward) is usually a
(𝑁) 𝑡𝑒 is a special end of
linear layer + ReLU
Layer Normalization FF FF FF FF 𝑡1Ƹ
(𝑁) (𝑁)
𝑡Ƹ2 𝑡Ƹ3 𝑡Ƹ𝑒(𝑁) sequence (EOS) token
nonlinearity +
another linear layer
Layer Normalization Like FF, linear
Softmax Softmax Softmax Softmax With weight matrix 𝑊 of
Layer normalization FF FF FF operation is
size 𝑉 × 𝐷 where 𝑉 is
also applied
(batch normalization is Cross-Attention Layer position-wise vocab size and 𝐷 is the
difficult since difference Linear Linear Linear Linear dimensionality of the last
Layer Normalization
input sequences can be Layer Normalization decoder block
(𝑁)
of different lengths) (𝑁) (𝑁) (𝑁) (𝑁) embeddings 𝑡𝑚
Self-Attention Layer Masked Self-Attention Layer 𝑡𝑠 𝑡1Ƹ 𝑡Ƹ2 𝑡Ƹ3
Blue arrows are the Decoder’s output layer
residual connections (ℓ−1) (ℓ−1) (ℓ−1) (ℓ−1) (ℓ−1) (ℓ−1) (ℓ−1)
𝑠1 𝑠2 𝑠3 𝑡𝑠 𝑡1 𝑡2 𝑡3 Output (“target”) Note the one position
(towards right) shift
Input (“source”) An Encoder Block A Decoder Block connected with token embeddings
between decoder’s input
token embeddings (N such blocks) the corresponding encoder block vs output
(N such blocks) 𝑡𝑠 is a special start of
Fixed in the first layer (obtained sequence (SOS) token
from an input embedding table)
and learned for subsequent layers
Multi-head Attention (MHA)
▪ A single attention function can capture only one notion of similarity
▪ Transformers therefore use multi-head attention (MHA)
MHA
Single Attention
Output

Output
Concatenate and Project

Attention
Output Output Output

Query Keys Values

Attention Attention Attention

Query Keys Values Query Keys Values Query Keys Values

Pic credit: https://huggingface.co/learn/nlp-course/chapter1/

(Masked) Multi-head Attention (MHA)

Here, on the output side, we

used “masked” MHA because Pic source: “Attention is all you need” (Vaswani et al, 2017)
during output generation, we
don’t want to look at future
tokens
Computing Attention Efficiently…
▪ The standard attention mechanism is inefficient for large sequences
𝑂(𝑇 2 ) storage and 𝑸𝑲⊤
computation cost for a
𝑇 length sequence
𝑯 = softmax 𝑽
𝑑
▪ Many ways to make it more efficient, e.g.,
Sparse Attention Linearized Attention
exp 𝑸𝑲⊤ ≈ 𝜙 𝑸 ⊤ 𝜙(𝑲)

A nonlinear
projection based
on kernels

E.g., kernel
random features
projection

Pic source: A Survey of Transformers (Lin et al, 2021)

Attention in Images: An Image is Worth 16x16 Words

● The image is divided into patches. Each such patch is acting as a word similar to language.
1An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al, 2020)
Attention in Images: An Image is Worth 1Gx1G Words

● Each patch represents vector representation contains (image information + position

information)
Popular Transformer Variants: BERT and GPT
▪ The standard transformer architecture is an encoder-decoder model
▪ Some models use just the encoder or the decoder of the transformer
▪ BERT (Bidirectional Encoder Representations from Transformers)
▪ Basic BERT can be learned to encoder token sequences
▪ GPT (Generative Pretrained Transformer)
▪ Basic GPT can be used to generate token sequences similar to its training data

Encoder Decoder
This encoder can be Also, no cross-attention
used for other tasks since there is no encoder
by fine-tuning
A transformer which
A transformer which contains only the decoder
contains only the encoder BERT GPT
Pre-trained using a next
Trained unsupervisedly
token prediction objective
using a missing token
prediction objective
This is just start of
sentence token Missing token which BERT tries to predict
Transformers for Images: ViT
▪ Transformers can be used for images as well1. For image classification, it looks like this
Only the encoder part of
the transformer needed
whereas on the output
side, we just need an MLP
with softmax outout

Treat image patches as

tokens of a sequence

Also use the position

information

▪ Early work showed ViT can outperform CNNs given very large amount of training data
▪ However, recent work2 has shown that good old CNNs still rule! ViT and CNN perform
comparably at scale, i.e., when both given large amount of compute and training data
2ConvNets Match Vision Transformers at Scale (Smith et al, 2023)
Exercise: Transformer

https://www.byhand.ai/
Any question?

Readings for you:

▪ Deep Learning book
▪ Nice Demo from Stanford CS-231 Course
▪ Dive into Deep Learning (Very Useful)
▪ Special thanks to Daniel Jurafsky, Piyush Rai, Srijit Mukherjee
– I adopted some of their slides available online.

495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
495 Lecture 8
No ratings yet
495 Lecture 8
28 pages
CS414-Lesson 10.transformer and Applications
No ratings yet
CS414-Lesson 10.transformer and Applications
50 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
Lecture 6 Transformers
No ratings yet
Lecture 6 Transformers
92 pages
20190630transformer 210110081057
No ratings yet
20190630transformer 210110081057
32 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
Transformers
No ratings yet
Transformers
15 pages
Transformer
No ratings yet
Transformer
58 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
Lecture 25
No ratings yet
Lecture 25
13 pages
NLP 8
No ratings yet
NLP 8
42 pages
Chap6 Transformer (20240219) - DL4H Practioner Guide
No ratings yet
Chap6 Transformer (20240219) - DL4H Practioner Guide
36 pages
Transformer
No ratings yet
Transformer
14 pages
Lec 12
No ratings yet
Lec 12
30 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
Transformer
No ratings yet
Transformer
10 pages
All You Need To Know About The Self-Attention Layer
No ratings yet
All You Need To Know About The Self-Attention Layer
80 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
Transformers
No ratings yet
Transformers
15 pages
Transformer
No ratings yet
Transformer
31 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Attention Is All You Need Paper Explained Well
No ratings yet
Attention Is All You Need Paper Explained Well
18 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Transformers
No ratings yet
Transformers
41 pages
Attention LLM
No ratings yet
Attention LLM
36 pages
AI Transformers for Researchers
No ratings yet
AI Transformers for Researchers
65 pages
Transformer 24 Aug
No ratings yet
Transformer 24 Aug
56 pages
Transformer
No ratings yet
Transformer
4 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
Transformers v1.1
No ratings yet
Transformers v1.1
1 page
12 Transformer
No ratings yet
12 Transformer
41 pages
Transformer
No ratings yet
Transformer
41 pages
Transformers Architecture
No ratings yet
Transformers Architecture
5 pages
Attention & Transformers
No ratings yet
Attention & Transformers
66 pages
Transformer NLP
No ratings yet
Transformer NLP
15 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
GEN-AI Handout 1
No ratings yet
GEN-AI Handout 1
4 pages
Bahdanau Attention Mechanism (Also Known As Additive Attention)
No ratings yet
Bahdanau Attention Mechanism (Also Known As Additive Attention)
41 pages
ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Lec 7 Trans (Decoder) +ViT
No ratings yet
Lec 7 Trans (Decoder) +ViT
20 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
XCS224N Module6 Slides
No ratings yet
XCS224N Module6 Slides
99 pages
A1
No ratings yet
A1
11 pages
Hamming Code Implementation in Verilog
No ratings yet
Hamming Code Implementation in Verilog
5 pages
DSBDAL Lab Manual
No ratings yet
DSBDAL Lab Manual
26 pages
Department of Education: Table of Specification Quarter Examination Local Guiding Services NC Ii
No ratings yet
Department of Education: Table of Specification Quarter Examination Local Guiding Services NC Ii
2 pages
Numerical Analysis - MTH603 Handouts Lecture 1
No ratings yet
Numerical Analysis - MTH603 Handouts Lecture 1
3 pages
Mathcad - Ver. 13 Mathcad - Ver. 13 Mathcad - Ver. 13 Mathcad - Ver. 13
No ratings yet
Mathcad - Ver. 13 Mathcad - Ver. 13 Mathcad - Ver. 13 Mathcad - Ver. 13
4 pages
Artificial Intelligence Chapter 2: Intelligent Agents
No ratings yet
Artificial Intelligence Chapter 2: Intelligent Agents
12 pages
Mae Syllabus
No ratings yet
Mae Syllabus
113 pages
Ensemble Learning and Random Forests
No ratings yet
Ensemble Learning and Random Forests
37 pages
Daytrading and Swing Trading Differences
No ratings yet
Daytrading and Swing Trading Differences
5 pages
LMI Control Toolbox
0% (1)
LMI Control Toolbox
356 pages
Unit 3 ML
No ratings yet
Unit 3 ML
28 pages
SOM Algorithm Aimad
No ratings yet
SOM Algorithm Aimad
4 pages
Chapter 3 Ann
No ratings yet
Chapter 3 Ann
26 pages
Machine Learning Manual
No ratings yet
Machine Learning Manual
40 pages
Software Testing Techniques
No ratings yet
Software Testing Techniques
51 pages
Unit2 Advanced Concepts of Modeling in AI Class X 2025-26 Part 1
No ratings yet
Unit2 Advanced Concepts of Modeling in AI Class X 2025-26 Part 1
77 pages
Chapter # 2 Solution of Algebraic and Transcendental Equations
100% (2)
Chapter # 2 Solution of Algebraic and Transcendental Equations
31 pages
Linear Codes & Hamming Distance
No ratings yet
Linear Codes & Hamming Distance
30 pages
Digital Certificate and Signature
100% (5)
Digital Certificate and Signature
51 pages
ML's Impact on Trading Practices
No ratings yet
ML's Impact on Trading Practices
34 pages
343-Co-Rotational Finite Element Formulation Used in The Koiter-Newton Method For Nonlinear Buckling Analyses
No ratings yet
343-Co-Rotational Finite Element Formulation Used in The Koiter-Newton Method For Nonlinear Buckling Analyses
17 pages
NIST Fingerprint Image Quality
No ratings yet
NIST Fingerprint Image Quality
13 pages
NCAA FBS Clustering & Grocery Store Analysis
No ratings yet
NCAA FBS Clustering & Grocery Store Analysis
4 pages
Digital Signatures & Authentication Protocols
No ratings yet
Digital Signatures & Authentication Protocols
19 pages
Bayesian Phylogenetic Dating Guide
No ratings yet
Bayesian Phylogenetic Dating Guide
25 pages
Cryptography and Network Security Long
No ratings yet
Cryptography and Network Security Long
121 pages
DSP Summary Notes
No ratings yet
DSP Summary Notes
7 pages
Control Systems (CS) : Lecture-7 Routh-Herwitz Stability Criterion
No ratings yet
Control Systems (CS) : Lecture-7 Routh-Herwitz Stability Criterion
30 pages
DS Theory Syllabus
No ratings yet
DS Theory Syllabus
2 pages
Mtech 1 Sem Lab Practical List
No ratings yet
Mtech 1 Sem Lab Practical List
2 pages

Transformers - The Brain of ChatGPT

Uploaded by

Transformers - The Brain of ChatGPT

Uploaded by

Introduction to Machine Learning

Transformers: The Brain of GPT Models

MATH 370: Machine Learning

“Attention is all you need” (Vaswani et al, 2017)

▪ For 𝑁 inputs, collectively denoting inputs as 𝑿 ∈ ℝ𝑁×𝐾1 and outputs as 𝑯 ∈ ℝ𝑁×𝐾2

▪ Here the weights 𝑾 ∈ ℝ𝐾1 ×𝐾2 do not depend on the inputs 𝑿

▪ One popular way to define the attention scores

▪ Attention mechanism (especially self-attention) is used in transformers

𝒒𝑛−1 𝒗𝑛−1 𝒌𝑛−1 𝒒𝑛 𝒗𝑛 𝒌𝑛 𝒒𝑛+1 𝒗𝑛+1 𝒌𝑛+1

𝑸 = 𝑿𝑾𝑄 𝑲 = 𝑿𝑾𝐾 𝑽 = 𝑿𝑾𝑉

▪ Attention mechanism for a bi-directional RNN encoder-decoder model

Pic source: https://matthewmcateer.me/blog/getting-started-with-attention-for-classification/

How much the word

Example credit: https://blog.research.google/2017/08/transformer-novel-neural-network.html

▪ 𝑁 key vectors 𝑲 = {𝒌1 , 𝒌2 , … , 𝒌𝑁 } (one per token) of “queries” 𝑸 = 𝑿𝑾𝑄

● It finds out the Query (Question) and Keys (Answers).

● The output is how they are related.

▪ Transformers are based on the following key ideas

▪ Attention helps capture the context better and in a much

▪ Given the positional encoding, we add them to the token embedding

Query Keys Values

Query Keys Values Query Keys Values Query Keys Values

Pic credit: https://huggingface.co/learn/nlp-course/chapter1/

Here, on the output side, we

Pic source: A Survey of Transformers (Lin et al, 2021)

● Each patch represents vector representation contains (image information + position

Treat image patches as

Also use the position

Readings for you:

You might also like