Understanding Self-Attention

The document discusses the concept of self-attention in transformer architectures, highlighting its importance in generating contextual embeddings for machine translation. It explains the limitations of traditional word representation methods like One Hot Encoding and Bag Of Words, and introduces self-attention as a solution to capture context-dependent meanings. Additionally, it addresses the need for task-specific embeddings and the introduction of learning parameters to enhance the self-attention mechanism.

Uploaded by

Sachin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views37 pages

Understanding Self-Attention

Uploaded by

Sachin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Understanding Self-

Attention
The Core of Transformers

Presented by
Sachin
2303864
M.Sc. (Integrated)
Figure 1 Transformer Architecture
Source: Attention Is All You Need (research paper)
From Scratch…
Task: Machine Translation

Text Neural Network Text

• Neural networks can’t process raw text the way humans

do.
• So, we need to represent words numerically before
feeding them into the model.
Initial Work
One Hot Encoding (OHE):
“Mat Cat Mat”
Ma Cat Ra
“Rat Cat Rat” t t

• Considers unique words Ma 1 0 0

t
• Representation: Cat 0 1 0
“Mat Cat Mat” : [ 1 0 0 ] [ 0 1 0 ] [ 1 0 0 ] Ra 0 0 1
t
“Rat Cat Rat ” : [ 0 0 1 ] [ 0 1 0 ] [ 0 0 1 ]
• Limitation: High-dimensional and sparse for large
vocabularies
Initial Work
Bag Of Words (BOW):
“Mat Cat Mat”
“Rat Cat Rat”
• Considers word count
• Representation:
Mat Cat Rat
“Mat Cat Mat” : [ 2 1 0 ]
“Rat Cat Rat ” : [ 0 1 2 ]
• Limitation: Can’t capture semantic meaning
Word Embeddings
• Advantage: Captures “Average meaning”.

• If the data is biased, then the “average meaning”

would also be biased.
The Problem of “Average
meaning”
• Suppose we want to translate a sentence:
“Apple launched a new phone, when I was eating an
orange”
• If the word “Apple” is used more often as a fruit than
company in the dataset, there is a higher chance that in
the given sentence the word “Apple” can be
misunderstood as a fruit.
• The problem is, word embeddings are static, which
means these are created once but used again and
again, and hence these are independent of the context
of the sentence.
Self-Attention
• Self-Attention is a
mechanism that takes
static embedding as
input and generates
smart contextual
embeddings as
output.

Figure 3 Self-Attention overview

Source:
https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
Thought behind Self-Attention
• Let’s take two sentences as examples:
• Money Bank Grows.
• River Bank Flows.
• Static models like Word2Vec or GloVe assign the same
vector to "Bank" in both sentences.
• But clearly, the meaning is different depending on
context.
• This can lead to errors:
• Model may mix up meanings
• Poor performance in tasks like translation, question answering,
etc.
Thought behind Self-Attention
• What if we represent the word “Bank” as:
• Sentence-1: Bank = 0.3 x Money + 0.6 x Bank + 0.1 x Grows
• Sentence-2: Bank = 0.25 x River + 0.7 x Bank + 0.05 x Flows
• Now the representations are contextual.
• But… What are these coefficients?
• Normalized similarity scores between Static
embeddings of different words.
• Since the embeddings are vectors, so easiest way to
calculate similarity scores is Dot Product.
Figure 4 Contextual Embedding Generation
Source:
https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
Thought behind Self-Attention
• Advantage: The whole process is parallel, so the
process is highly scalable.

Figure 5 Contextual embedding generation in parallel

Source:
https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
Thought behind Self-Attention
• Disadvantage:
• Since the model is parallel, the sequential information of
sentence is lost.
• No parameters involved, that means no learning.
The Problem of “No Learning”
• Current contextual embeddings are general, that
means they do not depend on the task we are
performing, and hence are not task-specific.
• Let’s take an example of English to Hindi translation:
“piece of cake” : “बहुत आसान काम”
• With the general contextual embeddings, the model
might translate this sentence to,
“piece of cake” : “केक का टुकड़ा”
• Hence, we need task-specific embeddings.
Task-Specific Contextual
Embeddings
• In order to generate task-specific contextual
embedding, we need to introduce some learning
parameters in the self-attention model.
• But where and how?
query

key

valu
e
Introducing Learning Parameters
• Each embedding vector is
performing three roles:
• Query
• Key
• Value
• But a single vector might not be
a good fit for all the three roles.
• Therefore, we should derive
three different vectors
specifically for their roles.
Figure 6 Introducing Query, Key and Value vectors in place of Embedding vector
Source: https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
Introducing Learning Parameters
• How to derive query, key and value
vector from embedding vector.
• Two mathematical operations
derive new vector from an existing
one:
• Scaling
• Linear Transformation
• In our problem Linear
Transformation is better option,
because Scaling just changes the
magnitude.
Introducing Learning Parameters
• How to get the
transformation
matrix?
• It will be simply
learned from the data,
and hence the valued
inside the matrix will
be the learning
parameters that will
generate task-specific
contextual embedding. Figure 7 Deriving Query, Key and Value vectors from
Embedding using linear transformation
Source:
Figure 8 Parallel Execution of Self-Attention
Source: https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
There Is One More Thing …

Source: Attention Is All You Need (research paper

But... Why Scaling?

Source: Attention Is All You Need (research paper)

• The simple reason is dot-product.

• Dot-product results in:
• Small value for vectors of lower dimension.
• Large value for vectors of higher dimension.
But... Why Scaling?
• Vectors of higher dimension can capture more features
of a given word.
• So, High dimensional vectors are more favourable.
• But the Dot-Product of high dimensional vectors results
in higher values.
• Higher the values, higher the variance.
• And, High variance is a problem.
Figure 10 Histogram of Dot-Products of 1000 vector of dimension 3
Figure 11 Histogram of Dot-Products of 1000 vector of dimension 100
Figure 12 Histogram of Dot-Products of 1000 vector of dimension 1000
Figure 13 Comparison of Histogram of Dot-Products of 1000 vector of dimension 3,
100 and 1000
Why High Variance Is A Problem?
• Higher the variance, higher the difference between
smallest and largest value would be.
• And, the SoftMax function returns:
• High probability value for higher values
• Low probability value for lower values
Why High Variance Is A Problem?
• Since we will use backpropagation to train the model, so
these small values will lead to vanishing gradient
problem, and hence the overall training would be
unstable.
• So, we need to reduce the variance. But... How?
• Simply, by scale down the vectors. But... by what
factor?
Why High Variance Is A Problem?

Source:
https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
Scaled Dot-Product Attention
• The Self-Attention is given by the
following equation:

• Here,
• Q : Query matrix,
• K : Key matrix,
• V : Value matrix, Figure 9 Scaled Dot-Product
• dk: Dimensionality of Key vectors. Attention
Source: Attention Is All You Need
(research paper)
References
• https://youtu.be/r7mAt0iVqwo?si=Fjl9319Wlu-kSJcc
• https://youtu.be/-tCKPl_8Xb8?si=2i1NQnqTzhZX2KSg
• https://youtu.be/XnGGmvpDLA0?si=SnOHRYzVXwZP3GE
Y
• https://youtu.be/BjRVS2wTtcA?si=DtLbS7_Y6bUxnfpv
• Attention Is All You Need (
https://arxiv.org/abs/1706.03762)
Thank You

Transformers
No ratings yet
Transformers
15 pages
Self Attention Mechanism
No ratings yet
Self Attention Mechanism
20 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
495 Lecture 8
No ratings yet
495 Lecture 8
28 pages
Transformer 24 Aug
No ratings yet
Transformer 24 Aug
56 pages
Transformers - The Brain of ChatGPT
No ratings yet
Transformers - The Brain of ChatGPT
25 pages
Transformer
No ratings yet
Transformer
14 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
CS414-Lesson 10.transformer and Applications
No ratings yet
CS414-Lesson 10.transformer and Applications
50 pages
Transformers
No ratings yet
Transformers
15 pages
03b. Transformers
No ratings yet
03b. Transformers
75 pages
Transformers
No ratings yet
Transformers
41 pages
LLM Attention
No ratings yet
LLM Attention
13 pages
12 Transformer
No ratings yet
12 Transformer
41 pages
Transformer
No ratings yet
Transformer
41 pages
NLP 8
No ratings yet
NLP 8
42 pages
Lecture 6 Transformers
No ratings yet
Lecture 6 Transformers
92 pages
Transformer
No ratings yet
Transformer
58 pages
Transformer
No ratings yet
Transformer
31 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Transformer
No ratings yet
Transformer
10 pages
20190630transformer 210110081057
No ratings yet
20190630transformer 210110081057
32 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
No ratings yet
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
20 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
Seq2Seq, Attention and Transformers
No ratings yet
Seq2Seq, Attention and Transformers
142 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Transformers From Scratch PoliTO - Ipynb Colab
No ratings yet
Transformers From Scratch PoliTO - Ipynb Colab
17 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Tranformers Transfer Learning
No ratings yet
Tranformers Transfer Learning
58 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
Transformer LectureNotes
No ratings yet
Transformer LectureNotes
33 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
Transformer
No ratings yet
Transformer
5 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
(9,10) Transformers - 3
0% (1)
(9,10) Transformers - 3
92 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
Building LLMs: A Comprehensive Guide
No ratings yet
Building LLMs: A Comprehensive Guide
16 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
Attention Is All You Need Paper Explained Well
No ratings yet
Attention Is All You Need Paper Explained Well
18 pages
Generative AI Unit 3 Notes
No ratings yet
Generative AI Unit 3 Notes
8 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
58 pages
Position Encoding: Intuition Lack Inherent Word Order Awareness
No ratings yet
Position Encoding: Intuition Lack Inherent Word Order Awareness
33 pages
Transformers 1
No ratings yet
Transformers 1
6 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Stable Diffusion
No ratings yet
Stable Diffusion
58 pages
09 Transformers
No ratings yet
09 Transformers
40 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
Transformers v1.1
No ratings yet
Transformers v1.1
1 page
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Sub 2
No ratings yet
Sub 2
4 pages
Don't Teach. Incentivize
No ratings yet
Don't Teach. Incentivize
59 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Generative AI for Business Leaders
100% (18)
Generative AI for Business Leaders
80 pages
Deep Learning For NLP and Speech Recogni
100% (9)
Deep Learning For NLP and Speech Recogni
640 pages
AI Agents by Google
100% (10)
AI Agents by Google
42 pages
Beyond AI
100% (10)
Beyond AI
532 pages
Machine Learning With Python
100% (15)
Machine Learning With Python
692 pages
Prompt Engineering Bible Join and Master The AI Revolution Profit Online With GPT-4 Plugins For Effortless Money Making (Robert E. Miller) (Z-Library)
100% (10)
Prompt Engineering Bible Join and Master The AI Revolution Profit Online With GPT-4 Plugins For Effortless Money Making (Robert E. Miller) (Z-Library)
209 pages
RAG Architecture
100% (9)
RAG Architecture
52 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
97% (33)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
Hackers Guide To Machine Learning With Python PDF
100% (15)
Hackers Guide To Machine Learning With Python PDF
272 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (10)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
94% (16)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
Tom Taulli - Generative AI - A Non-Technical Introduction-Apress (2023)
100% (8)
Tom Taulli - Generative AI - A Non-Technical Introduction-Apress (2023)
211 pages
Generative AI On AWS
100% (10)
Generative AI On AWS
208 pages
LLM Application Through Production
100% (11)
LLM Application Through Production
254 pages
Machine Learning Paradigms
100% (10)
Machine Learning Paradigms
336 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
Data Analytics and AI
100% (12)
Data Analytics and AI
267 pages
Applied Generative AI For Beginners Practical Knowledge 1703207445
94% (16)
Applied Generative AI For Beginners Practical Knowledge 1703207445
221 pages
Prompt Engineer 101
97% (34)
Prompt Engineer 101
45 pages
7 Agentic RAG System Architectures To Build AI Agents
100% (2)
7 Agentic RAG System Architectures To Build AI Agents
12 pages
LLMs Guide for Developers & Data Scientists
100% (14)
LLMs Guide for Developers & Data Scientists
132 pages
Machine Learning Masterclass
100% (11)
Machine Learning Masterclass
108 pages
A Taxonomy of Retrieval Augmented Generation
100% (4)
A Taxonomy of Retrieval Augmented Generation
56 pages
Create LLM Application Using Langchain With Ease
100% (5)
Create LLM Application Using Langchain With Ease
12 pages
45 ChatGPT Use Cases For Product Managers 1674466304
100% (18)
45 ChatGPT Use Cases For Product Managers 1674466304
100 pages
Machine Learning - An Applied Mathematics Introduction PDF
100% (13)
Machine Learning - An Applied Mathematics Introduction PDF
246 pages
Gen AI Companies 1679276337830
100% (1)
Gen AI Companies 1679276337830
1 page
Apress Understanding Large Language Models B0CJ2C8TXQ
100% (11)
Apress Understanding Large Language Models B0CJ2C8TXQ
166 pages
Agentic AI - Comprehensive Guide
100% (1)
Agentic AI - Comprehensive Guide
20 pages
Top Agentic AI Architecture Design Patterns
100% (6)
Top Agentic AI Architecture Design Patterns
8 pages
Project Proposal
No ratings yet
Project Proposal
1 page
CDN - Content Delivery Network
No ratings yet
CDN - Content Delivery Network
25 pages
Fnet Experimentation
No ratings yet
Fnet Experimentation
6 pages
Database Management System - Weekly Test 04 - Test Paper
No ratings yet
Database Management System - Weekly Test 04 - Test Paper
6 pages
Theory of Computation - Weekly Test 05 - Test Paper
No ratings yet
Theory of Computation - Weekly Test 05 - Test Paper
5 pages
Algorithms Test Paper - 1
No ratings yet
Algorithms Test Paper - 1
5 pages
Hu W. AI For Power Electronics and Renewable Energy Systems 2024
100% (1)
Hu W. AI For Power Electronics and Renewable Energy Systems 2024
346 pages
11964-Article Text-21255-1-10-20220114
No ratings yet
11964-Article Text-21255-1-10-20220114
8 pages
ACM RACS 2016 Paper 226 Reduced Final
No ratings yet
ACM RACS 2016 Paper 226 Reduced Final
5 pages
Machine Learning Final Notes by Sakhawat Hossain
No ratings yet
Machine Learning Final Notes by Sakhawat Hossain
76 pages
1 s2.0 S0169023X23001246 Main
No ratings yet
1 s2.0 S0169023X23001246 Main
23 pages
Noninvasive Glocosa Subspace KNN
No ratings yet
Noninvasive Glocosa Subspace KNN
7 pages
AI-Powered Symptom Checker Project
No ratings yet
AI-Powered Symptom Checker Project
47 pages
A Machine Learning Based Classification Model To Support University Students With Dyslexia With Personalized Tools and Strategies
No ratings yet
A Machine Learning Based Classification Model To Support University Students With Dyslexia With Personalized Tools and Strategies
12 pages
ML - Chapter 5 - Neural Network
No ratings yet
ML - Chapter 5 - Neural Network
64 pages
Predicting Health Center Visits
No ratings yet
Predicting Health Center Visits
6 pages
Semantic Web Based Information Systems State of The Art Applications Advances in Semantic Web and Information Systems Vol 1.9781599044279.47602
100% (2)
Semantic Web Based Information Systems State of The Art Applications Advances in Semantic Web and Information Systems Vol 1.9781599044279.47602
329 pages
DR Siby Abraham - AI and Big Data
No ratings yet
DR Siby Abraham - AI and Big Data
35 pages
40 Most Popular Python Scientific Libraries
No ratings yet
40 Most Popular Python Scientific Libraries
9 pages
Aligning Moments in Time Using Video Queries
No ratings yet
Aligning Moments in Time Using Video Queries
13 pages
Polynomial Regression Guide
No ratings yet
Polynomial Regression Guide
69 pages
2............... EFFResNet-ViT A Fusion-Based Convolutional and Vision Transformer Model For Explainable Medical Image Classification
No ratings yet
2............... EFFResNet-ViT A Fusion-Based Convolutional and Vision Transformer Model For Explainable Medical Image Classification
29 pages
M.SC (CS) Part I Syllabus NEP 1.0 2025
No ratings yet
M.SC (CS) Part I Syllabus NEP 1.0 2025
71 pages
DEEP LEARNING NOTES - Btech
No ratings yet
DEEP LEARNING NOTES - Btech
26 pages
6 Text Clustering
No ratings yet
6 Text Clustering
66 pages
Energy Forecasting Techniques Review
No ratings yet
Energy Forecasting Techniques Review
11 pages
The Risk Atlas of Mexico City Mexico A Tool For de
No ratings yet
The Risk Atlas of Mexico City Mexico A Tool For de
11 pages
Iowtc2022 98217
No ratings yet
Iowtc2022 98217
10 pages
Document Malware
No ratings yet
Document Malware
9 pages
KNN Model
No ratings yet
KNN Model
5 pages
The Critical Role of Digital Forensics in The Modern Information Era
No ratings yet
The Critical Role of Digital Forensics in The Modern Information Era
12 pages
Intro Gen AI 6p
100% (1)
Intro Gen AI 6p
6 pages
Dataset of Groundnut Plant Leaf Images For Classification and Detection
No ratings yet
Dataset of Groundnut Plant Leaf Images For Classification and Detection
10 pages
Ai Project Logbook
No ratings yet
Ai Project Logbook
26 pages
Machine Learning: Asst. Prof. Dr. Mohammed Najm Abdullah
No ratings yet
Machine Learning: Asst. Prof. Dr. Mohammed Najm Abdullah
46 pages
A Comparative Analysis On Fruit Freshness
No ratings yet
A Comparative Analysis On Fruit Freshness
4 pages