Understanding Self-
Attention
The Core of Transformers
Presented by
Sachin
2303864
M.Sc. (Integrated)
Figure 1 Transformer Architecture
Source: Attention Is All You Need (research paper)
From Scratch…
Task: Machine Translation
Text Neural Network Text
• Neural networks can’t process raw text the way humans
do.
• So, we need to represent words numerically before
feeding them into the model.
Initial Work
One Hot Encoding (OHE):
“Mat Cat Mat”
Ma Cat Ra
“Rat Cat Rat” t t
• Considers unique words Ma 1 0 0
t
• Representation: Cat 0 1 0
“Mat Cat Mat” : [ 1 0 0 ] [ 0 1 0 ] [ 1 0 0 ] Ra 0 0 1
t
“Rat Cat Rat ” : [ 0 0 1 ] [ 0 1 0 ] [ 0 0 1 ]
• Limitation: High-dimensional and sparse for large
vocabularies
Initial Work
Bag Of Words (BOW):
“Mat Cat Mat”
“Rat Cat Rat”
• Considers word count
• Representation:
Mat Cat Rat
“Mat Cat Mat” : [ 2 1 0 ]
“Rat Cat Rat ” : [ 0 1 2 ]
• Limitation: Can’t capture semantic meaning
Word Embeddings
• Advantage: Captures “Average meaning”.
Word Neural Network [ 9, 4, 2, 6, …
]
(n-dim vector)
Drone: [ 0, 3, 3 ]
Rocket: [ 0, 4, 2 ]
Eagle: [ 3, 0, 3 ]
• Each dimension in vector represents a feature of word.
• Similar words will have similar vectors.
Figure 2 Geometric meaning of word
embeddings
Source: https://corpling.hypotheses.org/495
The Problem of “Average
meaning”
• Dataset:
• An apple a day keeps the doctor away.
• Apple is health.
• Apple is better than orange.
• Apple makes good phones.
• …
• … Word Neural Network [x y]
( fruit, company )
• If the data is biased, then the “average meaning”
would also be biased.
The Problem of “Average
meaning”
• Suppose we want to translate a sentence:
“Apple launched a new phone, when I was eating an
orange”
• If the word “Apple” is used more often as a fruit than
company in the dataset, there is a higher chance that in
the given sentence the word “Apple” can be
misunderstood as a fruit.
• The problem is, word embeddings are static, which
means these are created once but used again and
again, and hence these are independent of the context
of the sentence.
Self-Attention
• Self-Attention is a
mechanism that takes
static embedding as
input and generates
smart contextual
embeddings as
output.
Figure 3 Self-Attention overview
Source:
https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
Thought behind Self-Attention
• Let’s take two sentences as examples:
• Money Bank Grows.
• River Bank Flows.
• Static models like Word2Vec or GloVe assign the same
vector to "Bank" in both sentences.
• But clearly, the meaning is different depending on
context.
• This can lead to errors:
• Model may mix up meanings
• Poor performance in tasks like translation, question answering,
etc.
Thought behind Self-Attention
• What if we represent the word “Bank” as:
• Sentence-1: Bank = 0.3 x Money + 0.6 x Bank + 0.1 x Grows
• Sentence-2: Bank = 0.25 x River + 0.7 x Bank + 0.05 x Flows
• Now the representations are contextual.
• But… What are these coefficients?
• Normalized similarity scores between Static
embeddings of different words.
• Since the embeddings are vectors, so easiest way to
calculate similarity scores is Dot Product.
Figure 4 Contextual Embedding Generation
Source:
https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
Thought behind Self-Attention
• Advantage: The whole process is parallel, so the
process is highly scalable.
Figure 5 Contextual embedding generation in parallel
Source:
https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
Thought behind Self-Attention
• Disadvantage:
• Since the model is parallel, the sequential information of
sentence is lost.
• No parameters involved, that means no learning.
The Problem of “No Learning”
• Current contextual embeddings are general, that
means they do not depend on the task we are
performing, and hence are not task-specific.
• Let’s take an example of English to Hindi translation:
“piece of cake” : “बहुत आसान काम”
• With the general contextual embeddings, the model
might translate this sentence to,
“piece of cake” : “केक का टुकड़ा”
• Hence, we need task-specific embeddings.
Task-Specific Contextual
Embeddings
• In order to generate task-specific contextual
embedding, we need to introduce some learning
parameters in the self-attention model.
• But where and how?
query
key
valu
e
Introducing Learning Parameters
• Each embedding vector is
performing three roles:
• Query
• Key
• Value
• But a single vector might not be
a good fit for all the three roles.
• Therefore, we should derive
three different vectors
specifically for their roles.
Figure 6 Introducing Query, Key and Value vectors in place of Embedding vector
Source: https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
Introducing Learning Parameters
• How to derive query, key and value
vector from embedding vector.
• Two mathematical operations
derive new vector from an existing
one:
• Scaling
• Linear Transformation
• In our problem Linear
Transformation is better option,
because Scaling just changes the
magnitude.
Introducing Learning Parameters
• How to get the
transformation
matrix?
• It will be simply
learned from the data,
and hence the valued
inside the matrix will
be the learning
parameters that will
generate task-specific
contextual embedding. Figure 7 Deriving Query, Key and Value vectors from
Embedding using linear transformation
Source:
Figure 8 Parallel Execution of Self-Attention
Source: https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
There Is One More Thing …
Source: Attention Is All You Need (research paper
But... Why Scaling?
Source: Attention Is All You Need (research paper)
• The simple reason is dot-product.
• Dot-product results in:
• Small value for vectors of lower dimension.
• Large value for vectors of higher dimension.
But... Why Scaling?
• Vectors of higher dimension can capture more features
of a given word.
• So, High dimensional vectors are more favourable.
• But the Dot-Product of high dimensional vectors results
in higher values.
• Higher the values, higher the variance.
• And, High variance is a problem.
Figure 10 Histogram of Dot-Products of 1000 vector of dimension 3
Figure 11 Histogram of Dot-Products of 1000 vector of dimension 100
Figure 12 Histogram of Dot-Products of 1000 vector of dimension 1000
Figure 13 Comparison of Histogram of Dot-Products of 1000 vector of dimension 3,
100 and 1000
Why High Variance Is A Problem?
• Higher the variance, higher the difference between
smallest and largest value would be.
• And, the SoftMax function returns:
• High probability value for higher values
• Low probability value for lower values
Why High Variance Is A Problem?
• Since we will use backpropagation to train the model, so
these small values will lead to vanishing gradient
problem, and hence the overall training would be
unstable.
• So, we need to reduce the variance. But... How?
• Simply, by scale down the vectors. But... by what
factor?
Why High Variance Is A Problem?
Source:
https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
Scaled Dot-Product Attention
• The Self-Attention is given by the
following equation:
• Here,
• Q : Query matrix,
• K : Key matrix,
• V : Value matrix, Figure 9 Scaled Dot-Product
• dk: Dimensionality of Key vectors. Attention
Source: Attention Is All You Need
(research paper)
References
• https://youtu.be/r7mAt0iVqwo?si=Fjl9319Wlu-kSJcc
• https://youtu.be/-tCKPl_8Xb8?si=2i1NQnqTzhZX2KSg
• https://youtu.be/XnGGmvpDLA0?si=SnOHRYzVXwZP3GE
Y
• https://youtu.be/BjRVS2wTtcA?si=DtLbS7_Y6bUxnfpv
• Attention Is All You Need (
https://arxiv.org/abs/1706.03762)
Thank You