0% found this document useful (0 votes)
40 views37 pages

Understanding Self-Attention

The document discusses the concept of self-attention in transformer architectures, highlighting its importance in generating contextual embeddings for machine translation. It explains the limitations of traditional word representation methods like One Hot Encoding and Bag Of Words, and introduces self-attention as a solution to capture context-dependent meanings. Additionally, it addresses the need for task-specific embeddings and the introduction of learning parameters to enhance the self-attention mechanism.

Uploaded by

Sachin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views37 pages

Understanding Self-Attention

The document discusses the concept of self-attention in transformer architectures, highlighting its importance in generating contextual embeddings for machine translation. It explains the limitations of traditional word representation methods like One Hot Encoding and Bag Of Words, and introduces self-attention as a solution to capture context-dependent meanings. Additionally, it addresses the need for task-specific embeddings and the introduction of learning parameters to enhance the self-attention mechanism.

Uploaded by

Sachin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Understanding Self-

Attention
The Core of Transformers

Presented by
Sachin
2303864
M.Sc. (Integrated)
Figure 1 Transformer Architecture
Source: Attention Is All You Need (research paper)
From Scratch…
Task: Machine Translation

Text Neural Network Text

• Neural networks can’t process raw text the way humans


do.
• So, we need to represent words numerically before
feeding them into the model.
Initial Work
One Hot Encoding (OHE):
“Mat Cat Mat”
Ma Cat Ra
“Rat Cat Rat” t t

• Considers unique words Ma 1 0 0


t
• Representation: Cat 0 1 0
“Mat Cat Mat” : [ 1 0 0 ] [ 0 1 0 ] [ 1 0 0 ] Ra 0 0 1
t
“Rat Cat Rat ” : [ 0 0 1 ] [ 0 1 0 ] [ 0 0 1 ]
• Limitation: High-dimensional and sparse for large
vocabularies
Initial Work
Bag Of Words (BOW):
“Mat Cat Mat”
“Rat Cat Rat”
• Considers word count
• Representation:
Mat Cat Rat
“Mat Cat Mat” : [ 2 1 0 ]
“Rat Cat Rat ” : [ 0 1 2 ]
• Limitation: Can’t capture semantic meaning
Word Embeddings
• Advantage: Captures “Average meaning”.

Word Neural Network [ 9, 4, 2, 6, …


]
(n-dim vector)
Drone: [ 0, 3, 3 ]
Rocket: [ 0, 4, 2 ]
Eagle: [ 3, 0, 3 ]
• Each dimension in vector represents a feature of word.
• Similar words will have similar vectors.
Figure 2 Geometric meaning of word
embeddings
Source: https://corpling.hypotheses.org/495
The Problem of “Average
meaning”
• Dataset:
• An apple a day keeps the doctor away.
• Apple is health.
• Apple is better than orange.
• Apple makes good phones.
• …
• … Word Neural Network [x y]
( fruit, company )

• If the data is biased, then the “average meaning”


would also be biased.
The Problem of “Average
meaning”
• Suppose we want to translate a sentence:
“Apple launched a new phone, when I was eating an
orange”
• If the word “Apple” is used more often as a fruit than
company in the dataset, there is a higher chance that in
the given sentence the word “Apple” can be
misunderstood as a fruit.
• The problem is, word embeddings are static, which
means these are created once but used again and
again, and hence these are independent of the context
of the sentence.
Self-Attention
• Self-Attention is a
mechanism that takes
static embedding as
input and generates
smart contextual
embeddings as
output.

Figure 3 Self-Attention overview


Source:
https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
Thought behind Self-Attention
• Let’s take two sentences as examples:
• Money Bank Grows.
• River Bank Flows.
• Static models like Word2Vec or GloVe assign the same
vector to "Bank" in both sentences.
• But clearly, the meaning is different depending on
context.
• This can lead to errors:
• Model may mix up meanings
• Poor performance in tasks like translation, question answering,
etc.
Thought behind Self-Attention
• What if we represent the word “Bank” as:
• Sentence-1: Bank = 0.3 x Money + 0.6 x Bank + 0.1 x Grows
• Sentence-2: Bank = 0.25 x River + 0.7 x Bank + 0.05 x Flows
• Now the representations are contextual.
• But… What are these coefficients?
• Normalized similarity scores between Static
embeddings of different words.
• Since the embeddings are vectors, so easiest way to
calculate similarity scores is Dot Product.
Figure 4 Contextual Embedding Generation
Source:
https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
Thought behind Self-Attention
• Advantage: The whole process is parallel, so the
process is highly scalable.

Figure 5 Contextual embedding generation in parallel


Source:
https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
Thought behind Self-Attention
• Disadvantage:
• Since the model is parallel, the sequential information of
sentence is lost.
• No parameters involved, that means no learning.
The Problem of “No Learning”
• Current contextual embeddings are general, that
means they do not depend on the task we are
performing, and hence are not task-specific.
• Let’s take an example of English to Hindi translation:
“piece of cake” : “बहुत आसान काम”
• With the general contextual embeddings, the model
might translate this sentence to,
“piece of cake” : “केक का टुकड़ा”
• Hence, we need task-specific embeddings.
Task-Specific Contextual
Embeddings
• In order to generate task-specific contextual
embedding, we need to introduce some learning
parameters in the self-attention model.
• But where and how?
query

key

valu
e
Introducing Learning Parameters
• Each embedding vector is
performing three roles:
• Query
• Key
• Value
• But a single vector might not be
a good fit for all the three roles.
• Therefore, we should derive
three different vectors
specifically for their roles.
Figure 6 Introducing Query, Key and Value vectors in place of Embedding vector
Source: https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
Introducing Learning Parameters
• How to derive query, key and value
vector from embedding vector.
• Two mathematical operations
derive new vector from an existing
one:
• Scaling
• Linear Transformation
• In our problem Linear
Transformation is better option,
because Scaling just changes the
magnitude.
Introducing Learning Parameters
• How to get the
transformation
matrix?
• It will be simply
learned from the data,
and hence the valued
inside the matrix will
be the learning
parameters that will
generate task-specific
contextual embedding. Figure 7 Deriving Query, Key and Value vectors from
Embedding using linear transformation
Source:
Figure 8 Parallel Execution of Self-Attention
Source: https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
There Is One More Thing …

Source: Attention Is All You Need (research paper


But... Why Scaling?

Source: Attention Is All You Need (research paper)

• The simple reason is dot-product.


• Dot-product results in:
• Small value for vectors of lower dimension.
• Large value for vectors of higher dimension.
But... Why Scaling?
• Vectors of higher dimension can capture more features
of a given word.
• So, High dimensional vectors are more favourable.
• But the Dot-Product of high dimensional vectors results
in higher values.
• Higher the values, higher the variance.
• And, High variance is a problem.
Figure 10 Histogram of Dot-Products of 1000 vector of dimension 3
Figure 11 Histogram of Dot-Products of 1000 vector of dimension 100
Figure 12 Histogram of Dot-Products of 1000 vector of dimension 1000
Figure 13 Comparison of Histogram of Dot-Products of 1000 vector of dimension 3,
100 and 1000
Why High Variance Is A Problem?
• Higher the variance, higher the difference between
smallest and largest value would be.
• And, the SoftMax function returns:
• High probability value for higher values
• Low probability value for lower values
Why High Variance Is A Problem?
• Since we will use backpropagation to train the model, so
these small values will lead to vanishing gradient
problem, and hence the overall training would be
unstable.
• So, we need to reduce the variance. But... How?
• Simply, by scale down the vectors. But... by what
factor?
Why High Variance Is A Problem?

Source:
https://youtu.be/-tCKPl_8Xb8?si=U4zgmn_D9tlxN9-4
Scaled Dot-Product Attention
• The Self-Attention is given by the
following equation:

• Here,
• Q : Query matrix,
• K : Key matrix,
• V : Value matrix, Figure 9 Scaled Dot-Product
• dk: Dimensionality of Key vectors. Attention
Source: Attention Is All You Need
(research paper)
References
• https://youtu.be/r7mAt0iVqwo?si=Fjl9319Wlu-kSJcc
• https://youtu.be/-tCKPl_8Xb8?si=2i1NQnqTzhZX2KSg
• https://youtu.be/XnGGmvpDLA0?si=SnOHRYzVXwZP3GE
Y
• https://youtu.be/BjRVS2wTtcA?si=DtLbS7_Y6bUxnfpv
• Attention Is All You Need (
https://arxiv.org/abs/1706.03762)
Thank You

You might also like