0% found this document useful (0 votes)

81 views16 pages

Understanding Self-Attention in Transformers

Uploaded by

stephen kimeu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views16 pages

Understanding Self-Attention in Transformers

Uploaded by

stephen kimeu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 16

Deep Dive into Self-Attention by Hand✍︎

Explore the intricacies of the attention mechanism

responsible for fueling the transformers

Srijanie Dey, PhD

12 min read
·
Apr 22, 2024
https://towardsdatascience.com/deep-dive-into-self-attention-by-hand-%EF%B8%8E-
f02876e49857

Attention! Attention!

Because ‘Attention is All You Need’.

No, I am not saying that, the Transformer is.

Image by author (Robtimus Prime seeking attention. As per my son, bright rainbow colors
work better for attention and hence the color scheme.)

As of today, the world has been swept over by the power of transformers. Not the likes of
‘Robtimus Prime’ but the ones that constitute neural networks. And that power is because of
the concept of ‘attention’. So, what does attention in the context of transformers really
mean? Let’s try to find out some answers here:
First and foremost:

What are transformers?

Transformers are neural networks that specialize in learning context from the data. Quite
similar to us trying to find the meaning of ‘attention and context’ in terms of transformers.

How do transformers learn context from the data?

By using the attention mechanism.

What is the attention mechanism?

The attention mechanism helps the model scan all parts of a sequence at each step and
determine which elements need to be focused on. The attention mechanism was proposed as
an alternative to the ‘strict/hard’ solution of fixed-length vectors in the encoder-decoder
architecture and provide a ‘soft’ solution focusing only on the relevant parts.

What is self-attention?
The attention mechanism worked to improve the performance of Recurrence Neural
Networks (RNNs), with the effect seeping into Convolutional Neural Networks (CNNs).
However, with the introduction of the transformer architecture in the year 2017, the need for
RNNs and CNNs was quietly obliterated. And the central reason for it was the self-attention
mechanism.

The self-attention mechanism was special in the sense that it was built to inculcate the
context of the input sequence in order to enhance the attention mechanism. This idea became
transformational as it was able to capture the complex nuances of a language.

As an example:

When I ask my 4-year old what transformers are, his answer only contains the words robots
and cars. Because that is the only context he has. But for me, transformers also mean neural
networks as this second context is available to the slightly more experienced mind of mine.
And that is how different contexts provide different solutions and so tend to be very
important.

The word ‘self’ refers to the fact that the attention

mechanism examines the same input sequence that it is
processing.
There are many variations of how self-attention is performed. But the scaled dot-product
mechanism has been one of the most popular ones. This was the one introduced in the
original transformer architecture paper in 2017 — “Attention is All You Need”.
Where and how does self-attention feature in
transformers?
I like to see the transformer architecture as a combination of two shells — the outer shell and
the inner shell.

1. The outer shell is a combination of the attention-weighting mechanism and the feed
forward layer about which I talk in detail in this article.
2. The inner shell consists of the self-attention mechanism and is part of the attention-
weighting feature.

So, without further delay, let us dive into the details behind the self-attention mechanism and
unravel the workings behind it. The Query-Key module and the SoftMax function play a
crucial role in this technique.

This discussion is based on Prof. Tom Yeh’s wonderful AI by Hand Series on Self-Attention.
(All the images below, unless otherwise noted, are by Prof. Tom Yeh from the above-
mentioned LinkedIn post, which I have edited with his permission.)

So here we go:

Self-Attention
To build some context here, here is a pointer to how we process the ‘Attention-Weighting’
in the transformer outer shell.

Attention weight matrix (A)

The attention weight matrix A is obtained by feeding the input features into the Query-Key
(QK) module. This matrix tries to find the most relevant parts in the input sequence. Self-
Attention comes into play while creating the Attention weight matrix A using the QK-
module.
How does the QK-module work?
Let us look at The different components of Self-Attention: Query (Q), Key (K) and Value
(V).

I love using the spotlight analogy here as it helps me visualize the model throwing light on
each element of the sequence and trying to find the most relevant parts. Taking this analogy a
bit further, let us use it to understand the different components of Self-Attention.

Imagine a big stage getting ready for the world’s largest Macbeth production. The audience
outside is teeming with excitement.

 The lead actor walks onto the stage, the spotlight shines on him and he asks in his
booming voice “Should I seize the crown?”. The audience whispers in hushed tones
and wonders which path this question will lead to. Thus, Macbeth himself represents
the role of Query (Q) as he asks pivotal questions and drives the story forward.
 Based on Macbeth’s query, the spotlight shifts to other crucial characters that hold
information to the answer. The influence of other crucial characters in the story, like
Lady Macbeth, triggers Macbeth’s own ambitions and actions. These other
characters can be seen as the Key (K) as they unravel different facets of the story
based on the particular they know.
 Finally, these characters provide enough motivation and information to Macbeth by
their actions and perspectives. These can be seen as Value (V). The Value (V) pushes
Macbeth towards his decisions and shapes the fate of the story.

And with that is created one of the world’s finest performances, that remains etched in the
minds of the awestruck audience for the years to come.
Now that we have witnessed the role of Q, K, V in the fantastical world of performing arts,
let’s return to planet matrices and learn the mathematical nitty-gritty behind the QK-module.
This is the roadmap that we will follow:

Roadmap for the Self-Attention mechanism

And so the process begins.

We are given:
A set of 4-feature vectors (Dimension 6)

Our goal :
Transform the given features into Attention Weighted Features.

[1] Create Query, Key, Value Matrices

To do so, we multiply the features with linear transformation matrices W_Q, W_K, and
W_V, to obtain query vectors (q1,q2,q3,q4), key vectors (k1,k2,k3,k4), and value vectors
(v1,v2,v3,v4) respectively as shown below:

To get Q, multiply W_Q with X:

To get K, multiply W_K with X:

Similarly, to get V, multiply W_V with X.

To be noted:
1. As can be seen from the calculation above, we use the same set of features for both
queries and keys. And that is how the idea of “self” comes into play here, i.e. the
model uses the same set of features to create its query vector as well as the key vector.
2. The query vector represents the current word (or token) for which we want to
compute attention scores relative to other words in the sequence.
3. The key vector represents the other words (or tokens) in the input sequence and we
compute the attention score for each of them with respect to the current word.
[2] Matrix Multiplication

The next step is to multiply the transpose of K with Q i.e. K^T . Q.

The idea here is to calculate the dot product between every pair of query and key vectors.
Calculating the dot product gives us an estimate of the matching score between every “key-
query” pair, by using the idea of Cosine Similarity between the two vectors. This is the ‘dot-
product’ part of the scaled dot-product attention.

Cosine-Similarity

Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of
the vectors divided by the product of their lengths. It roughly measures if two vectors are
pointing in the same direction thus implying the two vectors are similar.

Remember cos(0°) = 1, cos(90°) = 0 , cos(180°) =-1

- If the dot product between the two vectors is approximately 1, it implies we are looking at
an almost zero angle between the two vectors meaning they are very close to each other.

- If the dot product between the two vectors is approximately 0, it implies we are looking at
vectors that are orthogonal to each other and not very similar.

- If the dot product between the two vectors is approximately -1, it implies we are looking at
an almost an 180° angle between the two vectors meaning they are opposites.
[3] Scale

The next step is to scale/normalize each element by the square root of the dimension ‘d_k’. In
our case the number is 3. Scaling down helps to keep the impact of the dimension on the
matching score in check.

How does it do so? As per the original Transformer paper and going back to Probability 101,
if two independent and identically distributed (i.i.d) variables q and k with mean 0 and
variance 1 with dimension d are multiplied, the result is a new random variable with mean
remaining 0 but variance changing to d_k.

Now imagine how the matching score would look if our dimension is increased to 32, 64, 128
or even 4960 for that matter. The larger dimension would make the variance higher and push
the values into regions ‘unknown’.

To keep the calculation simple here, since sqrt [3] is approximately 1.73205, we replace it
with [ floor(□/2) ].

Floor Function
The floor function takes a real number as an argument and returns the largest integer less than
or equal to that real number.

Eg : floor(1.5) = 1, floor(2.9) = 2, floor (2.01) = 2, floor(0.5) = 0.

The opposite of the floor function is the ceiling function.

This the ‘scaled’ part of the scaled dot-product attention.

[4] Softmax

There are three parts to this step:

1. Raise e to the power of the number in each cell (To make things easy, we use 3 to the
power of the number in each cell.)
2. Sum these new values across each column.
3. For each column, divide each element by its respective sum (Normalize). The purpose
of normalizing each column is to have numbers sum up to 1. In other words, each
column then becomes a probability distribution of attention, which gives us our
Attention Weight Matrix (A).
This Attention Weight Matrix is what we had obtained
after passing our feature matrix through the QK-module
in Step 2 in the Transformers section.
(Remark: The first column in the Attention Weight Matrix has a typo as the current elements
don’t add up to 1. Please double-check. We are allowed these errors because we are human.)

The Softmax step is important as it assigns probabilities to the score obtained in the previous
steps and thus helps the model decide how much importance (higher/lower attention weights)
needs to be given to each word given the current query. As is to be expected, higher attention
weights signify greater relevance allowing the model to capture dependencies more
accurately.

Once again, the scaling in the previous step becomes important here. Without the scaling, the
values of the resultant matrix gets pushed out into regions that are not processed well by the
Softmax function and may result in vanishing gradients.

[5] Matrix Multiplication

Finally we multiply the value vectors (Vs) with the Attention Weight Matrix (A). These value
vectors are important as they contain the information associated with each word in the
sequence.

And the result of the final multiplication in this step are the attention weighted features Zs
which are the ultimate solution of the self-attention mechanism. These attention-weighted
features essentially contain a weighted representation of the features assigning higher
weights for features with higher relevance as per the context.

Now with this information available, we continue to the next step in the transformer
architecture where the feed-forward layer processes this information further.
And this brings us to the end of the brilliant self-attention technique!

Reviewing all the key points based on the ideas we talked about above:

1. Attention mechanism was the result of an effort to better the performance of RNNs,
addressing the issue of fixed-length vector representations in the encoder-decoder
architecture. The flexibility of soft-length vectors with a focus on the relevant parts of
a sequence was the core strength behind attention.
2. Self-attention was introduced as a way to inculcate the idea of context into the model.
The self-attention mechanism evaluates the same input sequence that it processes,
hence the use of the word ‘self’.
3. There are many variants to the self-attention mechanism and efforts are ongoing to
make it more efficient. However, scaled dot-product attention is one of the most
popular ones and a crucial reason why the transformer architecture was deemed to be
so powerful.
4. Scaled dot-product self-attention mechanism comprises the Query-Key module (QK-
module) along with the Softmax function. The QK module is responsible for
extracting the relevance of each element of the input sequence by calculating the
attention scores and the Softmax function complements it by assigning probability to
the attention scores.
5. Once the attention-scores are calculated, they are multiplied with the value vector
to obtain the attention-weighted features which are then passed on to the feed-
forward layer.

Multi-Head Attention
To cater to a varied and overall representation of the sequence, multiple copies of the self-
attention mechanism are implemented in parallel which are then concatenated to produce
the final attention-weighted values. This is called the Multi-Head Attention.

Transformer in a Nutshell
This is how the inner-shell of the transformer architecture works. And bringing it together
with the outer shell, here is a summary of the Transformer mechanism:

1. The two big ideas in the Transformer architecture here are attention-weighting and
the feed-forward layer (FFN). Both of them combined together allow the
Transformer to analyze the input sequence from two directions. Attention looks at the
sequence based on positions and the FFN does it based on the dimensions of the
feature matrix.
2. The part that powers the attention mechanism is the scaled dot-product Attention
which consists of the QK-module and outputs the attention weighted features.

‘Attention Is really All You Need’

Transformers have been here for only a few years and the field of AI has already seen
tremendous progress based on it. And the effort is still ongoing. When the authors of the
paper used that title for their paper, they were not kidding.
It is interesting to see once again how a fundamental idea — the ‘dot product’ coupled with
certain embellishments can turn out to be so powerful!

Image by author

P.S. If you would like to work through this exercise on your own, here are the blank
templates for you to use.

Blank Template for hand-exercise

Now go have some fun with the exercise while paying attention to your Robtimus Prime!

Related Work:
Here are the other articles in the Deep Dive by Hand Series:

 Deep Dive into Vector Databases by Hand ✍ that explores what exactly happens
behind-the-scenes in Vector Databases.
 Deep Dive into Sora’s Diffusion Transformer (DiT) by Hand ✍ that explores the
secret behind Sora’s state-of-the-art videos.
 Deep Dive into Transformers by Hand ✍ that explores the power behind the power of
transformers.

References:
1. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan
N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need.” Advances
in neural information processing systems 30 (2017).
2. Bahdanau, Dzmitry, Kyunghyun Cho and Yoshua Bengio. “Neural Machine
Translation by Jointly Learning to Align and Translate.” CoRR abs/1409.0473 (2014).

Transformers From Scratch PoliTO - Ipynb Colab
No ratings yet
Transformers From Scratch PoliTO - Ipynb Colab
17 pages
Lecture 6 Transformers
No ratings yet
Lecture 6 Transformers
92 pages
Transformer 24 Aug
No ratings yet
Transformer 24 Aug
56 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
495 Lecture 8
No ratings yet
495 Lecture 8
28 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
Memory-Efficient Attention Mechanism
No ratings yet
Memory-Efficient Attention Mechanism
8 pages
S: Rethinking Self-Attention in Transformer Models: Ynthesizer
No ratings yet
S: Rethinking Self-Attention in Transformer Models: Ynthesizer
12 pages
Chap6 Transformer (20240219) - DL4H Practioner Guide
No ratings yet
Chap6 Transformer (20240219) - DL4H Practioner Guide
36 pages
SimpleTron NeurIPS 2022
No ratings yet
SimpleTron NeurIPS 2022
15 pages
Attention & Transformers
No ratings yet
Attention & Transformers
66 pages
All You Need To Know About The Self-Attention Layer
No ratings yet
All You Need To Know About The Self-Attention Layer
80 pages
Transformer Mixture Key
No ratings yet
Transformer Mixture Key
27 pages
Duman Keles23a
No ratings yet
Duman Keles23a
23 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
Understanding Self-Attention
No ratings yet
Understanding Self-Attention
37 pages
Notes of Transformer
No ratings yet
Notes of Transformer
8 pages
"Attention-Augmented CNNs for Image Classification"
No ratings yet
"Attention-Augmented CNNs for Image Classification"
12 pages
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
No ratings yet
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
20 pages
CS414-Lesson 10.transformer and Applications
No ratings yet
CS414-Lesson 10.transformer and Applications
50 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
Empirical Study of Spatial Attention in Deep Networks
No ratings yet
Empirical Study of Spatial Attention in Deep Networks
10 pages
Lec 12
No ratings yet
Lec 12
30 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
L9-CAM Attention Transformer-V4
No ratings yet
L9-CAM Attention Transformer-V4
39 pages
LLM Attention
No ratings yet
LLM Attention
13 pages
Gen AI - Gen AI Models III
No ratings yet
Gen AI - Gen AI Models III
22 pages
Assignment 2 - ML-SelfAttn
No ratings yet
Assignment 2 - ML-SelfAttn
4 pages
Efficient Vision Transformers
No ratings yet
Efficient Vision Transformers
15 pages
Transformer LectureNotes
No ratings yet
Transformer LectureNotes
33 pages
Gen AI IA 3 Answers
No ratings yet
Gen AI IA 3 Answers
33 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
Transformer
No ratings yet
Transformer
58 pages
Transformers - The Brain of ChatGPT
No ratings yet
Transformers - The Brain of ChatGPT
25 pages
Bello Attention Augmented Convolutional Networks ICCV 2019 Paper
No ratings yet
Bello Attention Augmented Convolutional Networks ICCV 2019 Paper
10 pages
Efficient Attention Mechanisms
No ratings yet
Efficient Attention Mechanisms
37 pages
Self Attention Mechanism Presentation
No ratings yet
Self Attention Mechanism Presentation
6 pages
Dis7 Sol
No ratings yet
Dis7 Sol
8 pages
Attention in Neural Networks
No ratings yet
Attention in Neural Networks
8 pages
Understanding Attention Mechanisms in Deep Learning
No ratings yet
Understanding Attention Mechanisms in Deep Learning
104 pages
Lecture 10
No ratings yet
Lecture 10
66 pages
Transformers
No ratings yet
Transformers
15 pages
Paper 2
No ratings yet
Paper 2
8 pages
Understanding Transformer Models
No ratings yet
Understanding Transformer Models
29 pages
Transformer
No ratings yet
Transformer
4 pages
NLP 8
No ratings yet
NLP 8
42 pages
Efficient Transformers: A Survey
No ratings yet
Efficient Transformers: A Survey
28 pages
Using Transformers For Computer Vision - by Cameron R. Wolfe, Ph.D. - Towards Data Science
No ratings yet
Using Transformers For Computer Vision - by Cameron R. Wolfe, Ph.D. - Towards Data Science
2 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
NLP Lecture 01-15-Attnmechanism
No ratings yet
NLP Lecture 01-15-Attnmechanism
13 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Transformer 1803.02155
No ratings yet
Transformer 1803.02155
5 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
A1
No ratings yet
A1
11 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
13 Customer Service Role-Play Scenarios
No ratings yet
13 Customer Service Role-Play Scenarios
26 pages
50-Plus British Phrases and Slangs
No ratings yet
50-Plus British Phrases and Slangs
52 pages
5 Habits That Instantly RAISE YOUR VIBE
No ratings yet
5 Habits That Instantly RAISE YOUR VIBE
55 pages
Exploring Computation Graphs in Rust
No ratings yet
Exploring Computation Graphs in Rust
8 pages
Exploring Directed Acyclic Graphs in Golang
No ratings yet
Exploring Directed Acyclic Graphs in Golang
3 pages
Akan Debt Bondage Analysis
No ratings yet
Akan Debt Bondage Analysis
11 pages
Anonymous Voting System Using Zero-Knowledge Proof
No ratings yet
Anonymous Voting System Using Zero-Knowledge Proof
7 pages
White Jesus: Challenging Racism
100% (4)
White Jesus: Challenging Racism
11 pages
Flask Blog
No ratings yet
Flask Blog
24 pages
PUTHANE 8290: Polyurethane Coating Specs
No ratings yet
PUTHANE 8290: Polyurethane Coating Specs
2 pages
PRISTINE - Shanta Holdings LTD
No ratings yet
PRISTINE - Shanta Holdings LTD
20 pages
P A D A S: Akshiraa Coaching Centre - Poly TRB 2021-English Answer Key
No ratings yet
P A D A S: Akshiraa Coaching Centre - Poly TRB 2021-English Answer Key
43 pages
Oliver P. de Guzman: Objective
No ratings yet
Oliver P. de Guzman: Objective
3 pages
2 1 2-Companion
No ratings yet
2 1 2-Companion
23 pages
The Art of Problem PDF
No ratings yet
The Art of Problem PDF
223 pages
The World Customs Organization
100% (1)
The World Customs Organization
3 pages
Feedback On LML4801-25-S1 Assignment 1
No ratings yet
Feedback On LML4801-25-S1 Assignment 1
2 pages
Soil Erosion Solutions Case Studies
No ratings yet
Soil Erosion Solutions Case Studies
20 pages
On Rizal's Retraction
No ratings yet
On Rizal's Retraction
10 pages
English Exam Practice Questions
No ratings yet
English Exam Practice Questions
10 pages
Barefoot Counsellor
100% (1)
Barefoot Counsellor
18 pages
A Banjo Song
No ratings yet
A Banjo Song
9 pages
McDonald's Highway 70 Complaint
No ratings yet
McDonald's Highway 70 Complaint
3 pages
Immediate Access Microeconomics 3rd Edition Karlan Verified PDF Download
0% (1)
Immediate Access Microeconomics 3rd Edition Karlan Verified PDF Download
408 pages
Baby Corgi: Pattern #-Yarn
No ratings yet
Baby Corgi: Pattern #-Yarn
7 pages
Project Management Planning Assignment
No ratings yet
Project Management Planning Assignment
15 pages
Credit Note Voucher (Ctrl+F8) in Tally9 Accounting Software
No ratings yet
Credit Note Voucher (Ctrl+F8) in Tally9 Accounting Software
2 pages
Supply Chain Management - Emma
No ratings yet
Supply Chain Management - Emma
39 pages
Diagnosis by Palpation in Traditional Chinese Medicine
97% (30)
Diagnosis by Palpation in Traditional Chinese Medicine
71 pages
Noor Fatima's Professional Resume
No ratings yet
Noor Fatima's Professional Resume
2 pages
Group Assignment Fat Solible Vitamins
No ratings yet
Group Assignment Fat Solible Vitamins
6 pages
NXP Femto Cell Solution
No ratings yet
NXP Femto Cell Solution
29 pages
Generalized Periodontitis Treated With Periodontal, Orthodontic, and Prosthodontic Therapy: A Case Report
No ratings yet
Generalized Periodontitis Treated With Periodontal, Orthodontic, and Prosthodontic Therapy: A Case Report
16 pages
B1 07. Hobbies and Pastimes
No ratings yet
B1 07. Hobbies and Pastimes
7 pages
Edupadi Com Classroom Lessons ss1 Ict Application Area of Ict Page 2...
No ratings yet
Edupadi Com Classroom Lessons ss1 Ict Application Area of Ict Page 2...
5 pages
The Odyssey Books 5-8 Guided Reading Questions
No ratings yet
The Odyssey Books 5-8 Guided Reading Questions
2 pages
Introduction To Human Philosophy
No ratings yet
Introduction To Human Philosophy
13 pages
Practical Research 2 Theoretical and Conceptual Framework Jerry V. Manlapaz, RN, PHD, Edd, LPT Theoretical and Conceptual Framework
No ratings yet
Practical Research 2 Theoretical and Conceptual Framework Jerry V. Manlapaz, RN, PHD, Edd, LPT Theoretical and Conceptual Framework
5 pages
Lesson 4 - Rizal's Higher Education
No ratings yet
Lesson 4 - Rizal's Higher Education
5 pages

Understanding Self-Attention in Transformers

Uploaded by

Understanding Self-Attention in Transformers

Uploaded by

Deep Dive into Self-Attention by Hand✍︎

Explore the intricacies of the attention mechanism

Srijanie Dey, PhD

Because ‘Attention is All You Need’.

No, I am not saying that, the Transformer is.

What are transformers?

How do transformers learn context from the data?

What is the attention mechanism?

The word ‘self’ refers to the fact that the attention

Attention weight matrix (A)

Roadmap for the Self-Attention mechanism

And so the process begins.

[1] Create Query, Key, Value Matrices

To get Q, multiply W_Q with X:

To get K, multiply W_K with X:

The next step is to multiply the transpose of K with Q i.e. K^T . Q.

Remember cos(0°) = 1, cos(90°) = 0 , cos(180°) =-1

Eg : floor(1.5) = 1, floor(2.9) = 2, floor (2.01) = 2, floor(0.5) = 0.

The opposite of the floor function is the ceiling function.

This the ‘scaled’ part of the scaled dot-product attention.

There are three parts to this step:

[5] Matrix Multiplication

‘Attention Is really All You Need’

Blank Template for hand-exercise

You might also like