0% found this document useful (0 votes)

26 views4 pages

Assignment 2 - ML-SelfAttn

The document outlines an assignment involving neural networks and self-attention mechanisms, divided into two parts. Part A focuses on basic neural network operations and computations, including finding weights for specific outputs and decision boundaries for a single neuron. Part B delves into self-attention, requiring computations of query, key, and value vectors, as well as attention scores based on given matrices and vectors.

Uploaded by

ARJU Zerin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views4 pages

Assignment 2 - ML-SelfAttn

Uploaded by

ARJU Zerin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Assignment 2 Total Points 20

Part A: Basics of Neural Nets

1. [5 points] Assume you have the following neural network with 3 neurons (circles in green).

Here,
Z11 = w11.x1 + w21.x2, A11 = ReLU(Z11)
Z12 = w12.x1 + w22.x2, A12 = ReLU(Z12)
Z2 = w3.A11 + w4.A12, A2 = ReLU(Z2)

a. Find the values of w’s so that the A2 gives the prediction as same as Y in the table on the left.
b. Find the values of w’s so that the neural network can produce Y = X1 XNOR X2

2. [5 points] Assume you have a single neuron with sign() activation function: that is, the output of the
neural network will be either 0 or 1 and it is given by:

a = sign(wTx +b)

The sign function is defined as below

sign(z) = 1, if z > 0 and
0, otherwise

a. Given w = [5 6], b = 30 what is the output of the neuron when x1 = [1 2] and x2 = [2 8]

b. Plot the decision boundary (the curve that separates the points for the decision or output is 1, from
those for which the output is 0). Use the values of w and b from part (a).
Part B: Self-attention [10 points]

In part of the assignment/tutorial will take you through the self-attention computation that was
proposed in the seminal paper “Attention is all you need”. You are to do the computations by hand.

Assume you have embedding or representations of 3 tokens given by x1, x2, x3, each of these xi’s are 5-
vectors (dimension of the vectors is 5) stored as the rows in the following 3 x 5 matrix X:

[ ]
1 2 3 7 1
X= 2 1 0 1 0
0 1 3 5 1

The self-attention mechanism basically computes the pairwise attention or relationship between the
vectors (transformed) and based on these pairwise attentions produces output that are linear
combinations of the vectors

Figure: self-attention computation is shown in the figure on the left. The same computation is represented the figure
in the right form the attention is all you need paper

The self-attention mechanism will first constructs 3 query, 3 key, and 3 value vectors by transforming
each of the token vectors, by multiplying them with matrices WQ, WK, and WV. The dimensions of the
W matrices will determine the dimensions of the query, key and value vectors. Assume, the following
WQ, WK, and WV matrices:

[ ] [ ] [ ]
1 0 1 0 1 0 1 0 1 0 0 1
0 1 0 1 2 1 1 0 0 1 1 1
W Q= 0 2 0 0 , W K= 0 0 0 0 , WV= 0 1 0 1
0 0 1 1 0 1 0 0 1 1 1 1
0 0 0 1 0 1 0 1 1 0 0 0
Compute the query, key and value vectors

The vectors will be just the linear transformations of the xi’s. The linear transformations are presented
by the W matrices.

query vector, q1 = x1 * WQ =
query vector, q2 = x2 * WQ =
query vector, q3 = x3 *WQ =

Observe these can be computed directly by: Q = XWQ (matrix multiplication)

where q1, q2, q3 are the rows of Q (a 3 x 4 matrix).So,

Similarly,

find K = XWK, where k1, k2, k3 are the rows of K representing the key vectors corresponding to the
vectors x1, x2, x3

and find V = XWV, where v1, v2, v3 are the rows of V representing the value vectors corresponding to
the vectors x1, x2, x3

Compute the Attention Scores

Once the Q, K, V matrices are computed we can use the equation mentioned in the paper

Next, we use the query and key vectors to compute the pairwise attentions – we will use the dot product
attention. We will get 9 dot products that can be easily stored in a 3 x 3 matrix, S = QKT.

QKT =
Observe the ij-th element of QKT will be the dot product between qi (query vector corresponding to xi)
and kj (the key vector corresponding to xj).

Now we scale the entries of the matrix by 1/ √ d k , where dk is the dimension of the query or key vector.
In this case, dk = 4.

S = (1/ √ d k ) QKT =

Observe the first row of S contains the relationship/attention (dot products) that q1 has with k1, k2, and
k3. In this step, softmax() is applied to these attentions to produce attention scores that are between 0
and 1 and that they sum upto 1.

Therefore, the softmax() is applied to each row separately producing a 3 x 3 matrix A:

A = softmax(S) =

Output:

With these attention scores (weights) we will output vectors by taking the weighted sum of v1, v2 and
v3. So, if the attention scores in A’s first row are 0.4, 0.1, 0.3 then we will output o1 as
o1 = 0.4 v1 + 0.1 v2 + 0.3 v3.

Therefore, we basically find the updated vectors by computing

Output = A*V =

*Note on Multi-head Attention:

In the “attention is all you need” paper the authors also presented the idea
of having multiple heads, where each head produces the output as we have
done above.

Therefore if there are n heads, each headi will have its own WQ, WK, and
WV matrices. Then the final output will be a concatinate the outputs of each
head and then do a linear transformation (multiply by a matrix) to produce
the final output, as shown in the figure on the right.

CS541 HW4
No ratings yet
CS541 HW4
11 pages
Practical 10 Solution
No ratings yet
Practical 10 Solution
6 pages
Notes On Implementing Attention - Eli Bendersky
No ratings yet
Notes On Implementing Attention - Eli Bendersky
12 pages
Transformer 24 Aug
No ratings yet
Transformer 24 Aug
56 pages
Comprehensive Exam - Answer Key - DNN - EC3M - October 2024
No ratings yet
Comprehensive Exam - Answer Key - DNN - EC3M - October 2024
7 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Flashattn
No ratings yet
Flashattn
6 pages
Coding Attention Mechanisms
No ratings yet
Coding Attention Mechanisms
24 pages
hw8 Sol
No ratings yet
hw8 Sol
4 pages
Understanding Self-Attention in Transformers
No ratings yet
Understanding Self-Attention in Transformers
16 pages
Gen AI IA 3 Answers
No ratings yet
Gen AI IA 3 Answers
33 pages
Transformers for NLP Developers
No ratings yet
Transformers for NLP Developers
9 pages
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
No ratings yet
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
20 pages
Self Attention With Trainable Weights 1726701162
No ratings yet
Self Attention With Trainable Weights 1726701162
12 pages
Transformer-VQ Linear-Time Transformers Via Vector Quantization
No ratings yet
Transformer-VQ Linear-Time Transformers Via Vector Quantization
22 pages
IIT MADRAS Attention Mechanism
No ratings yet
IIT MADRAS Attention Mechanism
4 pages
CS414-Lesson 10.transformer and Applications
No ratings yet
CS414-Lesson 10.transformer and Applications
50 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
Memory-Efficient Attention Mechanism
No ratings yet
Memory-Efficient Attention Mechanism
8 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
Multi-head Latent Attention in LLMs
No ratings yet
Multi-head Latent Attention in LLMs
3 pages
Attention Mechanism by Hand Exercise
No ratings yet
Attention Mechanism by Hand Exercise
1 page
NLPMCQ
No ratings yet
NLPMCQ
23 pages
Transformer
No ratings yet
Transformer
4 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
The Attention Mechanism From Scratch
No ratings yet
The Attention Mechanism From Scratch
7 pages
Neural Network and Fuzzy System Math
No ratings yet
Neural Network and Fuzzy System Math
12 pages
Attention & Transformers
No ratings yet
Attention & Transformers
66 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
Additive - Attention Example
No ratings yet
Additive - Attention Example
2 pages
Cours 2
No ratings yet
Cours 2
25 pages
Tranformers Transfer Learning
No ratings yet
Tranformers Transfer Learning
58 pages
495 Lecture 8
No ratings yet
495 Lecture 8
28 pages
2022-23 Odd Et Cse DLNLP
No ratings yet
2022-23 Odd Et Cse DLNLP
4 pages
Problem Sheet For Soft Computing AI and NN Lab
No ratings yet
Problem Sheet For Soft Computing AI and NN Lab
6 pages
Transformers Attention Is All You Need (1) - 40-62
No ratings yet
Transformers Attention Is All You Need (1) - 40-62
23 pages
Anlp 05 Transformers
No ratings yet
Anlp 05 Transformers
40 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
Transformer
No ratings yet
Transformer
58 pages
Lecture 11 Transformers Annotations
No ratings yet
Lecture 11 Transformers Annotations
70 pages
Assignment 4 Solution
No ratings yet
Assignment 4 Solution
4 pages
Efficient Attention Mechanisms
No ratings yet
Efficient Attention Mechanisms
37 pages
(Chapman
No ratings yet
(Chapman
69 pages
Final PPT DataMining
No ratings yet
Final PPT DataMining
64 pages
Fast Transformer Decoding - One Write-Head Is All You Need
No ratings yet
Fast Transformer Decoding - One Write-Head Is All You Need
9 pages
ECE/CS 559 - Neural Networks Lecture Notes #8: Associative Memory and Hopfield Networks
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #8: Associative Memory and Hopfield Networks
9 pages
Deep Learning Lab Manual
No ratings yet
Deep Learning Lab Manual
65 pages
Module 2 PDF
No ratings yet
Module 2 PDF
83 pages
18ai61-Model Question Paper Solutions
No ratings yet
18ai61-Model Question Paper Solutions
71 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
AML - Lecture 9
No ratings yet
AML - Lecture 9
36 pages
DL Shahid
No ratings yet
DL Shahid
21 pages
IBest DeepLearning
No ratings yet
IBest DeepLearning
123 pages
As A Single PDF
No ratings yet
As A Single PDF
3 pages
Notes of Transformer
No ratings yet
Notes of Transformer
8 pages
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 3
No ratings yet
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 3
8 pages
2024 Transformer-VQ Lingle ArXiv
No ratings yet
2024 Transformer-VQ Lingle ArXiv
30 pages
ADL Midterm Mock Exam 2021
No ratings yet
ADL Midterm Mock Exam 2021
5 pages
Assign1
No ratings yet
Assign1
13 pages
Numerical Method
No ratings yet
Numerical Method
4 pages
Assignment 2.1
No ratings yet
Assignment 2.1
2 pages
Python/Numpy Basics: UCI ML Repository
No ratings yet
Python/Numpy Basics: UCI ML Repository
2 pages
M II QuestionBank
No ratings yet
M II QuestionBank
101 pages
Co-Ordinate Geometry The Equation of Straight Lines
No ratings yet
Co-Ordinate Geometry The Equation of Straight Lines
14 pages
Stuff You MUST Know Cold: Ap Calculus
No ratings yet
Stuff You MUST Know Cold: Ap Calculus
2 pages
Functions and Limits Guide
No ratings yet
Functions and Limits Guide
21 pages
An Arc Spline Approximation To A Clothoid
No ratings yet
An Arc Spline Approximation To A Clothoid
19 pages
Adding Integers Using Counters
No ratings yet
Adding Integers Using Counters
2 pages
GMAT Math Formula Sheet Preview PDF
0% (4)
GMAT Math Formula Sheet Preview PDF
4 pages
Summative Test
No ratings yet
Summative Test
2 pages
Solución
No ratings yet
Solución
3 pages
Manifold Estimation, Hidden Structure and Dimension Reduction
No ratings yet
Manifold Estimation, Hidden Structure and Dimension Reduction
39 pages
XI - Maths Guess Paper For Annual Exam Session 23-23 by Prof. M Saqib Nauman (1) - 1
No ratings yet
XI - Maths Guess Paper For Annual Exam Session 23-23 by Prof. M Saqib Nauman (1) - 1
8 pages
E DX X X X DX DX X: Tutorial Sheet 8 Improper Integral, Beta and Gamma Function
No ratings yet
E DX X X X DX DX X: Tutorial Sheet 8 Improper Integral, Beta and Gamma Function
1 page
01 - Reasoning - Verbal - Number Series
100% (1)
01 - Reasoning - Verbal - Number Series
61 pages
Numerical Analysis: Lecture - 14
No ratings yet
Numerical Analysis: Lecture - 14
8 pages
Nurture Course: Additional Exercise ON Logarithm
No ratings yet
Nurture Course: Additional Exercise ON Logarithm
11 pages
Chapter - 8 Dirac Notation and Hermitian Operators
0% (1)
Chapter - 8 Dirac Notation and Hermitian Operators
11 pages
9709 s04 Ms
No ratings yet
9709 s04 Ms
34 pages
109 - UnivalentfunctionsandTeichmuellerSpaces PDF
0% (1)
109 - UnivalentfunctionsandTeichmuellerSpaces PDF
266 pages
Advanced Double Integral Techniques
No ratings yet
Advanced Double Integral Techniques
6 pages
QUestion Paper CLASS 6 UNIT TEST
No ratings yet
QUestion Paper CLASS 6 UNIT TEST
11 pages
Genetic Algorithms in Soft Computing
No ratings yet
Genetic Algorithms in Soft Computing
4 pages
Relations+and+functions+L3+DPP+ +12th+elite+
No ratings yet
Relations+and+functions+L3+DPP+ +12th+elite+
40 pages
SBP Additional Math Guide
No ratings yet
SBP Additional Math Guide
83 pages
H2 Math: Vector Basics
No ratings yet
H2 Math: Vector Basics
24 pages
Ratio and Proportion
No ratings yet
Ratio and Proportion
5 pages
Quick Review 2.3: (For Help, Go To Sections 1.2 and 2.1.)
No ratings yet
Quick Review 2.3: (For Help, Go To Sections 1.2 and 2.1.)
3 pages
WME 2023 Mathematics Extension 1 Trial HSC Exam
No ratings yet
WME 2023 Mathematics Extension 1 Trial HSC Exam
25 pages
Z-Transforms: Definitions and Examples
No ratings yet
Z-Transforms: Definitions and Examples
21 pages
NX CAD Training Report
No ratings yet
NX CAD Training Report
22 pages
Math1035 Unit 2 Practice Milestone
No ratings yet
Math1035 Unit 2 Practice Milestone
28 pages

Assignment 2 - ML-SelfAttn

Uploaded by

Assignment 2 - ML-SelfAttn

Uploaded by

Assignment 2 Total Points 20

Part A: Basics of Neural Nets

The sign function is defined as below

a. Given w = [5 6], b = 30 what is the output of the neuron when x1 = [1 2] and x2 = [2 8]

Observe these can be computed directly by: Q = XWQ (matrix multiplication)

Compute the Attention Scores

Therefore, the softmax() is applied to each row separately producing a 3 x 3 matrix A:

Therefore, we basically find the updated vectors by computing

*Note on Multi-head Attention:

You might also like