0% found this document useful (0 votes)
26 views4 pages

Assignment 2 - ML-SelfAttn

The document outlines an assignment involving neural networks and self-attention mechanisms, divided into two parts. Part A focuses on basic neural network operations and computations, including finding weights for specific outputs and decision boundaries for a single neuron. Part B delves into self-attention, requiring computations of query, key, and value vectors, as well as attention scores based on given matrices and vectors.

Uploaded by

ARJU Zerin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views4 pages

Assignment 2 - ML-SelfAttn

The document outlines an assignment involving neural networks and self-attention mechanisms, divided into two parts. Part A focuses on basic neural network operations and computations, including finding weights for specific outputs and decision boundaries for a single neuron. Part B delves into self-attention, requiring computations of query, key, and value vectors, as well as attention scores based on given matrices and vectors.

Uploaded by

ARJU Zerin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Assignment 2 Total Points 20

Part A: Basics of Neural Nets

1. [5 points] Assume you have the following neural network with 3 neurons (circles in green).

Here,
Z11 = w11.x1 + w21.x2, A11 = ReLU(Z11)
Z12 = w12.x1 + w22.x2, A12 = ReLU(Z12)
Z2 = w3.A11 + w4.A12, A2 = ReLU(Z2)

a. Find the values of w’s so that the A2 gives the prediction as same as Y in the table on the left.
b. Find the values of w’s so that the neural network can produce Y = X1 XNOR X2

2. [5 points] Assume you have a single neuron with sign() activation function: that is, the output of the
neural network will be either 0 or 1 and it is given by:

a = sign(wTx +b)

The sign function is defined as below


sign(z) = 1, if z > 0 and
0, otherwise

a. Given w = [5 6], b = 30 what is the output of the neuron when x1 = [1 2] and x2 = [2 8]

b. Plot the decision boundary (the curve that separates the points for the decision or output is 1, from
those for which the output is 0). Use the values of w and b from part (a).
Part B: Self-attention [10 points]

In part of the assignment/tutorial will take you through the self-attention computation that was
proposed in the seminal paper “Attention is all you need”. You are to do the computations by hand.

Assume you have embedding or representations of 3 tokens given by x1, x2, x3, each of these xi’s are 5-
vectors (dimension of the vectors is 5) stored as the rows in the following 3 x 5 matrix X:

[ ]
1 2 3 7 1
X= 2 1 0 1 0
0 1 3 5 1

The self-attention mechanism basically computes the pairwise attention or relationship between the
vectors (transformed) and based on these pairwise attentions produces output that are linear
combinations of the vectors

Figure: self-attention computation is shown in the figure on the left. The same computation is represented the figure
in the right form the attention is all you need paper

The self-attention mechanism will first constructs 3 query, 3 key, and 3 value vectors by transforming
each of the token vectors, by multiplying them with matrices WQ, WK, and WV. The dimensions of the
W matrices will determine the dimensions of the query, key and value vectors. Assume, the following
WQ, WK, and WV matrices:

[ ] [ ] [ ]
1 0 1 0 1 0 1 0 1 0 0 1
0 1 0 1 2 1 1 0 0 1 1 1
W Q= 0 2 0 0 , W K= 0 0 0 0 , WV= 0 1 0 1
0 0 1 1 0 1 0 0 1 1 1 1
0 0 0 1 0 1 0 1 1 0 0 0
Compute the query, key and value vectors

The vectors will be just the linear transformations of the xi’s. The linear transformations are presented
by the W matrices.

query vector, q1 = x1 * WQ =
query vector, q2 = x2 * WQ =
query vector, q3 = x3 *WQ =

Observe these can be computed directly by: Q = XWQ (matrix multiplication)


where q1, q2, q3 are the rows of Q (a 3 x 4 matrix).So,

Q=

Similarly,

find K = XWK, where k1, k2, k3 are the rows of K representing the key vectors corresponding to the
vectors x1, x2, x3

K=

and find V = XWV, where v1, v2, v3 are the rows of V representing the value vectors corresponding to
the vectors x1, x2, x3

V=

Compute the Attention Scores

Once the Q, K, V matrices are computed we can use the equation mentioned in the paper

Next, we use the query and key vectors to compute the pairwise attentions – we will use the dot product
attention. We will get 9 dot products that can be easily stored in a 3 x 3 matrix, S = QKT.

QKT =
Observe the ij-th element of QKT will be the dot product between qi (query vector corresponding to xi)
and kj (the key vector corresponding to xj).

Now we scale the entries of the matrix by 1/ √ d k , where dk is the dimension of the query or key vector.
In this case, dk = 4.

S = (1/ √ d k ) QKT =

Observe the first row of S contains the relationship/attention (dot products) that q1 has with k1, k2, and
k3. In this step, softmax() is applied to these attentions to produce attention scores that are between 0
and 1 and that they sum upto 1.

Therefore, the softmax() is applied to each row separately producing a 3 x 3 matrix A:

A = softmax(S) =

Output:

With these attention scores (weights) we will output vectors by taking the weighted sum of v1, v2 and
v3. So, if the attention scores in A’s first row are 0.4, 0.1, 0.3 then we will output o1 as
o1 = 0.4 v1 + 0.1 v2 + 0.3 v3.

Therefore, we basically find the updated vectors by computing

Output = A*V =

*Note on Multi-head Attention:

In the “attention is all you need” paper the authors also presented the idea
of having multiple heads, where each head produces the output as we have
done above.

Therefore if there are n heads, each headi will have its own WQ, WK, and
WV matrices. Then the final output will be a concatinate the outputs of each
head and then do a linear transformation (multiply by a matrix) to produce
the final output, as shown in the figure on the right.

You might also like