0% found this document useful (0 votes)
41 views8 pages

Online Softmax Normalizer Calculation

Uploaded by

Hoang Lại
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views8 pages

Online Softmax Normalizer Calculation

Uploaded by

Hoang Lại
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab 12:11 16/1/25 [CS115]_softmax_backpropagation (1).

ipynb - Colab

We can use Softmax Regression for this problem. Softmax Regression can be called Multinomial Logistic Regression.

keyboard_arrow_down Softmax Regression This is a generalization of Logistic Regression (for 2 classes) to arbitrary number of classes.

We want to build a model to discriminate red, green, and blue points in 2-dimensional space.
keyboard_arrow_down One-hot vector representation
Given a point in 2D space x = (x1 , x2 ) , we want to output either red, green, or blue.
We represent the output as one-hot vector.

# Setting up dataset ⎡
1
⎤ ⎡
0
⎤ ⎡
0

import numpy as np For example, we represent red as ⎢ 0 ⎥, green as ⎢ 1 ⎥, and blue as ⎢ 0 ⎥.
import [Link] as plt ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
0 0 1
[Link]('classic')
np.set_printoptions(precision=3, suppress=True)
print(y)
X = [Link]([[-0.1, 1.4],
[-0.5, 0.2], [0 0 1 0 2 1 1 1 1 0 0 2 2 2 1 0 1 2 2 2]
[ 1.3, 0.9],
[-0.6, 0.4],
[-1.6, 0.2], print([Link](3))
[ 0.2, 0.2],
[-0.3,-0.4], [[1. 0. 0.]
[ 0.7,-0.8], [0. 1. 0.]
[ 1.1,-1.5], [0. 0. 1.]]
[-1.0, 0.9],
[-0.5, 1.5], # Convert the labels to one-hot encoding
[-1.3,-0.4], Y = [Link](3)[y]
[-1.4,-1.2], print(Y)
[-0.9,-0.7],
[ 0.4,-1.3], [[1. 0. 0.]
[-0.4, 0.6], [1. 0. 0.]
[ 0.3,-0.5], [0. 1. 0.]
[-1.6,-0.7], [1. 0. 0.]
[-0.5,-1.4], [0. 0. 1.]
[-1.0,-1.4]]) [0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
y = [Link]([0, 0, 1, 0, 2, 1, 1, 1, 1, 0, 0, 2, 2, 2, 1, 0, 1, 2, 2, 2])
[0. 1. 0.]
[1. 0. 0.]
colormap = [Link](['r', 'g', 'b']) [1. 0. 0.]
[0. 0. 1.]
def plot_scatter(X, y, colormap):
[0. 0. 1.]
[Link]()
[0. 0. 1.]
[Link](left=-2.0, right=2.0) [0. 1. 0.]
[Link](bottom=-2.0, top=2.0) [1. 0. 0.]
[Link](X[:,0], X[:, 1], s=80, c=colormap[y]) [0. 1. 0.]
[0. 0. 1.]
[Link]('$x_1$', size=20) [0. 0. 1.]
[Link]('$x_2$', size=20) [0. 0. 1.]]

plot_scatter(X, y, colormap)
keyboard_arrow_down Computation Graph

The parameters of our Softmax Regression model are:

w11 w12 b1
⎡ ⎤ ⎡ ⎤
W = ⎢ w21 w22 ⎥ and b = ⎢ b2 ⎥

⎣ ⎦ ⎣ ⎦
w31 w32 b3

[Link] 1/15 [Link] 2/15


12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab 12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab
We need to learn these parameters of the Softmax Regression model. The process of finding out the appropriate values for these parameters return [Link](z) / [Link]([Link](z))

from our dataset so that the predictions of the model is as close as possible to the targets is called training.
[Link](1000.0)

keyboard_arrow_down Problem Formulation <ipython-input-9-ca0c76ca62c8>:1: RuntimeWarning: overflow encountered in exp


[Link](1000.0)
inf

The coordinates of all the data points are put into the matrix X.
[Link](0.00000001)
The labels of all data points are put into the vector y.
Here, the dataset (X, y) contains m samples, and for each sample i we have (x(i) , y (i) ), where x(i) is the coordinate of
2
= 20 ∈ R -18.420680743952367
(i) (i)
the data point (x1 , and y (i) is its label (red, green, blue).

keyboard_arrow_down Numerically stable version


,x ) ∈ {0, 1, 2}
2

Therefore, X is of size (20 × 2), and y is of size (20 × 1).


We convert each y (i) into a one-hot vector. We thus convert vector y into a matrix Y of size (20 × 3).
We can compute a as follows.

keyboard_arrow_down
z z K z +K

Softmax Function
e i e i e e i

ai = = =
C z C z C z +K
j j K j
∑ e ∑ e e ∑ e
j=1 j=1 j=1

where K = − max(z1 , z2 , … , zC ) .

Feed-forward Phase
def stable_softmax(z):
Feed-forward means: given the model parameters W and b, given a sample (x, y), we produce the output a and compute the loss L. return [Link](z-max(z)) / [Link]([Link](z-max(z)))

z = Wx + b z = [Link]([1000., 2000., 3000.])


print(z)
where x is the input coordinate vector of size (2 × 1). Therefore, z is of size (3 × 1). print(stable_softmax(z))

[1000. 2000. 3000.]


Softmax [0. 0. 1.]

After getting z , we apply the softmax function to compute a . import torch


[Link]([Link](z), 0)
a = softmax(z)

e
z
i
tensor([0., 0., 1.], dtype=torch.float64)
ai =
C z
j

keyboard_arrow_down Feed-forward Implementation


∑ e
j=1

Here, because z is of size (3 × 1), a is also of size (3 × 1), i.e., (a1 , a2 , a3 ).

The softmax function above produces a probability distribution, i.e., ∑ ai = 1 .

In classification, we want to compute the probability that a sample belongs to a class/category. Suppose someone already computed W and b for us as follows.

keyboard_arrow_down Example: W = [Link]([[ 0.31, 3.95],


[ 7.07, -0.23],
[-6.27, -2.35]])
Calculate softmax values for z = [1, 2, 3]
T

b = [Link]([[ 1.2 ],
[ 2.93 ],
def softmax(z): [-4.14 ]])
return [Link](z) / [Link]([Link](z))

z = [Link]([1., 2., 3.]) def forward(W, b, x):


print(z) z = [Link](W, x) + b
print(softmax(z)) a = stable_softmax(z)

[1. 2. 3.] return z, a


[0.09 0.245 0.665]

plot_scatter(X, y, colormap)
Softmax emphasizes the relative difference between large and small values.

However, naive implementation of softmax can suffer from numerical stability.

Example: Calculate softmax values for z = [1000, 2000, 3000]


T

z = [Link]([1000.0, 2000.0, 3000.0])


print(z)
print(softmax(z))

[1000. 2000. 3000.]


[nan nan nan]
<ipython-input-7-2d8ec219fdf8>:2: RuntimeWarning: overflow encountered in exp
return [Link](z) / [Link]([Link](z))
<ipython-input-7-2d8ec219fdf8>:2: RuntimeWarning: invalid value encountered in divide

[Link] 3/15 [Link] 4/15


12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab 12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab
1 0.9
⎡ ⎤ ⎡ ⎤
(1) (1)
y = ⎢0⎥ a = ⎢ 0.1 ⎥

⎣ ⎦ ⎣ ⎦
0 0

0 0.1
⎡ ⎤ ⎡ ⎤
(2) (2)
y = ⎢1⎥ a = ⎢ 0.8 ⎥

⎣ ⎦ ⎣ ⎦
0 0.1

0 0.1
⎡ ⎤ ⎡ ⎤
(3) (3)
y = ⎢0⎥ a = ⎢ 0.2 ⎥

⎣ ⎦ ⎣ ⎦
1 0.7

Is this a good classifier? How likely is it?

Intuitively, we want to find a classifier that produces a 's similar to y's.

0 0.1 0.2
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
For example, if y(i) = ⎢1⎥ , having a(i) = ⎢ 0.8 ⎥ is more desirable than having a(i) = ⎢ 0.6 ⎥ .
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
0 0.1 0.2

The likelihood of the classifier producing a(i) regarding one example is:
(i)
3 (i) y
∏ (a ) j

j=1 j

Let's try to predict the class (red, green, or blue) of the first example in our datasets as follows.
For example, with a(1) and y(1) as above, we have:
(i)
3 (i) y 1 0 0
# X[0,:] has the shape (1,2) ∏
j=1
(a
j
) j
= (0.9) × (0.1) × (0.0) = 0.9 × 1 × 1 = 0.9

x = X[0,:].reshape(2,1) # reshape from (1,2) --> (2,1)


z, a = forward(W, b, x)

print([Link]())
print([Link]())
print([Link]())
print(y[0])

[-0.1 1.4]
[ 6.699 1.901 -6.803]
[0.992 0.008 0. ]
0

Similar to the case of Logistic Regression, we need to apply Maximum Likelihood Estimation (MLE) for multiple examples.
The class of the first example is red (i.e., 0).
We need to find weights W and bias b to maximize the following.
Our classifier predicts the probability of each class [red, green, blue] is [0.992, 0.008, 0.0]. (i)
N 3 (i)
where a
y (i)
∏ ∏ (a ) j
= softmax(Wx + b)
i=1 j=1 j

If we already had weights W and bias b, it's easy to make predictions. x


(i)
,y
(i)
is the i -th example in our dataset of N training examples.
But how to learn these parameters (weights and bias) properly?
Maximizing the above is equal to maximizing the following:

keyboard_arrow_down
(i) (i)

Maximum Likelihood Estimation (MLE) and Loss Function for Softmax Regression
N 3 (i) y N 3 (i) y N 3 (i) (i)
log ( ∏ ∏ (a ) j
) = ∑ ∑ log(a ) j
= ∑ ∑ y log(a )
i=1 j=1 j i=1 j=1 j i=1 j=1 j j

Maximizing the above is similar to minimize the following:


We need to learn a matrix W of size (3 × 2) and a b of size (3 × 1) that best discriminates red, green, blue points.
N 3 (i) (i)
J = −∑ ∑ y log(a )
Let's say we have 3 points and their classes (and their one-hot vector representations) as follows. i=1 j=1 j j

which is the loss function (cost function/log loss function) for our Softmax Regression problem here.
1
⎡ ⎤
−0.1
x
(1)
= [ ], y
(1)
= 0 (red) y
(1)
= ⎢0⎥ This loss function can also be called Cross Entropy Loss function. In Logistic Regression (binary classification), this is Binary Cross Entropy
1.4 ⎣
0
⎦ loss.

0 In this example, we have only 3 classes red, green, blue. In the general case, where we need to make prediction for C classes , the loss function
⎡ ⎤
1.3 N C (i) (i)
x
(2)
= [ ], y
(2)
= 1 (green) y
(2)
= ⎢1⎥ of Softmax Regression would be: J = −∑
i=1

j=1
y
j
log(a
j
)
0.9 ⎣ ⎦
0

x
(3)
= [
−1.4
], y
(3)
= 2 (blue) y
(3)

= ⎢0⎥
0

keyboard_arrow_down Numerical Stability of Loss Function
−1.1
⎣ ⎦
1

The loss function for one sample is


Assume that there's a classifier that makes the following predictions a 's about these 3 points. L = −∑
C
y j log(aj )
j=1

[Link] 5/15 [Link] 6/15


12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab 12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab
dz2 d
def compute_loss(y, a): dw2,1
=
dw2,1
(w2,1 x1 + w2,2 x2 + b2 ) = x1

return -1.0 * [Link](y * [Link](a))


dz2 d
= (w2,1 x1 + w2,2 x2 + b2 ) = 1
db2 db2

The logarithm function is not numerically stable for small input values 0 < aj < 1 .
We then have
So we derive a different computation as follows.
dz1 dz1 dz1
= x1 = x2 = 1
C dw1,1 dw1,2 db1
L = −∑ y j log(aj )
j=1
dz2 dz2 dz2
z
C e
j = x1 = x2 = 1
= −∑ y j log ( ) dw2,1 dw2,2 db2
j=1 C z
∑ e k
k=1
dz3 dz3 dz3
z +K
j = x1 = x2 = 1
C e dw3,1 dw3,2 db3
= −∑ y j log ( ) where K = − max(z1 , z2 , … , zC )
j=1 C z +K
∑ e k
k=1

C zj +K C zk +K
We can also easily compute as follows.
= −∑ y j [ log(e ) − log(∑ e )] dL
j=1 k=1
dai

C C zk +K
= −∑ y j [zj + K − log(∑ e )] dL d C d −y i
j=1 k=1 = ( − ∑ y j log(aj )) = (−y i log(ai )) =
dai dai j=1 dai ai

The first log is eliminated. The domain of the second log function is [1, ∞) , which is the numerically stable range for the logarithm function.
We then have:
dL d 3 d −y 1
= ( − ∑ y j log(aj )) = (−y 1 log(a1 )) =
def compute_loss_stable_version(y, z): da1 da1 j=1 da1 a1

return -1.0 * [Link](y * (z - [Link]() - [Link]([Link]([Link]([Link]()))))) dL d 3 d −y 2


= ( − ∑ y j log(aj )) = (−y 2 log(a2 )) =
da2 da2 j=1 da2 a2

keyboard_arrow_down Gradient Descent for Softmax Regression


dL d 3 d −y 3
= ( − ∑ y j log(aj )) = (−y 3 log(a3 )) =
da3 da3 j=1 da3 a3

Then,
dL y1

⎡ ⎤ −
⎡ ⎤
We can start with random weights W and bias b. da1 a1

dL ⎢ dL ⎥ ⎢ y2

= ⎢ ⎥ = ⎢− ⎥
We make the parameters (weights and bias) better and better gradually by updating the them in each iteration: da ⎢ da2 ⎥ ⎢ a2

y3
dL ⎣− ⎦
⎣ ⎦
dL a3
W = W − α da3
dW

dL
b = b − α
dai
db
We still need to compute the derivative of the Softmax function
dzm
To compute the and , we need to do backpropagation.
dL dL

dW db

keyboard_arrow_down Derivative of Softmax function


dai

dzm
tells us how much ai would change if zm changes.
z
d d e i
ai = ( )
dzm dzm C z
j
∑ e
j=1

Applying the Quotient Rule


′ ′
f (x) f (x)g(x)−g (x)f (x)
Let's compute each element of the gradient separately. d
= 2
dx g(x) (g(x))

dL 3 dL dai dzm
= ∑ C
dwm,n i=1 dai dzm dwm,n Here, f (x) = e
zi
and g(x) = ∑
j=1
e
zj
.
dL 3 dL dai dzm

dbm
= ∑
i=1 dai dzm dbm
We have two cases, i = m and i ≠ m .

For example, if we want to compute dL


:
dw2,1

If i = m :
dL 3 dL dai dz2
= ∑
dw2,1 i=1 dai dz2 dw2,1

dL da1 dz2 dL da2 dz2 dL da3 dz2


= + +
da1 dz2 dw2,1 da2 dz2 dw2,1 da3 dz2 dw2,1

Changing w2,1 leads to changing z2 , which leads to changing a1 , a2 , a3 , which all lead to chaning L.

Therefore, we are computing the effect of how changing w2,1 influence L over 3 different paths. We then sum up all the paths together to
compute .
dL

dw2,1

We can easily compute these derivatives:


dzm d
= (wm,1 x1 + wm,2 x2 + … + wm,n xn + … + wm,d xd + bm ) = xn
dwm,n dwm,n

dzm d
= (wm,1 x1 + wm,2 x2 + … + wm,n xn + … + wm,d xd + bm ) = 1
dbm dbm

For the example above, we have:

[Link] 7/15 [Link] 8/15


12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab 12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab
dai z
d e i
= ( )
C z
dzm dzm j
∑ e
j=1

′ C z C z ′
z j j z
(e i ) ∑ e −(∑ e ) e i
j=1 j=1

= 2
C z
( ∑
j=1
e j
)

C z
z j zm z
e i ∑ e −e e i
j=1

= 2
C z
( ∑
j=1
e j
)

C zj C zj
e
zi
( ∑
j=1
e −e
zm
) zm
e ( ∑
j=1
e −e
zm
)
= 2
= 2
C z C z
( ∑ e j
) ( ∑ e j
)
j=1 j=1
For the example here, we have
C z
j zm
zm ∑ e −e
e j=1
dL dL
= dL

C
e
z
j

C
e
z
j ⎡ dw1,1 dw1,2
⎤ ⎡ ⎤
j=1 j=1 db1

⎢ dL dL
⎥ ⎢ ⎥
and
C dL dL dL
= ⎢ ⎥
zj
zm ∑ e zm = ⎢ ⎥
e j=1 e
dW ⎢ dw2,1 dw2,2 ⎥ db db2
= ( − ) ⎢ ⎥ ⎢ ⎥
C zj C zj C zj
∑ e ∑ e ∑ e dL
dL dL
j=1 j=1 j=1
⎣ ⎦ ⎣ ⎦
dw3,1 dw3,2 db3
= am (1 − am )

If i ≠ m : Above, we already had


z dL 3 dL dai dz2
dai d e i

= ( ) = ∑
C z dw2,1 i=1 dai dz2 dw2,1
dzm dzm j
∑ e
j=1

dL da1 dz2 dL da2 dz2 dL da3 dz2


(e
z
i

) ∑
C
e
z
j
−(∑
C
e
z
j
) e
′ z
i = + +
j=1 j=1 da1 dz2 dw2,1 da2 dz2 dw2,1 da3 dz2 dw2,1
= 2
C z
da1 da2 da3 dz2
( ∑ e
j
) dL dL dL
j=1 = ( + + )
da1 dz2 da2 dz2 da3 dz2 dw2,1
C z ′ z
j i
0−(∑ e ) e zm z
i
j=1 −e e dL dz2
= 2
= 2 =
C z C z dz2 dw2,1
( ∑
j=1
e
j
) ( ∑
j=1
e
j
)

e
zm
e
z
i Similarly,
= − = −am ai
C z C z
j j
∑ e ∑ e
j=1 j=1 dai dz2 da1 dz2 da2 dz2 da3 dz2 da1 da2 da3 dz2
dL 3 dL dL dL dL dL dL dL
= ∑ = + + = ( + + )
db2 i=1 dai dz2 db2 da1 dz2 db2 da2 dz2 db2 da3 dz2 db2 da1 dz2 da2 dz2 da3 dz2 db2

dL dz2
In summary, we have =
dz2 db2

dai am (1 − am ) if i = m
= {
dzm
−am ai if i ≠ m From above, we have

We can have as a (C matrix.


d dz1 dz1 dz1
a × C) = x1 = x2 = 1
dz dw1,1 dw1,2 db1

For our example above, we have dz2 dz2 dz2


= x1 = x2 = 1
dw2,1 dw2,2 db2
da1 da2 da3
⎡ ⎤ dz3 dz3 dz3
dz1 dz1 dz1 a1 (1 − a1 ) a1 (−a2 ) a1 (−a3 ) = x1 = x2 = 1
⎡ ⎤ dw3,1 dw3,2 db3
⎢ da1 da2 da3 ⎥
d
a = ⎢ ⎥ = ⎢ a2 (−a1 ) a2 (1 − a2 ) a2 (−a3 ) ⎥
dz ⎢ dz2 dz2 dz2 ⎥
⎢ ⎥
⎣ ⎦
a3 (−a1 ) a3 (−a2 ) a3 (1 − a3 )
We then have
da1 da2 da3
⎣ ⎦
dz3 dz3 dz3

dL dL dL dz1 dL dz1
dL dL dL
a1 a1 a1 (1 − a1 ) −a2 −a3 ⎡ ⎤ x1 x2
⎡ ⎤ ⎡ ⎤ ⎡ dw1,1 dw1,2 ⎤ dz1 dw1,1 dz1 dw1,2 ⎡ ⎤ ⎡ ⎤
dz1 dz1 dz1

= ⎢ a2 a2 a2 ⎥ ∘ ( ⎢ −a1 (1 − a2 ) −a3 ⎥) ⎢ ⎥ ⎢ ⎥
dL dL dL dL dz2 dL dz2 ⎢ dL dL ⎥ ⎢ dL ⎥ dL T
= ⎢ ⎥ = ⎢ ⎥ =
⎢ x1 x2 ⎥ = ⎢ ⎥ [ x1 x2 ] = x
⎣ ⎦ ⎢ dw2,1 dw2,2 ⎥ ⎢ dz2 dw2,1 dz2 dw2,2 ⎥ dz2 dz2 dz2
⎣ ⎦ dW
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ dz
a3 a3 a3 −a1 −a2 (1 − a3 ) ⎢ ⎥
dL dL dz3 dz1 dL dL dL
⎣ ⎦ dL dL ⎣ x1 x2 ⎦ ⎣ ⎦
⎣ ⎦
dw3,1 dw3,2 dz3 dz3 dz3
a1 a1 a1 1 0 0 a1 a2 a3 dz3 dw3,1 dz3 dw3,2
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
= ⎢a
2 a2 a2 ⎥ ∘ ( ⎢ 0 1 0 ⎥ − ⎢ a1 a2 a3 ⎥ ) and
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
a3 a3 a3 0 0 1 a1 a2 a3 dL dz1
dL dL dL
⎡ ⎤ 1
⎡ db1 ⎤ dz1 db1 ⎡ dz1 ⎤ ⎡ dz1 ⎤
a1 1 0 0 1
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎢ dL ⎥ ⎢ dz1 ⎥ ⎢ dL ⎥ ⎢ dL ⎥
dL dL dL
= ⎢ ⎢ ⎥ = 1⎥ = ⎢
⎥ = ⎢ ⎥ ⎢ ⎥ =
= ⎢a ⎥[1 1 1] ∘ ( ⎢ 0 1 0 ⎥ − ⎢ 1 ⎥ [ a1 a2 a3 ] ) db ⎢ db2
⎥ dz2 db2 ⎢ dz2
⎥ ⎢ dz2
⎥ dz
2 ⎢ ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦ dL dL dz1 dL dL
a3 0 0 1 1 ⎣ ⎦ ⎣ ⎦ ⎣ 1⎦ ⎣ ⎦
db3 dz3 db3 dz3 dz3

T T
= a1 ∘ (I − 1a )

We can compute dL
,
dL
,
dL
as follows.

keyboard_arrow_down Backpropagation Phase


dz1 dz2 dz3

dL da1 dL da2 dL da3 da1 da2 da3


dL dL
⎡ + + ⎤ ⎡ ⎤⎡
⎡ dz1 ⎤ da1 dz1 da2 dz1 da3 dz1 dz1 dz1 dz1 da1 ⎤

⎢ dL ⎥ ⎢ da1 da2 da3 ⎥ ⎢ da1 da2 da3 ⎥⎢ dL ⎥


dL dL dL dL da dL
= ⎢ ⎥ = ⎢ + + ⎥ = ⎢ ⎥⎢ ⎥ =
In the backpropagation phase, we compute and using the derivatives we have computed above.
dL dL
dz dz2 ⎢ da1 dz2 da2 dz2 da3 dz2 ⎥ ⎢ dz2 dz2 dz2 ⎥ da2 dz da
⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
dW db
dL da1 da2 da3 da1 da2 da3 dL
⎣ ⎦ ⎣
dL dL dL
⎦ ⎣ ⎦⎣ ⎦
dz3
+ + da3
da1 dz3 da3 dz3 da3 dz3 dz3 dz3 dz3

NOTE: The notations here are for the sake of convenience.


From above, we have:

[Link] 9/15 [Link] 10/15


12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab 12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab
da1 da2 da3 # random initialization
⎡ ⎤
dz1 dz1 dz1

a1 (1 − a1 ) a1 (−a2 ) a1 (−a3 )

W_initial = [Link](3,2)
⎢ ⎥
da da1 da2 da3 W = W_initial.copy()
= ⎢ ⎥ = ⎢ a2 (−a1 ) a2 (1 − a2 ) a2 (−a3 ) ⎥
⎢ ⎥
dz

dz2 dz2 dz2
⎥ b = [Link]((3,1))
⎣ ⎦
da1 da2 da3 a3 (−a1 ) a3 (−a2 ) a3 (1 − a3 )
⎣ ⎦
dz3 dz3 dz3
W_cache = []
dL y1 b_cache = []
⎡ ⎤ −
⎡ ⎤
da1 a1
L_cache = []
dL ⎢ dL ⎥ ⎢ y2 ⎥
= ⎢ ⎥ = ⎢− ⎥
da da2 ⎢ a2 ⎥
⎢ ⎥ for i in range(num_epochs):
y3
dL ⎣− ⎦
⎣ ⎦ dW = [Link]([Link])
da3 a3
db = [Link]([Link])
y1




L = 0
a1 (1 − a1 ) a1 (−a2 ) a1 (−a3 ) a1
⎡ ⎤
dL ⎢ y2 ⎥
= ⎢ a2 (−a1 ) a2 (1 − a2 ) a2 (−a3 ) ⎥⎢− ⎥ # Loop through each example in the dataset
dz ⎢ a2 ⎥
⎣ ⎦ y3 for j in range([Link][0]):
a3 (−a1 ) a3 (−a2 ) a3 (1 − a3 ) ⎣− ⎦
a3 x_j = X[j,:].reshape(2,1)
y_j = Y[j,:].reshape(3,1)
−y 1 (1 − a1 ) + y 2 a1 + y 3 a1
⎡ ⎤

= ⎢ y 1 a2 − y 2 (1 − a2 ) + y 3 a2 ⎥ z_j, a_j = forward(W, b, x_j)


⎣ ⎦ loss_j = compute_loss_stable_version(y_j, z_j)
y 1 a3 + y 2 a3 − y 3 (1 − a3 )
dW_j, db_j = compute_gradient(x_j, y_j, z_j, a_j)
a1 − 1 a1 a1 y1
⎡ ⎤⎡ ⎤
= ⎢
dW += dW_j
a2 a2 − 1 a2 ⎥ ⎢ y2 ⎥
db += db_j
⎣ ⎦⎣ ⎦
a3 a3 a3 − 1 y3 L += loss_j
a1 1 0 0 y1
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
dW = (1.0/[Link][0]) * dW
= (⎢a ⎥[1 1] − ⎢ 0 0 ⎥ ) ⎢ y2 ⎥
2 1 1 db = (1.0/[Link][0]) * db
⎣ ⎦ ⎣ ⎦ ⎣ ⎦ L = (1.0/[Link][0]) * L
a3 0 0 1 y3

# gradient descent
T
= (a1 − I)y
W, b = gradient_descent(W, b, dW, db, learning_rate)

= a1
T
y − Iy [here y is a one-hot vector].
W_cache.append(W)
= a − y =
dL b_cache.append(b)
dz
L_cache.append(L)

Finally, we have [Link]()


[Link]('Number of iterations', size=20)
dL
=
dL
x
T
= (a − y)x
T
[Link]('Loss', size=20)
dW dz
[Link](bottom=0.0, top=max(L_cache)*1.1)
and [Link](L_cache)
print(L_cache[-1])
dL dL
= = (a − y)
db dz
0.10777593297399604

def compute_gradient(x, y, z, a):


# compute dL/da
da = -y / a

# compute da/dz
matrix = [Link](a, [Link]((1, 3))) * ([Link](3) - [Link]([Link]((3,1)), a.T))

# compute dL/dz
#dz = [Link](matrix, da)
dz = a - y

# compute dL/dW
dW = [Link](dz, x.T)
# compute dL/db
db = [Link]()

return dW, db

Now we can implement Gradient Descent to learn the parameters of Softmax Regression for our problem here as follows.

learning_rate = 2.0 # This is just for this DEMO. Normally, learning rate is quite small.
num_epochs = 40

def gradient_descent(W, b, dW, db, learning_rate):


W = W - learning_rate * dW
keyboard_arrow_down Decision Boundary
b = b - learning_rate * db

def plot_decision_boundary(X, Y, W, b):


return W, b
[Link]()

[Link] 11/15 [Link] 12/15


12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab 12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab
[Link]([-2.0, 2.0]) return lines1, lines2, lines3, text_box
[Link]([-2.0, 2.0])
[Link]('$x_1$', size=20) lines1, = [Link]([], [], c='black')
[Link]('$x_2$', size=20) lines2, = [Link]([], [], c='black')
[Link]('Decision boundary', size = 18) lines3, = [Link]([], [], c='black')

[Link](X[:, 0], X[:, 1], s=50, c=colormap[y]) [Link]()


[Link](X[:, 0], X[:, 1], s=50, c=colormap[y])
xs = [Link]([-2.0, 2.0]) text_box = [Link](1.1, 1.6, 'Iteration 0', size = 16)
ys1 = ((b[1, 0] - b[0, 0]) - (W[0, 0] - W[1, 0]) * xs) / (W[0, 1] - W[1, 1])
ys2 = ((b[2, 0] - b[0, 0]) - (W[0, 0] - W[2, 0]) * xs) / (W[0, 1] - W[2, 1]) anim = [Link](fig, animate, len(W_cache), blit=False, interval=500)
ys3 = ((b[2, 0] - b[1, 0]) - (W[1, 0] - W[2, 0]) * xs) / (W[1, 1] - W[2, 1]) [Link](fig)
anim
[Link](xs, ys1, c='black')
[Link](xs, ys2, c='black')
[Link](xs, ys3, c='black')

plot_decision_boundary(X, Y, W, b)

# Import for animation on Google Colab


from matplotlib import rc
rc('animation', html='jshtml')
        
import [Link] as animation
Once Loop Reflect

import [Link] as animation

fig = [Link]()

ax = fig.add_subplot(111)
ax.set_xlim([-2.0, 2.0])
ax.set_ylim([-2.0, 2.0])
ax.set_xlabel('$x_1$', size=20)
ax.set_ylabel('$x_2$', size=20)

ax.set_title('Decision boundary - Animated', size = 18)

def animate(i):
xs = [Link]([-2.0, 2.0])
W = W_cache[i]
b = b_cache[i]

ys1 = ((b[1, 0] - b[0, 0]) - (W[0, 0] - W[1, 0]) * xs) / (W[0, 1] - W[1, 1])
ys2 = ((b[2, 0] - b[0, 0]) - (W[0, 0] - W[2, 0]) * xs) / (W[0, 1] - W[2, 1])
ys3 = ((b[2, 0] - b[1, 0]) - (W[1, 0] - W[2, 0]) * xs) / (W[1, 1] - W[2, 1])

lines1.set_data(xs, ys1)
lines2.set_data(xs, ys2)
lines3.set_data(xs, ys3)

text_box.set_text('Iteration: {}'.format(i))

[Link] 13/15 [Link] 14/15


12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab
NX = 100
NY = 100

def plot_decision_boundary_heatmap(X, Y, W, b):


[Link]()
[Link]([-2.0, 2.0])
[Link]([-2.0, 2.0])

[Link] 15/15

You might also like