0% found this document useful (0 votes)

41 views8 pages

Online Softmax Normalizer Calculation

Uploaded by

Hoang Lại

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views8 pages

Online Softmax Normalizer Calculation

Uploaded by

Hoang Lại

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab 12:11 16/1/25 [CS115]_softmax_backpropagation (1).

ipynb - Colab

We can use Softmax Regression for this problem. Softmax Regression can be called Multinomial Logistic Regression.

keyboard_arrow_down Softmax Regression This is a generalization of Logistic Regression (for 2 classes) to arbitrary number of classes.

We want to build a model to discriminate red, green, and blue points in 2-dimensional space.
keyboard_arrow_down One-hot vector representation
Given a point in 2D space x = (x1 , x2 ) , we want to output either red, green, or blue.
We represent the output as one-hot vector.

# Setting up dataset ⎡
1
⎤ ⎡
0
⎤ ⎡
0
⎤
import numpy as np For example, we represent red as ⎢ 0 ⎥, green as ⎢ 1 ⎥, and blue as ⎢ 0 ⎥.
import [Link] as plt ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
0 0 1
[Link]('classic')
np.set_printoptions(precision=3, suppress=True)
print(y)
X = [Link]([[-0.1, 1.4],
[-0.5, 0.2], [0 0 1 0 2 1 1 1 1 0 0 2 2 2 1 0 1 2 2 2]
[ 1.3, 0.9],
[-0.6, 0.4],
[-1.6, 0.2], print([Link](3))
[ 0.2, 0.2],
[-0.3,-0.4], [[1. 0. 0.]
[ 0.7,-0.8], [0. 1. 0.]
[ 1.1,-1.5], [0. 0. 1.]]
[-1.0, 0.9],
[-0.5, 1.5], # Convert the labels to one-hot encoding
[-1.3,-0.4], Y = [Link](3)[y]
[-1.4,-1.2], print(Y)
[-0.9,-0.7],
[ 0.4,-1.3], [[1. 0. 0.]
[-0.4, 0.6], [1. 0. 0.]
[ 0.3,-0.5], [0. 1. 0.]
[-1.6,-0.7], [1. 0. 0.]
[-0.5,-1.4], [0. 0. 1.]
[-1.0,-1.4]]) [0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
y = [Link]([0, 0, 1, 0, 2, 1, 1, 1, 1, 0, 0, 2, 2, 2, 1, 0, 1, 2, 2, 2])
[0. 1. 0.]
[1. 0. 0.]
colormap = [Link](['r', 'g', 'b']) [1. 0. 0.]
[0. 0. 1.]
def plot_scatter(X, y, colormap):
[0. 0. 1.]
[Link]()
[0. 0. 1.]
[Link](left=-2.0, right=2.0) [0. 1. 0.]
[Link](bottom=-2.0, top=2.0) [1. 0. 0.]
[Link](X[:,0], X[:, 1], s=80, c=colormap[y]) [0. 1. 0.]
[0. 0. 1.]
[Link]('$x_1$', size=20) [0. 0. 1.]
[Link]('$x_2$', size=20) [0. 0. 1.]]

plot_scatter(X, y, colormap)
keyboard_arrow_down Computation Graph

The parameters of our Softmax Regression model are:

w11 w12 b1
⎡ ⎤ ⎡ ⎤
W = ⎢ w21 w22 ⎥ and b = ⎢ b2 ⎥

⎣ ⎦ ⎣ ⎦
w31 w32 b3

[Link] 1/15 [Link] 2/15

12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab 12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab
We need to learn these parameters of the Softmax Regression model. The process of finding out the appropriate values for these parameters return [Link](z) / [Link]([Link](z))

from our dataset so that the predictions of the model is as close as possible to the targets is called training.
[Link](1000.0)

keyboard_arrow_down Problem Formulation <ipython-input-9-ca0c76ca62c8>:1: RuntimeWarning: overflow encountered in exp

[Link](1000.0)
inf

The coordinates of all the data points are put into the matrix X.
[Link](0.00000001)
The labels of all data points are put into the vector y.
Here, the dataset (X, y) contains m samples, and for each sample i we have (x(i) , y (i) ), where x(i) is the coordinate of
2
= 20 ∈ R -18.420680743952367
(i) (i)
the data point (x1 , and y (i) is its label (red, green, blue).

keyboard_arrow_down Numerically stable version

,x ) ∈ {0, 1, 2}
2

Therefore, X is of size (20 × 2), and y is of size (20 × 1).

We convert each y (i) into a one-hot vector. We thus convert vector y into a matrix Y of size (20 × 3).
We can compute a as follows.

keyboard_arrow_down
z z K z +K

Softmax Function
e i e i e e i

ai = = =
C z C z C z +K
j j K j
∑ e ∑ e e ∑ e
j=1 j=1 j=1

where K = − max(z1 , z2 , … , zC ) .

Feed-forward Phase
def stable_softmax(z):
Feed-forward means: given the model parameters W and b, given a sample (x, y), we produce the output a and compute the loss L. return [Link](z-max(z)) / [Link]([Link](z-max(z)))

z = Wx + b z = [Link]([1000., 2000., 3000.])

print(z)
where x is the input coordinate vector of size (2 × 1). Therefore, z is of size (3 × 1). print(stable_softmax(z))

[1000. 2000. 3000.]

Softmax [0. 0. 1.]

After getting z , we apply the softmax function to compute a . import torch

[Link]([Link](z), 0)
a = softmax(z)

e
z
i
tensor([0., 0., 1.], dtype=torch.float64)
ai =
C z
j

keyboard_arrow_down Feed-forward Implementation

∑ e
j=1

Here, because z is of size (3 × 1), a is also of size (3 × 1), i.e., (a1 , a2 , a3 ).

The softmax function above produces a probability distribution, i.e., ∑ ai = 1 .

In classification, we want to compute the probability that a sample belongs to a class/category. Suppose someone already computed W and b for us as follows.

keyboard_arrow_down Example: W = [Link]([[ 0.31, 3.95],

[ 7.07, -0.23],
[-6.27, -2.35]])
Calculate softmax values for z = [1, 2, 3]
T

b = [Link]([[ 1.2 ],
[ 2.93 ],
def softmax(z): [-4.14 ]])
return [Link](z) / [Link]([Link](z))

z = [Link]([1., 2., 3.]) def forward(W, b, x):

print(z) z = [Link](W, x) + b
print(softmax(z)) a = stable_softmax(z)

[1. 2. 3.] return z, a

[0.09 0.245 0.665]

plot_scatter(X, y, colormap)
Softmax emphasizes the relative difference between large and small values.

However, naive implementation of softmax can suffer from numerical stability.

Example: Calculate softmax values for z = [1000, 2000, 3000]

z = [Link]([1000.0, 2000.0, 3000.0])

print(z)
print(softmax(z))

[1000. 2000. 3000.]

[nan nan nan]
<ipython-input-7-2d8ec219fdf8>:2: RuntimeWarning: overflow encountered in exp
return [Link](z) / [Link]([Link](z))
<ipython-input-7-2d8ec219fdf8>:2: RuntimeWarning: invalid value encountered in divide

[Link] 3/15 [Link] 4/15

12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab 12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab
1 0.9
⎡ ⎤ ⎡ ⎤
(1) (1)
y = ⎢0⎥ a = ⎢ 0.1 ⎥

⎣ ⎦ ⎣ ⎦
0 0

0 0.1
⎡ ⎤ ⎡ ⎤
(2) (2)
y = ⎢1⎥ a = ⎢ 0.8 ⎥

⎣ ⎦ ⎣ ⎦
0 0.1

0 0.1
⎡ ⎤ ⎡ ⎤
(3) (3)
y = ⎢0⎥ a = ⎢ 0.2 ⎥

⎣ ⎦ ⎣ ⎦
1 0.7

Is this a good classifier? How likely is it?

Intuitively, we want to find a classifier that produces a 's similar to y's.

0 0.1 0.2
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
For example, if y(i) = ⎢1⎥ , having a(i) = ⎢ 0.8 ⎥ is more desirable than having a(i) = ⎢ 0.6 ⎥ .
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
0 0.1 0.2

The likelihood of the classifier producing a(i) regarding one example is:
(i)
3 (i) y
∏ (a ) j

j=1 j

Let's try to predict the class (red, green, or blue) of the first example in our datasets as follows.
For example, with a(1) and y(1) as above, we have:
(i)
3 (i) y 1 0 0
# X[0,:] has the shape (1,2) ∏
j=1
(a
j
) j
= (0.9) × (0.1) × (0.0) = 0.9 × 1 × 1 = 0.9

x = X[0,:].reshape(2,1) # reshape from (1,2) --> (2,1)

z, a = forward(W, b, x)

print([Link]())
print([Link]())
print([Link]())
print(y[0])

[-0.1 1.4]
[ 6.699 1.901 -6.803]
[0.992 0.008 0. ]
0

Similar to the case of Logistic Regression, we need to apply Maximum Likelihood Estimation (MLE) for multiple examples.
The class of the first example is red (i.e., 0).
We need to find weights W and bias b to maximize the following.
Our classifier predicts the probability of each class [red, green, blue] is [0.992, 0.008, 0.0]. (i)
N 3 (i)
where a
y (i)
∏ ∏ (a ) j
= softmax(Wx + b)
i=1 j=1 j

If we already had weights W and bias b, it's easy to make predictions. x

(i)
,y
(i)
is the i -th example in our dataset of N training examples.
But how to learn these parameters (weights and bias) properly?
Maximizing the above is equal to maximizing the following:

keyboard_arrow_down
(i) (i)

Maximum Likelihood Estimation (MLE) and Loss Function for Softmax Regression
N 3 (i) y N 3 (i) y N 3 (i) (i)
log ( ∏ ∏ (a ) j
) = ∑ ∑ log(a ) j
= ∑ ∑ y log(a )
i=1 j=1 j i=1 j=1 j i=1 j=1 j j

Maximizing the above is similar to minimize the following:

We need to learn a matrix W of size (3 × 2) and a b of size (3 × 1) that best discriminates red, green, blue points.
N 3 (i) (i)
J = −∑ ∑ y log(a )
Let's say we have 3 points and their classes (and their one-hot vector representations) as follows. i=1 j=1 j j

which is the loss function (cost function/log loss function) for our Softmax Regression problem here.
1
⎡ ⎤
−0.1
x
(1)
= [ ], y
(1)
= 0 (red) y
(1)
= ⎢0⎥ This loss function can also be called Cross Entropy Loss function. In Logistic Regression (binary classification), this is Binary Cross Entropy
1.4 ⎣
0
⎦ loss.

0 In this example, we have only 3 classes red, green, blue. In the general case, where we need to make prediction for C classes , the loss function
⎡ ⎤
1.3 N C (i) (i)
x
(2)
= [ ], y
(2)
= 1 (green) y
(2)
= ⎢1⎥ of Softmax Regression would be: J = −∑
i=1
∑
j=1
y
j
log(a
j
)
0.9 ⎣ ⎦
0

x
(3)
= [
−1.4
], y
(3)
= 2 (blue) y
(3)
⎡
= ⎢0⎥
0
⎤
keyboard_arrow_down Numerical Stability of Loss Function
−1.1
⎣ ⎦
1

The loss function for one sample is

Assume that there's a classifier that makes the following predictions a 's about these 3 points. L = −∑
C
y j log(aj )
j=1

[Link] 5/15 [Link] 6/15

12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab 12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab
dz2 d
def compute_loss(y, a): dw2,1
=
dw2,1
(w2,1 x1 + w2,2 x2 + b2 ) = x1

return -1.0 * [Link](y * [Link](a))

dz2 d
= (w2,1 x1 + w2,2 x2 + b2 ) = 1
db2 db2

The logarithm function is not numerically stable for small input values 0 < aj < 1 .
We then have
So we derive a different computation as follows.
dz1 dz1 dz1
= x1 = x2 = 1
C dw1,1 dw1,2 db1
L = −∑ y j log(aj )
j=1
dz2 dz2 dz2
z
C e
j = x1 = x2 = 1
= −∑ y j log ( ) dw2,1 dw2,2 db2
j=1 C z
∑ e k
k=1
dz3 dz3 dz3
z +K
j = x1 = x2 = 1
C e dw3,1 dw3,2 db3
= −∑ y j log ( ) where K = − max(z1 , z2 , … , zC )
j=1 C z +K
∑ e k
k=1

C zj +K C zk +K
We can also easily compute as follows.
= −∑ y j [ log(e ) − log(∑ e )] dL
j=1 k=1
dai

C C zk +K
= −∑ y j [zj + K − log(∑ e )] dL d C d −y i
j=1 k=1 = ( − ∑ y j log(aj )) = (−y i log(ai )) =
dai dai j=1 dai ai

The first log is eliminated. The domain of the second log function is [1, ∞) , which is the numerically stable range for the logarithm function.
We then have:
dL d 3 d −y 1
= ( − ∑ y j log(aj )) = (−y 1 log(a1 )) =
def compute_loss_stable_version(y, z): da1 da1 j=1 da1 a1

return -1.0 * [Link](y * (z - [Link]() - [Link]([Link]([Link]([Link]()))))) dL d 3 d −y 2

= ( − ∑ y j log(aj )) = (−y 2 log(a2 )) =
da2 da2 j=1 da2 a2

keyboard_arrow_down Gradient Descent for Softmax Regression

dL d 3 d −y 3
= ( − ∑ y j log(aj )) = (−y 3 log(a3 )) =
da3 da3 j=1 da3 a3

Then,
dL y1

⎡ ⎤ −
⎡ ⎤
We can start with random weights W and bias b. da1 a1

dL ⎢ dL ⎥ ⎢ y2
⎥
= ⎢ ⎥ = ⎢− ⎥
We make the parameters (weights and bias) better and better gradually by updating the them in each iteration: da ⎢ da2 ⎥ ⎢ a2
⎥
y3
dL ⎣− ⎦
⎣ ⎦
dL a3
W = W − α da3
dW

dL
b = b − α
dai
db
We still need to compute the derivative of the Softmax function
dzm
To compute the and , we need to do backpropagation.
dL dL

dW db

keyboard_arrow_down Derivative of Softmax function

dai

dzm
tells us how much ai would change if zm changes.
z
d d e i
ai = ( )
dzm dzm C z
j
∑ e
j=1

Applying the Quotient Rule

′ ′
f (x) f (x)g(x)−g (x)f (x)
Let's compute each element of the gradient separately. d
= 2
dx g(x) (g(x))

dL 3 dL dai dzm
= ∑ C
dwm,n i=1 dai dzm dwm,n Here, f (x) = e
zi
and g(x) = ∑
j=1
e
zj
.
dL 3 dL dai dzm

dbm
= ∑
i=1 dai dzm dbm
We have two cases, i = m and i ≠ m .

For example, if we want to compute dL

:
dw2,1

If i = m :
dL 3 dL dai dz2
= ∑
dw2,1 i=1 dai dz2 dw2,1

dL da1 dz2 dL da2 dz2 dL da3 dz2

= + +
da1 dz2 dw2,1 da2 dz2 dw2,1 da3 dz2 dw2,1

Changing w2,1 leads to changing z2 , which leads to changing a1 , a2 , a3 , which all lead to chaning L.

Therefore, we are computing the effect of how changing w2,1 influence L over 3 different paths. We then sum up all the paths together to
compute .
dL

dw2,1

We can easily compute these derivatives:

dzm d
= (wm,1 x1 + wm,2 x2 + … + wm,n xn + … + wm,d xd + bm ) = xn
dwm,n dwm,n

dzm d
= (wm,1 x1 + wm,2 x2 + … + wm,n xn + … + wm,d xd + bm ) = 1
dbm dbm

For the example above, we have:

′ C z C z ′
z j j z
(e i ) ∑ e −(∑ e ) e i
j=1 j=1

= 2
C z
( ∑
j=1
e j
)

C z
z j zm z
e i ∑ e −e e i
j=1

= 2
C z
( ∑
j=1
e j
)

C zj C zj
e
zi
( ∑
j=1
e −e
zm
) zm
e ( ∑
j=1
e −e
zm
)
= 2
= 2
C z C z
( ∑ e j
) ( ∑ e j
)
j=1 j=1
For the example here, we have
C z
j zm
zm ∑ e −e
e j=1
dL dL
= dL
∑
C
e
z
j
∑
C
e
z
j ⎡ dw1,1 dw1,2
⎤ ⎡ ⎤
j=1 j=1 db1

⎢ dL dL
⎥ ⎢ ⎥
and
C dL dL dL
= ⎢ ⎥
zj
zm ∑ e zm = ⎢ ⎥
e j=1 e
dW ⎢ dw2,1 dw2,2 ⎥ db db2
= ( − ) ⎢ ⎥ ⎢ ⎥
C zj C zj C zj
∑ e ∑ e ∑ e dL
dL dL
j=1 j=1 j=1
⎣ ⎦ ⎣ ⎦
dw3,1 dw3,2 db3
= am (1 − am )

If i ≠ m : Above, we already had

z dL 3 dL dai dz2
dai d e i

= ( ) = ∑
C z dw2,1 i=1 dai dz2 dw2,1
dzm dzm j
∑ e
j=1

dL da1 dz2 dL da2 dz2 dL da3 dz2

(e
z
i
′
) ∑
C
e
z
j
−(∑
C
e
z
j
) e
′ z
i = + +
j=1 j=1 da1 dz2 dw2,1 da2 dz2 dw2,1 da3 dz2 dw2,1
= 2
C z
da1 da2 da3 dz2
( ∑ e
j
) dL dL dL
j=1 = ( + + )
da1 dz2 da2 dz2 da3 dz2 dw2,1
C z ′ z
j i
0−(∑ e ) e zm z
i
j=1 −e e dL dz2
= 2
= 2 =
C z C z dz2 dw2,1
( ∑
j=1
e
j
) ( ∑
j=1
e
j
)

e
zm
e
z
i Similarly,
= − = −am ai
C z C z
j j
∑ e ∑ e
j=1 j=1 dai dz2 da1 dz2 da2 dz2 da3 dz2 da1 da2 da3 dz2
dL 3 dL dL dL dL dL dL dL
= ∑ = + + = ( + + )
db2 i=1 dai dz2 db2 da1 dz2 db2 da2 dz2 db2 da3 dz2 db2 da1 dz2 da2 dz2 da3 dz2 db2

dL dz2
In summary, we have =
dz2 db2

dai am (1 − am ) if i = m
= {
dzm
−am ai if i ≠ m From above, we have

We can have as a (C matrix.

d dz1 dz1 dz1
a × C) = x1 = x2 = 1
dz dw1,1 dw1,2 db1

For our example above, we have dz2 dz2 dz2

= x1 = x2 = 1
dw2,1 dw2,2 db2
da1 da2 da3
⎡ ⎤ dz3 dz3 dz3
dz1 dz1 dz1 a1 (1 − a1 ) a1 (−a2 ) a1 (−a3 ) = x1 = x2 = 1
⎡ ⎤ dw3,1 dw3,2 db3
⎢ da1 da2 da3 ⎥
d
a = ⎢ ⎥ = ⎢ a2 (−a1 ) a2 (1 − a2 ) a2 (−a3 ) ⎥
dz ⎢ dz2 dz2 dz2 ⎥
⎢ ⎥
⎣ ⎦
a3 (−a1 ) a3 (−a2 ) a3 (1 − a3 )
We then have
da1 da2 da3
⎣ ⎦
dz3 dz3 dz3

dL dL dL dz1 dL dz1
dL dL dL
a1 a1 a1 (1 − a1 ) −a2 −a3 ⎡ ⎤ x1 x2
⎡ ⎤ ⎡ ⎤ ⎡ dw1,1 dw1,2 ⎤ dz1 dw1,1 dz1 dw1,2 ⎡ ⎤ ⎡ ⎤
dz1 dz1 dz1

= ⎢ a2 a2 a2 ⎥ ∘ ( ⎢ −a1 (1 − a2 ) −a3 ⎥) ⎢ ⎥ ⎢ ⎥
dL dL dL dL dz2 dL dz2 ⎢ dL dL ⎥ ⎢ dL ⎥ dL T
= ⎢ ⎥ = ⎢ ⎥ =
⎢ x1 x2 ⎥ = ⎢ ⎥ [ x1 x2 ] = x
⎣ ⎦ ⎢ dw2,1 dw2,2 ⎥ ⎢ dz2 dw2,1 dz2 dw2,2 ⎥ dz2 dz2 dz2
⎣ ⎦ dW
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ dz
a3 a3 a3 −a1 −a2 (1 − a3 ) ⎢ ⎥
dL dL dz3 dz1 dL dL dL
⎣ ⎦ dL dL ⎣ x1 x2 ⎦ ⎣ ⎦
⎣ ⎦
dw3,1 dw3,2 dz3 dz3 dz3
a1 a1 a1 1 0 0 a1 a2 a3 dz3 dw3,1 dz3 dw3,2
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
= ⎢a
2 a2 a2 ⎥ ∘ ( ⎢ 0 1 0 ⎥ − ⎢ a1 a2 a3 ⎥ ) and
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
a3 a3 a3 0 0 1 a1 a2 a3 dL dz1
dL dL dL
⎡ ⎤ 1
⎡ db1 ⎤ dz1 db1 ⎡ dz1 ⎤ ⎡ dz1 ⎤
a1 1 0 0 1
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎢ dL ⎥ ⎢ dz1 ⎥ ⎢ dL ⎥ ⎢ dL ⎥
dL dL dL
= ⎢ ⎢ ⎥ = 1⎥ = ⎢
⎥ = ⎢ ⎥ ⎢ ⎥ =
= ⎢a ⎥[1 1 1] ∘ ( ⎢ 0 1 0 ⎥ − ⎢ 1 ⎥ [ a1 a2 a3 ] ) db ⎢ db2
⎥ dz2 db2 ⎢ dz2
⎥ ⎢ dz2
⎥ dz
2 ⎢ ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦ dL dL dz1 dL dL
a3 0 0 1 1 ⎣ ⎦ ⎣ ⎦ ⎣ 1⎦ ⎣ ⎦
db3 dz3 db3 dz3 dz3

T T
= a1 ∘ (I − 1a )

We can compute dL
,
dL
,
dL
as follows.

keyboard_arrow_down Backpropagation Phase

dz1 dz2 dz3

dL da1 dL da2 dL da3 da1 da2 da3

dL dL
⎡ + + ⎤ ⎡ ⎤⎡
⎡ dz1 ⎤ da1 dz1 da2 dz1 da3 dz1 dz1 dz1 dz1 da1 ⎤

⎢ dL ⎥ ⎢ da1 da2 da3 ⎥ ⎢ da1 da2 da3 ⎥⎢ dL ⎥

dL dL dL dL da dL
= ⎢ ⎥ = ⎢ + + ⎥ = ⎢ ⎥⎢ ⎥ =
In the backpropagation phase, we compute and using the derivatives we have computed above.
dL dL
dz dz2 ⎢ da1 dz2 da2 dz2 da3 dz2 ⎥ ⎢ dz2 dz2 dz2 ⎥ da2 dz da
⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
dW db
dL da1 da2 da3 da1 da2 da3 dL
⎣ ⎦ ⎣
dL dL dL
⎦ ⎣ ⎦⎣ ⎦
dz3
+ + da3
da1 dz3 da3 dz3 da3 dz3 dz3 dz3 dz3

NOTE: The notations here are for the sake of convenience.

From above, we have:

⎡
−
⎤
L = 0
a1 (1 − a1 ) a1 (−a2 ) a1 (−a3 ) a1
⎡ ⎤
dL ⎢ y2 ⎥
= ⎢ a2 (−a1 ) a2 (1 − a2 ) a2 (−a3 ) ⎥⎢− ⎥ # Loop through each example in the dataset
dz ⎢ a2 ⎥
⎣ ⎦ y3 for j in range([Link][0]):
a3 (−a1 ) a3 (−a2 ) a3 (1 − a3 ) ⎣− ⎦
a3 x_j = X[j,:].reshape(2,1)
y_j = Y[j,:].reshape(3,1)
−y 1 (1 − a1 ) + y 2 a1 + y 3 a1
⎡ ⎤

= ⎢ y 1 a2 − y 2 (1 − a2 ) + y 3 a2 ⎥ z_j, a_j = forward(W, b, x_j)

⎣ ⎦ loss_j = compute_loss_stable_version(y_j, z_j)
y 1 a3 + y 2 a3 − y 3 (1 − a3 )
dW_j, db_j = compute_gradient(x_j, y_j, z_j, a_j)
a1 − 1 a1 a1 y1
⎡ ⎤⎡ ⎤
= ⎢
dW += dW_j
a2 a2 − 1 a2 ⎥ ⎢ y2 ⎥
db += db_j
⎣ ⎦⎣ ⎦
a3 a3 a3 − 1 y3 L += loss_j
a1 1 0 0 y1
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
dW = (1.0/[Link][0]) * dW
= (⎢a ⎥[1 1] − ⎢ 0 0 ⎥ ) ⎢ y2 ⎥
2 1 1 db = (1.0/[Link][0]) * db
⎣ ⎦ ⎣ ⎦ ⎣ ⎦ L = (1.0/[Link][0]) * L
a3 0 0 1 y3

# gradient descent
T
= (a1 − I)y
W, b = gradient_descent(W, b, dW, db, learning_rate)

= a1
T
y − Iy [here y is a one-hot vector].
W_cache.append(W)
= a − y =
dL b_cache.append(b)
dz
L_cache.append(L)

def compute_gradient(x, y, z, a):

# compute dL/da
da = -y / a

# compute da/dz
matrix = [Link](a, [Link]((1, 3))) * ([Link](3) - [Link]([Link]((3,1)), a.T))

# compute dL/dz
#dz = [Link](matrix, da)
dz = a - y

# compute dL/dW
dW = [Link](dz, x.T)
# compute dL/db
db = [Link]()

return dW, db

Now we can implement Gradient Descent to learn the parameters of Softmax Regression for our problem here as follows.

learning_rate = 2.0 # This is just for this DEMO. Normally, learning rate is quite small.
num_epochs = 40

def gradient_descent(W, b, dW, db, learning_rate):

W = W - learning_rate * dW
keyboard_arrow_down Decision Boundary
b = b - learning_rate * db

def plot_decision_boundary(X, Y, W, b):

return W, b
[Link]()

[Link] 11/15 [Link] 12/15

12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab 12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab
[Link]([-2.0, 2.0]) return lines1, lines2, lines3, text_box
[Link]([-2.0, 2.0])
[Link]('$x_1$', size=20) lines1, = [Link]([], [], c='black')
[Link]('$x_2$', size=20) lines2, = [Link]([], [], c='black')
[Link]('Decision boundary', size = 18) lines3, = [Link]([], [], c='black')

[Link](X[:, 0], X[:, 1], s=50, c=colormap[y]) [Link]()

[Link](X[:, 0], X[:, 1], s=50, c=colormap[y])
xs = [Link]([-2.0, 2.0]) text_box = [Link](1.1, 1.6, 'Iteration 0', size = 16)
ys1 = ((b[1, 0] - b[0, 0]) - (W[0, 0] - W[1, 0]) * xs) / (W[0, 1] - W[1, 1])
ys2 = ((b[2, 0] - b[0, 0]) - (W[0, 0] - W[2, 0]) * xs) / (W[0, 1] - W[2, 1]) anim = [Link](fig, animate, len(W_cache), blit=False, interval=500)
ys3 = ((b[2, 0] - b[1, 0]) - (W[1, 0] - W[2, 0]) * xs) / (W[1, 1] - W[2, 1]) [Link](fig)
anim
[Link](xs, ys1, c='black')
[Link](xs, ys2, c='black')
[Link](xs, ys3, c='black')

plot_decision_boundary(X, Y, W, b)

# Import for animation on Google Colab

from matplotlib import rc
rc('animation', html='jshtml')
        
import [Link] as animation
Once Loop Reflect

import [Link] as animation

fig = [Link]()

ax = fig.add_subplot(111)
ax.set_xlim([-2.0, 2.0])
ax.set_ylim([-2.0, 2.0])
ax.set_xlabel('$x_1$', size=20)
ax.set_ylabel('$x_2$', size=20)

ax.set_title('Decision boundary - Animated', size = 18)

def animate(i):
xs = [Link]([-2.0, 2.0])
W = W_cache[i]
b = b_cache[i]

ys1 = ((b[1, 0] - b[0, 0]) - (W[0, 0] - W[1, 0]) * xs) / (W[0, 1] - W[1, 1])
ys2 = ((b[2, 0] - b[0, 0]) - (W[0, 0] - W[2, 0]) * xs) / (W[0, 1] - W[2, 1])
ys3 = ((b[2, 0] - b[1, 0]) - (W[1, 0] - W[2, 0]) * xs) / (W[1, 1] - W[2, 1])

lines1.set_data(xs, ys1)
lines2.set_data(xs, ys2)
lines3.set_data(xs, ys3)

text_box.set_text('Iteration: {}'.format(i))

[Link] 13/15 [Link] 14/15

12:11 16/1/25 [CS115]_softmax_backpropagation (1).ipynb - Colab
NX = 100
NY = 100

def plot_decision_boundary_heatmap(X, Y, W, b):

[Link]()
[Link]([-2.0, 2.0])
[Link]([-2.0, 2.0])

[Link] 15/15

Online Softmax Normalizer Calculation
No ratings yet
Online Softmax Normalizer Calculation
9 pages
Softmax Function in TensorFlow
No ratings yet
Softmax Function in TensorFlow
7 pages
Exploring the Softmax Function in ML
No ratings yet
Exploring the Softmax Function in ML
7 pages
Softmax Regression Implementation in Python
No ratings yet
Softmax Regression Implementation in Python
6 pages
Implementing Softmax in Python
No ratings yet
Implementing Softmax in Python
14 pages
Implementing Softmax in Python
No ratings yet
Implementing Softmax in Python
15 pages
Cross Interopy
No ratings yet
Cross Interopy
7 pages
EE2211 Supervised Learning Overview
No ratings yet
EE2211 Supervised Learning Overview
15 pages
Understanding the Softmax Function
No ratings yet
Understanding the Softmax Function
6 pages
Softmax Function Implementation in Python
No ratings yet
Softmax Function Implementation in Python
9 pages
Online Normalizer for Softmax Efficiency
No ratings yet
Online Normalizer for Softmax Efficiency
9 pages
Training a Softmax Classifier
No ratings yet
Training a Softmax Classifier
17 pages
Neural Network Parameter Calculation
No ratings yet
Neural Network Parameter Calculation
9 pages
Softmax Regression in Machine Learning
No ratings yet
Softmax Regression in Machine Learning
11 pages
Sigmoid and Softmax in AI Models
No ratings yet
Sigmoid and Softmax in AI Models
90 pages
Implementing SVM with Python Code
No ratings yet
Implementing SVM with Python Code
7 pages
Neural Networks and Adversarial Attacks
No ratings yet
Neural Networks and Adversarial Attacks
9 pages
SVM Algorithm and Data Preprocessing Guide
No ratings yet
SVM Algorithm and Data Preprocessing Guide
8 pages
Softmax Regression for Image Classification
No ratings yet
Softmax Regression for Image Classification
9 pages
Predicting Student Pass Rates
No ratings yet
Predicting Student Pass Rates
17 pages
Feature Selection & Data Preprocessing Guide
No ratings yet
Feature Selection & Data Preprocessing Guide
18 pages
SVM and Regression Techniques Overview
No ratings yet
SVM and Regression Techniques Overview
216 pages
Mastering Machine Learning Algorithms
100% (2)
Mastering Machine Learning Algorithms
148 pages
Understanding Neural Networks Basics
No ratings yet
Understanding Neural Networks Basics
7 pages
Softmax vs Sigmoid in Machine Learning
No ratings yet
Softmax vs Sigmoid in Machine Learning
5 pages
Understanding Artificial Neural Networks
No ratings yet
Understanding Artificial Neural Networks
76 pages
SVM Analysis with Synthetic Data in ML
No ratings yet
SVM Analysis with Synthetic Data in ML
13 pages
Overfitting and Underfitting in ML
No ratings yet
Overfitting and Underfitting in ML
10 pages
SVM for Classification and Regression
100% (1)
SVM for Classification and Regression
28 pages
Machine Learning Concepts Overview
No ratings yet
Machine Learning Concepts Overview
39 pages
Deep Learning for Classification Techniques
No ratings yet
Deep Learning for Classification Techniques
17 pages
Feature Scaling Techniques in Python
No ratings yet
Feature Scaling Techniques in Python
11 pages
SVM Model for Cancer Cell Classification
No ratings yet
SVM Model for Cancer Cell Classification
10 pages
Softmax and Cross Entropy Derivatives
No ratings yet
Softmax and Cross Entropy Derivatives
14 pages
Activation Functions in Neural Networks
No ratings yet
Activation Functions in Neural Networks
31 pages
Activation Functions Implementation
No ratings yet
Activation Functions Implementation
10 pages
Softmax Function and Neural Network Implementation
No ratings yet
Softmax Function and Neural Network Implementation
22 pages
ECE 449: Machine Learning Concepts
No ratings yet
ECE 449: Machine Learning Concepts
5 pages
Machine Learning Techniques Overview
No ratings yet
Machine Learning Techniques Overview
15 pages
Data Preprocessing with Sklearn
No ratings yet
Data Preprocessing with Sklearn
9 pages
Numpy Functions for Deep Learning
No ratings yet
Numpy Functions for Deep Learning
5 pages
Machine Learning Practices with Iris Dataset
100% (1)
Machine Learning Practices with Iris Dataset
12 pages
Data Wrangling and Analysis Techniques
No ratings yet
Data Wrangling and Analysis Techniques
16 pages
Gradient Descent Methods for Regression
No ratings yet
Gradient Descent Methods for Regression
9 pages
Naive Bayes for Machine Learning Analysis
No ratings yet
Naive Bayes for Machine Learning Analysis
12 pages
EDA and ML on Iris Dataset
100% (2)
EDA and ML on Iris Dataset
24 pages
Fisher Iris Classification Analysis
No ratings yet
Fisher Iris Classification Analysis
22 pages
Data Importing and Analysis Guide
No ratings yet
Data Importing and Analysis Guide
9 pages
Linear Algebra and Optimization Concepts
No ratings yet
Linear Algebra and Optimization Concepts
2 pages
Neural Network Accuracy and Activation Functions
No ratings yet
Neural Network Accuracy and Activation Functions
83 pages
Fashion MNIST Data Preparation Guide
No ratings yet
Fashion MNIST Data Preparation Guide
5 pages
GMM
No ratings yet
GMM
5 pages
Implementing Machine Learning Algorithms
No ratings yet
Implementing Machine Learning Algorithms
20 pages
Softmax Regression Implementation Guide
No ratings yet
Softmax Regression Implementation Guide
5 pages
Model Evaluation and Selection Lab Guide
No ratings yet
Model Evaluation and Selection Lab Guide
21 pages
Model Evaluation and Selection Lab Guide
No ratings yet
Model Evaluation and Selection Lab Guide
21 pages
Python Machine Learning Programs Overview
No ratings yet
Python Machine Learning Programs Overview
12 pages
Forest Cover Type Prediction Model
No ratings yet
Forest Cover Type Prediction Model
6 pages
Post-Quantum Cryptography Workshop 2008
No ratings yet
Post-Quantum Cryptography Workshop 2008
240 pages
Minimax Algorithm in Game Theory
No ratings yet
Minimax Algorithm in Game Theory
8 pages
Fashion-MNIST Deep Learning Assignment
No ratings yet
Fashion-MNIST Deep Learning Assignment
2 pages
Problem Solving with Invariants
No ratings yet
Problem Solving with Invariants
22 pages
Measuring Gravity: Free Fall Methods
No ratings yet
Measuring Gravity: Free Fall Methods
2 pages
Electronics Engineering: Signals and Systems
No ratings yet
Electronics Engineering: Signals and Systems
18 pages
Timestamps as Nonces in Authentication
No ratings yet
Timestamps as Nonces in Authentication
5 pages
Applied Mathematics Expert in Maryland
No ratings yet
Applied Mathematics Expert in Maryland
3 pages
Quantum Mechanics Lecture Notes: QM6
No ratings yet
Quantum Mechanics Lecture Notes: QM6
5 pages
Stiffness Matrix for Beam Elements
No ratings yet
Stiffness Matrix for Beam Elements
12 pages
Quantum Cryptography Fundamentals Review
No ratings yet
Quantum Cryptography Fundamentals Review
5 pages
Differences Between The Duty Cycle and The Aerospectrum Loading Types - Madsen - General-Atomics - Ncode UGM CAE 18
No ratings yet
Differences Between The Duty Cycle and The Aerospectrum Loading Types - Madsen - General-Atomics - Ncode UGM CAE 18
15 pages
Data Scientist Resume and Cover Letter
No ratings yet
Data Scientist Resume and Cover Letter
2 pages
Tensor Flow
100% (1)
Tensor Flow
130 pages
LTI Systems and Convolution Analysis
No ratings yet
LTI Systems and Convolution Analysis
2 pages
Density and Contour Plots Overview
No ratings yet
Density and Contour Plots Overview
16 pages
Understanding AES S-Box Functionality
No ratings yet
Understanding AES S-Box Functionality
2 pages
Computer Engineering Lab Manual 2019
No ratings yet
Computer Engineering Lab Manual 2019
81 pages
Heuristic Functions in Game Playing
No ratings yet
Heuristic Functions in Game Playing
33 pages
MATLAB Spectral Analysis of Signals
No ratings yet
MATLAB Spectral Analysis of Signals
14 pages
Bank Fraud Detection Using ML Techniques
No ratings yet
Bank Fraud Detection Using ML Techniques
30 pages
Advanced Generative AI Course Overview
No ratings yet
Advanced Generative AI Course Overview
4 pages
Unreachable Setpoints in MPC
No ratings yet
Unreachable Setpoints in MPC
7 pages
Proc Hpbin Sas
No ratings yet
Proc Hpbin Sas
33 pages
Flow Models in Deep Learning
No ratings yet
Flow Models in Deep Learning
64 pages
UDL Answer Booklet Students
No ratings yet
UDL Answer Booklet Students
79 pages
LPP Cases and Simplex Method Guide
No ratings yet
LPP Cases and Simplex Method Guide
37 pages
Helmet Detection and E-Challan System
No ratings yet
Helmet Detection and E-Challan System
14 pages
Sequences and Series Solutions Guide
No ratings yet
Sequences and Series Solutions Guide
9 pages