Online Softmax Normalizer Calculation
Online Softmax Normalizer Calculation
ipynb - Colab
We can use Softmax Regression for this problem. Softmax Regression can be called Multinomial Logistic Regression.
keyboard_arrow_down Softmax Regression This is a generalization of Logistic Regression (for 2 classes) to arbitrary number of classes.
We want to build a model to discriminate red, green, and blue points in 2-dimensional space.
keyboard_arrow_down One-hot vector representation
Given a point in 2D space x = (x1 , x2 ) , we want to output either red, green, or blue.
We represent the output as one-hot vector.
# Setting up dataset ⎡
1
⎤ ⎡
0
⎤ ⎡
0
⎤
import numpy as np For example, we represent red as ⎢ 0 ⎥, green as ⎢ 1 ⎥, and blue as ⎢ 0 ⎥.
import [Link] as plt ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
0 0 1
[Link]('classic')
np.set_printoptions(precision=3, suppress=True)
print(y)
X = [Link]([[-0.1, 1.4],
[-0.5, 0.2], [0 0 1 0 2 1 1 1 1 0 0 2 2 2 1 0 1 2 2 2]
[ 1.3, 0.9],
[-0.6, 0.4],
[-1.6, 0.2], print([Link](3))
[ 0.2, 0.2],
[-0.3,-0.4], [[1. 0. 0.]
[ 0.7,-0.8], [0. 1. 0.]
[ 1.1,-1.5], [0. 0. 1.]]
[-1.0, 0.9],
[-0.5, 1.5], # Convert the labels to one-hot encoding
[-1.3,-0.4], Y = [Link](3)[y]
[-1.4,-1.2], print(Y)
[-0.9,-0.7],
[ 0.4,-1.3], [[1. 0. 0.]
[-0.4, 0.6], [1. 0. 0.]
[ 0.3,-0.5], [0. 1. 0.]
[-1.6,-0.7], [1. 0. 0.]
[-0.5,-1.4], [0. 0. 1.]
[-1.0,-1.4]]) [0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
y = [Link]([0, 0, 1, 0, 2, 1, 1, 1, 1, 0, 0, 2, 2, 2, 1, 0, 1, 2, 2, 2])
[0. 1. 0.]
[1. 0. 0.]
colormap = [Link](['r', 'g', 'b']) [1. 0. 0.]
[0. 0. 1.]
def plot_scatter(X, y, colormap):
[0. 0. 1.]
[Link]()
[0. 0. 1.]
[Link](left=-2.0, right=2.0) [0. 1. 0.]
[Link](bottom=-2.0, top=2.0) [1. 0. 0.]
[Link](X[:,0], X[:, 1], s=80, c=colormap[y]) [0. 1. 0.]
[0. 0. 1.]
[Link]('$x_1$', size=20) [0. 0. 1.]
[Link]('$x_2$', size=20) [0. 0. 1.]]
plot_scatter(X, y, colormap)
keyboard_arrow_down Computation Graph
w11 w12 b1
⎡ ⎤ ⎡ ⎤
W = ⎢ w21 w22 ⎥ and b = ⎢ b2 ⎥
⎣ ⎦ ⎣ ⎦
w31 w32 b3
from our dataset so that the predictions of the model is as close as possible to the targets is called training.
[Link](1000.0)
The coordinates of all the data points are put into the matrix X.
[Link](0.00000001)
The labels of all data points are put into the vector y.
Here, the dataset (X, y) contains m samples, and for each sample i we have (x(i) , y (i) ), where x(i) is the coordinate of
2
= 20 ∈ R -18.420680743952367
(i) (i)
the data point (x1 , and y (i) is its label (red, green, blue).
keyboard_arrow_down
z z K z +K
Softmax Function
e i e i e e i
ai = = =
C z C z C z +K
j j K j
∑ e ∑ e e ∑ e
j=1 j=1 j=1
where K = − max(z1 , z2 , … , zC ) .
Feed-forward Phase
def stable_softmax(z):
Feed-forward means: given the model parameters W and b, given a sample (x, y), we produce the output a and compute the loss L. return [Link](z-max(z)) / [Link]([Link](z-max(z)))
e
z
i
tensor([0., 0., 1.], dtype=torch.float64)
ai =
C z
j
In classification, we want to compute the probability that a sample belongs to a class/category. Suppose someone already computed W and b for us as follows.
b = [Link]([[ 1.2 ],
[ 2.93 ],
def softmax(z): [-4.14 ]])
return [Link](z) / [Link]([Link](z))
plot_scatter(X, y, colormap)
Softmax emphasizes the relative difference between large and small values.
⎣ ⎦ ⎣ ⎦
0 0
0 0.1
⎡ ⎤ ⎡ ⎤
(2) (2)
y = ⎢1⎥ a = ⎢ 0.8 ⎥
⎣ ⎦ ⎣ ⎦
0 0.1
0 0.1
⎡ ⎤ ⎡ ⎤
(3) (3)
y = ⎢0⎥ a = ⎢ 0.2 ⎥
⎣ ⎦ ⎣ ⎦
1 0.7
0 0.1 0.2
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
For example, if y(i) = ⎢1⎥ , having a(i) = ⎢ 0.8 ⎥ is more desirable than having a(i) = ⎢ 0.6 ⎥ .
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
0 0.1 0.2
The likelihood of the classifier producing a(i) regarding one example is:
(i)
3 (i) y
∏ (a ) j
j=1 j
Let's try to predict the class (red, green, or blue) of the first example in our datasets as follows.
For example, with a(1) and y(1) as above, we have:
(i)
3 (i) y 1 0 0
# X[0,:] has the shape (1,2) ∏
j=1
(a
j
) j
= (0.9) × (0.1) × (0.0) = 0.9 × 1 × 1 = 0.9
print([Link]())
print([Link]())
print([Link]())
print(y[0])
[-0.1 1.4]
[ 6.699 1.901 -6.803]
[0.992 0.008 0. ]
0
Similar to the case of Logistic Regression, we need to apply Maximum Likelihood Estimation (MLE) for multiple examples.
The class of the first example is red (i.e., 0).
We need to find weights W and bias b to maximize the following.
Our classifier predicts the probability of each class [red, green, blue] is [0.992, 0.008, 0.0]. (i)
N 3 (i)
where a
y (i)
∏ ∏ (a ) j
= softmax(Wx + b)
i=1 j=1 j
keyboard_arrow_down
(i) (i)
Maximum Likelihood Estimation (MLE) and Loss Function for Softmax Regression
N 3 (i) y N 3 (i) y N 3 (i) (i)
log ( ∏ ∏ (a ) j
) = ∑ ∑ log(a ) j
= ∑ ∑ y log(a )
i=1 j=1 j i=1 j=1 j i=1 j=1 j j
which is the loss function (cost function/log loss function) for our Softmax Regression problem here.
1
⎡ ⎤
−0.1
x
(1)
= [ ], y
(1)
= 0 (red) y
(1)
= ⎢0⎥ This loss function can also be called Cross Entropy Loss function. In Logistic Regression (binary classification), this is Binary Cross Entropy
1.4 ⎣
0
⎦ loss.
0 In this example, we have only 3 classes red, green, blue. In the general case, where we need to make prediction for C classes , the loss function
⎡ ⎤
1.3 N C (i) (i)
x
(2)
= [ ], y
(2)
= 1 (green) y
(2)
= ⎢1⎥ of Softmax Regression would be: J = −∑
i=1
∑
j=1
y
j
log(a
j
)
0.9 ⎣ ⎦
0
x
(3)
= [
−1.4
], y
(3)
= 2 (blue) y
(3)
⎡
= ⎢0⎥
0
⎤
keyboard_arrow_down Numerical Stability of Loss Function
−1.1
⎣ ⎦
1
The logarithm function is not numerically stable for small input values 0 < aj < 1 .
We then have
So we derive a different computation as follows.
dz1 dz1 dz1
= x1 = x2 = 1
C dw1,1 dw1,2 db1
L = −∑ y j log(aj )
j=1
dz2 dz2 dz2
z
C e
j = x1 = x2 = 1
= −∑ y j log ( ) dw2,1 dw2,2 db2
j=1 C z
∑ e k
k=1
dz3 dz3 dz3
z +K
j = x1 = x2 = 1
C e dw3,1 dw3,2 db3
= −∑ y j log ( ) where K = − max(z1 , z2 , … , zC )
j=1 C z +K
∑ e k
k=1
C zj +K C zk +K
We can also easily compute as follows.
= −∑ y j [ log(e ) − log(∑ e )] dL
j=1 k=1
dai
C C zk +K
= −∑ y j [zj + K − log(∑ e )] dL d C d −y i
j=1 k=1 = ( − ∑ y j log(aj )) = (−y i log(ai )) =
dai dai j=1 dai ai
The first log is eliminated. The domain of the second log function is [1, ∞) , which is the numerically stable range for the logarithm function.
We then have:
dL d 3 d −y 1
= ( − ∑ y j log(aj )) = (−y 1 log(a1 )) =
def compute_loss_stable_version(y, z): da1 da1 j=1 da1 a1
Then,
dL y1
⎡ ⎤ −
⎡ ⎤
We can start with random weights W and bias b. da1 a1
dL ⎢ dL ⎥ ⎢ y2
⎥
= ⎢ ⎥ = ⎢− ⎥
We make the parameters (weights and bias) better and better gradually by updating the them in each iteration: da ⎢ da2 ⎥ ⎢ a2
⎥
y3
dL ⎣− ⎦
⎣ ⎦
dL a3
W = W − α da3
dW
dL
b = b − α
dai
db
We still need to compute the derivative of the Softmax function
dzm
To compute the and , we need to do backpropagation.
dL dL
dW db
dzm
tells us how much ai would change if zm changes.
z
d d e i
ai = ( )
dzm dzm C z
j
∑ e
j=1
dL 3 dL dai dzm
= ∑ C
dwm,n i=1 dai dzm dwm,n Here, f (x) = e
zi
and g(x) = ∑
j=1
e
zj
.
dL 3 dL dai dzm
dbm
= ∑
i=1 dai dzm dbm
We have two cases, i = m and i ≠ m .
If i = m :
dL 3 dL dai dz2
= ∑
dw2,1 i=1 dai dz2 dw2,1
Changing w2,1 leads to changing z2 , which leads to changing a1 , a2 , a3 , which all lead to chaning L.
Therefore, we are computing the effect of how changing w2,1 influence L over 3 different paths. We then sum up all the paths together to
compute .
dL
dw2,1
dzm d
= (wm,1 x1 + wm,2 x2 + … + wm,n xn + … + wm,d xd + bm ) = 1
dbm dbm
′ C z C z ′
z j j z
(e i ) ∑ e −(∑ e ) e i
j=1 j=1
= 2
C z
( ∑
j=1
e j
)
C z
z j zm z
e i ∑ e −e e i
j=1
= 2
C z
( ∑
j=1
e j
)
C zj C zj
e
zi
( ∑
j=1
e −e
zm
) zm
e ( ∑
j=1
e −e
zm
)
= 2
= 2
C z C z
( ∑ e j
) ( ∑ e j
)
j=1 j=1
For the example here, we have
C z
j zm
zm ∑ e −e
e j=1
dL dL
= dL
∑
C
e
z
j
∑
C
e
z
j ⎡ dw1,1 dw1,2
⎤ ⎡ ⎤
j=1 j=1 db1
⎢ dL dL
⎥ ⎢ ⎥
and
C dL dL dL
= ⎢ ⎥
zj
zm ∑ e zm = ⎢ ⎥
e j=1 e
dW ⎢ dw2,1 dw2,2 ⎥ db db2
= ( − ) ⎢ ⎥ ⎢ ⎥
C zj C zj C zj
∑ e ∑ e ∑ e dL
dL dL
j=1 j=1 j=1
⎣ ⎦ ⎣ ⎦
dw3,1 dw3,2 db3
= am (1 − am )
= ( ) = ∑
C z dw2,1 i=1 dai dz2 dw2,1
dzm dzm j
∑ e
j=1
e
zm
e
z
i Similarly,
= − = −am ai
C z C z
j j
∑ e ∑ e
j=1 j=1 dai dz2 da1 dz2 da2 dz2 da3 dz2 da1 da2 da3 dz2
dL 3 dL dL dL dL dL dL dL
= ∑ = + + = ( + + )
db2 i=1 dai dz2 db2 da1 dz2 db2 da2 dz2 db2 da3 dz2 db2 da1 dz2 da2 dz2 da3 dz2 db2
dL dz2
In summary, we have =
dz2 db2
dai am (1 − am ) if i = m
= {
dzm
−am ai if i ≠ m From above, we have
dL dL dL dz1 dL dz1
dL dL dL
a1 a1 a1 (1 − a1 ) −a2 −a3 ⎡ ⎤ x1 x2
⎡ ⎤ ⎡ ⎤ ⎡ dw1,1 dw1,2 ⎤ dz1 dw1,1 dz1 dw1,2 ⎡ ⎤ ⎡ ⎤
dz1 dz1 dz1
= ⎢ a2 a2 a2 ⎥ ∘ ( ⎢ −a1 (1 − a2 ) −a3 ⎥) ⎢ ⎥ ⎢ ⎥
dL dL dL dL dz2 dL dz2 ⎢ dL dL ⎥ ⎢ dL ⎥ dL T
= ⎢ ⎥ = ⎢ ⎥ =
⎢ x1 x2 ⎥ = ⎢ ⎥ [ x1 x2 ] = x
⎣ ⎦ ⎢ dw2,1 dw2,2 ⎥ ⎢ dz2 dw2,1 dz2 dw2,2 ⎥ dz2 dz2 dz2
⎣ ⎦ dW
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ dz
a3 a3 a3 −a1 −a2 (1 − a3 ) ⎢ ⎥
dL dL dz3 dz1 dL dL dL
⎣ ⎦ dL dL ⎣ x1 x2 ⎦ ⎣ ⎦
⎣ ⎦
dw3,1 dw3,2 dz3 dz3 dz3
a1 a1 a1 1 0 0 a1 a2 a3 dz3 dw3,1 dz3 dw3,2
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
= ⎢a
2 a2 a2 ⎥ ∘ ( ⎢ 0 1 0 ⎥ − ⎢ a1 a2 a3 ⎥ ) and
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
a3 a3 a3 0 0 1 a1 a2 a3 dL dz1
dL dL dL
⎡ ⎤ 1
⎡ db1 ⎤ dz1 db1 ⎡ dz1 ⎤ ⎡ dz1 ⎤
a1 1 0 0 1
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎢ dL ⎥ ⎢ dz1 ⎥ ⎢ dL ⎥ ⎢ dL ⎥
dL dL dL
= ⎢ ⎢ ⎥ = 1⎥ = ⎢
⎥ = ⎢ ⎥ ⎢ ⎥ =
= ⎢a ⎥[1 1 1] ∘ ( ⎢ 0 1 0 ⎥ − ⎢ 1 ⎥ [ a1 a2 a3 ] ) db ⎢ db2
⎥ dz2 db2 ⎢ dz2
⎥ ⎢ dz2
⎥ dz
2 ⎢ ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦ dL dL dz1 dL dL
a3 0 0 1 1 ⎣ ⎦ ⎣ ⎦ ⎣ 1⎦ ⎣ ⎦
db3 dz3 db3 dz3 dz3
T T
= a1 ∘ (I − 1a )
We can compute dL
,
dL
,
dL
as follows.
⎡
−
⎤
L = 0
a1 (1 − a1 ) a1 (−a2 ) a1 (−a3 ) a1
⎡ ⎤
dL ⎢ y2 ⎥
= ⎢ a2 (−a1 ) a2 (1 − a2 ) a2 (−a3 ) ⎥⎢− ⎥ # Loop through each example in the dataset
dz ⎢ a2 ⎥
⎣ ⎦ y3 for j in range([Link][0]):
a3 (−a1 ) a3 (−a2 ) a3 (1 − a3 ) ⎣− ⎦
a3 x_j = X[j,:].reshape(2,1)
y_j = Y[j,:].reshape(3,1)
−y 1 (1 − a1 ) + y 2 a1 + y 3 a1
⎡ ⎤
# gradient descent
T
= (a1 − I)y
W, b = gradient_descent(W, b, dW, db, learning_rate)
= a1
T
y − Iy [here y is a one-hot vector].
W_cache.append(W)
= a − y =
dL b_cache.append(b)
dz
L_cache.append(L)
# compute da/dz
matrix = [Link](a, [Link]((1, 3))) * ([Link](3) - [Link]([Link]((3,1)), a.T))
# compute dL/dz
#dz = [Link](matrix, da)
dz = a - y
# compute dL/dW
dW = [Link](dz, x.T)
# compute dL/db
db = [Link]()
return dW, db
Now we can implement Gradient Descent to learn the parameters of Softmax Regression for our problem here as follows.
learning_rate = 2.0 # This is just for this DEMO. Normally, learning rate is quite small.
num_epochs = 40
plot_decision_boundary(X, Y, W, b)
fig = [Link]()
ax = fig.add_subplot(111)
ax.set_xlim([-2.0, 2.0])
ax.set_ylim([-2.0, 2.0])
ax.set_xlabel('$x_1$', size=20)
ax.set_ylabel('$x_2$', size=20)
def animate(i):
xs = [Link]([-2.0, 2.0])
W = W_cache[i]
b = b_cache[i]
ys1 = ((b[1, 0] - b[0, 0]) - (W[0, 0] - W[1, 0]) * xs) / (W[0, 1] - W[1, 1])
ys2 = ((b[2, 0] - b[0, 0]) - (W[0, 0] - W[2, 0]) * xs) / (W[0, 1] - W[2, 1])
ys3 = ((b[2, 0] - b[1, 0]) - (W[1, 0] - W[2, 0]) * xs) / (W[1, 1] - W[2, 1])
lines1.set_data(xs, ys1)
lines2.set_data(xs, ys2)
lines3.set_data(xs, ys3)
text_box.set_text('Iteration: {}'.format(i))
[Link] 15/15