Introduction to Machine Learning
Homework 7: Neural Networks
Prof. Sundeep Rangan and Yao Wang
Solution
1. (a) The linear functions in the hidden layer are:
1 0 1 0 x1 + x3
0 1 1 x1 0
x2 + x3
H H
z =W x+b = H x2 + =
1 1 0 −1 x1 + x2 − 1
x3
1 1 1 1 x1 + x2 + x3 + 1
Hence, the activation functions are
gact (x1 + x3 ) 1{x1 +x3 ≥0}
gact (x2 + x3 ) 1{x +x ≥0}
2 3
uH = gact (zH ) = =
1{x +x −1≥1} .
gact (x1 + x2 − 1) 1 2
gact (x1 + x2 + x3 + 1) 1{x1 +x2 +x3 +1}
For example, the region where uH1 = 1 is described by x1 + x3 ≥ 0.
(b) The output z O is
1{x1 +x3 ≥0}
1{x2 +x3 ≥0}
z O = W O uH + bO = [1, 1, −1, −1] − 1.5
1{x1 +x2 ≥1}
1{x1 +x2 +x3 ≥−1}
= 1{x1 +x3 ≥0} + 1{x2 +x3 ≥0} − 1{x1 +x2 ≥1} − 1{x1 +x2 +x3 ≥−1} − 1.5.
ŷ = 1 lie in the region z O ≥ 0. The visualization of this region is not required.
2. (a) Since WH has three rows, the number of hidden units is Nh = 3. The outputs zH in the
hidden layer are,
−x −1 −x − 1
zH = WH x + bH = x + 1 = x + 1 .
x −2 x−2
So, the outputs after ReLU activation are,
H
u1 max{0, −x − 1}
uH = uH2 = max{0, x + 1} .
uH3 max{0, x − 2}
The functions are plotted in 1.
1
uHj
uH2
3
uH3
2
uH1
1
x
-3 -2 -1 1 2 3 4
Figure 1: Problem 2(a). Hidden layer activations, uHj vs. x for j = 1, 2, 3.
(b) Since the network is for regression, you can take
ŷ = gout (z O ) = z O .
One possible loss function is the squared error,
N
X
L= (ŷi − yi )2 .
i=1
(c) Note that the output layer can be thought of as a linear regressor with input being the
hidden layer activations uH and output being ŷ. Let U be the data matrix where the i-th
row contains [1, (uHi )T ] = [1, uHi,1 , uHi,2 , uHi,3 ], w̃T = [bO , (WO )T ], and y = [y1 , y2 , . . . , yN ]T ,
the problem is to solve the least squares problem
Minimize kU w̃ − yk2
The analytical solution can be expressed as
w̃ = (U T U )−1 U T y
From given xi , we can first determine hidden layer outputs zHi and uHi . We can compute
manually based on the equations given in Part (a). We can also use a Python code to
do so.
Table below lists the corresponding values:
xi -2 -1 0 3 3.5
1 0 −1 −4 −4.5
zHi −1 0 1 4 4.5
−4 −3 −2 1 1.5
1 0 0 0 0
uHi 0 0 1 4 4.5
0 0 0 1 1.5
yi 0 0 1 3 3
2
The data matrix is thus
1 1 0 0
1 0 0 0
1
U = 0 1 0
1 0 4 1
1 0 4.5 1.5
The target vector is
0
0
1
y=
3
3
The solution is
0
0
w̃ =
1
−1
or
0
bO = 0, WO = 1
−1
The python code for computing the hidden layer outputs and for determining the least
squares solution is given below.
import numpy as np
Wh = np.array([ −1,1,1])
bh = np.array([−1,1,−2])
x = np.array([ −2, −1,0,3,3.5])
y = np.array ([0 ,0 ,1 ,3 ,3])
zh = x[:, None]∗ Wh[None ,:] + bh[None ,:]
Uh = np. maximum (0,zh)
U = np. hstack ((np.ones ((5 ,1)) , Uh))
w tilde = np. linalg .lstsq(U, y)[0]
bo = w tilde [0]
Wo = w tilde [1:]
(d) We can use the python function predict defined in the next subproblem to compute ŷ
for x in the range of [−3, 4]. The resulting curve is shown below.
x = np. linspace(−3,4)
yhat = predict (x,Wh ,bh ,Wo ,bo)
3
ŷ
x
-3 -2 -1 1 2 3 4
Figure 2: Problem 2(c). Output for the training data.
(e) We represent Wh,Wo,bh as vectors and bo as a scalar. Then, we can write the predict
function as:
def predict (x,Wh ,bh ,Wo ,bo):
zh = x[:, None]∗ Wh[None ,:] + bh[None ,:]
uh = np. maximum (0, zh)
yhat = uh.dot(Wo) + bo
return yhat
Note the use of python broadcasting.
3. (a) We simply add the index i to all the terms:
Ni
X
zij = Wjk xik + bj , uij = 1/(1 + exp(−zij )), j = 1, . . . , M,
k=1
PM (1)
j=1 aj uij
ŷi = PM ,
j=1 uij
(b) The computation graph is shown in Fig. 3.
xi zi ui ŷi L
yi
W, b a
Figure 3: Computation graph for Problem 3 mapping the the training data (xi , yi ) and parameters
to the loss function L. Parameters are shown in light blue and data in light green.
(c) The gradient is as follow
∂L
= −2(yi − ŷi )
∂ ŷi
4
for all i = 1, · · · , N .
(d) We first compute the partial derivative ∂ ŷi /∂uij . We rewrite the equation for ŷi as,
PM
al uil
ŷi = Pl=1
M
. (2)
l=1 uil
Note that before taking the derivative with respect to uij we had to rewrite the sum in
(2) with the index l so that it is not confused with the index j of the variable uij .
∂ ŷi
Now, we use chain rule to find the derivative ∂uij as
P PM P PM
M ∂ l=1 al uil M ∂ uil
∂ ŷi l=1 uil ∂uij − l=1 al uil
l=1
∂uij
= 2
∂uij
P
M
l=1 uil
Note that PM
∂ l=1 al uil
= aj
∂uij
and PM
∂ l=1 uil
=1
∂uij
Therefore, PM
∂ ŷi aj al uil
= PM − Pl=1 2
∂uij l=1 uil
M
u
l=1 il
If we know ∂L/∂ ŷi for all i, then ∂L/∂uij (for all i, j) is computed as
PM
∂L ∂L ∂ ŷi ∂L aj al uil
= = PM − Pl=1 2 .
∂uij ∂ ŷi ∂uij ∂ ŷi l=1 uil
M
u
l=1 il
The derivative can be simplified further, but it is not necessary.
(e) Note that
∂uij exp(−zij )
= .
∂zij (1 + exp(−zij ))2
Given that ∂L/∂uij is known, we compute the gradient ∂L/∂zij as
∂L ∂L ∂uij ∂L exp(−zij )
= =
∂zij ∂uij ∂zij ∂uij (1 + exp(−zij ))2
(f) We first rewrite the sum in zij using index `, so that it is not confused with index k.
Ni
X
zij = Wj` xi` + bj
`=1
5
Taking the partial derivatives,
∂zij
= xik
∂Wjk
and
∂zij
= 1.
∂bj
Now, given ∂L/∂zij , we can compute the gradient ∂L/∂Wjk using multivariate chain rule
(Note that L is a multivariate function of N single variable functions z1j , z2j , . . . , zN j )
N
X ∂L ∂zij N
∂L X ∂L
= = xik .
∂Wjk ∂zij ∂Wjk ∂zij
i=1 i=1
Similarly ∂L/∂bj is computed using multivariate chain rule,
N
X ∂L ∂zij N
∂L X ∂L
= = .
∂bj ∂zij ∂bj ∂zij
i=1 i=1
(g) Put all this together, we get
N PM
∂L X aj al uil exp(−zij )
= −2(yi − ŷi ) PM − Pl=1 2 xjk (3)
∂Wjk
i=1 l=1 uil
M
u (1 + exp(−zij ))2
l=1 il
N PM
∂L X aj al uil exp(−zij )
= −2(yi − ŷi ) PM − Pl=1 2 (4)
∂bj
i=1 l=1 uil
M
u (1 + exp(−zij ))2
l=1 il
(h) Assume u is a matrix and dloss dyhat is a vector. Then, we can compute the gradients
via python broadcasting:
usum = np.sum(u,axis =1)
uasum = np.sum(u ∗ a[None ,:], axis =1)
dyhat du = a[None ,:]/ usum [:, None] − uasum /( usum ∗∗2)
dloss du = dloss dyhat [:, None] ∗ dyhat du