0% found this document useful (0 votes)
117 views6 pages

hw07 Neural Soln PDF

This document contains the solution to homework 7 on neural networks for an introduction to machine learning course taught by Prof. Sundeep Rangan and Yao Wang. It includes: 1. Details on the linear functions and activation functions for the hidden layer of a neural network. 2. Explaining the hidden and output layers of a neural network for regression, including the loss function. 3. Discussing solving the least squares problem to find the weights using linear regression on the hidden unit activations.

Uploaded by

Yasmine A. Sabry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views6 pages

hw07 Neural Soln PDF

This document contains the solution to homework 7 on neural networks for an introduction to machine learning course taught by Prof. Sundeep Rangan and Yao Wang. It includes: 1. Details on the linear functions and activation functions for the hidden layer of a neural network. 2. Explaining the hidden and output layers of a neural network for regression, including the loss function. 3. Discussing solving the least squares problem to find the weights using linear regression on the hidden unit activations.

Uploaded by

Yasmine A. Sabry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Introduction to Machine Learning

Homework 7: Neural Networks


Prof. Sundeep Rangan and Yao Wang

Solution

1. (a) The linear functions in the hidden layer are:


     
1 0 1   0 x1 + x3
 0 1 1  x1  0  
  x2 + x3 
H H
z =W x+b = H   x2  +  = 
 1 1 0   −1   x1 + x2 − 1 
x3
1 1 1 1 x1 + x2 + x3 + 1
Hence, the activation functions are
   
gact (x1 + x3 ) 1{x1 +x3 ≥0}
 gact (x2 + x3 )   1{x +x ≥0} 
2 3
uH = gact (zH ) =  =
  1{x +x −1≥1}  .

 gact (x1 + x2 − 1) 1 2
gact (x1 + x2 + x3 + 1) 1{x1 +x2 +x3 +1}

For example, the region where uH1 = 1 is described by x1 + x3 ≥ 0.


(b) The output z O is
 
1{x1 +x3 ≥0}
 1{x2 +x3 ≥0} 
z O = W O uH + bO = [1, 1, −1, −1]   − 1.5
 1{x1 +x2 ≥1} 
1{x1 +x2 +x3 ≥−1}
= 1{x1 +x3 ≥0} + 1{x2 +x3 ≥0} − 1{x1 +x2 ≥1} − 1{x1 +x2 +x3 ≥−1} − 1.5.
ŷ = 1 lie in the region z O ≥ 0. The visualization of this region is not required.
2. (a) Since WH has three rows, the number of hidden units is Nh = 3. The outputs zH in the
hidden layer are,
     
−x −1 −x − 1
zH = WH x + bH =  x  +  1  =  x + 1 .
x −2 x−2
So, the outputs after ReLU activation are,
 H   
u1 max{0, −x − 1}
uH =  uH2  =  max{0, x + 1}  .
uH3 max{0, x − 2}
The functions are plotted in 1.

1
uHj
uH2
3
uH3
2
uH1
1

x
-3 -2 -1 1 2 3 4

Figure 1: Problem 2(a). Hidden layer activations, uHj vs. x for j = 1, 2, 3.

(b) Since the network is for regression, you can take

ŷ = gout (z O ) = z O .

One possible loss function is the squared error,


N
X
L= (ŷi − yi )2 .
i=1

(c) Note that the output layer can be thought of as a linear regressor with input being the
hidden layer activations uH and output being ŷ. Let U be the data matrix where the i-th
row contains [1, (uHi )T ] = [1, uHi,1 , uHi,2 , uHi,3 ], w̃T = [bO , (WO )T ], and y = [y1 , y2 , . . . , yN ]T ,
the problem is to solve the least squares problem

Minimize kU w̃ − yk2

The analytical solution can be expressed as

w̃ = (U T U )−1 U T y

From given xi , we can first determine hidden layer outputs zHi and uHi . We can compute
manually based on the equations given in Part (a). We can also use a Python code to
do so.
Table below lists the corresponding values:
xi -2 -1 0 3 3.5
         
1 0 −1 −4 −4.5
zHi −1 0 1 4  4.5 
−4 −3 −2 1 1.5
         
1 0 0 0 0
uHi 0 0 1 4 4.5
0 0 0 1 1.5
yi 0 0 1 3 3

2
The data matrix is thus  
1 1 0 0
1 0 0 0
 
1
U = 0 1 0 
1 0 4 1
1 0 4.5 1.5
The target vector is  
0
0
 
1
y= 
3
3
The solution is  
0
0
w̃ = 
1

−1
or  
0
bO = 0, WO =  1 
−1
The python code for computing the hidden layer outputs and for determining the least
squares solution is given below.
import numpy as np
Wh = np.array([ −1,1,1])
bh = np.array([−1,1,−2])
x = np.array([ −2, −1,0,3,3.5])
y = np.array ([0 ,0 ,1 ,3 ,3])
zh = x[:, None]∗ Wh[None ,:] + bh[None ,:]
Uh = np. maximum (0,zh)
U = np. hstack ((np.ones ((5 ,1)) , Uh))
w tilde = np. linalg .lstsq(U, y)[0]
bo = w tilde [0]
Wo = w tilde [1:]

(d) We can use the python function predict defined in the next subproblem to compute ŷ
for x in the range of [−3, 4]. The resulting curve is shown below.
x = np. linspace(−3,4)
yhat = predict (x,Wh ,bh ,Wo ,bo)

3

x
-3 -2 -1 1 2 3 4

Figure 2: Problem 2(c). Output for the training data.

(e) We represent Wh,Wo,bh as vectors and bo as a scalar. Then, we can write the predict
function as:
def predict (x,Wh ,bh ,Wo ,bo):
zh = x[:, None]∗ Wh[None ,:] + bh[None ,:]
uh = np. maximum (0, zh)
yhat = uh.dot(Wo) + bo
return yhat

Note the use of python broadcasting.


3. (a) We simply add the index i to all the terms:
Ni
X
zij = Wjk xik + bj , uij = 1/(1 + exp(−zij )), j = 1, . . . , M,
k=1
PM (1)
j=1 aj uij
ŷi = PM ,
j=1 uij

(b) The computation graph is shown in Fig. 3.

xi zi ui ŷi L

yi

W, b a

Figure 3: Computation graph for Problem 3 mapping the the training data (xi , yi ) and parameters
to the loss function L. Parameters are shown in light blue and data in light green.

(c) The gradient is as follow


∂L
= −2(yi − ŷi )
∂ ŷi

4
for all i = 1, · · · , N .
(d) We first compute the partial derivative ∂ ŷi /∂uij . We rewrite the equation for ŷi as,
PM
al uil
ŷi = Pl=1
M
. (2)
l=1 uil

Note that before taking the derivative with respect to uij we had to rewrite the sum in
(2) with the index l so that it is not confused with the index j of the variable uij .
∂ ŷi
Now, we use chain rule to find the derivative ∂uij as
P  PM  P  PM 
M ∂ l=1 al uil M ∂ uil
∂ ŷi l=1 uil ∂uij − l=1 al uil
l=1
∂uij
= 2
∂uij
P
M
l=1 uil

Note that PM
∂ l=1 al uil
= aj
∂uij
and PM
∂ l=1 uil
=1
∂uij
Therefore, PM
∂ ŷi aj al uil
= PM − Pl=1 2
∂uij l=1 uil
M
u
l=1 il

If we know ∂L/∂ ŷi for all i, then ∂L/∂uij (for all i, j) is computed as
 
PM
∂L ∂L ∂ ŷi ∂L  aj al uil 
= =  PM − Pl=1 2  .
∂uij ∂ ŷi ∂uij ∂ ŷi l=1 uil
M
u
l=1 il

The derivative can be simplified further, but it is not necessary.


(e) Note that
∂uij exp(−zij )
= .
∂zij (1 + exp(−zij ))2
Given that ∂L/∂uij is known, we compute the gradient ∂L/∂zij as

∂L ∂L ∂uij ∂L exp(−zij )
= =
∂zij ∂uij ∂zij ∂uij (1 + exp(−zij ))2

(f) We first rewrite the sum in zij using index `, so that it is not confused with index k.
Ni
X
zij = Wj` xi` + bj
`=1

5
Taking the partial derivatives,
∂zij
= xik
∂Wjk
and
∂zij
= 1.
∂bj
Now, given ∂L/∂zij , we can compute the gradient ∂L/∂Wjk using multivariate chain rule
(Note that L is a multivariate function of N single variable functions z1j , z2j , . . . , zN j )
N
X ∂L ∂zij N
∂L X ∂L
= = xik .
∂Wjk ∂zij ∂Wjk ∂zij
i=1 i=1

Similarly ∂L/∂bj is computed using multivariate chain rule,


N
X ∂L ∂zij N
∂L X ∂L
= = .
∂bj ∂zij ∂bj ∂zij
i=1 i=1

(g) Put all this together, we get


 
N PM
∂L X  aj al uil  exp(−zij )
= −2(yi − ŷi )  PM − Pl=1 2  xjk (3)
∂Wjk
i=1 l=1 uil
M
u (1 + exp(−zij ))2
l=1 il

 
N PM
∂L X  aj al uil  exp(−zij )
= −2(yi − ŷi )  PM − Pl=1 2  (4)
∂bj
i=1 l=1 uil
M
u (1 + exp(−zij ))2
l=1 il

(h) Assume u is a matrix and dloss dyhat is a vector. Then, we can compute the gradients
via python broadcasting:
usum = np.sum(u,axis =1)
uasum = np.sum(u ∗ a[None ,:], axis =1)
dyhat du = a[None ,:]/ usum [:, None] − uasum /( usum ∗∗2)
dloss du = dloss dyhat [:, None] ∗ dyhat du

You might also like