0% found this document useful (0 votes)
26 views

Concepts_in_Deep_Learning_Solutions_v1.0

This document is a solutions manual for 'Deep Learning: Foundations and Concepts' by Christopher M. Bishop and Hugh Bishop, covering exercises from Chapters 2 to 10. It includes worked solutions and notes that a complete solutions manual will be released soon, along with additional resources available on the book's website. Feedback on the book and materials can be sent to the authors via email.

Uploaded by

Geonormal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Concepts_in_Deep_Learning_Solutions_v1.0

This document is a solutions manual for 'Deep Learning: Foundations and Concepts' by Christopher M. Bishop and Hugh Bishop, covering exercises from Chapters 2 to 10. It includes worked solutions and notes that a complete solutions manual will be released soon, along with additional resources available on the book's website. Feedback on the book and materials can be sent to the authors via email.

Uploaded by

Geonormal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

Christopher M.

Bishop
with Hugh Bishop

Deep Learning
Foundations
and Concepts

Solutions to Exercises
Chapters 2 to 10 | Version 1.0
2

This is version 1.0 of the solutions manual for Deep Learning: Foundations
and Concepts by C. M. Bishop and H. Bishop (Springer, 2024) and contains worked
solutions for exercises in Chapters 2 to 10. A full solutions manual including solu-
tions to all exercises in the book will be released soon. The most recent version of
the solutions manual, along with a free-to-use digital version of the book as well as
downloadable versions of the figures in PDF and JPEG formats, can be found on the
book web site:
https://www.bishopbook.com
If you have any feedback on the book or associated materials including this solutions
manual, please send email to the authors at
[email protected]
Contents

Contents 3
Chapter 2: Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 3: Standard Distributions . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 4: Single-layer Networks: Regression . . . . . . . . . . . . . . . 44
Chapter 5: Single-layer Networks: Classification . . . . . . . . . . . . . . 49
Chapter 6: Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . 60
Chapter 7: Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 8: Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 81
Chapter 9: Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Chapter 10: Convolutional Networks . . . . . . . . . . . . . . . . . . . . 103

3
4 Solutions 2.1–2.2

Chapter 2 Probabilities

2.1 We first compute p(T = 1) by modifying (2.20)

p(T = 1) = p(T = 1|C = 0)p(C = 0) + p(T = 1|C = 1)p(C = 1)


3 999 90 1 3, 087
= × + × = = 0.03087. (1)
100 1, 000 100 1, 000 100, 000

Then we evaluate p(C = 1|T = 1) by modifying (2.22)

p(T = 1|C = 1)p(C = 1)


p(C = 1|T = 1) = (2)
p(T = 1)
90 1 100, 000 90
= × × = ' 0.029 (3)
100 1, 000 3, 087 3, 087
Hence we see that the probability of having cancer, even after a positive test, remains
very small.

2.2 Note that each of the numbers 0, 1, 2, 3, 4, 5, and 6 appears on only one of the dice,
which means that when we roll one die against another, there can never be a draw.
Look first at the red die, and notice that it has four copies of the number 2 and two
copies of the number 6. Two-thirds of the time, when we roll the red die it will
give a 2, and one third of the time it will give a 6. Therefore, if we roll the red die
against the yellow die (which always gives a 3), the yellow die will, on average, win
two-thirds of the time, and will lose one-third of the time.
Now look at the blue die, and notice that it has four copies of the number 4, and two
copies of the number 0. When we roll it against the yellow die, it will therefore give
a 4 two thirds of the time, in which case it wins, and a 0 one-third of the time, in
which case it loses.
Next consider the green die versus the blue die. The green die has three copies of
the number 1 and three copies of the number 5. To work out the probability that
the green die will win we first note that there is a probability of 1/2 that the green
die will give a 5, in which case it is certain to win against the blue die. Likewise,
there is a probability of 1/2 that the green die will give a 1, in which case there is a
probability of 1/3 that it will win. The overall probability that the green die will win
is then given by multiplying the probabilities:
   
1 1 1 2
×1 + × = . (4)
2 2 3 3

Finally, consider the probability of the red die winning against the green die. There
is a probability of 1/3 that the red die will produce a 6, in which case it is certain
that the red die will win. There is similarly a probability of 2/3 that the red die will
Solutions 2.3–2.5 5

produce a 2 in which case there is a 1/2 chance that the red die will win. The overall
probability of the red die winning is again obtained by multiplying the probabilities:
   
1 2 1 2
×1 + × = . (5)
3 3 2 3

For more information on these dice, see:


microsoft.com/en-us/research/project/non-transitive-dice/

2.3 Using the sum and product rules of probability we can write the desired distribution
in the form ZZ
p(y) = p(y|u, v)pu (u)pv (v) du dv. (6)

Since y is a deterministic function of u and v, its conditional distribution is given


by a Dirac delta function in the form

p(y|u, v) = δ(y − u − v). (7)

Substituting (7) into (6) allows us to perform the integration over v to give
Z
p(y) = pu (u)pv (y − u) du (8)

as required.

2.4 If we integrate the uniform distribution (2.33) over x we obtain


∞ d
d−c
Z Z
1
p(x) dx = dx = =1 (9)
−∞ c d−c d−c

and hence this distribution is normalized. For the mean of the distribution we have
b b
x2 b2 − a2
Z 
1 a+b
E[x] = x dx = = = .
a b−a 2(b − a) a 2(b − a) 2

The variance can be found by first evaluating

b b
x3 b3 − a 3 a2 + ab + b2
Z 
2 1 2
E[x ] = x dx = = =
a b−a 3(b − a) a 3(b − a) 3

and then using (2.46) to give

a2 + ab + b2 (a + b)2 (b − a)2
var[x] = E[x2 ] − E[x]2 = − = .
3 4 12
6 Solutions 2.6–2.7

2.5 Integrating the exponential distribution (2.34) over 0 6 x 6 ∞ we obtain


Z ∞ Z ∞
p(x|λ) dx = λ exp(−λx) dx
0 0
Z ∞
= exp(−y) dy
0

= [− exp(−y)]0
=1 (10)

where we have used the change of variables y = λx. Hence, the exponential dis-
tribution is normalized. Likewise, if we integrate the Laplace distribution (2.35) we
obtain
Z ∞ Z ∞  
1 |x − µ|
p(x|µ, λ) dx = exp − dx
−∞ −∞ 2γ γ
Z µ   Z ∞  
1 x−µ 1 x−µ
= exp dx + exp − dx
−∞ 2γ γ µ 2γ γ
Z 0 Z ∞
1 1
= exp (z) dz + exp (−z) dz
−∞ 2 0 2
 0  ∞
1 1
= exp(z) + − exp(−z)
2 −∞ 2 0
1 1
= + =1 (11)
2 2
where we have made the substitution z = (x − µ)/γ in each of the two integrals.
Hence we see that the Laplace distribution is also normalized.

2.6 Integrating the empirical density (2.37) we obtain

∞ N Z
1 X ∞
Z
p(x|D) dx = δ(x − xn ) dx
−∞ N n=1 −∞
N Z
1 X ∞
= δ(y) dy
N n=1 −∞
N
1 X
= 1=1 (12)
N n=1

as required. Here we have substituted y = x − xn into the nth integral in the


summation and then used the definition of the Dirac delta function. This result is
easily generalized to a multi-dimensional data variable x.

2.7 If we substitute the empirical distribution (2.37) into the definition of the expectation
Solutions 2.8–2.10 7

with respect to a continuous density given by (2.39) we obtain


Z ∞
E[f ] = p(x)f (x) dx
−∞
N Z ∞
1 X
' δ(x − xn )f (x) dx
N n=1 −∞
N Z
1 X ∞
= δ(yn )f (yn + xn ) dyn
N n=1 −∞
N
1 X
= f (xn ) (13)
N n=1
as required. Here we have used the change of variable yn = x − xn separately
in each of the integrals, along with the property of the Dirac delta function δ(yn )
integrating to unity with the only non-zero contribution coming from yn = 0.
2.8 Expanding the square we have
E[(f (x) − E[f (x)])2 ] = E[f (x)2 − 2f (x)E[f (x)] + E[f (x)]2 ]
= E[f (x)2 ] − 2E[f (x)]E[f (x)] + E[f (x)]2
= E[f (x)2 ] − E[f (x)]2
as required.
2.9 The definition of covariance is given by (2.47) as
cov[x, y] = E[xy] − E[x]E[y].
Using (2.38) and the fact that p(x, y) = p(x)p(y) when x and y are independent, we
obtain
XX
E[xy] = p(x, y)xy
x y
X X
= p(x)x p(y)y
x y
= E[x]E[y]
and hence cov[x, y] = 0. The case where x and y are continuous variables is analo-
gous, with (2.38) replaced by (2.39) and the sums replaced by integrals.
2.10 Since x and z are independent, their joint distribution factorizes p(x, z) = p(x)p(z),
and so
ZZ
E[x + z] = (x + z)p(x)p(z) dx dz (14)
Z Z
= xp(x) dx + zp(z) dz (15)
= E[x] + E[z]. (16)
8 Solution 2.11

Similarly for the variances, we first note that

(x + z − E[x + z])2 = (x − E[x])2 + (z − E[z])2 + 2(x − E[x])(z − E[z]) (17)

where the final term will integrate to zero with respect to the factorized distribution
p(x)p(z). Hence
ZZ
var[x + z] = (x + z − E[x + z])2 p(x)p(z) dx dz
Z Z
= (x − E[x]) p(x) dx + (z − E[z])2 p(z) dz
2

= var(x) + var(z). (18)

For discrete variables the integrals are replaced by summations, and the same results
are again obtained.
2.11 Using the definition (2.39) of expectation we have
Z Z
Ey [Ex [x|y]] = p(y) p(x|y)x dx dy
ZZ
= p(x, y)x dx dy
Z
= p(x)x dx = E [x] (19)

where we have used the product rule of probability p(x|y)p(y) = p(x, y). Now we
make use of the result (2.46) to write

Ey [varx [x|y]] + vary [Ex [x|y]] =


2
Ey [Ex [x2 |y]] − Ey [Ex [x|y]2 ] + Ey [Ex [x|y]2 ] − Ey [Ex [x|y]] . (20)

We now note that the second and third terms on the right-hand side of (20) cancel.
The first term on the right-hand side of (20) can be written as
Z Z
Ey Ex [x2 |y] = p(y) p(x|y)x2 dx dy
 

ZZ
= p(x, y)x2 dx dy
Z
= p(x)x2 dx = E x2 .
 
(21)

Likewise, we can again make use of the result (2.46) to write the fourth term on the
2
right-hand side of (20) in the form E [x] . Hence we have
2
Ey [varx [x|y]] + vary [Ex [x|y]] = E x2 − E [x] = var [x]
 
(22)

as required.
Solutions 2.12–2.13 9

2.12 The transformation from Cartesian to polar coordinates is defined by


x = r cos θ (23)
y = r sin θ (24)
and hence we have x2 + y 2 = r2 where we have used the well-known trigonometric
result (3.127). Also the Jacobian of the change of variables is easily seen to be
∂x ∂x
∂(x, y) ∂r ∂θ
=
∂(r, θ) ∂y ∂y
∂r ∂θ
cos θ −r sin θ
= =r
sin θ r cos θ
where again we have used (3.127). Thus the double integral in (2.125) becomes
Z 2π Z ∞
r2
 
2
I = exp − 2 r dr dθ (25)
0 0 2σ
Z ∞  u 1
= 2π exp − 2 du (26)
0 2σ 2
h  u  i∞
= π exp − 2 −2σ 2 (27)
2σ 0
= 2πσ 2 (28)
where we have used the change of variables r2 = u. Thus
1/2
I = 2πσ 2 .
Finally, using the transformation y = x − µ, the integral of the Gaussian distribution
becomes
Z ∞ Z ∞
y2
 
2
 1
N x|µ, σ dx = 1 /2
exp − 2 dy
−∞ (2πσ 2 ) −∞ 2σ
I
= 1 /2
=1
(2πσ 2 )
as required.
2.13 From the definition (2.49) of the univariate Gaussian distribution, we have
Z ∞ 1/2  
1 1 2
E[x] = exp − 2 (x − µ) x dx. (29)
−∞ 2πσ 2 2σ
Now change variables using y = x − µ to give
Z ∞ 1/2  
1 1 2
E[x] = exp − 2 y (y + µ) dy. (30)
−∞ 2πσ 2 2σ
10 Solutions 2.14–2.15

We now note that in the factor (y + µ) the first term in y corresponds to an odd
integrand and so this integral must vanish (to show this explicitly, write the integral
as the sum of two integrals, one from −∞ to 0 and the other from 0 to ∞ and then
show that these two integrals cancel). In the second term, µ is a constant and pulls
outside the integral, leaving a normalized Gaussian distribution which integrates to
1, and so we obtain (2.52).
To derive (2.53) we first substitute the expression (2.49) for the normal distribution
into the normalization result (2.51) and re-arrange to obtain
Z ∞  
1 2
1/2
exp − 2 (x − µ) dx = 2πσ 2 . (31)
−∞ 2σ

We now differentiate both sides of (31) with respect to σ 2 and then re-arrange to
obtain  1/2 Z ∞  
1 1 2
exp − (x − µ) (x − µ)2 dx = σ 2 (32)
2πσ 2 −∞ 2σ 2

which directly shows that

E[(x − µ)2 ] = var[x] = σ 2 . (33)

Now we expand the square on the left-hand side giving

E[x2 ] − 2µE[x] + µ2 = σ 2 .

Making use of (2.52) then gives (2.53) as required.


Finally, (2.54) follows directly from (2.52) and (2.53)

E[x2 ] − E[x]2 = µ2 + σ 2 − µ2 = σ 2 .


2.14 For the univariate case, we simply differentiate (2.49) with respect to x to obtain
d x−µ
N x|µ, σ 2 = −N x|µ, σ 2

.
dx σ2
Setting this to zero we obtain x = µ.
2.15 We use ` to denote ln p(X|µ, σ 2 ) from (2.56). By standard rules of differentiation
we obtain
N
∂` 1 X
= 2 (xn − µ).
∂µ σ n=1
Setting this equal to zero and moving the terms involving µ to the other side of the
equation we get
N
1 X 1
xn = 2 N µ
σ 2 n=1 σ
Solutions 2.16–2.17 11

and by multiplying ing both sides by σ 2 /N we get (2.57).

Similarly we have

N
∂` 1 X N 1
= (xn − µ)2 −
∂σ 2 2(σ 2 )2 n=1 2 σ2

and setting this to zero we obtain

N
N 1 1 X
= (xn − µ)2 .
2 σ2 2(σ 2 )2 n=1

Multiplying both sides by 2(σ 2 )2 /N and substituting µML for µ we get (2.58).

2.16 If m = n then xn xm = x2n and using (2.53) we obtain E[x2n ] = µ2 + σ 2 , whereas if


n 6= m then the two data points xn and xm are independent and hence E[xn xm ] =
E[xn ]E[xm ] = µ2 where we have used (2.52). Combining these two results we
obtain (2.128).

Next we have
N
1 X
E[µML ] = E[xn ] = µ (34)
N n=1

using (2.52).
2
Finally, consider E[σML ]. From (2.57) and (2.58), and making use of (2.128), we
have
 !2 
N N
2 1 X 1 X
E[σML ] = E xn − xm 
N n=1 N m=1
N
" N N X N
#
1 X 2 X 1 X
= E x2n − xn xm + 2 xm xl
N n=1 N m=1
N m=1
l=1
   
2 2 2 1 2 2 1 2
= µ +σ −2 µ + σ +µ + σ
N N
 
N −1
= σ2 (35)
N

as required.
12 Solutions 2.18–2.20

2.17 From the definition (2.61), and making use of (2.52) and (2.53), we have
" N
#
 2 1 X 2
E σb =E (xn − µ)
N n=1
N
1 X  2
E xn − 2xn µ + µ2

=
N n=1
N
1 X 2
µ + σ 2 − 2µµ + µ2

=
N n=1
= σ2 (36)

as required.
2.18 Differentiating (2.66) with respect to σ 2 gives
N
∂ 2 1 X 2 N 1
2
ln p(t|x, w, σ ) = 4
{y(xn , w) − tn } − . (37)
∂σ 2σ n=1 2 σ2

Setting the derivative to zero and rearranging then gives


N
2 1 X 2
σML = {y(xn , wML ) − tn } (38)
N n=1

as required.
2.19 If we assume that the function y = f (x) is strictly monotonic, which is necessary to
exclude the possibility for spikes of infinite density in p(y), we are guaranteed that
the inverse function x = f −1 (y) exists. We can then use (2.71) to write

df −1
p(y) = q(f −1 (y)) . (39)
dy
Since the only restriction on f is that it is monotonic, it can distribute the probability
mass over x arbitrarily over y. This is illustrated in Figure 2.12 on page 44, as a part
of Solution ??. From (39) we see directly that
q(x)
|f 0 (x)| = .
p(f (x))

2.20 The Jacobian matrix for the transformation from (x1 , x2 ) to (y1 , y2 ) is defined by
∂y1 ∂y1
 
 ∂x1 ∂x2 
J=  ∂y2 ∂y2  .
 (40)

∂x1 ∂x2
Solutions 2.21–2.22 13

For the specific transformation defined by (2.78) and (2.79) we have


∂y1
= 1 + 5 sech2 (x1 ) (41)
∂x1
∂y1
=0 (42)
∂x2
∂y2
= x21 (43)
∂x1
∂y2
= 1 + 5 sech2 (x2 ). (44)
∂x2

2.21 From the discussion of the introduction of Section 2.5, we have


h(p2 ) = h(p) + h(p) = 2 h(p).
We then assume that for all k 6 K, h(pk ) = k h(p). For k = K + 1 we have
h(pK +1 ) = h(pK p) = h(pK ) + h(p) = K h(p) + h(p) = (K + 1) h(p).
Moreover,
n n n
h(pn/m ) = n h(p1/m ) = m h(p1/m ) = h(pm/m ) = h(p)
m m m
and so, by continuity, we have that h(px ) = x h(p) for any real number x.
Now consider the positive real numbers p and q and the real number x such that
p = q x . From the above discussion, we see that
h(p) h(q x ) x h(q) h(q)
= x
= =
ln(p) ln(q ) x ln(q) ln(q)
and hence h(p) ∝ ln(p).
2.22 We wish to maximize the entropy (2.86) subject to the constraint that the probabili-
ties sum to one, so that X
p(xi ) = 1. (45)
i
We introduce a Lagrange multiplier λ to enforce this constraint and hence we maxi-
mize !
X X
− p(xi ) ln p(xi ) + λ p(xi ) − 1 . (46)
i i

Setting the derivative with respect to p(xi ) to zero gives


− ln p(xi ) − 1 + λ = 0. (47)
Solving for p(xi ) we obtain
p(xi ) = exp (−1 + λ) . (48)
14 Solutions 2.23–2.24

Since the right-hand side does not depend on i, this shows that the probabilities are
all equal. From (45) it then follows that p(xi ) = 1/M . Substituting this result into
(2.86) then shows that the value of the entropy at its maximum is equal to ln M .

2.23 The entropy of an M -state discrete variable x can be written in the form
M M
X X 1
H(x) = − p(xi ) ln p(xi ) = p(xi ) ln . (49)
i=1 i=1
p(xi )

The function ln(x) is concave_ and so we can apply Jensen’s inequality in the form
(2.102) but with the inequality reversed, so that
M
!
X 1
H(x) 6 ln p(xi ) = ln M. (50)
i=1
p(xi )

2.24 Obtaining the required functional derivative can be done simply by inspection. How-
ever, if a more formal approach is required we can proceed as follows using the
techniques set out in Appendix B. Consider first the functional
Z
I[p(x)] = p(x)f (x) dx.

Under a small variation p(x) → p(x) + η(x) we have


Z Z
I[p(x) + η(x)] = p(x)f (x) dx +  η(x)f (x) dx

and hence from (B.3) we deduce that the functional derivative is given by

δI
= f (x).
δp(x)

Similarly, if we define Z
J[p(x)] = p(x) ln p(x) dx

then under a small variation p(x) → p(x) + η(x) we have


Z
J[p(x) + η(x)] = p(x) ln p(x) dx
Z Z 
1
+ η(x) ln p(x) dx + p(x) η(x) dx + O(2 )
p(x)

and hence
δJ
= p(x) + 1.
δp(x)
Solutions 2.25–2.26 15

Using these two results we obtain the following result for the functional derivative
− ln p(x) − 1 + λ1 + λ2 x + λ3 (x − µ)2 .
Re-arranging then gives (2.97).
To eliminate the Lagrange multipliers we substitute (2.97) into each of the three
constraints (2.93), (2.94) and (2.95) in turn. The solution is most easily obtained by
comparison with the standard form of the Gaussian, and noting that the results
1
λ1 = 1 − ln 2πσ 2

(51)
2
λ2 = 0 (52)
1
λ3 = (53)
2σ 2
do indeed satisfy the three constraints.
Note that there is a typographical error in the question, which should read ”Use
calculus of variations to show that the stationary point of the functional shown just
before (1.108) is given by (1.108)”.
For the multivariate version of this derivation, see Exercise 3.8.
2.25 Substituting the right hand side of (2.98) in the argument of the logarithm on the
right hand side of (2.91), we obtain
Z
H[x] = − p(x) ln p(x) dx

(x − µ)2
Z  
1
= − p(x) − ln(2πσ 2 ) − dx
2 2σ 2
 Z 
1 2 1 2
= ln(2πσ ) + 2 p(x)(x − µ) dx
2 σ
1
ln(2πσ 2 ) + 1 ,

=
2
where in the last step we used (2.95).
2.26 The Kullback-Leibler divergence takes the form
Z Z
KL(pkq) = − p(x) ln q(x) dx + p(x) ln p(x) dx.

Substituting the Gaussian for q(x)we obtain


Z  
1 1 T −1
KL(pkq) = − p(x) − ln |Σ| − (x − µ) Σ (x − µ) dx + const.
2 2
1
ln |Σ| + Tr Σ−1 E (x − µ)(x − µ)T
 
= + const.
2
1
ln |Σ| + µT Σ−1 µ − 2µT Σ−1 E[x] + Tr Σ−1 E xxT
 
=
2
+const. (54)
16 Solutions 2.27–2.28

Differentiating this w.r.t. µ, using results from Appendix A, and setting the result to
zero, we see that
µ = E[x]. (55)
Similarly, differentiating (54) w.r.t. Σ−1 , again using results from Appendix A and
also making use of (55) and (2.48), we see that

Σ = E xxT − µµT = cov[x].


 
(56)

2.27 From (2.100) we have


Z Z
KL(pkq) = − p(x) ln q(x) dx + p(x) ln p(x) dx. (57)

Using (2.49) and (2.51)– (2.53), we can rewrite the first integral on the r.h.s. of (57)
as
(x − m)2
Z Z  
2 1 2
− p(x) ln q(x) dx = N (x|µ, σ ) ln(2πs ) + dx
2 s2
 Z 
1 1
= ln(2πs2 ) + 2 N (x|µ, σ 2 )(x2 − 2xm + m2 ) dx
2 s
σ 2 + µ2 − 2µm + m2
 
1 2
= ln(2πs ) + . (58)
2 s2

The second integral on the r.h.s. of (57) we recognize from (2.91) as the negative
differential entropy of a Gaussian. Thus, from (57), (58) and (2.99), we have

σ 2 + µ2 − 2µm + m2
 
1 2 2
KL(pkq) = ln(2πs ) + − 1 − ln(2πσ )
2 s2
  2
σ 2 + µ2 − 2µm + m2

1 s
= ln + −1 .
2 σ2 s2

2.28 First, let us set α = 1 − . Then


n  o
p(x)(1+α)/2 = p(x)1−/2 = p(x) 1 − ln p(x) + O(2 ) . (59)
2
Likewise, we have

q(x)(1−α)/2 = q(x)/2 = 1 + ln q(x) + O(2 ). (60)
2
We also have

1 − α2 = 1 − 1 + 2 − 2 = 2 + O(2 ). (61)
Solution 2.29 17

Substituting these expressions into the alpha divergence defined by (2.129) we obtain
 Z o 
4 n  on 
Dα (pkq) = 1 − p(x) 1 − ln p(x) 1 + ln q(x) dx + O()
2 2 2
Z  
q(x)
= − p(x) ln dx + O() (62)
p(x)
where we have used Z
p(x) dx = 1. (63)

Taking the limit  → 0 gives D1 (pkq) = KL(pkq) as required. We can similarly


consider α = −1 + , giving

p(x)(1+α)/2 = p(x)/2 = 1 + ln p(x) + O(2 ) (64)
2
together with
n  o
q(x)(1−α)/2 = q(x)1−/2 = q(x) 1 − ln q(x) + O(2 ) . (65)
2
We also have
1 − α2 = 1 − 1 + 2 − 2 = 2 + O(2 ). (66)
Substituting these expressions into the alpha divergence defined by (2.129) we obtain
 Z o 
4 n  on 
Dα (pkq) = 1 − q(x) 1 − ln q(x) 1 + ln p(x) dx + O()
2 2 2
Z  
p(x)
= − q(x) ln dx + O() (67)
q(x)
where we have used Z
q(x) dx = 1. (68)

Taking the limit  → 0 gives D−1 (pkq) = KL(qkp) as required.


2.29 We first make use of the relation I(x; y) = H(y) − H(y|x) which we obtained in
(2.110), and note that the mutual information satisfies I(x; y) > 0 since it is a form
of Kullback-Leibler divergence. Finally we make use of the relation (2.108) to obtain
the desired result (2.130).
To show that statistical independence is a sufficient condition for the equality to be
satisfied, we substitute p(x, y) = p(x)p(y) into the definition of the entropy, giving
ZZ
H(x, y) = p(x, y) ln p(x, y) dx dy
ZZ
= p(x)p(y) {ln p(x) + ln p(y)} dx dy
Z Z
= p(x) ln p(x) dx + p(y) ln p(y) dy
= H(x) + H(y).
18 Solutions 2.30–2.32

To show that statistical independence is a necessary condition, we combine the equal-


ity condition
H(x, y) = H(x) + H(y)
with the result (2.108) to give

H(y|x) = H(y).

We now note that the right-hand side is independent of x and hence the left-hand side
must also be constant with respect to x. Using (2.110) it then follows that the mutual
information I[x, y] = 0. Finally, using (2.109) we see that the mutual information is
a form of KL divergence, and this vanishes only if the two distributions are equal, so
that p(x, y) = p(x)p(y) as required.
2.30 When we make a change of variables, the probability density is transformed by the
Jacobian of the change of variables. Thus we have
∂yi
p(x) = p(y) = p(y)|A| (69)
∂xj

where | · | denotes the determinant. Then the entropy of y can be written


Z Z
H(y) = − p(y) ln p(y) dy = − p(x) ln p(x)|A|−1 dx = H(x) + ln |A|


(70)
as required.
2.31 The conditional entropy H(y|x) can be written
XX
H(y|x) = − p(yi |xj )p(xj ) ln p(yi |xj ) (71)
i j

which equals 0 by assumption. Since the quantity −p(yi |xj ) ln p(yi |xj ) is non-
negative each of these terms must vanish for any value xj such that p(xj ) 6= 0.
However, the quantity p ln p only vanishes for p = 0 or p = 1. Thus the quantities
p(yi |xj ) are all either 0 or 1. However, they must also sum to 1, since this is a
normalized probability distribution, and so precisely one of the p(yi |xj ) is 1, and
the rest are 0. Thus, for each value xj there is a unique value yi with non-zero
probability.
2.32 Consider (2.101) with λ = 0.5 and b = a + 2 (and hence a = b − 2),

0.5f (a) + 0.5f (b) > f (0.5a + 0.5b)


= 0.5f (0.5a + 0.5(a + 2)) + 0.5f (0.5(b − 2) + 0.5b)
= 0.5f (a + ) + 0.5f (b − )

We can rewrite this as

f (b) − f (b − ) > f (a + ) − f (a)


Solution 2.33 19

We then divide both sides by  and let  → 0, giving

f 0 (b) > f 0 (a).

Since this holds at all points, it follows that f 00 (x) > 0 everywhere.
To show the implication in the other direction, we make use of Taylor’s theorem
(with the remainder in Lagrange form), according to which there exist an x? such
that
1
f (x) = f (x0 ) + f 0 (x0 )(x − x0 ) + f 00 (x? )(x − x0 )2 .
2
Since we assume that f 00 (x) > 0 everywhere, the third term on the r.h.s. will always
be positive and therefore

f (x) > f (x0 ) + f 0 (x0 )(x − x0 )

Now let x0 = λa + (1 − λ)b and consider setting x = a, which gives

f (a) > f (x0 ) + f 0 (x0 )(a − x0 )


= f (x0 ) + f 0 (x0 ) ((1 − λ)(a − b)) . (72)

Similarly, setting x = b gives

f (b) > f (x0 ) + f 0 (x0 )(λ(b − a)). (73)

Multiplying (72) by λ and (73) by 1 − λ and adding up the results on both sides, we
obtain
λf (a) + (1 − λ)f (b) > f (x0 ) = f (λa + (1 − λ)b)
as required.
2.33 From (2.101) we know that the result (2.102) holds for M = 1. We now suppose that
it holds for some general value M and show that it must therefore hold for M + 1.
Consider the left hand side of (2.102)
M +1
! M
!
X X
f λi xi = f λM +1 xM +1 + λi x i (74)
i=1 i=1
M
!
X
= f λM +1 xM +1 + (1 − λM +1 ) η i xi (75)
i=1

where we have defined


λi
ηi = . (76)
1 − λM +1
We now apply (2.101) to give
M +1
! M
!
X X
f λi xi 6 λM +1 f (xM +1 ) + (1 − λM +1 )f ηi x i . (77)
i=1 i=1
20 Solutions 2.34–2.35

We now note that the quantities λi by definition satisfy

M
X +1
λi = 1 (78)
i=1

and hence we have


M
X
λi = 1 − λM +1 (79)
i=1

Then using (76) we see that the quantities ηi satisfy the property

M M
X 1 X
ηi = λi = 1. (80)
i=1
1 − λM +1 i=1

Thus we can apply the result (2.102) at order M and so (77) becomes

M +1
! M M +1
X X X
f λi x i 6 λM +1 f (xM +1 ) + (1 − λM +1 ) ηi f (xi ) = λi f (xi ) (81)
i=1 i=1 i=1

where we have made use of (76).

2.34 For a one-dimensional variable the KL divergence takes the form


Z  
q(x)
KL(pkq) = − p(x) ln dx
p(x)
Z
=− p(x) ln q(x) dx + const. (82)

where the constant term is just the negative entropy of the fixed distribution p(x).
Substituting for p(x) in the first term using the empirical distribution (2.37), and
substituting for q(x) using the model distribution q(x|θ) gives

Z N
1 X
KL(pkq) = − δ(x − xn ) ln q(x|θ) dx + const.
N n=1
N
1 X
=− ln q(xn |θ) + const.. (83)
N n=1

which is the required neagative log likelihood function up to an additive constant.


Solution 2.36 21

2.35 From (2.92), making use of (2.107), we have


ZZ
H[x, y] = − p(x, y) ln p(x, y) dx dy
ZZ
= − p(x, y) ln (p(y|x)p(x)) dx dy
ZZ
= − p(x, y) (ln p(y|x) + ln p(x)) dx dy
ZZ ZZ
= − p(x, y) ln p(y|x) dx dy − p(x, y) ln p(x) dx dy
ZZ Z
= − p(x, y) ln p(y|x) dx dy − p(x) ln p(x) dx
= H[y|x] + H[x].

2.36 We first evaluate the marginal and conditional probabilities p(x), p(y), p(x|y), and
p(y|x), to give the results shown in the tables below. From these tables, together

y y
x 0 2/3 0 1 0 1
1 1/3 1/3 2/3 x 0 1 1/2
1 0 1/2
p(x) p(y) p(x|y)
y
0 1
x 0 1/2 1/2
1 0 1
p(y|x)

with the definitions


X
H(x) = − p(xi ) ln p(xi ) (84)
i
XX
H(x|y) = − p(xi , yj ) ln p(xi |yj ) (85)
i j

and similar definitions for H(y) and H(y|x), we obtain the following results
(a) H(x) = ln 3 − 32 ln 2
(b) H(y) = ln 3 − 32 ln 2
2
(c) H(y|x) = 3
ln 2
2
(d) H(x|y) = 3
ln 2
(e) H(x, y) = ln 3
(f) I(x; y) = ln 3 − 43 ln 2
22 Solutions 2.37–2.39

where we have used (2.110) to evaluate the mutual information. The corresponding
diagram is shown in Figure 1.

2.37 The arithmetic and geometric means are defined as

K K
!1/K
1 X Y
x̄A = xk and x̄G = xk ,
K
k k

respectively. Taking the logarithm of x̄A and x̄G , we see that

K
! K
1 X 1 X
ln x̄A = ln xk and ln x̄G = ln xk .
K K
k k

By matching f with ln and λi with 1/K in (2.102), taking into account that the
logarithm is concave rather than convex and the inequality therefore goes the other
way, we obtain the desired result.

2.38 From the product rule of probability we have p(x, y) = p(y|x)p(x), and so (2.109)
can be written as
ZZ ZZ
I(x; y) = − p(x, y) ln p(y) dx dy + p(x, y) ln p(y|x) dx dy
Z ZZ
= − p(y) ln p(y) dy + p(x, y) ln p(y|x) dx dy
= H(y) − H(y|x). (86)

Figure 1 Diagram showing the relationship


between marginal, conditional
and joint entropy and the mutual
information.
Solution 2.40 23

2.39 If z1 and z2 are independent, then


ZZ
cov[z1 , z2 ] = (z1 − z̄1 )(z2 − z̄2 )p(z1 , z2 ) dz1 dz2
ZZ
= (z1 − z̄1 )(z2 − z̄2 )p(z1 )p(z2 ) dz1 dz2
Z Z
= (z1 − z̄1 )p(z1 ) dz1 (z2 − z̄2 )p(z2 ) dz2
= 0,

where Z
z̄i = E[zi ] = zi p(zi ) dzi .

For y2 we have
p(y2 |y1 ) = δ(y2 − y12 ),
i.e., a spike of probability mass one at y12 , which is clearly dependent on y1 . With ȳi
defined analogously to z̄i above, we get
ZZ
cov[y1 , y2 ] = (y1 − ȳ1 )(y2 − ȳ2 )p(y1 , y2 ) dy1 dy2
ZZ
= y1 (y2 − ȳ2 )p(y2 |y1 )p(y1 ) dy1 dy2
Z
= (y13 − y1 ȳ2 )p(y1 ) dy1
= 0,

where we have used the fact that all odd moments of y1 will be zero, since it is
symmetric around zero.
2.40 [The original printing of Deep Learning: Foundations and Concepts has a typo in
this exercise in which the word ‘convex’ in the first sentence should read ‘concave’.
Note, however, that the exercise can equally well be solved with ‘convex’ by follow-
ing the same reasoning as given here.] We introduce a binary variable C such that
C = 1 denotes that the coin lands concave side up, and a binary variable Q for which
Q = 1 denotes that the concave side of the coin is heads. From the stated physical
properties of the coin, and from the assumed prior probability that the concave side
is heads, we have

p(C = 1) = 0.6 (87)


p(Q = 1) = 0.1. (88)

The data set D = {x1 , . . . , x10 } consists of 10 observations x each of which takes
the value H for heads or T for tails. We assume that the data points are independent
and identically distributed, so the ordering does not matter. A particular coin flip can
land heads up either because it lands concave side up and the concave side is heads
24 Solution 2.40

or because it lands convex side up and the concave side is tails. Thus the probability
of landing heads is given by
p(x = H) = p(C = 1)p(Q = 1) + p(C = 0)p(Q = 0)
= 0.6 × 0.1 + 0.4 × 0.9
= 0.06 + 0.36
= 0.42 (89)
from which it follows that p(x = T ) = 0.58. The probability of observing 8 heads
and 2 tails is therefore
 
10
p(D) = × (0.42)8 × (0.58)2 (90)
8
where the coefficient in the binomial expansion is given by
 
N N!
= . (91)
K K!(N − K)!
The posterior probability that the concave side is heads is then given by Bayes’
theorem
p(D|Q = 1)p(Q = 1)
p(Q = 1|D) = (92)
p(D)
For the first term in the numerator on the right-hand side we note that, since this is
conditioned on the concave side being heads, we need to use the probabilities of the
coin landing concave side up and concave side down, to give
 
10
p(D|Q = 1) = × (0.6)8 × (0.4)2 . (93)
8
Substituting into Bayes’ theorem, and noting that the binomial coefficients cancel,
we obtain
(0.6)8 × (0.4)2 × 0.1
p(Q = 1|D) = ' 0.825 (94)
(0.42)8 × (0.58)2
and so we see this is a higher probability than the prior of 0.1, which is intuitively
reasonable since we saw a larger number of heads compared to tails in the data set.
The probability that the next flip will land heads up is then given by
p(H|D) = p(H|Q = 1, D)p(Q = 1|D) + p(H|Q = 0, D)p(Q = 0|D)
= p(H|Q = 1)p(Q = 1|D) + p(H|Q = 0)p(Q = 0|D)
' 0.6 × 0.825 + 0.4 × (1 − 0.825) ' 0.565 (95)
which we see is higher than the probability 0.42 of heads before we observed the
data. Again this is intuitively reasonable given the predominance of heads in the
data set.
Solution 2.41 25

2.41 If we substitute (2.115) into (2.114), we obtain (2.116). We now use (2.66) for the log
likelihood of the linear regressionPmodel, and note that ln p(t|x, w, σ 2 ) corresponds
to ln p(D|w). We also note that i wi2 = wT w. Hence we obtain (2.117) for the
regularized error function.
26 Solutions 3.1–3.2

Chapter 3 Standard Distributions

3.1 From the definition (3.2) of the Bernoulli distribution we have


X
p(x|µ) = p(x = 0|µ) + p(x = 1|µ)
x∈{0,1}

= (1 − µ) + µ = 1
X
xp(x|µ) = 0.p(x = 0|µ) + 1.p(x = 1|µ) = µ
x∈{0,1}
X
(x − µ)2 p(x|µ) = µ2 p(x = 0|µ) + (1 − µ)2 p(x = 1|µ)
x∈{0,1}

= µ2 (1 − µ) + (1 − µ)2 µ = µ(1 − µ).


The entropy is given by
X
H[x] = − p(x|µ) ln p(x|µ)
x∈{0,1}
X
= − µx (1 − µ)1−x {x ln µ + (1 − x) ln(1 − µ)}
x∈{0,1}

= −(1 − µ) ln(1 − µ) − µ ln µ.

3.2 The normalization of (3.195) follows from


   
1+µ 1−µ
p(x = +1|µ) + p(x = −1|µ) = + = 1.
2 2
The mean is given by
   
1+µ 1−µ
E[x] = − = µ.
2 2
To evaluate the variance we use
   
1−µ 1+µ
E[x2 ] = + =1
2 2
from which we have
var[x] = E[x2 ] − E[x]2 = 1 − µ2 .
Finally the entropy is given by
xX
=+1
H[x] = − p(x|µ) ln p(x|µ)
x=−1
       
1−µ 1−µ 1+µ 1+µ
= − ln − ln .
2 2 2 2
Solutions 3.3–3.4 27

3.3 Using the definition (3.10) we have


   
N N N! N!
+ = +
n n−1 n!(N − n)! (n − 1)!(N + 1 − n)!
(N + 1 − n)N ! + nN ! (N + 1)!
= =
n!(N + 1 − n)! n!(N + 1 − n)!
 
N +1
= . (96)
n
To prove the binomial theorem (3.197) we note that the theorem is trivially true
for N = 0. We now assume that it holds for some general value N and prove its
correctness for N + 1, which can be done as follows
N  
N +1
X N n
(1 + x) = (1 + x) x
n=0
n
N   N +1  
X N n X N
= x + xn
n=0
n n=1
n − 1
  N      
N 0 X N N n N N +1
= x + + x + x
0 n=1
n n−1 N
  N    
N +1 0 X N +1 n N + 1 N +1
= x + x + x
0 n=1
n N +1
N +1  
X N +1 n
= x (97)
n=0
n

which completes the inductive proof. Finally, using the binomial theorem, the nor-
malization condition (3.198) for the binomial distribution gives
N   N   n
X N n X N µ
µ (1 − µ)N −n = (1 − µ) N

n=0
n n=0
n 1−µ
 N
µ
= (1 − µ)N 1 + =1 (98)
1−µ
as required.
3.4 Differentiating (3.198) with respect to µ we obtain
N    
X N n N −n n (N − n)
µ (1 − µ) − = 0.
n=1
n µ (1 − µ)

Multiplying through by µ(1 − µ) and re-arranging we obtain (3.11).


28 Solutions 3.5–3.6

If we differentiate (3.198) twice with respect to µ we obtain


N  
( 2 )
X N n N −n n (N − n) n (N − n)
µ (1 − µ) − − 2− = 0.
n=1
n µ (1 − µ) µ (1 − µ)2

We now multiply through by µ2 (1 − µ)2 and re-arrange, making use of the result
(3.11) for the mean of the binomial distribution, to obtain

E[n2 ] = N µ(1 − µ) + N 2 µ2 .

Finally, we use (2.46) to obtain the result (3.12) for the variance.

3.5 We differentiate (3.26) with respect to x to obtain

∂ 1
N (x|µ, Σ) = − N (x|µ, Σ)∇x (x − µ)T Σ−1 (x − µ)

∂x 2
= −N (x|µ, Σ)Σ−1 (x − µ),

where we have used (A.19), (A.20), and the fact that Σ−1 is symmetric. Setting this
derivative equal to 0, and left-multiplying by Σ, leads to the solution x = µ.

3.6 First, we define y = Ax + b, and we assume that y has the same dimensionality
as x, so that A is a square matrix. We also assume that A is symmetric and has
an inverse. It follows that x = A−1 (y − b). From the sum and product rules of
probability we have
Z
p(y) = p(y|x)p(x) dx
Z
= δ (y − Ax − b) p(x) dx
Z
∝ δ x − A−1 (y − b) p(x) dx

(99)

where δ(·) is the Dirac delta function. Substituting for p(x) using the Gaussian
distribution we have
Z
p(y) ∝ δ x − A−1 (y − b) N (x|µ, Σ) dx


Z  
1
∝ δ x − A−1 (y − b) exp − (x − µ)T Σ−1 (x − µ) dx

2
 
1 −1 T −1 −1
∝ exp − (A (y − b) − µ) Σ (A (y − b) − µ)
2
 
1 T −1 −1 −1
∝ exp − (y − b − Aµ) A Σ A (y − b − Aµ) (100)
2
Solutions 3.7–3.8 29

where we have used the property of a symmetric matrix that its inverse is also sym-
metric. By inspection we see that p(y) is a Gaussian distribution and that its mean
is Aµ + b and that its covariance is
−1
A−1 Σ−1 A−1 = AΣA. (101)

3.7 From (2.100) we have


Z Z
KL (q(x)kp(x)) = − q(x) ln p(x) dx + q(x) ln q(x) dx. (102)

Using (3.26), (3.40), (3.42) and (3.46), we can rewrite the first integral on the r.h.s.
of (102) as
Z
− q(x) ln p(x) dx
Z
1
= N (x|µq , Σq ) D ln(2π) + ln |Σp | + (x − µp )T Σ− 1
p (x − µp ) dx
2
1
= D ln(2π) + ln |Σp | + Tr[Σ− 1 T
p (µq µq + Σq )]
2
−µp Σ− 1 T −1 T −1
p µq − µq Σp µp + µp Σp µp . (103)

The second integral on the r.h.s. of (102) we recognize from (2.92) as the negative
differential entropy of a multivariate Gaussian. Thus, from (102), (103) and (3.204),
we have
KL (q(x)kp(x))
 
1 |Σp | −1
 T −1
= ln − D + Tr Σp Σq + (µp − µq ) Σp (µp − µq ) . (104)
2 |Σq |

3.8 We can make use of Lagrange multipliers to enforce the constraints on the maximum
entropy solution. Note that we need a single Lagrange multiplier for the normaliza-
tion constraint (3.201), a D-dimensional vector m of Lagrange multipliers for the
D constraints given by (3.202), and a D × D matrix L of Lagrange multipliers to
enforce the D2 constraints represented by (3.203). Thus we maximize
Z Z 
H[p] = − p(x) ln p(x) dx + λ
e p(x) dx − 1
Z 
+mT p(x)x dx − µ
 Z 
+Tr L p(x)(x − µ)(x − µ)T dx − Σ . (105)

By functional differentiation (Appendix B) the maximum of this functional with


respect to p(x) occurs when
0 = −1 − ln p(x) + λ + mT x + Tr{L(x − µ)(x − µ)T }.
30 Solution 3.8

Solving for p(x) we obtain

p(x) = exp λ − 1 + mT x + (x − µ)T L(x − µ) .



(106)

We now find the values of the Lagrange multipliers by applying the constraints. First
we complete the square inside the exponential, which becomes
 T  
1 1 −1 1
λ − 1 + x − µ + L−1 m L x − µ + L m + µT m − mT L−1 m.
2 2 4

We now make the change of variable


1
y = x − µ + L−1 m.
2
The constraint (3.202) then becomes
Z   
T T 1 T −1 1 −1
exp λ − 1 + y Ly + µ m − m L m y + µ − L m dy = µ.
4 2

In the final parentheses, the term in y vanishes by symmetry, while the term in µ
simply integrates to µ by virtue of the normalization constraint (3.201) which now
takes the form
Z  
T T 1 T −1
exp λ − 1 + y Ly + µ m − m L m dy = 1.
4

and hence we have


1
− L−1 m = 0
2
where again we have made use of the constraint (3.201). Thus m = 0 and so the
density becomes

p(x) = exp λ − 1 + (x − µ)T L(x − µ) .




Substituting this into the final constraint (3.203), and making the change of variable
x − µ = z we obtain
Z
exp λ − 1 + zT Lz zzT dx = Σ.


Applying an analogous argument to that used to derive (3.48) we obtain L = − 12 Σ.


Finally, the value of λ is simply that value needed to ensure that the Gaussian distri-
bution is correctly normalized, as derived in Section 3.2, and hence is given by
 
1 1
λ − 1 = ln .
(2π)D/2 |Σ|1/2
Solutions 3.9–3.10 31

3.9 From the definitions of the multivariate differential entropy (2.92) and the multivari-
ate Gaussian distribution (3.26), we get
Z
H[x] = − N (x|µ, Σ) ln N (x|µ, Σ) dx
Z
1
N (x|µ, Σ) D ln(2π) + ln |Σ| + (x − µ)T Σ−1 (x − µ) dx

=
2
1
D ln(2π) + ln |Σ| + Tr Σ−1 Σ
 
=
2
1 D
= ln |Σ| + (1 + ln(2π)) .
2 2

3.10 We have p(x1 ) = N (x1 |µ1 , τ1−1 ) and p(x2 ) = N (x2 |µ2 , τ2−1 ). Since x = x1 + x2
we also have p(x|x2 ) = N (x|µ1 + x2 , τ1−1 ). We now evaluate the convolution
integral given by (3.205) which takes the form
 τ 1/2  τ 1/2 Z ∞ n τ τ2 o
1 2 1
p(x) = exp − (x − µ1 − x2 )2 − (x2 − µ2 )2 dx2 .
2π 2π −∞ 2 2
(107)
Since the final result will be a Gaussian distribution for p(x) we need only evaluate
its precision, since, from (2.99), the entropy is determined by the variance or equiv-
alently the precision, and is independent of the mean. This allows us to simplify the
calculation by ignoring such things as normalization constants.
We begin by considering the terms in the exponent of (107) which depend on x2
which are given by

1
− x22 (τ1 + τ2 ) + x2 {τ1 (x − µ1 ) + τ2 µ2 }
2
 2 2
1 τ1 (x − µ1 ) + τ2 µ2 {τ1 (x − µ1 ) + τ2 µ2 }
= − (τ1 + τ2 ) x2 − +
2 τ1 + τ2 2(τ1 + τ2 )

where we have completed the square over x2 . When we integrate out x2 , the first
term on the right hand side will simply give rise to a constant factor independent
of x. The second term, when expanded out, will involve a term in x2 . Since the
precision of x is given directly in terms of the coefficient of x2 in the exponent, it is
only such terms that we need to consider. There is one other term in x2 arising from
the original exponent in (107). Combining these we have

τ1 2 τ12 1 τ1 τ2 2
− x + x2 = − x
2 2(τ1 + τ2 ) 2 τ1 + τ2

from which we see that x has precision τ1 τ2 /(τ1 + τ2 ).


We can also obtain this result for the precision directly by appealing to the general
result (3.99) for the convolution of two linear-Gaussian distributions.
32 Solutions 3.11–3.12

The entropy of x is then given, from (2.99), by


 
1 2π(τ1 + τ2 )
H[x] = ln .
2 τ1 τ2

3.11 We can use an analogous argument to that used in the solution of Exercise ??. Con-
sider a general square matrix Λ with elements Λij . Then we can always write
Λ = ΛA + ΛS where
Λij + Λji Λij − Λji
ΛSij = , ΛA
ij = (108)
2 2
and it is easily verified that ΛS is symmetric so that ΛSij = ΛSji , and ΛA is antisym-
metric so that ΛA S
ij = −Λji . The quadratic form in the exponent of a D-dimensional
multivariate Gaussian distribution can be written
D D
1 XX
(xi − µi )Λij (xj − µj ) (109)
2 i=1 j =1

where Λ = Σ−1 is the precision matrix. When we substitute Λ = ΛA + ΛS into


(109) we see that the term involving ΛA vanishes since for every positive term there
is an equal and opposite negative term. Thus we can always take Λ to be symmetric.
3.12 We start by pre-multiplying both sides of (3.28) by u†i , the conjugate transpose of
ui . This gives us
u†i Σui = λi u†i ui . (110)
Next consider the conjugate transpose of (3.28) and post-multiply it by ui , which
gives us
u†i Σ† ui = λ∗i u†i ui . (111)
where λ∗i is the complex conjugate of λi . We now subtract (110) from (111) and use
the fact the Σ is real and symmetric and hence Σ = Σ† , to get

0 = (λ∗i − λi )u†i ui .
Hence λ∗i = λi and so λi must be real.
Now consider
uT
i uj λj = uT
i Σuj
T
= uT
i Σ uj
T
= (Σui ) uj
= λi uT
i uj ,

where we have used (3.28) and the fact that Σ is symmetric. If we assume that
0 6= λi 6= λj 6= 0, the only solution to this equation is that uT
i uj = 0, i.e., that ui
and uj are orthogonal.
Solutions 3.13–3.14 33

If 0 6= λi = λj 6= 0, any linear combination of ui and uj will be an eigenvector


with eigenvalue λ = λi = λj , since, from (3.28),

Σ(aui + buj ) = aλi ui + bλj uj


= λ(aui + buj ).

Assuming that ui 6= uj , we can construct

uα = aui + buj
uβ = cui + duj

such that uα and uβ are mutually orthogonal and of unit length. Since ui and uj are
orthogonal to uk (k 6= i, k 6= j), so are uα and uβ . Thus, uα and uβ satisfy (3.29).
Finally, if λi = 0, Σ must be singular, with ui lying in the nullspace of Σ. In this
case, ui will be orthogonal to the eigenvectors projecting onto the rowspace of Σ
and we can chose kui k = 1, so that (3.29) is satisfied. If more than one eigenvalue
equals zero, we can chose the corresponding eigenvectors arbitrily, as long as they
remain in the nullspace of Σ, and so we can chose them to satisfy (3.29).
3.13 We can write the r.h.s. of (3.31) in matrix form as
D
X
λi ui uT T
i = UΛU = M,
i=1

where U is a D × D matrix with the eigenvectors u1 , . . . , uD as its columns and Λ


is a diagonal matrix with the eigenvalues λ1 , . . . , λD along its diagonal.
Thus we have
UT MU = UT UΛUT U = Λ.
However, from (3.28)–(3.30), we also have that

UT ΣU = UT ΛU = UT UΛ = Λ,

and so M = Σ and (3.31) holds.


Moreover, since U is orthonormal, U−1 = UT and so
D
−1 −1 X 1
Σ−1 = UΛUT = UT Λ−1 U−1 = UΛ−1 UT = ui uT
i .
i=1
λ i

3.14 Since u1 , . . . , uD constitute a basis for RD , we can write

a = â1 u1 + â2 u2 + . . . + âD uD ,

where â1 , . . . , âD are coefficients obtained by projecting a on u1 , . . . , uD . Note that


they typically do not equal the elements of a.
34 Solutions 3.15–3.16

Using this we can write

aT Σa = â1 uT T

1 + . . . + âD uD Σ (â1 u1 + . . . + âD uD )

and combining this result with (3.28) we get

â1 uT T

1 + . . . + âD uD (â1 λ1 u1 + . . . + âD λD uD ) .

Now, since uT
i uj = 1 only if i = j, and 0 otherwise, this becomes

â21 λ1 + . . . + â2D λD

and since a is real, we see that this expression will be strictly positive for any non-
zero a, if all eigenvalues are strictly positive. It is also clear that if an eigenvalue,
λi , is zero or negative, there exist a vector a (e.g. a = ui ), for which this expression
will be less than or equal to zero. Thus, that a matrix has eigenvectors which are all
strictly positive is a sufficient and necessary condition for the matrix to be positive
definite.

3.15 A D × D matrix has D2 elements. If it is symmetric then the elements not on the
leading diagonal form pairs of equal value. There are D elements on the diagonal
so the number of elements not on the diagonal is D2 − D and only half of these are
independent giving
D2 − D
.
2
If we now add back the D elements on the diagonal we get

D2 − D D(D + 1)
+D = .
2 2

3.16 Consider a matrix M which is symmetric, so that MT = M. The inverse matrix


M−1 satisfies
MM−1 = I.
Taking the transpose of both sides of this equation, and using the relation (A.1), we
obtain T
M−1 MT = IT = I
since the identity matrix is symmetric. Making use of the symmetry condition for
M we then have T
M−1 M = I
and hence, from the definition of the matrix inverse,
T
M−1 = M−1

and so M−1 is also a symmetric matrix.


Solutions 3.17–3.19 35

3.17 Recall that the transformation (3.34) diagonalizes the coordinate system and that
the quadratic form (3.27), corresponding to the square of the Mahalanobis distance,
is then given by (3.33). This corresponds to a shift in the origin of the coordinate
system and a rotation so that the hyper-ellipsoidal contours along which the Maha-
lanobis distance is constant become axis aligned. The volume contained within any
one such contour is unchanged by shifts and rotations. We now make the further
1/2
transformation zi = λi yi for i = 1, . . . , D. The volume within the hyper-ellipsoid
then becomes
Z YD D Z Y D
1/2
Y
dyi = λi dzi = |Σ|1/2 VD ∆D
i=1 i=1 i=1

where we have used the property that the determinant of Σ is given by the product
of its eigenvalues, together with the fact that in the z coordinates the volume has
become a sphere of radius ∆ whose volume is VD ∆D .
3.18 Multiplying the left hand side of (3.60) by the matrix (3.208) trivially gives the iden-
tity matrix. On the right hand side consider the four blocks of the resulting parti-
tioned matrix:
upper left

AM − BD−1 CM = (A − BD−1 C)(A − BD−1 C)−1 = I

upper right

−AMBD−1 + BD−1 + BD−1 CMBD−1


= −(A − BD−1 C)(A − BD−1 C)−1 BD−1 + BD−1
= −BD−1 + BD−1 = 0

lower left
CM − DD−1 CM = CM − CM = 0
lower right

−CMBD−1 + DD−1 + DD−1 CMBD−1 = DD−1 = I.

Thus the right hand side also equals the identity matrix.
3.19 We first of all take the joint distribution p(xa , xb , xc ) and marginalize to obtain the
distribution p(xa , xb ). Using the results of Section 3.2.5 this is again a Gaussian
distribution with mean and covariance given by
   
µa Σaa Σab
µ= , Σ= .
µb Σba Σbb

From Section 3.2.4 the distribution p(xa , xb ) is then Gaussian with mean and co-
variance given by (3.65) and (3.66) respectively.
36 Solutions 3.20–3.23

3.20 Multiplying the left hand side of (3.210) by (A + BCD) trivially gives the identity
matrix I. On the right hand side we obtain
(A + BCD)(A−1 − A−1 B(C−1 + DA−1 B)−1 DA−1 )
= I + BCDA−1 − B(C−1 + DA−1 B)−1 DA−1
−BCDA−1 B(C−1 + DA−1 B)−1 DA−1
= I + BCDA−1 − BC(C−1 + DA−1 B)(C−1 + DA−1 B)−1 DA−1
= I + BCDA−1 − BCDA−1 = I

3.21 From y = x + z we have trivially that E[y] = E[x] + E[z]. For the covariance we
have
cov[y] = E (x − E[x] + y − E[y])(x − E[x] + y − E[y])T
 

= E (x − E[x])(x − E[x])T + E (y − E[y])(y − E[y])T


   

+ E (x − E[x])(y − E[y])T + E (y − E[y])(x − E[x])T


   
| {z } | {z }
=0 =0
= cov[x] + cov[z]
where we have used the independence of x and z, together with E [(x − E[x])] =
E [(z − E[z])] = 0, to set the third and fourth terms in the expansion to zero. For
1-dimensional variables the covariances become variances and we obtain the result
of Exercise 2.10 as a special case.
3.22 For the marginal distribution p(x) we see from (3.76) that the mean is given by the
upper partition of (3.92) which is simply µ. Similarly from (3.77) we see that the
covariance is given by the top left partition of (3.89) and is therefore given by Λ−1 .
Now consider the conditional distribution p(y|x). Applying the result (3.65) for the
conditional mean we obtain
µy|x = Aµ + b + AΛ−1 Λ(x − µ) = Ax + b.
Similarly applying the result (3.66) for the covariance of the conditional distribution
we have
cov[y|x] = L−1 + AΛ−1 AT − AΛ−1 ΛΛ−1 AT = L−1
as required.
3.23 We first define
X = Λ + AT LA (112)
and
W = −LA, and thus WT = −AT LT = −AT L, (113)
since L is symmetric. We can use (112) and (113) to re-write (3.88) as
 
X WT
R=
W L
Solutions 3.24–3.26 37

and using (3.60) we get


−1
−MWT L−1
  
X WT M
=
W L −L WM L + L−1 WMWT L−1
−1 −1

where now −1


M = X − WT L−1 W .
Substituting X and W using (112) and (113), respectively, we get
−1
M = Λ + AT LA − AT LL−1 LA = Λ−1 ,

−MWT L−1 = Λ−1 AT LL−1 = Λ−1 AT


and

L−1 + L−1 WMWT L−1 = L−1 + L−1 LAΛ−1 AT LL−1


= L−1 + AΛ−1 AT ,

as required.

3.24 Substituting the leftmost expression of (3.89) for R−1 in (3.91), we get

Λ−1 Λ−1 AT
  
Λµ − AT Sb
AΛ−1 S−1 + AΛ−1 AT Sb
Λ−1 Λµ − AT Sb + Λ−1 AT Sb 
  
=
AΛ−1 Λµ − AT Sb + S−1 + AΛ−1 AT Sb
µ − Λ−1 AT Sb + Λ−1 AT Sb
 
=
Aµ − AΛ−1 AT Sb + b + AΛ−1 AT Sb
 
µ
= .
Aµ − b

3.25 Since y = x + z we can write the conditional distribution of y given x in the form
p(y|x) = N (y|µz + x, Σz ). This gives a decomposition of the joint distribution
of x and y in the form p(x, y) = p(y|x)p(x) where p(x) = N (x|µx , Σx ). This
therefore takes the form of (3.83) and (3.84) in which we can identify µ → µx ,
Λ−1 → Σx , A → I, b → µz and L−1 → Σz . We can now obtain the marginal
distribution p(y) by making use of the result (3.99) from which we obtain p(y) =
N (y|µx + µz , Σz + Σx ). Thus both the means and the covariances are additive, in
agreement with the results of Exercise 3.21.

3.26 The quadratic form in the exponential of the joint distribution is given by

1 1
− (x − µ)T Λ(x − µ) − (y − Ax − b)T L(y − Ax − b). (114)
2 2
38 Solution 3.26

We now extract all of those terms involving x and assemble them into a standard
Gaussian quadratic form by completing the square
1
= − xT (Λ + AT LA)x + xT Λµ + AT L(y − b) + const
 
2
1
= − (x − m)T (Λ + AT LA)(x − m)
2
1 T
+ m (Λ + AT LA)m + const (115)
2
where
m = (Λ + AT LA)−1 Λµ + AT L(y − b) .
 

We can now perform the integration over x which eliminates the first term in (115).
Then we extract the terms in y from the final term in (115) and combine these with
the remaining terms from the quadratic form (114) which depend on y to give
1 
= − yT L − LA(Λ + AT LA)−1 AT L y
2 
+yT L − LA(Λ + AT LA)−1 AT L b
+LA(Λ + AT LA)−1 Λµ .

(116)

We can identify the precision of the marginal distribution p(y) from the second order
term in y. To find the corresponding covariance, we take the inverse of the precision
and apply the Woodbury inversion formula (3.210) to give
−1
L − LA(Λ + AT LA)−1 AT L = L−1 + AΛ−1 AT

(117)

which corresponds to (3.94).


Next we identify the mean ν of the marginal distribution. To do this we make use of
(117) in (116) and then complete the square to give
1 −1
− (y − ν)T L−1 + AΛ−1 AT (y − ν) + const
2
where

ν = L−1 + AΛ−1 AT (L + AΛ−1 AT )−1 b + LA(Λ + AT LA)−1 Λµ .


  −1 

Now consider the two terms in the square brackets, the first one involving b and the
second involving µ. The first of these contribution simply gives b, while the term in
µ can be written

= L−1 + AΛ−1 AT LA(Λ + AT LA)−1 Λµ




= A(I + Λ−1 AT LA)(I + Λ−1 AT LA)−1 Λ−1 Λµ = Aµ

where we have used the general result (BC)−1 = C−1 B−1 . Hence we obtain
(3.93).
Solutions 3.27–3.28 39

3.27 To find the conditional distribution p(x|y) we start from the quadratic form (114)
corresponding to the joint distribution p(x, y). Now, however, we treat y as a con-
stant and simply complete the square over x to give
1 1
− (x − µ)T Λ(x − µ) − (y − Ax − b)T L(y − Ax − b)
2 2
1 T
= − x (Λ + AT LA)x + xT {Λµ + AL(y − b)} + const
2
1
= − (x − m)T (Λ + AT LA)(x − m)
2
where, as in the solution to Exercise 3.26, we have defined
m = (Λ + AT LA)−1 Λµ + AT L(y − b)


from which we obtain directly the mean and covariance of the conditional distribu-
tion in the form (3.95) and (3.96).
3.28 Differentiating (3.102) with respect to Σ we obtain two terms:
N
N ∂ 1 ∂ X
− ln |Σ| − (xn − µ)T Σ−1 (xn − µ).
2 ∂Σ 2 ∂Σ n=1
For the first term, we can apply (A.28) directly to get
N ∂ N T N
− ln |Σ| = − Σ−1 = − Σ−1 .
2 ∂Σ 2 2
For the second term, we first re-write the sum
N
X
(xn − µ)T Σ−1 (xn − µ) = N Tr Σ−1 S ,
 
n=1

where
N
1 X
S= (xn − µ)(xn − µ)T .
N n=1
Using this together with (A.21), in which x = Σij (element (i, j) in Σ), and proper-
ties of the trace we get
N
∂ X ∂
(xn − µ)T Σ−1 (xn − µ) = N Tr Σ−1 S
 
∂Σij n=1 ∂Σij
 
∂ −1
= N Tr Σ S
∂Σij
 
−1 ∂Σ −1
= −N Tr Σ Σ S
∂Σij
 
∂Σ −1 −1
= −N Tr Σ SΣ
∂Σij
= −N Σ−1 SΣ−1 ij

40 Solutions 3.29–3.30

where we have used (A.26). Note that in the last step we have ignored the fact that
Σij = Σji , so that ∂Σ/∂Σij has a 1 in position (i, j) only and 0 everywhere else.
Treating this result as valid nevertheless, we get
N
1 ∂ X N
− (xn − µ)T Σ−1 (xn − µ) = Σ−1 SΣ−1 .
2 ∂Σ n=1 2

Combining the derivatives of the two terms and setting the result to zero, we obtain
N −1 N −1
Σ = Σ SΣ−1 .
2 2
Re-arrangement then yields
Σ=S
as required.

3.29 The derivation of (3.46) follows directly from the discussion given in the text be-
tween (3.42) and (3.46). If m = n then, using (3.46) we have E[xn xT T
n ] = µµ + Σ,
whereas if n 6= m then the two data points xn and xm are independent and hence
E[xn xm ] = µµT where we have used (3.42). Combining these results we obtain
(3.213). From (3.42) and (3.46) we then have
N
" N
! N
!#
1 X 1 X T 1 X T
E [ΣML ] = E xn − xm xn − xl
N n=1 N m=1 N
l=1
N
" N N X N
#
1 X 2 X 1 X
= E x n xT
n − xn xTm+ 2
xm xT
l
N n=1 N m=1
N m=1 l=1
   
T T 1 T 1
= µµ + Σ − 2 µµ + Σ + µµ + Σ
N N
 
N −1
= Σ (118)
N

as required.

3.30 Using the relation (3.214) we have

1 = exp(iA) exp(−iA) = (cos A + i sin A)(cos A − i sin A) = cos2 A + sin2 A.

Similarly, we have

cos(A − B) = < exp{i(A − B)}


= < exp(iA) exp(−iB)
= <(cos A + i sin A)(cos B − i sin B)
= cos A cos B + sin A sin B.
Solutions 3.31–3.34 41

Finally
sin(A − B) = = exp{i(A − B)}
= = exp(iA) exp(−iB)
= =(cos A + i sin A)(cos B − i sin B)
= sin A cos B − cos A sin B.
3.31 Expressed in terms of ξ the von Mises distribution becomes
n o
p(ξ) ∝ exp m cos(m−1/2 ξ) .

For large m we have cos(m−1/2 ξ) = 1 − m−1 ξ 2 /2 + O(m−2 ) and so


p(ξ) ∝ exp −ξ 2 /2


and hence p(θ) ∝ exp{−m(θ − θ0 )2 /2}.


3.32 Using (3.133), we can write (3.132) as
N
X N
X N
X
(cos θ0 sin θn − cos θn sin θ0 ) = cos θ0 sin θn − sin cos θn = 0.
n=1 n=1 n=1
Rearranging this, we get
P
sin θn sin θ0
Pn = = tan θ0 ,
n cos θ n cos θ0
which we can solve w.r.t. θ0 to obtain (3.134).
3.33 Differentiating the von Mises distribution (3.129) we have
1
p0 (θ) = − exp {m cos(θ − θ0 )} sin(θ − θ0 )
2πI0 (m)
which vanishes when θ = θ0 or when θ = θ0 + π (mod2π). Differentiating again
we have
1
p00 (θ) = − exp {m cos(θ − θ0 )} sin2 (θ − θ0 ) + cos(θ − θ0 ) .
 
2πI0 (m)
Since I0 (m) > 0 we see that p00 (θ) < 0 when θ = θ0 , which therefore represents
a maximum of the density, while p00 (θ) > 0 when θ = θ0 + π (mod2π), which is
therefore a minimum.
3.34 From (3.119) and (3.134), we see that θ̄ = θ0ML . Using this together with (3.118)
and (3.127), we can rewrite (3.137) as follows:
N
! N
!
1 X ML 1 X
A(mML ) = cos θn cos θ0 + sin θn sin θ0ML
N n=1 N n=1
= r̄ cos θ̄ cos θ0ML + r̄ sin θ̄ sin θ0ML
= r̄ cos2 θ0ML + sin2 θ0ML


= r̄.
42 Solutions 3.35–3.37

3.35 Starting from (3.26), we can rewrite the argument of the exponential as
1  1
− Tr Σ−1 xxT + µT Σ−1 x − µT Σ−1 µ.

2 2
The last term is indepedent of x but depends on µ and Σ and so should go into g(η).
The second term is already an inner product and can be kept as is. To deal with
the first term, we define the D2 -dimensional vectors z and λ, which consist of the
columns of xxT and Σ−1 , respectively, stacked on top of each other. Now we can
write the multivariate Gaussian distribution on the form (3.138), with
 −1 
Σ µ
η =
− 12 λ
 
x
u(x) =
z
h(x) = (2π)−D/2
 
−1/2 1 T −1
g(η) = |Σ| exp − µ Σ µ .
2

3.36 Taking the first derivative of (3.172) we obtain, as in the text,


Z
−∇ ln g(η) = g(η) h(x) exp η T u(x) u(x) dx


Taking the gradient again gives


Z
h(x) exp η T u(x) u(x)u(x)T dx

−∇∇ ln g(η) = g(η)
Z
+∇g(η) h(x) exp η T u(x) u(x) dx


= E[u(x)u(x)T ] − E[u(x)]E[u(x)T ]
= cov[u(x)]
where we have used the result (3.172).
3.37 The value of the density p(x) at a point xn is given by hj (n) , where the notation j(n)
denotes that data point xn falls within region j. Thus the log likelihood function
takes the form
XN X N
ln p(xn ) = ln hj (n) .
n=1 n=1
We now need to take account of the constraint that p(x) must integrate to unity. Since
p(x) has the constantP
value hi over region i, which has volume ∆i , the normalization
constraint becomes i hi ∆i = 1. Introducing a Lagrange multiplier λ we then
minimize the function
N
!
X X
ln hj (n) + λ hi ∆i − 1
n=1 i
Solution 3.38 43

with respect to hk to give


nk
0= + λ∆k
hk
where nk denotes the total number of data points falling within region k. Multiplying
both sides by hk , summing over k and making use of the normalization constraint,
we obtain λ = −N . Eliminating λ then gives our final result for the maximum
likelihood solution for hk in the form
nk 1
hk = .
N ∆k
Note that, for equal sized bins ∆k = ∆ we obtain a bin height hk which is propor-
tional to the fraction of points falling within that bin, as expected.
3.38 From (3.180) we have
K
p(x) =
N V (ρ)
where V (ρ) is the volume of a D-dimensional hypersphere with radius ρ, where in
turn ρ is the distance from x to its K th nearest neighbour in the data set. Thus, in
polar coordinates, if we consider sufficiently large values for the radial coordinate r,
we have
p(x) ∝ r−D .
If we consider the integral of p(x) and note that the volume element dx can be
written as rD−1 dr, we get
Z Z ∞ Z ∞
p(x) dx ∝ r−D rD−1 dr = r−1 dr
0 0

which diverges logarithmically.


44 Solutions 4.1–4.4

Chapter 4 Single-layer Networks: Regression

4.1 Substituting (1.1) into (1.2) and then differentiating with respect to wi we obtain
 
N
X XM
 wj xjn − tn  xin = 0. (119)
n=1 j =0

Re-arranging terms then gives the required result.


4.2 For the regularized sum-of-squares error function given by (1.4) the corresponding
linear equations are again obtained by differentiation, and take the same form as
(4.53), but with Aij replaced by A
e ij , given by

A
e ij = Aij + λIij . (120)

4.3 Using (4.6), we have


2
2σ(2a) − 1 = −1
1 + e−2a
2 1 + e−2a
= −

1+e 2a 1 + e−2a
−2a
1−e
=
1 + e−2a
ea − e−a
=
ea + e−a
= tanh(a).

If we now take aj = (x − µj )/2s, we can rewrite (4.57) as


M
X
y(x, w) = w0 + wj σ(2aj )
j =1
M
X wj
= w0 + (2σ(2aj ) − 1 + 1)
j =1
2
M
X
= u0 + uj tanh(aj ),
j =1
PM
where uj = wj /2, for j = 1, . . . , M , and u0 = w0 + j =1 wj /2.
4.4 We first write
Φ(ΦT Φ)−1 ΦT v = Φe
v
= ϕ1 ṽ (1) + ϕ2 ṽ (2) + . . . + ϕM ṽ (M )
Solutions 4.5–4.6 45

where ϕm is the m-th column of Φ and v e = (ΦT Φ)−1 ΦT v. By comparing this


with the least squares solution in (4.14), we see that

y = ΦwML = Φ(ΦT Φ)−1 ΦT t

corresponds to a projection of t onto the space spanned by the columns of Φ. To see


that this is indeed an orthogonal projection, we first note that for any column of Φ,
ϕj ,
Φ(ΦT Φ)−1 ΦT ϕj = Φ(ΦT Φ)−1 ΦT Φ j = ϕj
 

and therefore
T T T
(y − t) ϕj = (ΦwML − t) ϕj = tT Φ(ΦT Φ)−1 ΦT − I ϕj = 0

and thus (y − t) is ortogonal to every column of Φ and hence is orthogonal to S.


4.5 If we define R = diag(r1 , . . . , rN ) to be a diagonal matrix containing the weighting
coefficients, then we can write the weighted sum-of-squares cost function in the form
1
ED (w) = (t − Φw)T R(t − Φw).
2
Setting the derivative with respect to w to zero, and re-arranging, then gives
−1 T
w? = ΦT RΦ Φ Rt

which reduces to the standard solution (4.14) for the case R = I.


If we compare (4.60) with (4.9)–(4.11), we see that rn can be regarded as a precision
(inverse variance) parameter, particular to the data point (xn , tn ), that either replaces
or scales β.
Alternatively, rn can be regarded as an effective number of replicated observations
of data point (xn , tn ); this becomes particularly clear if we consider (4.60) with rn
taking positive integer values, although it is valid for any rn > 0.
4.6 Taking the gradient of (4.26) with respect to w and setting this to zero, we obtain
N
X
0=− {tn − wT φ(xn )}φ(xn ) + λw (121)
n=1

which we can rewrite as

0 = −ΦT t + ΦT Φw + λw (122)

Rearranging to solve for w we then obtain


−1
w = λI + ΦT Φ ΦT t (123)

as required.
46 Solutions 4.7–4.8

4.7 We first write down the log likelihood function which is given by
N
N 1X
ln L(W, Σ) = − ln |Σ| − (tn − WT φ(xn ))T Σ−1 (tn − WT φ(xn )).
2 2 n=1

First of all we set the derivative with respect to W equal to zero, giving
N
X
0=− Σ−1 (tn − WT φ(xn ))φ(xn )T .
n=1

Multiplying through by Σ and introducing the design matrix Φ and the target data
matrix T we have
ΦT ΦW = ΦT T
Solving for W then gives (4.14) as required.
The maximum likelihood solution for Σ is easily found by appealing to the standard
result from Chapter ?? giving
N
1 X T T
Σ= (tn − WML φ(xn ))(tn − WML φ(xn ))T .
N n=1

as required. Since we are finding a joint maximum with respect to both W and Σ
we see that it is WML which appears in this expression, as in the standard result for
an unconditional Gaussian distribution.
4.8 The expected squared loss for a vectorial target variable is given by
ZZ
E[L] = ky(x) − tk2 p(t, x) dx dt.

Our goal is to choose y(x) so as to minimize E[L]. We can do this formally using
the calculus of variations to give
Z
δE[L]
= 2(y(x) − t)p(t, x) dt = 0.
δy(x)
Solving for y(x), and using the sum and product rules of probability, we obtain
Z
tp(t, x) dt Z
y(x) = Z = tp(t|x) dt
p(t, x) dt

which is the conditional average of t conditioned on x. For the case of a scalar target
variable we have Z
y(x) = tp(t|x) dt

which is equivalent to (4.37).


Solutions 4.9–4.12 47

4.9 We start by expanding the square in (??), in a similar fashion to the univariate case
in the equation preceding (4.39),

ky(x) − tk2 = ky(x) − E[t|x] + E[t|x] − tk2


= ky(x) − E[t|x]k2 + (y(x) − E[t|x])T (E[t|x] − t)
+(E[t|x] − t)T (y(x) − E[t|x]) + kE[t|x] − tk2 .

Following the treatment of the univariate case, we now substitute this into (4.64) and
perform the integral over t. Again the cross-term vanishes and we are left with
Z Z
2
E[L] = ky(x) − E[t|x]k p(x) dx + var[t|x]p(x) dx

from which we see directly that the function y(x) that minimizes E[L] is given by
E[t|x].
4.10 This exercise is just a repeat of Exercise 4.9.
4.11 To prove the normalization of the distribution (4.66) consider the integral
Z ∞ Z ∞
|x|q xq
   
I= exp − 2 dx = 2 exp − 2 dx
−∞ 2σ 0 2σ
and make the change of variable
xq
u= .
2σ 2
Using the definition (??) of the Gamma function, this gives
Z ∞ 2
2σ 2(2σ 2 )1/q Γ(1/q)
I=2 (2σ 2 u)(1−q)/q exp(−u) du =
0 q q
from which the normalization of (4.66) follows.
For the given noise distribution, the conditional distribution of the target variable
given the input variable is
|t − y(x, w)|q
 
2 q
p(t|x, w, σ ) = exp − .
2(2σ 2 )1/q Γ(1/q) 2σ 2
The likelihood function is obtained by taking products of factors of this form, over
all pairs {xn , tn }. Taking the logarithm, and discarding additive constants, we obtain
the desired result.
4.12 Since we can choose y(x) independently for each value of x, the minimum of the
expected Lq loss can be found by minimizing the integrand given by
Z
|y(x) − t|q p(t|x) dt (124)
48 Solution 4.12

for each value of x. Setting the derivative of (124) with respect to y(x) to zero gives
the stationarity condition
Z
q|y(x) − t|q−1 sign(y(x) − t)p(t|x) dt
Z y(x) Z ∞
q−1
= q |y(x) − t| p(t|x) dt − q |y(x) − t|q−1 p(t|x) dt = 0
−∞ y (x)

which can also be obtained directly by setting the functional derivative of (4.40) with
respect to y(x) equal to zero. It follows that y(x) must satisfy
Z y (x) Z ∞
q−1
|y(x) − t| p(t|x) dt = |y(x) − t|q−1 p(t|x) dt. (125)
−∞ y (x)

For the case of q = 1 this reduces to


Z y(x) Z ∞
p(t|x) dt = p(t|x) dt. (126)
−∞ y (x)

which says that y(x) must be the conditional median of t.


For q → 0 we note that, as a function of t, the quantity |y(x) − t|q is close to 1
everywhere except in a small neighbourhood around t = y(x) where it falls to zero.
The value of (124) will therefore be close to 1, since the density p(t) is normalized,
but reduced slightly by the ‘notch’ close to t = y(x). We obtain the biggest reduction
in (124) by choosing the location of the notch to coincide with the largest value of
p(t), i.e. with the (conditional) mode.
Solutions 5.1–5.3 49

Chapter 5 Single-layer Networks: Classification

5.1 Consider the component tk of t. By definition we have


p(tk = 1|x) = p(Ck |x) (127)
p(tk = 0|x) = 1 − p(Ck |x). (128)
Taking the expectation of tk then gives
E[tk |x] = 1 × p(Ck |x) + 0 × {1 − p(Ck |x)} = p(Ck |x) (129)
as required.
5.2 Assume that the convex hulls of {xn } and {ym } intersect. Then there exists at least
one point z such that X X
z= αn x n = βm ym (130)
n m
P
where βm > 0 for all m and m βm = 1. If {xn } and {ym } also were to be
linearly separable, we would have that
X X
b T z + w0 =
w b T x n + w0 =
αn w b T ym + w0 ) > 0,
αn (w (131)
n n

b T xn + w0 > 0 and the {αn } are all non-negative and sum to 1. However, by
since w
the corresponding argument
X X
b T z + w0 =
w b T y m + w0 =
βm w b T ym + w0 ) < 0,
βm ( w (132)
m m

which is a contradiction and hence {xn } and {ym } cannot be linearly separable if
their convex hulls intersect.
If we instead assume that {xn } and {ym } are linearly separable and consider a point
z in the intersection of their convex hulls, the same contradiction arise. Thus no such
point can exist and the intersection of the convex hulls of {xn } and {ym } must be
empty.
5.3 For the purpose of this exercise, we make the contribution of the bias weights explicit
in (5.14), giving

f = 1 Tr (XW + 1wT − T)T (XW + 1wT − T) ,



ED (W) 0 0 (133)
2
where w0 is the column vector of bias weights (the top row of W
f transposed) and 1
is a column vector of N ones.
We can take the derivative of (133) w.r.t. w0 , giving
2N w0 + 2(XW − T)T 1. (134)
50 Solution 5.4

Setting this to zero, and solving for w0 , we obtain


w0 = t̄ − WT x̄ (135)
where
1 T 1
t̄ = T 1 and x̄ = XT 1. (136)
N N
If we subsitute (135) into (133), we get
1 
ED (W) = Tr (XW + T − XW − T)T (XW + T − XW − T) , (137)
2
where
T = 1t̄T and X = 1x̄T . (138)
Setting the derivative of this w.r.t. W to zero we get
W = (X b −1 X
b T X) b TT b † T,
b =X b (139)
where we have defined Xb = X − X and T b = T − T.
Now consider the prediction for a new input vector x? ,
y(x? ) = WT x? + w0
= WT x? + t̄ − WT x̄
 T
= t̄ − T b † (x? − x̄).
bT X (140)

If we apply (5.97) to t̄, we get


1 T T
aT t̄ =a T 1 = −b. (141)
N
Therefore, applying (5.97) to (140), we obtain
 T
aT y(x? ) = aT t̄ + aT T bT Xb † (x? − x̄)

= aT t̄ = −b,
b T = aT (T − T)T = b(1 − 1)T = 0T .
since aT T
5.4 When we consider several simultaneous constraints, (5.97) becomes
Atn + b = 0, (142)
where A is a matrix and b is a column vector such that each row of A and element
of b correspond to one linear constraint.
If we apply (142) to (140), we obtain
 T
Ay(x? ) = At̄ − AT bT X b † (x? − x̄)

= At̄ = −b,
b T = A(T − T)T = b1T − b1T = 0T . Thus Ay(x? ) + b = 0.
since AT
Solutions 5.5–5.6 51

5.5 Using the definitions (5.30) and (5.31) we have

NTP NTP
Precision × Recall = ×
NTP + NFP NTP + NFN
2
NTP
= .
(NTP + NFP )(NTP + NFN )
Similarly, taking a common denominator we have
NTP NTP
Precision + Recall = +
NTP + NFP NTP + NFN
NTP (NTP + NFN ) + NTP (NTP + NFP )
=
(NTP + NFP )(NTP + NFN )
NTP (2NTP + NFN + NFP )
= .
(NTP + NFP )(NTP + NFN )
Substituting these two results into (5.38) we obtain
2NTP
F = . (143)
2NTP + NFP + NFN
as required.
5.6 Since the square root function is monotonic for non-negative numbers, we can take
the square root of the relation a 6 b to obtain a1/2 6 b1/2 . Then we multiply both
sides by the non-negative quantity a1/2 to obtain a 6 (ab)1/2 .
The probability of a misclassification is given, from (??), by
Z Z
p(mistake) = p(x, C2 ) dx + p(x, C1 ) dx
ZR1 R2
Z
= p(C2 |x)p(x) dx + p(C1 |x)p(x) dx. (144)
R1 R2

Since we have chosen the decision regions to minimize the probability of misclassi-
fication we must have p(C2 |x) 6 p(C1 |x) in region R1 , and p(C1 |x) 6 p(C2 |x) in
region R2 . We now apply the result a 6 b ⇒ a1/2 6 b1/2 to give
Z
p(mistake) 6 {p(C1 |x)p(C2 |x)}1/2 p(x) dx
R1
Z
+ {p(C1 |x)p(C2 |x)}1/2 p(x) dx
R2
Z
= {p(C1 |x)p(x)p(C2 |x)p(x)}1/2 dx (145)

since the two integrals have the same integrand. The final integral is taken over the
whole of the domain of x.
52 Solutions 5.7–5.11

5.7 Substituting Lkj = 1 − δkj into (5.23), and using the fact that the posterior proba-
bilities sum to one, we find that, for each x we should choose the class j for which
1 − p(Cj |x) is a minimum, which is equivalent to choosing the j for which the pos-
terior probability p(Cj |x) is a maximum. This loss matrix assigns a loss of one if
the example is misclassified, and a loss of zero if it is correctly classified, and hence
minimizing the expected loss will minimize the misclassification rate.
5.8 From (5.23) we see that for a general loss matrix and arbitrary class priors, the ex-
pected loss is minimized by assigning an input x to class the j which minimizes
X 1 X
Lkj p(Ck |x) = Lkj p(x|Ck )p(Ck ) (146)
p(x)
k k

and so there is a direct trade-off between the priors p(Ck ) and the loss matrix Lkj .
5.9 We recognise the sum over data points in (5.100) as the finite-sample approximation
to an expectation, as seen in (2.40). Taking the limit N → ∞ we can use (2.39) to
write the expectation in the form
Z Z
E [p(Ck |x)] = p(Ck |x)p(x) dx = p(Ck , x) dx = p(Ck ) (147)

where we have used the product and sum rules of probability.


5.10 A vector x belongs to class Ck with probability
P p(Ck |x). If we decide to assign x to
class Cj we will incur an expected loss of k Lkj p(Ck |x), whereas if we select the
reject option we will incur a loss of λ. Thus, if
X
j = arg min Lkl p(Ck |x) (148)
l
k

then we minimize the expected loss if we take the following action


 P
class j, if minl k Lkl p(Ck |x) < λ;
choose (149)
reject, otherwise.
P
For a loss matrix Lkj = 1 − Ikj we have k Lkl p(Ck |x) = 1 − p(Cl |x) and so we
reject unless the smallest value of 1 − p(Cl |x) is less than λ, or equivalently if the
largest value of p(Cl |x) is less than 1 − λ. In the standard reject criterion we reject
if the largest posterior probability is less than θ. Thus these two criteria for rejection
are equivalent provided θ = 1 − λ.
5.11 From (5.42) we have

1 1 + e−a − 1
1 − σ(a) = 1 − =
1 + e−a 1 + e−a
−a
e 1
= −a
= a = σ(−a).
1+e e +1
Solutions 5.12–5.13 53

The inverse of the logistic sigmoid is easily found as follows

1
y = σ(a) =
1 + e−a
1
⇒ − 1 = e−a
y
 
1−y
⇒ ln = −a
y
 
y
⇒ ln = a = σ −1 (y).
1−y

5.12 Substituting (5.47) into (5.41), we see that the normalizing constants cancel and we
are left with
 
T
exp − 21 (x − µ1 ) Σ−1 (x − µ1 ) p(C1 )
a = ln  
T
exp − 12 (x − µ2 ) Σ−1 (x − µ2 ) p(C2 )
1
= − xΣT x − xΣµ1 − µT T
1 Σx + µ1 Σµ1
2
p(C1 )
−xΣT x + xΣµ2 + µT T

2 Σx − µ2 Σµ2 + ln
p(C2 )
T 1 T −1 p(C1 )
= (µ1 − µ2 ) Σ−1 x − µ Σ µ1 − µT

2 Σµ2 + ln .
2 1 p(C2 )

Substituting this into the rightmost form of (5.40) we obtain (5.48), with w and w0
given by (5.49) and (5.50), respectively.

5.13 The likelihood function is given by


N Y
Y K
tnk
p ({φn , tn }|{πk }) = {p(φn |Ck )πk }
n=1 k=1

and taking the logarithm, we obtain


N X
X K
ln p ({φn , tn }|{πk }) = tnk {ln p(φn |Ck ) + ln πk } . (150)
n=1 k=1

maximize the log likelihood with respect to πk we need to preserve the


In order toP
constraint k πk = 1. This can be done by introducing a Lagrange multiplier λ and
maximizing
K
!
X
ln p ({φn , tn }|{πk }) + λ πk − 1 . (151)
k=1
54 Solution 5.14

Setting the derivative with respect to πk equal to zero, we obtain


N
X tnk
+ λ = 0. (152)
n=1
πk

Re-arranging then gives


N
X
−πk λ = tnk = Nk . (153)
n=1

Summing both sides over k we find that λ = −N , and using this to eliminate λ we
obtain (5.101).

5.14 If we substitute (5.102) into (150) and then use the definition of the multivariate
Gaussian, (3.26), we obtain

ln p ({φn , tn }|{πk }) =
N K
1 XX
tnk ln |Σ| + (φn − µk )T Σ−1 (φ − µ) , (154)


2 n=1
k=1

where we have dropped terms independent of {µk } and Σ.


Setting the derivative of the r.h.s. of (154) w.r.t. µk , obtained by using (A.19), to
zero, we get
XN X K
tnk Σ−1 (φn − µk ) = 0. (155)
n=1 k=1

Making use of (153), we can re-arrange this to obtain (5.103).


Rewriting the r.h.s. of (154) as
N K
1 XX
tnk ln |Σ| + Tr Σ−1 (φn − µk )(φ − µk )T ,
  
− b (156)
2 n=1
k=1

we can use (A.24) and (A.28) to calculate the derivative w.r.t. Σ−1 . Setting this to
zero we obtain
N T
1 XX
tnk Σ − (φn − µn )(φn − µk )T = 0.

(157)
2 n=1
k

Again making use of (153), we can re-arrange this to obtain (5.104), with Sk given
by (5.105).
Note that, as in Exercise 3.28, we do not enforce that Σ should be symmetric, but
simply note that the solution is automatically symmetric.
Solutions 5.15–5.16 55

5.15 We assume that the training set consists of data points xn each of which is labelled
with the associated class Ck . This allows the parameters {µki } to be fitted for each
class independently. From (5.64) the log likelihood function for class Ck is then
given by
N X
X D
ln p(D|Ck ) = {xni ln µki + (1 − xni ) ln(1 − µki )} . (158)
n=1 i=1

Setting the derivative with respect to µki equal to zero gives


N  
X xni (1 − xni )
0= − . (159)
n=1
µki (1 − µki )

Rearranging to solve for µki we finally obtain


N
1 X
µki = xni (160)
N n=1

which is the intuitively pleasing result that, for each class k and for each component
i, the value of µki is given by the average of the values of the corresponding compo-
nents xni of those data vectors that belong to class Ck . Since the xni are binary this
is just the fraction of data points for which the corresponding value of i is equal to
one.
5.16 The generative model for φ corresponding to the chosen coding scheme is given by
M
Y
p (φ | Ck ) = p (φm | Ck ) (161)
m=1

where
L
Y
p (φm | Ck ) = µφkml
ml
, (162)
l=1

where in turn {µkml } are the parameters of the multinomial models for φ.
Substituting this into (5.46) we see that

ak = ln p (φ | Ck ) p (Ck )
M
X
= ln p (Ck ) + ln p (φm | Ck )
m=1
M X
X L
= ln p (Ck ) + φml ln µkml , (163)
m=1 l=1

which is linear in φml .


56 Solution 5.17

5.17 We denote the data set by D = {φnml } where n = 1, . . . , N . From the naive Bayes
assumption we can fit each class Ck separately to the training data. For class Ck the
log likelihood function takes the form
N X
X M X
L
ln p(D|Ck ) = φnml ln µkml . (164)
n=1 m=1 l=1

Note that the parameter µkml represents the probability that for class Ck the compo-
nent m will have its non-zero element in position l. In order to find the maximum
likelihood solution we need to take account of the constraint that these probabilities
must sum to one, separately for each value of m, so that
L
X
µkml = 1. (165)
l=1

We can handle this by introducing Lagrange multipliers, one per component, and
then maximize the modified likelihood function given by
N X M X L M L
!
X X X
φnml ln µkml + λm µkml − 1 . (166)
n=1 m=1 l=1 m=1 l=1

Setting the derivative with respect to µkml equal to zero gives


N  
X φnml
0= + λm . (167)
n=1
µkml

Rearranging to solve for µkml we obtain


N
1 X
µkml = − φnml . (168)
λm n=1

To find the Lagrange multipliers we substitute this result into the constraint (165)
and rearrange to give
N X
X L
λm = − φnml . (169)
n=1 l=1

We now use this to replace the Lagrange multiplier in (168) to give the final result
for the maximum likelihood solution for the parameters in the form
N
X
φnml
n=1
µkml = N XL
. (170)
X
φnml
n=1 l=1
Solutions 5.18–5.20 57

5.18 Differentiating (5.42) we obtain

dσ e−a
= 2
da (1 + e−a )
 −a 
e
= σ(a)
1 + e−a
1 + e−a
 
1
= σ(a) −
1 + e−a 1 + e−a
= σ(a)(1 − σ(a)).

5.19 We start by computing the derivative of (5.74) w.r.t. yn


∂E 1 − tn tn
= − (171)
∂yn 1 − yn yn
yn (1 − tn ) − tn (1 − yn )
=
yn (1 − yn )
yn − yn t n − t n + yn t n
= (172)
yn (1 − yn )
yn − t n
= . (173)
yn (1 − yn )

From (5.72), we see that


∂yn ∂σ(an )
= = σ(an ) (1 − σ(an )) = yn (1 − yn ). (174)
∂an ∂an

Finally, we have
∇an = φn (175)
where ∇ denotes the gradient with respect to w. Combining (173), (174) and (175)
using the chain rule, we obtain
N
X ∂E ∂yn
∇E = ∇an
n=1
∂yn ∂an
N
X
= (yn − tn )φn
n=1

as required.
5.20 If the data set is linearly separable, any decision boundary separating the two classes
will have the property 
> 0 if tn = 1,
wT φn (176)
< 0 otherwise.
58 Solutions 5.21–5.23

Moreover, from (5.74) we see that the negative log-likelihood will be minimized
(i.e., the likelihood maximized) when yn = σ (wT φn ) = tn for all n. This will be
the case when the sigmoid function is saturated, which occurs when its argument,
wT φ, goes to ±∞, i.e., when the magnitude of w goes to infinity.

5.21 From (5.76) we have


 ak 2
∂yk e ak e
= P ai
− P ai = yk (1 − yk ),
∂ak i e ie
∂yk eak eaj
= − P 2 = −yk yj , j 6= k.
∂aj ( i eai )

Combining these results we obtain (5.78).

5.22 From (5.80) we have


∂E tnk
=− . (177)
∂ynk ynk
If we combine this with (5.78) using the chain rule, we get
K
∂E X ∂E ∂ynk
=
∂anj ∂ynk ∂anj
k=1
K
X tnk
= − ynk (Ikj − ynj )
ynk
k=1
= ynj − tnj ,
P
where we have used that ∀n : k tnk = 1.
If we combine this with (175), again using the chain rule, we obtain (5.81).

5.23 We consider the two cases where a > 0 and a < 0 separately. In the first case, we
can use (3.25) to rewrite (5.86) as
Z 0 Z a  2
1 θ
Φ(a) = N (θ|0, 1) dθ + √ exp − dθ
−∞ 0 2π 2
Z a/√2
1 1 √
= +√ exp −u2 2 du
2 2π 0
  
1 a
= 1 + erf √ , (178)
2 2
where, in the last line, we have used (5.87).
When a < 0, the symmetry of the Gaussian distribution gives

Φ(a) = 1 − Φ(−a). (179)


Solution 5.24 59

Combining this with (178), we get


  
1 a
Φ(a) = 1 − 1 + erf − √
2 2
  
1 a
= 1 + erf √ ,
2 2
where we have used the fact that the erf function is is anti-symmetric, i.e., erf(−a) =
−erf(a).
5.24 From (5.72) we have that

= σ(0)(1 − σ(0))
da a=0
 
1 1 1
= 1− = . (180)
2 2 4
Since the derivative of a cumulative distribution function is simply the corresponding
density function, (5.86) gives

dΦ(λa)
= λN (0|0, 1)
da a=0
1
= λ√ .

Setting this equal to (180), we see that

2π π
λ= or equivalently λ2 = . (181)
4 8
The comparison of the logistic sigmoid function and the scaled probit function is
illustrated in Figure 5.12.
60 Solutions 6.1–6.3

Chapter 6 Deep Neural Networks

6.1 On the right-hand side of (6.51) we make the change of variables u = r2 to give
Z ∞
1 1
SD e−u uD/2−1 du = SD Γ(D/2) (182)
2 0 2

where we have used the definition (??) of the Gamma function. On the left hand side
of (6.51) we can use (2.126) to obtain π D/2 . Equating these we obtain the desired
result (6.53).
The volume of a sphere of radius 1 in D-dimensions is obtained by integration
Z 1
SD
VD = S D rD−1 dr = . (183)
0 D

For D = 2 and D = 3 we obtain the following results

4 3
S2 = 2π, S3 = 4π, V2 = πa2 , V3 = πa . (184)
3

6.2 The volume of the cube is (2a)D . Combining this with (6.53) and (6.54) we obtain
(6.55). Using Stirling’s formula (6.56) in (6.55) the ratio becomes, for large D,

volume of sphere  πe D/2 1


= (185)
volume of cube 2D D
which goes to 0 as D → ∞. The distance from the center of the cube to the mid
point of one of the sides is a, since this is where
√ it makes contact with the sphere.
Similarly the
√ distance to one of the corners is a D from Pythagoras’ theorem. Thus
the ratio is D.

6.3 Since p(x) is radially symmetric it will be roughly constant over the shell of radius
r and thickness . This shell has volume SD rD−1  and since kxk2 = r2 we have
Z
p(x) dx ' p(r)SD rD−1  (186)
shell

from which we obtain (6.58). We can find the stationary points of p(r) by differen-
tiation
r2
 
d h
D−2 D−1
 r i
p(r) ∝ (D − 1)r +r − 2 exp − 2 = 0. (187)
dr σ 2σ

Solving for r, and using D  1, we obtain b
r ' Dσ.
Solution 6.4 61

Next we note that


r + )2
 
(b
r + )D−1 exp −
r + ) ∝ (b
p(b
2σ 2
r + )2
 
(b
= exp − + (D − 1) ln(b
r + ) . (188)
2σ 2

We now expand p(r) around the point b r. Since this is a stationary point of p(r)
we must keep terms up to second order. Making use of the expansion ln(1 + x) =
x − x2 /2 + O(x3 ), together with D  1, we obtain (6.59).
Finally, from (6.57) we see that the probability density at the origin is given by
1
p(x = 0) =
(2πσ 2 )1/2

while the density at kxk = b


r is given from (6.57) by

r2
   
1 1 D
exp − 2 = exp −
b
p(kxk = b r) =
(2πσ 2 )1/2 2σ (2πσ 2 )1/2 2

where we have used b r ' Dσ. Thus the ratio of densities is given by exp(D/2).
6.4 Using the definition of the tanh function we have

ea − e−a
tanh(a) =
ea + e−a
1 − e−2a
=
1 + e−2a
2 1 + e−2a
= −
1 + e−2a 1 + e−2a
2
= −1
1 + e−2a
= 2σ(2a) − 1 (189)

where we have made use of the definition of the sigmoid function in (6.60). Re-
arranging we obtain
1
σ(a) = (tanh(a/2) + 1) . (190)
2
For the case of a logistic sigmoid activation function, the argument of the output-unit
activation function in (6.11) is given by
M D
!
X (2)
X (1)
wkj σ wji xi . (191)
j =0 i=0
62 Solutions 6.5–6.6

2.5 2.5 2.5

0.0 0.0 0.0

−2.5 β = 0.1 −2.5 β=1 −2.5 β = 10


−2.5 0.0 2.5 −2.5 0.0 2.5 −2.5 0.0 2.5
(a) (b) (c)

Figure 2 The swish activation function plotted for values of β = 0.1, 1 and 10

Substituting for the sigmoid function using (190) we obtain

M D
!
X (2)
X (1)
w
e kj tanh w
e ji xi (192)
j =0 i=0

where we have defined

(2) 1 (2)
w
e kj = w , j = 1, . . . , M (193)
2 kj
(2) 1 (2) 1
w
e k0 = wk 0 + (194)
2 2
(1) 1 (1)
w
e ji = wji (195)
2

which again takes the form (6.11).

6.5 Using the definition of the logistic sigmoid, the swish activation function can be
written as
x
sw(x) = . (196)
1 + exp(−βx)

This is plotted for β = 0.1, β = 1.0, and β = 10 in Figure 2. In the limit β → ∞ we


can consider the behaviour of the swish function separately for positive and negative
values of x. For x > 0 the function exp(−βx) → 0 and hence sw(x) → x, while
for x < 0 the function exp(−βx) → ∞ and hence sw(x) → 0. Thus, in this limit
the swish function becomes a ReLU function.
Solution 6.7 63

6.6 From (6.14), using standard derivatives, we get

d ea ea (ea − e−a ) e−a e−a (ea − e−a )


tanh = − 2 + + 2
da ea + e−a (ea + e−a ) ea + e−a (ea + e−a )
ea + e−a 1 − e2a − e−2a + 1
= + 2
ea + e−a (ea + e−a )
e2a − 2 + e−2a
= 1− 2
(ea + e−a )
(ea − e−a )(ea − e−a )
= 1− a
(e + e−a ) (ea + e−a )
= 1 − tanh2 (a).

6.7 The softplus activation function is given by

ζ(a) = ln (1 + exp(a)) . (197)

We can prove the property (6.62) using the following steps

ζ(a) − ζ(−a) = ln (1 + exp(a)) − ln (1 + exp(−a)) (198)


= ln [exp(a) (exp(−a) + 1)] − ln (1 + exp(−a)) (199)
= a + ln (exp(−a) + 1) − ln (1 + exp(−a)) (200)
= a. (201)

To prove the property (6.63 ) we have

ln σ(a) = − ln (1 + exp(−a)) = −ζ(−a). (202)

For the derivative (6.64) of the softplus function we have


d d
ζ(a) = ln (1 + exp(a))
da da
exp(a)
=
1 + exp(a)
1
= = σ(a). (203)
1 + exp(−a)

Finally, to find the inverse (6.65) of the softplus function let y = ζ −1 (a), then

a = ζ(y) = ln (1 + exp(y)) . (204)

Rearranging we obtain
exp(a) = 1 + exp(y) (205)
and hence
y = ln(exp(a) − 1). (206)
64 Solutions 6.8–6.10

6.8 Differentiating the error (6.25) with respect to σ 2 and setting the derivative to zero
gives
N
1 X N 1
0=− 4 {y(xn , w) − tn }2 + . (207)
2σ n=1 2 σ2
Rearranging to solve for σ 2 we obtain
N
1 X
σ2 = {y(xn , w? ) − tn }2 (208)
N n=1

as required.
6.9 The likelihood function for an i.i.d. data set, {(x1 , t1 ), . . . , (xN , tN )}, under the
conditional distribution (6.28) is given by
N
Y
N tn |y(xn , w), β −1 I .

n=1

If we take the logarithm of this, using (3.26), we get


N
X
ln N tn |y(xn , w), β −1 I

n=1
N
1X T
= − (tn − y(xn , w)) (βI) (tn − y(xn , w)) + const
2 n=1
N
βX
= − ktn − y(xn , w)k2 + const,
2 n=1

where ‘const’ comprises terms which are independent of w. The first term on the
right hand side is proportional to the negative of (6.29) and hence maximizing the
log-likelihood is equivalent to minimizing the sum-of-squares error.
6.10 In this case, the likelihood function becomes
N
Y
p(T|X, w, Σ) = N (tn |y(xn , w), Σ) ,
n=1

with the corresponding log-likelihood function

ln p(T|X, w, Σ)
N
N 1X
=− (ln |Σ| + K ln(2π)) − (tn − yn )T Σ−1 (tn − yn ), (209)
2 2 n=1

where yn = y(xn , w) and K is the dimensionality of y and t.


Solution 6.11 65

If we first treat Σ as fixed and known, we can drop terms that are independent of w
from (209), and by changing the sign we get the error function
N
1X
E(w) = (tn − yn )T Σ−1 (tn − yn ).
2 n=1

If we consider maximizing (209) w.r.t. Σ, the terms that need to be kept are
N
N 1X
− ln |Σ| − (tn − yn )T Σ−1 (tn − yn ).
2 2 n=1

By rewriting the second term we get


" N
#
N 1 −1
X
T
− ln |Σ| − Tr Σ (tn − yn )(tn − yn ) .
2 2 n=1

Using results from Appendix ??, we can maximize this by setting the derivative w.r.t.
Σ−1 to zero, yielding
N
1 X
Σ= (tn − yn )(tn − yn )T .
N n=1

Thus the optimal value for Σ depends on w through yn .


A possible way to address this mutual dependency between w and Σ when it comes
to optimization, is to adopt an iterative scheme, alternating between updates of w
and Σ until some convergence criterion is reached.
6.11 Let t ∈ {0, 1} denote the data set label and let k ∈ {0, 1} denote the true class label.
We want the network output to have the interpretation y(x, w) = p(k = 1|x). From
the rules of probability we have
1
X
p(t = 1|x) = p(t = 1|k)p(k|x) = (1 − )y(x, w) + (1 − y(x, w)).
k=0

The conditional probability of the data label is then


p(t|x) = p(t = 1|x)t (1 − p(t = 1|x)1−t .
Forming the likelihood and taking the negative logarithm we then obtain the error
function in the form
N
X
E(w) = − {tn ln [(1 − )y(xn , w) + (1 − y(xn , w))]
n=1
+(1 − tn ) ln [1 − (1 − )y(xn , w) − (1 − y(xn , w))]} .
See also Solution ??.
66 Solutions 6.12–6.14

6.12 This simply corresponds to a scaling and shifting of the binary outputs, which di-
rectly gives the activation function, using the notation from (??), in the form

y = 2σ(a) − 1.

The corresponding error function can be constructed from (6.33) by applying the
inverse transform to yn and tn , yielding
N    
X 1 + tn 1 + yn 1 + tn 1 + yn
E(w) = − ln + 1− ln 1 −
n=1
2 2 2 2
N
1X
= − {(1 + tn ) ln(1 + yn ) + (1 − tn ) ln(1 − yn )} + N ln 2
2 n=1

where the last term can be dropped, since it is independent of w.


To find the corresponding activation function we simply apply the linear transforma-
tion to the logistic sigmoid given by (??), which gives
2
y(a) = 2σ(a) − 1 = −1
1 + e−a
1 − e−a ea/2 − e−a/2
= =
1 + e−a ea/2 + e−a/2
= tanh(a/2).

6.13 For the given interpretation of yk (x, w), the conditional distribution of the target
vector for a multiclass neural network is
K
Y
p(t|w1 , . . . , wK ) = yktk .
k=1

Thus, for a data set of N points, the likelihood function will be


N Y
Y K
tnk
p(T|w1 , . . . , wK ) = ynk .
n=1 k=1

Taking the negative logarithm in order to derive an error function we obtain (6.36)
as required. Note that this is the same result as for the multiclass logistic regression
model, given by (5.80) .

6.14 Differentiating (6.33) with respect to the activation an corresponding to a particular


data point n, we obtain

∂E 1 ∂yn 1 ∂yn
= −tn + (1 − tn ) . (210)
∂an yn ∂an 1 − yn ∂an
Solution 6.15 67

From (5.72), we have


∂yn
= yn (1 − yn ). (211)
∂an
Substituting (211) into (210), we get

∂E yn (1 − yn ) yn (1 − yn )
= −tn + (1 − tn )
∂an yn (1 − yn )
= yn − tn

as required.
6.15 Consider a specific data point n and, to minimize clutter, omit the suffix n on vari-
ables such as ak and yk . We can use the chain rule of calculus to write
K
∂E X ∂E ∂yj
= . (212)
∂ak j =1
∂yj ∂ak

From (6.36) we have


∂E tj
=− . (213)
∂yj yj
We can write (6.37) in the form
exp(aj )
yj = X . (214)
exp(al )
l

For the derivative ∂yj /∂ak there are two contributions, one from the numerator and
one from the denominator, so that
∂yj exp(aj )δjk exp(aj ) exp(ak )
=P − P 2
∂ak l exp(al ) { l exp(al )}
= yj δjk − yj yk . (215)

Substituting (213) and (215) into (212) we then have


K
∂E X tj
=− {yj δjk − yj yk } = yk − tk (216)
∂ak y
j =1 j

as required. In the final step we have used


K
X
tj = 1 (217)
j =1

which follows from 1-of-K coding scheme used for the {tj }.
68 Solutions 6.16–6.18

6.16 From standard trigonometric rules we get the position of the end of the first arm,
 
(1) (1)
x1 , x2 = (L1 cos(θ1 ), L1 sin(θ1 )) .

Similarly, the position of the end of the second arm relative to the end of the first arm
is given by the corresponding equation, with an angle offset of π (see Figure 6.16),
which equals a change of sign
 
(2) (2)
x1 , x2 = (L2 cos(θ1 + θ2 − π), L1 sin(θ1 + θ2 − π))
= − (L2 cos(θ1 + θ2 ), L2 sin(θ1 + θ2 )) .

Putting this together, we must also taken into account that θ2 is measured relative to
the first arm and so we get the position of the end of the second arm relative to the
attachment point of the first arm as

(x1 , x2 ) = (L1 cos(θ1 ) − L2 cos(θ1 + θ2 ), L1 sin(θ1 ) − L2 sin(θ1 + θ2 )) .

6.17 The interpretation of γnk as a posterior probability follows from Bayes’ theorem for
the probability of the component indexed by k, given observed data t, in which all
quantities are also conditioned on the input variable x. Therefore x simply appears as
a conditioning variable in the right-hand side of all quantities. From Bayes’ theorem
we have
p(t|k, x)p(k|x)
p(k|t, x) = (218)
p(t|x)
where, as usual, the denominator can be expressed as a marginalization over the
terms in the numerator, so that
X
p(t|x) = p(t|l, x)p(l|x). (219)
l

The quantities πk (x) defined by (6.40) satisfy (6.39) and hence meet the require-
ments to be viewed as probabilities, and so we equate p(k|x) = πk (x). Simi-
larly, the class-conditional distribution p(t|k, x) is given by the Gaussian Nnk =
N (tn |µk (xn ), σk2 (xn )). Substituting into (218) then gives

πk Nnk
p(k|tn , xn ) = P = γnk (220)
l πl Nnl

as required.

6.18 We start by using the chain rule to write


K
∂En X ∂En ∂πj
= . (221)
∂aπk j =1
∂πj ∂aπk
Solutions 6.19–6.20 69

Note that because of the coupling between outputs caused by the softmax activation
function, the dependence on the activation of a single output unit involves all the
output units.
For the first factor inside the sum on the r.h.s. of (221), standard derivatives applied
to the nth term of (6.43) gives
∂En Nnj γnj
= − PK =− . (222)
∂πj l=1 πl Nnl
πj

For the for the second factor, we have from (5.78) that
∂πj
= πj (Ijk − πk ). (223)
∂aπk

Combining (221), (222) and (223), we get


K
∂En X γnj
= − πj (Ijk − πk )
∂aπk j =1
πj
K
X K
X
= − γnj (Ijk − πk ) = −γnk + γnj πk = πk − γnk ,
j =1 j =1
PK
where we have used the fact that, by (6.44), j =1 γnj = 1 for all n.
6.19 Note: see Solution 6.18.
From (6.42) we have
aµkl = µkl
and thus
∂En ∂En
= .
∂aµkl ∂µkl
From (3.26), (6.43) and (6.44), we get
∂En πk Nnk tnl − µkl
= −P 2
∂µkl k0 πk0 Nnk0 σk (xn )
µkl − tnl
= γnk (tn |xn ) 2 .
σk (xn )

6.20 From (6.41) and (6.43), we see that


∂En ∂En ∂σk
σ = , (224)
∂ak ∂σk ∂aσk
where, from (6.41),
∂σk
= σk . (225)
∂aσk
70 Solution 6.21

From (3.26), (6.43) and (6.44), we get

 L/2 
ktn − µk k2
 
∂En 1 L L
= −P − L+1 exp −
∂σk k0 Nnk
0 2π σ 2σk2
ktn − µk k2 ktn − µk k2
  
1
+ L exp −
σ 2σk2 σk3
2
 
L ktn − µk k
= γnk − .
σk σk3

Combining this with (224) and (225), we get

ktn − µk k2
 
∂En
= γnk L − .
∂aσk σk2

6.21 From (3.42) and (6.38) we have

Z
E [t|x] = tp (t|x) dt
Z K
X
πk (x)N t|µk (x), σk2 (x) dt

= t
k=1
K
X Z
tN t|µk (x), σk2 (x) dt

= πk (x)
k=1
K
X
= πk (x)µk (x).
k=1

We now introduce the shorthand notation

K
X
tk = µk (x) and t = πk (x)tk .
k=1
Solution 6.21 71

Using this together with (3.42), (3.46), (6.38) and (6.48), we get
Z
2 2
s (x) = E kt − E [t|x] k |x = kt − tk2 p (t|x) dt
 

Z  K
X
T T T T
πk N t|µk (x), σk2 (x) dt

= t t−t t−t t+t t
k=1
K n o
T T T T
X
= πk (x) σk2 + tk tk − tk t − t tk + t t
k=1
K
X
πk (x) σk2 + ktk − tk2

=
k=1
 2

K
X  K
X 
= πk (x) σk2 + µk (x) − πl µl (x) .
 
k=1 l
72 Solutions 7.1–7.2

Chapter 7 Gradient Descent

7.1 Substituting (7.10) into (7.7) we obtain


!  
1 X X
E(w) = E(w? ) + αi uT
i H αj uj  .
2 i j

Making use of (7.8) then gives


! 
1 X X
E(w) = E(w? ) + αi uT
i
 λj αj uj  .
2 i j

Now making use of (7.9) we have


1X X
E(w) = E(w? ) + αi λj αj δij
2 i j
1X
= E(w? ) + λi αi2 (226)
2 i

as required.
7.2 From (7.8) and (7.10) we have
uT T
i Hui = ui λi ui = λi .

Assume that H is positive definite, so that (7.12) holds. Then by setting v = ui it


follows that
λi = uT i Hui > 0
for all values of i. Thus, if H is positive definite, all of its eigenvalues will be
positive.
Conversely, assume that () holds. Then, for any vector, v, we can make use of (7.13)
to give
!T  
X X
vT Hv = ci ui H cj uj 
i j
!T  
X X
= ci ui  λj cj uj 
i j
X
2
= λi ci > 0
i

where we have used (7.8) and (7.9) along with (). Thus, if all of the eigenvalues are
positive, the Hessian matrix will be positive definite.
Solutions 7.3–7.4 73

7.3 From (7.12) we see that, if H is positive definite, then the second term in (7.7) will
be positive whenever (w − w? ) is non-zero. Thus the smallest value which E(w)
can take is E(w? ), and so w? is the minimum of E(w). Conversely, if w? is the
minimum of E(w), then, for any vector w 6= w? , E(w) > E(w? ). This will only
be the case if the second term of (7.7) is positive for all values of w 6= w? (since
the first term is independent of w). Since w − w? can be set to any vector of real
numbers, it follows from the definition (7.12) that H must be positive definite.
7.4 The first derivatives of the error function are given by
∂E X
= (yn − tn )xn (227)
∂w n
∂E X
= (yn − tn ) (228)
∂b n

where yn = y(xn , w, b). The second derivatives are then given by

∂2E X
2
= x2n (229)
∂w n
∂2E ∂2E X
= = xn = N x (230)
∂w∂b ∂b∂w n
∂2E X
= 1=N (231)
∂b2 n

where x is the sample mean defined by


1 X
x= xn .
N n

The Hessian matrix is given by


  
∂2E ∂2E

X
 ∂w2 ∂b∂w  x2n Nx 
H=  ∂2E
= .
∂2E  n

Nx N
∂w∂b ∂b2
Note that the Hessian does not depend on the target values for this simple model,
nor does it depend on the model parameters. It is a function only of the input data
variables. The trace and determinant of the Hessian are given by
X
Tr H = x2n + N (232)
n
X
det H = N x2n − (N x)2 = N 2 σ 2 (233)
n
74 Solution 7.5

where σ 2 is the sample variance defined by


1 X
σ2 = (xn − x)2 .
N n
We see that the trace and the determinant are positive. Since the determinant is the
product of the two eigenvalues of the Hessian it follows that either both eigenvalues
are positive or they are both negative. Since the trace is the sum of the eigenvalues,
and the trace is also positive, it follows that both eigenvalues must be positive and so
the Hessian must be positive definite. Here we ignore the degenerate case where the
number of data points is one or where all the data points are the same.
7.5 Note that the original printing of the book there is a minus sign missing from the
right-hand side of (7.64). We first note that the derivative of the logistic sigmoid
function is given by (5.18). Using this result, the first derivatives of the error function
are given by
∂E X  tn 1 − tn

=− − yn (1 − yn )xn
∂w n
yn 1 − yn
X
= (yn − tn )xn , (234)
n
∂E X  tn 1 − tn

=− − yn (1 − yn )
∂b n
yn 1 − yn
X
= (yn − tn ) (235)
n

where yn = y(xn , w, b). The second derivatives are then given by


∂2E X
2
= yn (1 − yn )x2n (236)
∂w n
2
∂ E ∂2E X
= = yn (1 − yn )xn (237)
∂w∂b ∂b∂w n
∂2E X
= yn (1 − yn ). (238)
∂b2 n
The Hessian matrix is given by
 
∂2E ∂2E
 ∂w2 ∂b∂w 
H=  ∂2E

∂2E 
∂w∂b ∂b2
 X X 
 yn (1 − yn )x2n yn (1 − yn )xn 
=  X n n
X . (239)
yn (1 − yn )xn yn (1 − yn )

n n
Solution 7.6 75

Note that the Hessian does not depend on the target values for this simple model,
but it is a function of the model parameters w and b, corresponding to the fact that
the error function is non-quadratic. Since the logistic sigmoid function satisfies 0 <
σ(·) < 1 we see that yn (1 − yn ) is always a positive quantity. We therefore see
that the elements of the leading diagonal of the Hessian are given by the sum of
positive terms and are therefore themselves positive. Thus the trace of the Hessian
is positive. Note that we ignore the degenerate case where all of the data points
are identical, leading to a trace of zero, since this is of no practical interest. For the
determinant we first define cn = yn (1−yn ) in order to keep the notation uncluttered.
We then have
! ! !2
X X X
2
det H = cn xn cn − cn xn
n n n
 ! 2
 X 
! 


 cn xn 


X X n
= cn cn xn − ! . (240)
 X 
n n  


 cn 


n

Thus the determinant comprises the sum of terms each of which is positive and hence
is itself positive, and hence both the trace and the determinant are positive. Since the
determinant is the product of the two eigenvalues of the Hessian it follows that either
both eigenvalues are positive or they are both negative. Since the trace is the sum of
the eigenvalues, and the trace is also positive, it follows that both eigenvalues must
be positive and so the Hessian must be positive definite.
7.6 We start by making the change of variable given by (7.10) which allows the error
function to be written in the form (7.11). Setting the value of the error function
E(w) to a constant value C we obtain
1X
E(w? ) + λi αi2 = C.
2 i

Re-arranging gives X
λi αi2 = 2C − 2E(w? ) = C
e
i

where C e is also a constant. This is the equation for an ellipse whose axes are aligned
with the coordinates described by the variables {αi }. The length of axis j is found
by setting αi = 0 for all i 6= j, and solving for αj giving
!1/2
C
e
αj =
λj

which is inversely proportional to the square root of the corresponding eigenvalue.


76 Solutions 7.7–7.9

7.7 A W × W matrix has W 2 elements. If it is symmetric then the elements not on the
leading diagonal form pairs of equal value. There are W elements on the diagonal
so the number of elements not on the diagonal is W 2 − W and only half of these are
independent giving
W2 − W
2
as the number of independent of-diagonal elements. If we now add back the W
elements on the diagonal we get
W2 − W W (W + 1)
+W = .
2 2
Finally, we add the W elements of the gradient vector b to give
W (W + 1) W (W + 1) + 2W W 2 + 3W W (W + 3)
+W = = = .
2 2 2 2

7.8 From the property (2.52) of the Gaussian distribution we have, for a single data point,

E[xn ] = µ.

Two independent data points will be uncorrelated and hence

E[xn xm ] = E[xn ]E[xm ] = µ2 if n 6= m.

Therefore, using (2.53) it follows that

E[xn xm ] = δnm σ 2 + µ2 .

Using the definition (7.65) of x we have

E[x] = µ
N N
1 XX
E[x2 ] = E[xn xm ]
N 2 n=1 m=1
1 2
= σ + µ2 .
N
Hence

E[(x − µ)2 ] = E[x2 − 2xµ + µ2 ]


1 2 σ2
= σ + µ2 − 2µ2 + µ2 = .
N N
(l )
7.9 Note that in the original printing, the left-hand side of (7.20) should be var[ai ]
(l ) (l)
instead of var[zj ]. The expectation over ai involves averaging over both the dis-
(l−1) (l−1)
tribution of wij and the distribution of zj . For a given value of zj the quantity
Solution 7.9 77

(l−1)
wij zj has zero mean since we are assuming that the weights wij are drawn from
a zero-mean Gaussian N (0, 2 ). Since this is true for every value of j in the summa-
tion we have
(l )
E[ai ] = 0. (241)
(l ) (l )
To find the variance of ai we note that the quantity ai comprises the sum of
M terms which are themselves independent random variables, and we know from
(2.121) that the total variance of a sum of independent variables is the sum of the
variances of the individual variables. Hence we have
(l) (l−1)
var[ai ] = M var[wij zj ]
(l−1) 2 (l−1) 2
= M Ew,z [(wij zj ) ] − M Ew,z [wij zj ]
(l−1) 2
= M Ew,z [(wij zj ) ]
2 (l−1) 2
= M Ew [(wij ) ] Ez [(zj ) ]
(l−1)
since Ew,z [wij zj ] = 0 as discussed above. Because the weights wij are drawn
(l−1) 2
from a Gaussian N (0, 2 ) we have E[(wij )2 ] = 2 . To find E[(zj ) ] we note that
(l−1) (l−1)
zj = ReLU(ai )
and therefore
(l−1) 2 (l−1) 2
(zj ) = ReLU(ai ) .
When we take the square of the ReLU, the result will be zero if the argument is less
zero or negative, and will be the square of the argument if the argument is positive.
(l−1)
The quantity ai will have a symmetric distribution about 0 since, for every value
(l−1) (l−1)
of zj , the product wij zj will have a symmetric Gaussian distribution, and so
the overall distribution is the sum of symmetric distributions and is therefore itself
symmetric. Thus, when we take the expectation of the ReLU of this quantity, half
the terms will contribute zero and half the terms will contribute the square of the
argument and hence
(l−1) 2 (l−1)
E[(zj ) ] = E[(ReLU(ai ))2 ]
1 (l−1) 2
= E[(ai ) ]
2
1 (l−1) 2
= var[(ai ) ]
2
1
= λ2
2
(l−1)
where we have used the property that ai has zero mean. Combining these results
gives
(l) M 2 2
var[ai ] =  λ
2
78 Solutions 7.10–7.11

(l)
as required. Finally, if the variance of ai is also to equal λ2 then
M 2 2
 λ = λ2
2
from which it follows that  must be chosen to have the value
r
2
= . (242)
M

7.10 Taking the gradient of (7.7) we obtain

∇E = H(w − w? ). (243)

Substituting for (w − w? ) using (7.10) then gives


!
X
∇E = H αi ui . (244)
i

Making use of (7.8) we have


X
∇E = αi λi ui . (245)
i

If we consider a change ∆αi in the coefficients αi then the corresponding change in


w is obtained by taking finite differences of (7.10) to give
X
∆w = ∆αi ui . (246)
i

Next, using (7.16) we have


∆w = −η∇E. (247)
Substituting on both sides then gives
X X
∆αi ui = −η αi λi ui . (248)
i i

If we multiply both sides by uT


j and make use of the orthonormality relation (7.9)
we finally obtain
∆αj = −ηαj λj (249)
as required.
7.11 For small values of µ we can make a Taylor expansion of the first term on the right-
hand side of (7.34) in powers of µ to give

∆w(τ −1) = − η∇E w(τ −1) − ηµ∇∇E w(τ −1) ∆w(τ −2) + O(µ2 )
 

+ µ∆w(τ −2) .
Solutions 7.12–7.13 79

If we now assume that η = O() and µ = O(), then we can neglect higher-order
terms in the Taylor expansion and also omit the term with coefficient ηµ since that is
O(2 ). Here we have assumed that the error surface is slowly varying and hence that
the Hessian term ∇∇E is O(1). We then obtain the standard formula for gradient
descent with momentum defined by (7.31).
7.12 If we apply (7.66) recursively we obtain

µn = βµn−1 + (1 − β)xn
= β (βµn−2 + (1 − β)xn−1 ) + (1 − β)xn
= β 2 µn−2 + β(1 − β)xn−1 + (1 − β)xn
= β 3 µn−3 + β 2 (1 − β)xn−2 + β(1 − β)xn−1 + (1 − β)xn
= ...
Xn
n
= β µ0 + β k−1 (1 − β)xn−k+1 .
k=1

We now set µ0 = 0, and then take the expectation of both sides with respect to
the distribution of x, noting that the {xn } are independent, identically distributed
samples from this distribution. This gives
n
X
E[µn ] = β k−1 (1 − β)E[xn−k+1 ]
k=1
n
X
= β k−1 (1 − β)x
k=1

where x is the true mean of the distribution of x. Making use of the result (7.67) we
can write this as

E[µn ] = (1 − β n )x.

Thus we see that E[µn ] 6= x and hence that the estimate µn is biased. This bias is
easily corrected by using the estimator
µn
µ
bn = . (250)
1 − βn
which has the property E[µn ] = x and is therefore unbiased.
7.13 Setting the derivative of (7.69) with respect to λ equal to zero we have

E(w(τ ) + λd) = 0. (251)
∂λ
To evaluate this derivative we can use the chain rule of calculus. Define

v = w(τ ) + λd. (252)


80 Solution 7.14

Then we have
M
∂ X ∂vi ∂E
E=
∂λ i=1
∂λ ∂vi
M
X ∂E
= di
i=1
∂vi
T
= d ∇E (253)

where {di } are the components of d. This derivative vanishes for a particular value
λ? which then defines the new location in weight space

w(τ +1) = w(τ ) + λ? d. (254)

We therefore have

dT ∇E(w(τ ) + λ? d) = dT ∇E(w(τ +1) ) = 0 (255)

and hence the gradient of the error function at the line-search minimum is orthogonal
to the search direction.
7.14 Summing both sides of (7.50) over n to compute the sample mean we have
N N
!
1 X 1 X
x
eni = xni − N µi = 0 (256)
N n=1 N σi n=1

where we have used (7.48). Similarly, if we consider the sample variance


N N
1 X 2 1 X 2
xni − 0) =
(e (xni − µi ) = 1 (257)
N n=1 N σi2 n=1

where we have made use of (256) together with (7.49).


Solutions 8.1–8.2 81

Chapter 8 Backpropagation

8.1 From (8.12) we have


∂En X ∂En ∂ak
δj ≡ = (258)
∂aj ∂ak ∂aj
k

which follows from the chain rule of probability. Then (8.8) gives

∂En
δk = . (259)
∂ak

Again using the chain rule we have

∂ak ∂ak ∂zj


=
∂aj ∂zj ∂aj
∂zj
= wkj
∂aj
= wkj h0 (aj ) (260)

where we have used (8.5) and (8.6). Substituting these results into (258) we obtain
X
δj = h0 (aj ) wkj δk (261)
k

as required.

8.2 The forward propagation equations in matrix notation are given by (6.19) in the form

z(l) = h(l) W(l) z(l−1)



(262)

(l)
where W(l) is a matrix with elements wjk comprising the weights in layer l of
the network, and the activation function h(l) (·) acts on each element of its vector
argument independently. If we define δ (l) to be the errors vector with elements δj ,
then the backpropagation equations in matrix notation take the form
 n T o
δ (l−1) = h(l−1)0 a(l−1) W(l−1) δ (l) (263)

where denotes the Hadamard product which comprises the element-wise multi-
plication of two vectors. Note that the forward propagation equation (8.5) involves
a summation over the second index of wji whereas the backpropagation equation
(8.13) involves a summation over the first index. Hence when we write the back-
propagation equation in matrix notation it involves the transpose of the matrix that
appears in the forward propagation equation.
82 Solutions 8.3–8.5

8.3 We are interested in determining how the correction term


E(wij + ) − E(wij − )
δ = E 0 (wij ) − (264)
2
depend on . Using Taylor expansions, we can rewrite the numerator of the first term
of (264) as

2 00
E(wij ) + E 0 (wij ) + E (wij ) + O(3 )
2
2
− E(wij ) + E 0 (wij ) − E 00 (wij ) + O(3 ) = 2E 0 (wij ) + O(3 ).
2
Note that the 2 terms cancel. Substituting this into (264) we get,
2E 0 (wij ) + O(3 )
δ= − E 0 (wij ) = O(2 ). (265)
2
8.4 If we introduce skip layer weights, U, into the model described in Section 8.1.3, this
will only affect the last of the forward propagation equations, (8.20), which becomes
M
X D
X
(2)
yk = wkj zj + uki xi . (266)
j =0 i=1

Note that there is no need to include the input bias. The derivative w.r.t. uki can be
expressed using the output {δk } of (8.21),
∂E
= δk x i . (267)
∂uki
8.5 The alternative forward propagation scheme takes the first line of (8.29) as its starting
point. However, rather than proceeding with a ‘recursive’ definition of ∂yk /∂aj , we
instead make use of a corresponding definition for ∂aj /∂xi . More formally
∂yk X ∂yk ∂aj
Jki = = (268)
∂xi j
∂aj ∂xi

where ∂yk /∂aj is defined by (8.33) for logistic sigmoid output units, (8.34) for
softmax output units, or simply as δkj , for the case of linear output units. We define
∂aj /∂xi = wji if aj is in the first hidden layer and otherwise
∂aj X ∂aj ∂al
= (269)
∂xi ∂al ∂xi
l

where
∂aj
= wjl h0 (al ). (270)
∂al
Thus we can evaluate Jki by forward propagating ∂aj /∂xi , with initial value wij ,
alongside aj , using (269) and (270).
Solution 8.6 83

8.6 Using the chain rule together with (8.5) and (8.77), we have
∂En ∂En ∂ak
(2)
=
∂wkj ∂ak ∂w(2)
kj
= δk z j . (271)
Thus,
∂ 2 En ∂δk zj
(2) (2)
= (2)
(272)
∂wkj ∂wk0j 0 ∂wk0j 0
and since zj is independent of the second layer weights,
∂ 2 En ∂δk
(2) (2)
= zj (2)
∂wkj ∂wk0j 0 ∂wk0j 0
∂ 2 En ∂ak
= zj
∂ak ∂ak0 ∂w(2)
k0j 0
= zj zj 0 Mkk0 ,
where we again have used the chain rule together with (8.5) and (8.77). If both
weights are in the first layer, we again used the chain rule, this time together with
(8.5), (8.12) and (8.13), to get
∂En ∂En ∂aj
(1)
= (1)
∂wji ∂aj ∂wji
X ∂En ∂ak
= xi
∂ak ∂aj
k
X (2)
= xi h0 (aj ) wkj δk .
k

Thus we have
!
∂ 2 En ∂ 0
X (2)
(1) (1)
= xi h (aj ) wkj δk . (273)
∂wji ∂wj 0i0 ∂wj 0i0
k
(2) (1)
Now we note that xi and wkj do not depend on wj 0i0 , while h0 (aj ) is only affected
in the case where j = j 0 . Using these observations together with (8.5), we get
∂ 2 En 00
X (2)
0
X (2) ∂δk
(1) (1)
= x i x i 0 h (aj )Ijj 0 w kj δ k + x i h (a j ) wkj (1)
. (274)
∂wji ∂wj 0i0 k k ∂wj 0i0
From (8.5), (8.12), (8.13), (8.77) and the chain rule, we have
∂δk X ∂ 2 En ∂ak0 ∂aj 0
(1)
=
∂wj 0i0 ∂ak ∂ak0 ∂aj 0 ∂w(1)
k0 j 0i0
X (2)
0
= xi0 h (aj ) wk0j 0 Mkk0 . (275)
k0
84 Solution 8.7

Substituting this back into (274), we obtain (??). Finally, from (271) we have

∂ 2 En ∂δk zj 0
(1) (2)
= (1)
. (276)
∂wji ∂wkj 0 ∂wji

Using (275), we get

∂ 2 En X (2)
(1) (2)
= zj 0 xi h0 (aj ) wk0j Mkk0 + δk Ijj 0 h0 (aj )xi
∂wij ∂wkj 0 k0
!
X (2)
0
= xi h (aj ) δk Ijj 0 + wk0j Mkk0 .
k0

8.7 If we introduce skip layer weights into the model discussed in Section ??, three new
cases are added to three already covered in Exercise 8.6. The first derivative w.r.t.
skip layer weight uki can be written

∂En ∂En ∂ak ∂En


= = xi . (277)
∂uki ∂ak ∂uki ∂ak
Using this, we can consider the first new case, where both weights are in the skip
layer,

∂ 2 En ∂ 2 En ∂ak0
= xi
∂uki ∂uk0i0 ∂ak ∂ak0 ∂uk0i0
= Mkk0 xi xi0 ,

where we have also used (8.77). When one weight is in the skip layer and the other
weight is in the hidden-to-output layer, we can use (277), (8.5) and (8.77) to get

∂ 2 En ∂ 2 En ∂ak0
(2)
= xi
∂uki ∂wk0j ∂ak ∂ak0 ∂w(2)
k0j
= Mkk0 zj xi .

Finally, if one weight is a skip layer weight and the other is in the input-to-hidden
layer, (277), (8.5), (8.12), (8.13) and (8.77) together give

∂ 2 En
 
∂ ∂En
(1)
= (1)
xi
∂uki ∂w 0 ji ∂w 0 ∂ak ji
X ∂ 2 En ∂ak0
= xi
∂ak ∂ak0 ∂w(1)0
k0 ji
X (2)
= xi xi0 h0 (aj ) Mkk0 wk0j .
k0
Solutions 8.8–8.10 85

8.8 The multivariate form of (8.38) is


N
1X
E= (yn − tn )T (yn − tn ). (278)
2 n=1

The elements of the first and second derivatives then become


N
∂E X ∂yn
= (yn − tn )T (279)
∂wi n=1 ∂wi

and
N
∂2E ∂yn T ∂yn ∂ 2 yn
X  
= + (yn − tn )T . (280)
∂wi ∂wj n=1
∂wj ∂wi ∂wj ∂wi
As for the univariate case, we again assume that the second term of the second deriva-
tive vanishes and we are left with
N
X
H= Bn BT
n, (281)
n=1

where Bn is a W × K matrix, K being the dimensionality of yn , with elements


∂ynk
(Bn )lk = . (282)
∂wl

8.9 Taking the second derivatives of (8.78) with respect to two weights wr and ws we
obtain
∂2E X Z  ∂yk ∂yk 
= p(x) dx
∂wr ∂ws ∂wr ∂ws
k
X Z  ∂ 2 yk 
+ (yk (x) − Etk [tk |x]) p(x) dx. (283)
∂wr ∂ws
k

Using the result (4.37) that the outputs yk (x) of the trained network represent the
conditional averages of the target data, we see that the second term in (283) vanishes.
The Hessian is therefore given by an integral of terms involving only the products of
first derivatives. For a finite data set, we can write this result in the form
N
∂2E 1 X X ∂ykn ∂ykn
= (284)
∂wr ∂ws N n=1 ∂wr ∂ws
k

which is identical with (8.40) up to a scaling factor.


8.10 If we take the gradient of (6.33) with respect to w, we obtain
N N
X ∂E X
∇E(w) = ∇an = (yn − tn )∇an ,
n=1
∂an n=1
86 Solutions 8.11–8.12

where we have used the result proved earlier in the solution to Exercise 6.14. Taking
the second derivatives we have
N  
X ∂yn
∇∇E(w) = ∇an ∇an + (yn − tn )∇∇an .
n=1
∂an

Dropping the last term and using the result (5.72) for the derivative of the logistic
sigmoid function, proved in the solution to Exercise 5.18, we finally get
N
X N
X
∇∇E(w) ' yn (1 − yn )∇an ∇an = yn (1 − yn )bn bT
n
n=1 n=1

where bn ≡ ∇an .
8.11 Using the chain rule, we can write the first derivative of (6.36) as
N K
∂E X X ∂E ∂ank
= . (285)
∂wi n=1 ∂ank ∂wi
k=1

From Exercise 6.15, we know that


∂E
= ynk − tnk . (286)
∂ank
Using this and (5.78), we can get the derivative of (285) w.r.t. wj as
N XK K
!
∂2E X X ∂ank ∂anl ∂ 2 ank
= ynk (Ikl − ynl ) + (ynk − tnk ) .
∂wi ∂wj n=1
∂wi ∂wj ∂wi ∂wj
k=1 l=1

For a trained model, the network outputs will approximate the conditional class prob-
abilities and so the last term inside the parenthesis will vanish in the limit of a large
data set, leaving us with
N X K
K X
X ∂ank ∂anl
(H)ij ' ynk (Ikl − ynl ) . (287)
n=1 k=1 l=1
∂wi ∂wj

8.12 Suppose we have already obtained the inverse Hessian using the first L data points.
By separating off the contribution from data point L + 1 in (8.40), we obtain

HL+1 = HL + ∇aL+1 ∇aT


L+1 . (288)

We now consider the matrix identity (8.80). If we now identify HL with M and
bL+1 with v, we obtain

H− 1 T −1
L ∇aL+1 ∇aL+1 HL
H−1 −1
L+1 = HL − −1 . (289)
1 + ∇aT
L+1 HL ∇aL+1
Solutions 8.13–8.14 87

In this way, data points are sequentially absorbed until L+1 = N and the whole data
set has been processed. This result therefore represents a procedure for evaluating
the inverse of the Hessian using a single pass through the data set. The initial matrix
H0 is chosen to be αI, where α is a small quantity, so that the algorithm actually
finds the inverse of H + αI. The results are not particularly sensitive to the precise
value of α.

8.13 The function h(·) is given by a soft ReLU

h(a) = ln(1 + exp(a)) (290)

and its derivative is given by

exp(a)
h0 (a) = . (291)
1 + exp(a)

We can now use the chain rule of calculus in the form


dy dy dz
= . (292)
dw1 dz dw1

Using (8.44) and (291) we have

dz exp(w1 x + b1 )
= x. (293)
dw1 1 + exp(w1 x + b1 )

Similarly, using (8.45) and (291) we have

dy exp(w2 z + b2 )
= w2 . (294)
dz 1 + exp(w2 z + b2 )

Finally, combining these derivatives using the chain rule, and then substituting for z
using (8.44 ), we obtain

∂y w2 x exp (w1 x + b1 + b2 + w2 ln[1 + exp(w1 x + b1 )])


= . (295)
∂w1 (1 + exp(w1 x + b1 )) (1 + exp(b2 + w2 ln[1 + exp(w1 x + b1 )]))

8.14 The evaluation trace equations are given directly from the definition of the logistic
map

L1 =x (296)
L2 = 4L1 (1 − L1 ) (297)
L3 = 4L2 (1 − L2 ) (298)
L4 = 4L3 (1 − L3 ). (299)
(300)
88 Solution 8.15

The corresponding explicit functions, without simplification, are then given by

L1 (x) = x (301)
L2 (x) = 4x(1 − x) (302)
L3 (x) = 16x(1 − x)(1 − 2x)2 (303)
2 2 2
L4 (x) = 64x(1 − x)(1 − 2x) (1 − 8x + 8x ) . (304)

Finally, taking derivatives, we obtain the following expressions, again without sim-
plification

L01 (x) =1 (305)


L02 (x) =4(1 − x) − 4x (306)
L03 (x) 2 2
=16(1 − x)(1 − 2x) − 16x(1 − 2x) − 64x(1 − x)(1 − 2x) (307)
L04 (x) =128x(1 − x)(−8 + 16x)(1 − 2x) (1 − 8x + 8x ) 2 2

+ 64(1 − x)(1 − 2x)2 (1 − 8x + 8x2 )2


− 64x(1 − 2x)2 (1 − 8x + 8x2 )2
− 256x(1 − x)(1 − 2x)(1 − 8x + 8x2 )2 . (308)

Note that the complexity of the expressions for the derivatives grows much faster
than the complexity of the expressions for the corresponding functions.

8.15 To derive the forward-mode equations we apply (8.57) in the form


X ∂vi
v̇i = v̇j
∂vj
j∈pa(i)

where pa(i) denotes parents of the node i in the evaluation trace diagram. Using the
evaluation trace diagram in Figure 8.4, together with (8.50) to (8.56), we then have

v̇1 = 1 (309)
v̇2 = 0 (310)
∂v3 ∂v3
v̇3 = v̇1 + v̇2 = v̇1 v2 + v̇2 v1 (311)
∂v1 ∂v2
∂v4
v̇4 = v̇2 = v̇2 cos(v2 ) (312)
∂v2
∂v5
v̇5 = v̇3 = v˙3 exp(v3 ) (313)
∂v3
∂v6 ∂v6
v̇6 = v̇3 + v̇4 = v̇3 − v̇4 (314)
∂v3 ∂v4
∂v7 ∂v7
v̇7 = v̇5 + v̇6 = v̇5 + v̇6 . (315)
∂v5 ∂v6
Solutions 8.16–8.17 89

8.16 To derive the reverse-mode equations we apply (8.69) in the form


X ∂vj
vi = vj . (316)
∂vi
j∈ch(i)

Here ch(i) denotes the children of node i in the evaluation trace graph. Using the
evaluation trace diagram in Figure 8.4, together with (8.50) to (8.56), and starting at
the output of the graph and working backwards we then have

v7 = 1 (317)
∂v7
v6 = v7 = v7 (318)
∂v6
∂v7
v5 = v7 = v7 (319)
∂v5
∂v6
v4 = v6 = −v 6 (320)
∂v4
∂v5 ∂v6
v3 = v5 + v6 = v 5 v5 + v 6 (321)
∂v3 ∂v3
∂v3 ∂v4
v2 = v3 + v4 = v 3 v1 + v 4 cos(v2 ) (322)
∂v2 ∂v2
∂v3
v1 = v3 = v 3 v2 . (323)
∂v1

8.17 From (8.49) we have the following expression for the partial derivative
∂f
= x2 + x2 exp(x1 x2 ). (324)
∂x1
Evaluating this for (x1 x2 ) = (1, 2) gives

∂f
= 2 + 2 exp(2). (325)
∂x1 x1 =1,x2 =2

From the evaluation trace equations (8.50) to (8.56) we have

v1 =1 (326)
v2 =2 (327)
v3 =2 (328)
v4 = sin(2) (329)
v5 = exp(2) (330)
v6 = 2 − sin(2) (331)
v7 = 2 + exp(2) − sin(2). (332)
90 Solution 8.18

For the tangent variables we can then use (8.58) to (8.64) to give

v̇1 =1 (333)
v̇2 =0 (334)
v̇3 =2 (335)
v̇4 =0 (336)
v̇5 = 2 exp(2) (337)
v̇6 =2 (338)
v̇7 = 2 + 2 exp(2) (339)

and so we see that v̇7 does indeed represent the correct value for the derivative given
by (325). Similarly, we can use the evaluation trace equations of reverse-mode auto-
matic differentiation (8.70) to (8.76) to evaluate the adjoint variables as follows

v7 =1 (340)
v6 =1 (341)
v5 =1 (342)
v4 = −1 (343)
v3 = exp(2) + 1 (344)
v2 = (exp(2) + 1) − cos(2) (345)
v1 = 2 exp(2) + 2. (346)

From (8.68) we have


∂f
v1 = (347)
∂x1
and so again we see that this agrees with the required derivative.

8.18 The vectors e1 , . . . , eD form a complete orthonormal basis and so we can expand an
arbitrary D-dimensional vector r in the form
D
X
r= αi ei . (348)
i=1

Taking the product of both sides with eT


j we obtain

rj = eT
j r = αj (349)

where rj is the jth component of r. Hence we can write the expansion in the form

D
X
r= ri ei . (350)
i=1
Solution 8.18 91

Multiplying both sides by the Jacobian then gives


D D
X X ∂f
Jr = ri Jei = ri (351)
i=1 i=1
∂xi

where f (x) is the original network function with elements fk (x). This can be in-
terpreted as a single pass of forward-mode automatic differentiation in which the
tangent variables associated with the input variables are given by ẋi = ri .
One way to see this more clearly is to introduce a function g(z) where z is a scalar
variable and the elements of g are given by gi (z) = ri z. From the perspective of
a network diagram this can be viewed as introducing an extra layer from a single
input z to the original inputs {xi }. The overall composite function can be written
as f (g(z)), which is now a function with just one input whose Jacobian is therefore
a matrix with a single column which can therefore be evaluated in a single pass of
forward-mode automatic differentiation. The elements of this vector are given by
D D
∂fk X ∂fk ∂xi X
= = Jki ri = (Jr)k (352)
∂z i=1
∂xi ∂z i=1

and are therefore the elements of the Jacobian-vector product as required. The tan-
gent variables at the inputs to the main network are then given by
∂xi
ẋi = = ri . (353)
∂z
Thus, we see that if the tangent variable ẋi for each input i is set to the corresponding
element ri of r, then a single pass of forward-mode automatic differentiation will
compute the Jacobian-vector product as required.
92 Solutions 9.1–9.2

Chapter 9 Regularization

9.1 We will start by showing that the group of rotations by multiples of 90◦ forms a
group:

• Closure: Suppose A represents a rotation of a◦ and B represents a rotation of


b◦ , and both a and b are multiples of 90, making A and B both members of the
set. A ◦ B therefore represents a rotation by a + b which is also a multiple of
90, making this new rotation also a member of the set.
• Associativity: Now suppose we have three rotations, A, B and C, which rep-
resent rotations of a, b and c degrees respectively, again each a multiple of 90.
Now we can see that both (A ◦ B) ◦ C and A ◦ (B ◦ C) correspond to rotations
of (a + b + c)◦ , making the set associative under composition of rotations.
• Identity: A rotation of 0◦ is in the set and also leaves other rotations unchanged
when composed with them.
• Inverse: If we take an element A in the set which again represents a rotation
of a◦ , then it’s inverse A−1 will be a rotation by 360 − a, meaning A ◦ A−1
will give a rotation of 360◦ which is the same as a rotation by 0◦ which is the
identity.

By showing that these four axioms are satisfied, we have shown that this is indeed
a group. Now we will do the same for the group of translations of an object in a
two-dimensional plane.

• Closure: Suppose A represents a translation of ax in x and ay in y, and B


represents a translation of bx in x and by in y. A ◦ B therefore represents a
translation by ax + bx in x and ay + by in y which is also a translation in a 2D
plane.
• Associativity: Now suppose we have three translations, A, B and C, which
represent translations of ax , bx and cx in x respectively, and ay , by and cy in y.
Now we can see that both (A ◦ B) ◦ C and A ◦ (B ◦ C) correspond to translations
of ax + bx + cx in x and ay + by + cy in y.
• Identity: The composition of a translation of 0 in both dimensions with any
other translation will leave the latter unchanged and therefore the translation of
0 in both dimensions is the identity for this group.
• Inverse: If we take an element A in the set which again represents a translation
of ax in x and ay in y, we can see that composing this with a translation of −ax
in x and −ay in y gives us a translation of 0 in both dimensions which means
this negative translation is the inverse of A
Solution 9.3 93

9.2 Let
D
X
yn
e = w0 + wi (xni + ni )
i=1
D
X
= yn + wi ni
i=1

where yn = y(xn , w) and ni ∼ N (0, σ 2 ) and we have used (9.52). From (9.53) we
then define
N
1X 2
E
e = {e
yn − tn }
2 n=1
N
1 X 2
= yn tn + t2n
y − 2e
2 n=1 n
e
 !2
N  D D
1 X
2
X X
= y + 2yn wi ni + wi ni
2  n
n=1 i=1 i=1

D
X 
−2tn yn − 2tn wi ni + t2n .

i=1

If we take the expectation of Ee under the distribution of ni , we see that the second
and fifth terms disappear, since E[ni ] = 0, while for the third term we get
 !2 
XD XD
E wi ni  = wi2 σ 2
i=1 i=1

since the ni are all independent with variance σ 2 .


From this and (9.53) we see that
D
e = ED + 1
h i X
E E w2 σ 2 ,
2 i=1 i

as required.
9.3 We first write the gradient descent formula in terms of continuous time t in the form
w(t + ) = w(t) − e
η ∇Ω(w) (354)
where  represents some finite time step and we have defined e η = η/. We now
make a Taylor expansion of the left-hand side in powers of  to give
dw(t)
w(t) +  + O(2 ) = w(t) − e
η ∇Ω(w). (355)
dt
94 Solution 9.4

We now take the limit  → 0 to give


dw(t)
= −e
η ∇Ω(w). (356)
dt
Next we substitute for Ω(w) using
1
Ω(w) = − wT w (357)
2
to give
dw(t)
= −e
η w. (358)
dt
This has the solution
w(t) = w(0) exp {−e
η t} (359)
as can easily be verified by substitution. Thus, the elements of w decay exponentially
to zero.
9.4 With the transformed inputs, weights and biases, (9.6) becomes
!
X
zj = h w
e ji x
ei + w
e j0 .
i

Using (9.8)–(9.10), we can rewrite the argument of h(·) on the r.h.s. as


X1 bX
wji (axi + b) + wj 0 − wji
i
a a i
X bX bX
= wji xi + wji + wj 0 − wji
i
a i a i
X
= wji xi + wj 0 .
i

Similarly, with the transformed outputs, weights and biases, (9.7) becomes
X
yk =
e w
e kj zj + w
e k0 .
i

Using (9.11)–(9.13), we can rewrite this as


X
cyk + d = cwkj zj + cwk0 + d
k
!
X
= c wkj zj + wk0 + d.
i

By subtracting d and subsequently dividing by c on both sides, we recover (9.7) in


its original form.
Solutions 9.5–9.6 95

9.5 We can rewrite the constraint (9.20) in the form


M
X
|wj |q − η 6 0. (360)
j =1

This constraint can be enforced by adding a term to the un-regularized error E(w)
using a Lagrange multiplier, which we denote λ/2, to give
 
M
λ X
E(w) + |wj |q − η  . (361)
2 j =1

Since the term λη/2 is constant with respect to w, minimizing (361) is equivalent to
minimizing
M
λX
E(w) + |wj |q . (362)
2 j =1

Appendix C If the constraint is active then λ 6= 0 and hence


M
X
|wj |q = η. (363)
j =1

The strength of the regularization increases as λ increases, driving weights to smaller


values. Similarly, the strength of the constraint increases as η decreases, again driv-
ing weights to smaller values. However, the precise relationship between λ and η
depends on the form of E(w).
9.6 The gradient of (9.56) is given
∇E = H(w − w? )
and hence update formula (9.57) becomes
w(τ ) = w(τ −1) − ρH(w(τ −1) − w? ).

Pre-multiplying both sides with uT


j we get

(τ ) (τ )
wj = uT
jw (364)
T (τ −1) T (τ −1) ?
= uj w − ρuj H(w −w )
(τ −1) ?
= wj − ρηj uT
j (w − w )
(τ −1) (τ −1)
= wj − ρηj (wj − wj? ), (365)
where we have used (9.59). To show that
(τ )
wj = {1 − (1 − ρηj )τ } wj?
96 Solution 9.7

for τ = 1, 2, . . ., we can use proof by induction. For τ = 1, we recall that w(0) = 0


and insert this into (365), giving
(1) (0) (0)
wj = wj − ρηj (wj − wj? )
= ρηj wj?
= {1 − (1 − ρηj )} wj? .

Now we assume that the result holds for τ = N − 1 and then make use of (365)
(N ) (N −1) (N −1)
wj = wj − ρηj (wj − wj? )
(N −1)
= wj (1 − ρηj ) + ρηj wj?
1 − (1 − ρηj )N −1 wj? (1 − ρηj ) + ρηj wj?

=
(1 − ρηj ) − (1 − ρηj )N wj? + ρηj wj?

=
1 − (1 − ρηj )N wj?

=

as required. Provided that |1−ρηj | < 1 then we have (1−ρηj )τ → 0 as τ → ∞, and


hence 1 − (1 − ρηj )N → 1 and w(τ ) → w? . If τ is finite but ηj  (ρτ )−1 , τ

must still be large, since ηj ρτ  1, even though |1−ρηj | < 1. If τ is large, it follows
(τ )
from the argument above that wj ' wj? . If, on the other hand, ηj  (ρτ )−1 , this
means that ρηj must be small, since ρηj τ  1 and τ is an integer greater than or
equal to one. If we expand,

(1 − ρηj )τ = 1 − τ ρηj + O(ρηj2 )

and insert this into (9.58), we get


(τ )
|wj | = | {1 − (1 − ρηj )τ } wj? |
= | 1 − (1 − τ ρηj + O(ρηj2 )) wj? |


' τ ρηj |wj? |  |wj? |

9.7 Suppose that a set of weights w1 , . . . , wK are shared so that w1 = w2 = . . . =


wK = λ. We can compute the derivative of the error function with respect to λ using
the chain rule of calculus
K K
∂E X ∂E ∂wi X ∂E
= = (366)
∂λ i=1
∂wi ∂λ i=1
∂wi

where we have used


∂wi
= 1. (367)
∂λ
Hence, first run the standard backpropagation algorithm (or automatic differentia-
tion) to evaluate the individual gradients ∂E/∂wi for all weights. Then, for every
Solutions 9.8–9.11 97

group of weights in a network that are shared, sum up the gradients over all of those
weights and the use this combined gradient to update the weights in that group. Note
that, as long as the weights in each group are initialized to the same value, this will
ensure that they remain equal after the update.

9.8 From the formula


M
X
p(w) = πj N (w|µj , σj2 ) (368)
j =1

we can identify the following probabilities

p(j) = πj (369)
2
p(w|j) = N (w|µj , σj ). (370)

Hence from Bayes’ theorem we have

p(j)p(w|j) πj N (w|µj , σj2 )


p(j|w) = =P 2
. (371)
p(w) k πk N (w|µk , σk )

9.9 This is easily verified by taking the derivative of (9.22), using (2.49) and standard
derivatives, yielding

∂Ω 1 X (wi − µj )
=P 2
πj N (wi |µj , σj2 ) .
∂wi k πk N (wi |µk , σk ) j σ2

Combining this with (9.23) and (9.24), we immediately obtain the second term of
(9.25).

9.10 Since the µj s only appear in the regularization term, Ω(w), from (9.23) we have

∂E
e ∂Ω
=λ . (372)
∂µj ∂µj

Using (3.25), (9.22) and (9.24) and standard rules for differentiation, we can calcu-
late the derivative of Ω(w) as follows:

∂Ω X 1  wi − µ j
= −  πj N wi |µj , σj2
σj2

∂µj P 2
i j 0 πj 0 N wi |µj 0 , σj 0
X wi − µ j
= − γj (wi ) .
i
σj2

Combining this with (372), we get (9.26).


98 Solutions 9.12–9.13

9.11 Following the same line of argument as in Solution 9.10, we need the derivative of
Ω(w) w.r.t. σj . Again using (3.25), (9.22) and (9.24) and standard rules for differ-
entiation, we find this to be
!
(wi − µj )2

∂Ω X 1 1 1
= −  πj − 2 exp −
2σj2

∂σj i
P
0 π j 0N w i |µ j
2
0, σ 0
(2π)1/2 σj
j j
!
(wi − µj )2 (wi − µj )2

1
+ exp −
σj 2σj2 σj3
( )
X 1 (wi − µj )2
= γj (wi ) − .
i
σj σj3

Combining this with (372), we get (9.28).


9.12 From the definition (9.30) we have
exp(ηk )
πk = P . (373)
l exp(ηl )

Taking the derivative then gives


∂πk exp(ηk ) exp(ηk )
=P δjk − P 2 exp(ηj )
∂ηj l exp(ηl ) ( l exp(ηl ))
= δjk πk − πj πk . (374)

From (9.22) and (??) we then have

∂E
e XX 1 ∂πk
= −λ γk (wi )
∂ηj i
π k ∂ηj
k
XX
= −λ γk (wi ) {δjk − πj }
i k
X
=λ {πj − γj (wi )} (375)
i
P
where we have used the fact that k γk (wi ) = 1 for all i.
9.13 The result is easily proved by substituting (9.36) into (9.37), and then substituting
(9.35) into the resulting expression, giving

y =F3 (z2 ) + z2
=F3 (F2 (z1 ) + z1 ) + F2 (z1 ) + z1
=F3 (F2 (F1 (x) + x) + F1 (x) + x)
+ F2 (F1 (x) + x))
+ F1 (x) + x. (376)
Solutions 9.14–9.16 99

9.14 Using (9.49), we can rewrite (9.47) as


( )2 
M
1 X
ECOM = Ex  m (x) 
M m=1
( )2 
M
1 X
= Ex  m (x) 
M2 m=1
M M
1 XX
= Ex [m (x)l (x)]
M 2 m=1
l=1
M
1 X 1
Ex m (x)2 =
 
= EAV
M2 m=1
M

where we have used (9.46) in the last step.


9.15 We start by rearranging the r.h.s. of (9.46), by moving the factor 1/M inside the sum
and the expectation operator outside the sum, yielding
" M #
X 1
2
Ex m (x) .
m=1
M

If we then identify m (x) and 1/M with xi and λi in (2.102), respectively, and take
f (x) = x2 , we see from (2.102) that
M
!2 M
X 1 X 1
m (x) 6 m (x)2 .
m=1
M m=1
M

Since this holds for all values of x, it must also hold for the expectation over x,
proving (9.64).
9.16 If E(y(x)) is convex, we can apply (2.102) as follows:
M
1 X
EAV = Ex [E(y(x))]
M m=1
" M #
X 1
= Ex E(y(x))
m=1
M
" M
!#
X 1
> Ex E y(x)
m=1
M
= ECOM
where λi = 1/M for i = 1, . . . , M in (2.102) and we have implicitly defined ver-
sions of EAV and ECOM corresponding to E(y(x)).
100 Solutions 9.17–9.18

9.17 To prove that (9.67) is a sufficient condition for (9.66) we have to show that (9.66)
follows from (9.67). To do this, consider a fixed set of ym (x) and imagine varying
the αm over all possible values allowed by (9.67) and consider the values taken by
yCOM (x) as a result. The maximum value of yCOM (x) occurs when αk = 1 where
yk (x) > ym (x) for m 6= k, and hence all αm = 0 for m 6= k. An analogous result
holds for the minimum value. For other settings of α,

ymin (x) < yCOM (x) < ymax (x),

since yCOM (x) is a convex combination of points, ym (x), such that

∀m : ymin (x) 6 ym (x) 6 ymax (x).

Thus, (9.67) is a sufficient condition for (9.66).


Showing that (9.67) is a necessary condition for (9.66) is equivalent to showing that
(9.66) is a sufficient condition for (9.67). The implication here is that if (9.66) holds
for any choice of values of the committee members {ym (x)} then (9.67) will be
satisfied. Suppose, without loss of generality, that αk is the smallest of the α values,
i.e. αk 6 αm for k 6= m. Then consider yk (x) = 1, together with ym (x) = 0 for all
m 6= k. Then ymin (x) = 0 while yCOM (x) = αk and hence from (9.66) we obtain
αk > 0. Since αk is the smallest of the α values it follows that all of the coefficients
must satisfy αk > 0. Similarly, consider the case inP which ym (x) = 1 for all m.
Then ymin (x) P = y max (x) = 1, while y COM (x) = m αm . From (9.66) it then
follows that m αm = 1, as required.
9.18 From (3.2) the Bernoulli distribution for the elements of the dropout matrix can be
written as
Bern(Rni |ρ) = ρRni (1 − ρ)1−Rni . (377)
Hence we have
X
E[Rni ] = Bern(Rni |ρ)Rni
Rni ∈{0,1}

= ρ. (378)

Two elements Rni and Rnj will be independent unless j = i. Hence, for j 6= i we
have
X X
E[Rni Rnj ] = Bern(Rni |ρ)Bern(Rnj |ρ)Rni Rnj
Rni ∈{0,1} Rnj ∈{0,1}
  
X X
= Bern(Rni |ρ)Rni   Bern(Rnj |ρ)Rnj 
Rni ∈{0,1} Rnj ∈{0,1}
2

whereas if j = i we have
2
Rni Rnj = Rni = Rni
Solution 9.18 101

and therefore

E[Rni Rnj ] = E[Rni ] = ρ.

Combining these we obtain

E[Rni Rnj ] = δij ρ + (1 − δij )ρ2 . (379)

To find the expected value of the error function (9.69) we first expand out the square
to give


N X
X K  D
X
E(W) = t2nk − 2 wki Rni xni
n=1 k=1

i=1
  
D
X XD 
+ wki Rni xni   wkj Rnj xnj  .

i=1 j =1

Next we take the expectation of the error and substitute for the expectations of the
dropout matrix elements using (378) and (379) to give

  2 
N X
X K  D
X D
X 
E [E(W)] = t2nk − 2ρ wki xni + ρ2  wki xni 
n=1 k=1
 
i=1 i=1

N X
X D
K X
+ (ρ − ρ2 ) 2 2
wki xni (380)
n=1 k=1 i=1
N X
K
( D
)2
X X
= tnk − ρ wki xni
n=1 k=1 i=1
N X
X K X
D
2 2
+ ρ(1 − ρ) wki xni . (381)
n=1 k=1 i=1

Finally, we can find a solution for the weights that minimize this expected error
function by setting the derivatives with respect to wki equal to zero. For this it is
102 Solution 9.18

more convenient to work with the expression (380) giving


N D N
!
∂ X X X
E [E(W)] = − 2ρ xni + 2ρ2 wkj xnj xni
∂wki n=1 j =1 n=1
N
!
X
2
+ 2ρ(1 − ρ)wki xni
n=1
N
X
= − 2ρ xni
n=1
D
( N
! N
!)
X X X
2 2
+2 wkj ρ xnj xni + δji ρ(1 − ρ) xni .
j =1 n=1 n=1

This can be written in matrix form as

0 = −B + WM (382)

where W has elements wij , B is a diagonal matrix, and the elements of B and M
are given by
N
X
Bii = −2ρ xni (383)
n=1
( N
! N
!)
X X
Mji = 2 ρ2 xnj xni + δji ρ(1 − ρ) x2ni . (384)
n=1 n=1

Hence the minimizing weights are given by

W? = BM−1 . (385)
Solutions 10.1–10.4 103

Chapter 10 Convolutional Networks

10.1 We can impose the constraint kxk2 = K by using a Lagrange multiplier λ and
maximizing
wT x + λ kxk2 − K .

(386)
Taking the gradient with respect to x and setting this gradient to zero gives
w + 2λx = 0 (387)
which shows that x = αw where α = −1/(2λ).
10.2 Let us represent the input array as a vector x = (x1 , x2 , .., x5 )T . We will start with
the case where there is only one convolutional filter, of width 3, the weights of which
we will denote using a vector k = (k1 , k2 , k3 )T . If we look at Figure 3, we can see
that the three outputs, which we represent with the vector y = (y1 , y2 , y3 )T are given
by:  
x1 k1 + x2 k2 + x3 k3
y =  x2 k1 + x3 k2 + x4 k3  . (388)
x3 k1 + x4 k2 + x5 k3
Now we wish to find a matrix K such that y = Kx. We can see that it must be a
3 × 5 matrix, and each entry Kij is given by the contribution that xj makes to yi .
Therefore K is given by:
 
k1 k2 k3 0 0
K =  0 k1 k2 k3 0  . (389)
0 0 k1 k2 k3
This is an example of a Toeplitz matrix, where each descending diagonal from left
to right is constant. Convolution operations in 1D can always be represented as a
multiplication of the input array by a Toeplitz matrix.
10.3 Simple matrix multiplication shows that the convolution is given by

32 14 −18

−22 −24 12

22 6 18
.

10.4 There are many possibilities for indexing the elements of I, K, and C. One choice
is simply to choose the indices in K(l, m) to have the ranges 1 6 l 6 L and 1 6
m 6 M giving
L X
X M
C(j, k) = I(j + l, k + m)K(l, m). (390)
l=1 m=1
104 Solution 10.5

y1 = y2 = y3 =

x1 × k1 + x1 x1

x2 × x2 × k1 + x2
k2 +

x3 × x3 × x3 × k1 +
k3 k2 +

x4 x4 × x4 × k2 +
k3

x5 x5 x5 × k3

Figure 3 Figure showing a convolution operation of a filter k over an input array x with output y.

If we similarly choose the two indices of I(·, ·) to run from 1, . . . , J and 1, . . . , K


respectively, it follows that 0 6 j 6 J − L and 0 6 k 6 K − M . For the
convolutional form we likewise have
L X
X M
C(j, k) = I(j − l, k − m)K(l, m) (391)
l=1 m=1

where now L + 1 6 j 6 J + 1 and M + 1 6 k 6 K + 1 as is easily verified. Finally,


we can define λ = L − l + 1 and µ = M − m + 1 which allows us to rewrite (391)
in the form
XL X M
C(j, k) = I(j + λ, k + µ)K(λ,
e e µ) (392)
λ=1 µ=1

where we have defined


I(j + λ, k + µ) = I(j − L + λ − 1, k − M + µ − 1)
e (393)
K(λ,
e µ) = K(L − λ + 1, M − µ + 1). (394)
The convolution and cross-correlation representations differ in whether the index
variables l and m that label the kernel elements run from low to high values or or
vice versa. In a machine learning application it is usually irrelevant which of these
forms is used, since the algorithm will learn the same value for the kernel in the
corresponding locations, so that learning with the inverted representation will lead
to the same kernel but with its values in reverse order.
10.5 If we substitute z = x − y into (10.21) we obtain
Z ∞
F (x) = G(x − z)k(z) dz. (395)
−∞
Solutions 10.6–10.8 105

If we now discretize the x and z variables into bins of width ∆ we can approximate
this integral using

X
F (j∆) ' G(j∆ − l∆)k(l∆)∆. (396)
l=−∞

This can now be written as a one-dimensional version of the convolutional layer


defined by (10.19) in the form
L
X
C(j) ' I(j − l)K(l) (397)
l=1

where we have defined C(j) = F (j∆), I(l) = G(l∆), and K(l) = k(l∆)∆.
10.6 We saw in Section 10.2.3 that convolving an image of dimensions J × K and ad-
ditional padding P , with a filter of dimensions M × M will yield a feature map
of dimension (J + 2P − M + 1) × (K + 2P − M + 1). Now if we substi-
tute P = (M − 1)/2, we see that the dimensions of the feature map are given
by (J + (2((M − 1)/2) − M + 1) × (K + (2((M − 1)/2) − M + 1) which simplifies
to J × K.
10.7 In the case with no padding and a stride of 1, an M × M kernel would convolve
J − M + 1 times horizontally and K − M + 1 times vertically, giving us (J −
M + 1) × (K − M + 1) features. After applying padding P to each of the edges
of the image, we have a new image of size (J + 2P ) × (K + 2P ), and hence the
dimensionality of the feature layer would be (J +2P −M +1)×(K +2P −M +1).
When we apply a stride, we divide the number of convolutions in each dimension
by the stride and use the floor operator to account for the case where there is some
remainder of the image in a given direction that is less than the stride. The initial 1 is
not divided by the stride as it represents the first operation and is therefore unaffected
by the stride. This gives us
   
J + 2P − M K + 2P − M
+1 × +1 (398)
S S
features.
10.8 We assume connections between padding inputs and features count as connections
for simplicity. We also haven’t included max pooling or activation function connec-
tions. As every convolutional layer in the VGG-16 network uses a 3×3 filter, a given
node in a convolutional layer takes a number of inputs equal to 9 times the number of
channels in the previous layer plus 1 for the bias. The first convolutional layer there-
fore has 224 × 224 × (3 × 9 + 1) × 64 = 96, 337, 920 connections to the previous
layer. The number of connections for a fully connected layer is equal to the number
of input features plus 1 for the bias, multiplied by the number of nodes, which for
the first fully connected layer is equal to (7 × 7 × 512 + 1) × 4096 = 102, 764, 544.
The number of the connections for the rest of the layers are shown in Table 1.
106 Solution 10.9

The number of learnable parameters in a convolutional layer is independent of the


height and width dimensions of that layer. For a layer with a 3 × 3 filter, the number
of parameters for a given kernel is equal to 9 multiplied by the number of channels in
the previous layer, plus 1 for the bias. For a given layer, the number of such kernels
is then the same as the number of channels. So for example the first convolutional
layer has 64 × (3 × 3 × 3 + 1) = 1, 792 learnable parameters. For a fully connected
layer, the number of parameters is just equal to the number of connections as there
are no shared weights. The number of learnable parameters for each layer are also
shown in Table 1.
Layer Connections Learnable Parameters
Convolution 1 96, 337, 920 1, 792
Convolution 2 1, 852, 899, 328 36, 928
Convolution 3 926, 449, 664 73, 856
Convolution 4 1, 852, 899, 328 147, 584
Convolution 5 926, 449, 664 295, 168
Convolution 6 1, 852, 899, 328 590, 080
Convolution 7 1, 852, 899, 328 590, 080
Convolution 8 926, 449, 664 1, 180, 160
Convolution 9 1, 852, 899, 328 2, 359, 808
Convolution 10 1, 852, 899, 328 2, 359, 808
Convolution 11 462, 522, 368 2, 359, 808
Convolution 12 462, 522, 368 2, 359, 808
Convolution 13 462, 522, 368 2, 359, 808
Fully Connected 1 102, 764, 544 102, 764, 544
Fully Connected 2 16, 781, 312 16, 781, 312
Fully Connected 3 4, 097, 000 4, 097, 000
Total 15, 504, 292, 840 138, 357, 544

Table 1 Table showing the number of connections and learnable parameters in each layer of the VGG-16
network.

10.9 The convolution operation can be written as


L X
X M
C(j, k) = I(j − l, k − m)K(l, m) (399)
l=1 m=1

where j = 1, . . . , J and k = 1, . . . , K. The kernel K is swept across the image


I giving a total number of positions of (J − L + 1) × (K − M + 1) and for each
position, the number of operations is L × M . Thus the total number of operations is

(J − L + 1)(K − M + 1)LM. (400)

Now suppose that the kernel is separable, in other words that it factorizes in the form

K(l, m) = F (l)G(m). (401)


Solutions 10.10–10.11 107

Substituting (401) into (399) we obtain


L
X M
X
C(j, k) = F (l) I(j − l, k − m)G(m). (402)
l=1 m=1

Consider first the summation over m. This involves a one-dimensional kernel G(m)
which must be swept over the image for a total number of J ×(K −M +1) positions,
and in each position there are M operations to perform giving a total number of
operations equal to (K − M + 1)JM . This gives rise to an intermediate array of
dimension J × (K − M + 1). Now the summation over l is performed which is also
a convolution involving a one-dimensional kernel F (l). The number of positions for
this kernel is given by (K − M + 1) × (J − L + 1) and in each position we have to
perform L operations giving a total of (K − M + 1)(J − L + 1)L. Overall, the total
number of operations is therefore given by

(K − M + 1)JM + (K − M + 1)(J − L + 1)L. (403)

To see that this represents a saving in computation consider the case where the image
is large compared to the kernel size so that J  L and K  M . Then (400) is
approximately given by JKLM whereas (403) is approximately given by JK(L +
M ). Note that as well as saving on compute, a separable kernel uses less storage.
However, since it restricts the form of the kernel it can lead to a significant reduction
in generalization accuracy.

10.10 The derivatives of a cost function with respect to an activation value can be evaluated
using backpropagation, which corresponds to an application of the chain rule of
calculus. This backpropagation starts with the derivatives of the cost function with
respect to the local activations, which for the cost function defined by (10.12) are
given by
∂F (I)
δijk = = 2aijk (404)
∂aijk
and hence, up to a factor of 2, are given by the activation values themselves. These
values are then back-propagated through the network using (8.13) until the input
layer is reached. This input layer represents the image and the associated δ values
correspond to derivatives of the cost function (10.12) with respect to the pixel values,
as required.

10.11 Consider a 1-hot encoding scheme for the C object classes using binary variables
yi ∈ {0, 1} where i = 1, . . . , C and yi = 1 represents the presence of an object from
class i. We then introduce an additional class with a binary variable yC +1 ∈ {0, 1}
where yC +1 = 1 means there is no object from any of the given classes in the image.
Since we assume that a given image either contains an object from one of the classes
or no objects, the variables y1 , . . . , yC +1 form a 1-hot encoding where all variables
have the value 0 except for a single variable taking the value 1. We can train a model
f (·) to take an image as input and to return a probability distribution over these
108 Solution 10.12

variables. These probabilities must sum to one


C
X +1
pf (yi = 1) = 1. (405)
i=1

Now instead suppose we introduce a binary variable b ∈ {0, 1} such that b = 1


means an object (of any class) is present in the image and b = 0 means that no object
is present. We can train a model g(·) to output a probability distribution over this
variable, such that
pg (b = 1) + pg (b = 0) = 1. (406)
We also introduce binary variables z1 , . . . , zC ∈ {0, 1} to predict the class of the
object, conditional on their being an object present. These variables have a 1-hot
encoding. We then train an associated model h(·) to output a probability distribution
satisfying
XC
ph (zi = 1) = 1. (407)
i=1

To relate these sets of probabilities we can use the product rule in the form

p(object present and class i) = p(class i|object present)p(object present) (408)

which gives the following results

pf (yi = 1) = ph (zi = 1)pg (b = 1) i = 1, . . . , C (409)


pf (yC +1 = 1) = pg (b = 0). (410)

10.12 We first note that evaluating the scalar product between two N -dimensional vectors
requires N multiplies and N − 1 additions giving a total of 2N − 1 computational
steps.
For the network in Figure 10.22, the number of computational steps required to cal-
culate the first convolution operation is given by the number of features in the second
layer, which is 4 × 4 = 16, multiplied by the number of steps needed to evaluate the
output of the filter. Since the filter size is 3 × 3 = 9 each filter evaluation requires
9 × 8 = 72 steps. Hence the total number of computational steps for the first convo-
lutional layer is 16 × 72 = 1, 152. For one evaluation of a 2 × 2 max pooling filter,
there are 3 computational steps required. This is multiplied by the number of such
operations required for the max pooling layer which is 4, giving 12 total steps. The
fully connected layer is equivalent to a scalar product between two vectors of dimen-
sionality 4 and hence requires 4 + 3 = 7 operations. Therefore a single evaluation
of this network requires a total of 1, 152 + 12 + 7 = 1, 171 computational steps.
For the network in Figure 10.23, there are 6 × 6 = 36 features in the second layer
each of which requires 9 × 8 = 72 computations in the convolutional layer, giving a
total of 36 × 72 = 2, 592 computational steps. The max pooling layer has 9 nodes
and hence requires 9 × 3 = 27 computations. Finally, the fully connected layer
Solution 10.13 109

requires 4 × 7 = 28 computations, giving a total of 2, 592 + 27 + 28 = 2, 647


operations for one pass through the whole network.
Therefore the improvement in efficiency of using one pass through the second net-
work compared to 8 passes through the first network is equal to 1, 171 × 8/2, 647 =
3.54.
10.13 The padded input vector is given by

0
 
 x1 
x2
 
x= . (411)
 
 x3 
 x4 
0

For a filter with elements (w1 , w2 , w3 ) and a stride of 2, the output vector will be
two-dimensional and can be written as y = (y1 , y2 )T , in which the elements are
given by

y1 = w 2 x 1 + w 3 x 2 (412)
y2 = w 1 x 2 + w 2 x 3 + w 3 x 4 . (413)

We can write this convolution operation using matrix notation in the form

y = Ax (414)

where the matrix A is given by


 
w1 w2 w3 0 0 0
A= . (415)
0 0 w1 w2 w3 0

Now consider the up-sampling operation using a filter (w1 , w2 , w3 ) operating on


a vector z = (z1 , z2 )T with a stride of 2. The resulting six-dimensional vector
h = (h1 , h2 , h3 , h4 , h5 , h6 ) has elements given by

h1 = w1 z 1 (416)
h2 = w2 z 1 (417)
h3 = w3 z 1 + w1 z 2 (418)
h4 = w2 z 2 (419)
h5 = w3 z 2 (420)
h6 = 0. (421)
(422)

This can be written using matrix notation in the form

h = Bz (423)
110 Solution 10.13

where the matrix B is given by

w1 0
 
 w2 0 
w3 w1 
 
B= . (424)

 0 w2 
 0 w3 
0 0

By inspection we see that B = AT and hence up-sampling can be seen as the trans-
pose of convolution.

You might also like