Concepts_in_Deep_Learning_Solutions_v1.0
Concepts_in_Deep_Learning_Solutions_v1.0
Bishop
with Hugh Bishop
Deep Learning
Foundations
and Concepts
Solutions to Exercises
Chapters 2 to 10 | Version 1.0
2
This is version 1.0 of the solutions manual for Deep Learning: Foundations
and Concepts by C. M. Bishop and H. Bishop (Springer, 2024) and contains worked
solutions for exercises in Chapters 2 to 10. A full solutions manual including solu-
tions to all exercises in the book will be released soon. The most recent version of
the solutions manual, along with a free-to-use digital version of the book as well as
downloadable versions of the figures in PDF and JPEG formats, can be found on the
book web site:
https://www.bishopbook.com
If you have any feedback on the book or associated materials including this solutions
manual, please send email to the authors at
[email protected]
Contents
Contents 3
Chapter 2: Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 3: Standard Distributions . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 4: Single-layer Networks: Regression . . . . . . . . . . . . . . . 44
Chapter 5: Single-layer Networks: Classification . . . . . . . . . . . . . . 49
Chapter 6: Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . 60
Chapter 7: Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 8: Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 81
Chapter 9: Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Chapter 10: Convolutional Networks . . . . . . . . . . . . . . . . . . . . 103
3
4 Solutions 2.1–2.2
Chapter 2 Probabilities
2.2 Note that each of the numbers 0, 1, 2, 3, 4, 5, and 6 appears on only one of the dice,
which means that when we roll one die against another, there can never be a draw.
Look first at the red die, and notice that it has four copies of the number 2 and two
copies of the number 6. Two-thirds of the time, when we roll the red die it will
give a 2, and one third of the time it will give a 6. Therefore, if we roll the red die
against the yellow die (which always gives a 3), the yellow die will, on average, win
two-thirds of the time, and will lose one-third of the time.
Now look at the blue die, and notice that it has four copies of the number 4, and two
copies of the number 0. When we roll it against the yellow die, it will therefore give
a 4 two thirds of the time, in which case it wins, and a 0 one-third of the time, in
which case it loses.
Next consider the green die versus the blue die. The green die has three copies of
the number 1 and three copies of the number 5. To work out the probability that
the green die will win we first note that there is a probability of 1/2 that the green
die will give a 5, in which case it is certain to win against the blue die. Likewise,
there is a probability of 1/2 that the green die will give a 1, in which case there is a
probability of 1/3 that it will win. The overall probability that the green die will win
is then given by multiplying the probabilities:
1 1 1 2
×1 + × = . (4)
2 2 3 3
Finally, consider the probability of the red die winning against the green die. There
is a probability of 1/3 that the red die will produce a 6, in which case it is certain
that the red die will win. There is similarly a probability of 2/3 that the red die will
Solutions 2.3–2.5 5
produce a 2 in which case there is a 1/2 chance that the red die will win. The overall
probability of the red die winning is again obtained by multiplying the probabilities:
1 2 1 2
×1 + × = . (5)
3 3 2 3
2.3 Using the sum and product rules of probability we can write the desired distribution
in the form ZZ
p(y) = p(y|u, v)pu (u)pv (v) du dv. (6)
Substituting (7) into (6) allows us to perform the integration over v to give
Z
p(y) = pu (u)pv (y − u) du (8)
as required.
and hence this distribution is normalized. For the mean of the distribution we have
b b
x2 b2 − a2
Z
1 a+b
E[x] = x dx = = = .
a b−a 2(b − a) a 2(b − a) 2
b b
x3 b3 − a 3 a2 + ab + b2
Z
2 1 2
E[x ] = x dx = = =
a b−a 3(b − a) a 3(b − a) 3
a2 + ab + b2 (a + b)2 (b − a)2
var[x] = E[x2 ] − E[x]2 = − = .
3 4 12
6 Solutions 2.6–2.7
where we have used the change of variables y = λx. Hence, the exponential dis-
tribution is normalized. Likewise, if we integrate the Laplace distribution (2.35) we
obtain
Z ∞ Z ∞
1 |x − µ|
p(x|µ, λ) dx = exp − dx
−∞ −∞ 2γ γ
Z µ Z ∞
1 x−µ 1 x−µ
= exp dx + exp − dx
−∞ 2γ γ µ 2γ γ
Z 0 Z ∞
1 1
= exp (z) dz + exp (−z) dz
−∞ 2 0 2
0 ∞
1 1
= exp(z) + − exp(−z)
2 −∞ 2 0
1 1
= + =1 (11)
2 2
where we have made the substitution z = (x − µ)/γ in each of the two integrals.
Hence we see that the Laplace distribution is also normalized.
∞ N Z
1 X ∞
Z
p(x|D) dx = δ(x − xn ) dx
−∞ N n=1 −∞
N Z
1 X ∞
= δ(y) dy
N n=1 −∞
N
1 X
= 1=1 (12)
N n=1
2.7 If we substitute the empirical distribution (2.37) into the definition of the expectation
Solutions 2.8–2.10 7
where the final term will integrate to zero with respect to the factorized distribution
p(x)p(z). Hence
ZZ
var[x + z] = (x + z − E[x + z])2 p(x)p(z) dx dz
Z Z
= (x − E[x]) p(x) dx + (z − E[z])2 p(z) dz
2
For discrete variables the integrals are replaced by summations, and the same results
are again obtained.
2.11 Using the definition (2.39) of expectation we have
Z Z
Ey [Ex [x|y]] = p(y) p(x|y)x dx dy
ZZ
= p(x, y)x dx dy
Z
= p(x)x dx = E [x] (19)
where we have used the product rule of probability p(x|y)p(y) = p(x, y). Now we
make use of the result (2.46) to write
We now note that the second and third terms on the right-hand side of (20) cancel.
The first term on the right-hand side of (20) can be written as
Z Z
Ey Ex [x2 |y] = p(y) p(x|y)x2 dx dy
ZZ
= p(x, y)x2 dx dy
Z
= p(x)x2 dx = E x2 .
(21)
Likewise, we can again make use of the result (2.46) to write the fourth term on the
2
right-hand side of (20) in the form E [x] . Hence we have
2
Ey [varx [x|y]] + vary [Ex [x|y]] = E x2 − E [x] = var [x]
(22)
as required.
Solutions 2.12–2.13 9
We now note that in the factor (y + µ) the first term in y corresponds to an odd
integrand and so this integral must vanish (to show this explicitly, write the integral
as the sum of two integrals, one from −∞ to 0 and the other from 0 to ∞ and then
show that these two integrals cancel). In the second term, µ is a constant and pulls
outside the integral, leaving a normalized Gaussian distribution which integrates to
1, and so we obtain (2.52).
To derive (2.53) we first substitute the expression (2.49) for the normal distribution
into the normalization result (2.51) and re-arrange to obtain
Z ∞
1 2
1/2
exp − 2 (x − µ) dx = 2πσ 2 . (31)
−∞ 2σ
We now differentiate both sides of (31) with respect to σ 2 and then re-arrange to
obtain 1/2 Z ∞
1 1 2
exp − (x − µ) (x − µ)2 dx = σ 2 (32)
2πσ 2 −∞ 2σ 2
E[x2 ] − 2µE[x] + µ2 = σ 2 .
E[x2 ] − E[x]2 = µ2 + σ 2 − µ2 = σ 2 .
2.14 For the univariate case, we simply differentiate (2.49) with respect to x to obtain
d x−µ
N x|µ, σ 2 = −N x|µ, σ 2
.
dx σ2
Setting this to zero we obtain x = µ.
2.15 We use ` to denote ln p(X|µ, σ 2 ) from (2.56). By standard rules of differentiation
we obtain
N
∂` 1 X
= 2 (xn − µ).
∂µ σ n=1
Setting this equal to zero and moving the terms involving µ to the other side of the
equation we get
N
1 X 1
xn = 2 N µ
σ 2 n=1 σ
Solutions 2.16–2.17 11
Similarly we have
N
∂` 1 X N 1
= (xn − µ)2 −
∂σ 2 2(σ 2 )2 n=1 2 σ2
N
N 1 1 X
= (xn − µ)2 .
2 σ2 2(σ 2 )2 n=1
Multiplying both sides by 2(σ 2 )2 /N and substituting µML for µ we get (2.58).
Next we have
N
1 X
E[µML ] = E[xn ] = µ (34)
N n=1
using (2.52).
2
Finally, consider E[σML ]. From (2.57) and (2.58), and making use of (2.128), we
have
!2
N N
2 1 X 1 X
E[σML ] = E xn − xm
N n=1 N m=1
N
" N N X N
#
1 X 2 X 1 X
= E x2n − xn xm + 2 xm xl
N n=1 N m=1
N m=1
l=1
2 2 2 1 2 2 1 2
= µ +σ −2 µ + σ +µ + σ
N N
N −1
= σ2 (35)
N
as required.
12 Solutions 2.18–2.20
2.17 From the definition (2.61), and making use of (2.52) and (2.53), we have
" N
#
2 1 X 2
E σb =E (xn − µ)
N n=1
N
1 X 2
E xn − 2xn µ + µ2
=
N n=1
N
1 X 2
µ + σ 2 − 2µµ + µ2
=
N n=1
= σ2 (36)
as required.
2.18 Differentiating (2.66) with respect to σ 2 gives
N
∂ 2 1 X 2 N 1
2
ln p(t|x, w, σ ) = 4
{y(xn , w) − tn } − . (37)
∂σ 2σ n=1 2 σ2
as required.
2.19 If we assume that the function y = f (x) is strictly monotonic, which is necessary to
exclude the possibility for spikes of infinite density in p(y), we are guaranteed that
the inverse function x = f −1 (y) exists. We can then use (2.71) to write
df −1
p(y) = q(f −1 (y)) . (39)
dy
Since the only restriction on f is that it is monotonic, it can distribute the probability
mass over x arbitrarily over y. This is illustrated in Figure 2.12 on page 44, as a part
of Solution ??. From (39) we see directly that
q(x)
|f 0 (x)| = .
p(f (x))
2.20 The Jacobian matrix for the transformation from (x1 , x2 ) to (y1 , y2 ) is defined by
∂y1 ∂y1
∂x1 ∂x2
J= ∂y2 ∂y2 .
(40)
∂x1 ∂x2
Solutions 2.21–2.22 13
Since the right-hand side does not depend on i, this shows that the probabilities are
all equal. From (45) it then follows that p(xi ) = 1/M . Substituting this result into
(2.86) then shows that the value of the entropy at its maximum is equal to ln M .
2.23 The entropy of an M -state discrete variable x can be written in the form
M M
X X 1
H(x) = − p(xi ) ln p(xi ) = p(xi ) ln . (49)
i=1 i=1
p(xi )
The function ln(x) is concave_ and so we can apply Jensen’s inequality in the form
(2.102) but with the inequality reversed, so that
M
!
X 1
H(x) 6 ln p(xi ) = ln M. (50)
i=1
p(xi )
2.24 Obtaining the required functional derivative can be done simply by inspection. How-
ever, if a more formal approach is required we can proceed as follows using the
techniques set out in Appendix B. Consider first the functional
Z
I[p(x)] = p(x)f (x) dx.
and hence from (B.3) we deduce that the functional derivative is given by
δI
= f (x).
δp(x)
Similarly, if we define Z
J[p(x)] = p(x) ln p(x) dx
and hence
δJ
= p(x) + 1.
δp(x)
Solutions 2.25–2.26 15
Using these two results we obtain the following result for the functional derivative
− ln p(x) − 1 + λ1 + λ2 x + λ3 (x − µ)2 .
Re-arranging then gives (2.97).
To eliminate the Lagrange multipliers we substitute (2.97) into each of the three
constraints (2.93), (2.94) and (2.95) in turn. The solution is most easily obtained by
comparison with the standard form of the Gaussian, and noting that the results
1
λ1 = 1 − ln 2πσ 2
(51)
2
λ2 = 0 (52)
1
λ3 = (53)
2σ 2
do indeed satisfy the three constraints.
Note that there is a typographical error in the question, which should read ”Use
calculus of variations to show that the stationary point of the functional shown just
before (1.108) is given by (1.108)”.
For the multivariate version of this derivation, see Exercise 3.8.
2.25 Substituting the right hand side of (2.98) in the argument of the logarithm on the
right hand side of (2.91), we obtain
Z
H[x] = − p(x) ln p(x) dx
(x − µ)2
Z
1
= − p(x) − ln(2πσ 2 ) − dx
2 2σ 2
Z
1 2 1 2
= ln(2πσ ) + 2 p(x)(x − µ) dx
2 σ
1
ln(2πσ 2 ) + 1 ,
=
2
where in the last step we used (2.95).
2.26 The Kullback-Leibler divergence takes the form
Z Z
KL(pkq) = − p(x) ln q(x) dx + p(x) ln p(x) dx.
Differentiating this w.r.t. µ, using results from Appendix A, and setting the result to
zero, we see that
µ = E[x]. (55)
Similarly, differentiating (54) w.r.t. Σ−1 , again using results from Appendix A and
also making use of (55) and (2.48), we see that
Using (2.49) and (2.51)– (2.53), we can rewrite the first integral on the r.h.s. of (57)
as
(x − m)2
Z Z
2 1 2
− p(x) ln q(x) dx = N (x|µ, σ ) ln(2πs ) + dx
2 s2
Z
1 1
= ln(2πs2 ) + 2 N (x|µ, σ 2 )(x2 − 2xm + m2 ) dx
2 s
σ 2 + µ2 − 2µm + m2
1 2
= ln(2πs ) + . (58)
2 s2
The second integral on the r.h.s. of (57) we recognize from (2.91) as the negative
differential entropy of a Gaussian. Thus, from (57), (58) and (2.99), we have
σ 2 + µ2 − 2µm + m2
1 2 2
KL(pkq) = ln(2πs ) + − 1 − ln(2πσ )
2 s2
2
σ 2 + µ2 − 2µm + m2
1 s
= ln + −1 .
2 σ2 s2
1 − α2 = 1 − 1 + 2 − 2 = 2 + O(2 ). (61)
Solution 2.29 17
Substituting these expressions into the alpha divergence defined by (2.129) we obtain
Z o
4 n on
Dα (pkq) = 1 − p(x) 1 − ln p(x) 1 + ln q(x) dx + O()
2 2 2
Z
q(x)
= − p(x) ln dx + O() (62)
p(x)
where we have used Z
p(x) dx = 1. (63)
H(y|x) = H(y).
We now note that the right-hand side is independent of x and hence the left-hand side
must also be constant with respect to x. Using (2.110) it then follows that the mutual
information I[x, y] = 0. Finally, using (2.109) we see that the mutual information is
a form of KL divergence, and this vanishes only if the two distributions are equal, so
that p(x, y) = p(x)p(y) as required.
2.30 When we make a change of variables, the probability density is transformed by the
Jacobian of the change of variables. Thus we have
∂yi
p(x) = p(y) = p(y)|A| (69)
∂xj
(70)
as required.
2.31 The conditional entropy H(y|x) can be written
XX
H(y|x) = − p(yi |xj )p(xj ) ln p(yi |xj ) (71)
i j
which equals 0 by assumption. Since the quantity −p(yi |xj ) ln p(yi |xj ) is non-
negative each of these terms must vanish for any value xj such that p(xj ) 6= 0.
However, the quantity p ln p only vanishes for p = 0 or p = 1. Thus the quantities
p(yi |xj ) are all either 0 or 1. However, they must also sum to 1, since this is a
normalized probability distribution, and so precisely one of the p(yi |xj ) is 1, and
the rest are 0. Thus, for each value xj there is a unique value yi with non-zero
probability.
2.32 Consider (2.101) with λ = 0.5 and b = a + 2 (and hence a = b − 2),
Since this holds at all points, it follows that f 00 (x) > 0 everywhere.
To show the implication in the other direction, we make use of Taylor’s theorem
(with the remainder in Lagrange form), according to which there exist an x? such
that
1
f (x) = f (x0 ) + f 0 (x0 )(x − x0 ) + f 00 (x? )(x − x0 )2 .
2
Since we assume that f 00 (x) > 0 everywhere, the third term on the r.h.s. will always
be positive and therefore
Multiplying (72) by λ and (73) by 1 − λ and adding up the results on both sides, we
obtain
λf (a) + (1 − λ)f (b) > f (x0 ) = f (λa + (1 − λ)b)
as required.
2.33 From (2.101) we know that the result (2.102) holds for M = 1. We now suppose that
it holds for some general value M and show that it must therefore hold for M + 1.
Consider the left hand side of (2.102)
M +1
! M
!
X X
f λi xi = f λM +1 xM +1 + λi x i (74)
i=1 i=1
M
!
X
= f λM +1 xM +1 + (1 − λM +1 ) η i xi (75)
i=1
M
X +1
λi = 1 (78)
i=1
Then using (76) we see that the quantities ηi satisfy the property
M M
X 1 X
ηi = λi = 1. (80)
i=1
1 − λM +1 i=1
Thus we can apply the result (2.102) at order M and so (77) becomes
M +1
! M M +1
X X X
f λi x i 6 λM +1 f (xM +1 ) + (1 − λM +1 ) ηi f (xi ) = λi f (xi ) (81)
i=1 i=1 i=1
where the constant term is just the negative entropy of the fixed distribution p(x).
Substituting for p(x) in the first term using the empirical distribution (2.37), and
substituting for q(x) using the model distribution q(x|θ) gives
Z N
1 X
KL(pkq) = − δ(x − xn ) ln q(x|θ) dx + const.
N n=1
N
1 X
=− ln q(xn |θ) + const.. (83)
N n=1
2.36 We first evaluate the marginal and conditional probabilities p(x), p(y), p(x|y), and
p(y|x), to give the results shown in the tables below. From these tables, together
y y
x 0 2/3 0 1 0 1
1 1/3 1/3 2/3 x 0 1 1/2
1 0 1/2
p(x) p(y) p(x|y)
y
0 1
x 0 1/2 1/2
1 0 1
p(y|x)
and similar definitions for H(y) and H(y|x), we obtain the following results
(a) H(x) = ln 3 − 32 ln 2
(b) H(y) = ln 3 − 32 ln 2
2
(c) H(y|x) = 3
ln 2
2
(d) H(x|y) = 3
ln 2
(e) H(x, y) = ln 3
(f) I(x; y) = ln 3 − 43 ln 2
22 Solutions 2.37–2.39
where we have used (2.110) to evaluate the mutual information. The corresponding
diagram is shown in Figure 1.
K K
!1/K
1 X Y
x̄A = xk and x̄G = xk ,
K
k k
K
! K
1 X 1 X
ln x̄A = ln xk and ln x̄G = ln xk .
K K
k k
By matching f with ln and λi with 1/K in (2.102), taking into account that the
logarithm is concave rather than convex and the inequality therefore goes the other
way, we obtain the desired result.
2.38 From the product rule of probability we have p(x, y) = p(y|x)p(x), and so (2.109)
can be written as
ZZ ZZ
I(x; y) = − p(x, y) ln p(y) dx dy + p(x, y) ln p(y|x) dx dy
Z ZZ
= − p(y) ln p(y) dy + p(x, y) ln p(y|x) dx dy
= H(y) − H(y|x). (86)
where Z
z̄i = E[zi ] = zi p(zi ) dzi .
For y2 we have
p(y2 |y1 ) = δ(y2 − y12 ),
i.e., a spike of probability mass one at y12 , which is clearly dependent on y1 . With ȳi
defined analogously to z̄i above, we get
ZZ
cov[y1 , y2 ] = (y1 − ȳ1 )(y2 − ȳ2 )p(y1 , y2 ) dy1 dy2
ZZ
= y1 (y2 − ȳ2 )p(y2 |y1 )p(y1 ) dy1 dy2
Z
= (y13 − y1 ȳ2 )p(y1 ) dy1
= 0,
where we have used the fact that all odd moments of y1 will be zero, since it is
symmetric around zero.
2.40 [The original printing of Deep Learning: Foundations and Concepts has a typo in
this exercise in which the word ‘convex’ in the first sentence should read ‘concave’.
Note, however, that the exercise can equally well be solved with ‘convex’ by follow-
ing the same reasoning as given here.] We introduce a binary variable C such that
C = 1 denotes that the coin lands concave side up, and a binary variable Q for which
Q = 1 denotes that the concave side of the coin is heads. From the stated physical
properties of the coin, and from the assumed prior probability that the concave side
is heads, we have
The data set D = {x1 , . . . , x10 } consists of 10 observations x each of which takes
the value H for heads or T for tails. We assume that the data points are independent
and identically distributed, so the ordering does not matter. A particular coin flip can
land heads up either because it lands concave side up and the concave side is heads
24 Solution 2.40
or because it lands convex side up and the concave side is tails. Thus the probability
of landing heads is given by
p(x = H) = p(C = 1)p(Q = 1) + p(C = 0)p(Q = 0)
= 0.6 × 0.1 + 0.4 × 0.9
= 0.06 + 0.36
= 0.42 (89)
from which it follows that p(x = T ) = 0.58. The probability of observing 8 heads
and 2 tails is therefore
10
p(D) = × (0.42)8 × (0.58)2 (90)
8
where the coefficient in the binomial expansion is given by
N N!
= . (91)
K K!(N − K)!
The posterior probability that the concave side is heads is then given by Bayes’
theorem
p(D|Q = 1)p(Q = 1)
p(Q = 1|D) = (92)
p(D)
For the first term in the numerator on the right-hand side we note that, since this is
conditioned on the concave side being heads, we need to use the probabilities of the
coin landing concave side up and concave side down, to give
10
p(D|Q = 1) = × (0.6)8 × (0.4)2 . (93)
8
Substituting into Bayes’ theorem, and noting that the binomial coefficients cancel,
we obtain
(0.6)8 × (0.4)2 × 0.1
p(Q = 1|D) = ' 0.825 (94)
(0.42)8 × (0.58)2
and so we see this is a higher probability than the prior of 0.1, which is intuitively
reasonable since we saw a larger number of heads compared to tails in the data set.
The probability that the next flip will land heads up is then given by
p(H|D) = p(H|Q = 1, D)p(Q = 1|D) + p(H|Q = 0, D)p(Q = 0|D)
= p(H|Q = 1)p(Q = 1|D) + p(H|Q = 0)p(Q = 0|D)
' 0.6 × 0.825 + 0.4 × (1 − 0.825) ' 0.565 (95)
which we see is higher than the probability 0.42 of heads before we observed the
data. Again this is intuitively reasonable given the predominance of heads in the
data set.
Solution 2.41 25
2.41 If we substitute (2.115) into (2.114), we obtain (2.116). We now use (2.66) for the log
likelihood of the linear regressionPmodel, and note that ln p(t|x, w, σ 2 ) corresponds
to ln p(D|w). We also note that i wi2 = wT w. Hence we obtain (2.117) for the
regularized error function.
26 Solutions 3.1–3.2
= (1 − µ) + µ = 1
X
xp(x|µ) = 0.p(x = 0|µ) + 1.p(x = 1|µ) = µ
x∈{0,1}
X
(x − µ)2 p(x|µ) = µ2 p(x = 0|µ) + (1 − µ)2 p(x = 1|µ)
x∈{0,1}
= −(1 − µ) ln(1 − µ) − µ ln µ.
which completes the inductive proof. Finally, using the binomial theorem, the nor-
malization condition (3.198) for the binomial distribution gives
N N n
X N n X N µ
µ (1 − µ)N −n = (1 − µ) N
n=0
n n=0
n 1−µ
N
µ
= (1 − µ)N 1 + =1 (98)
1−µ
as required.
3.4 Differentiating (3.198) with respect to µ we obtain
N
X N n N −n n (N − n)
µ (1 − µ) − = 0.
n=1
n µ (1 − µ)
We now multiply through by µ2 (1 − µ)2 and re-arrange, making use of the result
(3.11) for the mean of the binomial distribution, to obtain
E[n2 ] = N µ(1 − µ) + N 2 µ2 .
Finally, we use (2.46) to obtain the result (3.12) for the variance.
∂ 1
N (x|µ, Σ) = − N (x|µ, Σ)∇x (x − µ)T Σ−1 (x − µ)
∂x 2
= −N (x|µ, Σ)Σ−1 (x − µ),
where we have used (A.19), (A.20), and the fact that Σ−1 is symmetric. Setting this
derivative equal to 0, and left-multiplying by Σ, leads to the solution x = µ.
3.6 First, we define y = Ax + b, and we assume that y has the same dimensionality
as x, so that A is a square matrix. We also assume that A is symmetric and has
an inverse. It follows that x = A−1 (y − b). From the sum and product rules of
probability we have
Z
p(y) = p(y|x)p(x) dx
Z
= δ (y − Ax − b) p(x) dx
Z
∝ δ x − A−1 (y − b) p(x) dx
(99)
where δ(·) is the Dirac delta function. Substituting for p(x) using the Gaussian
distribution we have
Z
p(y) ∝ δ x − A−1 (y − b) N (x|µ, Σ) dx
Z
1
∝ δ x − A−1 (y − b) exp − (x − µ)T Σ−1 (x − µ) dx
2
1 −1 T −1 −1
∝ exp − (A (y − b) − µ) Σ (A (y − b) − µ)
2
1 T −1 −1 −1
∝ exp − (y − b − Aµ) A Σ A (y − b − Aµ) (100)
2
Solutions 3.7–3.8 29
where we have used the property of a symmetric matrix that its inverse is also sym-
metric. By inspection we see that p(y) is a Gaussian distribution and that its mean
is Aµ + b and that its covariance is
−1
A−1 Σ−1 A−1 = AΣA. (101)
Using (3.26), (3.40), (3.42) and (3.46), we can rewrite the first integral on the r.h.s.
of (102) as
Z
− q(x) ln p(x) dx
Z
1
= N (x|µq , Σq ) D ln(2π) + ln |Σp | + (x − µp )T Σ− 1
p (x − µp ) dx
2
1
= D ln(2π) + ln |Σp | + Tr[Σ− 1 T
p (µq µq + Σq )]
2
−µp Σ− 1 T −1 T −1
p µq − µq Σp µp + µp Σp µp . (103)
The second integral on the r.h.s. of (102) we recognize from (2.92) as the negative
differential entropy of a multivariate Gaussian. Thus, from (102), (103) and (3.204),
we have
KL (q(x)kp(x))
1 |Σp | −1
T −1
= ln − D + Tr Σp Σq + (µp − µq ) Σp (µp − µq ) . (104)
2 |Σq |
3.8 We can make use of Lagrange multipliers to enforce the constraints on the maximum
entropy solution. Note that we need a single Lagrange multiplier for the normaliza-
tion constraint (3.201), a D-dimensional vector m of Lagrange multipliers for the
D constraints given by (3.202), and a D × D matrix L of Lagrange multipliers to
enforce the D2 constraints represented by (3.203). Thus we maximize
Z Z
H[p] = − p(x) ln p(x) dx + λ
e p(x) dx − 1
Z
+mT p(x)x dx − µ
Z
+Tr L p(x)(x − µ)(x − µ)T dx − Σ . (105)
We now find the values of the Lagrange multipliers by applying the constraints. First
we complete the square inside the exponential, which becomes
T
1 1 −1 1
λ − 1 + x − µ + L−1 m L x − µ + L m + µT m − mT L−1 m.
2 2 4
In the final parentheses, the term in y vanishes by symmetry, while the term in µ
simply integrates to µ by virtue of the normalization constraint (3.201) which now
takes the form
Z
T T 1 T −1
exp λ − 1 + y Ly + µ m − m L m dy = 1.
4
Substituting this into the final constraint (3.203), and making the change of variable
x − µ = z we obtain
Z
exp λ − 1 + zT Lz zzT dx = Σ.
3.9 From the definitions of the multivariate differential entropy (2.92) and the multivari-
ate Gaussian distribution (3.26), we get
Z
H[x] = − N (x|µ, Σ) ln N (x|µ, Σ) dx
Z
1
N (x|µ, Σ) D ln(2π) + ln |Σ| + (x − µ)T Σ−1 (x − µ) dx
=
2
1
D ln(2π) + ln |Σ| + Tr Σ−1 Σ
=
2
1 D
= ln |Σ| + (1 + ln(2π)) .
2 2
3.10 We have p(x1 ) = N (x1 |µ1 , τ1−1 ) and p(x2 ) = N (x2 |µ2 , τ2−1 ). Since x = x1 + x2
we also have p(x|x2 ) = N (x|µ1 + x2 , τ1−1 ). We now evaluate the convolution
integral given by (3.205) which takes the form
τ 1/2 τ 1/2 Z ∞ n τ τ2 o
1 2 1
p(x) = exp − (x − µ1 − x2 )2 − (x2 − µ2 )2 dx2 .
2π 2π −∞ 2 2
(107)
Since the final result will be a Gaussian distribution for p(x) we need only evaluate
its precision, since, from (2.99), the entropy is determined by the variance or equiv-
alently the precision, and is independent of the mean. This allows us to simplify the
calculation by ignoring such things as normalization constants.
We begin by considering the terms in the exponent of (107) which depend on x2
which are given by
1
− x22 (τ1 + τ2 ) + x2 {τ1 (x − µ1 ) + τ2 µ2 }
2
2 2
1 τ1 (x − µ1 ) + τ2 µ2 {τ1 (x − µ1 ) + τ2 µ2 }
= − (τ1 + τ2 ) x2 − +
2 τ1 + τ2 2(τ1 + τ2 )
where we have completed the square over x2 . When we integrate out x2 , the first
term on the right hand side will simply give rise to a constant factor independent
of x. The second term, when expanded out, will involve a term in x2 . Since the
precision of x is given directly in terms of the coefficient of x2 in the exponent, it is
only such terms that we need to consider. There is one other term in x2 arising from
the original exponent in (107). Combining these we have
τ1 2 τ12 1 τ1 τ2 2
− x + x2 = − x
2 2(τ1 + τ2 ) 2 τ1 + τ2
3.11 We can use an analogous argument to that used in the solution of Exercise ??. Con-
sider a general square matrix Λ with elements Λij . Then we can always write
Λ = ΛA + ΛS where
Λij + Λji Λij − Λji
ΛSij = , ΛA
ij = (108)
2 2
and it is easily verified that ΛS is symmetric so that ΛSij = ΛSji , and ΛA is antisym-
metric so that ΛA S
ij = −Λji . The quadratic form in the exponent of a D-dimensional
multivariate Gaussian distribution can be written
D D
1 XX
(xi − µi )Λij (xj − µj ) (109)
2 i=1 j =1
0 = (λ∗i − λi )u†i ui .
Hence λ∗i = λi and so λi must be real.
Now consider
uT
i uj λj = uT
i Σuj
T
= uT
i Σ uj
T
= (Σui ) uj
= λi uT
i uj ,
where we have used (3.28) and the fact that Σ is symmetric. If we assume that
0 6= λi 6= λj 6= 0, the only solution to this equation is that uT
i uj = 0, i.e., that ui
and uj are orthogonal.
Solutions 3.13–3.14 33
uα = aui + buj
uβ = cui + duj
such that uα and uβ are mutually orthogonal and of unit length. Since ui and uj are
orthogonal to uk (k 6= i, k 6= j), so are uα and uβ . Thus, uα and uβ satisfy (3.29).
Finally, if λi = 0, Σ must be singular, with ui lying in the nullspace of Σ. In this
case, ui will be orthogonal to the eigenvectors projecting onto the rowspace of Σ
and we can chose kui k = 1, so that (3.29) is satisfied. If more than one eigenvalue
equals zero, we can chose the corresponding eigenvectors arbitrily, as long as they
remain in the nullspace of Σ, and so we can chose them to satisfy (3.29).
3.13 We can write the r.h.s. of (3.31) in matrix form as
D
X
λi ui uT T
i = UΛU = M,
i=1
UT ΣU = UT ΛU = UT UΛ = Λ,
aT Σa = â1 uT T
1 + . . . + âD uD Σ (â1 u1 + . . . + âD uD )
â1 uT T
1 + . . . + âD uD (â1 λ1 u1 + . . . + âD λD uD ) .
Now, since uT
i uj = 1 only if i = j, and 0 otherwise, this becomes
â21 λ1 + . . . + â2D λD
and since a is real, we see that this expression will be strictly positive for any non-
zero a, if all eigenvalues are strictly positive. It is also clear that if an eigenvalue,
λi , is zero or negative, there exist a vector a (e.g. a = ui ), for which this expression
will be less than or equal to zero. Thus, that a matrix has eigenvectors which are all
strictly positive is a sufficient and necessary condition for the matrix to be positive
definite.
3.15 A D × D matrix has D2 elements. If it is symmetric then the elements not on the
leading diagonal form pairs of equal value. There are D elements on the diagonal
so the number of elements not on the diagonal is D2 − D and only half of these are
independent giving
D2 − D
.
2
If we now add back the D elements on the diagonal we get
D2 − D D(D + 1)
+D = .
2 2
3.17 Recall that the transformation (3.34) diagonalizes the coordinate system and that
the quadratic form (3.27), corresponding to the square of the Mahalanobis distance,
is then given by (3.33). This corresponds to a shift in the origin of the coordinate
system and a rotation so that the hyper-ellipsoidal contours along which the Maha-
lanobis distance is constant become axis aligned. The volume contained within any
one such contour is unchanged by shifts and rotations. We now make the further
1/2
transformation zi = λi yi for i = 1, . . . , D. The volume within the hyper-ellipsoid
then becomes
Z YD D Z Y D
1/2
Y
dyi = λi dzi = |Σ|1/2 VD ∆D
i=1 i=1 i=1
where we have used the property that the determinant of Σ is given by the product
of its eigenvalues, together with the fact that in the z coordinates the volume has
become a sphere of radius ∆ whose volume is VD ∆D .
3.18 Multiplying the left hand side of (3.60) by the matrix (3.208) trivially gives the iden-
tity matrix. On the right hand side consider the four blocks of the resulting parti-
tioned matrix:
upper left
upper right
lower left
CM − DD−1 CM = CM − CM = 0
lower right
Thus the right hand side also equals the identity matrix.
3.19 We first of all take the joint distribution p(xa , xb , xc ) and marginalize to obtain the
distribution p(xa , xb ). Using the results of Section 3.2.5 this is again a Gaussian
distribution with mean and covariance given by
µa Σaa Σab
µ= , Σ= .
µb Σba Σbb
From Section 3.2.4 the distribution p(xa , xb ) is then Gaussian with mean and co-
variance given by (3.65) and (3.66) respectively.
36 Solutions 3.20–3.23
3.20 Multiplying the left hand side of (3.210) by (A + BCD) trivially gives the identity
matrix I. On the right hand side we obtain
(A + BCD)(A−1 − A−1 B(C−1 + DA−1 B)−1 DA−1 )
= I + BCDA−1 − B(C−1 + DA−1 B)−1 DA−1
−BCDA−1 B(C−1 + DA−1 B)−1 DA−1
= I + BCDA−1 − BC(C−1 + DA−1 B)(C−1 + DA−1 B)−1 DA−1
= I + BCDA−1 − BCDA−1 = I
3.21 From y = x + z we have trivially that E[y] = E[x] + E[z]. For the covariance we
have
cov[y] = E (x − E[x] + y − E[y])(x − E[x] + y − E[y])T
as required.
3.24 Substituting the leftmost expression of (3.89) for R−1 in (3.91), we get
Λ−1 Λ−1 AT
Λµ − AT Sb
AΛ−1 S−1 + AΛ−1 AT Sb
Λ−1 Λµ − AT Sb + Λ−1 AT Sb
=
AΛ−1 Λµ − AT Sb + S−1 + AΛ−1 AT Sb
µ − Λ−1 AT Sb + Λ−1 AT Sb
=
Aµ − AΛ−1 AT Sb + b + AΛ−1 AT Sb
µ
= .
Aµ − b
3.25 Since y = x + z we can write the conditional distribution of y given x in the form
p(y|x) = N (y|µz + x, Σz ). This gives a decomposition of the joint distribution
of x and y in the form p(x, y) = p(y|x)p(x) where p(x) = N (x|µx , Σx ). This
therefore takes the form of (3.83) and (3.84) in which we can identify µ → µx ,
Λ−1 → Σx , A → I, b → µz and L−1 → Σz . We can now obtain the marginal
distribution p(y) by making use of the result (3.99) from which we obtain p(y) =
N (y|µx + µz , Σz + Σx ). Thus both the means and the covariances are additive, in
agreement with the results of Exercise 3.21.
3.26 The quadratic form in the exponential of the joint distribution is given by
1 1
− (x − µ)T Λ(x − µ) − (y − Ax − b)T L(y − Ax − b). (114)
2 2
38 Solution 3.26
We now extract all of those terms involving x and assemble them into a standard
Gaussian quadratic form by completing the square
1
= − xT (Λ + AT LA)x + xT Λµ + AT L(y − b) + const
2
1
= − (x − m)T (Λ + AT LA)(x − m)
2
1 T
+ m (Λ + AT LA)m + const (115)
2
where
m = (Λ + AT LA)−1 Λµ + AT L(y − b) .
We can now perform the integration over x which eliminates the first term in (115).
Then we extract the terms in y from the final term in (115) and combine these with
the remaining terms from the quadratic form (114) which depend on y to give
1
= − yT L − LA(Λ + AT LA)−1 AT L y
2
+yT L − LA(Λ + AT LA)−1 AT L b
+LA(Λ + AT LA)−1 Λµ .
(116)
We can identify the precision of the marginal distribution p(y) from the second order
term in y. To find the corresponding covariance, we take the inverse of the precision
and apply the Woodbury inversion formula (3.210) to give
−1
L − LA(Λ + AT LA)−1 AT L = L−1 + AΛ−1 AT
(117)
Now consider the two terms in the square brackets, the first one involving b and the
second involving µ. The first of these contribution simply gives b, while the term in
µ can be written
where we have used the general result (BC)−1 = C−1 B−1 . Hence we obtain
(3.93).
Solutions 3.27–3.28 39
3.27 To find the conditional distribution p(x|y) we start from the quadratic form (114)
corresponding to the joint distribution p(x, y). Now, however, we treat y as a con-
stant and simply complete the square over x to give
1 1
− (x − µ)T Λ(x − µ) − (y − Ax − b)T L(y − Ax − b)
2 2
1 T
= − x (Λ + AT LA)x + xT {Λµ + AL(y − b)} + const
2
1
= − (x − m)T (Λ + AT LA)(x − m)
2
where, as in the solution to Exercise 3.26, we have defined
m = (Λ + AT LA)−1 Λµ + AT L(y − b)
from which we obtain directly the mean and covariance of the conditional distribu-
tion in the form (3.95) and (3.96).
3.28 Differentiating (3.102) with respect to Σ we obtain two terms:
N
N ∂ 1 ∂ X
− ln |Σ| − (xn − µ)T Σ−1 (xn − µ).
2 ∂Σ 2 ∂Σ n=1
For the first term, we can apply (A.28) directly to get
N ∂ N T N
− ln |Σ| = − Σ−1 = − Σ−1 .
2 ∂Σ 2 2
For the second term, we first re-write the sum
N
X
(xn − µ)T Σ−1 (xn − µ) = N Tr Σ−1 S ,
n=1
where
N
1 X
S= (xn − µ)(xn − µ)T .
N n=1
Using this together with (A.21), in which x = Σij (element (i, j) in Σ), and proper-
ties of the trace we get
N
∂ X ∂
(xn − µ)T Σ−1 (xn − µ) = N Tr Σ−1 S
∂Σij n=1 ∂Σij
∂ −1
= N Tr Σ S
∂Σij
−1 ∂Σ −1
= −N Tr Σ Σ S
∂Σij
∂Σ −1 −1
= −N Tr Σ SΣ
∂Σij
= −N Σ−1 SΣ−1 ij
40 Solutions 3.29–3.30
where we have used (A.26). Note that in the last step we have ignored the fact that
Σij = Σji , so that ∂Σ/∂Σij has a 1 in position (i, j) only and 0 everywhere else.
Treating this result as valid nevertheless, we get
N
1 ∂ X N
− (xn − µ)T Σ−1 (xn − µ) = Σ−1 SΣ−1 .
2 ∂Σ n=1 2
Combining the derivatives of the two terms and setting the result to zero, we obtain
N −1 N −1
Σ = Σ SΣ−1 .
2 2
Re-arrangement then yields
Σ=S
as required.
3.29 The derivation of (3.46) follows directly from the discussion given in the text be-
tween (3.42) and (3.46). If m = n then, using (3.46) we have E[xn xT T
n ] = µµ + Σ,
whereas if n 6= m then the two data points xn and xm are independent and hence
E[xn xm ] = µµT where we have used (3.42). Combining these results we obtain
(3.213). From (3.42) and (3.46) we then have
N
" N
! N
!#
1 X 1 X T 1 X T
E [ΣML ] = E xn − xm xn − xl
N n=1 N m=1 N
l=1
N
" N N X N
#
1 X 2 X 1 X
= E x n xT
n − xn xTm+ 2
xm xT
l
N n=1 N m=1
N m=1 l=1
T T 1 T 1
= µµ + Σ − 2 µµ + Σ + µµ + Σ
N N
N −1
= Σ (118)
N
as required.
Similarly, we have
Finally
sin(A − B) = = exp{i(A − B)}
= = exp(iA) exp(−iB)
= =(cos A + i sin A)(cos B − i sin B)
= sin A cos B − cos A sin B.
3.31 Expressed in terms of ξ the von Mises distribution becomes
n o
p(ξ) ∝ exp m cos(m−1/2 ξ) .
= r̄.
42 Solutions 3.35–3.37
3.35 Starting from (3.26), we can rewrite the argument of the exponential as
1 1
− Tr Σ−1 xxT + µT Σ−1 x − µT Σ−1 µ.
2 2
The last term is indepedent of x but depends on µ and Σ and so should go into g(η).
The second term is already an inner product and can be kept as is. To deal with
the first term, we define the D2 -dimensional vectors z and λ, which consist of the
columns of xxT and Σ−1 , respectively, stacked on top of each other. Now we can
write the multivariate Gaussian distribution on the form (3.138), with
−1
Σ µ
η =
− 12 λ
x
u(x) =
z
h(x) = (2π)−D/2
−1/2 1 T −1
g(η) = |Σ| exp − µ Σ µ .
2
= E[u(x)u(x)T ] − E[u(x)]E[u(x)T ]
= cov[u(x)]
where we have used the result (3.172).
3.37 The value of the density p(x) at a point xn is given by hj (n) , where the notation j(n)
denotes that data point xn falls within region j. Thus the log likelihood function
takes the form
XN X N
ln p(xn ) = ln hj (n) .
n=1 n=1
We now need to take account of the constraint that p(x) must integrate to unity. Since
p(x) has the constantP
value hi over region i, which has volume ∆i , the normalization
constraint becomes i hi ∆i = 1. Introducing a Lagrange multiplier λ we then
minimize the function
N
!
X X
ln hj (n) + λ hi ∆i − 1
n=1 i
Solution 3.38 43
4.1 Substituting (1.1) into (1.2) and then differentiating with respect to wi we obtain
N
X XM
wj xjn − tn xin = 0. (119)
n=1 j =0
A
e ij = Aij + λIij . (120)
and therefore
T T T
(y − t) ϕj = (ΦwML − t) ϕj = tT Φ(ΦT Φ)−1 ΦT − I ϕj = 0
0 = −ΦT t + ΦT Φw + λw (122)
as required.
46 Solutions 4.7–4.8
4.7 We first write down the log likelihood function which is given by
N
N 1X
ln L(W, Σ) = − ln |Σ| − (tn − WT φ(xn ))T Σ−1 (tn − WT φ(xn )).
2 2 n=1
First of all we set the derivative with respect to W equal to zero, giving
N
X
0=− Σ−1 (tn − WT φ(xn ))φ(xn )T .
n=1
Multiplying through by Σ and introducing the design matrix Φ and the target data
matrix T we have
ΦT ΦW = ΦT T
Solving for W then gives (4.14) as required.
The maximum likelihood solution for Σ is easily found by appealing to the standard
result from Chapter ?? giving
N
1 X T T
Σ= (tn − WML φ(xn ))(tn − WML φ(xn ))T .
N n=1
as required. Since we are finding a joint maximum with respect to both W and Σ
we see that it is WML which appears in this expression, as in the standard result for
an unconditional Gaussian distribution.
4.8 The expected squared loss for a vectorial target variable is given by
ZZ
E[L] = ky(x) − tk2 p(t, x) dx dt.
Our goal is to choose y(x) so as to minimize E[L]. We can do this formally using
the calculus of variations to give
Z
δE[L]
= 2(y(x) − t)p(t, x) dt = 0.
δy(x)
Solving for y(x), and using the sum and product rules of probability, we obtain
Z
tp(t, x) dt Z
y(x) = Z = tp(t|x) dt
p(t, x) dt
which is the conditional average of t conditioned on x. For the case of a scalar target
variable we have Z
y(x) = tp(t|x) dt
4.9 We start by expanding the square in (??), in a similar fashion to the univariate case
in the equation preceding (4.39),
Following the treatment of the univariate case, we now substitute this into (4.64) and
perform the integral over t. Again the cross-term vanishes and we are left with
Z Z
2
E[L] = ky(x) − E[t|x]k p(x) dx + var[t|x]p(x) dx
from which we see directly that the function y(x) that minimizes E[L] is given by
E[t|x].
4.10 This exercise is just a repeat of Exercise 4.9.
4.11 To prove the normalization of the distribution (4.66) consider the integral
Z ∞ Z ∞
|x|q xq
I= exp − 2 dx = 2 exp − 2 dx
−∞ 2σ 0 2σ
and make the change of variable
xq
u= .
2σ 2
Using the definition (??) of the Gamma function, this gives
Z ∞ 2
2σ 2(2σ 2 )1/q Γ(1/q)
I=2 (2σ 2 u)(1−q)/q exp(−u) du =
0 q q
from which the normalization of (4.66) follows.
For the given noise distribution, the conditional distribution of the target variable
given the input variable is
|t − y(x, w)|q
2 q
p(t|x, w, σ ) = exp − .
2(2σ 2 )1/q Γ(1/q) 2σ 2
The likelihood function is obtained by taking products of factors of this form, over
all pairs {xn , tn }. Taking the logarithm, and discarding additive constants, we obtain
the desired result.
4.12 Since we can choose y(x) independently for each value of x, the minimum of the
expected Lq loss can be found by minimizing the integrand given by
Z
|y(x) − t|q p(t|x) dt (124)
48 Solution 4.12
for each value of x. Setting the derivative of (124) with respect to y(x) to zero gives
the stationarity condition
Z
q|y(x) − t|q−1 sign(y(x) − t)p(t|x) dt
Z y(x) Z ∞
q−1
= q |y(x) − t| p(t|x) dt − q |y(x) − t|q−1 p(t|x) dt = 0
−∞ y (x)
which can also be obtained directly by setting the functional derivative of (4.40) with
respect to y(x) equal to zero. It follows that y(x) must satisfy
Z y (x) Z ∞
q−1
|y(x) − t| p(t|x) dt = |y(x) − t|q−1 p(t|x) dt. (125)
−∞ y (x)
b T xn + w0 > 0 and the {αn } are all non-negative and sum to 1. However, by
since w
the corresponding argument
X X
b T z + w0 =
w b T y m + w0 =
βm w b T ym + w0 ) < 0,
βm ( w (132)
m m
which is a contradiction and hence {xn } and {ym } cannot be linearly separable if
their convex hulls intersect.
If we instead assume that {xn } and {ym } are linearly separable and consider a point
z in the intersection of their convex hulls, the same contradiction arise. Thus no such
point can exist and the intersection of the convex hulls of {xn } and {ym } must be
empty.
5.3 For the purpose of this exercise, we make the contribution of the bias weights explicit
in (5.14), giving
= aT t̄ = −b,
b T = aT (T − T)T = b(1 − 1)T = 0T .
since aT T
5.4 When we consider several simultaneous constraints, (5.97) becomes
Atn + b = 0, (142)
where A is a matrix and b is a column vector such that each row of A and element
of b correspond to one linear constraint.
If we apply (142) to (140), we obtain
T
Ay(x? ) = At̄ − AT bT X b † (x? − x̄)
= At̄ = −b,
b T = A(T − T)T = b1T − b1T = 0T . Thus Ay(x? ) + b = 0.
since AT
Solutions 5.5–5.6 51
NTP NTP
Precision × Recall = ×
NTP + NFP NTP + NFN
2
NTP
= .
(NTP + NFP )(NTP + NFN )
Similarly, taking a common denominator we have
NTP NTP
Precision + Recall = +
NTP + NFP NTP + NFN
NTP (NTP + NFN ) + NTP (NTP + NFP )
=
(NTP + NFP )(NTP + NFN )
NTP (2NTP + NFN + NFP )
= .
(NTP + NFP )(NTP + NFN )
Substituting these two results into (5.38) we obtain
2NTP
F = . (143)
2NTP + NFP + NFN
as required.
5.6 Since the square root function is monotonic for non-negative numbers, we can take
the square root of the relation a 6 b to obtain a1/2 6 b1/2 . Then we multiply both
sides by the non-negative quantity a1/2 to obtain a 6 (ab)1/2 .
The probability of a misclassification is given, from (??), by
Z Z
p(mistake) = p(x, C2 ) dx + p(x, C1 ) dx
ZR1 R2
Z
= p(C2 |x)p(x) dx + p(C1 |x)p(x) dx. (144)
R1 R2
Since we have chosen the decision regions to minimize the probability of misclassi-
fication we must have p(C2 |x) 6 p(C1 |x) in region R1 , and p(C1 |x) 6 p(C2 |x) in
region R2 . We now apply the result a 6 b ⇒ a1/2 6 b1/2 to give
Z
p(mistake) 6 {p(C1 |x)p(C2 |x)}1/2 p(x) dx
R1
Z
+ {p(C1 |x)p(C2 |x)}1/2 p(x) dx
R2
Z
= {p(C1 |x)p(x)p(C2 |x)p(x)}1/2 dx (145)
since the two integrals have the same integrand. The final integral is taken over the
whole of the domain of x.
52 Solutions 5.7–5.11
5.7 Substituting Lkj = 1 − δkj into (5.23), and using the fact that the posterior proba-
bilities sum to one, we find that, for each x we should choose the class j for which
1 − p(Cj |x) is a minimum, which is equivalent to choosing the j for which the pos-
terior probability p(Cj |x) is a maximum. This loss matrix assigns a loss of one if
the example is misclassified, and a loss of zero if it is correctly classified, and hence
minimizing the expected loss will minimize the misclassification rate.
5.8 From (5.23) we see that for a general loss matrix and arbitrary class priors, the ex-
pected loss is minimized by assigning an input x to class the j which minimizes
X 1 X
Lkj p(Ck |x) = Lkj p(x|Ck )p(Ck ) (146)
p(x)
k k
and so there is a direct trade-off between the priors p(Ck ) and the loss matrix Lkj .
5.9 We recognise the sum over data points in (5.100) as the finite-sample approximation
to an expectation, as seen in (2.40). Taking the limit N → ∞ we can use (2.39) to
write the expectation in the form
Z Z
E [p(Ck |x)] = p(Ck |x)p(x) dx = p(Ck , x) dx = p(Ck ) (147)
1 1 + e−a − 1
1 − σ(a) = 1 − =
1 + e−a 1 + e−a
−a
e 1
= −a
= a = σ(−a).
1+e e +1
Solutions 5.12–5.13 53
1
y = σ(a) =
1 + e−a
1
⇒ − 1 = e−a
y
1−y
⇒ ln = −a
y
y
⇒ ln = a = σ −1 (y).
1−y
5.12 Substituting (5.47) into (5.41), we see that the normalizing constants cancel and we
are left with
T
exp − 21 (x − µ1 ) Σ−1 (x − µ1 ) p(C1 )
a = ln
T
exp − 12 (x − µ2 ) Σ−1 (x − µ2 ) p(C2 )
1
= − xΣT x − xΣµ1 − µT T
1 Σx + µ1 Σµ1
2
p(C1 )
−xΣT x + xΣµ2 + µT T
2 Σx − µ2 Σµ2 + ln
p(C2 )
T 1 T −1 p(C1 )
= (µ1 − µ2 ) Σ−1 x − µ Σ µ1 − µT
2 Σµ2 + ln .
2 1 p(C2 )
Substituting this into the rightmost form of (5.40) we obtain (5.48), with w and w0
given by (5.49) and (5.50), respectively.
Summing both sides over k we find that λ = −N , and using this to eliminate λ we
obtain (5.101).
5.14 If we substitute (5.102) into (150) and then use the definition of the multivariate
Gaussian, (3.26), we obtain
ln p ({φn , tn }|{πk }) =
N K
1 XX
tnk ln |Σ| + (φn − µk )T Σ−1 (φ − µ) , (154)
−
2 n=1
k=1
we can use (A.24) and (A.28) to calculate the derivative w.r.t. Σ−1 . Setting this to
zero we obtain
N T
1 XX
tnk Σ − (φn − µn )(φn − µk )T = 0.
(157)
2 n=1
k
Again making use of (153), we can re-arrange this to obtain (5.104), with Sk given
by (5.105).
Note that, as in Exercise 3.28, we do not enforce that Σ should be symmetric, but
simply note that the solution is automatically symmetric.
Solutions 5.15–5.16 55
5.15 We assume that the training set consists of data points xn each of which is labelled
with the associated class Ck . This allows the parameters {µki } to be fitted for each
class independently. From (5.64) the log likelihood function for class Ck is then
given by
N X
X D
ln p(D|Ck ) = {xni ln µki + (1 − xni ) ln(1 − µki )} . (158)
n=1 i=1
which is the intuitively pleasing result that, for each class k and for each component
i, the value of µki is given by the average of the values of the corresponding compo-
nents xni of those data vectors that belong to class Ck . Since the xni are binary this
is just the fraction of data points for which the corresponding value of i is equal to
one.
5.16 The generative model for φ corresponding to the chosen coding scheme is given by
M
Y
p (φ | Ck ) = p (φm | Ck ) (161)
m=1
where
L
Y
p (φm | Ck ) = µφkml
ml
, (162)
l=1
where in turn {µkml } are the parameters of the multinomial models for φ.
Substituting this into (5.46) we see that
ak = ln p (φ | Ck ) p (Ck )
M
X
= ln p (Ck ) + ln p (φm | Ck )
m=1
M X
X L
= ln p (Ck ) + φml ln µkml , (163)
m=1 l=1
5.17 We denote the data set by D = {φnml } where n = 1, . . . , N . From the naive Bayes
assumption we can fit each class Ck separately to the training data. For class Ck the
log likelihood function takes the form
N X
X M X
L
ln p(D|Ck ) = φnml ln µkml . (164)
n=1 m=1 l=1
Note that the parameter µkml represents the probability that for class Ck the compo-
nent m will have its non-zero element in position l. In order to find the maximum
likelihood solution we need to take account of the constraint that these probabilities
must sum to one, separately for each value of m, so that
L
X
µkml = 1. (165)
l=1
We can handle this by introducing Lagrange multipliers, one per component, and
then maximize the modified likelihood function given by
N X M X L M L
!
X X X
φnml ln µkml + λm µkml − 1 . (166)
n=1 m=1 l=1 m=1 l=1
To find the Lagrange multipliers we substitute this result into the constraint (165)
and rearrange to give
N X
X L
λm = − φnml . (169)
n=1 l=1
We now use this to replace the Lagrange multiplier in (168) to give the final result
for the maximum likelihood solution for the parameters in the form
N
X
φnml
n=1
µkml = N XL
. (170)
X
φnml
n=1 l=1
Solutions 5.18–5.20 57
dσ e−a
= 2
da (1 + e−a )
−a
e
= σ(a)
1 + e−a
1 + e−a
1
= σ(a) −
1 + e−a 1 + e−a
= σ(a)(1 − σ(a)).
Finally, we have
∇an = φn (175)
where ∇ denotes the gradient with respect to w. Combining (173), (174) and (175)
using the chain rule, we obtain
N
X ∂E ∂yn
∇E = ∇an
n=1
∂yn ∂an
N
X
= (yn − tn )φn
n=1
as required.
5.20 If the data set is linearly separable, any decision boundary separating the two classes
will have the property
> 0 if tn = 1,
wT φn (176)
< 0 otherwise.
58 Solutions 5.21–5.23
Moreover, from (5.74) we see that the negative log-likelihood will be minimized
(i.e., the likelihood maximized) when yn = σ (wT φn ) = tn for all n. This will be
the case when the sigmoid function is saturated, which occurs when its argument,
wT φ, goes to ±∞, i.e., when the magnitude of w goes to infinity.
5.23 We consider the two cases where a > 0 and a < 0 separately. In the first case, we
can use (3.25) to rewrite (5.86) as
Z 0 Z a 2
1 θ
Φ(a) = N (θ|0, 1) dθ + √ exp − dθ
−∞ 0 2π 2
Z a/√2
1 1 √
= +√ exp −u2 2 du
2 2π 0
1 a
= 1 + erf √ , (178)
2 2
where, in the last line, we have used (5.87).
When a < 0, the symmetry of the Gaussian distribution gives
dΦ(λa)
= λN (0|0, 1)
da a=0
1
= λ√ .
2π
Setting this equal to (180), we see that
√
2π π
λ= or equivalently λ2 = . (181)
4 8
The comparison of the logistic sigmoid function and the scaled probit function is
illustrated in Figure 5.12.
60 Solutions 6.1–6.3
6.1 On the right-hand side of (6.51) we make the change of variables u = r2 to give
Z ∞
1 1
SD e−u uD/2−1 du = SD Γ(D/2) (182)
2 0 2
where we have used the definition (??) of the Gamma function. On the left hand side
of (6.51) we can use (2.126) to obtain π D/2 . Equating these we obtain the desired
result (6.53).
The volume of a sphere of radius 1 in D-dimensions is obtained by integration
Z 1
SD
VD = S D rD−1 dr = . (183)
0 D
4 3
S2 = 2π, S3 = 4π, V2 = πa2 , V3 = πa . (184)
3
6.2 The volume of the cube is (2a)D . Combining this with (6.53) and (6.54) we obtain
(6.55). Using Stirling’s formula (6.56) in (6.55) the ratio becomes, for large D,
6.3 Since p(x) is radially symmetric it will be roughly constant over the shell of radius
r and thickness . This shell has volume SD rD−1 and since kxk2 = r2 we have
Z
p(x) dx ' p(r)SD rD−1 (186)
shell
from which we obtain (6.58). We can find the stationary points of p(r) by differen-
tiation
r2
d h
D−2 D−1
r i
p(r) ∝ (D − 1)r +r − 2 exp − 2 = 0. (187)
dr σ 2σ
√
Solving for r, and using D 1, we obtain b
r ' Dσ.
Solution 6.4 61
We now expand p(r) around the point b r. Since this is a stationary point of p(r)
we must keep terms up to second order. Making use of the expansion ln(1 + x) =
x − x2 /2 + O(x3 ), together with D 1, we obtain (6.59).
Finally, from (6.57) we see that the probability density at the origin is given by
1
p(x = 0) =
(2πσ 2 )1/2
r2
1 1 D
exp − 2 = exp −
b
p(kxk = b r) =
(2πσ 2 )1/2 2σ (2πσ 2 )1/2 2
√
where we have used b r ' Dσ. Thus the ratio of densities is given by exp(D/2).
6.4 Using the definition of the tanh function we have
ea − e−a
tanh(a) =
ea + e−a
1 − e−2a
=
1 + e−2a
2 1 + e−2a
= −
1 + e−2a 1 + e−2a
2
= −1
1 + e−2a
= 2σ(2a) − 1 (189)
where we have made use of the definition of the sigmoid function in (6.60). Re-
arranging we obtain
1
σ(a) = (tanh(a/2) + 1) . (190)
2
For the case of a logistic sigmoid activation function, the argument of the output-unit
activation function in (6.11) is given by
M D
!
X (2)
X (1)
wkj σ wji xi . (191)
j =0 i=0
62 Solutions 6.5–6.6
Figure 2 The swish activation function plotted for values of β = 0.1, 1 and 10
M D
!
X (2)
X (1)
w
e kj tanh w
e ji xi (192)
j =0 i=0
(2) 1 (2)
w
e kj = w , j = 1, . . . , M (193)
2 kj
(2) 1 (2) 1
w
e k0 = wk 0 + (194)
2 2
(1) 1 (1)
w
e ji = wji (195)
2
6.5 Using the definition of the logistic sigmoid, the swish activation function can be
written as
x
sw(x) = . (196)
1 + exp(−βx)
Finally, to find the inverse (6.65) of the softplus function let y = ζ −1 (a), then
Rearranging we obtain
exp(a) = 1 + exp(y) (205)
and hence
y = ln(exp(a) − 1). (206)
64 Solutions 6.8–6.10
6.8 Differentiating the error (6.25) with respect to σ 2 and setting the derivative to zero
gives
N
1 X N 1
0=− 4 {y(xn , w) − tn }2 + . (207)
2σ n=1 2 σ2
Rearranging to solve for σ 2 we obtain
N
1 X
σ2 = {y(xn , w? ) − tn }2 (208)
N n=1
as required.
6.9 The likelihood function for an i.i.d. data set, {(x1 , t1 ), . . . , (xN , tN )}, under the
conditional distribution (6.28) is given by
N
Y
N tn |y(xn , w), β −1 I .
n=1
where ‘const’ comprises terms which are independent of w. The first term on the
right hand side is proportional to the negative of (6.29) and hence maximizing the
log-likelihood is equivalent to minimizing the sum-of-squares error.
6.10 In this case, the likelihood function becomes
N
Y
p(T|X, w, Σ) = N (tn |y(xn , w), Σ) ,
n=1
ln p(T|X, w, Σ)
N
N 1X
=− (ln |Σ| + K ln(2π)) − (tn − yn )T Σ−1 (tn − yn ), (209)
2 2 n=1
If we first treat Σ as fixed and known, we can drop terms that are independent of w
from (209), and by changing the sign we get the error function
N
1X
E(w) = (tn − yn )T Σ−1 (tn − yn ).
2 n=1
If we consider maximizing (209) w.r.t. Σ, the terms that need to be kept are
N
N 1X
− ln |Σ| − (tn − yn )T Σ−1 (tn − yn ).
2 2 n=1
Using results from Appendix ??, we can maximize this by setting the derivative w.r.t.
Σ−1 to zero, yielding
N
1 X
Σ= (tn − yn )(tn − yn )T .
N n=1
6.12 This simply corresponds to a scaling and shifting of the binary outputs, which di-
rectly gives the activation function, using the notation from (??), in the form
y = 2σ(a) − 1.
The corresponding error function can be constructed from (6.33) by applying the
inverse transform to yn and tn , yielding
N
X 1 + tn 1 + yn 1 + tn 1 + yn
E(w) = − ln + 1− ln 1 −
n=1
2 2 2 2
N
1X
= − {(1 + tn ) ln(1 + yn ) + (1 − tn ) ln(1 − yn )} + N ln 2
2 n=1
6.13 For the given interpretation of yk (x, w), the conditional distribution of the target
vector for a multiclass neural network is
K
Y
p(t|w1 , . . . , wK ) = yktk .
k=1
Taking the negative logarithm in order to derive an error function we obtain (6.36)
as required. Note that this is the same result as for the multiclass logistic regression
model, given by (5.80) .
∂E 1 ∂yn 1 ∂yn
= −tn + (1 − tn ) . (210)
∂an yn ∂an 1 − yn ∂an
Solution 6.15 67
∂E yn (1 − yn ) yn (1 − yn )
= −tn + (1 − tn )
∂an yn (1 − yn )
= yn − tn
as required.
6.15 Consider a specific data point n and, to minimize clutter, omit the suffix n on vari-
ables such as ak and yk . We can use the chain rule of calculus to write
K
∂E X ∂E ∂yj
= . (212)
∂ak j =1
∂yj ∂ak
For the derivative ∂yj /∂ak there are two contributions, one from the numerator and
one from the denominator, so that
∂yj exp(aj )δjk exp(aj ) exp(ak )
=P − P 2
∂ak l exp(al ) { l exp(al )}
= yj δjk − yj yk . (215)
which follows from 1-of-K coding scheme used for the {tj }.
68 Solutions 6.16–6.18
6.16 From standard trigonometric rules we get the position of the end of the first arm,
(1) (1)
x1 , x2 = (L1 cos(θ1 ), L1 sin(θ1 )) .
Similarly, the position of the end of the second arm relative to the end of the first arm
is given by the corresponding equation, with an angle offset of π (see Figure 6.16),
which equals a change of sign
(2) (2)
x1 , x2 = (L2 cos(θ1 + θ2 − π), L1 sin(θ1 + θ2 − π))
= − (L2 cos(θ1 + θ2 ), L2 sin(θ1 + θ2 )) .
Putting this together, we must also taken into account that θ2 is measured relative to
the first arm and so we get the position of the end of the second arm relative to the
attachment point of the first arm as
6.17 The interpretation of γnk as a posterior probability follows from Bayes’ theorem for
the probability of the component indexed by k, given observed data t, in which all
quantities are also conditioned on the input variable x. Therefore x simply appears as
a conditioning variable in the right-hand side of all quantities. From Bayes’ theorem
we have
p(t|k, x)p(k|x)
p(k|t, x) = (218)
p(t|x)
where, as usual, the denominator can be expressed as a marginalization over the
terms in the numerator, so that
X
p(t|x) = p(t|l, x)p(l|x). (219)
l
The quantities πk (x) defined by (6.40) satisfy (6.39) and hence meet the require-
ments to be viewed as probabilities, and so we equate p(k|x) = πk (x). Simi-
larly, the class-conditional distribution p(t|k, x) is given by the Gaussian Nnk =
N (tn |µk (xn ), σk2 (xn )). Substituting into (218) then gives
πk Nnk
p(k|tn , xn ) = P = γnk (220)
l πl Nnl
as required.
Note that because of the coupling between outputs caused by the softmax activation
function, the dependence on the activation of a single output unit involves all the
output units.
For the first factor inside the sum on the r.h.s. of (221), standard derivatives applied
to the nth term of (6.43) gives
∂En Nnj γnj
= − PK =− . (222)
∂πj l=1 πl Nnl
πj
For the for the second factor, we have from (5.78) that
∂πj
= πj (Ijk − πk ). (223)
∂aπk
L/2
ktn − µk k2
∂En 1 L L
= −P − L+1 exp −
∂σk k0 Nnk
0 2π σ 2σk2
ktn − µk k2 ktn − µk k2
1
+ L exp −
σ 2σk2 σk3
2
L ktn − µk k
= γnk − .
σk σk3
ktn − µk k2
∂En
= γnk L − .
∂aσk σk2
Z
E [t|x] = tp (t|x) dt
Z K
X
πk (x)N t|µk (x), σk2 (x) dt
= t
k=1
K
X Z
tN t|µk (x), σk2 (x) dt
= πk (x)
k=1
K
X
= πk (x)µk (x).
k=1
K
X
tk = µk (x) and t = πk (x)tk .
k=1
Solution 6.21 71
Using this together with (3.42), (3.46), (6.38) and (6.48), we get
Z
2 2
s (x) = E kt − E [t|x] k |x = kt − tk2 p (t|x) dt
Z K
X
T T T T
πk N t|µk (x), σk2 (x) dt
= t t−t t−t t+t t
k=1
K n o
T T T T
X
= πk (x) σk2 + tk tk − tk t − t tk + t t
k=1
K
X
πk (x) σk2 + ktk − tk2
=
k=1
2
K
X K
X
= πk (x) σk2 + µk (x) − πl µl (x) .
k=1 l
72 Solutions 7.1–7.2
as required.
7.2 From (7.8) and (7.10) we have
uT T
i Hui = ui λi ui = λi .
where we have used (7.8) and (7.9) along with (). Thus, if all of the eigenvalues are
positive, the Hessian matrix will be positive definite.
Solutions 7.3–7.4 73
7.3 From (7.12) we see that, if H is positive definite, then the second term in (7.7) will
be positive whenever (w − w? ) is non-zero. Thus the smallest value which E(w)
can take is E(w? ), and so w? is the minimum of E(w). Conversely, if w? is the
minimum of E(w), then, for any vector w 6= w? , E(w) > E(w? ). This will only
be the case if the second term of (7.7) is positive for all values of w 6= w? (since
the first term is independent of w). Since w − w? can be set to any vector of real
numbers, it follows from the definition (7.12) that H must be positive definite.
7.4 The first derivatives of the error function are given by
∂E X
= (yn − tn )xn (227)
∂w n
∂E X
= (yn − tn ) (228)
∂b n
∂2E X
2
= x2n (229)
∂w n
∂2E ∂2E X
= = xn = N x (230)
∂w∂b ∂b∂w n
∂2E X
= 1=N (231)
∂b2 n
Note that the Hessian does not depend on the target values for this simple model,
but it is a function of the model parameters w and b, corresponding to the fact that
the error function is non-quadratic. Since the logistic sigmoid function satisfies 0 <
σ(·) < 1 we see that yn (1 − yn ) is always a positive quantity. We therefore see
that the elements of the leading diagonal of the Hessian are given by the sum of
positive terms and are therefore themselves positive. Thus the trace of the Hessian
is positive. Note that we ignore the degenerate case where all of the data points
are identical, leading to a trace of zero, since this is of no practical interest. For the
determinant we first define cn = yn (1−yn ) in order to keep the notation uncluttered.
We then have
! ! !2
X X X
2
det H = cn xn cn − cn xn
n n n
! 2
X
!
cn xn
X X n
= cn cn xn − ! . (240)
X
n n
cn
n
Thus the determinant comprises the sum of terms each of which is positive and hence
is itself positive, and hence both the trace and the determinant are positive. Since the
determinant is the product of the two eigenvalues of the Hessian it follows that either
both eigenvalues are positive or they are both negative. Since the trace is the sum of
the eigenvalues, and the trace is also positive, it follows that both eigenvalues must
be positive and so the Hessian must be positive definite.
7.6 We start by making the change of variable given by (7.10) which allows the error
function to be written in the form (7.11). Setting the value of the error function
E(w) to a constant value C we obtain
1X
E(w? ) + λi αi2 = C.
2 i
Re-arranging gives X
λi αi2 = 2C − 2E(w? ) = C
e
i
where C e is also a constant. This is the equation for an ellipse whose axes are aligned
with the coordinates described by the variables {αi }. The length of axis j is found
by setting αi = 0 for all i 6= j, and solving for αj giving
!1/2
C
e
αj =
λj
7.7 A W × W matrix has W 2 elements. If it is symmetric then the elements not on the
leading diagonal form pairs of equal value. There are W elements on the diagonal
so the number of elements not on the diagonal is W 2 − W and only half of these are
independent giving
W2 − W
2
as the number of independent of-diagonal elements. If we now add back the W
elements on the diagonal we get
W2 − W W (W + 1)
+W = .
2 2
Finally, we add the W elements of the gradient vector b to give
W (W + 1) W (W + 1) + 2W W 2 + 3W W (W + 3)
+W = = = .
2 2 2 2
7.8 From the property (2.52) of the Gaussian distribution we have, for a single data point,
E[xn ] = µ.
E[xn xm ] = δnm σ 2 + µ2 .
E[x] = µ
N N
1 XX
E[x2 ] = E[xn xm ]
N 2 n=1 m=1
1 2
= σ + µ2 .
N
Hence
(l−1)
wij zj has zero mean since we are assuming that the weights wij are drawn from
a zero-mean Gaussian N (0, 2 ). Since this is true for every value of j in the summa-
tion we have
(l )
E[ai ] = 0. (241)
(l ) (l )
To find the variance of ai we note that the quantity ai comprises the sum of
M terms which are themselves independent random variables, and we know from
(2.121) that the total variance of a sum of independent variables is the sum of the
variances of the individual variables. Hence we have
(l) (l−1)
var[ai ] = M var[wij zj ]
(l−1) 2 (l−1) 2
= M Ew,z [(wij zj ) ] − M Ew,z [wij zj ]
(l−1) 2
= M Ew,z [(wij zj ) ]
2 (l−1) 2
= M Ew [(wij ) ] Ez [(zj ) ]
(l−1)
since Ew,z [wij zj ] = 0 as discussed above. Because the weights wij are drawn
(l−1) 2
from a Gaussian N (0, 2 ) we have E[(wij )2 ] = 2 . To find E[(zj ) ] we note that
(l−1) (l−1)
zj = ReLU(ai )
and therefore
(l−1) 2 (l−1) 2
(zj ) = ReLU(ai ) .
When we take the square of the ReLU, the result will be zero if the argument is less
zero or negative, and will be the square of the argument if the argument is positive.
(l−1)
The quantity ai will have a symmetric distribution about 0 since, for every value
(l−1) (l−1)
of zj , the product wij zj will have a symmetric Gaussian distribution, and so
the overall distribution is the sum of symmetric distributions and is therefore itself
symmetric. Thus, when we take the expectation of the ReLU of this quantity, half
the terms will contribute zero and half the terms will contribute the square of the
argument and hence
(l−1) 2 (l−1)
E[(zj ) ] = E[(ReLU(ai ))2 ]
1 (l−1) 2
= E[(ai ) ]
2
1 (l−1) 2
= var[(ai ) ]
2
1
= λ2
2
(l−1)
where we have used the property that ai has zero mean. Combining these results
gives
(l) M 2 2
var[ai ] = λ
2
78 Solutions 7.10–7.11
(l)
as required. Finally, if the variance of ai is also to equal λ2 then
M 2 2
λ = λ2
2
from which it follows that must be chosen to have the value
r
2
= . (242)
M
∇E = H(w − w? ). (243)
∆w(τ −1) = − η∇E w(τ −1) − ηµ∇∇E w(τ −1) ∆w(τ −2) + O(µ2 )
+ µ∆w(τ −2) .
Solutions 7.12–7.13 79
If we now assume that η = O() and µ = O(), then we can neglect higher-order
terms in the Taylor expansion and also omit the term with coefficient ηµ since that is
O(2 ). Here we have assumed that the error surface is slowly varying and hence that
the Hessian term ∇∇E is O(1). We then obtain the standard formula for gradient
descent with momentum defined by (7.31).
7.12 If we apply (7.66) recursively we obtain
µn = βµn−1 + (1 − β)xn
= β (βµn−2 + (1 − β)xn−1 ) + (1 − β)xn
= β 2 µn−2 + β(1 − β)xn−1 + (1 − β)xn
= β 3 µn−3 + β 2 (1 − β)xn−2 + β(1 − β)xn−1 + (1 − β)xn
= ...
Xn
n
= β µ0 + β k−1 (1 − β)xn−k+1 .
k=1
We now set µ0 = 0, and then take the expectation of both sides with respect to
the distribution of x, noting that the {xn } are independent, identically distributed
samples from this distribution. This gives
n
X
E[µn ] = β k−1 (1 − β)E[xn−k+1 ]
k=1
n
X
= β k−1 (1 − β)x
k=1
where x is the true mean of the distribution of x. Making use of the result (7.67) we
can write this as
E[µn ] = (1 − β n )x.
Thus we see that E[µn ] 6= x and hence that the estimate µn is biased. This bias is
easily corrected by using the estimator
µn
µ
bn = . (250)
1 − βn
which has the property E[µn ] = x and is therefore unbiased.
7.13 Setting the derivative of (7.69) with respect to λ equal to zero we have
∂
E(w(τ ) + λd) = 0. (251)
∂λ
To evaluate this derivative we can use the chain rule of calculus. Define
Then we have
M
∂ X ∂vi ∂E
E=
∂λ i=1
∂λ ∂vi
M
X ∂E
= di
i=1
∂vi
T
= d ∇E (253)
where {di } are the components of d. This derivative vanishes for a particular value
λ? which then defines the new location in weight space
We therefore have
and hence the gradient of the error function at the line-search minimum is orthogonal
to the search direction.
7.14 Summing both sides of (7.50) over n to compute the sample mean we have
N N
!
1 X 1 X
x
eni = xni − N µi = 0 (256)
N n=1 N σi n=1
Chapter 8 Backpropagation
which follows from the chain rule of probability. Then (8.8) gives
∂En
δk = . (259)
∂ak
where we have used (8.5) and (8.6). Substituting these results into (258) we obtain
X
δj = h0 (aj ) wkj δk (261)
k
as required.
8.2 The forward propagation equations in matrix notation are given by (6.19) in the form
(l)
where W(l) is a matrix with elements wjk comprising the weights in layer l of
the network, and the activation function h(l) (·) acts on each element of its vector
argument independently. If we define δ (l) to be the errors vector with elements δj ,
then the backpropagation equations in matrix notation take the form
n T o
δ (l−1) = h(l−1)0 a(l−1) W(l−1) δ (l) (263)
where denotes the Hadamard product which comprises the element-wise multi-
plication of two vectors. Note that the forward propagation equation (8.5) involves
a summation over the second index of wji whereas the backpropagation equation
(8.13) involves a summation over the first index. Hence when we write the back-
propagation equation in matrix notation it involves the transpose of the matrix that
appears in the forward propagation equation.
82 Solutions 8.3–8.5
2 00
E(wij ) + E 0 (wij ) + E (wij ) + O(3 )
2
2
− E(wij ) + E 0 (wij ) − E 00 (wij ) + O(3 ) = 2E 0 (wij ) + O(3 ).
2
Note that the 2 terms cancel. Substituting this into (264) we get,
2E 0 (wij ) + O(3 )
δ= − E 0 (wij ) = O(2 ). (265)
2
8.4 If we introduce skip layer weights, U, into the model described in Section 8.1.3, this
will only affect the last of the forward propagation equations, (8.20), which becomes
M
X D
X
(2)
yk = wkj zj + uki xi . (266)
j =0 i=1
Note that there is no need to include the input bias. The derivative w.r.t. uki can be
expressed using the output {δk } of (8.21),
∂E
= δk x i . (267)
∂uki
8.5 The alternative forward propagation scheme takes the first line of (8.29) as its starting
point. However, rather than proceeding with a ‘recursive’ definition of ∂yk /∂aj , we
instead make use of a corresponding definition for ∂aj /∂xi . More formally
∂yk X ∂yk ∂aj
Jki = = (268)
∂xi j
∂aj ∂xi
where ∂yk /∂aj is defined by (8.33) for logistic sigmoid output units, (8.34) for
softmax output units, or simply as δkj , for the case of linear output units. We define
∂aj /∂xi = wji if aj is in the first hidden layer and otherwise
∂aj X ∂aj ∂al
= (269)
∂xi ∂al ∂xi
l
where
∂aj
= wjl h0 (al ). (270)
∂al
Thus we can evaluate Jki by forward propagating ∂aj /∂xi , with initial value wij ,
alongside aj , using (269) and (270).
Solution 8.6 83
8.6 Using the chain rule together with (8.5) and (8.77), we have
∂En ∂En ∂ak
(2)
=
∂wkj ∂ak ∂w(2)
kj
= δk z j . (271)
Thus,
∂ 2 En ∂δk zj
(2) (2)
= (2)
(272)
∂wkj ∂wk0j 0 ∂wk0j 0
and since zj is independent of the second layer weights,
∂ 2 En ∂δk
(2) (2)
= zj (2)
∂wkj ∂wk0j 0 ∂wk0j 0
∂ 2 En ∂ak
= zj
∂ak ∂ak0 ∂w(2)
k0j 0
= zj zj 0 Mkk0 ,
where we again have used the chain rule together with (8.5) and (8.77). If both
weights are in the first layer, we again used the chain rule, this time together with
(8.5), (8.12) and (8.13), to get
∂En ∂En ∂aj
(1)
= (1)
∂wji ∂aj ∂wji
X ∂En ∂ak
= xi
∂ak ∂aj
k
X (2)
= xi h0 (aj ) wkj δk .
k
Thus we have
!
∂ 2 En ∂ 0
X (2)
(1) (1)
= xi h (aj ) wkj δk . (273)
∂wji ∂wj 0i0 ∂wj 0i0
k
(2) (1)
Now we note that xi and wkj do not depend on wj 0i0 , while h0 (aj ) is only affected
in the case where j = j 0 . Using these observations together with (8.5), we get
∂ 2 En 00
X (2)
0
X (2) ∂δk
(1) (1)
= x i x i 0 h (aj )Ijj 0 w kj δ k + x i h (a j ) wkj (1)
. (274)
∂wji ∂wj 0i0 k k ∂wj 0i0
From (8.5), (8.12), (8.13), (8.77) and the chain rule, we have
∂δk X ∂ 2 En ∂ak0 ∂aj 0
(1)
=
∂wj 0i0 ∂ak ∂ak0 ∂aj 0 ∂w(1)
k0 j 0i0
X (2)
0
= xi0 h (aj ) wk0j 0 Mkk0 . (275)
k0
84 Solution 8.7
Substituting this back into (274), we obtain (??). Finally, from (271) we have
∂ 2 En ∂δk zj 0
(1) (2)
= (1)
. (276)
∂wji ∂wkj 0 ∂wji
∂ 2 En X (2)
(1) (2)
= zj 0 xi h0 (aj ) wk0j Mkk0 + δk Ijj 0 h0 (aj )xi
∂wij ∂wkj 0 k0
!
X (2)
0
= xi h (aj ) δk Ijj 0 + wk0j Mkk0 .
k0
8.7 If we introduce skip layer weights into the model discussed in Section ??, three new
cases are added to three already covered in Exercise 8.6. The first derivative w.r.t.
skip layer weight uki can be written
∂ 2 En ∂ 2 En ∂ak0
= xi
∂uki ∂uk0i0 ∂ak ∂ak0 ∂uk0i0
= Mkk0 xi xi0 ,
where we have also used (8.77). When one weight is in the skip layer and the other
weight is in the hidden-to-output layer, we can use (277), (8.5) and (8.77) to get
∂ 2 En ∂ 2 En ∂ak0
(2)
= xi
∂uki ∂wk0j ∂ak ∂ak0 ∂w(2)
k0j
= Mkk0 zj xi .
Finally, if one weight is a skip layer weight and the other is in the input-to-hidden
layer, (277), (8.5), (8.12), (8.13) and (8.77) together give
∂ 2 En
∂ ∂En
(1)
= (1)
xi
∂uki ∂w 0 ji ∂w 0 ∂ak ji
X ∂ 2 En ∂ak0
= xi
∂ak ∂ak0 ∂w(1)0
k0 ji
X (2)
= xi xi0 h0 (aj ) Mkk0 wk0j .
k0
Solutions 8.8–8.10 85
and
N
∂2E ∂yn T ∂yn ∂ 2 yn
X
= + (yn − tn )T . (280)
∂wi ∂wj n=1
∂wj ∂wi ∂wj ∂wi
As for the univariate case, we again assume that the second term of the second deriva-
tive vanishes and we are left with
N
X
H= Bn BT
n, (281)
n=1
8.9 Taking the second derivatives of (8.78) with respect to two weights wr and ws we
obtain
∂2E X Z ∂yk ∂yk
= p(x) dx
∂wr ∂ws ∂wr ∂ws
k
X Z ∂ 2 yk
+ (yk (x) − Etk [tk |x]) p(x) dx. (283)
∂wr ∂ws
k
Using the result (4.37) that the outputs yk (x) of the trained network represent the
conditional averages of the target data, we see that the second term in (283) vanishes.
The Hessian is therefore given by an integral of terms involving only the products of
first derivatives. For a finite data set, we can write this result in the form
N
∂2E 1 X X ∂ykn ∂ykn
= (284)
∂wr ∂ws N n=1 ∂wr ∂ws
k
where we have used the result proved earlier in the solution to Exercise 6.14. Taking
the second derivatives we have
N
X ∂yn
∇∇E(w) = ∇an ∇an + (yn − tn )∇∇an .
n=1
∂an
Dropping the last term and using the result (5.72) for the derivative of the logistic
sigmoid function, proved in the solution to Exercise 5.18, we finally get
N
X N
X
∇∇E(w) ' yn (1 − yn )∇an ∇an = yn (1 − yn )bn bT
n
n=1 n=1
where bn ≡ ∇an .
8.11 Using the chain rule, we can write the first derivative of (6.36) as
N K
∂E X X ∂E ∂ank
= . (285)
∂wi n=1 ∂ank ∂wi
k=1
For a trained model, the network outputs will approximate the conditional class prob-
abilities and so the last term inside the parenthesis will vanish in the limit of a large
data set, leaving us with
N X K
K X
X ∂ank ∂anl
(H)ij ' ynk (Ikl − ynl ) . (287)
n=1 k=1 l=1
∂wi ∂wj
8.12 Suppose we have already obtained the inverse Hessian using the first L data points.
By separating off the contribution from data point L + 1 in (8.40), we obtain
We now consider the matrix identity (8.80). If we now identify HL with M and
bL+1 with v, we obtain
H− 1 T −1
L ∇aL+1 ∇aL+1 HL
H−1 −1
L+1 = HL − −1 . (289)
1 + ∇aT
L+1 HL ∇aL+1
Solutions 8.13–8.14 87
In this way, data points are sequentially absorbed until L+1 = N and the whole data
set has been processed. This result therefore represents a procedure for evaluating
the inverse of the Hessian using a single pass through the data set. The initial matrix
H0 is chosen to be αI, where α is a small quantity, so that the algorithm actually
finds the inverse of H + αI. The results are not particularly sensitive to the precise
value of α.
exp(a)
h0 (a) = . (291)
1 + exp(a)
dz exp(w1 x + b1 )
= x. (293)
dw1 1 + exp(w1 x + b1 )
dy exp(w2 z + b2 )
= w2 . (294)
dz 1 + exp(w2 z + b2 )
Finally, combining these derivatives using the chain rule, and then substituting for z
using (8.44 ), we obtain
8.14 The evaluation trace equations are given directly from the definition of the logistic
map
L1 =x (296)
L2 = 4L1 (1 − L1 ) (297)
L3 = 4L2 (1 − L2 ) (298)
L4 = 4L3 (1 − L3 ). (299)
(300)
88 Solution 8.15
L1 (x) = x (301)
L2 (x) = 4x(1 − x) (302)
L3 (x) = 16x(1 − x)(1 − 2x)2 (303)
2 2 2
L4 (x) = 64x(1 − x)(1 − 2x) (1 − 8x + 8x ) . (304)
Finally, taking derivatives, we obtain the following expressions, again without sim-
plification
Note that the complexity of the expressions for the derivatives grows much faster
than the complexity of the expressions for the corresponding functions.
where pa(i) denotes parents of the node i in the evaluation trace diagram. Using the
evaluation trace diagram in Figure 8.4, together with (8.50) to (8.56), we then have
v̇1 = 1 (309)
v̇2 = 0 (310)
∂v3 ∂v3
v̇3 = v̇1 + v̇2 = v̇1 v2 + v̇2 v1 (311)
∂v1 ∂v2
∂v4
v̇4 = v̇2 = v̇2 cos(v2 ) (312)
∂v2
∂v5
v̇5 = v̇3 = v˙3 exp(v3 ) (313)
∂v3
∂v6 ∂v6
v̇6 = v̇3 + v̇4 = v̇3 − v̇4 (314)
∂v3 ∂v4
∂v7 ∂v7
v̇7 = v̇5 + v̇6 = v̇5 + v̇6 . (315)
∂v5 ∂v6
Solutions 8.16–8.17 89
Here ch(i) denotes the children of node i in the evaluation trace graph. Using the
evaluation trace diagram in Figure 8.4, together with (8.50) to (8.56), and starting at
the output of the graph and working backwards we then have
v7 = 1 (317)
∂v7
v6 = v7 = v7 (318)
∂v6
∂v7
v5 = v7 = v7 (319)
∂v5
∂v6
v4 = v6 = −v 6 (320)
∂v4
∂v5 ∂v6
v3 = v5 + v6 = v 5 v5 + v 6 (321)
∂v3 ∂v3
∂v3 ∂v4
v2 = v3 + v4 = v 3 v1 + v 4 cos(v2 ) (322)
∂v2 ∂v2
∂v3
v1 = v3 = v 3 v2 . (323)
∂v1
8.17 From (8.49) we have the following expression for the partial derivative
∂f
= x2 + x2 exp(x1 x2 ). (324)
∂x1
Evaluating this for (x1 x2 ) = (1, 2) gives
∂f
= 2 + 2 exp(2). (325)
∂x1 x1 =1,x2 =2
v1 =1 (326)
v2 =2 (327)
v3 =2 (328)
v4 = sin(2) (329)
v5 = exp(2) (330)
v6 = 2 − sin(2) (331)
v7 = 2 + exp(2) − sin(2). (332)
90 Solution 8.18
For the tangent variables we can then use (8.58) to (8.64) to give
v̇1 =1 (333)
v̇2 =0 (334)
v̇3 =2 (335)
v̇4 =0 (336)
v̇5 = 2 exp(2) (337)
v̇6 =2 (338)
v̇7 = 2 + 2 exp(2) (339)
and so we see that v̇7 does indeed represent the correct value for the derivative given
by (325). Similarly, we can use the evaluation trace equations of reverse-mode auto-
matic differentiation (8.70) to (8.76) to evaluate the adjoint variables as follows
v7 =1 (340)
v6 =1 (341)
v5 =1 (342)
v4 = −1 (343)
v3 = exp(2) + 1 (344)
v2 = (exp(2) + 1) − cos(2) (345)
v1 = 2 exp(2) + 2. (346)
8.18 The vectors e1 , . . . , eD form a complete orthonormal basis and so we can expand an
arbitrary D-dimensional vector r in the form
D
X
r= αi ei . (348)
i=1
rj = eT
j r = αj (349)
where rj is the jth component of r. Hence we can write the expansion in the form
D
X
r= ri ei . (350)
i=1
Solution 8.18 91
where f (x) is the original network function with elements fk (x). This can be in-
terpreted as a single pass of forward-mode automatic differentiation in which the
tangent variables associated with the input variables are given by ẋi = ri .
One way to see this more clearly is to introduce a function g(z) where z is a scalar
variable and the elements of g are given by gi (z) = ri z. From the perspective of
a network diagram this can be viewed as introducing an extra layer from a single
input z to the original inputs {xi }. The overall composite function can be written
as f (g(z)), which is now a function with just one input whose Jacobian is therefore
a matrix with a single column which can therefore be evaluated in a single pass of
forward-mode automatic differentiation. The elements of this vector are given by
D D
∂fk X ∂fk ∂xi X
= = Jki ri = (Jr)k (352)
∂z i=1
∂xi ∂z i=1
and are therefore the elements of the Jacobian-vector product as required. The tan-
gent variables at the inputs to the main network are then given by
∂xi
ẋi = = ri . (353)
∂z
Thus, we see that if the tangent variable ẋi for each input i is set to the corresponding
element ri of r, then a single pass of forward-mode automatic differentiation will
compute the Jacobian-vector product as required.
92 Solutions 9.1–9.2
Chapter 9 Regularization
9.1 We will start by showing that the group of rotations by multiples of 90◦ forms a
group:
By showing that these four axioms are satisfied, we have shown that this is indeed
a group. Now we will do the same for the group of translations of an object in a
two-dimensional plane.
9.2 Let
D
X
yn
e = w0 + wi (xni + ni )
i=1
D
X
= yn + wi ni
i=1
where yn = y(xn , w) and ni ∼ N (0, σ 2 ) and we have used (9.52). From (9.53) we
then define
N
1X 2
E
e = {e
yn − tn }
2 n=1
N
1 X 2
= yn tn + t2n
y − 2e
2 n=1 n
e
!2
N D D
1 X
2
X X
= y + 2yn wi ni + wi ni
2 n
n=1 i=1 i=1
D
X
−2tn yn − 2tn wi ni + t2n .
i=1
If we take the expectation of Ee under the distribution of ni , we see that the second
and fifth terms disappear, since E[ni ] = 0, while for the third term we get
!2
XD XD
E wi ni = wi2 σ 2
i=1 i=1
as required.
9.3 We first write the gradient descent formula in terms of continuous time t in the form
w(t + ) = w(t) − e
η ∇Ω(w) (354)
where represents some finite time step and we have defined e η = η/. We now
make a Taylor expansion of the left-hand side in powers of to give
dw(t)
w(t) + + O(2 ) = w(t) − e
η ∇Ω(w). (355)
dt
94 Solution 9.4
Similarly, with the transformed outputs, weights and biases, (9.7) becomes
X
yk =
e w
e kj zj + w
e k0 .
i
This constraint can be enforced by adding a term to the un-regularized error E(w)
using a Lagrange multiplier, which we denote λ/2, to give
M
λ X
E(w) + |wj |q − η . (361)
2 j =1
Since the term λη/2 is constant with respect to w, minimizing (361) is equivalent to
minimizing
M
λX
E(w) + |wj |q . (362)
2 j =1
(τ ) (τ )
wj = uT
jw (364)
T (τ −1) T (τ −1) ?
= uj w − ρuj H(w −w )
(τ −1) ?
= wj − ρηj uT
j (w − w )
(τ −1) (τ −1)
= wj − ρηj (wj − wj? ), (365)
where we have used (9.59). To show that
(τ )
wj = {1 − (1 − ρηj )τ } wj?
96 Solution 9.7
Now we assume that the result holds for τ = N − 1 and then make use of (365)
(N ) (N −1) (N −1)
wj = wj − ρηj (wj − wj? )
(N −1)
= wj (1 − ρηj ) + ρηj wj?
1 − (1 − ρηj )N −1 wj? (1 − ρηj ) + ρηj wj?
=
(1 − ρηj ) − (1 − ρηj )N wj? + ρηj wj?
=
1 − (1 − ρηj )N wj?
=
group of weights in a network that are shared, sum up the gradients over all of those
weights and the use this combined gradient to update the weights in that group. Note
that, as long as the weights in each group are initialized to the same value, this will
ensure that they remain equal after the update.
p(j) = πj (369)
2
p(w|j) = N (w|µj , σj ). (370)
9.9 This is easily verified by taking the derivative of (9.22), using (2.49) and standard
derivatives, yielding
∂Ω 1 X (wi − µj )
=P 2
πj N (wi |µj , σj2 ) .
∂wi k πk N (wi |µk , σk ) j σ2
Combining this with (9.23) and (9.24), we immediately obtain the second term of
(9.25).
9.10 Since the µj s only appear in the regularization term, Ω(w), from (9.23) we have
∂E
e ∂Ω
=λ . (372)
∂µj ∂µj
Using (3.25), (9.22) and (9.24) and standard rules for differentiation, we can calcu-
late the derivative of Ω(w) as follows:
∂Ω X 1 wi − µ j
= − πj N wi |µj , σj2
σj2
∂µj P 2
i j 0 πj 0 N wi |µj 0 , σj 0
X wi − µ j
= − γj (wi ) .
i
σj2
9.11 Following the same line of argument as in Solution 9.10, we need the derivative of
Ω(w) w.r.t. σj . Again using (3.25), (9.22) and (9.24) and standard rules for differ-
entiation, we find this to be
!
(wi − µj )2
∂Ω X 1 1 1
= − πj − 2 exp −
2σj2
∂σj i
P
0 π j 0N w i |µ j
2
0, σ 0
(2π)1/2 σj
j j
!
(wi − µj )2 (wi − µj )2
1
+ exp −
σj 2σj2 σj3
( )
X 1 (wi − µj )2
= γj (wi ) − .
i
σj σj3
∂E
e XX 1 ∂πk
= −λ γk (wi )
∂ηj i
π k ∂ηj
k
XX
= −λ γk (wi ) {δjk − πj }
i k
X
=λ {πj − γj (wi )} (375)
i
P
where we have used the fact that k γk (wi ) = 1 for all i.
9.13 The result is easily proved by substituting (9.36) into (9.37), and then substituting
(9.35) into the resulting expression, giving
y =F3 (z2 ) + z2
=F3 (F2 (z1 ) + z1 ) + F2 (z1 ) + z1
=F3 (F2 (F1 (x) + x) + F1 (x) + x)
+ F2 (F1 (x) + x))
+ F1 (x) + x. (376)
Solutions 9.14–9.16 99
If we then identify m (x) and 1/M with xi and λi in (2.102), respectively, and take
f (x) = x2 , we see from (2.102) that
M
!2 M
X 1 X 1
m (x) 6 m (x)2 .
m=1
M m=1
M
Since this holds for all values of x, it must also hold for the expectation over x,
proving (9.64).
9.16 If E(y(x)) is convex, we can apply (2.102) as follows:
M
1 X
EAV = Ex [E(y(x))]
M m=1
" M #
X 1
= Ex E(y(x))
m=1
M
" M
!#
X 1
> Ex E y(x)
m=1
M
= ECOM
where λi = 1/M for i = 1, . . . , M in (2.102) and we have implicitly defined ver-
sions of EAV and ECOM corresponding to E(y(x)).
100 Solutions 9.17–9.18
9.17 To prove that (9.67) is a sufficient condition for (9.66) we have to show that (9.66)
follows from (9.67). To do this, consider a fixed set of ym (x) and imagine varying
the αm over all possible values allowed by (9.67) and consider the values taken by
yCOM (x) as a result. The maximum value of yCOM (x) occurs when αk = 1 where
yk (x) > ym (x) for m 6= k, and hence all αm = 0 for m 6= k. An analogous result
holds for the minimum value. For other settings of α,
= ρ. (378)
Two elements Rni and Rnj will be independent unless j = i. Hence, for j 6= i we
have
X X
E[Rni Rnj ] = Bern(Rni |ρ)Bern(Rnj |ρ)Rni Rnj
Rni ∈{0,1} Rnj ∈{0,1}
X X
= Bern(Rni |ρ)Rni Bern(Rnj |ρ)Rnj
Rni ∈{0,1} Rnj ∈{0,1}
2
=ρ
whereas if j = i we have
2
Rni Rnj = Rni = Rni
Solution 9.18 101
and therefore
To find the expected value of the error function (9.69) we first expand out the square
to give
N X
X K D
X
E(W) = t2nk − 2 wki Rni xni
n=1 k=1
i=1
D
X XD
+ wki Rni xni wkj Rnj xnj .
i=1 j =1
Next we take the expectation of the error and substitute for the expectations of the
dropout matrix elements using (378) and (379) to give
2
N X
X K D
X D
X
E [E(W)] = t2nk − 2ρ wki xni + ρ2 wki xni
n=1 k=1
i=1 i=1
N X
X D
K X
+ (ρ − ρ2 ) 2 2
wki xni (380)
n=1 k=1 i=1
N X
K
( D
)2
X X
= tnk − ρ wki xni
n=1 k=1 i=1
N X
X K X
D
2 2
+ ρ(1 − ρ) wki xni . (381)
n=1 k=1 i=1
Finally, we can find a solution for the weights that minimize this expected error
function by setting the derivatives with respect to wki equal to zero. For this it is
102 Solution 9.18
0 = −B + WM (382)
where W has elements wij , B is a diagonal matrix, and the elements of B and M
are given by
N
X
Bii = −2ρ xni (383)
n=1
( N
! N
!)
X X
Mji = 2 ρ2 xnj xni + δji ρ(1 − ρ) x2ni . (384)
n=1 n=1
W? = BM−1 . (385)
Solutions 10.1–10.4 103
10.1 We can impose the constraint kxk2 = K by using a Lagrange multiplier λ and
maximizing
wT x + λ kxk2 − K .
(386)
Taking the gradient with respect to x and setting this gradient to zero gives
w + 2λx = 0 (387)
which shows that x = αw where α = −1/(2λ).
10.2 Let us represent the input array as a vector x = (x1 , x2 , .., x5 )T . We will start with
the case where there is only one convolutional filter, of width 3, the weights of which
we will denote using a vector k = (k1 , k2 , k3 )T . If we look at Figure 3, we can see
that the three outputs, which we represent with the vector y = (y1 , y2 , y3 )T are given
by:
x1 k1 + x2 k2 + x3 k3
y = x2 k1 + x3 k2 + x4 k3 . (388)
x3 k1 + x4 k2 + x5 k3
Now we wish to find a matrix K such that y = Kx. We can see that it must be a
3 × 5 matrix, and each entry Kij is given by the contribution that xj makes to yi .
Therefore K is given by:
k1 k2 k3 0 0
K = 0 k1 k2 k3 0 . (389)
0 0 k1 k2 k3
This is an example of a Toeplitz matrix, where each descending diagonal from left
to right is constant. Convolution operations in 1D can always be represented as a
multiplication of the input array by a Toeplitz matrix.
10.3 Simple matrix multiplication shows that the convolution is given by
32 14 −18
−22 −24 12
22 6 18
.
10.4 There are many possibilities for indexing the elements of I, K, and C. One choice
is simply to choose the indices in K(l, m) to have the ranges 1 6 l 6 L and 1 6
m 6 M giving
L X
X M
C(j, k) = I(j + l, k + m)K(l, m). (390)
l=1 m=1
104 Solution 10.5
y1 = y2 = y3 =
x1 × k1 + x1 x1
x2 × x2 × k1 + x2
k2 +
x3 × x3 × x3 × k1 +
k3 k2 +
x4 x4 × x4 × k2 +
k3
x5 x5 x5 × k3
Figure 3 Figure showing a convolution operation of a filter k over an input array x with output y.
If we now discretize the x and z variables into bins of width ∆ we can approximate
this integral using
∞
X
F (j∆) ' G(j∆ − l∆)k(l∆)∆. (396)
l=−∞
where we have defined C(j) = F (j∆), I(l) = G(l∆), and K(l) = k(l∆)∆.
10.6 We saw in Section 10.2.3 that convolving an image of dimensions J × K and ad-
ditional padding P , with a filter of dimensions M × M will yield a feature map
of dimension (J + 2P − M + 1) × (K + 2P − M + 1). Now if we substi-
tute P = (M − 1)/2, we see that the dimensions of the feature map are given
by (J + (2((M − 1)/2) − M + 1) × (K + (2((M − 1)/2) − M + 1) which simplifies
to J × K.
10.7 In the case with no padding and a stride of 1, an M × M kernel would convolve
J − M + 1 times horizontally and K − M + 1 times vertically, giving us (J −
M + 1) × (K − M + 1) features. After applying padding P to each of the edges
of the image, we have a new image of size (J + 2P ) × (K + 2P ), and hence the
dimensionality of the feature layer would be (J +2P −M +1)×(K +2P −M +1).
When we apply a stride, we divide the number of convolutions in each dimension
by the stride and use the floor operator to account for the case where there is some
remainder of the image in a given direction that is less than the stride. The initial 1 is
not divided by the stride as it represents the first operation and is therefore unaffected
by the stride. This gives us
J + 2P − M K + 2P − M
+1 × +1 (398)
S S
features.
10.8 We assume connections between padding inputs and features count as connections
for simplicity. We also haven’t included max pooling or activation function connec-
tions. As every convolutional layer in the VGG-16 network uses a 3×3 filter, a given
node in a convolutional layer takes a number of inputs equal to 9 times the number of
channels in the previous layer plus 1 for the bias. The first convolutional layer there-
fore has 224 × 224 × (3 × 9 + 1) × 64 = 96, 337, 920 connections to the previous
layer. The number of connections for a fully connected layer is equal to the number
of input features plus 1 for the bias, multiplied by the number of nodes, which for
the first fully connected layer is equal to (7 × 7 × 512 + 1) × 4096 = 102, 764, 544.
The number of the connections for the rest of the layers are shown in Table 1.
106 Solution 10.9
Table 1 Table showing the number of connections and learnable parameters in each layer of the VGG-16
network.
Now suppose that the kernel is separable, in other words that it factorizes in the form
Consider first the summation over m. This involves a one-dimensional kernel G(m)
which must be swept over the image for a total number of J ×(K −M +1) positions,
and in each position there are M operations to perform giving a total number of
operations equal to (K − M + 1)JM . This gives rise to an intermediate array of
dimension J × (K − M + 1). Now the summation over l is performed which is also
a convolution involving a one-dimensional kernel F (l). The number of positions for
this kernel is given by (K − M + 1) × (J − L + 1) and in each position we have to
perform L operations giving a total of (K − M + 1)(J − L + 1)L. Overall, the total
number of operations is therefore given by
To see that this represents a saving in computation consider the case where the image
is large compared to the kernel size so that J L and K M . Then (400) is
approximately given by JKLM whereas (403) is approximately given by JK(L +
M ). Note that as well as saving on compute, a separable kernel uses less storage.
However, since it restricts the form of the kernel it can lead to a significant reduction
in generalization accuracy.
10.10 The derivatives of a cost function with respect to an activation value can be evaluated
using backpropagation, which corresponds to an application of the chain rule of
calculus. This backpropagation starts with the derivatives of the cost function with
respect to the local activations, which for the cost function defined by (10.12) are
given by
∂F (I)
δijk = = 2aijk (404)
∂aijk
and hence, up to a factor of 2, are given by the activation values themselves. These
values are then back-propagated through the network using (8.13) until the input
layer is reached. This input layer represents the image and the associated δ values
correspond to derivatives of the cost function (10.12) with respect to the pixel values,
as required.
10.11 Consider a 1-hot encoding scheme for the C object classes using binary variables
yi ∈ {0, 1} where i = 1, . . . , C and yi = 1 represents the presence of an object from
class i. We then introduce an additional class with a binary variable yC +1 ∈ {0, 1}
where yC +1 = 1 means there is no object from any of the given classes in the image.
Since we assume that a given image either contains an object from one of the classes
or no objects, the variables y1 , . . . , yC +1 form a 1-hot encoding where all variables
have the value 0 except for a single variable taking the value 1. We can train a model
f (·) to take an image as input and to return a probability distribution over these
108 Solution 10.12
To relate these sets of probabilities we can use the product rule in the form
10.12 We first note that evaluating the scalar product between two N -dimensional vectors
requires N multiplies and N − 1 additions giving a total of 2N − 1 computational
steps.
For the network in Figure 10.22, the number of computational steps required to cal-
culate the first convolution operation is given by the number of features in the second
layer, which is 4 × 4 = 16, multiplied by the number of steps needed to evaluate the
output of the filter. Since the filter size is 3 × 3 = 9 each filter evaluation requires
9 × 8 = 72 steps. Hence the total number of computational steps for the first convo-
lutional layer is 16 × 72 = 1, 152. For one evaluation of a 2 × 2 max pooling filter,
there are 3 computational steps required. This is multiplied by the number of such
operations required for the max pooling layer which is 4, giving 12 total steps. The
fully connected layer is equivalent to a scalar product between two vectors of dimen-
sionality 4 and hence requires 4 + 3 = 7 operations. Therefore a single evaluation
of this network requires a total of 1, 152 + 12 + 7 = 1, 171 computational steps.
For the network in Figure 10.23, there are 6 × 6 = 36 features in the second layer
each of which requires 9 × 8 = 72 computations in the convolutional layer, giving a
total of 36 × 72 = 2, 592 computational steps. The max pooling layer has 9 nodes
and hence requires 9 × 3 = 27 computations. Finally, the fully connected layer
Solution 10.13 109
0
x1
x2
x= . (411)
x3
x4
0
For a filter with elements (w1 , w2 , w3 ) and a stride of 2, the output vector will be
two-dimensional and can be written as y = (y1 , y2 )T , in which the elements are
given by
y1 = w 2 x 1 + w 3 x 2 (412)
y2 = w 1 x 2 + w 2 x 3 + w 3 x 4 . (413)
We can write this convolution operation using matrix notation in the form
y = Ax (414)
h1 = w1 z 1 (416)
h2 = w2 z 1 (417)
h3 = w3 z 1 + w1 z 2 (418)
h4 = w2 z 2 (419)
h5 = w3 z 2 (420)
h6 = 0. (421)
(422)
h = Bz (423)
110 Solution 10.13
w1 0
w2 0
w3 w1
B= . (424)
0 w2
0 w3
0 0
By inspection we see that B = AT and hence up-sampling can be seen as the trans-
pose of convolution.