3.
Least Mean Square (LMS) Algorithm
3.1 Spatial Filtering
uses single linear neuron and can be understood as adaptive filtering
y = ∑k wkxk for k = 1 to p
error e = d − y where d = desired value
1 2
cost function = mean squared error = J = e
2
-1 w0 = θ
w1 output y
x1 ∑
.....
wp
xn
J
3.2 Steepest descent
∂J//∂wk = 0 to determine optimum weight
adjust weights iteratively and move gradient
= ∂J//∂w
along the error surface towards the
Jmin
optimum value
wk(n+1) = wk(n) − η (∂∂J(n)//∂wk) w0 single
weight
i.e. updated value is proportional to
negative of the gradient of the error surface
∴ wk(n+1) = wk(n) + η e(n) xk(n)
18
Properties of LMS:
• a stochastic gradient algorithm in that the gradient vector is ‘random’ in
contrast to steepest descent
• on average improves in accuracy for increasing values of n.
• reduces storage requirement to information present in its current set of
weights, and can operate in a nonstationary environment.
3.2.1 Convergence ( proof not given)
• in the mean if weight vector → optimum value as n → ∞, requires:
0 < η < 2//λmax λmax is max eigenvalue of autocorrelation matrix Rx
Rx = E[x xT]
• in the mean square if mean-square of error signal → constant as n → ∞,
requires:
0 < η < 2//tr[Rx] where tr[Rx] = ∑k λk ≥ λmax
• Faster convergence is usually obtained by making η a function of n,
for example η(n) = c//n for some constant c
19
4. Multilayer Feedforward Perceptron Training
4.1 Back-propagation Algorithm
• .....
.....
.....
neuron j
dj
yi
wij uj
ej
from previous ϕ(••) -1
layer
Let wji be weight connected from neuron i to neuron j
error signal: ej(n) = dj(n) − yj(n)
net internal sum: υj(n) = ∑iwji(n)yi(n) for i = 0 to p
output: yj(n) = ϕj(υ
υj(n))
1
Instantaneous sum of squared errors: (n) = ∑ e2(n) over all j in o/p layer
2 j
20
1
For N patterns, average squared error: = ∑ (n) for n = 1 to N
av
N n
• Learning goal is to minimise av by adjusting weights, but instead of the
av
estimate (n) is used on a pattern-by-pattern basis
From the ∂ (n) = ∂ (n) ∂ej(n) ∂yj(n) ∂υj(n)
chain rule: ∂wji(n) ∂ej(n) ∂yj(n) ∂υj(n) ∂wji(n)
∴ weight correction: ∆ wji(n) = − η ∂ (n) i.e. steepest descent
∂ wji(n)
= η δj (n) yi(n) where δj (n) = - ∂ (n) / ∂υj(n)
Case 1: Output node, local gradient easily calculated
Case 2: hidden node more complex, need to consider neuron j feeding neuron k,
where inmputs to neuron j are yi
δj(n) = − ∂ (n) ϕj′(υ
υj(n)) = −∑k ek ∂ek(n) ϕj′(υ
υj(n))
∂yj(n) ∂yj(n)
∴ δj(n) = − ϕj′(υ
υj(n)) ∑k ek(n) ∂ek(n) ∂υk(n) = − ϕj′(υ
υj(n)) ∑k δk(n) wkj(n)
∂υk(n) ∂yj(n)
• Thus δj(n) is computed in terms of δk(n) which is closer to the output. After
calculating the network output in a forward pass, the error is computed and
recursively back-propagated through the network in a backward pass.
×(local gradient)×
weight correction = (learning rate)× ×(i/p signal neuron)
∆ wji(n) = η δj(n) yi(n)
21
4.2 Back-propagation training
Activation function:
yj(n) = ϕj(uj(n)) = 1
−uj(n))
1 + exp(−
∂yj(n) = ϕj′(uj(n)) = exp(−−uj(n)) = yj(n) [ 1 − yj(n)]
∂uj(n) −uj(n))]
[1 + exp(− 2
Note that max value of ϕj′(υ
υj(n)) occurs at yj(n) = 0.5 and
min value of 0 occurs at yj(n) = 0 or 1
Momentum term: + α ∆ wji(n − 1) 0 ≤ |α
α| < 1
helps locate more desirable local minimum in complex error surface
example error surface
no change in error sign ⇒ ∆ wji(n)
increases and descent is accelerated
changes in error sign ⇒ ∆ wji(n) decreases
and stabilises oscillations
large enough α can stop process
terminating in shallow local minima
single weight
with momentum, η can be larger
22
4.3 Other perspectives for improving generalisation
4.3.1 Pattern vs Batch Mode
Choice depends on particular problem:
• randomly updating weights after each pattern requires very little storage and
leads to a stochastic search which is less likely to get stuck in local minima
• updating after presentation of all training samples (an epoch) provides a more
accurate estimate of the gradient vector since it is based on the average
squared error
av
4.3.2 Stopping criteria
e.g. gradient vector threshold and/or change in average squared error per epoch
4.3.3 Initialisation
• default is uniform distribution inside a small range of values
• too large values can lead to premature saturation (neuron outputs close to
limits) which gives small weight adjustments even though error is large
4.3.4 Training Set Size
worst-case formula N > W//ε where:
N = no. of examples, W = no. of synaptic weights,
ε = fraction of errors permitted on test
4.3.5 Cross-Validation
• measures generalisation on test set
• various parameters including no. of hidden nodes, learning rate and
training set size can be set based on cross-validation performance
23
4.3.6 Network Pruning by complexity regularisation
(two possibilities: network growing and network pruning)
goal is to find weight vector that minimises R(w) = s(w) +λ c(w)
where s(w) is standared error measure e.g. mean square error
λ is the regularisation parameter
c(w) is the complexity penalty that depends on the network e.g. ||w||2
• regularisation term allows identification of weights having insignificant effect
4.3.7 Other ways of minimising cost function
• Back-propagation uses a relatively simple, quick approach to minimising cost
function by obtaining an instantaneous estimate of the gradient
• methods and techniques from nonlinear optimum filtering and nonlinear
function optimisation have been used to provide more sophisticated approach
to minimising the cost function e.g. Kalman filtering, conjugate-gradient
method
4.4 Universal Approximation Theorem
single hidden layer with suitable ϕ gets arbitrarily close to any continuous
function
• logistic function satisfies ϕ(⋅⋅) definition
• single hidden layer sufficient, but no clue on synthesis
• single hidden layer is restrictive in that hierarchical features not supported
24
4.5 Example of learning XOR Problem
Decision Boundaries
x1 x2 target x2
0 0 0 neuron a
0 1 1 1 out
1 0 1 =1
1 1 0
out
=0
0
x1
-1 1
1.5 x2
1 neuron b
x•
a -2 out
1 1 1
1 c
1 out
0.5
• 1 b -1 =1
x2
0.5 out
-1 =0
0
1 x1
x1 x2 a b target x2
0 0 0 0 0 neuron c
0 1 0 1 1 1
out
1 0 0 1 1 =0
1 1 1 1 0 out
=1
out
=0
0 1 x1
25
4.6 Example: vehicle navigation
sharp sharp
left right
....................... 45 output units
fully connected
............. 9 hidden units
fully connected
video input retina
network computes steering angle
training examples from human driver
obstacles detected by laser range finder
26
5. Associative Memories
5.1 linear associative memory
stimulus ak response bk
bk1 ak = ak1
bk = ak1 • w11 1 bk1
bk2 ak2 w12
. . w13
. .
bkp akp ak2 • 2 bk2
.....
.....
w11 ( k ) w12 ( k ) ... w1 p ( k )
w ( k ) w ( k ) ... w ( k ) akp • p bkp
W(k) =
22 22 2p
w p1 ( k ) w p 2 ( k ) ... w pp ( k )
response bk = W(k) ak
Design of weight matrix for storing q pattern associations ak bk
estimate of weight matrix = ∑k bk akT for k = 1 to q
(Hebbian learning principle)
bk1 [ak1, ak2, ...,akp]
where bk akT is the outer product = bk2
.
.
bkp
27
Pattern recall:
For recall of a stimulus pattern aj: b = W aj = ∑k (akTaj) bk
assuming that key patterns have been normalised, akTaj = 1
b = bj + vj where vj = ∑k (akTaj)bk for k = 1 to q, k ≠ j
i.e. vj results from interference from all other stimulus patterns
∴ (akTaj) = 0 for j ≠k → perfect recall (orthonormal patterns)
Main features:
distributed memory
auto- and hetero-associative
content addressable and resistant to noise and damage
interaction between stored patterns may lead to error on recall
The max. no. of patterns reliably stored is p, the dimension of input space
which is also the rank (no. of independent columns or rows) of W
For an auto-associative memory ideally W ak = ak showing that
stimulus patterns are eigenvectors of W with all unity eigenvalues
Example: a1 = [1 0 0 0]T, a2 = [0 1 0 0]T, a3 = [0 0 1 0]T
b1 = [5 1 0]T, b2 = [-2 1 6]T, b3 = [-2 4 3]T
memory weight matrix = 5 -2 -2 0 giving perfect recall
1 1 4 0 since stimulus patterns
0 6 3 0 are orthonormal
noisy stimulus e.g. [0.8 -0.15 0.15 -0.2]T gives [4 1.25 -0.45]T
which is closer to b1 than b2 or b3
28
6. Radial Basis Functions
6.1 Separability of patterns
Separability theorem (Cover) states that if mapping ϕ(x) is nonlinear and hidden-
unit space is high relative to input space then it is more likely to be non-separable
1
x1 • ϕ1 w0
w1
x2 • ϕ2 w21
.....
.....
.....
wp
xp • ϕp
Example of RBF is a Gaussian
ϕ(x) = exp(−
−||x − t||2)
t = centre of Gaussian
output neuron is linear weighted sum
ϕ(x) is nonlinear and hidden-unit space [ϕ
ϕ1(x), ϕ2(x),..., ϕp(x)] is usually
high dimension relative to input space and more likely to be separable
a difficult nonlinear optimisation problem has been converted to a linear
optimisation problem that can be solved by LMS algorithm
if a different RBF is centred on each training pattern, then the training
set can be learned perfectly
29
6.2 Example: XOR
use two hidden Gaussian functions
ϕ1(x) = exp(−
−||x − t1||2), t1 = [1,1]T
ϕ2(x) = exp(−
−||x − t2||2), t2 = [0,0]T
ϕ2(x)
pattern
1 • (1,1)
decision
patterns boundary
(0,1) (1,0)
• pattern
(0,0)
•
0
1 ϕ1(x)
x1 x2 ϕ1(x) ϕ2(x)
0 0 e-√√2 1
0 1 e-1 e-1
1 0 e-1 e-1
1 1 1 e-√√2
30
6.3 Ill-posed Hypersurface Reconstruction
Inverse problem of finding unknown mapping F from domain X and range Y is
well-posed if:
1. for every x ∈ X there exists y ∈ Y (existence)
2. for every pair of inputs x, t ∈ X, F(x) = F(t) iff x = t (uniqueness)
3. mapping is continuous (continuity) X Y
x F(x)
Learning is ill-posed because of sparsity of information & noise in training set
Regularisation Theory for solving ill-posed problems (Tikhonov) uses a
modified cost functional, that includes a complexity term:
(F) = s(F) +λ c(F)
where s(F) is the standard error term and c(F) is the regularising term
one regularised solution is given by a linear superposition of
multivariate Gaussian basis functions
• one regularised is given by a linear superposition of multivariate Gaussian
basis functions, with centres xi and widths σi
F(x) = ∑i wi exp( − ||x − xi||2 ) for i = 1 to N
σi2
2σ
practical ways of regularising:
reduce number of RBFs
change σ of the RBFs
31
choose position of centres
6.4 RBF Networks vs. MLP
• Single vs possibly multiple hidden layers
• common computation nodes vs. fundamentally different in hidden & o/p
layers
• all layers usually nonlinear vs. nonlinear hidden but linear output
• computation of inner product of i/p vector & weight vector vs. Euclidean
norm between i/p vector and centre of appropriate unit
• global approximation and therefore good at extrapolation vs. local
approximation with fast learning but poor extrapolation
6.5 Learning Strategies
variety of possibilities since a nonlinear optimisation strategy for hidden layer is
combined with linear optimisation strategy in output layer. For the hidden layer
the main choice involves how the centres are learned:
• Fixed Centres selected at random, e.g. choose Gaussian exp(−
− M d-2 ||x − ti||2)
where M = no. of centres and d = distance between them
• Self-organised selection centres, e.g. k-n-n or self-organising NN
• Supervised Selection of centres, e.g. error-correction learning 2with suitable
cost function using modified gradient descent
32
6.6 Example: curve fitting
−2)(2x+1)(1+x2)-1 from 15 noise-free examples
RBF for approximating (x−
15 Gaussian hidden units with same σ
Three designs are generated for σ = 0.5, σ = 1.0, σ = 1.5
−8, 12]
output shown for 200 inputs uniformly sampled in the range [−
x x σ = 1.0
x
x x
x
x
x σ = 0.5
x
x
x
x
x
x
σ = 1.5
x
best compromise is σ = 1.0
33