Mathematics of Neural Networks. Models, Algorithms and Applications (PDFDrive)
Mathematics of Neural Networks. Models, Algorithms and Applications (PDFDrive)
NETWORKS
Models, Algorithms and Applications
OPERATIONS RESEARCH/COMPUTER SCIENCE
INTERFACES
SERIES
Barth, Peter
Max-Planck-Institut fur Infonnatik, Gennany
Logic-Based 0-/ Constraint Programming
Jones, Christopher V.
University of Washington
Visualization and Optimization
EDITED SY
Stephen W ELLACOTT
University of Srighton
United Kingdom
John C MASON
•
University of Huddersfield
United Kingdom
lain J ANDERS ON
•
University of Huddersfield
United Kingdom
A C.I.P. Catalogue record for this book is available from the Library of Congress.
PREFACE XXI
Part I INVITED PAPERS 1
1 N-TUPLE NEURAL NETWORKS
N. M. Allinson and A. R. Kolcz 3
2 INFORMATION GEOMETRY OF NEURAL
NETWORKS -AN OVERVIEW-
Shun-ichi Amari 15
3 Q-LEARNING: A TUTORIAL AND
EXTENSIONS
George Cybenko, Robert Gray and Katsuhiro Moizumi 24
4 ARE THERE UNIVERSAL PRINCIPLES OF
BRAIN COMPUTATION?
Stephen Grossberg 34
5 ON-LINE TRAINING OF MEMORY-DRIVEN
ATTRACTOR NETWORKS
Morris W. Hirsch 41
6 MATHEMATICAL PROBLEMS ARISING
FROM CONSTRUCTING AN ARTIFICIAL
BRAIN
J. G. Taylor 47
Part II SUBMITTED PAPERS 59
7 THE SUCCESSFUL USE OF PROBABILITY
DATA IN CONNECTIONIST MODELS
J. R. Alexander Jr. and J. P. Coughlin 61
8 WEIGHTED MIXTURE OF MODELS FOR
ON-LINE LEARNING
P. Edgar An 67
9 LOCAL MODIFICATIONS TO RADIAL BASIS
NETWORKS
I. J. Anderson 73
v
VI MATHEMATICS OF NEURAL NETWORKS
PSYCHOLOGY
Richard Filer and James Austin 186
31 REGULARIZATION AND REALIZABILITY IN
RADIAL BASIS FUNCTION NETWORKS
Jason A.S. Freeman and David Saad 192
32 A UNIVERSAL APPROXIMATOR
NETWORK FOR LEARNING CONDITIONAL
PROBABILITY DENSITIES
D. Husmeier, D. Allen and J. G. Taylor 198
33 CONVERGENCE OF A CLASS OF NEURAL
NETWORKS
Mark P. Joy 204
34 APPLICATIONS OF THE COMPARTMENTAL
MODEL NEURON TO TIME SERIES ANALYSIS
S. K asderidis and J. G. Taylor 209
35 INFORMATION THEORETIC NEURAL
NETWORKS FOR CONTEXTUALLY GUIDED
UNSUPERVISED LEARNING
Jim Kay 215
36 CONVERGENCE IN NOISY TRAINING
Petri K oistinen 220
37 NON-LINEAR LEARNING DYNAMICS WITH
A DIFFUSING MESSENGER
Barl Krekelberg and John G. Taylor 225
38 A VARIATIONAL APPROACH TO
ASSOCIATIVE MEMORY
Abderrahim Labbi 230
39 TRANSFORMATION OF NONLINEAR
PROGRAMMING PROBLEMS INTO
SEPARABLE ONES USING MULTILAYER
NEURAL NETWORKS
Bao-Liang Lu and Koji Ito 235
40 A THEORY OF SELF-ORGANISING NEURAL
NETWORKS
S P Luttrell 240
41 NEURAL NETWORK SUPERVISED TRAINING
BASED ON A DIMENSION REDUCING
METHOD
G.D. Magoulas, M.N. Vrahatis, T.N. Grapsa and G.S. Androulakis 245
Contents IX
J. P. Coughlin A. Easdown
Towson State University, School of Computing and
Towson, MD 21032, USA. Mathematical Sciences,
University of Brighton,
Jack D. Cowan Brighton BN1 4GJ, UK.
Departments of Mathematics and Email: [email protected]
Neurology, The University of Chicago,
Chicago, IL 60637, USA. Michael Eisele
Email: [email protected] The Salk Institute,
Computational Neurobiology Laboratory,
Carol G. Crawford PO Box 85800, San Diego,
U.S. Naval Academy, CA 92186 - 5800, USA.
Department of Mathematics, Email: [email protected]
Annapolis MD 21402, USA.
Email: [email protected] S.W. Ellacott
School of Computing and
George Cybenko Mathematical Sciences,
Thayer School of Engineering, University of Brighton,
8000 Cummings Hall, Dartmouth College, Brighton BN1 4GJ, UK.
Hanover, NH 03755 USA. Email: [email protected]
Email: [email protected]
Stefano Fanelli
Laurence C Dixon Dipartimento di Matematica,
Engineering Research and Universita. di Roma,
Development Centre, "Tor Vergata" Via della Ricerca Scientifica,
University of Hertfordshire, 00133 Roma, Italy.
College Lane, Hatfield, Email: [email protected]
Herts, AL10 9AB, UK.
Alistair Ferguson
Christopher Dehainaut Engineering Research and
Phillips Laboratory, Development Centre,
Kirtland Air Force Base University of Hertfordshire,
Albuquerque, New Mexico, USA. College Lane, Hatfield,
Herts, AL10 9AB, UK.
A. Delgado Email: [email protected]
National University of Columbia,
Elec. Eng. Dept., AA No. 25268, Bogota, Richard Filer
D C, Columbia SA. Advanced Computer Architecture Group,
Email: [email protected] Department of Computer Science,
University of York,
B. Doyon Heslington, YOl 5DD, UK.
Unite INSERM 230, Service de Neurologie, Email: [email protected]. uk
CHU Purpan,
31059 Toulouse Cedex, France.
Contributors xv
Jason A.S. Freeman Stephen Grossberg
Centre for Neural Systems, Boston University, Department of Cognitive
University of Edinburgh, and Neural Systems, and
Edinburgh EH8 9LW, UK. Center for Adaptive Systems,
Email: [email protected] 677 Beacon Street, Boston, Massachusetts
02215 USA
Laurent Freyss
Centre d'Etudes et de Linda M. Haines
Recherches de Toulouse, University of Natal,
2 avenue Edouard Belin,BP 4025, Pietermaritzburg, South Africa.
31055 Toulouse Cedex, France. Email: [email protected]
Ansgar H. L. West
Neural Computing Research Group,
Aston University,
Aston Triangle, Birmingham B4 7ET, UK.
Email: [email protected]
Christopher K. I. Williams
Neural Computing Research Group,
Department of Computer Science and
Applied Mathematics,
Aston University, Birmingham B4 7ET, UK.
Email: [email protected]
Li-Qun Xu
Intelligent Systems Research Group,
BT Research Laboratories
Martlesham Heath,
Ipswich, IP5 7RE, UK.
Email: [email protected]
Dedication
This volume is dedicated to Professor Patrick Parks who died in 1995. Patrick was
famous both for his work in nonlinear stability and automatic control and for his
more recent contributions to neural networks-especially on learning procedures
for CMAC systems. Patrick was known to many as a good friend and colleague,
and as a gentleman, and he is greatly missed.
PREFACE
This volume of research papers comprises the proceedings of the first International
Conference on Mathematics of Neural Networks and Applications (MANNA), which
was held at Lady Margaret Hall, Oxford from July 3rd to 7th, 1995 and attended by
116 people. The meeting was strongly supported and, in addition to a stimulating
academic programme, it featured a delightful venue, excellent food and accommo-
dation, a full social programme and fine weather - all of which made for a very
enjoyable week.
This was the first meeting with this title and it was run under the auspices of the
Universities of Huddersfield and Brighton, with sponsorship from the US Air Force
(European Office of Aerospace Research and Development) and the London Math-
ematical Society. This enabled a very interesting and wide-ranging conference pro-
gramme to be offered. We sincerely thank all these organisations, USAF-EOARD,
LMS, and Universities of Huddersfield and Brighton for their invaluable support.
The conference organisers were John Mason (Huddersfield) and Steve Ellacott
(Brighton), supported by a programme committee consisting of Nigel Allinson
(UMIST), Norman Biggs (London School of Economics), Chris Bishop (Aston),
David Lowe (Aston), Patrick Parks (Oxford), John Taylor (King's College, Lon-
don) and Kevin Warwick (Reading). The local organiser from Huddersfield was
Ros Hawkins, who took responsibility for much of the administration with great
efficiency and energy. The Lady Margaret Hall organisation was led by their bursar,
Jeanette Griffiths, who ensured that the week was very smoothly run.
It was very sad that Professor Patrick Parks died shortly before the conference. He
made important contributions to the field and was to have given an invited talk at
the meeting.
Leading the academic programme at the meeting were nine invited speakers. Nigel
Allinson (UMIST), Shun-ichi Amari (Tokyo), Norman Biggs (LSE), George Cy-
benko (Dartmouth), Frederico Girosi (MIT), Stephen Grossberg (Boston), Morris
Hirsch (Berkeley), Helge Ritter (Bielefeld) and John Taylor (King's College, Lon-
don). The supporting programme was substantial; out of about 110 who submitted
abstracts, 78 delegates were offered and accepted opportunities to contribute pa-
pers. An abundance of relevant topics and areas were therefore covered, which was
indeed one of the primary objectives.
The main aim of the conference and of this volume was to bring together researchers
and their work in the many areas in which mathematics impinges on and contributes
to neural networks, including a number of key applications areas, in order to expose
current research and stimulate new ideas. We believe that this aim was achieved.
In particular the meeting attracted significant contributions in such mathematical
aspects as statistics and probability, statistical mechanics, dynamics, mathematical
biology and neural sciences, approximation theory and numerical analysis: alge-
XXi
XXll MATHEMATICS OF NEURAL NETWORKS
bra, geometry and combinatorics, and control theory. It also covered a considerable
range of neural network topics in such areas as learning and training, neural net-
work classifiers, memory based networks, self organising maps and unsupervised
learning, Hopfield networks, radial basis function networks, and the general area
of neural network modelling and theory. Finally, applications of neural networks
were considered in such topics as chemistry, speech recognition, automatic control,
nonlinear programming, medicine, image processing, finance, time series, and dy-
namics. The final collection of papers in this volume consists of 6 invited papers and
over 60 contributed papers, selected from the papers presented at the conference
following a refereeing procedure of both the talks and the final papers. We seriously
considered dividing the material into subject areas, but in the end decided that this
would be arbitrary and difficult - since so many papers addressed more than one
key area or issue.
We cannot conclude without mentioning some social aspects of the conference. The
reception in the atmospheric Old Library at LMH was accompanied by music from
a fine woodwind duo, Margaret and Richard Thorne, who had first met one of the
organisers during a sabbatical visit to Canberra, Australia! The conference dinner
was memorable for preliminary drinks in the lovely setting of the Fellows Garden,
excellent food, and an inspirational after-dinner speech by Professor John Taylor.
Finally the participants found some time to investigate the local area, including a
group excursion to Blenheim Palace.
We must finish by giving further broad expressions of thanks to the many staff at
Universities of Huddersfield and Brighton, US Air Force (EOARD), London Mathe-
matical Society, and Lady Margaret Hall who helped make the conference possible.
We also thank the publishers for their co-operation and support in the publication
of the proceedings. Finally we must thank all the authors who contributed papers
without whom this volume could not have existed.
PART I
INVITED PAPERS
N-TUPLE NEURAL NETWORKS
N. M. Allinson and A. R. Kolcz
Department of Electrical Engineering and Electronics, UMIST, Manchester, UK.
The N- Tuple Neural Network (NTNN) is a fast, efficient memory-based neural network capable
of performing non-linear function approximation and pattern classification. The random nature
of the N-tuple sampling of the input vectors makes precise analysis difficult. Here, the NTNN
is considered within a unifying framework of the General Memory Neural Network (GMNN) -
a family of networks which include such important types as radial basis function networks. By
discussing the NTNN within such a framework, a clearer understanding of its operation and
efficient application can be gained. The nature of the intrinsic tuple distances, and the resultant
kernel, is also discussed, together with techniques for handling non-binary input patterns. An
example of a tuple-based network, which is a simple extension of the conventional NTNN, is shown
to yield the best estimate of the underlying regression function, E(Ylx), for a finite training set.
Finally, the pattern classification capabilities of the NTNN are considered.
1 Introduction
The origins ofthe N-tuple neural network date from 1959, when Bledsoe and Brown-
ing [1] proposed a pattern classification system that employed random sampling of a
binary retina by taking N-bit long ordered samples (i.e., N-tuples) from the retina.
These samples form the addresses to a number of memory nodes - with each bit
in the sample corresponds to an individual address line. The N-tuple sampling
is sensitive to correlations occurring between different regions for a given class of
input patterns. Certain patterns will yield regions of the retina where the prob-
ability of a particular state of a selected N-tuple will be very high for a pattern
class (e.g., predominately 'white' or 'black' if we are considering binary images of
textual characters). If a set of exemplar patterns is presented to the retina, each of
the N-tuple samples can be thought of as estimating the probability of occurrence
of its individual states for each class. A cellular neural network interpretation of N-
tuple sampling was provided by Aleksander [2]; and as we attempt to demonstrate
in this paper its architecture conforms to what we term as the General Memory
Neural Network (GMNN). Though the N-tuple neural network is more commonly
thought of as a supervised pattern classifier, we will consider first the general prob-
lem of approximating a function, I, which exists in a D-dimensional real space,
IRP. This function is assumed to be smooth and continuous and that we possess a
finite number of sample pairs {(Xi, Yi) : i = 1, 2 ... , T}. We will further assume that
this training data is subject to a noise component, that is Yi =
f(xi) + c, where
c is a random error term with zero mean. A variant of the NTNN for function
approximation was first proposed by Tattersall et al [3] and termed the Single-
Layer-Lookup-Perceptron (SLLUP). The essential elements of the SLLUP are the
same as the basic NTNN except that the nodal memories contain numeric weights.
A further extension of the basic NTNN, originally developed by Bledsoe and Bisson
[4], records the relative frequencies at which the various nodal memories are ad-
dressed during training. The network introduced in Section 4 combines aspects of
these two networks and follows directly from the consolidated approach presented
in Section 2.
We discuss, in Section 3, some details of the NTNN with particular reference to its
mapping between sampled patterns on the retina and the N-tuple distance metric,
3
4 CHAPTER 1
and the transformation of non-binary element vectors onto the binary retina. The
form of the first mapping, which is an approximately exponential function, is the
kernel function of the NTNN - though due to the random nature of the sampling,
this must be considered in a statistical sense. Finally, a brief note is given on
a Bayesian approximation that indicate how these networks can be employed as
pattern classifiers.
2 The General Memory Neural Network
Examples of GMNN include Radial Basis Function (RBF) [5] networks and the
General Regression Neural Network (GRNN) [6]. These networks can provide pow-
erful approximation capabilities and have been subject to rigorous analysis. A fur-
ther class of networks, of which the NTNN is one, have not been treated to such
detailed examination. However, these networks (together with others such as the
CMAC [7] and the Multi-Resolution Network [8]) are computationally very efficient
and better suited to hardware implementation. The essential architectural compo-
nents of GMNNs are a layer of memory nodes, arranged into a number of blocks,
and an addressing element that selects the set of locations participating in the
computation of the output response. An extended version of this section is given in
[9].
2.1 Canonical Form of the General Memory Neural Network
The GMNN can be defined in terms of the following elements:
• A set of K memory nodes, each possessing a finite number of IAk I addressable
locations.
• An address generator which assigns an address vector
=
A(x) [Al(X), A2(X), . .. , AK(X)]
to each input point x. The address generated for the kth memory node is
denoted by Ak(x).
• The network's output, g, is obtained by combining the contents of selected
memory locations, that is
[ml (Al (x)), m2(A 2(x)), ... , mk(Ak(x))] ~ m, (1)
where mk(Ak(x)) is the content of the memory location selected by the kth
memory node by the address generated by x for that node (this will be iden-
tified as simply mk(x)). No specific format is imposed on the nature of the
memories other than that the format is uniform for all K nodes.
• A learning procedure exists which permits the network to adjust the nodal
memory contents, in response to the training set, so that some error criterion,
7r(f, g), is minimised.
Each node of the network performs a simple vector quantization of the input space
into IAk I cells. For each node, the address generating element can be split into
an index generator and an address decoder. The index generator, h, selects a cell
for every x E n and assigns it a unique index value, Ik(x) E {1, 2, ... , IAkl};
hence the index generator identifies the quantization cell to which the input points
belongs. The address decoder uses this generated index value to specify the physical
Allinson €3 I<olcz: N- Tuple N euml Networks 5
memory address which then selects the relevant node k. Hence, Ak(X) = Ak(Ik(X)).
Therefore, a cell, Rf, can be defined as the set of all input points which result in
the selection of an address corresponding to the ith index of the kth node.
Rf = {x En: Ik(x) = i} (2)
Each of the cells is closed and bounded as the input space is compact in IRD. The
selection of a cell is given by the following operator or activation function
S~(X)=(h(X)=i)={ 1 ifxEI!f: i =l, ... ,I Ak l (3)
• 0 otherwIse
The quantization of n performed by the individual nodes is combined, through the
intersection of the K cells being superimposed, to yield a global quantizer. The
number of cells IAI is given by the number of all such distinct intersections.
lAd IA21 IAkl K
IAI = E E··· E II ({x En: hex) ik} # 0) = (4)
;,=1 ;2=1 ik=1 k=1
The upper bound being given by IAlmax = nf=1 IAkl· The address generation
element of the network is distributed across the nodes, so that the general structure
of Figure 1a emerges. Alternatively, the address generation can be considered at
the global level (Figure 1b). These two variants are equivalent.
The quantization of the input space by the network produces values that are con-
stant over each cell (We are ignoring, for the present, external kernel functions).
The value of f assigned to the ith cell is normally expressed as the average value
of f over the cell.
(5)
where dJ is given by the squared error function. In most supervised learning schemes,
this representation of feR;) is estimated inherently through the minimisation of an
error function.
For K = 1, the GMNN could be simply replaced to a VQ followed by a look-up
table. There would need to be at least one input point per quantization cell. The
granularity of the quantization needs to be sufficient to meet the degree of approx-
imation performance appropriate for the required task. When there are multiple
nodes (K > 1), the quantization at the node level can be much coarser which, in
turn, increases the probability of input points lying inside a cell. The fine granu-
larity being achieved through the superposition of nodal quantizers. Learning and
generalisation are only possible through the use of multiple nodes. Points that are
close to each other in the n space should share many addresses, and vice versa.
2.2 GMNN Distance Metrics
The address proximity function, which quantifies the proximity of points in nand
so the number of generated identical nodal addresses, is given by
K K
K : n2 -+ {O, 1, ... , K} K(X, y) = E(Ak(x) = Ak(Y)) = E(Ik(X) = hey)) (6)
k=1 k=1
6 CHAPTER 1
~ Co-ordina1c InJUfannanoa ..
11------- II
\\ RdioaI projectioo
KNodco
The address distance function, defined as the number of different generated nodal
addresses, is given by
K K
P : Q2 -+ {O, 1, ... , K} p(x, y) = ~)Ak(X) f Ak(Y» = ~)Ik(x) f Ik(Y» (7)
k=l k=l
The binary nodal incidence function, which returns' l' if two inputs share a common
address at a given network node and '0' otherwise, is defined as
I {::} h(x) = h(Y)
Mk(X, y) = (Ik(X) = h(Y» = { 0 {::} h(x) f h(Y) (8)
From these definitions, several properties directly follow.
'v'(x, Y E Q) p(x, x) 0 = p(x, y) p(y, x) =
K(X, x) = K K(X , y) = K(Y, x) (9)
K(X,y) = K - p(x,y)
K
L Mk(X, y) = K(X, y) (10)
k=l
Allinson (3 Kolcz: N-Tuple Neural Networks 7
(11)
k=1
where Wk(Xi ) is the value of the weight selected by xi at the kth node. The par-
ticipating weights are modified by a value, .6.},
proportional to this error so as to
reduce the error.
(12)
Initially, all weight values are set to zero. As is shared by all points within
Wk(Xi )
the neighbourhood Nk(X i ), this weight updating can affect the network response
for points from outside the training set. That is within an input space region given
by
(13)
The output of the network after the training is complete, for an arbitrary x E 0,
will depend on all training samples lying within the neighbourhood of x.
K KT( TO)
g(x) = (; Wk(X) = (; ~ Mk(X, Xi) ];.6.; (14)
~I
5 10001&00 101
~
111
~
O<A O<A
00. .
MIl 011 ...
~OOO ~
I>C
C>A
(c)
Figure 3 (a) The delineation of input space by the hard decision planes of each
tuple element's threshold. Each region is marked by its specific binary state of
the 3-tuple, tl' (b) The thermometer coding inherent in N-tuple sampling. The
variable, Xl, is uniformly quantized into six discrete regions (L == 6). The indicated
2-tuple partitions this interval into three unequal quantization regions, with the
binary state of the 2-tuple indicated. (c) The delineation of the 3-dimensional input
space into tetrahedral regions through the use of a ranking code. The binary space
representation of the input space is also shown.
There is little difference in the general form of these two sampling methods, though
there may be crucial differences in performance for specific tasks. The distance
function depends exponentially on the ratio of the Hamming distance, H, between
patterns to the retinal size, R. The rate of decrease is proportional to the tuple
size.
3.2 Input Encoding
There is a direct and monotonic dependence of PNTNN pn the Hamming distance in
the binary space of the network's retina. For binary patterns, the N-tuple sampling
Allinson & Kolcz: N-Tuple Neural Networks 11
provides the desired mapping between the input and output addresses. For non-
binary input patterns, the situation is not so clear. One obvious solution is to use
a thermometer or bar-chart code, where one bit is associated to every level of an
input integer. This creates a linear array of 2n bits for an n-bit long integer. This
can produce very large retinas if the input dimensionality and quantization level are
large. The use of the natural binary code or Gray code is not feasible. Though these
are compact codes, there is no monotonic relationship between input and pattern
distances. The concatenation of several Gray codes [3] offers an improvement over
a limited region and enhances the dynamic range over the binary and straight Gray
code. The exponential dependence of the PNTNN on the Hamming distance means
that strict proportionality is not required but monotonicity is required within an
active region of Hamming distances.
The potential of CMAC encoding, and further aspects of input coding methods, are
discussed in Kolcz and Allinson [14]. Improvements in the input mapping, which in
turn produce a more isotropic kernel, are given in Kolcz and Allinson [15], where
rotation and hyper-sphere codings are described. Two further techniques will be
briefly introduced here in order to indicate the wide range of possible sampling and
coding schemes. Figure 3b shows one input variable, Xl, which is uniformly quan-
tized to six levels and this is sampled by the indicated 2-tuple. The corresponding
states of the resultant tuple for the three resulting sub-intervals indicate that ther-
mometer encoding can be inherent in tuple sampling. This concept can be extended
to the multivariate case. If the input space, 0, is a D-dimensional hypercube and
each memory node distributes its N address lines among these dimensions, then
the space is effectively quantized into I1~=1 (Nd + 1) hyper-rectangular cells. This
assumes random sampling, such that there are Nd address lines per input dimen-
sions (where Nd = Nj D). The placement of tuples can be very flexible (i.e., uniform
quantization is not essential) and the sampling process can take into account the
density of training points within the input space.
In rank-order coding, the non-binary N-tuple is transformed to a tuple of ranked
values (e.g., (20, 63, 40, 84, 122,38) becomes, in ascending order, the ranked tuple
(O, 3, 2, 4, 5, 1)). Each possible ordering is assigned a unique consecutive ranking
number, which is converted to binary format and then used as the retinal input.
Rank-order coding produces an equal significance code. The use of these relation-
ships is equivalent to delineating the input space into hyper-tetrahedrons rather
than the usual hyper-rectangles (Figure 3c).
4 N-tuple Regression Network
The framework for GMNN proposed earlier together with the derivation the tuple
distance metric are employed here in the development of a modified NTNN which
operates as a non-parametric regression estimator. The formal derivation of this
network and that the N-tuple kernel is a valid one for estimating the underlying
probability function is given in Kolcz and Allinson [16]. The purpose ofthis section
is to show the relative simplicity of this network compared with other implementa-
tions.
During the training phase, the network is presented with T data pairs, (xi, yi),
where xi is the D-dimensional input vector and yi is the corresponding scalar
output of the system under consideration. The input vector is represented by a
12 CHAPTER 1
unique selection of the K tuple addresses with their associated weight and counter
values.
{tl(X), t2(X)"", tK(X)}
x --+ { {Wl(X), W2(X)", ., WK(X)} (28)
{al (x), a2(x), ... , aK(x)}
During training, each addressed tuple location is updated according to
Wk(X i ) +- Wk(X i ) + yi and ak(xi ) +- ak(xi ) + 1 (29)
=
for i 1,2, ... , T and k 1,2, ... , K =
Initially all weight and counter values are set to zero. After training, the network
output, y(x), is obtained from
K
LWk(X)
'()
y X = ,,-k=.:::-l_ _
K (30)
Lak(X)
k=l
An additional condition is where all addressed locations are zero. In this case, the
output is set to zero.
K
L ak(x) =0 --+ y(x) = 0 (31)
k=l
Figure 4 shows the modifications needed to a conventional NTNN to form the N-
"",,--- .........
''
, 1
/
\
.... _ ....
/,
/ "
...._" //1
,
-rlR---f'-f .... - -
Network
output
J(x)
Figure 4 Modifications to the nodes and output elements of the NTNN to yield
the N-tuple regression network.
Lyi.
T
(1- p(x,x')) .
;=1 K
(32)
k=l
This suggests that the network output is an approximate solution of the gener-
alised regression function, E(Ylx), provided that the bracketed term in (32) is a
valid kernel function. This function is continuous, symmetrical, non-negative and
possesses finite support. These are all necessary conditions. A close approximation
(based on the exponential approximation of tuple distances) is also representable
as a product of univariate kernel functions. Taken together these provide sufficient
conditions for a valid kernel function [17]. A wide ranging set of experiments on
chaotic time-series prediction and non-linear system modelling has been conducted
[16), which confirm the successful operation of this network. A major advantage of
the NTNN implementation over other approaches is its fast, and fixed, speed of
operation. Each recall operation involves addressing a fixed number of locations.
There is no need for preprocessing large data sets, through data clustering, as is
often the case for RBF networks [18].
5 Pattern Classification
So far we have restricted our considerations to the approximation properties of the
NTNN, but the other major application - namely, pattern classification - can
be discussed within this common framework. The training phase of a supervised
network provides estimates of the conditional probabilities of individual pattern
classes. The class membership probabilities can be formulated through the Bayes
relationship, i.e.,
P( ) P(xlc)P(c)
x Ec = P(x) (33)
where c is the class label for a particular class {c = 1,2, ... , C}. The modified
NTNN discussed in Section 4 can be reformulated in terms of this classification.
The network through training approximates C indicator functions, which denote
membership to an individual class.
if xC c
le(x) ={ ~ otherwise
(34)
Modifying (32), the indicator functions can be approximated, after training, by
T K K
L(xi Cc)(I-p(x,x i )/K) LWk LP(tklc)
le(x) = .:..:i==.:l:"-T~_ _ _ _ _ _ _ = k~l == P( c) _k=-::~:--_ _ (35)
L(1- p(x, xi)/ K) L ak L P(tk)
i=l k=l k=l
This relationship gives the ratio of the cumulative summation of all training points
belonging to a class c, which have an N-tuple distance at 0,1, ... , (K - 1) from
x to a similar cumulative summation for all training points. The decision surfaces
14 CHAPTER 1
present in the K-dimensional weight space are described by 2:f=1 Wk const, and =
. . cIass IS
t he wmmng . gIven
. by cwinner = maX =I,2, ... ,C ",K e
L.."k=1 Wk'
e
6 Conclusions
The unifying approach proposed for a wide class of memory-based neural networks
means that practical, but poorly understood, networks (such as the NTNN) can
be considered in direct comparison with networks (such as RBF networks) that
possess a much firmer theoretical foundation. The random sampling inherent in
the N-tuple approach makes detailed analysis difficult so this link is all the more
important. The pragmatic advantages of NTNNs has been demonstrated in the
regression network described above, where large data-sets can be accommodated
with fixed computational overheads. The possible range of input sampling and
encoding strategies has been illustrated, but by no means exhaustively. There is
still a need to seek other strategies that will provide optimum kernel functions for
specified recognition or approximation tasks. The power and flexibility of Bledsoe
and Browning's original concept has not, as yet, been fully exploited.
REFERENCES
[1] Bledsoe W W and Browning I, Pattern recognition and reading by machine, IRE Joint Com-
puter Conference, 1959, 225-232.
[2] Aleksander I, Fused adative circuit which learns by example, Electronics Letters, 1965, 1,
173-174.
[3] Tattersall G D, Foster S and Johnston R D, Single-layer lookup perceptrons, lEE Proceedings
- F: Radar and Signal Processing, 1991, 138, 46-54.
[4] Bledsoe W W and Bisson C L, Improved memory matrices for the N -tuple pattern recognition
method, IRE Transactions on Electronic Computers, 1962, 11, 414-415.
[5] Broomhead D S and Lowe D, Multivariable functional interpolation and adaptive networks,
Complex Systems, 1988, 2, 321-355.
[6] Specht D F, A general regression neural network, IEEE Transactions on Neural Networks,
1991,2, 568-576.
[7] Albus J S, A new approach to manipulator control: the cerebellar model articulation controller
(CMAC), Journal of Dynamic Systems, Measurement and Control, 1975, 97,220-227.
[8] Moody J, Fast learning in multi-resolution hierarchies, in Advances in Neural Information
Processing 1 (Touretzky D S, ed.), 1989, Morgan Kaufmann: San Mateo, CA, 29-39.
[9] Kolcz A and Allinson N M, General Memory Neural Network - extending the properties
of basis networks to RAM-based architectures, 1995 IEEE International Conference on Neural
Networks, Perth, Western Australia.
[10] Park J and Sandberg, I W, Universal approximation using radial basis function networks,
Neural Computation, 1991, 3, 246-257.
[11] Kavli T, ASMOD - an algorithm for adaptive modelling of observational data, International
Journal of Control, 1993, 58, 947-967.
[12] Kolcz A and Allinson N M, Application of the CMAC-input encoding scheme in the N -tuple
approximation network, lEE Proceedings - E Computers and Digital Techniques, 1994, 141,
177-183.
[13] Kolcz A and Allinson N M, Enhanced N -tuple approximators, Proceedings of Weightless
Neural Network Workshop 93, 1993, 38-45.
[14] Allinson N M and Kolcz A, The theory and practice of N -tuple neural networks, in Neural
Networks (Taylor J G, ed.), 1995, Alfred Waller, 53-70.
[15] Kolcz A and Allinson N M, Euclidean mapping in an N -tuple approximation network, Sixth
IEEE Digital Signal Processing Workshop, 1994, 285-289.
[16] Kolcz A and Allinson N M, N -tuple regression network, Neural Networks vol 9 No.5 pp855-
870.
[17] Parzen E, On estimation of a probability density function and mode, Annals of Mathematical
Statistics, 1962, 33, 1065-1076.
[18] Moody J and Darken C J, Fast learning in networks of locally-tuned processing units, Neural
Computation, 1989, 1, 281-294.
INFORMATION GEOMETRY OF NEURAL
NETWORKS -AN OVERVIEW-
Shun-ichi Amari
University of Tokyo, Tokyo, Japan, RIKEN Frontier Research Program,
Wako-City, Saitama, Japan. Email: [email protected]
The set of all the neural networks of a fixed architecture forms a geometrical manifold where
the modifable connection weights play the role of coordinates. It is important to study all such
networks as a whole rather than the behavior of each network in order to understand the capability
of information processing of neural networks. What is the natural geometry to be introduced in
the manifold of neural networks? Information geometry gives an answer, giving the Riemannian
metric and a dual pair of affine connections. An overview is given to information geometry of
neural networks.
1 Introduction to Neural Manifolds
Let us consider a neural network of fixed architecture specified by parameters w =
(Wl,"', wp ) which represent the connection weights and thresholds ofthe network.
The parameters are usually modifiable by learning. The set N of all such networks
is considered a p-dimensional neural manifold, where w is a coordinate system in
N. Because it includes all the possible networks belonging to that architecture, the
total capabilities of the networks are made clear by studying the manifold N itself.
To be specific, let N be the set of multilayer feedforward networks each of which
receives an input a: and emits an output z. The input-output relation is described
as
z = I(a:; w)
where the total output z depends on w which describes all the connection weights
and thresholds of the hidden and output neurons. Let us consider the space S of
all the square integrable functions of a:
S = {k(a:)}
and assume for the moment that I(a:; w) is square integrable. The set S is infinite-
dimensional L2 space, and the neural manifold N is a part of it, that is, a p-
dimensional subspace embedded in S. This shows that not all the functions are
realizable by neural networks.
Given a function k(a:), we would like to find a neural network whose behavior
I(a:; w) approximates k(a:) as well as possible. The best approximation is given by
projecting k(a:) to N in the entire space S. The approximation power depends on
the shape of N in S. This shows that geometrical considerations are important.
When the behavior of a network is stochastic, it is given by the conditional proba-
bility p(zla:; w) of z conditioned on input a:, where w is the network parameters. A
typical example is a network composed of binary stochatic neurons: The probability
z = 1 of such a stochastic neuron is given by
exp{w· a:}
Prob{z=l;w}=l
+ exp {w· a: }' (1)
where z = 0 or 1. Anot-her typical case is a noise-contaminated network whose
output z is written as
z= I(a:;w)+n, (2)
15
16 CHAPTER 2
where n is a random noise independent of w. If n is subject to the normal distri-
bution with mean 0 and covariance u 2 I, I being the identity matrix,
n '" N(O, u 2 I),
the conditional probability is given by
p(zlxj w) = const exp { _ [z - {~~' W)J2} . (3)
When input signals x are produced independently from a distribution q(x), the
joint distribution of (x, z) is given by
p(x, Zj w) = q(x)p(zIXj w). (4)
Let S be the set of all the conditional probability distributions (or joint distri-
butions). Let q(zlx) be an arbitrary probability distribution in S which is to be
approximated by a stochastic neural network of the behavior p( z IXj w). The neural
manifold N consists of all the conditional probability distributions p(zlXj w) (or
the joint probability distributions p(x, Zj w)) and is a p-dimensional submanifo'ld of
S. A fundamental question arises: What is the natural geometry of Sand N? How
should the distance between two distributions be measured? What is the geodesic
connecting two distributions? It is important to have a definite answer to these
problems not only for studying stochastic networks but also for deterministic net-
works which are not free of random noises and whose stochastic interpretation is
sometimes very useful. Information geometry ([3], [17]) answers all of these prob-
lems.
2 A Short Review of Information Geometry
Let us consider probability distributions p(y, e) of random variable y, where e =
(6,··· ,en) is the parameters to specify a distribution. When y is a scalar and
normally distributed, we have
1 {(Y _1')2}
p(y, e) = ..,fFiu exp - 2u2 '
where e = (I', u) is the parameters to specify it. When y is discrete, taking values
on {O, 1,·· ·,n}, we have
where Oi(Y) Y
= 1 when = i and is equal to 0 otherwise and ei = Prob{y = i}.
Let S be the set of such distributions
S {p(y, = en
specified bye. Then, S can be regarded as an n-dimensional manifold, where e is
a coordinate system. If we can introduce a distance measure between two nearby
points specified by e and e + de by the quadratic form
Idel 2 = L,gij(e)d{idej, (5)
i,j
r~~)(e) (11)
where 1ijk(e) is the tensor defined by
8 8 8 ]
1ijk = E [ 8ei logp 8ej logp 8ek logp . (12)
They are called the e- and m-connection, respectively, and are dual in the sense of
X(Y,Z) = (V'~)Y,Z) + (Y, V'~)Z). (13)
The duality of connections is a new concept given rise to by information geometry.
When a manifold has a dually flat structure in the sense that the e- and m-Riemann-
Christoffel curvature vanish (but the Levi-Civita connection has non-vanishing cur-
vature in general), it has remarkable properties that are dual extensions of Euclidean
properties. Exponential families of probability distributions are proved to be du-
ally flat. Intuitively speaking, the set of all the probability distributions {p(y)} are
dually flat, and any parameterized model S = {p(y, e)} is a flat or curved subman-
ifold embedded in it. Hence, it is important to study properties of the dually flat
manifold.
When S is dually flat, the divergence function
D(e : e') = J
p(y, e) log :~: g dy (14)
is naturally and automatically introduced to S. This is an extension of the square
of the Riemannian distance, because it satisfies
1
D(e : e + de) = "2 LUij(e)deiej. (15)
The divergence satisfies D( e, e') ~ 0 with equality when and only when e = e'·
But it is not symmetric in general.
We have the generalized Pythagorus theorem.
18 CHAPTER 2
Theorem 1 Let el' e2 and e3
be three distributions in a dually flat manifold. Then,
when the m-geodesic connecting el
and e2
is orthogonal at e2
to the e-geodesic
connecting e2 e3,
and
(16)
The projection theorem is a consequence of this theorem.
These properties are applied to various fields of information sciences such as statis-
tics ([3], [11], [16], information theory ([5],[9], control systems theory ([4],[19]),
dynamical systems ([14],[18]) etc. It is also useful for neural networks ([6], [10], [7]).
We will show how it is applied to neural networks.
3 Manifold of Feedforward Networks and EM Algorithm
In the begining, we show a very simple case of the manifold M of simple stochastic
perceptrons or single stochastic neurons. It receives input x and emits a binary
output z stochastically, based on the weighted sum W· x of input x. The conditional
probability of z is written as
eZW . x
p(zl:t;w) = 1 + eW·x (17)
The joint distribution is given by
p(x, z; w) = q(x)p(zlx; w). (18)
Here, we assume that q(x) is the normal distribution N(O, I) with mean 0 and
covariance metrix I, the identity matrix.
Let M be the set of all the stochastic perceptrons. Since a perceptron is specified
by a vector w, M is regarded as a manifold homeomorphic to R n where n is the
dimensions of x and w. Here, w is a coordinate system of M. We introduce a
natural Riemannian geometric structure to M. Then, it is possible to define the
distance between two perceptrons, the volume of M itself, and so on. This is done
through the Fisher information, since each point (perceptron) in M is regarded as
=
a probability distribution (18). Let G(w) {gij(W)} be a matrix representing the
Riemannian metric tensor at point w. We define the Riemannian metric by the
Fisher information matrix,
where E denotes the expectation over (z, x) with respect to the distribution
p(z, x; w). In order to calculate the metric G explicitly, let ew be the unit column
vector in the direction of w in the Euclidean space R n ,
w
ew=~,
where Iwl is the Euclidean norm, and e~ its transpose. We then have the following
theorem.
Amari: Information Geometry of Neural Networks 19
Theorem 3 The Fisher information metric is given by
g(w) = cl(w)I + {C2(W) - cl(w)}eWe~, (20)
where w = Iwl (Euclidean norm) and Cl(W) and C2(W) are given by
vanishes under the distribution q(x) ,...., N(O, I). Hence, M is a self-dual Riemannian
manifold and no dual affine connections appear in this special case.
We now show some applications of the Riemannian structure.
The volume Vn of the perceptron manifold M is measured by
Vn = J
vIG(w)ldw (23)
where IG( w) I is the determinant of G = (9ij) which represents the volume density
by the Riemanian metric. Let us consider the problem of estimating w from exam-
ples. {From the Bayesian viewpoint, we consider that w is chosen randomly subject
to a prior distribution ppr(w). A natural choice of ppr(w) is the Jeffrey prior or
non-informative prior given by
1
ppr(w) = Vn vIG(w)l· (24)
When one studies learning curve and overtraining ([8]), it is important to take the
effect of ppr( w) into account. The Jeffrey prior is calculated as follows.
Theorem 4 The Jeffrey prior and the volume of the manifold are given by
1
vIG(w)1 = v,;-VC2(W){c1(w)}n-l, (25)
Vn = JVC2(W){Cl(w)}n-lanwn-ldw, (26)
respectively, where an is the volume of the unit n-sphere.
The gradient descent is a well known learning method, which was proposed by
Widrow for the analog linear percept ron and extended to non-linear multilayer
networks with hidden units ([2],[20] and others). Let I(x, z; w) be the loss function
when the perceptron with weight w processes an input-output pair (x, z). In many
cases, the squared error
1
I(x, z; w) = 21z - f(w· xW
is used. When input-output examples (Xt, Zt), t = 1,2,"" are given, we train the
perceptron by the gradient-descent method:
8/(Xt, Zt; Wt) (27)
Wt+l = Wt - C ow
20 CHAPTER 2
where Wt is the current value of the weight and it is modified to give Wt+1 by using
the current input-output (Xt, Zt). It was shown that the learning constant c can be
replaced by any positive-definite matrix cK ([2]). The natural choice of this matrix
is the inverse of the Fisher information metric,
81
Wt+1 = Wt - -1
cG (Wt) ow (28)
from the geometrical point of view, since this is the invariant gradient under general
coordinate transformations. However, it is in general not easy to calculate G- 1 so
that this excellent schema is not usually used (see [10]). We can calculate G- 1 (w)
explicitly in the perceptron case.
Theorem 5 The inverse of the Fisher information metric is
G-1 (w) 1 + (1
= -(-)1
C1 W
-(-)
C2 W
1) eweTw '
- -(-)
Cl W
(29)
Wij being the connection weights of the ith and jth neurons and WiD being the
biasing term of the ith neuron. All the other neurons are unchanged, Xj(t + 1) =
Xj (t).
The x(t), t = 1,2, .. " is a Markov process, and its stationary distribution is explic-
itly given by
p(x, W) = exp {~L WijXiXj -1/>(W)} (35)
= = =
when Wij Wji, Wii 0 hold, where Xo 1. The set of all the Boltzmann machines
=
forms an n(n + l)j2-dimensional manifold B, and W {WOi' Wij, i < j} is a coor-
dinate system of the Boltzmann neural manifold. To each Boltzmann machine W
corresponds a probability distribution (35) and vice versa. Therefore, B is identified
with the set of all the probability distributions of form (35).
Let S be the set of all the distributions over 2n states x,
S = {q(x) I q(x) > 0, L q(x) = I}. (36)
x
Then, B is an n(n + 1)/2-dimensional submanifold embedded in the (2n - 1)-
dimensional manifold S. We can show that B is a flat submanifold of S, and that
both Sand B are dually flat, although they are curved from the Riemannian metric
point of view [10].
22 CHAPTER 2
Given a distribution q(x), we train the Boltzmann machine by modifying Wij based
on independent examples Xl, x2 ... from the distribution q, for the purpose of ob-
taining W such that p(x, W) approximates q(x) as well as possible. The degree
of approximation is measured by the divergence D[q(x) : p(x, W)]. The best ap-
proximation is given by m-projecting q to B. We have an explicit solution since
the manifolds are fiat. The stochastic gradient learning was proposed by Ackley,
Hinton and Sejnowski [1],
8
W t +1 = Wt - TJ 8W D [q;P] (37)
where the gradient term is not the expectatio form but is evaluated by random
example Xt. However, from the geometrical poit of view, it is more natural to use
Wt +1 = Wt - TJO
- 1 8WD[q;P],
8 (38)
where 0- 1 is the Fisher information matrix. In this case, the expectation of the
trajectory Wt oflearning is the e-geodesic of B (when the continuous time version is
used), see [10]. It is shown that the convergence is much faster in this case, although
calculations of 0- 1 are sometimes not easy.
When hidden units exist, we divide x into
x = (xv, x H ), (39)
where xV represents the state of the visible part and x H represents that of the
hidden part. The connection weights are also divided naturally into the three parts
W = (WV, W H, WV H). The probability distribution of the visible part is given by
p(xv; W) = LP(x v ,xH ; W). (40)
xH
Let SV and B V be the manifolds of all the distributions over x V and those realizable
by Boltzmann machines in the form of (40). Then, SV is fiat but B V is not.
It is easier to treat the extended manifolds of SV,H and BV,H including hidden units.
However, only a distribution q(xV ) is specified from the outside in the learning
process and its hidden part is not. Let us consider a submanifold
D = {q(xV,xH)IEq(xV,xH) = q(xv)}. (41)
xH
The submanifold D is m-flat. Because we want to approximate q(xV ) but we do
not care about q( x H ) nor the interaction of x v and xH, the problem reduces to
minimizing
In the past decade, research in neurocomputing has been divided into two relatively well-defined
tracks: one track dealing with cognition and the other with behavior. Cognition deals with orga-
nizing, classifying and recognizing sensory stimuli. Behavior is more dynamic, involving sequences
of actions and changing interactions with an external environment. The mathematical techniques
that apply to these areas, at least from the point of neurocomputing, appear to have been quite
separate as well. The purpose of this paper is to give an overview of some recent powerful math-
ematical results in behavioral neurocomputing, specifically the concept of Q-learning due to C.
Watkins, and some new extensions. Finally, we propose ways in which the mathematics of cogni-
tion and the mathematics of behavior can move closer to build more unified systems of information
processing and action.
1 Introd uction
The study of artificial neural networks has burgeoned in the past decade. Two
distinct lines of research have emerged: the cognitive and the behavioral. Cognitive
research deals with the biological phenomenon of recognition, the mathematics of
pattern analysis and statistics, and applications in automatic pattern recognition.
Behavioral research deals with the biological phenomena of planning and action, the
mathematics of time dependent processes, and applications in control and decision-
making.
To be mathematically precise, let us discuss simple formulations of each problem
type. A cognitive problem typically involves so-called feature vectors, x E IRn. These
feature vectors are sensory stimuli or measurements and are presented to us in some
random way - that is, we cannot predict with certainty which stimuli will occur
next. These observations must be classified and the classification is accomplished
by a function f : IRn -+ IRm. The problem is to build an estimate of the true
function, I, based on a sample set of the form S = {(Xi, Yi), i = 1, ... , N} which
are drawn according to a joint probability distribution on (x, y) E IRn+m, say
J-I(x, y). Call the estimate g(x). Typically, f(x) is the conditional expected value
of y: I(x) = J, ydJ-l(x, y)/ J, dJ-l(x, y). A number of distinct situations in cognitive
neurocomputi~g arise deper:-ding on the type of information available for estimating
f. (See one of the numerous excellent textbooks on neural computing and machine
learning for more details.)
24
Cybenko, Gray (3 Moizumi: A Q-Learning Tutorial 25
classes are formed. Unsupervised learning is closely related to clustering and
uses techniques from that area.
• REINFORCEMENT LEARNING - As in unsupervised learning, no "correct" re-
sponse is given but a reinforcement is available. A reinforcement can be thought
of as an error when g(z) is the estimated response for feature vector z but the
correct response f is not available. Reinforcement learning is thus between
supervised and unsupervised learning: some performance feedback is provided
but not in the form of the correct answer and the deviation of the response
from it. The difference between supervised and reinforcement learning has been
characterized as the difference between learning from a teacher and learning
from a critic. The teacher provides the correct response but the critic only
states how bad the system's response was. Reinforcements are often referred
to as "rewards" and "punishments" depending on their sizes and the context.
L k(Xii' aj(xij))'
j=O
However, since the transitions are stochastic, we must take an expectation over
all possible trajectories, starting with Xi and terminating in Xo with probabil-
ities determined by the policy, 7r, and the corresponding transitions.
r-l
V(7r,i) = E{Lk(xij,aj(xij))}'
j=O
Cybenko, Gray & Moizumi: A Q-Learning Tutorial 27
• FIXED-TIME MINIMIZATION - As above but the cost is computed up to a fixed
time as opposed to when a terminal state is reached.
• AVERAGE-COST MINIMIZATION - The average future cost is minimized (here
o < r ::;. 1 is the discount factor):
L
V(7r, i) = L_oo
lim L 1 1E{" ri k(x;., aj(xi;))}.
+ L...J J
j=O
• DISCOUNTED COST MINIMIZATION - Here the cost of future actions are dis-
counted by a constant rate, 0 < r < 1, so that
00
and the optimal action from state Xi is precisely the action aij which achieves the
mmlmum.
The beauty of Watkins' Q-Iearning is that we can adaptively estimate the Q-values
for the optimal policy using samples of the above type. The Q-Iearning algorithm
begins with a tableau, Qa( Xi, aij ), initialized arbitrarily. Using samples of the form
Si = (1}i, (i, ai, ki), i = 1,2,3, ...
we perform the following update on the Q-value tableau:
Qi(1}i, ai) = (1 - f1i)Qi-l(1}i, O'i) + f1i(ki + ,V((i))
where V((i) = miI1a{Qi-l((i, a)}. All other elements of Qj are merely copied from
Qi-1 without change. The parameters {3i -+ 0 as i -+ 00.
Theorem 1 (Watkins [6, 7, 5]) - Let {i( x, a)} be the set of indices for which the
(x,O') entry of the Q-tableau is updated. If
L f1i(x,a) = 00 and L f1;(x,a) < 00
i(x,a) i(x,a)
then Qi(x,a) -+ Q*(x,a) as i -+ 00. Accordingly, a;(x) = argmaxaQi(x,a) con-
verges to the optimal action for state x.
Cybenko, Gray & Moizumi: A Q-Learning Tutorial 29
Proof See [7, 5]. 0
This result is remarkable in that it demonstrates that a simple update rule on the
Q-tableau results in a learning system which computes the optimal policy. In the
next section we show how this method can be embedded into an online system which
simultaneously uses the current Q-values to generate policies and uses the results
to update Q-values in such a way as to satisfy Watkin's theorem. Consequently, the
online system learns the optimal policy to which it ultimately converges.
4 Universal On-line Q-Learning
Online optimal learning and asymptotic optimal performance must be a compromise
between performing what the learning system estimates to be the optimal actions
and persistently exciting all possible state-action possibilities often enough to satisfy
the frequencies stipulated in Watkins' Q-Iearning theorem.
To begin, let us introduce a notion of asymptotically optimal performance. Suppose
we have an infinite sequence of state-action pairs that result from an actual realiza-
=
tion of the Markov decision process, say {(Yi, ai)' i 0,1, ... } with (Yi, ai) meaning
that at time i we are in state Yi and execute action ai, resulting in a transition to
state Yi+l at the next time. For such a sequence, we have a corresponding sequence
of costs Vi which represent the actual costs computed from the realized sequence,
Vi being the cost of following the sequence from time i onwards.
We now define asyptotic average optimality. Suppose that V; is the optimal expected
cost-to-go for state i. Let vi(nij) be the observed cost-to-go values for state i when
the system is in state i at time nij. Let NM (i) be the number of times the process
has visited state i up to time M. Then we say that the policy is asyptotically optimal
on average for state i if
as M -+ 00.
In order to establish this result, we need to construct an auxiliary Markov process
with states (Xi, aij) and transition probabilities,
p((Xi, aij), (X m , amn )) = p(Xi, aij, Xm )/ Jm
where Jm is the number of allowable actions in state X m . This auxiliary process
has the following interpretation: the transition probabilities between states are de-
termined by the transition probabilities of the original controlled process with the
added ingredient that the actions corresponding to the resulting state are uniformly
randomized.
Theorem 2 - Suppose that there is at least one stationary policy under which the
states of the resulting Markov process constitute a single recurrent class (that is, the
process is ergodic). Then, with probability 1, there is a mixed nonstationary strategy
for controlling the Markov decision process with the following two properties:
1. The optimal policy is estimated by Q-learning asymptotically;
2. The control is asymptotically optimal on average for all states that have a nonzero
stationary occupancy probability under the process' optimal policy.
Proof - The proof consists of three parts. We begin by showing that under the
hypothesis that the original process has a stationary policy under which the Markov
30 CHAPTER 3
X. (Xi' aij )
1
X
U
(xu' auv )
~ Transition within a row
is uniformly randomized
Q-Tableaux
process consists of a single recurrent class, the corresponding auxiliary process also
has a single .recurrent class. Let a(x) be the action for a state x which makes the
original process have a single recurrent set. Denote the corresponding transition
probabilities by p*(x, x') = p(x, a(x), x'). Let D be the maximal number of actions
per state: D ~ I{ aij}1 for all i. Consider the auxiliary process' canonical partitioning
into subclasses of states in which each set of the partition consists of states which
are either a) transient or b) recurrent and communicate with each other. Now the
(x, a( x)) form a subset of the states of the auxiliary process.
Step 1. The auxiliary Process is Recurrent - We will show that any given state of
the auxiliary process communicates with every other state. Let (x, a) and (x', a' )
be any two states of the auxiliary process. We must show that
p(n)«x, a), (x', a' )) > 0
for some n. Pick any state Y for which p(x, a, y) > O. By assumption, there is an m
for which
p~m)(y, x') > 0
under the special choice of actions, a(x). This means that there is a sequence of
states Y1 = y, Y2, ... , Ym = x' for which
p(Yi, a(Yi), Yi+1) > 0
and so
) ( ' ')) >
p(m+1)« x,a,x,a _ p(x,a'Y)n
D m- 1[p( Yi,aYi,Yi+1
i=l () )/D] >.
0
Since all states of the auxiliary process communicate with each other, all states
must be recurrent. Thus the auxiliary process consists of a single recurrent set of
states. Because the auxiliary process consists of recurrent states, the expected time
to visit all states of the auxiliary process is finite and all states will be visited in a
finite time with probability 1.
Step 2. A Mixed Policy - We next define a mixed time-dependent strategy for
switching between the auxiliary process and a stationary policy based on an esti-
Cybenko, Gray (3 Moizumi: A Q-Learning Tutorial 31
mate of the optimal policy. The mixed strategy begins by choosing actions according
to the auxiliary process. At time nk we switch from the auxiliary process to the
process controlled by the policy determined by the Q-table according to
aj(nk) = argminaQnk(xj,a).
At time mk we switch back to the auxiliary process. For k = 1,2,3, ... we have
1 = mo < nk < mk < nk+l < mk+l ... < 00. Let N' the largest expected time to
visit all states of the auxiliary process starting from any state and let N = 2 * N' .
Consider the following experiment which is relevant to the discussion below. Start in
any state of the auxiliary process and follow it for N steps at which time we jump to
an arbitrary state and run for another N steps, repeating this process indefinitely.
=
Call each such run of N steps a frame, numbering them Fi for i 1,2,3, .... We want
to concatenate the frames to produce a realization of the auxiliary process. To do
this, consider the final state of frame Fi. Since N was chosen to be twice as large as
the expected time to visit all states starting from any state, with probability 1 some
future frame, say Fi+j, includes a visit to the final state of Fi in the first N /2 = N'
states of that frame. Concatenate to the end of Fi the history in Fi+j following
the visit to the final state of Fi. Each step of this concatenation adds at least N'
consecutive steps of the auxiliary process so there are an infinite number of visits to
all states in the concatenation with probability 1. In the following construction, we
implicitly use this property of a concatenation of frames. The definition of nk is as
Increasing time
i~
AuxiJIiary Process Operation
follows. At time mk-l, we began following the auxiliary process which is recurrent.
We store state transitions of the form
(77, (, a, K),
where 77 is a state Xi, a is an action taken by the auxiliiary process, ( is the state of
the original process to which the process transitioned and K is the observed cost of
that transition, in a list. Proceed in this manner until either N steps have been taken
in this way or until the list contains a sample for each element in the Q-table. If the
list to update the Q-table is completed first, we update the Q-table and compute
the current optimal actions. In this case, nk is defined as the time after which the
Q-table was updated. Otherwise, define nk = mk-l + N and continue using the
previous estimated optimal actions as a policy. In either case, mk = nk + k * N so
32 CHAPTER 3
we operate the process using the estimated optimal policy for increasingly longer
periods of time k * N. We then use this list to update the Q-table with the samples
in the list.
Assume that the Q-table has M entries. Then for updating the ith element for the
jth time in the Q-table from the sample list, use
1
{3j oM +i = J. * M + Z.
According to this scheme, element i is updated for j = 1,2,3, ... and the corre-
sponding subset of {3's forms an arithmetic subsequence of the harmonic sequence;
hence it satifies Watkins' criteria for divergence and squared-convergence.
By the discussion above about frames, this list is filled in a finite amount of time
infinitely often with probability 1, but we impose a deterministic limit on the time
spent in this mode generating a frame. By construction and Watkins' Theorem,
the Q-table values converge to the optimal Q-values and hence the optimal action
policy is eventually found.
Step 3. Asymptotic Optimal Convergence - To prove the asymptotic optimal conver-
gence, note that the Q-table updates are performed according to Watkins' criteria
so that in the limit the Q-table values determine an optimal stationary policy. More-
over, the time spent operating in the auxiliary mode becomes an asymptotically
small fraction of the time spent operating in the estimated optimal policy. Hence,
the asymptotic convergence to optimality.
To make this formal, note that asymptotic average optimality depends only on what
happens in the limit. Specifically, if we can show that the average of the observed
cost-to-go values after some fixed time converges to the optimal value then we are
done.
To see this, run the process long enough so that the estimated optimal policy
determined by the Q-table is as close as desired to the optimal cost-to-go for the
process. This will happen at some finite time with probability 1 although the time
is not known a priori. Since we can also wait until k is arbitrarily large, the fraction
of time spent in the appoximately optimal mode can be made as close to 1 as we
like. Now for a state that has nonzero stationary occupancy probability under the
optimal policy, the fraction of time spent operating under the estimated optimal
policy approaches 1 and so the empirical cost-to-go also approaches the optimal. For
states that have zero occupancy probability under the optimal policy, the empirical
cost-to-go will be dominated by the empirical values determined during the auxiliary
process operation which will not be optimal in general. Thus the average within
the mixed mode of operation is guaranteed to converge to the estimated cost-to-go
for the optimal policy but only for states that have nonzero stationary occupancy
probabilities under the optimal policy. 0
5 Discussion
We have shown that there exists a strategy for operating a Markov Decision Process
(under simple conditions) in such a way that the optimal strategy is both learned
and the asymptotic operation of the system approaches optimality for states that
have nonzero occupancy probabilities under the computed optimal policy. This is
only one of many possible strategies for performing both of these simultaneously.
The question of which operating procedures are best for fastest convergence to the
Cybenko, Gray & Moizumi: A Q-Learning Tutorial 33
COGNlI'lVE BEHAVIORAL
Unified Theory of
Learning
Categories
and Actions/Controls
Andyticdly Unexplored
optimal cost-to-go values is beyond the scope of the techniques that we use. It is
an important question to pursue in the future.
One ofthe weaknesses ofthe Q-Learning framework is that states and actions for the
Q-Tableaux must be known a priori. This is restrictive in most dynamic learning
situations - it may not be appropriate or possible to select the actual states or
actions of a system without significant experimentation first. In general, we would
like to simultaneously learn the states, actions and corresponding optimal policies
at one time. This subject has been looked into by various researchers but with few
analytic results yet. There has been growing interest in dealing with systems that
have an infinite (contiuum) of possible states and actions, requiring discretization
for Q-Learning. How these states can be clustered or otherwise organized to achieve
optimal operation is a challenging question that requires serious future research.
REFERENCES
[1) D. Bertsekas, Dynamic Programming and Optimal Control. Athena Scientific, Belmont, MA
(1995).
[2) R. Howard, Dynamic Programming and Markov Proce6Ses. MIT Press, Cambridge, MA
(1960).
[3) W.T. Miller III, R.8. Sutton, and P.J. Werbos, Neural Networks for Control. MIT Press,
Cambridge, MA (1990).
[4) R.S. Sutton, A.G. Barto, and R.J. Williams, Reinforcement learning is direct adaptive control,
IEEE Control Systems Magazine (April 1992), ppI9-22.
[5) J. Tsitsiklis, Asynchronous stochastic approximation and Q-Learning, Machine Learning,
Vol. 16 (1994), ppI85-202.
[6) C.I.C.H. Watkins, Learning from delayed rewards Ph.D. Dissertation, University of Cam-
bridge (1989).
[7) C.I.C.H. Watkins and P. Dayan, Q-Learning, Machine Learning, Vol.8 (1989), pp279-292.
Acknowledgements
Supported in part by the Air Force Office of Scientific Research research grant
F49620-93-1-0266.
ARE THERE UNIVERSAL PRINCIPLES OF BRAIN
COMPUTATION?
Stephen Grossberg
Boston University, Department of Cognitive and Neural Systems and
Center for Adaptive Systems, 677 Beacon Street, Boston, Massachusetts 02215 USA
1 Introduction
Are there universal computational principles that the brain uses to self-organize
its intelligent properties? This lecture suggests that common principles are used in
brain systems for early vision, visual object recognition, auditory source identifica-
tion, variable-rate speech perception, and adaptive sensory-motor control, among
others. These are principles of matching and resonance that form part of Adaptive
Resonance Theory, or ART. In particular, bottom-up signals in an ART system
can automatically activate target cells to levels capable of generating suprathresh-
old output signals. Top-down expectation signals can only excite, or prime, target
cells to subthreshold levels. When both bottom-up and top-down signals are si-
multaneously active, only the bottom-up signals that receive top-down support can
remain active. All other cells, even those receiving large bottom-up inputs, are in-
hibited. Top-down matching hereby generates a focus of attention that can resonate
across processing levels, including those that generate the top-down signals. Such
a resonance acts as a trigger that activates learning processes within the system.
In the examples described herein, these effects are due to a top-down nonspecific
inhibitory gain control signal that is released in parallel with specific excitatory
signals.
2 Neural Dynamics of Multi-Source Audition
How does the brain's auditory system construct coherent representations of acoustic
objects from the jumble of noise and harmonics that relentlessly bombards our
ears throughout life? Bregman [1] has distinguished at least two levels of auditory
organization, called primitive streaming and schema-based segregation, at which
such representations are formed in order to accomplish auditory scene analysis. The
present work models data about both levels of organization, and shows that ART
mechanisms of matching and resonance playa key role in achieving the selectivity
and coherence that are characteristic of our auditory experience.
In environments with multiple sound sources, the auditory system is capable of
teasing apart the impinging jumbled signal into different mental objects, or streams,
as in its ability to solve the cocktail party problem. With my colleagues Krishna
Govindarajan, Lonce Wyse, and Michael Cohen [5], a neural network model of
this primitive streaming process, called the ARTSTREAM model (Figure 1), has
been developed that groups different frequency components based on pitch and
spatial location cues, and selectively allocates the components to different streams.
The grouping is accomplished through a resonance that develops between a given
object's pitch, its harmonic spectral components, and (to a lesser extent) its spatial
location. Those spectral components that are not reinforced by being matched with
the top-down prototype read-out by the selected object's pitch representation are
suppressed, thereby allowing another stream to capture these components, as in the
34
Grossberg: Universal Principles of Brain Computation 35
"old-plus-new heuristic" of Bregman [1]. These resonance and matching mechanisms
are specialized versions of ART mechanisms.
Pikh
Pikh
summation
layer
Speo:trol
.strum
layer
Input signal
The model is used to simulate data from psychophysical grouping experiments, such
as how a tone sweeping upwards in frequency creates a bounce percept by grouping
with a downward sweeping tone due to proximity in frequency, even if noise replaces
the tones at their intersection point. The model also simulates illusory auditory
percepts such as the auditory continuity illusion of a tone continuing through a
noise burst even if the tone is not present during the noise, and the scale illusion
of Deutsch whereby downward and upward scales presented alternately to the two
ears are regrouped based on frequency proximity, leading to a bounce percept. The
stream resonances provide the coherence that allows one voice or instrument to be
tracked through a multiple source environment.
3 Neural Dynamics of Variable-Rate Speech Categorization
What is the neural representation of a speech code as it evolves in real time?
With my colleagues Ian Boardman and Michael Cohen [6], a neural model of this
schema-based segregation process, called the ARTPHONE model (Figure 2), has
been developed to quantitatively simulate data concerning segregation and integra-
tion of phonetic percepts, as exemplified by the problem of distinguishing "topic"
from "top pick" in natural discourse. Psychoacoustic data concerning categorization
of stop consonant pairs indicate that the closure time between syllable final (VC)
36 CHAPTER 4
rate invariant
phonetic output
1---------1
1
1
phone inputs
Figure 2 The ARTPHONE model. Working memory nodes (w) excite chunks (u)
through previously learned pathways. List chunks send excitatory feedback down
to their item source nodes. Bottom-up and top-down pathways are modulated by
habituative transmitter gates (filled squares). Item nodes receive input in an on-
center off-surround anatomy. Total input (I) is averaged to control an item rate
signal (r) that adjusts the working memory gain (g). Excitatory paths are marked
with arrowheads, inhibitory paths with small open circles.
and syllable initial (CV) transitions determines whether consonants are segregated,
i.e., perceived as distinct, or integrated, i.e. fused into a single percept. Hearing two
stops in a VC-CV pair that are phonetically the same, as in "top pick," requires
about 150 msec more closure time than hearing two stops in a VC 1-C 2 V pair that
are phonetically different, as in "odd bal!." As shown by Repp [10], when the dis-
tribution of closure intervals over trials is experimentally varied, subjects' decision
boundaries between one-stop and two-stop percepts always occurred near the mean
closure interval.
The ARTPHONE model traces these properties to dynamical interactions between
a working memory for short-term storage of phonetic items and a list categoriza-
tion network that groups, or chunks, sequences of the phonetic items in working
memory. These interactions automatically adjust their processing rate to the speech
rate via automatic gain contro!' The speech code in the model is a resonant wave
that emerges after bottom-up signals from the working memory select list chunks
which, in turn, read out top-down expectations that amplify consistent working
memory items. This resonance may be rapidly reset by inputs, such as C 2 , that are
inconsistent with a top-down expectation, say of C 1 ; or by a collapse of resonant
activation due to a habituative process that can take a much longer time to occur,
as illustrated by the categorical boundary between VCV and VC-CV. The catego-
rization data may thus be understood as emergent properties of a resonant process
that adjusts its dynamics to track the speech rate.
4 Neural Dynamics of Boundary and Surface Representation
With my colleagues Alan Gove and Ennio Mingolla [4], a neural network model,
called a FACADE theory model (Figure 3), has been developed to explain how
Grossberg: Universal Principles of Brain Computation 37
figurations. These examples illustrate how boundary and surface mechanisms can
generate percepts that are highly context-sensitive, including how illusory contours
can be amodally recognized without being seen, how model simple cells in VI re-
spond preferentially to luminance discontinuities using inputs from both LGN ON
and OFF cells, how model bipole cells in V2 with two colinear receptive fields can
help to complete curved illusory contours, how short-range simple cell groupings
and long-range bipole cell groupings can sometimes generate different outcomes, and
how model double-opponent, filling-in and boundary segmentation mechanisms in
V4 interact to generate surface brightness percepts in which filling-in of enhanced
brightness and darkness can occur before the net brightness distribution is com-
puted by double-opponent interactions. Taken together, these results emphasize the
importance of resonant feedback processes in generating conscious percepts in the
visual brain.
inhibitory
gain
+
reactive target --~
planned target
inhibitory
gain
Figure 4 SAC CART model for multimodal control of saccadic eye movements by
the superior colliculus.
Acknowledgements
Supported in part by the Air Force Office of Scientific Research (AFOSR F49620-
92-J-0225), the Advanced Research Projects Agency (ONR N00014-92-J-4015), and
the Office of Naval Research (ONR N00014-91-J-4100 and ONR N00014-92-J-1309).
The author wishes to thank Diana Meyers and Cynthia Bradford for their valuable
assistance in the preparation of the manuscript.
ON-LINE TRAINING OF MEMORY-DRIVEN
ATTRACTOR NETWORKS
Morris W. Hirsch
Department of Mathematics, University of California at Berkeley, USA.
A rigorous mathematical analysis is presented of a class of continuous networks having rather
arbitrary activation dynamics, with input patterns classified by attractors by a special training
scheme. Memory adapts continually, whether or not a training signal is present. It is shown
that consistent input-output pairs can be learned perfectly provided every pattern is repeated
sufficiently often, and input patterns are nearly orthogonal.
1 Introduction
Most neural networks used for pattern classification have the following features:
• Patterns are classified by fixed points or limit cycles of the activation dynamics.
But biological neural networks- nervous systems- do not conform to these rules.
We learn while doing- we learn by doing, and if we don't do, we may forget. And
chaotic dynamics is the rule in many parts of the cerebral cortex (see the papers
by Freeman et al.).
Here we look at supervised learning in continuous time nets in which:
• The memory matrix is always adapting: the only difference between training
and testing is the presence of a training signal. Testing reinforces learning, lack
of testing can lead to forgetting, and retraining at any time is possible.
When the input patterns are sufficiently orthogonal, depending on the number of
patterns, then system parameters can be chosen robustly so that the system learns
to give the correct output for each input pattern on which it has consistently trained.
The net comprises an input layer of d units which feeds into a recurrently connected
activation layer of n units. Inputs and activations can take any real number values.
Inputs are drawn from a fixed list of patterns, taken for convenience to be unit
vectors. With any input a training signal may be presented for minimum time and
then shut off. These training signals need not be consistent, but those patterns
41
42 CHAPTER 5
that are trained consistently and tested sufficiently frequently, will respond with
the correct output.
Rather than taking outputs to be the usual stable equilibria or limit cycles, we
consider outputs to be the at tractors (more precisely, their basins) to which the
activation dynamics tends. (Compare [2], [4]). Each training signal directs the ac-
tivation dynamics into the basin of an attractor.
In some biological models it is the attractor as a geometric object, rather than
the dynamics in the attractor, that classifies the input. If the components of the
activation vector x represent firing rates of cells, then the attractor may corresond
to a particular cell assembly, and the useful information is which cell assembly is
active, with dynamical details being irrelevant. In this connection compare [8]; [1].
Before giving details of the dynamics, we first describe how the network looks from
the outside. From this black box point of view the activation dynamics appears to
be governed by a differential equation in Euclidean n-space IRn ,
dx
F(x) + I (1)
dt
-x+f(x)+I (2)
where f : IRn - t IRn is bounded and I E IRn is a constant training signal which may
e
be O. Input patterns are chosen from a finite set of distinct vectors a E IRd , a =
1, ... , m. Training signals are selected from a compact set S C IRn which contains
the origin.
At stage k of running the net, an input and a possibly null training signal are
presented simultaneously to the net for a certain time interval [tk-l, t~] called the
k'th instruction period, followed by the computation period [t~, tk] during which
training signal is removed (set to 0) while the same input is present. The activation
vector x then settles down to an attractor for the dynamics of the vector field F,
that is, for the differential equation
dx
dt = -x + f(x), (3)
When this process is repeated with a different input and training signal, the acti-
vation variable x jumps instantaneously to a new initial position.
To each pattern ~a there is associated a special attractor A a C IRn for (3), and a
target U a, which is a positively invariant neighborhood of A a in B(A a). It is desired
that when ~a is input, x(t) goes into Ua and stays there (thus approaching Aa),
until the input is changed.
Training consists in using training signals from a special subset sa C S of proper
training signals asociated with ~a. Other nonzero training signals are improper.
It turns out that with sufficiently orthogonal input patterns and suitable parame-
ters, any pattern that is trained on a proper training signal at stage k will go to its
correct target at a later stage provided that at intervening stages the pattern was
presented sufficiently often and no improper training signals were used.
Now we explain how the memory evolves. The time varying memory matrix M E
IRnxd maps static input patterns (column vectors) ~ to dynamic activation vectors
x by x(t) =. M(t)~. An auxiliary fast variable y E IRn provides a delay between
x(t) and M(t): The true dynamics is given by the following Main System, where
Hirsch: On-line Training of Memory-Driven Attractor Networks 43
~T denotes the transpose of ~:
dM
[-x+ y+ IJe, (4)
dt
).. dy
I(x) - y, (5)
dt
x M~. (6)
The system's dynamics is thus driven by M(t); there is no independent activation
dynamics.
= =
Computing dx/dt and using the fact that e~ 11~112 1, we obtain the following
system:
dx
-x + y+ I, (7)
dt
).. dy
I(x) - y. (8)
dt
This system is independent of the input ~; but the interpretation of x depends on
~, and x (but not y) jumps discontinuously when ~ is changed.
It is interesting to observe that System (7), (8) is equivalent to a single second order
equation in x, namely:
d2 x dx
).. dt 2 + ().. + 1) dt 4- x = I(x) + I, (9)
similar to the second order activation dynamics used by W. Freeman et al. (see
References). The following fact is used:
Proposition 1 If x(t) obeys Equation (9) and 11/(x)11 ~ p, then x(t) -t Np(I).
Standard dynamical system techniques (singular perturbations, invariant mani-
folds) show that the dynamics of System (7), (8) is closely approximated by the
that of the simpler system
dx
dt =-x+f(x)+I (10)
provided).. is small enough.
Example 1 A simple but interesting system of this type has I-dimensional acti-
vation dynamics. For f we take any smooth (Lipschitz also works) sigmoid with
high gain '" ~ 1, having limiting values ±l. The dynamics of (10) with I = 0
has stable equilibria at (approximately) x = ±1 and an unstable equilibrium at
x = O. The 2-dimensional dynamics of System (7), (8) with 1= 0 has two attract-
ing equilibria near (1,1) and (-1, -1), and an unstable equilibrium near (0,0).
Every trajectory tends to one of these three equilibria. As training signals we take
L = -1, h = +1. When I takes either of these values la, (x,y) settles to the
global attractor near (2I a, 2Ia). When I is then set to 0, keeping the same input
pattern, then (x, y) relaxes toward (Ia, Ia). This system learns to sort any number
of sufficiently orthogonal input patterns into two arbitrary classes after a single
pass through the pattern set. A similar system is treated in detail in [9].
Higher dimensional examples can be obtained from this one by taking Cartesian
products of n such vector fields, yielding 2n attracting equilibria. More interesting
dynamics can be constructed by changing the vector field in small neighborhoods
of the attracting equilibria. In this way at tractors with arbitrary dynamics can be
constructed.
44 CHAPTER 5
2 Statement of Results
=
Notation. Ilull y'u:'U denotes the Euclidean norm of vector u. The closed ball
of radius p about a point z is denoted by Np(z). The closed p-neighborhood of a
set A is Uz EA Np(z).
We start with the data: f, {Aa}, {ua}, {sa}.
We assume given an increasing sequence oftimes t"-1 < t~ < tIc, k E IN. The Main
System is run during stage k as described above, taking as initial values at t"-1
whatever the terminal values were in the previous stage (if k > 1).
We make the following asumptions, in terms of parameters f > 0, T > 0, N ;:;:
1,R>0,p>0:
Hypothesis 2
e
the input is a and the training signal is proper, while at each stage r, k < r ::; I
at which ea is input, the training signal is either proper or 0 (both may occur at
different stages).
ea tests successfully during stages k to I provided xa(tj) E U a whenever k ::; j ::; I
and the input for stage j is ea.
Recall that A is the time constant in Equation (5).
Theorem 3 There exist positive constants T., T., A. independent of N, E, {ea} and
computable from the data, with the following property. Assume 0 < ). < A. and
Hypothesis 2 holds with T ;:;: T*, R ;:;: R*, and in addition EN ::; T*. Then every
pattern which is consistently trained from stages k to I, tests successfully during
those stages.
The quantity
(J = EN(1 + E)N
plays a key role. Notice that (J -+ 0 as EN -+ O.
Hirsch: On-line Training of Memory-Driven Attractor Networks 45
Theorem 4 There exist positive constants T*, R*, A*, independent of f, N, {.;a} and
computable from the data, with the following property. If 0 < A < A* and Hypothesis
2 holds with T ;::: T*, R ;::: R*, then:
IIM(t)ell < OR (11)
for all t ;::: 0, a = 1, ... , m.
It is clear from (4) that if a vector 1] E ffid is orthogonal to all the input patterns,
then M(t)'T] is constant. Therefore the Theorem 4 implies that the entries in M(t)
are uniformly bounded.
e e
The key to the proofs is the following lemma. Suppose the input is = a in System
(4), (5),(6). Computing dxbjdt shows dxbjdt = (e ·ea)dxajdt, yielding the crucial
estimate:
Lemma 5 If the input is e a and b #- a, then for t1 > to ;::: 0 we have:
Ixb(st) - xb(so)1 ::; flxa(st} - xa(so)l. (12)
A beginning is being made on the hard problem of constructing an artificial brain, to try to bridge
the gap between neurophysiology and psychology. The approach uses especially the increasing
number of results obtained by means of non-invasive instruments (EEG,MEG,PET,fMRI), and
the related psychological tasks. The paper describes the program and some of the mathematical
problems that it is producing. In particular the class of problems associated with the activities of
various coupled modules used in higher order control and attention will be discussed. These include
posterior attentional coupled systems, and the anterior motor/action/attention "ACTION" net-
works, based on biological circuits. The relevance of these modules, and the associated problems,
for higher order cognition, are emphasised.
1 Introduction
The Harvard neurologist Gerald Fishbach stated recently "The human brain is the
most complicated object on earth" . That does indeed seem to be the case, and for
a considerable period of time since the beginning of the scientific revolution this
fact seems to have led to despair in ever being able to have a reasonable compre-
hension of it. However there is now a strong sense of optimism in the field of brain
sciences that serious progress is being made in the search for comprehension of
the mechanisms used by the brain to achieve its amazing powers, even up to the
level of consciousness. Part of this optimism stems from the discovery and devel-
opment of new windows with which to look at the brain, with large groups now
being established to exploit the power of the available instruments. There are also
new theories of how neurons might combine their activities to produce mind-like
behaviour, based on recent developments in the field of artificial neural networks.
At the same time there is an ever increasing understanding of the subtleties of the
mechanisms used by living neurons. It would thus appear that the mind and brain
are being attacked at many levels. The paper attempts to contribute towards this
program by starting with a brief survey of the results of non-invasive instruments.
Then it turns to consider the inverse problem before discussing the nature of past
global brain models. An overview is then given of the manner in which neural
modelling may attempt to begin to achieve perception. Problems of neuropsychol-
gical modelling are then considered, with the breakdown of tasks into component
sub-processes. Mathematical problems associated with constructing neural modules
to achieve the required functionality of sub-process are then discussed. Modules to
produce possible conscious processing are then considered, and the paper concludes
with a summary and discussion of further questions.
2 Non-Invasive Results
There are amazing new windows on the mind. These measure either the magnetic
or electric fields caused by neural activity, or the change in blood flow in active
regions of the brain to allow oxygen to be brought to support the neural activity.
The former techniques are termed magnetoencephalography (MEG) and electroen-
47
48 CHAPTER 6
cephalography (EEG), the latter positron emission tomography (PET) and func-
tional magnetic resonance imaging (fMRI) respectively. The techniques of MEG,
EEG and fMRI are truly non-invasive since there is no substance penetrating the
body, whilst that is not so with PET, in which a subject imbibes a radioactive
material which will decay rapidly with emission of positrons (with their ensuing
decay to two photons).
The methods have different advantages and disadvantages. Thus EEG is a very
accurate measure of the time course of the average electrical current from an ag-
gregate of neurons firing in unison, but is contaminated with conduction currents
flowing through the conducting material of the head. Thus although temporal ac-
curacy of a few milliseconds can be obtained spatial accuracy is mainly limited to
surface effects, and deep currents are difficult to detect with certainty. MEG does
not have the defect of being broadened out spatially by the conduction currents,
but has the problem of spatial accuracy, which is slowly becoming resolved (but
also has a more difficult one of expense due to the need for special screening and
low temperature apparatus for the SQUID devices). fMRI also has considerable
expense and low temporal accuracy. Even though the activity from each slice of
brain may be assessed in an fMRI machine within about 100 or so milliseconds the
blood flow related to the underlying brain activity still requires about 7 seconds
to reach its peak. There has been a recent suggestion that hypo-activity from the
deoxygenated blood will 'peak' after only about 1 or 2 seconds of input [1), but
that is still to be confirmed. Finally PET also has the same problems of slow blood
flow possessed by fMRI, but good spatial accuracy. Combining two or more of the
above techniques together makes it possible to follow the temporal development of
spatially well defined activity [2]. The results stemming from such a combination
are already showing clear trends.
The most important result being discovered by these techniques is that
1. activity of the brain for solving a given task is localised mainly in a distinct
network of modules,
2. the temporal flow of activity between the modules is itself very specific,
3. the network used for a given task is different, in general, from that used for
another task.
It is not proposed to give here a more detailed description of the enormous number
of results now being obtained by these techniques, but refer the reader, for example,
to the recent book of [2), or the excellent discussion in [3]. However it is important
finally to note that the fact there are distinctive signatures that can be picked
up by these instruments indicate that aggregates of nerve cells are involved in
solving the relevant tasks by the subjects. Thus population activity is involved,
and "grandmother" cells are not in evidence. Of course if they were important they
would slip through the net. However that there are such characteristic signatures
seems to show that the net is sharp enough to detect relevant aggregated activity.
3 Inverse Problems
There has been much work done to decipher the detailed form of the underlying
neural activity that gives rise to a particular MEG or EEG pattern, the so-called
Taylor: Mathematics of an Artificial Brain 49
"inverse problem". Various approaches have been used to read off the relevant
neural processing, mainly in terms of the distribution of the underlying sources
of current flow. This latter might be chosen to be extremely discrete, as in the
case of the single dipole fits to some of the data. However more complete current
flow analyses have been performed recently, as in the case of MFT [4]. This leads
to a continuous distribution of current over the brain, in terms of the lead fields
describing the effectiveness of a given sensor to detect current at the position of the
field in question.
It has been possible to extend this approach to use cross entropy minimisation so as
to determine the statistics of the current density distribution in terms of the covari-
ance matrix of the measurements and the sensor mean results [5]. This approach is
presently being extended to attempt to incorporate the iteration procedure in the
MFT method of [4] so as to make the approach more effective [6].
There are further deep difficulties with the inverse problem. Thus one has to face
up to the question of single trials versus averaged values. Averaging has especially
been performed on the EEG data so as to remove the noise present. However in the
case of successful task performance the subject must have used activity of a set of
modules, in a single trial, in a way which avoided the noise. This is an important
question that will be taken up in the next section: how is successful performance
achieved in an apparently noisy or chaotic dynamical system such as the brain [7]7
Averaging only removes the crucial data indicating as to how the solution to this
task might be achieved. But single trials appear to be very, very noisy. How do
modules share in the processing so as to overcome the variability of neural activity
in each of them 7
One way to begin to answer this, and the earlier problem about the inverse problem,
is to model the neural activity in a direct manner. In other words the actual flow
of nerve impulses, and their resultant stimulation of other neural modules, must be
attempted. The resulting electric and magnetic fields of this neural activity must
then be calculated. This approach will allow more knowledge to be inserted in the
model being built than by use of methods like MFT, and so help constrain the
solution space better.
4 Neural Modules for Perception
It is possible to develop a program of modelling for the whole brain, so that the
spatial and temporal flow patterns that arise from given inputs can be simulated.
That also requires the simulation of reasonably accurate input sensors, such as the
retina, the nose or the cochlea. There are already simple neural models of such
input processors (such as in [8]), so that with care the effects of early processing
could already be included.
The order of work then to be performed is that of the development of simple neural
modules, say each with about a thousand neurons, and connected up to others in
a manner already being determined from known connectivities, such as that of the
macaque [9]. An important question of homology has to be solved here, but there is
increasing understanding of how to relate the monkey brain to that of the human,
so that problem may not be impossible. There is also the difficulty of relating the
resulting set of, say, a hundred simplified modules to their correct places on the
cortical surface (and to appropriate sub-cortical places) for a particular human
50 CHAPTER 6
subject. Even assuming the cortical surface is known from an MRI scan, the actual
placing of the modules corresponding to different areas might be difficult. However
this may be tackled by considering it as a task of minimization of the mismatch
between the predictions of the model and the actual values of EEG and MEG signals
during a range of input processing, and including sub-cortical structures, such as
discussed in [10].
5 Neuropsychological Modelling
There is a big gap between neurons in the brain and the concepts which they sup-
port. The job of explaining how the latter arise from the former is formidable.
Indeed it is even harder to consider the ultimate of experiences supported, that of
consciousness itself. One approach has been to consider the brain as carrying out
simple computations. However there has recently been an attack on the notion of
the brain as a 'computational' organ, in the sense that "nervous systems represent
and they respond on the basis of the representations" [11]. It has been suggested,
instead, that brains do not perform such simple computations. Thus "It is shown
that simplified silicon nets can be thought of as computing, but biologically realistic
nets are non-computational. Rather than structure sensitive rules governed opera-
tions on symbolic representations, there is an evolution of self-organising non-linear
dynamic systems in a process of 'differing and deferring' " [7].
This claim of non-computationality of the brain is supported by appeals to the
discovery of chaos in EEG traces, and also that low-dimensional motion may be
involved in the production of percepts in ambiguity removal. However it is well
known that neural systems are non-linear, and that such systems can easily have
chaotic motions if forced into certain parameter regimes. Thus the dynamics of the
whole brain, written as the dynamical system
dX/dt = F(X,a)
in terms of the high dimensional state vector X, is expected to have chaotic motion
for some of the parameter range a, on which the vector field F depends. Thus X
could denote the set of compartmental membrane potentials, and F denotes the non-
linear Hodgkin-Huxley equations for the production of the nerve impulses. However
there are still expected to be "representations" brought about by the changes of the
function F by, for example, synaptic modifications caused by learning. If there is no
change in the transform function F there seems to be no chance for learning. Hence
the claim of [7] seems to be incorrect when the notion of representation is extended
to the more general transform function F. This uses the notion of coupled modules
in the whole brain, as are observed to be needed by the non-invasive techniques
mentioned earlier. Thus the possibility of chaos will not be regarded as problematic
to the program being proposed.
It is possible to build neural modules to model some of the important processes
observed in psychology. Thus the various sorts of memory and other processes may
be modelled as:
• Perception
• Movement
• Memory
• Thought
It is further seen that perception itself has at least the two component sub-tasks of
construction of codes (in the five separate modalities) and that of representations
(such as at feature, object, category, position, body matrix, lexical, phonological
and word level). There are numerous sub-divisions of memory (episodic, semantic,
implicit, priming, working, active) with considerable overlap with representations
(which are the objects of memory). There are functions which are strongly asso-
ciated with limbic regions, such as goals, values, drives and self representations.
52 CHAPTER 6
There are also frontally-sited functions, as in attention, social behaviour, planning
and actions [13].
The above set of functions is performed by complex brain networks, as demonstrated
clearly in [2]. It is a very important but difficult task to attempt to reduce this
broad range to a smaller set of underlying elements, the CSPs alluded to above.
The following set is proposed as a preliminary version of the required list:
SUBMITTED PAPERS
THE SUCCESSFUL USE OF PROBABILITY DATA IN
CONNECTIONIST MODELS
J. R. Alexander Jr. and J. P. Coughlin
Towson State University, Towson, USA.
Reggia [10) explored connectionist models which employed "virtual" lateral inhibition, and in-
cluded the activation of the receiving node in the equations for the flow of activation. Ahuja [1)
extended these concepts to include summing the total excitatory and inhibitory flow into a node.
He thus introduced the concept that the change of activation of a node depended on the integral
of the flow into that node and not just the present activation levels of the nodes to which it
is connected. Both Reggia's and Ahuja's models used probability data for the weights. Ahuja's
model was further extended by Alexander [2), [3), [4) in the RX model to allow both the weights
and the activations of Ahuja's model to be negative, and further, Alexander's model included
the prior probabilities of all nodes. Section 1 of this paper contains a complete listing of the RX
equations and describes their development. The main result of this paper, the demonstration of
the convergence of the system is presented in Section 2. Section 3 briefly describes the experiments
testing the RX system and summarizes this article.
1 The RX Equations
The net is two-layered, with the lower level being the J input nodes and upper
level the N output nodes. The values of the upper level nodes, ai(t), are on [0,1).
The prior probability of the existence of the feature associated with the ith node
is called ai. Values of ai(t) greater than ai indicate a higher than average chance
of occurrence of the feature represented by node i, and those lower, indicate a less
than average chance. The activation on the lower level nodes is computed from the
range of possible input values to a node. If the smallest observed value is M inj ,
the largest M aXj and the average Avej. Let Observj be the observed value. Then
define:
[Observj - Avej) 'f Ob A and
mj =[M aXj - A vej) I servj > vej
mj = [Observj - Avej)
[A vej - M')
znj
.
otherWIse (1)
Clearly mj lies on [-1,1). When mj assumes the value 1, then the feature repre-
sented by mj is present, with probability one. When mj is -1 the feature is absent
with probability one. A value of zero indicates that no information exists concern-
ing the absence or presence of this feature. Two sets of weights exist. The first is
called Wij, and in absolute value indicates the probability of occurrence (if positive,
and non-occurrence if negative) of the feature associated with the upper level node
(ith) , given the existence of activation on the lower level node (jth). The second
weight is called Vij, and in absolute value indicates the probability of occurrence (if
positive, and non-occurrence if negative) of the feature associated with the lower
level node, given the existence of activation on the upper level node.
Two auxiliary functions are to be associated with each ai(t). The function which
conveys the excitatory activation is called EXCi(t), and the one which conveys the
inhibitory is called Inhi(t). Each is the sum of all the excitatory and inhibitory
flow of activation into node i. These functions are defined by their derivatives and,
61
62 CHAPTER 7
since the latter are sums of everywhere positive terms, the former are monotone-
increasing. Each of the terms used in defining the derivative is a function of all the
ai(t) and is the variable forcing term. There will be one such function for each lower
level (j) node, hence a sum need be taken over all the j nodes connected to any
given i node. The bounded characteristic is achieved by including in the equation
a factor of the form [N - EXCi(t)]. The choice of N as the bound to which EXCi(t)
approaches is somewhat arbitrary.
2 Convergence of the RX Equations
2.1 The RX Equations
Proof From the definition, Excj(t) and Inhj(t) can be shown to be bounded mono-
tone functions [2], and therefore approach limits, LE; and LIj, respectively (see
Buck [7], pg 26). The cases that may occur are:
(1) LEi :j: LIj;
(2a) LEj = LIj = di, (di < N) and;
(2b) LEj = LIi = N.
Case (1) By examining separately the cases where LEj > LIj and LIi > LEj, it is
easily shown that ai(t) approaches a limit in either case. (See [2].)
The remaining two cases are, (2a) in which LEj = LIj = dj < N, and (2b) in which
= =
LEi LIi N. For the space of any triple, ai, Excj, and Inhi' a plane is formed
by EXCi = Inhj. This is pictured in Figure 1. The line formed by the intersection
=
of the plane with the plane ai aj constitutes the ith component of a C PL type
critical point. Thus, this line forms a continuum of critical points.
The flow, in both Cases (2a) and (2b), is considered (1) outside a 21/ band (1/
positive and small) about the plane aj = aj and (2) inside this band. It is not
difficult to analyze what happens outside the band, but pure analysis proofs could
not be obtained inside the band. Hence, within the 21/ band of the ai plane, the
associated linear system was used to analyze the flow of activation.
64 CHAPTER 7
l1j=71i~~.,;,.l
REGION 3
l1j = 71i-'fJ~~iiI.;
= =
Case (2a) LEi Ll; d;, where 0 < di < N.
(1) Flow outside the 27] band. Equations (8) furnished the result of formally inte-
grating equation (4) and equation (5). The results were:
-Kla;.!,' OUT;(8)d8
Exc; ()
t N - [N - E XCi ()]
to *e '0 (11)
1nh;(t) N - [N - 1n h; ()]
to *e-K2a;. f.''0 0"t;(8) d8
(12)
Assume that after some time T, a;(t) remains bounded away from the ai plane, by
some arbitrarily small amount 7]. Therefore:
12 = 1 00
out(t) dt = 100
921ai - a;1 dt > 1 00
192Imin7]dt. (14)
In equation (9), 91 is Wij /(2:1 Wlj lal - ad + Cd), and 92 is
(2: Vkj lal - all)/(2: Vlj lal - ad + Cd)
k,p; I
Not all the ak(t) may approach their ak. Were this to occur, then 12 would ap-
proach a limit different from h. Therefore, LEi would be greater than L1i and
contrary to the assumption (LEi = L1i = di ) would not converge to dj, contrary
to hypothesis.
Under the assumption that 2:k,pi Vkj lak(t) - akl :/; 0 we assert that, 92 0(91) =
and 91 = 0(92) (0 stands for "order of"). That is, 91 and 92 are of the same
order of magnitude (Olmstead [9], pI41.) since they are both ratios of fractions
whose denominators are not zero. The integrals h, and 12 will therefore diverge
and, Excj(t) and Inh;(t) will approach N and not di contrary to assumption.
(2) The flow inside the 27] band. Elsewhere [2], the eigenvalues and eigenvectors
of the Jacobians on each side of the a;(t) = aj hyperplane are calculated, and
solutions of the associated linear system discussed in detail. It suffices to say that
=
on one side of the hyperplane aj(t) aj, the eigenvalues are imaginary, and cause
elliptical motion about the line, Excj(t) = 1nh;(t) (at d;), which is in the hyperplane
Alexander [3 Coughlin: Probability Data in Connectionist Models 65
aj(t) = aj. In Figure 1 the imaginary eigenvalues from the linear system characterize
the motion above the aj plane, and the real eigenvalues that below this plane. From
below the aj plane trajectories in the region where Exc; > Inh; (called Region 1 in
Figure 1) approach the aj plane from below, and penetrate it. Above the aj plane
(Region 2 where Exci > I nhi ) trajectories travel an elliptic path about the line
Exci = Inhi and enter Region 4, (where Inh; > Exci). Two situations may occur.
First, they may leave the TJ band and go far enough above it that they may not
return. It is known that the integrals diverge in this region and hence Excj(t) and
Inhi(t) do not converge to di . Second, the trajectories may move back towards the
=
hyperplane ai(t) ai. They penetrate this plane and pass through to the other side.
Once beneath the ai plane they are in Region 3, where ai < ai, and Inh i > Excj.
The trajectory then moves away from the critical point. Since both Excj and Inh;
are monotone non-decreasing, once past d;, they cannot approach this point again.
Therefore, unless trajectories actually encounter a C PL type critical point, they are
driven away from it and into the region where la; - ad ? TJ. The integrals It and
12 therefore increase, and hence Exc; and Inh j, being monotone increasing, will
exceed di and not approach it as a limit.
Below the a; plane, the eigenvalues are real, but one of them is positive. Bellman,
[6] (Theorem 3, p88) has shown that, under conditions which the RX equations
obey, (i.e. differentiability Rudin [11], pg 189) critical points possessing positive
eigenvalues are unstable. As such, C PL type critical points are unstable equilibrium
points. Instability is not a strong property, and therefore, the possibility of points
approaching a CPL type critical point cannot be dismissed. Elsewhere [2] it is
demonstrated that a solution of the linear system does indeed approach aj.
We now move on to the last remaining possibility for Excj(t) and Inhj(t): LEi =
Llj = N. {Case (2b)}
(1) Flow outside the 2TJ band. In this case as in the preceding case it is fairly simple
to show that aj(t) approaches a limit.
(2) Flow inside the 2TJ band. Flow inside the band is approximated by the linear
system, but all trajectories are now those for which Excj(t) and I nh;(t) are within
(of N. The approximating linear equations are presented in [2]. As before, the tra-
jectories will eventually reach Region 3, where, although the constants multiplying
it are small, the motion is still governed by a positive exponential and aj(t) will be
directed away from a;. Thus, it would appear that when ai(t) leaves the cube of
figure 1, it will approach a limit. 0
We conclude by proving a lemma.
Lemma 2 Iflimt ...... oo Excj(t) = Inhj(t) = N, then ai(t) will not attain aj in finite
time.
Proof The proof is by contradiction. Assume for t > t* that ai(t) = ai. Then, for
= out;(t) = O. Therefore:
t > t*, OUT;(t)
foo
io OUTj(t) dt = iot* OUTj(t) dt = a 100
outj(t) dt = 1
t*
outi(t) dt = j3
(15)
where a and j3 are finite numbers. Integrating (4) we have for a solution to Excj(t):
lim Exc;(t)
t ...... oo
= N - [N _ Exc;(to)]e -K,a, 10 OUT,(t)dt
00
66 CHAPTER 7
= N - [N - EXC;(to)]e-K1li";a "I N.
The same can be shown for Inh;(t). 0
It would appear from this lemma, and from the fact that a;(t) is very near the ai
plane, that the trajectories within the volume being considered are moving very
slowly. This is to be expected since all terms are small.
We have thus shown in all three cases (1), (2a) and (2b) that all trajectories of
the RX system approach limits and hence the demonstration of the convergence of
a;(t) is complete.
3 Summary
The RX equations have been tested both in associative memory and control appli-
cations. In associative memory abilities, a back-propagating net's performance only
slightly exceeded that of the RX net in a test involving classifying RADAR signals.
In control applications a two input, one output node RX net (three neuron con-
troller) performed as well or better than a fuzzy controller in the task of backing a
truck to a loading dock. Because of the apparent biological plausibility of the three
neuron controller, and the existence of elementary life forms with control circuitry
yet no memory circuitry, we offer the speculation that neural memory circuitry has
evolved from neural control circuitry [5]. Recall, that the RX net uses probability
data as weights and that it involves no learning. In the future, we intend to study
more biologically plausible versions of the RX net both in memory and control
applications.
REFERENCES
[I] Ahuja S., Soh W., Schwartz A. A Connectionist Processing Metaphor Jor Diagnostic Rea-
soning. Intern.ational Journal of Intelligent Systems, (1988).
[2] Alexander, J.R.. An Analysis oj the General Properties and Convergence oj a Connectionist
Model Employing Probability Data as Weights and Parameters. PhD Dissertation (1991),
University of Maryland Baltimore County, Baltimore MD.
[3] Alexander, J.R. A Connectionist Model Employing Probability Data as Weights and Pa-
rameters. Intelligent Engineering Systems Through Artificial Neural Networks, Vol 2. (1992),
ASME Press, New York NY.
[4] Alexander, J.R. Calculating the Centroid oj a Fuzzy Logic Controller Using an Artificial
Neural Net. Intelligent Engineering Systems Through Artificial Neural Networks, Vol 3. (1993),
ASME Press, New York NY.
[5] Alexander, J.R. and Bradley, J. B. The Three Neuron Controller - History, Theoretical
Foundations, and Pouible Employment in Adaptive Control. Intelligent Engineering Systems
Through Artificial Neural Networks, Vol 5. (1995), ASME Press, New York NY.
[6] Bellman R. Stability Theory oj Differential Equations, (1953), New York: McGraw-Hill.
[7] Buck R.C. Advanced Calculus, (1956), New York: McGraw-Hill.
[8] Hirsch M. Convergent Activation Dynamics in Continuous Time Networks. Neural Networks
2 (1989), pp331-349.
[9] Olmstead J.M.H. Advanced Calculus. (1961), New York: Appleton-Century-Crafts, Inc.
[10] Reggia J. Virtual Lateral Inhibition in Parallel Activation Models oj Associative Memory.
Proceedings Ninth International Conference on Artificial Intelligence (1985), pp 244-248.
[11] Rudin W. Principles oj Mathematical Analysis. (1916), New York: McGraw Hill.
WEIGHTED MIXTURE OF MODELS FOR ON-LINE
LEARNING
P. Edgar An
Department of Ocean Engineering,
Florida Atlantic University, Boca Raton, FL 33431, USA.
This paper proposes a weighted mixture of locally generalizing models in an attempt to resolve the
trade-off between model mismatch and measurement noise given a sparse set of training samples so
that the conditional mean estimation performance of the desired response can be made adequate
over the input region of interest. In this architecture, each expert model has its corresponding
variance model for estimating the expert's modeling performance. Parameters associated with
individual expert models are adapted in the usualleast-mean-square sense, weighted by its variance
model output. Whereas the variance models are adapted in such a way that expert models of
higher-resolution (or greater modeling capability) are discouraged to contribute, except when the
local modeling performance becomes inadequate.
1 Background
Artificial neural networks have been widely used in modeling and classification ap-
plications because of their flexible nonlinear modeling capabilities over the input
region of interest [3]. The merit of individual networks is generally evaluated in
terms of cost criterion and learning algorithm by which the free parameters are
adapted, and the characteristics of the parametric model. These networks often
employ minimum sum-squared error criteria for parameter estimation such that
the desired solution is a conditional mean estimator for the given data set, pro-
vided that the model is sufficiently flexible [2]. The use of such criterion not only
provides a simple iterative procedure for parameter estimation but also the adapta-
tion follows the Maximum Likelihood (ML) principle when the model uncertainty is
a realization of independent Gaussian process. Neverth~less in many applications,
the generalization performance which utilizes the least-square criterion breaks down
when the conditional mean response is estimated using a small set of data samples
with an unknown degree of noise uncertainty. This is because the LS criterion max-
imizes only the overall fitting performance defined by the data set, rather than
evaluating the model mismatch in any local input region. Thus, a sufficiently flexi-
ble model often overfits undesirable noisy components in the data samples whereas
a model with restricted degrees of freedom is less likely to converge to the condi-
tional mean of the data response [4]. This paper focuses on an alternative approach
in dealing with the bias/variance dilemma, which is based on a variation of the
"Mixture-of-Experts" (or ME) [5].
2 Weighted Mixture of Experts (WME)
One alternative solution to the bias/variance dilemma is to incorporate a weighted
mixture of expert (or WME) models so that only few experts contribute in any par-
ticular input region [5]. The internal structure in each of these experts is fixed, and
a separate variance model is used, one for each expert, to evaluate its corresponding
expert's performance. If the expert and variance models are chosen to be linear with
respect to their adaptable parameters, the resulting learning process can be formu-
lated as a linear optimization problem which enables rapid learning convergence.
67
68 CHAPTER 8
The learning process for the WME algorithm is based on the competitive nature
among experts in modeling an unknown mapping, and the overall WME output,
y(x), is simply a linear combination of individual expert model outputs weighted
by their variance model outputs, C¥j(x), at x
A() ~ exp( -C¥j(X)) A ~ A
Y X = L...J ",m ( ())Yj(X) = L...Jpj(X)Yj(X) (1)
j=l L..."k=l exp -C¥k x j=l
where Yj(x) is the jth expert model output, and C¥j is the corresponding variance
model output (associated with the jth expert output). Pj(x) can be interpreted as a
local fitness measure of the jth expert modeling performance at x, and is bounded
E [0, 1]. The error criterion for individual expert models is defined as
Ey = 2n1 t;
n [m
[;Pj(X;)(Y(X;) - 1
yj(x;))2 (2)
En = -;?=
1 n [m?= Ajpj(x;)(y(x;) - Yj(x;))21 (3)
.=1 J=l
where n is the number of data samples, and Y(Xi) is the data output for Xi. Ey
ensures that each of the experts is allowed to best approximate y(x;), weighted by
Pj (Xi), whereas En ensures that all_ experts are specialized in unique input regions
by assigning a smaller Pj (i) for a larger error variance, and vice versa. The term
Aj in (3) regulates the variance estimation among models of varying resolution so
that Ai is larger for higher-resolution experts, and vice versa. If this term was not
included, the resulting adaptation would lead to a WME model consisting mainly
of high-resolution experts. Ai can thus be interpreted as a form of regularization
which takes in prior knowledge about the set of experts for modeling, and can be
set inversely proportional to their degrees of freedom (e.g. total number of free
parameters). This criterion is thus different from that proposed in [5] as a result
of this regularizing factor A. In the case where the model structure of any expert
is non-uniform across the input region of interest (such as clustering-based radial-
basis-functions network), Ai can be modified in such a way that it varies from one
input segment to another.
Like in many artificial neural networks, the iterative procedure can be modified so
that the adaptation is carried out sample by sample, and the corresponding cost cri-
terion can be reduced by dropping the first summation term in (2,3). It is generally
desirable to maintain equal variance model outputs prior to on-line learning to avoid
bias toward any particular model structure, unless prior knowledge of the unknown
mapping is available (e.g. C¥j(.) = 0). As training proceeds, the variance models
are adapted to form appropriate prior knowledge for combining the experts. These
parameterized models can be chosen from a variety of existing locally-generalizing
models, such as radial-basis-functions, B-Splines or Cerebellar Model Articulation
Controller (or CMAC) networks. These networks are particularly well suited to the
WME model implementation because their internal structures are fixed and their
parameters are defined over a local region of support, thus enabling rapid learning
convergence. Also, the influence of the variance model is restricted to within the
An: Weighted Mixture of Models for On-line Learning 69
local extent of the training input, and equal contribution will be assigned to regions
of unseen data samples. If both variance and expert models generalize globally, the
conditional mean estimation performance is likely to deteriorate considerably as a
result of temporal instability.
3 Example
, .. ..
. ,
' .
10
I;!
, ..
- . . ...
~ ... - .. ...
--
_____ _ ___ J ____ _______________ _ '" ,'.
~~~ - - t~~;; --:lil~ -~::i --.~!~.- -~~rJ~ jll! -f;~~: --:~:~ --r~~
10' '" .. ,
...
........
10. 1
.. ,4
1011:; . ~....;::
;",= .. ::::;
"".=.;:;
....::
. ::;..,: :::;::: ... ::!:.. ,::;~
;;; ;:..:::: 10 ::'
.....
...," ..,,,..... .
,'" .. ...... , ....
10~'---'----J'---'----''-----'----'_ _'___'_...J 10~'---'--'---'-~'---'----J'---'-_'__...J
I 10 I 10
Figure 1 (a) Top: The RMS curves as Figure 2 (a) Similar configuration to
a function of training cycle number for those described in Figure 1 except that
five different models averaging over 10 each independent training set contains
independent sets of 500 noiseless (ran- 5000 samples. (b) bottom: same RMS
domly drawn) training samples. '-: single curves as described in (a) except that
CMAC (C =
2); '- -': single CMAC (C = the samples are contaminated with zero-
7); '+': additive CMACs (C = 2, 7); '0': mean Gaussian noise of variance 0.1.
WME CMACs (1:1 ,\ ratio, C = 2, 7); 'x':
WME CMACs (1:3'\ ratio, C = 2, 7). (b)
bottom: same RMS curves as described in
(a) except that the samples are contam-
inated with zero-mean Gaussian noise of
variance 0.1.
3 3
2 2
-~ 0 -I
0 -~ 0 -I
0
I I
Input2 Inpu12
Inputl Inpull
Additive CMACs: (C =2, 7) Variance-based CMACs: (C =2, 7)
3
2 2
-~ -----:0:-------
I -I
o -~ - - - - : 0 : 0 - - - -
I -I
o
Input2 Inpu12
Input I Inputt
bias/variance dilemma issue will remain unresolved. Having unique internal expert
structures also facilitates implicit model selection problems such that one does not
need to search for an optimal structure for a single model. It would appear that
having a mixture of experts increases the total number of free parameters for opti-
mization, and the resulting network will not be parsimonous. Quite differently, the
objective is not to reduce the number of physical parameters, as advocated in many
proposed architectures, but rather to reduce the number of effective parameters,
governed by the variance models, in such a way that the WME model can be more
robust with respect to noise and model mismatch.
REFERENCES
[1] An P.E., Miller W.T., Parks P.C. De~ign Improvementa in Auociative Memorie~ for Cerebel-
lar Model Articulation Contro/ler~ (CMAC), Proc. intI. Conf. on Artificial Neural Networks,
Helsinki, North Holland, Vol.2 (1991), pp.1207-1210.
[2] Bishop C. Mixture Denaity Network~, Neural Computing Research Group Report
NCRG/4288 (1994), Department of Computer Science, Aston University, UK.
[3] Brown M., Harris C.J. NeuroFuzzy Adaptive Modelling and Control (1994), Prentice-Hall.
[4] Geman S., Bienenstock E. Neural Networh and the Biaa/Variance Dilemma, Neural Com-
putation, Vol. 4 (1992), pp.1-58.
[5] Jacobs R., Jordan M., Nowlan S., Hinton G. Adaptive Mixture~ of Local Experts, Neural
Computation, Vol. 3 (1991), pp.79-87.
[6] Sjoberg J., Ljung L. Overtraining, Regularization, and Searching for Minimum in Neural
Networks, the 4th IFAC Symposium on Adaptive Systems in Control and Signal Processing,
France, (1992) pp.669-674.
LOCAL MODIFICATIONS TO RADIAL BASIS
NETWORKS
I. J. Anderson
School of Computing and Mathematics, University of Huddersfield,
Queensgate, Huddersfield HDl 3DH, UK.
The fonn of a radial basis network is a linear combination of translates of a given radial basis
function, 4>( r ). The radial basis method involves determining the values of the unknown parameters
within the network given a set of inputs, {xd, and their correspondingoutpl1ts, {fd. It is usual
for some of the parameters of the network to be fixed. If the positions of the centres of the basis
functions are known and constant, the radial basis problem reduces to a standard linear system of
equations and many techniques are available for calculating the values of the unknown coefficients
efficiently. However, if both the positions of the centres and the values of the coefficients are
allowed to vary, the problem becomes considerably more difficult. A highly non-linear problem is
produced and solved in an iterative manner. An initial guess for the best positions of the centres
is made and the coefficients for this particular choice of centres are calculated as before. For each
iteration, a small change to the position of the centres is made in order to improve the quality
of the network and the values of the coefficients for these new centre positions are determined.
The overall algorithm is computationally expensive and here we consider ways of improving the
efficiency of the method by exploiting the local stability of the thin plate spline basis function.
At each step of the iteration, only a small change is made to the positions of the centres and so
we can reasonably expect that there is only a small change to the values of the corresponding
coefficients. These small changes are estimated using local modifications.
1 Introduction
We consider the thin plate spline basis function
¢(r)=r 2 10gr.
This basis function has not been as popular as the Gaussian, ¢(r) = exp( _r2 j2u),
mainly due to its unbounded nature. The thinking behind this is that it is desirable
to use a basis function that has near compact support so that the approximating
function behaves in a local manner.
A suitable definition of local behaviour for an approximating function of the form
given in equation (1) is as follows. The ith coefficient, Ci, of the approximant should
be related to the values {fk} for values of k where IIxk - Ai II is small. Thus, the
value of the coefficient should be influenced only by the data ordinates whose ab-
scissae values are close to the centre of the corresponding basis function. From this
definition one deduces that it is advisable to use basis functions that decay, hence
the desire for basis functions with near compact support. However many authors
have shown that this deduction is unfounded and it is in fact easier to produce the
properties of local behaviour by using unbounded functions [4, 2, 3, 1].
1.1 The Approximation Problem
Given a set of m data points, (Xk,!k), for k = 1,2, ... , m, it is possible to produce
an approximation of the form
n
f(x) = E ci¢(lIx - AiID, (1)
i=l
73
74 CHAPTER 9
Figure 1 The positions of the data abscissae and the centres of the basis func-
tions, and the surface approximant to the data.
such that
f(x,,)~f,,·
It is usual for this approximation problem to be solved in the least-squares sense.
There are two important categories for such an approximation. The easiest approach
is to consider the positions of the centres to be fixed, in which case the approxima-
tion problem is linear and is therefore reasonably efficient to solve. The alternative
is to allow the positions of the centres to vary or indeed for the number of centres
to change. For example, once a preliminary approximation has been completed it
is worth examining the results to determine the quality of the approximation. It
may be decided that one (or more) regions are unsatisfactory and so extra basis
functions or an adjustment of the centres of the current basis functions would be
suitable.
Under such circumstances it is usual to recalculate the new coefficients for the
approximation problem with respect to the new positions of the centres. Repeating
this process too often can be computationally expensive and it is more appropriate
to modify the current approximation to take into account the small changes that
have been made to the centres.
2 Local Stability of the Thin Plate Spline
This local behaviour of the thin plate spline is demonstrated by the use of an
example that consists of approximating m = 6,864 data points using n = 320 basis
functions. The centres for these basis functions are produced using a clustering
algorithm and the data are fitted in the least-squares sense. Formally, let I be the
set of indices for the basis functions, {I, 2, .. . ,n}. We wish to calculate the values
of the coefficients {Ci} that minimize the f 2 -norm of the residual vector e which
has the components
=
e" flo - I: ci<l>(Ilx - '\ill),
iEI
=
for k 1, 2, ... , m. The data, centres and the resulting approximant are shown in
Figure 1.
Anderson: Local Modifications to Radial Basis Networks 75
..
0,
~I
... .
o : .... _ _ _ _ _ _ _ __
-<,I
-<,
U t.. I ,f '.1
..
,.. ,,,
Let j E I be the index for one of the centres. We now perturb this centre by a small
amount, 6, such that we produce a new centre;
=
Ai Ai + 6.
=
All of the other centres are left undisturbed. That is A; Ai for i E I , i ¥ j . Again
the data are fitted in the least-squares sense but this time we use the new set of
centres, {An. Thus we calculate the values of the coefficients ci, for each i E I,
that minimize the i 2 -norm of the residual vector e* which has the components
e; =!k - L c;¢(lIx - A;II),
iEI
for k = 1, 2, ... , m. The two approximants are visually indistinguishable and so
we compare the coefficients of the respective fits.
Let do be the distance between the centre of the ith basis function and the position
of the perturbed centre, di = IIA; - A; II. Figure 2 shows the differences between
the respective coefficients, {(Ci - cin of the two fits against the distances {dol.
Also shown is the logarithm of the absolute values of the differences between the
respective coefficients against the same set of distances.
It can be seen that the differences between the coefficients decay exponentially as
the distance between the corresponding centre and the perturbation increases. It is
in this sense that we say that the thin plate spline behaves locally.
3 Exploiting the Local Behaviour
Since the effects of perturbing the position of a given centre are only noticeable for
the coefficients which correspond to basis functions that are centred in the neigh-
76 CHAPTER 9
• Convergence. Since each iteration is so much faster than the global approach
we can afford to use the technique many times, using different sets of basis
functions. However it would be necessary to confirm that such an approach
would continually refine and improve the quality of the approximation.
• No attempt has been made to optimize the size or the positions of the subset
of data used for the local modification. It is envisaged that a small but signif-
icant improvement in the quality of the approach could be made using such a
technique.
• Similarly the size and position of the subset of centres could be optimized. Par-
ticular attention could be directed towards the effects of larger perturbations
and/or several small perturbations.
REFERENCES
[I] I. J. Anderson. Local modifications to radial basis networlcs. Technical Report RR9505, School
of Computing and Mathematics, University of Huddersfield, Huddersfield, UK, (1995).
[2] M. D. Buhmann and C. K. Chui. A note on the local stability of translates of radial basis
functions. Journal of Approximation Theory, Vol. 74 (1993), pp36-40.
[3] J. Levesley. Local stability of translates of polyharmonic splines in even space dimension.
Numer. FUnct. Anal. and Optimiz., Vo1.15 (1994) pp327-333.
[4] M. J. D. Powell. Radial basis function approximations to polynomials. In D. F. Griffiths
and G. A. Watson, editors, Numerical Analysis, (1987), pp 223-241. Longman Scientific and
Technical,1988.
Acknowledgements
The author wishes to thank C. K. Pink for his comments on earlier drafts of this
paper, and also the Department of Conservative Dentistry in the Dental School at
the London Hospital Medical College for supplying the data.
A STATISTICAL ANALYSIS OF THE MODIFIED NLMS
RULES
E.n. Aved'yan, M. Brown* and C.J. Harris*
Russian Academy oj Sciences, Institute oj Control Sciences, 65 ProJsoyuznaya str.,
Moscow 117342, Russia. Email: [email protected]
*ISIS research group, Dept oj Electronics and Computer Science, Southampton
University, Highfield, Southampton, S017 1Bl, UK. Email: [email protected]
This paper analyses the statistical convergence properties of the modified NLMS rules which
were formulated in an attempt to produce more robust and faster converging training algorithms.
However, the statistical analysis described in this paper leads us to the conjecture that the standard
NLMS rule is the only unconditionally stable modified NLMS training algorithm, and that the
optimal value of the learning rate and region of convergence for the modified NLMS rules is
generally less than for the standard NLMS rule.
I Adaptive Systems and the NLMS Algorithm
Nonlinear networks, such as the Cerebellar Model Articulation Controller (CMAC),
Radial Basis Functions (RBF) and B-splines have an "output layer" of linear param-
eters that can be directly trained using any of the linear learning algorithms that
have been developed over the past 40 years. So consider a linear in its parameter
vector network of the form:
n
y(t) L Xi(t) Wi(t - 1)
i=l
(1)
where y(t) is the system's output, w(t - 1) = (W1(t - 1), ... , wn(t - 1» is the n-
dimensional weight vector and x(t) = (X1(t), ... , xn(t» is the n-dimensional trans-
formed input vector at time t. This "input" vector x could possibly be a nonlinear
transformation of the network's original input measurements. For a linear system
described by equation 1, the Normalised Least Mean Squares (NLMS) learning
algorithm for a single training sample {x(t), y(t)} is given by:
.6.w(t) = J.L
(y(t) 2 x(t) (2)
IIx(t)1I2
=
where .6.w(t) w(t) - w(t - 1) is the weight vector update, (y(t) (y(t) - y(t» =
is the output error, y(t) is the desired network output at time t and J.L E [0,2] is
the learning rate. This simple learning rule has a long history, as it was originally
derived by Kaczmarz in 1937 [6] and was re-derived numerous times in the adaptive
control [7] and neural network literature. When the training data are generated by
an equivalent model with an unknown "optimal" parameter vector w (i.e. there
is no modelling error or measurement noise), the NLMS algorithm possesses the
property that:
(3)
where (w(t) = w- w(t) is the weight error at time t. Hence the estimates of the
weight vector approach (monotonically) the true values and this learning rule has
many other desirable properties [3].
78
Aved'yan, Brown (3 Harris: Analysis of Modified NLMS Rules 79
2 Modified NLMS algorithms
In 1971, a set of Modified NLMS (MNLMS) algorithms was proposed by Aved'yan
[1, 2] and they were again rediscovered over 20 years later by Douglas [5]. The
MNLMS learning algorithms are derived from the two conditions:
Find wet) such that:
yet) = xT(t)w(t) (4)
and lI.6.w(t)lIp is minimised.
for different values of p where 1 :$ p < 00.
These instantaneous learning rules are based on generating a new search direction
by minimising an alternative norm in weight space, and three special cases which
are worthy of attention are the 1, 2 and oo-norms. The 2-norm corresponds to
the standard NLMS rule which is denoted by std NLMS, but the Ll learning
algorithm is:
fy{t)
.6.wi{t) = Jl-() Oile (5)
x'" t '
where k = argmaxo IXi(t)l, and this will be referred to as the max NLMS rule.
The Loo training rule is given by:
.6. () fy{t)sgn(x{t))
w t = Jl IIx{t)lb
(6)
and this will be referred to as the sgn NLMS rule [7). It is fairly simple to show
that the a posteriori output error is always zero for these learning rules (with
Jl = 1), so the only difference between them is how they search the weight space.
The max NLMS algorithm always updates the weight vector parallel to an axis, and
the sgn NLMS rule causes the weight vector update to be at 45 0 to an axis. This is
in contrast to the std NLMS training procedure which always projects the weight
vector perpendicularly onto the solution hyperplane generated by the training data
[6].
3 A Statistical Analysis
A deterministic analysis of the MNLMS rules has already been completed and
it is shown in [4] that certain finite training sets can cause unstable learning in
the linear network for the sgn and max NLMS rules, irrespective of the size of
the (non-zero) learning rate. The statistical analysis of the MNLMS rules in this
paper is based on a model of the input vector where all components are mutually
independent random processes with zero mean values, each of which is a sequence
of independent identically symmetric distributed random variables. This proivdes
the conditions of convergence for the modified algorithms, the optimal value of the
learning rate, and the influence of a noise and the mean value of the input process
on the convergence conditions.
3.1 Convergence
The process wet) converges (in the statistical mean-square sense) to the optimal
weight vector, w, if it satisfies the condition limt-+oo E (f~(t)fw(t)) = 0, where EO
denotes the statistical expectation operator. The convergence depends greatly on
the properties of the input process x{t), for instance the standard NLMS algorithm
does not converge to its optimal values, when the process x{t) quickly settles to a
constant value or when it is varies very slowly: the input signal is not persistently
80 CHAPTER 10
exciting. This situation is even worse for both the max and sgn NLMS algorithms
as they search a restricted section of the parameter space and an optimal solution
for the training date does not always lie in this region [3].
It is possible to write the Euclidean squared norm of the error in the weight vector
£~(t)£w(t) as:
=
(I _
£~(t)£w(t)
where set) is the current search direction for the different learning rules, when the
training data contain no measurment or modelling errors.
=
Let us denote by V 2(t) E (£~(t)£w(t)) the statistical expectation ofthe Euclidean
squared norm of the error in the weight vectors in equation 7. Taking into account
the statistical properties of the input vector x(t) and the fact that the trace of
each matrix in these equations is equal to one, it is possible to generate first-order
relations:
=
V}(t) (1- 2n- 11-' + 13.1-'2) V.2(t -1) (8)
where a • is used instead of std, sgn and max, and:
f3std = n- 1 (9)
E ( xT (t)x(t) ) (10)
f3s gn
(x T (t)sgn(x(t)))2
n-1 E (xT(t)X(t)) (11)
13m ax = xHt)
For these parameters we have the following inequalities:
n- 1 :::; f3s gn :::; 1 (12)
n- :::; f3max :::; 1
1 (13)
The value of the parameters f3s gn and f3max depends on the probability distribution
function of the vector x(t) and can be calculated analytically in the special cases
shown below.
From equation 8, it follows that not only are the convergence conditions for the std,
sgn and max NLMS algorithms, respectively:
0< q. = (l-2n- 11-'+f3.1-'2) < 1 (14)
but also the optimal value of the learning rate j1 by which values of qstd, qsgn and
qmax are minimal:
(15)
and consequently:
(16)
These values determine the convergence time of the respective algorithms, and:
I-':gn:::; I-':td =1 (17)
I-'~ax:::; I-':td =1 (18)
q:gn ~ q:td = (1- n- 1) (19)
q~ax ~ q:td = (1- n- 1 ) (20)
Aved'yan, Brown {3 Harris: Analysis of Modified NLMS Rules 81
Hence it follows that optimal value of the learning rate and the region of convergence
for the sgn and max NLMS rules are less than for the std NLMS rule, and that
the convergence time for the std NLMS rule is less than for the other sgn and
max NLMS rules. However, it is possible to get the analytical expression for f3 in
equations 10 and 11 in certain special cases.
When Xi(t) has a Laplace distribution: p(x;) = 0'-1 exp (-lx;1 /0'), then f3s g n =
2/(n + 1) and Jisgn = (n + 1)/2n for the sgn NLMS rule. The corresponding region
of convergence is equal to 0 < J.lmax < (n + 1) in contrast with the std NLMS rule
where 0 < J.lstd < 2. Hence, for large n, the convergence rate of the sgn NLMS rule
is approximately twice as large as the std NLMS rule.
When x;(t) has a uniform distribution p(Xi) = 1/20', IXil < a, then f3max = (n +
2)/3n, and Jimax = 3/(n + 2). The corresponding region of convergence is 0 <
J.lmax < 6/(n + 2) and it follows that for large n, the time of convergence of the
std NLMS rule is equal to cn where c is a constant whereas for the max NLMS rule
this time is equal to cn(n + 2)/3.
These examples shows that the std NLMS rule has a larger convergence rate when
compared with the sgn and the max NLMS algorithm for these specific input distri-
butions. The MNLMS rules therefore represent a tradeoff between computational
simplicity and stability/convergence rate.
3.2 The Influence of Noise
Assuming that the output data are corrupted with an additive, statistically inde-
pendent, white noise sequence e(k), with variance ul,
and that the input vector has
the same statistical properties as were mentioned above, then:
V.2(t) = (1 - 2nJ.l + f3.J.l 2)V.2(t - 1) + J.l2Uld. (21)
where the disturbance terms d. are given by:
dstd E ((x T (t)x(t)r 1) (22)
nUe2dsgn 2 J.I
f.I (26)
- fJsgnJ.ln
2d J.I (27)
nUe max 2 - f3maxJ.ln
After the initial "transient" convergence, the mean value of the process w(t) will
be equal to wand the weight vector will "jitter" around the mean value with a
variance given in equations 25-27. This region of convergence was termed a minimal
capture zone by Parks and Militzer [8]. For a specific input distribution, the size of
the minimal capture zone can be made arbitrary small by choosing a small learning
rate, J.I, but this would also decrease the rate of convergence.
82 CHAPTER 10
1.5r-----~-----r-----_.___---__,----__.
1 :.
{~(t){w(t) ~
.................
0.5 \ \ ....
" ..~.:..;.;...... '\ ..... .
°0L-----~~10===~=-=~=~=·~=·~~·~-·-~·~~~~·~~~·-~·~·;L~-~-~.~-~.~~~~.~4~0------~50
Figure 1 The evolution of the magnitude of the error in the weight vector when
it is trained using the std NLMS rule (solid line), sgn NLMS rule (dashed line) and
the max NLMS rule (dotted line), with lL.td = 1 and ILsgn = ILma.x = 0.5.
Acknowledgements
The authors gratefully acknowledge the financial support of Lucas aerospace during
the preparation ofthis manuscript and Dr Aved'yan also acknowledges the financial
support of the Fundamental Research Foundation of Russia, grant 93-012-448. He
also wishes to thank the Royal Society of London for their support through a joint
project on "Overall Behaviour of Adaptive Control Systems Using Neural Networks
for Identification or Learning", (1993-1996), Ref. No. 638072.P622.
This paper is dedicated to the memory of Patrick Parks.
FINITE SIZE EFFECTS IN ON-LINE LEARNING OF
MULTI-LAYER NEURAL NETWORKS.
David Barber, Peter Sollich* and David Saad
Dept. Comp, Sci. and Appl. Maths., Aston University,
Birmingham B4 7ET, UK. Web: http://www.ncrg.aston.ac.uk/
* Department of Physics, University of Edinburgh, Mayfield Road,
Edinburgh EH9 3JZ, UK.
We extend the recent progress in thermodynamic limit analyses of mean on-line gradient descent
learning dynamics in multi-layer networks by calculating the fluctuations possessed by finite di-
mensional systems. Fluctuations from the mean dynamics are largest at the onset of specialisation
as student hidden unit weight vectors begin to imitate specific teacher vectors, and increase with
the degree of symmetry of the initial conditions. Including a term to stimulate asymmetry in the
learning process typically significantly decreases finite size effects and training time.
Recent advances in the theory of on-line learning have yielded insights into the
training dynamics of multi-layer neural networks. In on-line learning, the weights
parametrizing the student network are updated according to the error on a single
example from a stream of examples, {e, r(el')} , generated by a teacher network
r(· )[1]. The analysis of the resulting weight dynamics has previously been treated
by assuming an infinite input dimension (thermodynamic limit) such that a mean
dynamics analysis is exact[2]. We present a more realistic treatment by calculating
corrections to the mean dynamics induced by finite dimensional inputs[3].
We assume that the teacher network the student attempts to learn is a soft com-
mittee machine[l] of N inputs, and M hidden units, this being a one hidden layer
network with weights connecting each hidden to output unit set to +1, and with
each hidden unit n connected to all input units by Bn(n = l..M). Explicitly, for
the N dimensional training input vector e,
the output of the teacher is given by,
M
(I' = Eg(Bn.e), (1)
n=l
where g(:z:) is the activation function of the hidden units, and we take g(:z:) =
erf(:z:/V2). The teacher generates a stream of training examples (e, (1'), with input
components drawn from a normal distribution of zero mean, unit variance. The
student network that attempts to learn the teacher, by fitting the training examples,
is also a soft committee machine, but with K hidden units. For input e,
the student
output is,
K
u(J,e) = Lg(J;·e), (2)
;=1
where the student weights J = {Ji}(i = l..K) are sequentially modified to reduce
the error that the student makes on an input e,
f(J, e)
1
= "2 (u(J, e) - (1')2 ="2
1 (K M)2
trg(:z:f) - ~ g(y~) , (3)
84
Barber, Sollich & Saad: Finite Size Effects in On-line Learning 85
with the activations defined xr
= Ji'e'", and Y~ = Bn·e'". Gradient descent on the
error (3) results in an update of the student weight vectors,
JI'+l = JI' - ;O'/e'", (4)
,r ~ g'(xf) - (5)
and g' is the derivative of the activation function g. The typical performance of the
student on a randomly selected input example is given by the generalisation error,
<g = «(J,e)), (6)
where ( .. ) represents an average over the gaussian input distribution. One finds
that <g depends only on the overlap parameters, R;n = Ji' Bn, Qij = J i · J j ,
and Tnm = Bn ·Bm(i,j = l..K; n, m = l..M)[2J, for which, using (4), we derive
(stochastic) update equations,
I'+l RI' _ 7]
R in ,I' I'
(7)
- in - N Ui Yn ,
2
I'+1 QI' _
Q ik - ik -
7]
N
(,I' I'
ui Xj + uk<I' Xi1') + N2
7] < < /:1' /:1'
uiukro,. ' ... (8)
We average over the input distribution to obtain deterministic equations for the
mean values of the overlap parameters, which are self-averaging in the thermody-
namic limit. In this limit we treat III N = Q' as a continuous variable and form
differential equations for the thermodynamic overlaps, R?n, Q?k'
dR?n = 7] (0)
---a;;- i Yn , (9)
(10)
For given initial overlap conditions, (9,10) are integrated to find the mean dynamical
behaviour of a student learning a teacher with an arbitrary numbers of hidden
units[2] (see fig.(la)). Typically, <g decays rapidly to a symmetric phase in which
there is near perfect symmetry between the hidden units. Such phases exist in
learnable scenarios until sufficient examples have been presented to determine which
student hidden unit will mimic which teacher hidden unit. For perfectly symmetric
initial conditions, such specialisation is impossible in a mean dynamics analysis. The
more symmetric the initial conditions are, the longer the trapping in the symmetric
phase (see fig.(2a)). Large deviations from the mean dynamics can exist in this
symmetric phase, as a small perturbation from symmetry can determine which
student hidden unit will specialise on which teacher hidden unit[l].
We rewrite (7,8) in the general form
al'+l - a!' = !.L
N
(Fa + 7]G a), (11)
where Fa + 7]G a is the update rule for a general overlap parameter a. In order to
investigate finite size effects, we make the following ansaetze for the deviations of
86 CHAPTER 11
the update rules Fa (the same form is made for Ga ) and overlap parameters a from
their thermodynamic values, 1
(')1 (')1
'~:r:'~
o 20 40 60 80
one teacher hidden unit. Non zero initial one teacher hidden unit. Initially, Qll =
parameters: Qll =
0.2,Q22 Rll =0.1. = 0.1, with all other parameters set to zero.
(a) Thermodynamic generalisation error, (a) Thermodynamic generalisation error
~~. (b) 0 (N- I ) correction to the gen- ~~. (b) 0 (N-l) correction to the gener-
eralisation error , ~~. Simulation results alisation error, ~~.
for N = 10,7) =0.1 and (half standard
deviation) error bars are drawn.
are small. For a finite size correction of less than 10%, we would require an input
dimension of around N > 257]. For the more symmetric initial conditions (fig.(2a))
there is a very definite symmetric phase, for which a finite size correction of less
than 10% (fig.(2b)) would require an input dimension of around N > 50, 0007]. As
' ,]
the initial conditions approach perfect symmetry, the finite size effects diverge, and
the mean dynamical theory becomes inexact. Using the covariances, we can analyse
0.6r---r--r-..,.--.--r---r--':::J
0.4
0.2
,-----------
- Qt1
.......... QI1
-- .~';&::
'~rs:: ]
---Rl1
0.0
..... -'-R"
....
~.2
~.4
a 140 a 140
~.6L..---L_--'-_--'--_-'-_L..---L_-L.-'
Figure 3 (a) The normalised compo- Figure 4 Two student hidden units,
nents of the principal eigenvector for the one teacher hidden unit. The initial
isotropic teacher. M K= = 2, (Q22 = conditions are as in fig.(2).(a) Ther-
Qll,R22 = RIl). Non zero initial pa- modynamic generalisation error, ~~. (b)
rameters Qll =
0.2, Q22 =
0.1, RIl = o (N-I) correction to the generalisation
O.OOl,R22 = 0.001.
error, f~.
the way in which the student breaks out of the symmetric phase by specialising its
hidden units. For the isotropic teacher scenario Tnm = Dnm , and M = K = 2, learn-
88 CHAPTER 11
ing proceeds such that one can approximate, Q22 = Qll, R22 = Rll. By analysing
the eigenvalues of the covariance matrix (!:::.a!:::.b), we found that there is a sharply
defined principal direction, the components of which we show in fig.(3). Initially, all
components of the principal direction are similarly correlated, which corresponds to
the symmetric region. Then, around a = 20, as the symmetry breaks, Ru and R21
become maximally anti-correlated, whilst there is minimal correlation between the
Qu and Q12 components. This corresponds well with predictions from perturbation
analysis[2]. The symmetry breaking is characterised by a specialisation process in
which each student vector increases its overlap with one particular teacher weight,
whilst decreasing its overlap with other teacher weights. After the specialisation
has occured, there is a growth in the anti-correlation between the student length
and its overlap with other students. The asymptotic values of these correlations are
in agreement with the convergence fixed point, R2 = Q = l.
In light of possible prolonged symmetric phases, we break the symmetry of the
student hidden units by imposing an ordering on the student lengths, Qu ~ Q22 ~
... ~ QKK, which is enforced in a 'soft' manner by including an extra term to (3),
1 K-l
{t ="2 E
h(Qj+lj+l-QJJ), (19)
i=l
where h(x) approximates the step function,
REFERENCES
[I] M. Biehl and H. Schwarze, Learning by online gradient descent, Journal of Physics A Vol.28
(1995), pp643-656.
[2] D. Saad and S .Solla, Exact solution for online learning in multilayer neural networks. Phys-
ical Review Letters, Vol. 74(21) (1995). pp4337-4340.
[3] P. Sollich, Finite size effects in learning and generalization in linear perceptrons, Journal of
Physics A Vol.27 (1994), pp7771-7784.
[4] T. Heskes, Journal of Physics A Vol.27 (1994), pp5145-5160.
Acknowledgements
This work was partially supported by the EU grant ERB CHRX-CT92-0063.
CONSTANT FAN-IN DIGITAL NEURAL NETWORKS
ARE VLSI-OPTIMAL
V. Beiu
Los Alamos National Laboratory, Division NIS-i, Los Alamos,
New Mexico 87545, USA. Email: [email protected]
The paper presents a theoretical proof revealing an intrinsic limitation of digital VLSI technology:
its inability to cope with highly connected structures (e.g. neural networks). We are in fact able
to prove that efficient digital VLSI implementations (known as VLSI-optimal when minimising
the AT2 complexity measure - A being the area of the chip, and T the delay for propagating the
inputs to the outputs) of neural networks are achieved for small-constant fan-in gates. This result
builds on quite recent ones dealing with a very close estimate of the area of neural networks when
implemented by threshold gates, but it is also valid for classical Boolean gates. Limitations and
open questions are presented in the conclusions.
Keywords: neural networks, VLSI, fan-in, Boolean circuits, threshold circuits, F n,m functions.
1 Introduction
In this paper a network will be considered an acyclic graph having several input
nodes (inputs) and some (at least one) output nodes (outputs). The nodes are
characterised by fan-in (the number of incoming edges - denoted by .6.) and fan-
out (the number of outgoing edges), while the network has a certain size (the
number of nodes) and depth (the number of edges on the longest input to output
path). If with each edge a synaptic weight is associated and each node computes the
weighted sum of its inputs to which a non-linear activation function is then applied
(artificial neuron), the network is a neural network (NN):
89
90 CHAPTER 12
maximal ratio between the largest and the smallest weight. For simplification, in the
following we shall consider only NNs having n binary inputs and k binary outputs.
If real inputs and outputs are needed, it is always possible to quantize them up to
a certain number of bits such as to achieve a desired precision. The fan-in of a gate
will be denoted by .6. and all the logarithms are taken to base 2 except mentioned
otherwise. Section 2 will present previous results for which proofs have already been
given [2-7]. In section 3 we shall prove our main claim while also showing several
simulation results.
2 Background
A novel synthesis algorithm evolving from the decomposition of COMPARISON
has recently been proposed. We have been able to prove that [2, 3]:
Proposition 1 The computation of COMPARISON of two n-bit numbers can
be realised by a .6.-ary tree of size O( n/.6.) and depth O(log n/log.6.) for any integer
fan-in 2 ~ .6. ~ n.
A class of Boolean functions Hi' 6 having the property that V16 E Hi' 6 is lin-
early separable has afterwards been introduced as: ''the class of functions 16 of
.6. input variables, with .6. even, 16 = 16(g6/2-1, e6/2-1, ... , go, eo), and comput-
.mg f 6 de! V
= j=O gj I\. k=j+l ek . BY conventIOn,
6 / 2- 1 [ (1\6/2-1 )]" . we conSI'der 1\",-1 =
i=", ei de!
1. One restriction is that the input variables are pair-dependent, meaning that
we can group the .6. input variables in .6./2 pairs of two input variables each:
(g6/2-1, e6/2-d, ... , (gO, eo), and that in each such group one variable is 'dominant'
(i.e. when a dominant variable is 1, the other variable forming the pair will also be
1):
Hi' 6 d~ {hlh : {(O, 0), (0, 1), (1, 1)}6/ 2 _ {O, I}, .6./2 E IN·,
T(n, m,.6.) = r
lOgn - 11 +
10g.6. - 1
r lOgm + 11 = 0 (IOg(mn»)
log .6. log .6.
and
4n. 26
A(n, m,.6.) < 2m· ( ~ +
5(n _.6.).2 6 / 2 )
.6.(.6. _ 2)
2m -11r
+.6.'.6. _ 1 = 0
(mn. 26
.6.
)
For 2m > 26 the equations are much more intricate, while the complexity values
for area and for AT2 are only reduced by a factor (equal to the fan-in [6, 7]). If we
now suppose that a feed-forward NN of n inputs and k outputs is described by m
examples, it can be directly constructed as simultaneously implementing k different
functions from F n,m [4, 6, 7]:
Proposition 5 =
Any set of k functions fEF n,i, i 1,2, ... , m, i:5 m :5 26 - 1 can
be computed by a neural network with polynomially bounded integer weights (and
thresholds) having size O(m(2n + k)/.6.), depth O(log(mn)/log.6.) and occupying
an area of O( mn . 26 /.6. + mk) if 2m :5 26 , for all the values of the fan-in (.6.) in
the range 3 to O(log n).
The architecture has a first layer of COMPARISONs which can either be imple-
mented using classical Boolean gates (BGs) or - as it has been shown previously
- by TGs. The desired function can be synthesised either by one more layer of
TGs, or by a classical two layers AND-OR structure (a second hidden layer of
AND gates - one for each hypercube), and a third layer of k OR gates represents
the outputs. For minimising the area some COMPARISONs could be replaced by
AND gates (like in a classical disjunctive normal form implementation).
3 Which is the VLSI-Optimal Fan-In?
Not wanting to complicate the proofs, we shall determine the VLSI-optimal fan-in
when implementing COMPARISON (in fact: F n,1 functions) for which the solution
was detailed in Propositions 1 to 3. The same result is valid for F n,m functions as
can be intuitively expected either by comparing equations (3) and (4), or because:
• the area is determined by the same first layer of COMPARISONs (the ad-
ditional area for implementing the symmetric 'alternate addition' [4] can be
neglected).
16 24
- 5~3Iog~ + 10~2Iog~ - In2n~2Iog~ + In2n~ log~
24 10 2 32 2 88 48
- In 2 n log ~ + In 2 ~ log ~ - In 2 n~ + In 2 n~ - In 2 n
+~~2_~~)
In2 In2
which - unfortunately - involves transcendental functions of the variables in an
essentially non-algebraic way. If we consider the simplified 'complexity' version of
equation (3) we have:
d(AT2) ~ ~ (nlog2 n· 21::./2) = 2D./2 . (In 2 _ ..!. __2_)
d~ d~ ~log2 ~ ~log2 ~ 2 ~ ~In~
which when equated to zero leads to In~(~ln2 - 2) = 4 (also a transcendental
equation). This has ~ = 6 as 'solution' and as the weights and the thresholds are
bounded by 21::./2 (Proposition 4) the proof is concluded. D
The proof has been obtained using several successive approximations: neglecting
the ceilings and using a 'simplified' complexity estimate. That is why we present in
Figure 2 exact plots of the AT2 measure which support our previous claim. It can
be seen that the optimal fan-in 'constantly' lies between 6 and 9 (as ~optim 6 ...9, =
one can minimise the area by using COMPARISONs only if the group of ones
has a length of a ~ 64 - see [4-7]). Some plots in Figure 2 are also including
a TG-optimal solution denoted by SRK [14] and the logarithmic fan-in solution
(~= logn) denoted BJg [5] .
4 Conclusions
This paper has presented a theoretical proof for one of the intrinsic limitations of
digital VLSI technology: there are no 'optimal' solutions able to cope with highly
connected structures. For doing that we have proven the contrary, namely that
constant fan-in NNs are VLSI-optimal for digital architectures (either Boolean or
using TGs) . Open questions remain concerning 'if' and 'how' such a result could
be extended to purely analog or mixed analog/digital VLSI circuits.
Beiu: Constant Fan-in Digital Neural Networks are VLSI-optimal 93
12000 B_'
f ... 8_9
B_"
.' ,_--+::.-:~~::""u
......+~.B....5
+++ 0 00
a)
3x.10'5 ' ; ' = - - - - - - _ - - , - - - - - - ,
B_5
~
'.5
0.5
B_'
00 00
c) d)
,Xl01i
B_. B_.
BJg
.~:~~;~-~.~::-;~~,-::-6 :: ~, , 0 0
•
goo "'" 700 800 IKlO 1000 1100
e) f)
REFERENCES
[1] Y.S. Abu-Mostafa, Connectivity Versus Entropy, in: Neural Infonnation Processing Systems
(Proc. NIPS*87, Denver, Colorado), ed. D.Z. Anderson, American Institute of Physics, New
York, (1988) ppl-8.
[2] V. Beiu, J.A. Peperstraete, J. Vandewalle and R. Lauwereins, Efficient Decomposition of
COMPARISON and Its Applications, in: ESANN'93 (Proc. European Symposium on Artificial
Neural Networks '93, Brussels, Belgium), ed. M. Verley sen, Dfacto, Brussels, (1993) pp45-50.
94 CHAPTER 12
[3] V. Beiu, J.A. Peperstraete, J. Vandewalle and R. Lauwereins, COMPARISON and Threshold
Gate Decomposition, in: MicroNeuro '93 (Proc. International Conference on Microelectronics
for Neural Networks '93, Edinburgh, UK), eds. D.J. Myers and A.F. Murray, UnivEd Tech.
Ltd., Edinburgh, (1993) pp83-90.
[4] V. Beiu, J.A. Peperstraete, J. Vandewalle and R. Lauwereins, Learning from Examples and
VLSI Implementation of Neural Networks, in: Cybernetics and System Research '94 (Proc.
European Meeting on Cybernetics and System Research '94, Vienna, Austria), ed. R. Trappl,
World Scientific Publishing, Singapore, (1994) pp1767-1774.
[5] V. Beiu, J.A. Peperstraete, J. Vandewalle and R. Lauwereins, A rea- Time Performances of
Some Neural Computations, in: SPRANN '94 (Proc. International Symposium on Signal Pro-
cessing, Robotics and Neural Networks '94, Lille, France), eds. P. Borne, T. Fukuda and S.G.
Tzafestas, GERF EC, Lille, (1994) pp664-668.
[6] V. Beiu and J.G. Taylor, VLSI Optimal Neural Network Learning Algorithm, in: Artificial
Neural Nets and Genetic Algorithms (Proc. Int. Conf., Ales, France), eds. D.W. Pearson, N.C.
Steele and R.F. Albrecht, Springer-Verlag, Vienna, (1995) pp61-64.
[7] V. Beiu and J.G. Taylor, Area-Efficient Constructive Learning Algorithms, in Proc. CSCS-
10 (10th International Conference on Control System and Computer Science, Bucharest,
Romania), ed. I. Dumitrache, PUBucharest, Bucharest, (1995), pp293-310.
[8] J. Bruck and J. Goodman, On the Power of Neural Networks for Solving Hard Problems, in:
Neural Information Processing Systems (Proc. NIPS*87, Denver, Colorado), ed. D.Z. Ander-
son, American Institute of Physics, New York, (1988) pp137-143.
[9] D. Hammerstrom, The Connectivity Analysis of Simple Associations -or- How Many Con-
nections Do You Need, in: Neural Information Processing Systems (Proc. NIPS*87, Novem-
ber, Denver, Colorado), ed. D.Z. Anderson, American Institute of Physics, New York, (1988)
pp338-347.
[10] H. Klaggers and M. Soegtrop, Limited Fan-In Random Wired Cascade-Correlation, in: Mi-
croNeuro'93 (Proc. International Conference on Microelectronics for Neural Networks, Edin-
burgh '93, UK), eds. D.J. Myers and A.F. Murray, UnivEd Tech. Ltd., Edinburgh, (1993)
pp79--82.
[11] A.V. Krishnamoorthy, R. Paturi, M. Blume, G.D. Linden, L.H. Linden and S.C. Esener,
Hardware TradeoJJs for Boolean Concept Learning, in WCNN'94 (Proc. World Conference on
Neural Networks '94, San Diego, USA), Lawrence Erlbaum and INNS Press, Hillsdale, (1994)
Vol. 1, pp551-559.
[12] D.S. Phatak and I. Koren, Connectivity and Performance TredeoJJs in the Cascade-
Correlation Learning Architecture, IEEE Trans. on Neural Networks Vol. 5(6) (1994), pp930-
935.
[13] N.P. Red'kin, Synthesis of Threshold Circuits for certain Classes of Boolean Functions,
Kibernetica Vol. 5 (1970), pp6-9. Translated in Cybernetics Vol. 6(5) (1973), pp54Q-544.
[14] K.-Y. Siu, V. Roychowdbury and T. Kailath, Depth-Size TradeoJJsfor Neural Computations,
IEEE Trans. on Computers Vol. 40(12) (1991), pp1402-1412.
[15] R.C. Williamson, e-Entropy and the Complexity of Feedforward Neural Networks, in: Neural
Information Processing Systems (Proc. NIPS*90, Denver, Colorado), eds. R.P. Lippmann, J.E.
Moody and D.S. Touretzky, Morgan Kaufmann, San Mateo, (1991), pp946-952.
[16] B.-T. Zhang and H. Miihlenbein, Genetic Programming of Minimal Neural Networks Using
Occam'8 Razor, Technical Report: Arbeitspapiere der GMD 734, Schlofi Birlinghoven, Sankt
Augustin, Germany (1993).
Acknowledgements
This research work has been started while Dr. Beiu was with the Katholieke Uni-
versiteit Leuven, Belgium, and has been supported by a grant from the Con-
certed Research Action of the Flemish Community entitled: "Applicable Neural
Networks". The research has been continued under the Human Capital and Mo-
bility programme of the European Community as an Individual Research Training
Fellowship ERB4001GT941815: "Programmable Neural Arrays", under contract
ERBCHBICT941741. The scientific responsibility is assumed by the author, who is
on leave of absence from the "Politechnica" University of Bucharest, Romania.
THE APPLICATION OF BINARY ENCODED 2ND
DIFFERENTIAL SPECTROMETRY IN
PREPROCESSING OF UV-VIS ABSORPTION
SPECTRAL DATA
N Benjathapanun, W J 0 Boyle and K T V Grattan
Department of Electrical, Electronic and Information Engineering
City University, Northampton Square, London ECl V OHB, UK.
This paper describes classification of UV-Vis optical absorption spectra by binary encoding seg-
ments of the second derivative of the absorption spectra according to their shape. This allows
successful classification of spectra using the Back Propagation Neural Network analysis (BPNN)
algorithm where other preprocessing schemes have failed. It is also shown that once classified,
estimation of chemical species concentration using a further stage of BPNN is possible. Data for
the study are derived from laboratory-based measurements of UV-Vis optical absorption spectra
from mixtures of common chemical pollutants.
1 Introduction
This study has the goal of developing artificial intelligence methods, to analyse UV-
Vis spectra and hence determine actual chemical species and their concentrations in
real-time in-line monitor systems. Prior to the study, a wide range of NN methods
(BP, Radial Base, Kohonen, etc.), topologies and iteration conditions were evalu-
ated for their ability to classify and/or estimate components in the data. All these
methods were unsuccessful and pointed to the need for a more knowledge-based
approach. In the present work two approaches to preprocessing the data before
classification by BPNN are evaluated. The first method, 2nd derivative spectrome-
try, relies only on the spectral data and if successful would give self-classifying, self
learning solutions. The second method, a modification of 2nd derivative spectrom-
etry [1], depends on knowledge of the absorption spectra of expected constituent
species.
2 Experimental
Data for the study were obtained from laboratory-based UV-Vis optical absorption
spectra measurements taken from mixtures of three common chemical pollutants
in water prepared at three different concentrations. Stock solutions were prepared
from addition of Sodium Nitrate, Ammonia solution and Sodium Hypochlorite to
distilled water and these were mixed in all possible combinations to provide a set
of training data with 64 members. 64 samples were also mixed for a test set at
concentrations approximately 30% greater than those for the training set. Species
and concentrations are summarised in Table 1. The apparatus used in the exper-
iments consisted of a Hewlett-Packard 8452A diode array spectrometer equipped
with a 1 cm quartz cell operated remotely using proprietary software. All spectro-
scopic data were transferred to computer via a serial interface for analysis using
Microsoft Windows based software including the Neural Desk v2.1 Neural Network
software package[2]. UV intensity spectra of the solution were recorded from 190 to
820 nm with 2 nm interval. Absorption spectra were calculated from the intensity
spectra using equation (1) with spectra from distilled water as a reference. For data
95
96 CHAPTER 13
TRAINING SET
NH3-A - 105.83 mg/l Ch-A - 49.23 mg/l N0 3-A - 7.75 mg/l
NH3-B = 32.55 mg/l Ch-B = 24.62 mg/l N0 3-B = 3.88 mg/l
NH3-C = 11.18 mg/l Ch-C = 6.15 mg/l N03-C - 1.94 mg/l
TESTING SET
NH3-D = 137.7 mg/l Ch-D = 60.48 mg/l N0 3-D = 9.75 mg/l
NH3-E - 45.9 mg/l Ch-E - 30.24 mg/l N0 3-E - 4.88 mg/l
NH3-F = 15.3 mg/l Ch-F - 5.04 mg/l N0 3-F - 1.95 mg/l
Table 1 Chemical species and concentrations used to obtain the data sets.
analysis, data points 4nm apart were used, giving data arrays of 43 points.
A --IoglO ( darkcurrent )
hlank -
(1)
darkcurrent
[.ample -
Figure 1 shows absorption spectra obtained for the three individual species and the
spectra when these are mixed together. These spectra are typical of those obtained
for this study and they exemplify two important features of UV-Vis spectrometry.
Firstly, UV-Vis spectral peaks from water are in general broad and overlapping,
typically 30nm wide - this makes them difficult to discern. Secondly, the dynamic
range in absorption for the contaminant species is high. Consequently when two or
more species are mixed at high concentrations very little light is left for measure-
ment, and the signal to noise level is reduced. The experimental data set is also
subject to some systematic error due to base line shifts representative of drift in
the experimental apparatus over long time periods (days) and also because of unde-
termined chemistry due to reactions between components in the mixture. However
these factors should not limit the ability to classify the spectra, only the ability
to quantify, as each constituent in a sample should result in its own distinctive
adsorption peak, even if the relationship between peak height and concentration
contains some error and/or is non-linear.
Figure 1 The absorbance spectra of N03 7.75 mg/I, NH3 105.83 mg/I and Ch
49.23mg/1 and a mixture of these (solid line).
Table 2 Surrunary of conditions for the BPNN classifiers used after the prepro-
cessing steps.
D.'"
"
--
~~D~ '"
oi!'
I"
---
0.,
a
1
D.2.'l
ell 0.1 - - 2nddW
.... -""'''''
'0
11°::
----,
~
DM ',Itr--____,--,_""'_,,--,_
D~------------------
1 201 401 tol ." l00112011401110111012D01
opoca
Figure 2 Training error as a function of Figure 3 Testing error after every 200
epoch for preprocessing with PCA, 2nd training epochs as a function of epoch for
Derivative spectra, and Binary Encoded preprocessing with PCA, 2nd Derivative
2nd Derivative Spectra. spectra, and Binary Encoded 2nd Deriva-
tive Spectra.
8 Conclusions
In laboratory-based trials, 2nd-derivative analysis is found to be an ineffective pre-
processing method in BPNN classification of UV-Vis absorption from water sam-
ples. This follows on from earlier work in which various NN algorithms including
BPNN were evaluated for classification and estimation and also found ineffective.
Subsequently a more knowledge-based approach has been formulated which restricts
itself to determining the presence and concentration of a range of expected species.
The scheme involves a three stage process. In the first stage shape information
is derived by binary encoding segments of the second derivative of the absorp-
100 CHAPTER 13
~
~
Figure 4 Absorption spectra and 1st and 2nd derivative spectra for
monochloroamine and monochloroamine plus nitrate.
tion spectra according to their shape. The rationale of this stage is to reduce the
spectral information to shape sensitive factors. This is found significantly to ease
classification of the spectra by a second stage of BPNN analysis. For estimation of
concentration of species absorption data for the expected species is used to train
a second stage of BPNN, segmentation of the spectra and selection of relevant in-
puts for the second stage BPNN is determined from the absorption data for the
expected species, to give the best segmentation pattern and minimum number of
network inputs. The two-step approach taken to classification and then estimation
is better than a one step approach. The first-step network specifies which species
are likely to occur and the second-step network can then focus on a few inputs that
strongly correlate with the presence of the expected species. Also the second-step
provides a filter that compensates for classification of species at low concentration
levels or incorrect identification of species due to low level signals with noise.
REFERENCES
[1] Sommer, L., Analytical absorption spectrophotometry in the visible and ultraviolet: the prin-
ciples, (1989) Elsevir.
[2] Neural Desk: User's Guide, Neural Computer Sciences, (1992).
[3] Hammerstrom, D. M., Working with neural networks, IEEE Spectrum, July (1993), pp46-53.
[4] Gemperline, P. J., Long, J. R. and Gregoriou, V. J., Nonlinear Multivarate Calibration Using
Principal Components Regression and Artificial Neural Networks, Anal. Chern., Vol. 63 (1991),
pp2313-2323.
[5] Antonov, L. and Stoyanov, S., Analysis of the Overlapping Bands in UV- Vis Absorption
spectroscopy, Applied Spectroscopy, Vol. 41 (1993), no. 1, pp1030-1035.
A NON-EQUIDISTANT ELASTIC NET ALGORITHM
Jan van den Berg and Jock H. Geselschap
Department of Computer Science,
Erasmus University Rotterdam, The Netherlands.
Email: [email protected]
The statistical mechanical derivation by Simic ofthe Elastic Net Algorithm (ENA) from a stochas-
tic Hopfield neural network is criticized. In our view, the ENA should be considered a dynamic
penalty method. Using a linear distance measure, a Non-equidistant Elastic Net Algorithm (NENA)
is presented. Finally, a Hybrid Elastic Net Algorithm (HENA) is discussed.
1 Stochastic Hopfield and Elastic Neural Networks
Hopfield introduced the idea of an energy function into neural network theory [5].
Like Simic [8], we use Hopfield's energy expression multiplied by -1, i.e.
E(S) = ~ L WijSiSj + L liSi, (1)
ij
where S E {D, l}n and all Wij ~ D. Making the units stochastic, the network can be
analyzed applying statistical mechanics. We concentrate on the free energy [6, 6]
F (E(S)) - TS, = (2)
where T = 1/(3 is the temperature, where (E(S)) represents the average energy, and
where S equals the so-called entropy. A minimum of F corresponds to a thermal
equilibrium state [6]. We shall apply the next theorem [10, 11]:
Let S~ denote whether the salesman at time i occupies space-point p or not (S~ =1
or 0). Then the corresponding Hamiltonian may be stated as [8]
E(S) = ~ LLd;qS;(S;+l + S;-l) + ~ LLd;qS;S!. (5)
i pq i pq
The first term represents the sum of distance-squares, the second term penalizes
the simultaneous presence of the salesman at more than one position. The other
constraints, which should guarantee that every city is visited once, can be built-in
'strongly' using "Ii : L:j Sij = 1. Eventually, one finds [8, 12] the free energy
Ftsp(V) = -~ LLd;qV;(V;+l + V;-l) - ~ LLd;qV;V;- i pq
i pq
101
102 CHAPTER 14
On the other hand, the 'elastic net' algorithm [2] has the energy function
Een(x) = TL 1Xi+l - xi 12 -7 LIn Lexp(=f 1Xp - xi 12). (7)
p i
Here, xi represents the i-th elastic net point and xp represents the location of city p.
Application of gradient descent on (7) yields the updating rule:
Axi = 7f(xi+1 - 2xi + xi-I) + 0<1 LAP(i)(xp - xi), (8)
p
and using (4) (adapted to the TSP, with a ~ 1), we obtain [12]
x( i) is the stochastic and xi the average position of the salesman at time i. Using
the decomposition, Simic writes Lq d;q V; =1 xp - xi 12 . By this, he makes
a crucial transformation from a linear function in V;
into a quadratic one in xi.
Substitution of the result in (12) (with 0' = (3), neglect of the second term, and
application of the decomposition (13) on the first term of (12) finally yield (7).
Objection 2. Careful analysis [12] shows that in general
L d;q V; = L(xp-xq)2V; :j; 1xp _xi 12 . o
q
v.d. Berg fj Geselschap: A Non-equidistant Elastic Net Algorithm 103
Energy (6) is a special case of a generalized free energy of type (3), whose sta-
tionary points are solutions of V; = exp( -,8 L: jq W~q V/)/L:l exp( -,8 L: jq W~q Vj).
Whatever is the temperature, these stationary points are found at states where on
average, all strongly submitted constraints are fulfilled. Moreover, stationary points
of a free energy of type (3) are often maxima [11, 12] .
Objection 3. An analysis of the free energy of the ENA (section 3) yields a very
different view : both terms of (7) create a set of minima. A competition takes place
between feasibility and optimality, where the current temperature determines the
overall effect. This corresponds to the classical penalty method. A difference from
that approach is that here - like in the Hopfield-Lagrange model [9] - the penalty
weights change dynamically. Consequently, we consider the ENA a dynamic penalty
method. 0
The last observation corresponds to the theory of so-called deformable templates
[7, 13], where the corresponding Hamiltonian equals
Edt(S,X) = TEl Xi+l - xi 12 + Est 1 Xp - x j 12 . (14)
pj
A statistical analysis [7, 13] of Edt yields the free energy (7). A comparison between
(7) and (14) clarifies that the first energy expression is derived from the second one
by adding noise exclusively to the penalty terms.
3 An Analysis of the ENA
We can analyze the ENA by inspection of the energy landscape [3, 12] . The general
behavior of the algorithm leads to large-scale, global adjustments early on. Later on,
smaller, more local refinements occur. Equidistance is enforced by the first, 'elastic
ring term' of (7) , which corresponds to parabolic pits in the energy landscape.
Feasibility is enforced by the second, 'mapping term' corresponding to pits whose
width and depth depend on T. Initially, the energy landscape appears to shelve
slightly and is lowest in regions with high city density. On lowering the temperature
a little, the mapping term becomes more important: it creates steeper pits around
cities. By this, the elastic net starts to stretch out. We next consider two potential,
nearly final states of a problem instance with 5 permanently fixed cities and 5
variable elastic net points, of which 4 are temporarily fixed . The energy landscape
of the remaining elastic net point is displayed. Figure 1 shows the case where 4 of
the 5 cities have already caught an elastic net point.
3~~
2 0.15
1 0.1
0.05
o
10
Figure 1 The energy landscape, a non- Figure 2 The energy landscape, an al-
feasible state. most feasible state.
104 CHAPTER 14
Net points Fig. 1 b.
Net points Fig. 2 •
City points •
O.S
The landscape of the 5-th ring point exhibits a large pit situated above the only
non-visited city. If the point is not too far away from the non-visited city, it can still
be caught by it. It demonstrates, that a too rapid lowering of the temperature may
lead to a non-valid solution. In figure 2, an almost feasible final solution is shown,
where 3 net points coincide with 3 cities. A 4-th elastic net point is precisely in the
middle between the two close cities. Now, the mapping term only produces some
small pits. The elastic net term has become perceptible too. Hence, the remaining
elastic net point is most probably forced to the middle of its neighbors making the
final state more or less equidistant, but not feasible! Thus, it is possible to end up
in a non-feasible solution if (a) the parameter T is lowered too rapidlyl or if (b)
two close cities have caught the same net point.
4 Alternative Elastic Net Algorithms
In order to use a correct distance measure and at the same time, to get rid of the
equidistance property, we adopt a linear distance measure in (7):
Flin(X) = 012 L 1xi+l - xi 1-7 LIn Lexp(=f 1Xp - xi 1 2 ). (15)
p j
Applying gradient descent, the corresponding motion equations are found [3]. A
self-evident analysis shows that, like in the original ENA, the elastic net forces try
to push elastic net points onto a straight line. There is, however, an important
difference: once a net point is situated in any point on the straight line between its
neighboring net points, it no longer feels an elastic net force. Equidistance is not
pursued anymore and the net points have more freedom to move towards cities. We
therefore conjecture that the NENA will find feasible solutions more easily. Since
the elastic net forces are normalized by the new algorithm, a tuning problem arises.
To solve this problem, all elastic net forces are multiplied by a same factor. The
final updating rule becomes:
i a 1 m
L Hi i (i+l X - Xi
~X = 7m 1X -X 1 1XHl_Xi 1+ 1Xi-i_Xi 1 +
Xi-l - X i) OIl2:
P •
A (z)(Xp-X).
i
1 p
Finally, we merged the ENA and the NENA into a hybrid one (HENA): the al-
gorithm starts using ENA (to get a balanced stretching out) and, after a certain
number of steps, switches to NENA (to try to guarantee feasibility).
lin optimal annealing[lJ, the temperature is decreased carefully to escape from local minima.
Instead, here this is done to end up in a local (i.e., a constrained) minimum.
v.d. Berg & Geselschap: A Non-equidistant Elastic Net Algorithm 105
5 Experiments
We started using the 5-city configuration of section 3. Using 5 up to 12 elastic net
points, the ENA produced only non-feasible solutions. Using 15 elastic net points,
the optimal feasible solution is always found. Using 5 elastic net points, the NENA
occasionally produced the optimal solution. A gradual increase of the number of
elastic net points results into a rise of the percentage of optimal solutions found.
Using only 10 elastic net points, we obtained a 100% score. Testing a 15-city-
problem, we had the similar experiences. However, the picture started to change
having 30-city problem instances. As a rule, both algorithms are equally disposed
to find a valid solution, but the quality of the solutions of the original ENA is
generally better. Trying even larger problem instances, the NENA more frequently
found a non-valid solution: inspection shows a strong lumping effect of net points
around cities and sometimes a certain city is completely left out. At this point, the
hybrid approach of HENA comes to mind. Up to 100 cities, we were unable to find
parameters which yield better solutions than the original ENA.
6 Conclusions and Outlook
Elastic neural networks are dynamic penalty methods, therefore always having a
tuning problem. Contrary to simulated annealing, the network should end up in
a local, constrained minimum. Trying the ENA, it may come up with a non-valid
solution iftwo cities are close to each other. To guarantee feasibility more easily, we
implemented a new algorithm, having a linear distance measure. The success of it is
limited to small problem instances, showing that the quadratic distance measure is
an essential ingredient of the original ENA. Trying a hybrid algorithm, we did not
find parameters which yield a better performance. In future research, an alternative
for HENA can be considered by realizing a gradual switch from the ENA to the
NENA. Likewise, other formulations of penalty terms can be tested.
REFERENCES
[1] E. Aarts and J. Korst, Simulated Annealing and Boltzmann Machines, A Stochastic Approach
to Combinatorial Optimization and Neural Computing, John Wiley & Sons (1989).
[2] R. Durbin and D. Willshaw, An Analogue Approach of the Travelling Salesman Problem
Using an Elastic Net Method, Nature, Vol. 326 (1987), pp689-691.
[3] J.H. Geselschap, Een Verbeterd 'Elastic Net' Algoritme (An Improved Elastic Net Algorithm),
Master's thesis, Erasmus University Rotterdam, Compo Sc. Dept., (1994).
[4] J. Hertz, A. Krogh, and R.G. Pa.Imer, Introduction to the Theory of Neural Computation,
Addison-Wesley (1991).
[5] J.J. Hopfield, Neural Networks and Physical Systems with Emergent Collective Computa-
tional Abilitiea, Proceedingsofthe National Academy of Sciences, USA Vol. 79 (1982), pp2554-
2558.
[6] G. Parisi, Statistical Field Theory, Addison-Wesley (1988).
[7] C. Peterson and B. Soderberg, Artificial Neural Networka and Combinatorial Optimization
Problems, to appear in: Local Search in Combinatorial Optimization, E.H.L. Aarts and J.K.
Lenstra eds., John Wiley & Sons.
[8] P.D. Simic, Statistical Mechanics as the Underlying Theory of 'Elastic' and 'Neural' Opti-
misations, Network, Vol. 1 (1990), pp88-103.
[9] J. van den Berg and J.C. Bioch, Constrained Optimization with the Hopfield-Lagrange Model,
in: Proceedings of the 14th IMACS World Congress (1994), pp470-473.
[10] J. van den Berg and J. C. Bioch, On the (Free) Energy of Stochastic and Continuous
Hopfield Neural Networks, in: Neural Networks: The Statistical Mechanics Perspective, J.-H.
Oh, C. Kwon, ·S. Cho eds., World Scientific (1995), pp233-244.
[11] J. van den Berg and J.C. Bioch, Some Theorems Concerning the Free Energy of (Un)Con-
strained Stochastic Hopfield Neural Networks, in: Lecture Notes in Artificial Intelligence 904,
106 CHAPTER 14
EuroCOLT'95 (1995), pp298-312.
[12] J. van den Berg and J.H. Geselschap, An analysis olllarious elastic net algorithms, Technical
Report EUR-CS-95-06, Erasmus University Rotterdam, Compo Sc. Dept. (1995).
[13] A.L. Yuille, Generalized Delormahle Models, Statistical Physics, and Matching Prohlems,
Neural Computation, Vol. 2 (1990), pp1-24.
UNIMODAL LOADING PROBLEMS
Monica Bianchini, Stefano Fanelli*,
Marco Gori and Marco Protasi*
Dipartimento di Sistemi e Informatica, Universitd degli Studi di Firenze,
Via di Santa Marta, 3, 50139 Firenze, Italy.
Email: [email protected]
* Dipartimento di Matematica, Universitd di Roma,
"Tor Vergata" Via della Ricerca Scientifica, 00133 Roma, Italy.
Email: [email protected]
This paper deals with optimal learning and provides a unified viewpoint of most significant results
in the field. The focus is on the problem of local minima in the cost function that is likely to affect
more or less any learning algorithm. We give some intriguing links between optimal learning and
the computational complexity of loading problems. We exhibit a computational model such that
the solution of all loading problems giving rise to unimodal error functions require the same time,
thus suggesting that they belong to the same computational class.
Keywords: Backpropagation, computational complexity, optimal learning, premature saturation,
spurious and structural local minima, terminal attractor.
1 Learning as Optimisation
Supervised learning in multilayered networks (MLNs) can be accomplished thanks
to Backpropagation (BP), which is used to minimise pattern misclassifications by
means of gradient descent for a particular nonlinear least squares fitting problem.
Unfortunately, BP is likely to be trapped in local minima and indeed many examples
of local extremes have been reported in the literature.
The presence of local minima derives essentially from two different reasons. First,
they may arise because of an unsuitable joint choice of the functions which defines
the network dynamics and the error function. Second, local minima may be inher-
ently related to the structure of the problem at hand. In [5], these two cases have
been referred to as spurious and structural local minima, respectively. Problems of
sub-optimal solutions may also arise when learning with high initial weights, as a
sort of premature neuron saturation arises, which is strictly related to the neuron
fan-in. An interesting way offacing this problem is to use the "relative cross-entropy
metric" [10], for which the erroneous saturation of the output neurons does not lead
to plateaux, but to very high values of the cost. When using the cross-entropy met-
ric, the repulsion from such configurations is much more effective, and underflow
errors are likely to be avoided.
There have also been attempts to provide theoretical conditions aimed at guarantee-
ing local minima free error surfaces. So far, however, only some sufficient conditions
have been identified that give rise to unimodal error surfaces. Examples are the the
case of pyramidal networks [8], commonly used in pattern recognition, radial ba-
sis function networks [2], and non-linear autoassociators [3]. The identification of
similar conditions ensures global optimisation just by using simple gradient descent.
Instead of looking for local algorithms like gradient descent, techniques that guar-
antee global optimisation may be explored. Of course, one of the main problems
to face is that most interesting tasks give rise to the optimisation of functions
with even several thousand variables. This makes it very unlikely that most classic
approaches [11] can be directly and successfully applied. Instead, the proposal of
107
108 CHAPTER 15
successful algorithms has to face effectively the curse of dimensionality typical of
most interesting practical problems.
Statistical training methods have been previously proposed in order to alleviate the
local convergence problem. These methods introduce noise to connection weights
during training, but suffer from extremely slow convergence due to their probabilis-
tic nature.
Several numerical algorithms for global optimisation have also been presented, in
which BP is revisited from the viewpoint of dynamical system theory. Barhen et
al. [1] have proposed the TRUST algorithm (for Terminal Repeller Unconstrained
Subenergy Tunneling) that formulates global optimisation as the solution of a sys-
tem of deterministic differential equations, where E(W) is the function to be op-
timised, while the connection weights are the states of the system. The dynamics
used is achieved upon application of the gradient descent to a modified cost which
transforms each encountered local minimum into a maximum, so that the gradi-
ent descent can escape from it to a lower valley. A related algorithm, called Magic
Hair-Brushing, has been proposed in [6]. The system dynamics is now modified
through a deformation of the gradient field for eliminating the local minima, while
preserving the global structure of the function. All these algorithms exhibit a good
performance in many practical cases but, unfortunately, their optimal convergence
is not formally guaranteed, unless starting from a "good" initial point.
2 The Class of Unimodal Loading Problems
Most experiments with multilayer perceptrons and BP are performed in a sort of
magic atmosphere where data are properly supplied to the network which begins
learning without knowing whether or not the experiment will be successful either
in terms of optimal convergence and generalisation. A trial and error scheme is
usually employed, aimed at adjusting the architecture in subsequent experiments
so as to meet the desired requirements. To some extent, this way of performing
experiments is inherently plagued by the suspect that the used numerical optimi-
sation algorithm might fail. Moreover, though optimal learning may be attained
with networks having a growing number of hidden neurons [14], the generalisation
to new examples is not guaranteed. The intuitive feeling that, in order to obtain a
good convergence behaviour, generalisation must be sacrificed, may be effectively
formalised in a sort of "uncertainty principle of learning" in which the variable
representing optimal convergence and generalisation are like conjugate variable in
Quantum Mechanics [7]. These potential sources of failure of learning algorithms
give rise to a sort of suspiciousness that turns out to be the unpleasant compan-
ion of every experiment. This seems to be interwound with the ambitious task of
learning too general functions.
Let us focus on the complexity issues related to the loading of the weights indepen-
dently of the consequent generalisation to new examples. This makes sense once a
consistent formulation of the learning problem in terms of both the chosen examples
and the neural architecture was provided. We address the problem of establishing
the computational requirements of special cases in which the loading of the weights
can be expressed in terms of optimisation of unimodal error functions.
Bianchini et al.: Unimodal Loading Problems 109
2.1 Canonical Form of Gradient Descent Learning
Let us consider the following learning equation:
dW
dt = --yVw E = f(t, W), (1)
where E(W) is the cost function and WE IRm is the weight vector. Let us choose
-y == II:~~~I'" being \Ii a non-negative continuous function of E. Based on this choice
of the learning rate, the dynamics of the error function becomes
dE TdW T (\Ii(E) )
dt = (VwE) dt = (VwE) -II Vw E 112 VwE = -\li(E), (2)
which makes the cost function continuously decreasing to zero. Those configurations
for which Vw E = 0 are singular points that attract the learning trajectory [4].
Special cases of this reduction to a canonical structure, where the learning is forced
by function \Ii and is independent of the problem at hand, have been explored
in the literature. White [13] has suggested to introduce a varying learning rate so
that the error dynamics evolves following the equation ~~ = -\li(E) == -aE, a > 0,
whose solution is a decaying exponential such that reaching the E = 0 attractorwill
theoretically take infinite time. In practice, this may not necessarily be a problem,
as the attractor may be approached sufficiently close in a reasonable amount of time,
even if, for ill-conditioned systems, it can still be prohibitive to reach a satisfactory
solution. Unfortunately, feedforward neural networks do often result in dynamical
systems that are ill-conditioned or mathematically stiff and thus the convergence
is generally very slow.
In [12] the canonical reduction of equation (2) is based on choosing \Ii(E) == Ek,
o < k < 1, which leads to an error dynamics based on the differential equation
~~ = - Ek, having a singularity at E = 0 violating the Lipschitz condition. If
Eo ~ 0 is the initial value of the cost, then the closed form solution is E(t) =
(E~-k - (1 - k)t) l~k, t ~ te, where te = ~t: (Fig. la). In the finite time te the
= =
transient beginning from E(O) Eo reaches the equilibrium point E 0, which is
a "terminal attract or."
In this paper, we are interested in finding terminal attractors and, particularly,
in minimising the time to required to approach the optimal solution. The choice
\Ii(E) == T} fulfills our needs. Consequently te =
EDIT} and, in particular, when
selecting T} = Eol(j, the terminal attractor is approached for te = (j (Fig. Ib),
independently of the problem at hand, while the corresponding weight updating
equation becomes
dW Eo VE (3)
dt --;-IIVEI1 2 '
This way of forcing the dynamics leads to establish intriguing links between the
concept of unimodal problems and their computational complexity. In fact, learning
attractors in finite time is not only useful from a numerical point of view but, in
the light of the considerations on the canonical equations (2), is interesting for the
relationships that can be established between different problems giving rise to local
minima free cost functions.
2.2 Computational analyses
Let us introduce the following classes of loading problems [9].
110 CHAPTER 15
E E
~ t ~ t
(~ ~)
Figure 1 Terminal attraction using (a) \lI(E) == Ek,O < k < 1, and (b) \lI(E) == 'fl.
Because of the previous analysis on gradient descent the following result holds.
Theorem 3 UP C FT.
ProofIf P E UP then one can always formulate the loading according to differential
equation (3) and the gradient descent is guaranteed not to get stuck in local minima.
Because of equation (2) the learning process ends for te = (J'. Hence, 'IT> 0, if we
choose (J' ::; T we conclude that P EFT. 0
This theoretical result should not be overvalued since it is based on a computational
Bianchini et al.: Unimodal Loading Problems 111
Loading Problem
Universe
FT
Problems
Figure 2 The class of unimodal learning problem can be learned in constant time.
model that does not care of problems due to limited energy. When choosing T
arbitrarily small, the slope of the energy in Fig. 1b goes in fact to infinite.
One may wonder whether problems can be found in FT that are not in UP (see
Fig. 2). This does not seem easy to establish and is an open problem that we think
deserves further attention.
3 Conclusions
The presence of local minima does not necessarily imply that a learning algorithm
will fail to discover an optimal solution, but we can think of their presence as a
boundary beyond which troubles for any learning technique are likely to begin.
In this paper we have proposed a brief review of results dealing with optimal learn-
ing, and we have discussed of problems of sub-optimal learning. Most importantly,
when referring to a continuous computational model, we have shown that there are
some intriguing links between computational complexity and the absence of local
minima. Basically all loading problems that can be formulated as the optimisation
of unimodal functions are proven to belong to a unique computational class. Note
that this class is defined on the basis of computational requirements and, therefore,
seems to be of interest independently of the neural network context in which it has
been formulated.
We are confident that these theoretical results open the doors for more thoroughly
analyses involving discrete computations, that could shed light on the computa-
tional complexity based on ordinary models of Computer Science.
REFERENCES
[1] J. Barhen, J. W. Burdick, and B. C. Cetin, Terminal Repeller Unconstrained Subenergy
Tunneling (TRUST) for fast global optimization, Journal of Optimization Theory and Appli-
cations, Vol.77 (1993), pp97-126.
[2] M. Bianchini, P. Frasconi, and M. Gori, Learning without local minima in radial basis function
networks, IEEE Transactions on Neural Networks, Vol. 6, (1995), pp749-756.
[3] M. Bianchini, P. Frasconi, and M. Gori, Learning in multilayered networks used as autoasso-
ciators, IEEE Transactions on Neural Networks, Vol. 6, (1995), pp512-515.
[4] M. Bianchini, M. Gori, and M. Maggini, Does terminal attractor backpropagation guaran-
tee global optimization?, in International Conference on Artificial Neural Networks, Springer-
Verlag, (1994), pp377-380.
112 CHAPTER 15
[5] M. Bianchini and M. Gori, Optimal learning in artificial neural networks: A review of theo-
retical renlts, Neurocomputing, Vol.13 (October 1996), No.5, pp313-346.
[6] J. Chao, W. Ratanasuwan, and S. Tsujii, How to find global minima in finite times of search
for multilayer perceptrons training, in International Joint Conference on Neural Networks,
IEEE Press, Singapore, (1991), pp1079-1083.
[7] P. Frasconi and M. Gori, Multilayered networks and the C-G uncertainty principle, in SPIE
International Conference, Science of Artificial Neural Networks, Orlando, Florida, 1993, pp396-
40L
[8] M. Gori and A. Tesi, On the problem of local minima in backpropagation, Transactions on
Pattern Analysis and Machine Intelligence, Vol. 14 (1992), pp76-86.
[9] J. S. Judd, Neural Network DeBign and the Complexity of Learning. Cambridge (1990),
London: The MIT Press.
[10] T. Samad, Backpropagation improvements baBed on heuri8tic arguments, in International
Joint Conference on Neural Networks, IEEE Press, Washington DC, (1990), pp565-568.
[11] Torn and Zilinkas, Global Optimization, Lecture Notes in Computer Sciences, (1987).
[12] S. Wang and C. H. Hsu, Terminal attractor learning algorithms for backpropagation neural
networh, in International Joint Conference on Neural Networks, IEEE Press, Singapore, 1991,
pp183-189.
[13] H. White, The learning rate in backpropagation Bydems: an application of Newton's method,
in International Joint Conference on Neural Networks, IEEE Press, Singapore, 1991, pp679-
684.
[14] X. Yu, Can backpropagation error Bur/ace not have local minima¥, IEEE Transactions on
Neural Networks, Vol. 3 (1992), pp1019-1020.
Acknowledgements
We thank P. Frasconi, M. Maggini, F. Scarselli, and F. Schoen for fruitful discussions
and suggestions.
ON THE USE OF SIMPLE CLASSIFIERS FOR THE
INITIALISATION OF ONE-HIDDEN-LAYER NEURAL
NETS
Jan C. Bioch, Robert Carsouw and Rob Potharst
Department of Computer Science, Erasmus University Rotterdam,
The Netherlands. Email: [email protected]
Linear decision tree classifiers and LVQ-networks divide the input space into convex regions that
can be represented by membership functions. These functions are then used to determine the
weights of the first layer of a feedforward network.
Subject classification: AMS(MOS) 68T05, 92B20
Keywords: decision tree, feedforward neural network, LVQ-network, classification, initialisation
1 Introduction
In this paper we mainly discuss the mapping of a linear tree classifier (LTC) onto a
feedforward neural net classifier (NNC) with one hidden layer. According to Park
[9] such a mapping results in a faster convergence of the neural net and in avoiding
local minima in network training. In general these mappings are also interesting
because they determine an appropriate architecture of the neural net. The LTC
used here is a hierarchical classifier that employs linear functions at each node
in the tree. For the construction of decision trees we refer to [10, 5, 12]. Several
authors [11, 4, 9] discuss the mapping of an LTC onto a feedforward net with one
or two hidden layers, see also [3, 2]. A discussion of a mapping onto a net with two
hidden layers can be found in Sethi [11] and Ivanova&Kubat [4]. A mapping onto a
net with one hidden layer is discussed in Park [9]. In his approach the mapping is
based on representing the convex regions induced by an LTC by linear membership
functions. However, in Park [9] no explicit expression for the coefficients of the
membership functions is given. These coefficients depend on a parameter p that
in his paper has to be supplied by the user. In section 2 we show that in general
it is not possible to find linear membership functions that represent the convex
regions induced by an LTC. It is however possible to find subregions that can be
represented by linear membership functions. We derive explicit expressions for the
aforementioned parameter p, in section 3. This makes it possible to control the
approximation of the convex regions by membership functions and therefore of
the initialisation of the neural net. In section 4 we also briefly discuss the use of
LVQ-networks [6, 7] for such an initialisation.
2 N on-existence of Linear Membership Functions
Suppose we are given a multivariate decision tree (N, C, V). In this notation, N is
the set of nodes of the tree, C is the set of leaves of the tree and V is the set of
linear functions dk : IRR -+ IR, kEN. In any node k of the tree, the linear function
dk is used to decide which branch to take. Specifically, we go left if dk(X) > 0, right
if dk(X) ~ 0, see figure 1. A decision tree induces a partitioning of IRR: each leaf
l corresponds with a convex region Rl, which consists of all points x E IRR , that
get assigned to leaf l by the decision tree. For example, region R5 consists of all
°
x E IRR with d1 (x) ~ and da(x) :5 0.
113
114 CHAPTER 16
We will now discuss the idea of linear membership functions to represent the convex
regions induced by an LTC, and we show that these functions are in general not
possible.
In [9] the following 'theorem' is given without proof, though supplied with heuristic
reasoning for its plausibility, see also equation 2 in the next section:
Conjecture (Park[9]) For every decision tree (N, e, V) there exists a set of linear
membership functions M =
{ml,i E e}, such that for any i,l' E e, with i f; i':
mt(x) > mt'(x), "Ix E Rt. (1)
Theorem 1 Let (N, £, V) be a decision tree, with at least two non-parallel decision
boundaries. Then the convex regions induced by this tree cannot be represented by
a set of quadratic polynomials.
Proof Let Rl, R2 and R3 be regions induced by a decision tree such that the
regions Rl and R2 U R3 are separated by the hyperplane d l (x) = 0, x E IRfi. We
assume that dl{x) > 0 on Rl and dl{x) < 0 on R2 U R 3. Similarly, R2 and R3 are
separated by d2{x) = 0, such that d2{x) > 0 on R2 and d2{x) < 0 on R3. Note
that such regions will always be induced by a subtree of a decision tree, unless
all decision boundaries are parallel. Let ml, m2 and m3 respectively denote the
=
membership functions of Rl, R2 and R3. By definition ml 0 on the hyperplane
= = =
dl{x) O. Similarly, m2 0 on d2 {x) O. (Note, that we actually know only that
m2 is zero on half of the hyperplane d2 {x) = O. However, using a simple result
from algebraic geometry it follows that m2 must be zero on the whole hyperplane
d2 {x) = 0.)
Now, let D12 = ml - m2. Then D12 is zero on dl{x) = 0, because D12 > 0 on
Rl and D12 < 0 on R2 U R3. As a consequence of Hilbert's Nullstellensatz d l is a
factor of D 12. Therefore, there exists a polynomial function e such that D12 = dle.
Since D12 is at most quadratic by assumption, we conclude that e is a constant or
a linear function. However, since both dle and d l are positive on Rl and negative
on R 2 , the function e is positive on Rl U R 2 • Since the degree of e is ~ 1, e must be
a positive constant. Similarly, we have D13 = dd, where f is a positive constant.
Therefore D 23 = dl(f - e). This contradicts the fact that D23 is zero on d2{x) = O.
o
Remark In [2] it is shown that under the conditions of Theorem 1, the membership
functions m, can be represented by multivariate polynomials, albeit of degree ~ 5.
3 An Approximated Mapping of a Decision Tree onto a
One-hidden-Iayer Neural Network
We will show in this section that the difficulties encountered in the preceding section
may be circumvented by requiring that the points we consider are not too close to
the hyperplanes associated with the decision tree.
Let (N, £, V) be a decision tree. We will restrict the regions Rt by assuming that
Vk EN: 0 < e ~ Idk{X)1 ~ E, where e and E are positive constants. The set of
points in Rt satisfying this condition will be denoted by St. Hence, Sl is a convex
subregion of Rt. Note also that Rt can be approximated by SL with arbitrary
precision, by varying the constants e and E.
In [9] Park considers the following set of linear membership functions:
mt{x) = L SlkCkdk{X), (2)
kePt
where Pl is the set of nodes on the path from the root to the leaf f. The constants
Stk are defined such that:
(3)
116 CHAPTER 16
The constants Ck > 0 are determined experimentally in [9]. Here we will derive an
explicit expression for these constants. Since as we have shown above that in general
these constants cannot exist if x E RI, l E C, we will now assume that x ESt.
Theorem 2 Let (lV, C, V) be a decision tree. Then there exists a set of linear
functions M = {ml,l E C}, such that for any l,l' E C, with l =Ii':
"Ix E Sl : ml(x) > mll(x). (4)
Proof Let T be the set of terminal nodes of the tree. An internal node t is called
terminal if both children of t are leaves. Further, if nl and n2 are two nodes, then
we write nl < n2 if nl is an ancestor of n2. Suppose that na ft T, and l, i' are two
leaves such that na is the last common ancestor of land l'.
We decompose the function mI' as follows: mll(x) = Ln;<n u SliCidi(X) +
Lnj~nu Sl'j c;dj (x), where n; E Pl and nj E PL'o By applying (3) to the node
n a, it can be seen that Sla = -SI'a' From this we conclude ml(x) - mll(x) =
2SlaCada(x) + El - E 2, where El = Ln;>n u slic;d;(x), with ni E Pl and E2 =
Lnj>nu slljcjdj(x), with nj E PL'o To assure that (4) holds, we require that (E2-
EI)/(2slada(x)) < Ca. Since El is positive we can satisfy the last condition by taking
E2/(2s1ada(x)) < Ca, or ~ Lnj>n u Cj :::; Ca· This yields the following sufficient
condition for the constants Cj:
2Ee LEe.
max { ""
L.J
Cj} :::; Ca , (5)
n;>na.
where nj E Pl and Ca is the set ofleaves of the subtree with root na. Condition (5)
implies: Ca ~ ~ Lnj>n u Cj, where nj E P and P is the longest possible path from
node na to some leaf. From condition (5) it also follows that the constants Ca are
determined up to a constant factor. It is easy to see that the constants can be
determined recursively by choosing positive values Ct for the set of terminal nodes
t E T. 0
Proof Using the formula for the partial sum of a geometric series this result can
be obtained straightforwardly, see [2]. 0
Another expression for the constants Ca can be obtained by using the fact that a
decision tree (lV, C, V) in practical situations is derived from a finite set of examples
V c mn. See [2].
4 Another Initialisation Method
In this section we consider another well-known classifier: Learning Vector Quan-
tisation (LVQ) networks [6, 7]. This method can be used to solve a classification
problem with m classes and data-vectors x E mn. It is known that the LVQ-network
induces a so-called Voronoi tesselation of the input space, see [6] chapter 9. Training
Bioch) Carsouw €3 Potharst: Simple Classifiers for Initialisation 117
of an LVQ-network yields prototype vectors Wj E IRn, j = 1, ... , m such that an
input vector x belongs to class j iff the distance to Wj is smallest:
Vi f. j : Ilwj - xii::; IIWi - xii =?- x E Rj.
It is easy to see that this is equivalent with
TIT TIT
w·J x- -W·
2 J wJ· >
-
W· x- -W· W·
I 2 ' ,.
Now define the linear membership function mi as:
m;(x) = TIT
wi x - "2Wj Wi·
Then
x E Ri <=> m;(x) > mj(x), Vj f. i.
Since an LVQ-network is good classifier and can be trained relatively fast, it is
a good alternative for the initialisation of a neural net using linear membership
functions. In [2] we show that an LVQ-network cannot induce the convex regions
induced by an LTC.
Discussion: We have proven that linear (or quadratic) membership functions rep-
resenting the convex regions of a linear decision tree in general do not exist. How-
ever, we give explicit formulae for the approximation of such functions. This allows
us to control the degree of approximation. Currently, we are investigating how to
determine an appropriate approximation in a given application. Furthermore, we
discussed the use of another simple classifier, namely an LVQ-network for initialis-
ing neural nets, see also [2].
REFERENCES
[1] J.C. Bioch, M. van Dijk and W. Verbeke, Neural Networks: New Tools for Data Analysis?,
in: M. Taylor and P. Lisboa (eds.) Proceedings of Neural Networks Applications and Tools,
IEEE Computer Society Press, pp28-38 (1994).
[2] J.C. Bioch, R. Carsouw and R. Potharst, On the Use of Simple Classifiers fOT the Initiali-
sation of One-Hidden-Layer Neural Nets, Technical Report eur-cs-95-08, Dept. of Computer
Science, Erasmus University Rotterdam (1995).
[3] R. Carsouw, Learning to Classify: Classification by neural nets based on decision trees, Mas-
terthesis (in Dutch), Dept. of Computer Science, Erasmus University Rotterdam, February
(1995).
[4] I. Ivanova, M. Kubat and G. Pfurtscheller, The System TBNN fOT Learning of 'Difficult'
Concepts, in: J.C. Bioch and S.H. Nienhuys-Cheng (eds), Proceedings of Benelearn94, Tech.
Rep. eur-09-94, Dept. of Computer Science, Erasmus University Rotterdam, pp230-241 (1994).
[5] U.M. Fayad and K.B. Irani, On the Handling of Continuous- Valued Attributes in Decision
Tree Generation, Machine Learning, Vol. 8 (1992), pp88-102.
[6] J. Hertz, A. Krogh, R.G. Palmer, Introduction to the theory of neural computation, Addison-
Wesley (1991).
[7] T. Kohonen, Self-Organization and Associative Memory, Berlin: Springer Verlag, 3rd edition
(1989).
[8] M.L. Minsky, S.A. Papert, Perceptrons, 2nd edition, MIT Press (1988).
[9] Y. Park, A Mapping From Linear Tree Classifiers to Neural Net Classifiers, Proceedings of
IEEE ICNN, Vol. I (1994), pp94-100, Orlando, Florida,.
[10] Y. Park and J. Sklansky, Automated Design of Linear Tree Classifiers, Pattern Recognition,
Vol. 23, No. 12 (1990), pp1393-1412.
[11] A.K. Sethi, Entropy Nets: From Decision Trees to Neural Networks, Proceedings of the
IEEE, Vol. 78, No. 10 (1990), pp1606-1613.
[12] J.R. Quinlan; C4.5 Programs for Machine Learning, Morgan Kaufmann (1993).
MODELLING CONDITIONAL PROBABILITY
DISTRIBUTIONS FOR PERIODIC VARIABLES
Christopher M Bishop and Ian T Nabney
Neural Computing Research Group, Aston University,
Birmingham, B-1 7ET, UK. Email: [email protected]
Most conventional techniques for estimating conditional probability densities are inappropriate for
applications involving periodic variables. In this paper we introduce three related techniques for
tackling such problems, and test them using synthetic data. We then apply them to the problem
of extracting the distribution of wind vector directions from radar scatterometer data.
1 Introduction
Many applications of neural networks can be formulated in terms of a multi-variate
non-linear mapping from an input vector x to a target vector t: a conventional
network approximates the regression (i.e. average) of t given x. But for mappings
which are multi-valued, the average of two solutions is not necessarily a valid solu-
tion. This problem can be resolved by estimating the conditional probability p( t Ix).
In this paper, we consider techniques for modelling the conditional distribution of
a periodic variable.
2 Density Estimation for Periodic Variables
A commonly used technique for unconditional density estimation is based on mix-
ture models of the form m
p(t) = L:>:l(;tPi(t) (1)
;=1
where a; are called mixing coefficients, and the component functions, or kernels,
tPi(t) are typically chosen to be Gaussians [7, 9]. In order to turn this into a model
for conditional density estimation, we simply make the coefficients and adaptive
parameters into functions of the input vector x using a neural network which takes
x as input [4, 1, 5]. We propose three methods for modelling the conditional density.
2.1 Mixtures of Wrapped Normal Densities
The first technique transforms X E IR to the periodic variable 0 E [0,211') by 0 =
X mod 211'. The transformation maps density functions p with domain IR into density
functions p with domain [0,211') as follows:
E
00
1:
N=-oo
This periodic function is normalized on the interval [0,211'), since
1 2
" p(Olx)dO = p(xlx)dx = 1 (3)
Here we shall restrict attention to Gaussian tP:
1 { lit -Jli(x)1I 2 } (4)
tP(tlx) = (21r)e/2 u;(x)e exp - 2u;(x)2
where t E IRe.
A mixture model with kernels as in equation (4) can approximate any density
function to arbitrary accuracy with suitable choice of parameters [7, 9]. We use a
118
Bishop (3 Nabney: Modelling Periodic Variables 119
standard multi-layer percept ron with a single hidden layer of sigmoidal units and
an output layer of linear units. It is necessary that the mixing coefficients O::i(X)
satisfy the constraints
L O::;(x) = 1,
m
OS O::;(x) $ 1. (5)
;=1
This can be achieved by choosing the O::i( x) to be related to the corresponding
network outputs by a normalized exponential, or softmax function [4]. The centres
Jl.i of the kernel functions are represented directly by the network outputs; this was
motivated by the corresponding choice of an uninformative Bayesian prior [4]. The
standard deviations 0";( x) of the kernel functions represent scale parameters and so
it is convenient to represent them in terms of the exponentials of the corresponding
network outputs. This ensures that O"i( x) > 0 and discourages O"i( x) from tending
to O. Again, it corresponds to an uninformative prior. To obtain the parameters of
the model we minimize an error function E given by the negative logarithm of the
likelihood function, using conjugate gradients. (The maximum likelihood approach
underestimates the variance of a distribution in regions of low data density [1]. For
our application, this effect will be small since the number of data points is large.)
In practice, we must restrict the value of N. We have taken the summation over 7
complete periods of27r. Since the component Gaussians have exponentially decaying
tails, this introduces negligible error provided the network is intialized so that the
kernels have their centres close to O.
2.2 Mixtures of Circular Normal Densities
The second approach is to make the kernels themselves periodic. Consider a velocity
vector v in two-dimensional Euclidean space for which the probability distribution
p( v) is a symmetric Gaussian. By using the transformation v'" = II v II cos B, Vy =
II v II sin B, we can show that the conditional distribution of the direction B, given the
vector magnitude IIvl!. is given by
1
¢(B) = 27rlo(m) exp{mcos(B - 'I/J)} (6)
which is known as a circular normal or von Mises distribution [6]. The normalization
coefficient is expressed in terms of the modified Bessel function, Io(m), and the
parameter m (which depends on Ilvll) is analogous to the inverse variance parameter
in a conventional normal distribution. The parameter 'I/J gives the peak of the density
function. Because of the Io(m) term, care must be taken in the implementation of
the error function to avoid overflow.
2.3 Expansion in Fixed Kernels
The third technique uses a model consisting of a fixed set of periodic kernels, again
given by circular normal functions as in equation (6). In this case the mixing propor-
tions alone are determined by the outputs of a neural network (through a softmax
activation function) and the centres 'l/Ji and width parameters mi are fixed. We
selected a uniform distribution of centres, and mi = m for each kernel, where the
value for m was chosen to give moderate overlap between the component functions.
3 Application to Synthetic Data
We first consider some synthetic test data. It is generated from a mixture of two tri-
angular distributions where the centres and widths are taken to be linear functions
120 CHAPTER 17
....
pi
....
pI2 . ..
0 ....
-pl2
. ..
0.2 0.4 0.6 0.8
.... ~
Figure 1 (a) Scatter plot of the synthetic training data. (b) Contours of the condi~
tiona! density p( 91x) obtained from a mixture of adaptive circular norma! functions
as described in Section 2.2. (c) The distributionp(9Ix) for x::: 0.5 (solid curve) from
the adaptive circular norma! model, compared with the true distribution (dashed
curve) from which the data was generated.
of a single input variable x. The mixing coefficients are fixed at 0.6 and 0.4. Any
values of (J which fall outside (-11', 11') are mapped back into this range by shifting
in multiples of 211'. An example data set generated in this way is shown in Figure 1.
Three independent datasets (for training, validation and testing) were generated,
each containing 1000 data points. Training runs were carried out in which the
number of hidden units and the number of kernels were varied to determine good
values on the validation set. Table 1 gives a summary of the best results obtained,
as determined from the test set. The mixture of adaptive circular normal functions
performed best. Plots of the distribution from the adaptive circular normal model
are shown in Figure 1.
has the lowest error on the validation data: however fewer centres are actually
required to model the conditional density function reasonably well.
5 Discussion
All three methods give reasonable results, with the adaptive-kernel approaches beat-
ing the fixed-kernel technique on synthetic data, and the reverse on the scatterom-
eter data. The two fully adaptive methods give similar results.
Note that there are two structural parameters to select: the number of hidden units
in the network and the number of components in the mixture model. The use of a
larger, fixed network structure, together with regularization to control the effective
model complexity, would probably simplify the process of model order selection.
REFERENCES
[1] C M Bishop, Mixture density networks. Technical Report NCRG/4288, Neural Computing
Research Group, Aston University, U.K. (1994).
[2] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press (1995).
122 CHAPTER 17
2.50 ,-----,------.----r-----,
(a) (b)
1.50,
2.00 1.4_1 .... ·
1.4_2-'_'-
1.50
1.4_3_
1.00t
j
!
1.00
0.50
0.00 _ _ _- _ _ _ _ _. .~___.:!IE::I~_ __
-pi o pl2
Figure 2 Plots of the conditional distribution p(/Jlx) obtained using all three
methods. (a) and (b) show linear and polar plots of the distributions for a given
input vector from the test set. The dominant alias at 1r is evident. In both plots,
the solid curve represents method 1, the dashed curve represents method 2, and the
curve with circles represents method 3.
[3] C. M. Bishop and C. Legleye, E&timating conditional probability distributions for periodic
variables, in: D. S. Touretzky, G . Tesauro, and T. K. Leen, editors, Advances in Neural Infor-
mation Processing Systems, Vol. 7 (1995), pp641-648, Cambridge MA, MIT Press.
[4] R A Jacobs, M I Jordan, S J Nowlan, and G E Hinton, Adaptive mixtures of local experts,
Neural Computation, Vol. 3 (1991), pp79-87.
[5] Y Liu, Robu&t neural network parameter e&timation and model &election for regression, in:
Advances in Neural Information Processing Systems, Vol.6 (1994), pp192-199, Morgan Kauf-
mann.
[6] K V Mardia, Statistic& of Directional Data. Academic Press, London (1972).
[7] G J McLachlan and K E Basford, Mixture models: Inference and Applications to Clustering.
Marcel Dekker, New York (1988) .
[8] S Thiria, C Mejia, F Badran, and M Crepon, A neural network approach for modeling non-
linear tran&fer function&: Application for wind retrieval from 3paceborne &catterometer data,
Journal of Geophysical Research, Vol. 98(C12) (1993), pp22827-22841.
[9] D M Titterington, A F M Smith, and U E Makov, Statistical Analysis of Finite Mixture
Distributions, John Wiley, Chichester (1985).
Acknowledgements
We are grateful to the European Space Agency and the UK Meteorological Office
for making available the ERS-l data. The contributions of Claire Legleye in the
early stages of this project are also gratefully acknowledged. We would also like
to thank lain Strachan and Ian Kirk of AEA Technology for a number of useful
discussions relating to the interpretation of this data.
INTEGRO-DIFFERENTIAL EQUATIONS IN
COMPARTMENTAL MODEL NEURODYNAMICS
Paul C. Bressloff
Department of Mathematical Sciences, Loughborough University,
Leics. LEil 3TU, UK.
Most neural network models take a neuron to be a point processor by neglecting the extensive
spatial structure of the dendritic tree system. When such structure is included, the dynamics of
a neural network can be formulated in terms of a set of coupled nonlinear integra-differential
equations. The kernel of these equations contains all information concerning the effects of the
dendritic tree, and can be calculated explicitly. We describe recent results on the analysis of these
integra-differential equations.
The local diffusive spread of electrical activity along a cylindrical region of a neu-
ron's dendrites can be described by the cable equation
aV a2 V
fit = -fV + D ae 2 + I(x, t) (1)
where x is the spatial location along the cable, V(x, t) is the local membrane po-
tential at time t, f is the decay rate, D is the diffusion coefficient and I(x, t) is any
external input. Note that equation (1) is valid provided that conductance changes
induced by synaptic inputs are small; in the dendritic spines the full Nernst-Planck
equations must be considered [1]. A compartmental model replaces the cable equa-
tion by a system of coupled ordinary differential equations according to a space-
discretization scheme [2]. The complex topology of the dendrites is represented by
a simply-connected graph or tree r. Each node of the tree a E r corresponds to a
small region of dendrite (compartment) over which the spatial variation of physical
properties is negligible. Each compartment a can be represented by an equivalent
circuit consisting of a resistor ROt and capacitor COl in parallel, which is joined to
its nearest neighbours < (3, a > by junctional resistors Rap. We shall assume for
simplicity that the tree r is coupled to a single somatic compartment via a junc-
tional resistor it from node ao E r. (Figure 1). The boundary conditions at the
terminal nodes of the tree can either be open (membrane potential is clamped at
zero) or closed (no current can flow beyond the terminal node).
An application of Kirchoff's law yields the result
C dVa __ Va '" Vp - Va a Er
a dt - R + L.t R + U -RAVoID D 01, 01 0 + I a (t) , (2)
Ot <p,a> ap
C dU = _ U + VOID: U (3)
dt R R
where Va(t) is the membrane potential of compartment a E rand U(t) is the
membrane potential of the soma. It can be shown that there exists a choice of
parameterisation (where all branches are uniform) for which equation (2) reduces
to the matrix.form [3]
av
dt = 2uQV(t) - (f + 2u)V(t) + U(t)a + I(t) (4)
123
124 CHAPTER 18
soma
P • ---c::::J-
a __
_
Rap
where aa = 60l ,OlO and (1, f are global longitudinal and transverse decay rates re-
spectively. The matrix Q generates an unbiased random walk on the tree r. That
is, Q = D- 1 A where A is the adjacency matrix of r, AOlP = 1 if the nodes a and
f3 are adjacent and AOlP = 0 otherwise, and D = diag(da ) where dOl is the coor-
dination number of node a. Our choice of parameterisationis particularly useful
since it allows one to study the effects of dendritic structure using algebraic graph
theory (see below). More general choices of parameterisation where each branch is
nonuniform, say, can be handled using perturbation theory.
Since the output of the neuron is determined by the somatic potential U(t), we can
view the dendritic potentials as auxiliary variables. In particular, using an integrat-
ing factor we can solve equation (4) for V(t) in terms of U(t) and I(t). Substitution
into equation (3) then yields the integro-differential equation (assuming without
loss of generality that V(O) = 0 and kG = 1)
where p = 1/ RG + 1/ kG and
GOlP(t) =e-(f+20')t [e 20'tQ]0l/3 (6)
is the dendritic membrane potential response of compartment a at time t due to a
unit impulse stimulation of node f3 at time t = 0 in the absence of the term U(t)a
on the right-hand side of equation (4). All details concerning the passive effects
of dendritic structure are contained within G(t). Note that in deriving equation
(5) we have assumed that the inputs IOl(t) are voltage-independent. Thus we are
ignoring the effects of shunting and voltage-dependent ionic gates.
One way to calculate GOl /3(t), equation (6), would be to diagonalize the matrix Q
to obtain (for a finite tree)
(7)
r
where {Ar } forms the discrete spectrum of Q and U r are the associated eigenvectors.
It can be shown that the spectral radius p( Q) = 1 and an application of the Perron-
=
Frobenius Theorem establishes that (a) A 1 is a nondegenerate eigenvalue and (b)
eigenvalues appear in real pairs ±A r . However, such a diagonalization procedure is
Bressloff: Integro-differential Equations in Neurodynamics 125
rather cumbersome for large trees and does not explicitly yield the dependence on
tree topology. An alternative approach is to exploit the fact that Q generates a
random walk on f. That is, expand equation (6) to obtain
=e-(£+2q)t L
(a~)
(10)
where f = {+ 20'. The resulting summation over trips can be performed explicitly
to yield a closed expression for G(z) [5]. In the remainder of this paper we consider
some examples illustrating the effects of dendritic structure on neurodynamics. We
shall find that the Laplace transform G(z) plays a crucial role in determining the
stability of these systems.
A major issue at the single neuron level is the effect of the coupling between soma
and dendrites on the input-output response of a neuron satisfying equation (5),
which may be rewritten in the form
dU
dt
t
= -pU(t) + Jo H(t - t')U(t')dt' + I(t),
'
126 CHAPTER 18
I(t)~U
__ P'!'!!_ h F!~ ;---------,::1:::
+ UN _
d~i = -pUi(t) + it [L L
o i¢i per
WeGaop(t - t')/(Uj(t'))] dt' (13)
Note that for simplicity we are neglecting the feedback arising from the coupling
between the soma and dendrites of the same neuron. Further, let the output function
= =
1 satisfy I(x) tanh(x:x). Then 1(0) = 0 so that U 0 is a fixed point solution
of (13). Linearization about the zero solution with 1'(0) = x: gives
d~; = -pUi(t) + lt L
o j¢i
Hij(t - t')Uj(t')dt' (14)
Bressloff: Integro-differential Equations in Neurodynamics 127
. GaP(z) _ -
1~IIJ-l- - G(e,e ,z) -
I. _ 1
2Vfl(~) ((Z+f)
J5(z+7j "7 bpexp -V n Lp
REFERENCES
[I] Sejnowski TJ and Qian N. In Single neuron computation (ed. T. Mckenna, J. Davis and S.
F. Zornetzer), San Diego: Academic Press (1992) pp117-139.
[2] Rall W. In: Neural Tkeory and Modeling ( R. F. Reiss ed.), Stanford University Press, Stan-
ford (1964), pp73-97.
[3] Bressloff PC and Taylor JG, Bioi. Cybern. Vol.70, ppI99-207.
[4] Abbott LF, Farhi E and Gutmann S, Bioi. Cybern. Vol.66 (1991), pp49-60.
[5] BressloffPC, Dwyer VM and Kearney MJ, J. Pkya. A, Vol.29 (1996), ppI881-1896.
[6] Bressloff PC, PkY8ica D Vol.80 (1995), p399.
[7] Burton TA , Volterra Integral and Differential Equations Academic Press, London (1983).
[8] Bressloff PC , Pkys. Rev. E Vol.50 (1994), pp2308-2319.
[9] Bressloff PC, Bioi. Cybern., Vo1.73 (1995), pp281-290.
[10] Shepherd GM ed., Tke Synaptic Organization of tke Brain Oxford University Press, Oxford
(1990).
NONLINEAR MODELS FOR NEURAL NETWORKS
Susan Brittain and Linda M. Haines
University of Natal, Pietermaritzburg, South Africa. Email: [email protected]
The statistical principles underpinning hidden-layerfeed-forward neural networks for fitting smooth
curves to regression data are explained and used as a basis for developing likelihood- and bootstrap-
based methods for obtaining confidence intervals for predicted outputs.
Keywords: Hidden-layer feed-forward neural networks; nonlinear regression; nonparametric re-
gression; confidence limits; predicted values.
1 Introduction
Hidden-layer feed-forward neural networks are used extensively to fit curves to re-
gression data and to provide surfaces from which classification rules can be deduced.
The focus of this article is on curve-fitting applications and two crucial statistical
insights into the workings of neural networks in this context are presented in Section
2. Approaches to developing confidence limits for predicted outputs are explored in
Section 3 and some conclusions given in Section 4.
2 Statistical Insights
Consider the following single hidden-layer feed-forward neural network used to
=
model regression data ofthe form, (Xi, Vi), i 1, ... , n. The input layer comprises a
neuron which inputs the x-variable and a second neuron which inputs a constant or
bias term into the network. The hidden-layer comprises two neurons with logistic
activation functions and an additional neuron which inputs a bias term and the
output layer provides the predicted v-value through a neuron with a linear activa-
tion function. The connection weights, (J = ((J1, ... , (J7), are defined in such a way
that the output, 0, from the network corresponding to an input, x, can be written
explicitly as
(1)
If in addition the network is trained to minimize the error sum-of-squares, 2::7=1 (Yi-
Oi)2, then it is clear that implementing this neural network is equivalent to using
the method of least squares to fit the nonlinear regression model, Yi = 0i + fi, i =
1, ... , n, where the error terms, fi, are independently and identically distributed
with zero mean and constant variance, to the data. The weights of the network cor-
respond to parameters in the regression model, training corresponds to iteration in
an appropriate optimization algorithm, and generalization to prediction. In fact the
nonlinear regression model just described is not in any way meaningful in relation
to the data and the broad modelling procedure should rather be viewed as one of
smoothing. In the present example, two logistic functions are scaled and located so
that their sum, together with a constant term, approximates an appropriate smooth
curve. Overall therefore it is clear that the underlying mechanism of a hidden-layer
feed-forward neural network is that of nonlinear regression and that the broad
principle underpinning such a network is that of non parametric regression.
3 Confidence Limits for the True Predicted Values
Consider a hidden-layer feed-forward neural network represented formally as the
nonlinear regression model,
Yi = T}(Xi,(J)+fi" i=I, ... ,n, (2)
129
130 CHAPTER 19
where Yi is the observed value at Xi, 8 = (81, ... , 8p ) is a p x 1 vector of unknown
parameters, 77(-) is a nonlinear function describing the network, and the error terms,
fi, are uncorrelated with mean, 0, and constant variance, u 2 • Then the least squares
estimator of 0, denoted 0, is that value of 0 which minimizes the error sum-of-
squares, S(O) = 2::=1[Yi - 77(Xi,O)j2, and the estimator of u 2 , d~noted 8 2 , is
given by S(O)/(n - p). The mean predicted value at Xg for model (2), and hence
for the underlying neural network, is given by 77(Xg, 0) and confidence intervals for
the corresponding true value, 77(xg, 0), can be derived from likelihood theory or by
resampling methods.
3.1 Likelihood Theory
Linearization method : Suppose that the errors in model (2) are normally dis-
tributed and suppose further that this model can be satisfactorily approximated
by a linear one, and that V(O), the estimated asymptotic variance of the least
i
squares estimator, 0, is taken to be 82 [2::=1 g(Xi' O)g(Xi, O)T] 1 , where g(Xi' 0) =
077(Xi, 0)/00, i = 1, ... , n, and the subscript 0 denotes evaluation at that point.
Then the standard error of the mean predicted value at Xg is given by SE[77(Xg, 0)] =
8)g(xg,8)I V(O) g(Xg, O)e, where g(Xg,O) = 077(Xg, 0)/00, and an approximate
100(1 - c:r)% confidence interval for 77(Xg, 0) can be expressed quite simply as
77(Xg, 0) ± t*SE[77(Xg, 0)], (3)
where t* denotes the appropriate critical t-value with n - p degrees of freedom.
Example 1 : Data were generated from model (2) with deterministic component
(1) corresponding to the neural network described in Section 2 and with normally
distributed error terms. The parameter values were taken to be
8 = (0.5,0.5,1, -1,0.1,1,1.5)
and u = 0.01, 25 x-values, equally spaced between -12 and 12, were selected, and
simulated y-values, corresponding to these x-values, were obtained. The approx-
imate 95% confidence limits to 77(Xg, 0) for Xg E [-12,12], were calculated using
formula (3) and are summarized in the plots of ±t* SE[77(xg, 9)] against Xg shown
in Figure 1. The interesting, systematic pattern exhibited by these limits depends
on the design points, Xi, i = 1, ... , n.
Profile likelihood method: Suppose again that the errors in model (2) are normally
distributed and let S(0177~) denote the minimum error sum-of-squares for a fixed
value, 77~, of the true predicted value, 77(Xg, 0). Then the profile log-likelihood for
77(X g, 0) is described, up to an additive constant, by the curve with generic point,
(77~, S(0177m, and a likelihood-based 100(1 - c:r)% confidence interval for 77(Xg,0)
comprises those values of 77~ for which
S(0177~) - S(O) ~ t*28 2. (4)
For Example 1 the requisite values of the conditional sum-of-squares, S(0177;), for a
particular x-value, Xg, were obtained by reformulating the deterministic component
of the model as 77(X,0) = 77g + 771(X, 0) - 771(xg, 8) with 771(X, 8) = 1+e-(~~+B2"l +
l+e-(~~+B."'l ' and by minimizing the resultant error sum-of-squares for appropriate
fixed values of the parameter, 77g = 77(Xg, 0). For each value of Xg E [-12,12]
Brittain (3 Haines: Nonlinear Models 131
so considered, the profile log-likelihood for the true predicted value, 7J(x g , 0), was
approximately quadratic and the attendant 95% confidence limits were calculated as
the two values of 7J~ satisfying the equality condition in expression (4) by using the
bisection method. A plot of these 95% confidence limits, centered at the maximum
likelihood estimator, 7J(x g , 8), against Xg, are shown, together with those found by
the linearization method, in Figure 1. It is clear from this figure that the confidence
limilll
ll!c:>
q
c:>
C!
c:>
·15 ·10 ·5 0 10 15
;
!:j
9
Figure 1 Linearization (solid line) and profile likelihood (dashed line) methods.
Hmilll
,,
c:>
c:i
·15 ·10 10 15
---
Figure 2 Bootstrap residuals (solid line) and linearization (dashed line) methods.
~ Umlts
c:i
c:>
c:i ~~~~~~~~~
_~'X ••••• •• : .... ••
·15 ·10 10 15
Figure 3 Bootstrap pairs (solid line) and linearization (dotted line) methods;
fitted curve (dashed line) and data points (circles).
limits for l1(X g , 0) obtained from these two methods are very similar.
132 CHAPTER 19
3.2 Bootstrap Methods
Bootstrapping residuals: Suppose that the least squares estimate, 0, of the pa-
rameters in model (2) is available. Then confidence intervals for true predicted
values, ,,(x g , (J), can be obtained by bootstrapping the standardized residuals, ej =
[Yi - ,,(Xi, O)]V n~p for i = 1, ... , n, following the procedure for regression models
outlined in [1]. For example 1, the resultant 95% bootstrap confidence limits for
predicted values, ,,(xg , (J), with x E [-12,12] were centered at the corresponding
bootstrap means, and a plot of these limits against x is shown in Figure 2, together
with the the corresponding limits obtained from the linearization method. It is
clear from this figure that the broad pattern exhibited by the two sets of confidence
limits is the same but that the limits are systematically displaced. An attempt to
correct the bootstrap percentiles for bias by implementing the BGa method of [1]
produced limits which were very different from the uncorrected ones.
Bootstrapping pairs: An alternative to bootstrapping the residuals is to bootstrap
the data pairs, (Xi, Yi), i = 1, ... , n, directly, following the procedure given in [1].
Approximate 95% confidence intervals for the true predicted values of Example 1
were obtained in this way and the results, in the form of plots of the confidence limits
centered about the bootstrap mean against x, are shown in Figure 3, together with
a plot of the corresponding centered limits obtained from the linearization method.
Clearly the confidence limits obtained by bootstrapping the data pairs are wildly
different from those found by the other methods investigated in this study. The
reason for this is clear from the suitably scaled plots of the data points and the
fitted curve which are superimposed on the plots of the confidence limits in Figure
3 so that the x-values coincide. In particular, only 4 of the 25 observations are taken
at x-values corresponding to the steep slope ofthe fitted curve between x = -1 and
x = 2.5 and the probability that at least one of these points is excluded from the
bootstrap sample is high, viz. 0.8463. As a consequence the bootstrap least squares
fitted curve is expected to be, and indeed is, extremely unstable in the region of
this slope.
3.3 Comparison of Methods
It is clearly important to generalize the results found thus far. To this end, 400 data
sets were simulated from the model setting of Example 1 and, from these, coverage
probabilities with a nominal level of 95% for a representative set of true predicted
values were evaluated using the likelihood-based and the bootstrap methods of
the previous two subsections. The results are summarized in Table 1 and clearly
reinforce those obtained for Example 1. In particular, the coverage probabilities
provided by the linearization and the profile likelihood-based methods are close to
nominal, while those obtained by bootstrapping the residuals are consistently low
over the range of x-values considered and those obtained by bootstrapping the data
pairs are very erratic.
4 Conclusions
The aim of this present study has been to critically compare selected methods for
setting confidence limits to the predicted outputs from a neural network. Both of
the likelihood-based methods investigated produced surprisingly good results. In
particular, the linearization method proved quick and easy to use, while the profile
Brittain (3 Haines: Nonlinear Models 133
Table 1 Coverage probabilities for a nominal level of 95% and 400 simulations.
likelihood approach, which is more rigorous, was a little more difficult to implement.
In contrast, the bootstrap methods for finding the required confidence limits were,
necessarily, highly computer-intensive, and the results disappointing. On balance,
it would seem that, to quote Wu [2], "The linearization method is a winner".
REFERENCES
[1) Efron, B. and Tibshirani, R.J., An Introduction to the Boot6trap. Chapman & Hall (1994),
New York.
[2) Wu, C.F.J., Jackknife, bootstrap and other resampling methods in regression analysis, Annals
of Statistics, Vol. 14 (1986), pp1261-1295.
Acknowledgements
The authors wish to thank the University of Natal and the Foundation for Research
Development, South Africa, for funding this work.
A NEURAL NETWORK FOR THE TRAVELLING
SALESMAN PROBLEM WITH A WELL BEHAVED
ENERGY FUNCTION
Marco Budinich and Barbara Rosario
Dipartimento di Fisica & INFN, Via Valerio 2, 34127, Trieste, Italy.
Email: [email protected]
We present and analyze a Self Organizing Feature Map (SOFM) for the NP-complete problem of
the travelling salesman (TSP): finding the shortest closed path joining N cities. Since the SOFM
has discrete input patterns (the cities of the TSP) one can examine its dynamics analytically. We
show that, with a particular choice of the distance function for the net, the energy associated to
the SOFM has its absolute minimum at the shortest TSP path. Numerical simulations confirm
that this distance augments performances. It is curious that the distance function having this
property combines the distances of the neuron and of the weight spaces.
1 Introduction
Solving difficult problems is a natural arena for a would-be new calculus paradigm
like that of neural networks. One can delineate a sharper image of their potential
with respect to the blurred image obtained in simpler problems.
Here we tackle the Travelling Salesman Problem (TSP, see [Lawler 1985], [Johnson
1990]) with a Self Organizing Feature Map (SOFM). This approach, proposed by
[Angeniol 1988] and [Favata 1991], started to produce respectable performances
with the elimination of the non- injective outputs produced by the SOFM [Budinich
1995]. In this paper we further improve its performances by choosing a suitable
distance function for the SOFM.
An interesting feature is that this net is open to analytical inspection down to a
level that is not usually reachable [Ritter 1992]. This happens because the input
patterns of the SOFM, namely the cities of the TSP, are discrete. As a consequence
we can show that the energy function, associated with SOFM learning, has its
absolute minimum in correspondence to the shortest TSP path.
In what follows we start with a brief presentation of the working principles of this
net and of its basic theoretical analysis (section 2). In section 3 we propose a new
distance function for the network and show its theoretical advantages while section
4 contains numerical results. The appendix contains the detailed description of
parameters needed to reproduce these results.
2 Solving the TSP with self-organizing maps
The basic idea comes from the observation that in one dimension the exact solution
to the TSP is trivial: always travel to the nearest unvisited city. Consequently, let
us suppose we have a smart map of the TSP cities onto a set of cities distributed
on a circle, we will easily find the shortest tour for these "image cities" that will
give also a path for the original cities. It is reasonable to conjecture that the better
the distance relations are preserved, the better will be the approximate solution
found. In this way, the original TSP is reduced to a search of a good neighborhood-
preserving map: here we build it via unsupervised learning of a SOFM.
The TSP we consider is constituted of N cities randomly distributed in the plane
(actually in the (0,1) square). The net is formed by N neurons logically organized
134
Budinich f3 Rosario: A Neural Network for the T S P 135
in a ring. The cities are the input patterns of the network and the (0,1) square its
input space.
Each neuron receives the (x, y) = if coordinates of the cities and has thus two
=
weights: (W." w y ) W. In this view both patterns and neurons can be thought as
points in two dimensional space. In response to input ij, the r-th neuron produces
=
output Or q.wr • Figure 1 gives a schematic view ofthe net while figure 2 represents
both patterns and neurons as points in the plane.
Learning follows the standard algorithm [Kohonen 1984]: a city if; is selected at
random and proposed to the net; let S be the most responding neuron (ie the
neuron nearest to if;) then all neuron weights are updated with the rule:
(1)
where 0 < ( < 1 is the learning constant and h r $ is the distance function.
This function determines the local deformations along the chain and controls the
number of neurons affected by the adaptation step (1) ; thus it is crucial for the
evolution of the network and for the whole learning process (see figure 2).
Step (1) is repeated several times while ( and the width of the distance function
are being reduced at the same time. A common choice for h r $ is a Gaussian-like
function like hr$ = e-( ~)2 where dr is the distance between neurons rand s
$
(the number of steps rand s) and IT is a parameter which determines the number
of neurons r such that ~wr "# 0; during learning (, IT -+ 0 so that ~wr -+ 0 and
hr $ -+ 6r $.
After learning, the network maps the two dimensional input space onto the one
dimensional space given by the ring of neurons and neighboring cities are mapped
136 CHAPTER 20
onto neighboring neurons. For each city its image is given by the nearest neuron.
From the tour on the neuron ring one obtains the path for the original TSp l
The standard theoretical approach to these nets considers the expectation value
E[~wrlwr] [Ritter 1992]. In general E[~wrlwr] cannot be treated analyticallyex-
cept when the input patterns have a discrete probability distribution as it happens
for the TSP. In this case, E[~wrlwr] can be expressed as the gradient of an energy
=
function, i.e. E[~wrlwr] -f~wr V(W), with
where L TSP 2 is the length of the tour ofTSP considering the squares of the distances
between cities. Thus the Kohonen algorithm for TSP minimizes an energy function
which, at the end of the leaning process, is proportional to the sum of the squares
of the distances. Numerical simulations confirm this result.
3 A New Distance Function
Our hypothesis is that we can obtain better results for the TSP using a distance
function h r , such that, at the end of the process, V(W) is proportional to the simple
length of the tour LTSP, namely V(W) <X L,(tii(,) - tii(,-I) = LTSP since, in
general, minimizing LTSP2 is not equivalent to minimizing LTSP. We thus consider
lThe main weakness of this algorithm is that, in about half of the cases, the map produced is
not injective. The definition of a continuous coordinate along the neuron ring solves this problem
yielding a competitive algorithm [Budinich 1995].
Budinich & Rosario: A Neural Network for the T S P 137
a function h rs depending both on the distance drs and on another distance Drs
defined in weight space:
3
Drs = L IWj - Wj-Il
j=r+1
If we define
D )-d~.
hrs = ( 1+ ;s (5)
when (J' -+ 0, we get for hdl,s
DS±IS)-I
hdl,s = ( 1 + --'- --- =
(J' (J' (J'
~ _ _ ~ _ _
(J' Ddl,s Iw s - ws±11 IqiCs) - qiCdl)1
and substituting this expression in (4) we obtain V(W)
~ 1 "" [ -)2 )2]
2N L...J 1-
s
(J'
-
qiCs) - qiCs-I)
(_
I qiCs) - qiCs-I) + 1-qiCs) - (J'
-
qiCs+1)
( _ _
I qiCs) - qiCs+1)
; L ItliCs) -
s
tliCs+lJi
(J'
N LTSP
With this choice of h rs the minimization of the energy V(W) is equivalent to the
minimization of the TSP path. We remark that the introduction of the distance
Drs between weights is a slightly unusual hypothesis for this kind of nets that
usually keep well separated neuron and weight spaces in the sense that the distance
function h rs depends only on the distance drs.
4 Numerical Results
Since the performances of this kind of TSP algorithms are good for problems with
more than 500 cities and more critical in smaller problems [Budinich 1995], we began
testing the performances produced by the new distance function (5) in problems
with 50 cities.
138 CHAPTER 20
City set Min. length [Durbin 1987] [Budinich 1995] This algorithm
1 5.8358 2.47% 1.65% 0.96%
2 5.9945 0.59% 1.66% 0.31%
3 5.5749 2.24% 1.06% 1.05%
4 5.6978 2.85% 1.37% 0.70%
5 6.1673 5.23% 5.25% 0.43%
I Average I 2.68% 2.20% 0.69%
Table 1 Comparison of the best TSP solution obtained in 10 runs of the vari-
ous algorithms. Rows refer to the 5 different problems each of 50 cities randomly
distributed in the (0,1) square. Column 2 reports the length of the best known
solution for the given problem. Columns 3 to 5 contain the best lengths obtained
by the three algorithms under study expressed as percentual increments from the
minimallengthj the number of runs of the algorithms is respectively: unknown,S
and 10. Last row gives the increment averaged over the 5 city sets.
We compared the quality of TSP solutions obtained with this net to those of two
other algorithms both deriving from the idea of a topology preserving map and that
both actually minimizes L TSP ' : the elastic net of Durbin and Willshaw [Durbin
1987] and this same algorithm with a standard distance choice.
As a test set, we considered the very same 5 sets of 50 randomly distributed cities
used for the elastic net.
Table 1 contains a comparison of the best TSP path obtained in several runs of
the different algorithms expressed as percentual increments over the best known
solution for the given problem. Another measure of the quality of the solution is
the mean length obtained in the 10 runs. The percentual increment of these mean
lengths, averaged over the 5 sets, was for this algorithm 2.49%, showing that even
the averages found with the new distance function are better than the minima found
with the elastic net.
These results clearly show that distance choice (5) gives better solutions in this
SOFM application, thus supporting the guess that an energy V(W) directly pro-
portional to the length of the tour LT SP, is better tuned to this problem.
One could wonder if adding weight space information to the distance function could
give interesting results also in other SOFM applications.
Appendix
Here we describe the network setting that produces the quoted numerical results.
Apart from the distance definition (5) we apply a standard Kohonen algorithm and
exponentially decrease parameters f and u with learning epoch ne (a learning epoch
corresponds to N weights update with rule (1))
f = foa ne u = uofJ"'e
and learning stops when f reaches 5.10- 3 . Numerical simulations clearly indicate
that best results are obtained when the final value of u is very small (~ 5· 10- 3 )
and when f and u decrease together reaching their final value at the same time.
Consequently given values for a and Uo one easily finds {3. In other words there are
just three free parameters to play with to optimize results, namely fO and Uo and a.
Budinich & Rosario: A Neural Network for the T S P 139
After some investigation we obtained the following values that produce the quoted
= =
results: (0 0.8, 0"0 14 and a 0.9996. =
REFERENCES
[1] B Angeniol, de G. La Croix Vaubois and J.-Y. Le Texier, Self Organising Feature Maps and
the Travelling Salesman Problem, Neural Networks VoU (1988), pp289-293.
[2] M Budinich, A Self-Organising Neural Network for the Travelling Salesman Problem that is
Competitive with Simulated Annealing, to appear in: Neural Computation.
[3] R. Durbin and D. Willshaw, An Analogue Approach to the Travelling Salesman Problem
using an Elastic Net Method, Nature Vol.336 (1987), pp689---691.
[4] F Favata amd R Walker. A Study of the Application of Kohonen-type Neural Networks to
the Travelling Salesman Problem, Biological Cybernetics Vol. 64 (1991), pp463-468.
[5] D. S. Johnson, Local Optimization and the Traveling Salesman Problem, in: Proceedings of
the 17th Colloquium on Automata, Languages and Programming, Springer-Verlag (1990) New
York, pp446-461.
[6] T. Kohonen, Self-Organisation and Associative Memory, Springer-Verlag (1984, 3rd Ed.
1989), Berlin Heidelberg.
[7] E. L. Lawler, J. K. Lenstra , A. G. Rinnoy Kan and D. B. Shmoys (editors), The Traveling
Salesman Problem - A Guided Tour of Combinatorial Optimization, John Wiley & Sons,
New York (1990), IV Reprint, p474.
[8] H. Ritter, T. Martinez and K. Schultern, Neural Computation and Self Organising Maps,
Addison-Wesley Publishing Company (1992), Reading Massachusetts, p306.
SEMIPARAMETRIC ARTIFICIAL NEURAL
NETWORKS
Enrico Capobianco
Stanford University, Stanford, CA 94305, USA.
Email: [email protected]
In this paper Artificial Neural Networks are considered as an example of the semiparametric class
of models which has become very popular among statisticians and econometricians in recent years.
Modelling and learning aspects are discussed. Some statistical procedures are described in order
to learn with infinite-dimensional nuisance parameters, and adaptive estimators are presented.
Keywords: Semiparametric Artificial Neural Networks; Profile Likelihood; Efficiency Bounds;
Asymptotic Adaptive Estimators
1 Introduction
We look at the interface between statistics and artificial neural networks (ANN)
and study stochastic multilayer neural networks with unspecified functional com-
ponents. We call them Semiparametric Neural Networks (SNN). The fact that
no absolute requirements come from biological considerations represents an impor-
tant motivation for SNN. Many technical issues arise from the statistical inference
perspective; we (A) stress the importance of computing likelihood-based criterion
functions in order to exploit the large sample statistical properties pertaining to
the related estimators and (B) measure the asymptotic efficiency of the estimators
from given network architectures.
2 ANN and Likelihood Theory
A stochastic machine with a vector of synaptic weights w, the input( output) vectors
x(y) (given a conditional density function f(y/x)) and the data set dimension T,
is used to compute w through a learning algorithm based on the minimization of
some fitting criterion function of estimated residuals or prediction errors. Consider
a likelihood function L(Y, B), where Y = (Yl,"" YT)' is the sample observation
vector and B = (B 1 , .•. , Bp)' is a vector of parameters. These three elements can
fully characterize a taxonomy of models, differentiated by the sample size and the
dimension of the parameter space where the optimization of the chosen criterion
function must be done. For instance, in a network where all the activation functions
are specified and the dimension of B is p < 00, i.e., in a fully parametric situation,
the solution to the optimization problem is to maximize I = ,£J=l InL(Yi' B) over
the weight space e. On the other hand, the activation functions can be reasonably
left unspecified, apart from some smoothness conditions. This choice permits a shift
from a parametric to a more non parametric setting, where the likelihood function
now admits a modified form L(Y,A), with A = (w,g,h) and w,g,h that indicate
respectively the weight vector, the unknown output and hidden layer activation
functions. A more restricted context, but still semiparametric in nature, occurs
when g or h are in fact specified; networks here closely resemble the semi parametric
and non parametric extensions of the Generalized Linear Models [7] described in
[6]. As a result, the likelihood function is now more constrained, with L(Y,8) and
8 = =
(w, g, h) or 8 (w, g, h). If a stochastic one-hidden layer feedforward neural
140
Capobianco: Semiparametric Artificial Neural Networks 141
network is represented by:
q •
An important aspect in SNN is related to the functional form of the I/O rela-
tion. The approximation ability of the network depends on the characteristics of
the underlying functional relation; usually sigmoidal-type or RBF-type ANN work
well, but when the activation functions are left unspecified, projection pursuit re-
gression [3] and other highly parameterized techniques represent possible solutions.
We consider pure minimization and iterative estimation strategies; the former are
=
based on the decomposition of the parameter vector, (J (w, TJ), where w represents
the weights and the bias terms considered together and TJ represents the infinite-
dimensional nuisance parameter4, and the latter are Newton-Raphson (NR)-type
procedures. We address the optimization problem in the Profile ML setting in-
troduced before. But another challenging issue is the weight estimation accuracy,
at least asymptotically. By working with an initially unrestricted likelihood-based
performance measure we would like to obtain, in statistical terms, a lower bound
[8,9, 10] for the asymptotic variance (AV) ofthe parametric component ofthe semi-
parametric model such that we are able to generalize the usual parametric bound
given by the inverse of the Fisher Information Matrix (FIM). We equivalently
calculate the Semiparametric Efficiency Bounds (SEB) for the parameters of
interest, which quantify the efficiency loss resulting from a semiparametric model
compared to a parametric one5 .
4 Parametric and N onparametric Estimation Theory
The discussion here draws mainly on [8], [10] and [11], where the concept of marginal
Fisher's bound for asymptotic variances in parametric models is generalized. If it is
true that a nonparametric problem is at least as difficult as any of the parametric
problems obtained by assuming we have enough knowledge of the unknown state
of nature to restrict it to a finite dimensional set, it is important to look for a
method that obtains a lower bound for the AV of the parametric component of the
initially unrestricted model 6 . Since a one-dimensional sub-problem asymptotically
as difficult as the original multidimensional problem often exists, we could express
the parameter space as a union of these one-dimensional sub-problems (paths or
directions along w, like Fw : Fw E :1') and estimate the parameter of interest to
select one of the sub-problems proceeding as if the true parameter would lay on
this curve. The question is: which sub-problem should be selected? First, we should
verify the existence of an estimator for a " curve" defined in the infinite-dimensional
nuisance parameter space, corresponding to one of the possible one-dimensional
3These distributional assumptions make a substantial difference in terms of the global or
local statistical efficiency that an estimator can achieve. While for global efficiency we mean
that an estimator is accurate regardless of the true underlying distributions, for local efficiency
the same holds only for some specific distributions related to the nonparametric component of the
model.
471 can stand for the unknown hidden layer activation functions, the noise density which ran-
domly perturbs the input/output pattern data, the weights or even the same activation functions.
sin the ANN context we could make comparisons between SNN and sigmoidal-type or other
networks on the grounds of the statistical efficiency of the related learning algorithms.
6rr a sequence of estimators (On) satisfies v'n(On - O} -+d N(O, V}, then V is the asymptotic
variance of (On) and for the ML estimator, under regular conditions, equals 1;1, where Ie is the
FIM.
Capobianco: Semiparametric Artificial Neural Networks 143
sub-problems. Given 8 =
(w, T/), consider the smooth curve 8(t) a < t < b
designed in the space W x T by the map t -+ 8(t) =
(w(t), T/(t». According to a
possible parameterization inducted by w(t) = t, the map becomes w -+ (w, T/w) with
T/wo = T/o and for each curve T/w we can compute the associated score functions Uw,u1'J
such that, d~l(w,T/w) Iw=wo= :!(wO,T/o) + :!(wO,T/O)(d~T/w Iw=wo) (or simply
Uw + UI'J) and the information matrix EO(d~ l(w, T/w) Iw=wo)2 = E(Uw + U)2, where
U E span(UI'J). Repeating the procedure for all the possible T/w we can find the
minimum FIM, which is given by inf(Eo(Uw + U)2 =
Eo(Uw + U*) iw, where =
U* is the projection of Uw onto span(UI'J). In particular, we define the curve T/~ for
which this minimum is achieved as the Least Favorable Curve (LFC). With a
semiparametric model, in order for T/~ to be LFC, the following relation must be
satisfied:
E([81(w,T/~)]
ow
IW-Wo
_ )2 <- E([81(w,T/w)]
ow
IW-Wo
_ )2 (3)
for each curve w -+ T/w in the nuisance functional space T7. Following the parametric
case, the marginal FIM for w in a semiparametric model is given by:
. -_·~E([81(w,T/W)]1
zw znJI'J ow W=Wo )2 (4)
12Note that (A) w(D) that maximizes L(D(w)) should behave like w(D), where w(D) =
argsupwL(D(w)) (a consistent estimator is required, given that L(D(w) -+pr L(D(w)) uni-
formly in w) (B) given AV = ii,
where according to [1) G = (g'j) is the FIM and g'j =
f32 f LII k(1 - k)~/;!;p(x)dx is the generic component of G, we must compare AV with the
inverse of (5).
13These estimators are asymptotically efficient after one iteration from a consistent initial esti-
mate of the parameter of interest and with the functions of the likelihood consistently estimated.
Capobianco: Semiparametric Artificial Neural Networks 145
.
soIve moment equatIOns 0f t he 10rm
r 2: m(Zi,4>,Ij)
n !
0 glven
=, . a generaI mterest
. pa-
rameter t/J. The m(.) function, similar to the estimating functions adopted in [5],
can represent the First Order Conditions for the maximum of a criterion func-
tion Q, and 1/ maximizes the expected value of the same function. This method is
equivalent to GPL when f is the density from a given distribution function F, Q is
the log-likelihood function, m(.) the score function and 1/(t/J, F) is the limit, for the
nonparametric component, that maximizes the expected value of In/(z/t/J, 1/). With
~ = argmax4> 2:; In/(z;/t/J, r,¢) as the GP(Max)L estimator, where the estimation
of 1/ does not affect the AV, and given M = 8E(m~/,'l0)) 14>=4>0 nonsingular, we
have t/J(z) = M-1m(z, t/J, 1/0) and ~ = ~ + 2:i ¢~Zi'~), which is the one-step version
of GP(Max)L estimator.
6 Conclusions
We analyzed semiparametric neural networks, described a general model set-up
and discussed the related asymptotic estimation issues. The degree of success in
solving the bias/variance dilemma is often a case-dependent problem requiring a
reparameterization of the model in the hope of finding adaptive estimators. SEB
tell us about the asymptotic statistical efficiency of the chosen estimator.
REFERENCES
[I] S.Amari and N .Murata, Statistical Theory of Learning CUT'IJes under Entropic Loss Criterion,
Neural Comput., Vol.5 (1993), pp14o-153.
[2] P.Bickel, C.A.J.Klaassen, Y.llitov and J.A.Wellner, Efficient and Adaptive Estimation of
Semiparametric Models. The John Hopkins University Press (1993).
[3] J.H.Friedman and W.Stuetzle, Projection Pursuit Regre38ion, JASA, Vol. 76 (1981), pp817-
823.
[4] S.Geman, E.Bienenstock and R.Doursat, Neural Network& and the bias/variance dilemma,
Neural Comput., Vol. 4 (1992), ppl-58.
[5] M.Kawanabe and S.Amari, Estimation of networks parameters in semiparametric stochastic
perceptron, Neural Comput., Vol. 6 (1994), ppI244-1261.
[6] P.A.Jokinen, A nonlinear network model for continuous learning, Neurocomp., Vol. 3 (1991),
ppI57-176.
[7] R.McCullagh and J.A.Nelder, Generalized Linear Models, Chapman and Hall (1989).
[8] W.K.Newey, Semiparametric efficiency bounds, J. Appl. Ec.trics, Vol. 5 (1990), pp99-135.
[9] W.K.Newey, The asymptotic variance of semiparametric estimators, Econometrica, Vol.62
(1994), ppI349-1382.
[10] T. Severini and W.H. Wong, Profile likelihood and conditionally parametric models, An.
Stat., Vol. 20 (1992), ppI768-1802.
[11] C. Stein, Efficient nonparametric testing and estimation, in: Proceedings Third Berkeley
Symposium in Mathematics, Statistics and Probability, University of California Press, Berke-
ley, (1956) Vol. 1, ppI87-196.
[12] A.S.Weigend and N.A.Gershenfeld, Time series prediction: forecasting the future and un-
derstanding the past, Santa Fe, Proc. Vol. XV (1994), Addison-Wesley.
[13] H. White, Artificial Neural Networks. Approximation and learning theory, Blackwell, (1992).
Acknowledgements
The author thanks two anonymous referees for their suggestions.
AN EVENT-SPACE FEEDFORWARD NETWORK
USING MAXIMUM ENTROPY PARTITIONING WITH
APPLICATION TO LOW LEVEL SPEECH DATA
D.K.Y. Chiu, D. Bockus and J. Bradford*
University of Guelph, Ontario, Canada.
* Brock University, St. Catharines, Ontario, Canada.
This paper describes an event-space feedforward network based on partitioning of the input space
using maximum entropy criterion. It shows how primitives defined as partitioned hypercells (event
space) can be selected for the purpose of class discrimination. Class discrimination of a hypercell
is evaluated statistically. Observed primitives corresponding to observed characteristics in selected
hypercells are used as inputs to a feedforward network in classification. Preliminary experimental
results using simulated data and as it pertains to speaker discrimination using low-level speech
data have shown very good classification rates.
1 Introduction
This paper proposes a feedforward network whose input layer is reconfigured during
the training phase depending on the generation and selection of newly defined prim-
itives. As the primitives are identified through partitioning of the input outcome
space, so would the construction of the input nodes which corresponds to defining
the primitive set. The primitives are defined through the selection of partitioned
hypercells corresponding to certain feature values of the data selected for classifi-
cation. Thus, an observed primitive in a datum would correspond to an observed
selected characteristic (or range of values) which eventually determine the datum's
classification. Since the input layer in the network is reconfigured depending on the
selected hypercells identified, we call it a self-configurable neural network [2]. The
two processes of primitive generation and classification are integrated and" closely
coupled" .
2 Maximum Entropy Partitioning
When discretizing the outcome space based on partitioning of the data, the dis-
cretization process becomes a partitioning process. The Maximum Entropy Par-
titioning (MEP) process generates a set of hypercells through partitioning of the
outcome space of n continuous valued variables (or features) [1,3,4,7]. This method
bypasses the problem of non-uniform scaling for different variables in multivariate
datum compared to the commonly used equal-width partitioning algorithm, and
minimizes the information loss after partitioning [3].
Given n variables in n-dimensional space, the MEP process partitions a data set
into k n hypercells based on the estimated probabilities of the data. The value k
represents the number of intervals that a dimension is divided into. Let P be the
probability distribution where the process produces a quantization of Pinto:
P(Ri),i=I,2, ... ,kn , (1)
where Ri denotes a hypercell. To maximize the information represented by Ri, the
partitioning scheme which maximizes Shannon's entropy function defined below is
146
Chiu J Bockus & Bradford: An Event-space Network 147
used:
k"
H(R) = - LP(Ri)logP(Ri). (2)
;=1
The function H(R) now becomes the objective function where information can
be maximized according to the maximization of H(R). With one variable, maxi-
mization occurs when the expected probability P(Ri) is approximately 11k of the
training data with repeated observations. This creates a narrower interval where the
probability density is higher [3]. In the proposed method, the partitioning process
is done based on the marginal probability distribution on each variable to avoid
combinatorial complexity.
In our data representation (such as low-level speech data), the jth datum is repre-
= =
sented by a set of J(i) points which are denoted by Zj {Xi (Xli, X2;, ••. , xni)li =
1, ... , J(i)}. That is, Zj composes of sets of n-vectors. A set of partition boundaries
are determined by combining all the data into a single set that we call the "data
universe" denoted by U. Hence, U is defined as the union of all the training data:
U = {Zj Ij = 1, ... , J} = Zl U Z2 U ... U ZJ (3)
Each data is assumed to have an assigned class label em in Lc classes, 1 ~ m ~
Lc. In n-dimensional space, the set of hypercells generated after partitioning is
then defined as R = {Rili = 1,2, ... ,kn }, where each Ri is bounded by intervals
that partition the data universe. The intervals that bound Ri are composed of the
boundary points which will be determined by an algorithm [3].
3 Selection of the Partitioned Hypercells
Representation of individual datum Zj is based on the same partitioning scheme
generated from partitioning the data universe. Each generated hypercells from the
partitioning cordons off a set of points in Zj. Since our partitioning is based on the
marginal probability distributions, a hypercell Ri on datum Zj that has significantly
more points than expected can be evaluated using the following statistic also known
as the standard residual [5]:
D(Ri, Zj) = obs(Ri, Zj) - exp(Ri, Zj) (4)
Vexp(Ri,Zj)
where exp(Ri, Zj) is defined as the average number of points observed in a hyper-
cell Ri, calculated as M (Zj ) I k n , and obs( Ri, Zj) is the number of points in Zj
observed in the same hypercell, given that M(Zj) is the total number of points
in Zj. Since the statistic follows a normal distribution, it is therefore possible to
evaluate a hypercell which has significant characteristic in the data, based on the
normal distribution according to a confidence level. If the expected value calculated
using the null hypothesis so that each hypercell has equal probability of occurrence,
then this equation has the properties of an approximate normal distribution with
a mean of 0 and a variance of 1. In cases where the asymptotic variance differs
from 1, then an adjustment to the standard residual is required in order to yield
better approximations [5]. The significance of a hypercell is determined by com-
paring D( Ri, Zj) to the tabulated z-values of a predefined confidence level using
the standard normal distribution. That is, the hypercell Ri is selected as significant
based on:
J >
D(Ri , Z·) - z
(5)
otherwise
148 CHAPTER 22
where z is the tabulated z-value of a certain degree of confidence level.
As each datum is partitioned using the same scheme generated from partitioning the
data universe, the number of data Zj with significant D(Ri, Zj) (or 8(R;, Zj) = 1)
and class label Om is denoted as T](Ri, Om), (or T](Ri,Om) = ~z.8(Ri,Zj)
J
for
Zj E Om). Let e(Ri) be the average number of data per class whose hypercell Ri is
significant, or:
(6)
and let
8(Ri) = t
m=l
[T](Ri, Om) - e(Ri)]2,
e(Ri)
'r/Zj E U (7)
reflects the extent of class discrimination for a hypercell Ri in the data universe.
Since 8(Ri) has an asymptotic Chi-square distribution, the relevance of a hypercell
can be evaluated by applying the Chi-square test. After 8(Ri) is calculated, it
is compared to a tabulated X2 value with Lc - 1 degrees of freedom based on a
presumed confidence level. The function described in equation (3.5) indicates the
hypercell's statistical relevance for class discrimination:
f3(Ri) = { 1 if 8(Ri~ ~ XL-I (8)
o otherwIse
Hypercells that are not identified as statistically relevant are partitioned further,
using the same criterion of maximum entropy, until there exists an insufficient
number of points, or a predefined depth has been reached. We call this method hi-
erarchical maximum entropy partitioning [1,3]. The rationale of using partitioning
iteratively is to identify useful characteristics at a more restricted interval when rel-
evant characteristic is not found at a larger interval. The hypercells that surpass the
threshold value are marked as having an acceptable degree of class discrimination
with f3(Ri) = 1, and these selected hypercells can be relabeled as {R;} indicated
by the superscript. The set {RS} corresponds to the set of input nodes in the feed-
forward network. When labeled, as (Ri, R;, ... , R s),they correspond to a set of
data value characteristics that are selected to have acceptable class discrimination
information.
4 Self Configurable Event-Space Feedforward Network
The number of inputs to the feedforward network depends on the number of iter-
ations and selected hypercells. As more iterations of the partitioning process are
performed, then more hypercells are generated and identified as statistically rele-
vant. This is analogous to the use of more refined characteristics of the data for
class discrimination if sampling reliability is not a concern.
Given a datum Zj and the generated hypercell Rs. Let obs(Rs, Zj) be the number
=
of points Zj has in Rs , exp( Rs , Zj) M (Zj ) / k n be the expected average number of
points, where M (Zj) is the total number of points in Zj. Substituting obs( Rs , Zj)
and exp(Rs, Zj) into equation (4), we calculate the statistic for Zj, denoted as
D( Rs , Zj ). Define a binary decision function for all the selected hypercells {RS}
for Zj as:
if D( R; , Zj) ~ z
otherwise
(9)
Chiu, Bockus (3 Bradford: An Event-space Network 149
where z is a tabulated z-value for a given confidence level. Then Zj is represented
by a vettor as:
Wj = (a(Ri, Zj )O(Ri), a(R;, Zj )O(R;), ... a(R;, Zj )O(R;),) (10)
where a(R;, Zj) indicates whether a selected characteristic is observed in Zj , that
is, a selected primitive is observed, and O(R;) is the estimated relevance of R; based
on the analysis from the data universe.
A binary vector V; can be used as an approximation to Wi, for each data where
simplicity of inputs is desired. It is defined as:
V; = (a(Ri' , Zj ),8(Ri) , a(R;, Zj ),8(R;), ... a(R;, Zj ),8(R;),) (11)
Each component a( R; , Zj ),8( R;) is the product of a( R; , Zj) which identifies sig-
nificant characteristics in the data Zj, and ,8( R;), which identifies the hypercells
(or primitives) based on the data universe. In other words, a component is 1 only
if the primitive is statistically significant in Zj, and statistically discriminating in
U.
This approximation does not provide the detailed information contribution to the
class discrimination as does the Wj vector. However, V; usually provides faster
training times with additional analysis information. ,8(Rt) is defined as a binary
element in the vector V; and is always equal to 1 for the selected hypercells. The
,8(Rt) value replaces O(Rt) in equation (10) so that it is now represented by a 1,
thus V; is rewritten as:
V; = (a(Ri,Zj),a(R;,Zj), ... a(Rs,Zj),) (12)
5 Training and Classification of the Network
A network can be trained using the standard back-propagation algorithm with the
supervised class label for each datum where the vector Wi or V; is the input. In the
testing phase, a new datum Zj with an unknown class label is assumed to belong to
one of the given classes, hence it is expected that the partitioning scheme generated
from the training session can be applied as well. Zj is partitioned to the predefined
levels according to the same scheme identified in training on the data universe.
Then Zj is converted to the corresponding vectors Wj or V; using equation (10) or
(12).
These vectors are applied to the feedforward network. The output node with the
highest activation which surpasses a threshold value, identifies the class of the
unknown input datum. If there is no output node with an activation surpassing the
threshold, then the datum remains as unclassified (or rejected).
6 Experimentation with Speech Data
To show how the algorithm performs on low level speech data, we performed an
experiment on its ability to identify relevant speech characteristics as well as its
ability to distinguish the speaker identity. The data used for these experiments
involved 3 speakers, on 3 words pronounced as "quit", "printer" and "end". Three
classes of data were defined, one corresponding to each speaker. Each of the three
words was spoken 10 times by each speaker, resulting in 30 speech sample data per
speaker, and a total of 90 samples. Each speech sample is represented by a set of
points composed of three variables as a 3-vector (time, frequency, amplitude). A
single utterance generated slightly over 12,000 data points.
The experiment used the "hold-out test" to evaluate the algorithm for speaker
identification. Each run consisted of selecting 5 samples on a word randomly from
150 CHAPTER 22
each speaker class for training. The remaining 5 samples were used for testing. With
3 speakers, each run consists of 15 test data and 15 training data. With 3 words,
and performing 10 runs on each word, a total of 30 runs was done for a total of 450
test samples. The results from the experiments showed that the system performed
reasonably well. A total of 369 out of 450 were classified correctly. Of those that
were not classified, about half were rejected. Total success rate was about 82%.
7 Experimentation with Classifying Data Forming Interlocking
Spirals
These experiments illustrate the algorithm on classifying overlapping interlocking
spirals of points [6]. In this set of experiments, the data are generated using different
parameters to define two classes of data so that each data consists of a number of
points forming spirals. Here, the points in a spiral were artificially generated so
that the points in a spiral could overlap with points in another data even though
they may belong to different class. Thus classification of each data has to depend
on a large number of points jointly. Each data sample was composed of 96 points.
To create a problem with probabilistic uncertainty, but more difficult, a degree of
randomness was introduced so that each spiral as well as each point in it had the
radius shifted by a random quantity.
In total, 60 data samples were generated for use in all the experiments. The exper-
iment consisted of 10 runs where each run was composed of 30 training data, 15
per class. The test set then used the remaining 30 un chosen samples, once again
15 per class. In 89 runs, using different confidence levels and number of intervals.
The results based on a total of 2670 test samples were: correctly recognized 92.9%,
rejected 5.0% and incorrectly classified 2.1%.
REFERENCES
[I] Bie C., Shen H. and Chiu D.K.Y., Hierarchical maximum entropy partitioning in texture
image analysis, Pattern Recognition Letter, Vol. 14 (1993), pp421-429.
[2] Chiu, D.K.Y. Towards an event-space sel/-configurable neural network, Proc. 1993 IEEE
Intern. Conference on Neural Networks (1993), pp956-961.
[3] Chiu D.K.Y. In/ormation discovery through hierarchical mazimum entropy discretization
and synthesi8, Knowledge Discovery in Databases, G. Piatetsky, Shapiro and W.J. Frwley,
MIT/ AAAI press (1991), pp125-140.
[4] Shen, H.C., Bie C. and Chiu D.K.Y., A texture-based di8tance measure for classification,
Pattern Recognition, Vol. 26 (1993), No.9, pp1429-1437.
[5] Haberman, S.J. The analysis of re8idual8 in cross- clauified tables, Biometrics 29 (1973),
pp205-209.
[6] Lang, K.J. and Witbrock, M.J Learning to tel/ two 8pirals apart, Proc 1988 Connection-
ist Models Summer School, Carnegie Mellon University, Morgan Kaufman Publishers (1988),
pp52-59.
[7] Wong, A.K.C. and Chiu, D.K.Y., Synthesizing statistical knowledge from incomplete mized-
mode data, IEEE Trans. PAMI, Vol. PAMI-9 (1987), No.6, pp796-805.
APPROXIMATING THE BAYESIAN DECISION
BOUNDARY FOR CHANNEL EQUALISATION USING
SUBSET RADIAL BASIS FUNCTION NETWORK
E.S. Chng, B. Mulgrew* , S. Chen** and G. Gibson***
National University of Singapore, Institute of System Science, Heng Mui Keng
Terrace, Kent Ridge, 119597 Singapore. Email: [email protected].
* Dept. of Electrical Eng., The University of Edinburgh, UK.
** Dept. of E.E.E., The University of Portsmouth, UK.
*** Biomathematics and Statistics, The University of Edinburgh, UK.
The aim of this paper is to examine the application of radial basis function (RBF) network to
realise the decision function of a symbol-decision equaliser for digital communication system. The
paper first study the Bayesian equaliser's decision function to show that the decision function
is nonlinear and has a structure identical to the RBF model. To implement the full Bayesian
equaliser using RBF network however requires very large complexity which is not feasible for
practical applications. To reduce the implementation complexity, we propose a model selection
technique to choose the important centres of the RBF equaliser. Our results indicate that reduced-
sized RBF equaliser can be found with no significant degradation in performance if the subset
models are selected appropriately.
Keywords: RBF network, Bayesian equaliser, neural networks.
1 Introduction
The transmission of digital signals across a communication channel is subjected
to noise and intersymbol interference (lSI). At the receiver, these effects must be
compensated to achieve reliable data communications[l, 2]. The channel, consisting
of the transmission filter, transmission medium and receiver filter, is modelled as a
finite impulse response (FIR) filter with a transfer function H(z) = L~;-l a( i)z-i.
The effects on the randomly transmitted signal s(k) = s = {±1} through the
channel is described by
na.- 1
r(k) = r(k) + n(k) = L s(k - i)a(i) + n(k) (1)
;=0
where r( k) is the corrupted signal of s( k) received by the equaliser at sampled
instant time k, r( k) is the noise-free observed signal, n( k) is the additive Gaus-
sian white noise, a( i) are the channel impulse response coefficients, and na is
the channel's memory length[l, 2]. Using a vector of the noisy received signal
=
r( k) [r( k), ... , r( k - m + 1)]T, the equaliser's task is to reconstruct the trans-
mitted symbol s(k - d) with the minimum probability of mis-classification, PE.
The integers m and d are known as the feedforward and delay order respectively.
The measure of an equaliser's performance PE, or more commonly expressed as the
bit error rate (BER), BER = loglOPE, in communication literature [1], is expressed
with respect to the signal to noise ratio (SNR) where the SNR is defined by
E[r(k)] IT;(l=:~~4-1 a(i?) E:~~4-1 a(i)2
SNR = E[n 2 (k)] = IT~ = IT~ (2)
where IT; = 1 is the transmit symbol variance and IT; is the noise variance.
The transmitted symbols that affect the input vector r( k) is the transmit sequence
s(k) = [s(k), ... , s(k -m-n a +2JT. There are N. = 2m +n4 - 1 possible combinations
151
152 CHAPTER 23
of these input sequences, i.e. {Sj}, 1 ::; j :5 N. [2]. In the absence of noise, there are
N. corresponding received sequences rj(k), 1 ::; j ::; N., which are referred to as
channel states. The values of the channel states are defined by,
Cj = rj(k) = F[sj]' 1::; j :5 N., (3)
a(n.~- J
where the matrix FE Rmx(m+na-l) is
a(O) a(l) ... a(na -1) 0
[ o a(O) a(l) ... a(na - 1) o
··· ... ... ... . ..
o ... . .. a(O) a(l)
(4)
Due to the additive noise, the observed sequence r( k) conditioned on the channel
state r( k) = Cj is a multi-variable Gaussian distribution with mean at Cj,
= (211"0"~)-m/2exp( -IIr(k) - Cj 112 /(20"~)).
p(r(k )Icj) (5)
The set of channel states Cd = {Cj}f~l can be divided into two subsets according
to the value of s(k - d), i.e.
d+) = {r(k)ls(k - d) = +1)}, (6)
C~-) = {r(k)ls(k - d) = -I)}, (7)
where the subscript d in Cd denotes the equaliser's delay order applied.
To minimise the probability of wrong decision, the optimum decision function is
based on determining the maximum a posteriori probability P( s( k - d) = slr( k)) [2]
given observed vector r(k), i.e.,
= =
s(k - d) sgn( pesek - d) +llr(k)) - pesek - d) -llr(k))) (8) =
where s(k - d) is the estimated value of s(k - d). It has been shown in [2] that the
Bayesian decision function can be reduced to the following form,
b(r(k)) =
c.I
)
i s(k)
1 1 1 I 1.5 1.5
2 I 1 -1 1.5 -0.5
3 1 -I 1 -0.5 0.5
4 1 -1 -1 -0.5 -1.5 -I
5 -I 1 I 0.5 1.5
6 -1 I -1 0.5 -0.5 -2
Figure 1 (a) Transmit sequences and channel states for channel H(z),
(b) Corresponding Bayesian decision boundaries for various delay orders_
When O' e --+ 0, the sum on the l.h.s. of Eq. 11 becomes dominated by the closest
centres to ro, i.e.
{ut} = min {liro - Cjll}· (12)
CjECJ+)
This is because the contribution from the terms exp( -Ilro -Cj 11 2 /(20';)) for centres
Cj t/:. ut converges much more quickly to zero when O' e --+ 0 than terms for centres
belonging to ut.
Similarly, the sum on the r_h.s of Eq. 11 becomes dominated by
the closest terms for centres belonging to Ui, where Ui = min Ck EC(-) d
{lira - Ck II}.
154 CHAPTER 23
At very high SNR, the asymptotic decision boundaries are hyper-planes between
pairs of channel states belonging to {U+} and {Ui} [4].
However, not all channel states of {U:, Ui} are required to define the decision
boundary. This can be observed from the example illustrated in Fig. Ib for decision
boundary realised using delay order d = 2. By visual insepection (Fig. Ib), it is
obvious that {C3, C7} E u;t and {C4' cs} E Ui. The decision boundary formed using
centres {C3, C4} and {C7, cs} are however the same. Therefore, in this case, only 1
pair of channel states, either {C3, C4} or {C7, cs}, is sufficient to approximate that
region of decision boundary.
To find the set of important centres {Ut., Ui.} for the subset RBF equaliser, we
propose the following algorithm,
For Cj E C~+)
For CA: E C~-)
rj,A: = Cj + (Ch;Cj)
if [
16(rj,A:) = 0 and
Cj = minc;EcJ+dllro - cill} and
1
1.(rj,le) #0
Cj -+ Ut., CA: -+ Ui.
next CIe,
next Cj.
where /. (.) = RBF model formed using the current selected channel states from
{Ut., Ui.} as centres and 16(.) is the full RBF Bayesian equaliser's decision function.
2.1 Subset Model Selection: Some Simulation Results
Simulations were conducted to select subset RBF equalisers from the full model.
The following channels which have the same magnitude but different phase response
were used,
Hl(z) = 0.8745 + 0.4372z- 1 - 0.2098z- 2 (13)
H2(z) = 0.2620 - 0.6647z- 1 - 0.6995z- 2 (14)
The feedforward order used was m =
4, resulting in a full model with N. =
2m +nG - 1 = 64 centres. Using SNR condition at 16dB, simulations were conducted
to compare the performance of the subset RBF and full RBF equalisers for the two
channels. The results are tabulated in Table la and Ib respectively; The first col-
umn of each table indicates the delay order parameter, the second column indicates
the number of channel states selected to form the subset model while the third and
fourth columns list the BER performance of the two equalisers and the last column
indicates if the channel states belonging to the different transmit symbol, i.e. C~+)
and C~-), are linearly or not-linearly separable. Our results show that reduced size
RBF equaliser with performance very similar to the full model's performance can
usually be found for equalisation problem that is linearly separable.
Chng et al.: Approximating the Bayesian Decision Boundary 155
Table 1 Comparing the performance of the full-size (64 centres) RBF equaliser,
subset RBF equaliserfor Channel HI(z) (Table la) and Channel H2(z) (Table Ib)
at SNR=16db.
3 Conclusions
This paper examined the application of RBF network for channel equalisation. It
was shown that the optimum symbol-decision equaliser can be realised by a RBF
model if channel statistic is known. The computational complexity required to im-
plement the full Bayesian function using the RBF network is however considerable.
To reduce implementation complexity, a method of model selection to reduce the
number of centres in the RBF model is proposed. Our results indicate that the
model size, and hence implementation complexity, can be reduced without signifi-
cantly compromising classification performance in some cases.
REFERENCES
[I] S.U.H.Qureshi, Adaptive equalization, Proc. IEEE, Vol. 73 (1985), No.9, ppI349-1387.
[2] S.Chen, B.Mulgrew and P.M. Grant, A clustering technique for digital communications chan-
nel equalization using radial basis function networks, IEEE Trans. Neural Networks, Vol. 4
(1993), No.4, pp57D-579.
[3] M.J.D.Powell, Radial basis functions for multivariable interpolation: a review, Algorithms
for Approximation, J.C.Mason and M.G.Cox (Eds), Oxford (1987) pp143-167.
[4] R.A.ntis, A randomized bias technique for the importance sampling simulation of Bayesian
equalisers, IEEE Trans. Communications, Vol. 43 (1995), ppll07-1115.
APPLICATIONS OF GRAPH THEORY TO THE
DESIGN OF NEURAL NETWORKS FOR AUTOMATED
FINGERPRINT IDENTIFICATION
Carol G. Crawford
U.S. Naval Academy, Department of Mathematics, Annapolis, USA.
This paper presents applications of graph theory to the design of graph matching neural networks
for automated fingerprint identification. Given a sparse set of minutiae from a fingerprint image,
complete with locations in the plane and (optionally) other labels such as ridge angles, ridge
counts to nearby minutiae and so on, this approach to matching begins by constructing a graph-
like representation of the minutiae map, utilizing proximity graphs, such as the sphere-of-influence
graphs. These graph representations are more robust to noise such as translations, rotations and
deformations. This paper presents the role of these graph representations in the design of graph
matching neural networks for the matching and classification of fingerprint images.
1 Introd uction
Matching the representations of two images has been the focus of extensive research
in computer vision and artificial intelligence. In particular, the problem of matching
fingerprint, images has received wide attention and varying approaches to its solu-
tion. In this paper we present results of an ongoing collaborative research program
which combines techniques and methods from graph theory and neural science to
design algorithms for graph matching neural networks.
This collaborative approach with Eric Mjolsness, University of Southern California,
San Diego, and Anand Rangarajan, Center for Theoretical and Applied Neural
Science at Yale, is an outgrowth of an initial investigation for the Federal Bureau
of Investigation to their existing Integrated Automated Fingerprint Identification
System, IAFIS.
In this research program, algorithms and techniques from discrete mathematics,
graph theory and computer science are combined to develop methods and algo-
rithms for representing and matching fingerprints in a very large database, such as
the one at the FBI. The Federal Bureau of Investigation and National Institute of
Standards and Technology provided a small database of fingerprints. This database,
together with the software environment at the Center for Theoretical and Applied
Neural Science at Yale University have provided a test bed for these algorithms.
The following presents the background of the fingerprint problem and this research
together with a special emphasis on the role of proximity graphs in the design of
the graph matching neural networks.
2 Fingerprint Images, Minutiae and Graph Representations
2.1 Minutiae Maps
Fingerprint matching and identification dates back to 1901 when it was introduced
by Sir Edward Henry for Scotland Yard. After sorting the fingerprints into classes
such as whorls, loops and arches, matches are made according to comparisons of
minutiae. Minutiae include such indications as ridge endings, islands and bifurca-
tions, with fingerprints averaging 100 or more per print. Today fingerprints are
initially stored on computer as a raw minutiae map in the form of a list of minutiae
positions and ridge angles in a raster-scan order. In American courts a positive
156
Crawford: Graph Theory for the Design of Neural Networks 157
matching of a dozen minutiae usually suffices for identification. However, for an
average computer to make these dozen matches the process would entail locating
every minutiae in both prints and then comparing all ten-thousand-plus possible
pairings of these minutiae. In addition, the minutiae map itself is very non-robust
to likely noise such as translations, rotations, and deformations, which can change
every minutiae positions. A subtler form of noise is the gradual increase in ridge
width or image scale typically encountered in moving from the top of the image to
the bottom. Thus, it is desirable to determine graph- like representations which are
more robust to noise and less susceptible to problems with missing minutiae.
2.2 Graph Representations of Minutiae Maps
Given a sparse set of minutiae from one fingerprint image, complete with their
locations in the plane and (optionally) other labels such as ridge angles, ridge
counts to nearby minutiae and so on, we construct a graph- like representation of
the minutiae map. By considering relationships between pairs of minutiae such as
their geometric distance in the plane, or the number of intervening ridges between
them, we can begin to construct features which are robust against translations and
rotations at least. However, there still exists the very serious problem of reorderings
of the minutiae forced by rotation and missing or extra minutiae. This "problem"
must be addresses by defining a reordering- independent match metric between two
such graphs.
Complete Graphs and Planar Distances
The simplest example of a labelled graph representation would be the complete
graph where every pair of minutiae are linked by an "edge" in the graph. Edges
would be labelled by the 2-d Euclidean distance between the minutiae. Note that
this graph would require a special definition of the match metric to handle miss-
ing, extra and reordered minutiae. Furthermore, most of the edges would connect
distant minutiae whose relationships, such as planar distance or ridge count, are
subject to noise and provide less real information than nearby edges. So for reasons
of robustness and computational cost, it makes sense to consider instead various
kinds of "proximity graphs" , which keep only the edges between minutiae that are
"neighbors" according to some criterion.
Sphere-oj-influence Graphs and Other Proximity Graph Representations
Sphere-of-Influence graphs comprise the first set of proximity graphs which we
considered in our goal of determining a better class of minutiae map representations.
First introduced by Toussaint [9], sphere- of-influence graphs provide a potentially
robust representation for minutiae maps. These graphs are capable of capturing
low-level perceptual structures of visual scenes consisting of dot patterns .. A very
active group of researchers have developed a series of significant results dealing
with this class of graphs. We refer the reader to the work of M. Lipman, [4,5,6]; F.
Harary, [5,6]; M. Jacobson, [5,6]; T. S. Michael,[7,8]; and T. Quint, [7,8].
The following definition is referred to by Toussaint, [9]. Let V = (PI," .,Pn )
be a finite set of points in the plane. For each point p in V, let rp be the closest
distance to any other set of points in the set, and let Cp be the circle of radius
rp centered at p. The sphere- of-influence graph, or BIG, is a graph on V with an
edge between points p, q if and only if the circles Cp , Cq intersect in at least two
158 CHAPTER 24
places. For various illustrations of sphere-of-influence graphs we refer the reader to
the excellent presentation by Toussaint in [9].
One can note from the prior example that perceptually salient groups of dots be-
come even more distinct in the corresponding sphere-of-influence graph. However,
SIGs represent only one group of an even richer class of graphs we refer to as
proximity graphs. These graphs also offer various benefits in their potential for
providing robust representations of minutiae maps. Proximity graphs all share the
property that they only contain edges corresponding between minutiae that are
"neighbors" according to some given criterion. The graphs which have turned out
to be most promising representations include relative neighborhood graphs, delauney
triangulations, voronoi diagrams, minutiae distance graphs and generalized sphere-
of-influence graphs. We refer the reader to an excellent survey article by Toussaint
for more details on proximity graphs. [9]
Generalized Sphere-oj-Influence Graph Representations
K-sphere-of-influence graphs (k-SIGs) are generalizations ofSIGs, in which a vertex
is connected to k nearby. vertices depending only on their relative distances, not
absolute distances.(Guibas, Pach, and Sharir [10]). Given a set V of n points in R d ,
the kth sphere-of- influence of a point x in V is the smallest closed ball centered at x
and containing more than k points of V (including x). The case for k = 1 yields the
standard sphere-of-influence graph. The it kth sphere-of-influence graph, Gk(V) of
V is a graph whose vertices are the points of V, and two points are connected by
an edge if their k-th spheres -of - influence intersect.
K-SIGs are especially suitable for minutiae map representations because of the fact
that connections depend on relative distances. This property provides a form of
scale invariance. Each edge can be labelled with the integer k, recording whether
it connects nearby (k = 1) or farther minutiae pairs. Unfortunately, this scale
robustness is bought at the price of increased susceptibility to missing minutiae.
When a minutiae goes missing, not only is there an unavoidable effect on the graph
by the deletion of the node, but there is a gratuitous "splash" effect of the edges
between nearby pairs of the remaining minutiae: their k-numbers change despite
the fact that their planar distance do not. This effect is mitigated by the match
metric, which changes only gradually with k, but it is still undesirable.
Finally, we define yet another graph which is a hybrid between planar distance
graphs and k-SIGs. We begin by creating a k-SIG with an overly large value of
k. However, we label the edges with the planar distance d. Next we find the local
image scale factor by finding the best constant of proportionality between d and k
in each image region. Divide all d's by this coefficient to turn them into less noisy,
noninteger versions of k, and then let this scaled version of d be the importance
rating for an edge. Then proceed to a graph and a match metric as in the planar
distance example. For this hybrid representation minutiae deletion only affects d
via the local scaling factor, which is determined by many different value of k and
is therefore fairly robust, preserving scale invariance.
3 Graph-Matching Neural Network Implementation
Graph representations of minutiae maps provide only the first step in developing a
matching scheme for fingerprints. This second part of this research effort is devoted
to the design of graph matching algorithms and their implementation as neural
Crawford: Graph Theory for the Design of Neural Networks 159
Acknowledgements
Research supported by The Office of Naval Research contract N0001494WR23025.
This work is an outgrowth of a cooperative program with Eric Mjolsness of The
University of Southern California and Anand Rangarajan of The Center for Theo-
retical and Applied Neural Science at Yale University This program was originally
funded by The Federal Bureau of Investigation.
ZERO DYNAMICS AND RELATIVE DEGREE OF
DYNAMIC RECURRENT NEURAL NETWORKS
A. Delgado, C. Kambhampati* and K. Warwick*
National University of Columbia, Elec. Eng. Dept., AA No. 25268,
Bogota, D C, Columbia SA. Email: [email protected]
*Cybernetics Department, University of Reading, UK.
In this paper the differential geometric control theory is used to define the key concepts of relative
degree and zero dynamics for a Dynamic Recurrent Neural Network (DRNN). It is shown that the
relative degree is the lower bound for the number of neurons and the zero dynamics are responsible
for the approximating capabilities of the network.
1 Introduction
Most of the current applications of neural networks to control nonlinear systems
rely on the classical NARMA approach [1,2]. This procedure, powerful in itself, has
some drawbacks [3]. On the other hand, a DRNN is described by a set of nonlinear
differential equations and can be analysed using differential geometric techniques
[4].
In this work, the important concepts of zero dynamics and relative degree from the
differential geometric control theory are formulated for a control affine DRNN.
2 Mathematical Preliminaries
Consider the nonlinear control affine system
Xl X2
X2 = X3
(1)
Xr fr(Xl, ... , Xr , Xr+l, ... , Xn) + gr' U
Xr+l fr+l(Xl, ... , Xr, Xr+l,···, Xn) + gr+l . U
y = Xl
these equations can be written in compact form
X = f(x)+g·u
y = h(x) (2)
where X E JR , U E JR, y E JR, f( X) and g are vector fields, h( x) is a scalar field. For
n
the system (1) there are two key concepts in the differential geometric framework,
that is the zero dynamics and the relative degree.
2.1 Zero Dynamics
The zero dynamics of the system (1) describe its behaviour when the output y(t)
is forced to be zero [4]. With the output zero, the initial state of the system must
be set to a value such that (Xl(O), ... , Xr(O)) are zero, whereas (Xr+l(O), ... , Xn(O))
can be chosen arbitrarily. In addition, the input u(t) must be the solution of the
161
162 CHAPTER 25
equation
0= fr(O, ... ,0, Xr+1(t), .. . , Xn(t)) + gr . U (3)
Solving for u(t) in (3) and replacing in the remaining equations, the zero dynamics
are given by the set of differential equations
(4)
Xn fn(O, ... , 0, Xr+1, ... , xn) - gn fr(O, ... , 0, Xr+1, ... , xn)
gr
The zero dynamics play a role similar to that of the zeros of the transfer function
in a linear system. If the zero dynamics are stable then the system (1) is said to be
minimum phase.
2.2 Relative Degree
The relative degree of a dynamical system is defined as the number of times that
the output y(t) must be differentiated with respect to time in order to have the
input u(t) appearing explicitly or is the number of state equations that the input
u(t) must go through in order to reach the output y(t).
The nonlinear system (2) is said to have relative degree r [3,4] if
LgLijh(x) 0, =
i 0, .. . ,r - 2
LgLrj-1h(x) :I 0
3 DRNN
A DRNN is described by the set of nonlinear differential equations
N
-Xi + L: Wij . u(Xj) + /i . U (5)
j=l
y i=l, ... ,N
or in matrix form
(6)
y = Xl
=
where X E mN, WE mNXN, r E mNX1, and I;(X) (u(xd, . .. , U(XN))T.
For control purposes it was demonstrated [3] that the network (6) can approximate
nonlinear systems of the class (2), the resulting model can be analysed using the
differential geometric framework. The aim of this paper is to propose a canonical
structure for the network (6) in order to get any desired relative degree with a zero
dynamics similar to (4).
Delgado et al.: Zero Dynamics and Relative Degree 163
Theorem 1 The DRNN (6) with the following matrices Wand r can have any
relative degree r E [2, N]
Wll W12 0 0
W21 W22 W23 0 0
This particular structure is called the staircase structure. Note that the minimum
number of neurons is the relative degree r. For a desired relative degree r = 1, the
coefficient 11 must be nonzero.
Proof Applying the definition of relative degree
y = Xl
iJ = :h = Wll . 0'(X1) + w12 . 0'(X2)
after the first derivative, every new derivative of y(t) introduces a new state equation
(staircase structure). Then after r derivatives of the output y(t), the input u(t)
appears explicitly. 0
Theorem 2 The network of the Theorem 1 with a sigmoid function 0'(0) = 0 has
a zero dynamics described by the set of differential equations
N
Xi=-Xi+ E (Wij- liWr;)'O'(X;), i=r+l, ... ,N
j=r+i Ir
Proof The zero dynamics are defined as the resulting dynamics when the output
y(t) is constrained to be zero. In the structure proposed y(t) = 0 means that the
first r state variables are zero X1(t) = ... = Xr(t) = 0 and 0'(X1(t)) = 0'( . .. ) =
O'(Xr(t)) = 0 The equation for X,.(t) becomes
N
o= E Wrj . O'(Xj) + Ir . u
j=r+i
solving for u
1
U = -- EN
Ir j=r+i
Wrj . O'(Xj)
i=r+l, ... ,N
o
Example: Consider the following staircase DRNN
Xl -Xl + wll . cr(xd + wl2 . cr(X2)
X2 -X2 + W21 . cr(xd + w22 . cr(X2) + w23 . cr(X3) + 12 . U
X3 -X3 + w31 . cr(xd + w32 . cr(X2) + w33 . cr(X3) + 13' U
Y Xl
cr(O) 0
Relative Degree: The first derivative of the output is
iJ = Xl = -Xl + Wll . cr(XI) + Wl2 . cr(X2)
the input u does not appear so the relative degree is greater than one. The second
derivative of the output is
ii = -Xl + Wll . cr' (xd . Xl + Wl2 . cr' (X2) . X2
where
'( .) _ dcr(Xi)
cr x, - d .
Xi
Replacing Xl and X2
ii = (-1 + Wll . cr'(xd)( -Xl + Wll . cr(xd + Wl2 . cr(X2))
+W12 . cr'(X2) . (-X2 + W21 . cr(xI) + W22 . cr(X2) + W23 . cr(X3) + 12 . u)
notice that the input appears explicitly so the relative degree is r = 2.
Zero Dynamics: The proposed network has three states and relative degree two, so
the zero dynamics has one state.
Following the definition of zero dynamics [4], Y = Xl = O. The first state equation
is reduced to
o= -0 + Wll . cr(O) + Wl2 . cr(X2)
this yields X2 = 00. The second state equation is
o= -0 + W21 . cr(O) + W22 . cr(O) + W23 . cr(X3) + 12 . u
solving for the input
166
Dzielinski f3 Zbikowski: The Irregular Sampling Approach 167
,,
, -----------------------~
.
,,'.
, ,,
,,
,
",,',, .
...,_------ ----------------r
Figure 1 Iterative map f: [0, 1) X [0,1)- [0,1) defined by Yk+l = f(Yk, Yk-l)'
pairs ({k, f({k)) , are generated when f is the right-hand side (RHS) of a dynamic
system. For simplicity, instead of dealing with a controlled system of type (1), we
concentrate on the low-order autonomous case
(2)
with Y E [0,1]. Thus {k = (Yk, Yk-d. We also assume that f is continuous; its
domain, D = [0,1]2 for (2), is compact (and connected).
The essential observation is that Yk, i.e., {k of the pairs ({k, f({k)) occur in the 2:12:2
plane in a nonuniform (irregular) way. They will be, in general, unevenly spaced
and their pattern of appearance will depend on the dynamics of (2), or the shape
of f: [0, 1] x [0,1] - [0,1].
The sample points {k = (Yk,Yk-1) appear in the 2:12:2 plane, according to the
iterative process (2); see Fig. 1. If we start with 6 =
(Y1, Yo), where Y1 E 02:2 and
Yo E 02:1, then we can read out Y2 from the surface representing f. Then Y2 is
reflected through 2:3 = 2:2 on the 2:22:3 plane, becoming a point on the 02:2 axis. In
the same time Y1 is reflected through 2:2 = 2:1 on the 2:12:2 plane, becoming a point
on the 02:1 axis. This results in the point (Y1, Y2) in the 2:12:2 plane, corresponding
to the sample 6 = (Y2, yr). We can now read out Y3 from the surface representing
f and repeat the process for k = 3. Now Y3 'migrates' from 02:3 to 02:2 and Y2
from 02:2 to 02:1 generating the point (Y2, Y3) on the 2:12:2 plane corresponding to
the sample 6 = (Y3, Y2) etc.
Thus, we have to address the issue of nonuniform sampling and to provide at least
the existence conditions for function recovery with any degree of accuracy from a
sufficiently large finite number of irregular samples making this way the application
of neural networks plausible.
168 CHAPTER 26
2(X) = f~I~2If(x)12dx.
a f~oo If(x)12dx
We want to determine how large a 2 (X)
can be for a given band-limited f; dually
one may define an appropriate measure of concentration of the amplitude spectrum
of f(x), say ,82(0). Note that if f(x) were indeed space-limited to (-X/2,X/2),
then a 2 (X) would have its largest value, namely unity. To solve the problem of
maximising a 2 (X) we have to express f(x) in terms of its amplitude spectrum
F(w) and find the F(w) for which a 2 (X) achieves maximal value. This is a classical
Dzielinski fj Zbikowski: The Irregular Sampling Approach 169
problem of mathematical physics (see [2]) and we know that the maximising F(w)
must satisfy the integral homogeneous Fredholm equation of the second kind
1 0
-0
. T( I
sm 11" w - W F(w")dw" = a 2 (X)F(w ' ),
")
1I"(w' - W")
Iw'l SQ· (3)
The solutions to equation (3) are known as prolate spheroidal wave Junctions (pswJJ
and they provide a useful set of band-limited functions (see [5] for more details).
5 The 2XQ Theorem
Our principal question was how well the band-limited and space-limited functions
(with above mentioned restrictions) are suited to model real-world dynamic au-
tomatic control systems. To shed more light on it let us recall the so-called 2XQ
Theorem [4]. Its practical engineering formulation says that if XQ is large enough,
then the space of functions of space range X and 'bandwidth' Q has dimension
f2XQ1. To formulate it in a more rigorous manner we have to introduce the notion
of space-limited and band-limited functions in a way avoiding their dependence on
the detailed behaviour of functions or their Fourier transforms at infinity.
So, we say that a function f(x) is space-limited in multi-dimensional case to the
interval (X/2, X/2) at level [ if
f If(xWdx < [,
Jxl>x/2
1
i.e., if the energy outside this space region is less than it is, in some sense, essential
for us. The same way we say that a function is band-limited with bandwidth Q at
1
level [ if
IF(wWdw < [,
Iwl>0/2
i.e., the energy outside the frequency range is less than the value we are interested
in. Using these newly defined functions we may state that every function is both
space-limited and band-limited at level [ (for some X and Q) as opposed to only
one function (f(x) == 0) which is both space-limited and band-limited in the strict
sense.
To complete the reformulation of the 2XQ Theorem we need one more definition.
We say that a set of functions F has an approximate dimension N at level [ on the
interval (-X/2, X/2) if there exists a set of N = N(X, [) functions <PI, IP2, ... , <PN
tt
such that for each f (x) E F there exist aI, a2, ... , aN such that
1 X/2
-X/2 [f(x) -
N 2
aj<pj(x)] dx < [ (4)
and there is no set of N -1 functions that will approximate every f (x) E F this way.
This definition says, in other words, that every function in F can be approximated
in -X/2 < x < X /2 by a function in the linear span of <PI, <P2, ... , <P N, so that the
difference between the function and its approximation is less than [.
Restated version of the theorem has now the following form:
Theorem 1 Let F, be the set of functions space-limited to (-X/2,X/2) at level
[ and band-limited to (-Q, Q) at level c. Let N(Q, X, [, [I) be the approximate
dimension of F, at level ['. Then for every [' > [
r N(Q,X,[,[') - 2Q r N(Q,X,[,[') - 2X
x~moo X -, n~ Q -.
170 CHAPTER 26
In fact these limits do not depend on c: and the set of functions which in real
world we must consider to be limited both in space and frequency will be always
asymptotically 2Xn-dimensional.
6 Conclusions
In this paper we have shown how the irregular samples are generated by a dy-
namic system. We argue that in general this is an intrinsic feature of the NARMA
model and our attempts to reconstruct the function f should be set in the ir-
regular sampling context. The most important question in this setting is the one
of space- and band-limitedness of the function under consideration. The result of
paramount importance from the point of view of function approximation by finite,
linear combinations of functions is given in the form of the 2Xn theorem. Equation
(4) stipulates the existence of a finite approximation of a given nonlinear function
by a linear combination of functions. It also gives the lower bound for the number
of these functions. This allows the application of neural network with known (finite)
number of neurons.
On the other hand, the 2Xn theorem is also interesting from the point of view of
function reconstruction from its irregularly spaced samples. In this case we have
to assume that our function to be reconstructed is band-limited. From practical
considerations it has to be space-limited as well. This problem is normally solved
in the context of the theory of complex functions analytic in the entire domain
(entire functions and their special types). Let us notice that applying the 2Xn
Theorem, instead of using entire functions of exponential type (as is usually the
case in irregular sampling), which is quite restrictive, we deal with functions which
are square-integrable-a condition that is easily fulfilled in most practical cases.
REFERENCES
[1) S. Chen and S. A. Billings. Representation of non-linear systems: the NARMAX model.
International Journal of Control, Vol. 49 (1989), ppl013-1032.
[2) R. Courant and D. Hilbert. Methods of Mathematical Physics, Vol. 1. Interscience Publishers
(1955), New York.
[3) R. M. Sanner and J.-J. E. Slotine. Gaussian networks for direct adaptive control. IEEE
Transactions on Neural Networks, Vol. 3 (1992), pp837-863.
[4) D. Slepian. Some comments on Fourier analysis, uncertainty and modelling. SIAM Review,
Vol. 25(3) (July 1983), pp379-393.
[5) D. Slepian and H. O. Pollak. Prolate spheroidal wave functions, Fourier analysis and uncer-
tainty, I. The Bell System Technical Journal, Vol. 40(1) (January 1961), pp43-64.
(6) R. Zbikowski, K. J. Hunt, A. Dzielinski, R. Murray-Smith, and P. J. Gawthrop. A review of
advances in neural adaptive control systems. Technical Report of the ESPRIT NACT Project
TP-1, Glasgow University and Daimler-Benz Research, (1994). (Available from FTP server
ftp.mech.gla.ac.ukas PostScript file /nact/nact.1p1.ps).
Acknowledgements
Work supported by ESPRIT III Basic Research Project No. 8039 Neural Adaptive
Control Technology
UNSUPERVISED LEARNING OF TEMPORAL
CONSTANCIES BY PYRAMIDAL-TYPE NEURONS.
Michael Eisele
The Salk Institute, Computational Neurobiology Laboratory,
PO Box 85800, San Diego, CA 92186 - 5800, USA. Email: [email protected]
An unsupervised learning principle is proposed for individual neurons with complex synaptic
structure and dynamical input. The learning goal is a neuronal response to temporal constancies:
IT some input patterns often occur in close temporal succession, then the neuron should respond
either to all of them or to none. It is shown that linear threshold neurons can achieve this learning
goal, if each synapse stores not only a weight, but also a short-term memory trace. The online
learning process requires no biologically implausible interactions. The sequence of temporal asso-
ciations can be interpreted as a random walk on the state transition graph of the input dynamics.
In numerical simulations the learning process turned out to be robust against parameter changes.
1 Introduction
Many neocortical neurons show responses that are invariant to changes in position,
size, illumination, or other properties of their preferred stimuli (see review [5]).
These temporal "constancies" or "invariants" do not have to be inborn: In numerical
simulations [2, 3, 6] neurons could learn such responses by associating stimuli that
occur in close temporal succession. Similar temporal associations have also been
observed experimentally [4].
In most numerical simulations [2, 6] temporal associations were formed with the help
of "memory traces" which are running averages of the neuron's recent activity: If
the stimulus S. is often followed by the stimulus Sj, a strong response of the neuron
to S. causes a large memory trace at the time of stimulus Sj, which teaches the
neuron to respont to Sj, too. By presenting the stimuli in reverse temporal order
Sj -+ Si, the response to stimuli Si is likewise strengthened by the response to
stimulus Si.
This simple learning scheme does no longer work, if the input dynamics is irre-
versible, that is if only the transition S. -+ Sj occurs. In the following a more
complex learning scheme will be presented, which can form associations in both
temporal directions of an irreversible input dynamics. For this purpose an addi-
tional memory trace will have to be stored in each synapse. Numerical results are
presented for neuronal input generated by a Markov process which is simpler than
naturally occuring input, but irreversible and highly stochastic. A temporal con-
stancy in such a dynamics consists of a set of states which are closely connected
to each other by temporal transitions. The learned synaptic excitation can be de-
rived from the neuronal firing pattern by interpreting the sequence of temporal
associations as a random walk between Markov states.
2 Neuron Model
The model neuron was chosen to have the simple activation dynamics of a linear
threshold unit, but a complicated learning dynamics. The input signal is generated
by a temporally discrete, autonomous, stochastic dynamics with a finite number
N of states S•. Each state is connected with a sensory afferent. At any time t
the afferent of the present state R(t) E {So, .. . SN} is set active (Xj(t) = 1 for
171
172 CHAPTER 27
R(t) = Sj), while all the other afferents are set passive (Xi(t) = 0 for R(t) =I Si).
Each afferent forms one synapse with the model neuron. The sum of weighted
inputs is called the neuron's activity a(R(t)) = I:i WiXi(t). If the activity exceeds
a threshold B, the neurons output y(R(t)) is 1, otherwise it is O.
The memory trace <P>r of any quantity P is defined as its exponentially weighted
past average at time t. The subscript r indicates over which time scale 1/TJr the
average is done (with 0 < TJr < 1).
co
<z>p(t) )
!:"W;(t) TJww;(t)x;(t) ( O:p a(t) - p(t) + TJwO:Jz(t)q;(t) (6)
The decay term p(t) was defined above. The past and future prefactors O:J and O:p
will be discussed below. The quantity z(t) models the postsynaptic depolarization
that effects the learning process. It is defined as a sum z(t) := a(t) + ,y(t) of the
depolarization a caused by other synaptic inputs and the depolarization caused by
the firing y of the postsynaptic neuron, weighted by a factor I > O.
This synaptic learning rule can be rewritten into a more comprehensible form by
defining modified synaptic changes !:"w;:
+ O:J TJJ
a(t)
f
T=l
z(t + r) . (1 - TJJ )(T-1) - p(t)) (7)
By inserting definitions (1) and (5) into eq. (6) one can easily show that the modi-
fied synaptic changes cause approximately the same total changes in the long run:
Eisele: Unsupervised Learning of Temporal Constancies 173
2:;=1 liw;(t) ~ 2:;=1 liWi(t) for large T. The approximation is good, if the learn-
ing period T is much larger than the duration l/rJf of the synaptic memory trace
qi. It is exact, if the depolarization z(t) is zero for t < 1 and t > T.
The definition (7) of liWi resembles resembles an unsupervised Widrow-Hoff-rule,
in which the desired activity of the neuron depends on an exponentially weighted
average of past and future postsynaptic depolarizations z(t ± 1'). Because the future
depolarizations z(t + 1') are not yet known at time t, the memory traces qi had to
be introduced in order to construct the online learning rule (6) of liWi(t).
3 Interpretation of Temporal Associations as a Random Walk
After a sufficiently long time to, any successful learning process should reach a
quasistationary state, in which synaptic changes cancel each other in the long run.
In the following we will derive the neural activities a(Si) in the quasistationary
state from the neural outputs y(Sj) and the transition rules of the input dynamics.
Even in the quasistationarity state, the quantities Wi, 8, and p will fluctuate ir-
regularily, if the input is generated by a stochastic dynamics. We assume that the
learning rates 'fJw, 'fJ9 and 'fJp are so small that these fluctuations can be neglected.
Then the activity a, the output y, and the depolarization z = a+'YY depend on time
t only through the state R(t) of the input dynamics. Under the assumption that the
input dynamics is ergodic, the temporal average 2:;=1 liWi(t+to)/T can be replaced
by an average over all possible trajectories ... -+ R( -1) -+ R(O) -+ R(I) -+ ...
of the input dynamics. By definition, quasistationarity has been reachedJ.!:. these
average weight changes vanish for all synapses. Using definition (7) of liWi, the
condition of quasistationarity now reads:
a(R(O)) ! a:. L
R(1),RC2),...
(Pf(R(O)-+R(1)-+R(2)'.)'fJf ~ Z(R(1'))(l -
T_1
'fJf )(T-1»)
+ ap . L:
p ... ,R(-2),R(-1)
(PP( ... R(-2)-+RC-1)-+RCO»)1]P f:
T=1
z(R( -1'))(1 _1JP)CT-1») (8)
for any state R(O). Here Pf(R(O) -+ R(l) -+ ... ) denotes the probability mea-
sure that a trajectory starting at R(O) will subsequently pass through R(I), R(2),
.... Analogously, Pp( ... -+ R(-I) -+ R(O)) denotes the relative probabilities of
trajectories ending at R(O).
The right hand side of eq. (8) can be interpreted as an average over random jumps
between states of the input dynamics. Every jump starts at state R(O). It jumps
with probability a JI p into the future and with probability Q p / p into the past.
The jump length l' is exponentially distributed, with a mean of 1/11J (or 1/'fJp)
time steps being transversed. If the dynamics is stochastic, several end states may
be reached after ±r time steps. The end state R( ±1') is then chosen according
to the transition probabilities Pf or Pr , which were defined above. The effective
depolarization z = a + iY of the end state is averaged over all possible jumps.
According to eq. (8), this average should equal the activity a(R(O)) at the start
state R(O).
The activity a(R(O» at the start state still depends on the unknown activity
a(R(±1'» at the end state. The latter can again be interpreted as an average over
random jumps, which start from state R(±r). By repeating this concatenation of
174 CHAPTER 27
6 8 't
Figure 1 IT is the information which a first spike at time t transmits about the
state of the input dynamics at time t +r . A neuron responding to a randomly chosen
set of 32 states would convey very little information (stars). A neuron with suitable
learning parameters (see text) can improve its response: Circle: response to an
optimal set of 32 states at t ::::: 106 . Triangle: response to 33 states at t ::::: 2.10 6 , if the
"drift" is slightly positive (Ct 1 = 0.55; Ct p = 0.45; 1)1 = 1)p = 0.8). Bowtie: response
to 31 states at t ::::: 2.5.10 6 , if a maximal drift (Ctj = Ct~ = 0) is compensated by
interactions of two different dendrites.
jumps, one forms a random walk (which should not be confused with a trajectory
of the input dynamics). By averaging eq. (8) over all start states R(O) one can
easily show that the total jump probability o:Jip + O:p/p ~ 1/(1 + ,y/a) < 1, so
that the random walk is of finite mean length a/(,fj). Thus one can sum ,y over
the end states of all jumps in a random walk and average this sum over all the
random walks starting at Sj. According to construction, this average shold equal
the activity a(Sj), once that the learning process has reached quasistationarity.
4 Numerical Results
The learning algorithm was tested with input generated by a special Markov pro-
cess, whose high internal symmetry permits an accurate assessment of the learn-
ing process. Each of its 10 . 25 =
320 equally probable states is denoted by a
digit sequence C BIB2B3B4B5, with C E {O, 1, ... 9} and Bi E {O, I}. Each state
C BIB2B3B4B5 forms two equally probable transitions to the states C' B2B3B4B50
and C' B2B3B4B51, with C' == C + 1 mod 10.
With suitable learning parameters (y = 0.1, , = 0.1, O:J = O:p = 0.5, and l/r/J =
I/TJp = 1.25, TJw ~ 0.04) and random initial synaptic weights the neuron needed
1 million time steps of online learning to develop a quasistationary response. It
responded to 32 states of the Markov process, namely the 8 states 0 B 1 B 2B 301 (with
BIB2B3 E {OOO, 001, ... 111}), the 8 states 1 B 1 B 201B 5, the 8 states 2 B 1 0IB4 B 5 ,
and the 8 states 3 01B3B 4 B 5. According to the rules of the input dynamics, these
four groups of states are always passed in the given order. Thus the neuron always
responds for 4 consequetive time steps. By its first spike (y(t) = 1 after y(t -1) = 0)
it transmits an information IT = log2(320/8) ~ 5.3 about the state R(t = T) of
the input dynamics at the present moment T = 0 and the next three time steps
Eisele: Unsupervised Learning of Temporal Constancies 175
T = 1,2 or 3 (see fig. 1). One can prove that this is one of the "most constant"
responses to the input dynamics, in the sense that no neuron with mean firing rate
y = 0.1 can transmit more information IT by its first spike or show more than
4 consequetive responses at some times without showing less than 4 consequetive
responses at other times.
A deeper analysis of the properties of the random walk showed that the learning
process is robust against almost all parameter changes (as long as TJf, TJp, TJw, TJe, TJ p,
and 'Y remain small enough). The one critical learning parameter is 0: f / TJf - O:p / TJp,
the mean drift ofthe random walk into the future direction per jump. In the extreme
case O:p =0 the learning goal (8) would require the neuron's strongest activity to
preceed its output y = 1 in time, which is inconsistent with the output being
caused by the neuron's strongest activity. Only if the drift was rather small, did
the learning process converge to quasistationarity (triangles in fig. 1).
There is a way to turn the learning process robust against changes in 0: f and O:p. One
constructs a neuron with two different dendrites, differing in the values p, <P>p,
<Li 6Wi>p, and <a>p and the synaptic learning parameters O:f,O:p,TJf,TJp: In
dendrite a the drift o:j /TJ'j - o:~/ TJ~ is chosen negative, in dendrite b the drift o:} / TJ} -
O:~/TJ~ is chosen positive. This learning process always reached quasistationarity in
numerical simulation. The early part of neuronal responses was caused by dendrite
b, the late part by dendrite a. The worst choice of learning parameters (maximal
drift o:'j = =
o:~ 0) still produced a rather good neuronal response (bowties in fig.
1). The known anatomical connections in the neocortex suggest that dendritic type
a might correspond to apical dendrites and dendritic type b to basal dendrites[lJ.
This speculative hypothesis might be tested by simulating networks of such neurons.
REFERENCES
[1] Eisele, M., VeTeinfachung generierender Partition en von chaotischen Attraktoren, (German)
PhD-thesis, Appendix D, ISSN 0944-2952 Jiil-report 3021, KFA-Jiilich, Jiilich (1995).
[2] Foldiak, P., Learning invariance from transformation sequences, Neural Computation, Vol.3
(1991), ppI94-200.
[3] Mitchison, G. J., Removing time variation with the anti-hebbian differential synapse, Neural
Computation, Vol.3 (1991), pp312-320.
[4] Miyashita, Y. & Chang, H.-S., Neuronal correlate of pictorial short-term memory in the
primate temporal cortex, Nature, Vol. 331 (1988), pp68-70.
[5] Oram, M. W. & Perret, D. I., Modeling visual recognition from neurobiological constraints,
Neural Networks, Vol.7 (1994), pp945-972.
[6] Wallis, G., Rolls, E. T., & Foldiak, P., Learning invariant responses to the natural transfor-
mations of objects, Int. Joint Conf. on Neural Net., Vol.2 (1993), pp1087-1090.
Acknowledgements
The author thanks F. Pasemann for many useful discussions and E. Poppel and G.
Eilenberger for general support. This work was done at the KFA Jiilich, Germany.
NUMERICAL ASPECTS OF MACHINE LEARNING IN
ARTIFICIAL NEURAL NETWORKS
s. W. Ellacott and A. Easdown
School of Computing and Mathematical Sciences, University of Brighton,
Brighton BNl 4GJ, UK. Email: [email protected]
In previous papers e.g. [5] the effect on the learning properties of filtering or other preprocessing
of input data to networks was considered. A strategy for adaptive filtering based directly on this
analysis will be presented. We focus in Section 2 on linear networks and the delta rule since
this simple case permits the approach to be easily tested. Numerical experiments on some simple
problems show that the method does indeed enhance the performance of the epoch or off line
method considerably. In Section 3, we discuss briefly the extension to non-linear networks and in
particuar to backpropagation. The algorithm in its simple form is, however, less successful and
current research focuses on a practicable extension to non-linear networks.
1 Introduction
In previous papers e.g. [5] the effect on the learning properties of filtering or other
preprocessing of input data to networks was considered. Such filters are usually
constructed either on the basis of knowledge of the problem domain or on statistical
or other properties of the input space. However in the paper cited it was briefly
pointed out that it would be possible to construct filters designed to optimise the
learning properties directly by inferring spectral properties of the iteration matrix
during the learning process, permitting the process to be conditioned dynamically.
The resulting technique is naturally parallel and easily carried out alongside the
learning rule itself, incurring only a small computational overhead. Here we explore
this idea.
2 Linear Theory
Although not the original perceptron algorithm, the following method known as the
delta rule is generally accepted as the best way to train a simple perceptron. Since
there is no coupling between the rows we may consider the single output perceptron.
Denote the required output for an input pattern x by y, and the weights by the
vector w T . Then,
6w = TJ(Y - w T x)x
where TJ is a parameter to be chosen called the learning rate [7],p.322 . Thus given
a current iterate weight vector Wk,
Wk+l = Wk + TJ(Y - WkT x)x = (J -TJxxT)Wk + TJYx (1)
since the quantity in the brackets is scalar. The bold subscript k here, denotes
the kth iterate, not the kth element. We will consider a fixed and finite set of
=
input patterns x p , p 1 ... t, with corresponding output Yp. If we assume that the
patterns are presented in repeated cyclic order, the presented x and corresponding
Y of (1) repeat every t iterations, and given a sufficiently smallTJ, the corresponding
weights go into a limit t-cycle: see [3] or [5]. Of course this is not the only or
necessarily best possible presentation scheme [2], but other methods require a priori
analysis of the data or dynamic reordering. Since we are assuming that we have
a fixed and finite set of patterns x p , p = 1 ... t, an alternative strategy is not to
update the weight vector until the whole epoch of t patterns has been presented.
176
Ellacott €3 Easdown: Numerical Aspects of Machine Learning 177
This idea is attractive since it actually generates the steepest descent direction for
the least sum of squares error over all the patterns. We will call this the epoch
method to distinguish it from the usual delta rule. (Other authors use the term off
line learning.) This leads to the iteration
t
Wk+l = OWk+1 + ." E(ypxp) (2)
p=l
where 0 = =
(I -."X XT) (I -."L). Here X is the n xt matrix whose columns are the
xp's. The k in (2) is, of course, not equivalent to that in (1), since it corresponds to a
complete epoch of patterns. There is no question of limit cycling, and, indeed a fixed
point will be a true least squares minimum w·. To see this, put Wk+l Wk w· = =
and observe that (2) reduces to the normal equations for the least squares problem.
Moreover the iteration (2) is simply steepest descent for the least squares problem,
applied with a fixed step length. Clearly L = X X T is symmetric and positive semi
definite. In fact, provided the xp span, it is (as is well known) strictly positive
definite. The eigenvalues of n are l-1/(the corresponding eigenvalues of L), and
hence for 1/ sufficiently small U(O) = 110112 < 1. Unfortunately, however, (2) gen-
erally requires a smaller value of ." than (1) to retain numerical stability [3], [5].
How can we improve stability of (2) or indeed (I)? Since these are linear itera-
tions it is only necessary to remove the leading eigenvalue of the iteration matrix.
Specifically we seek a matrix T such that if each input vector xp is replaced by
Tx p , more rapid convergence will result. We see from (2) that the crucial issue is
the relationship between the unfiltered update matrix 0 = (I - ."X XT) and its
filtered equivalent (I - 1/TX XTrT) = 0' say. In general these operations may be
defined on spaces of different dimension: see e.g. [5], but here we assume T is n x n.
To choose T we compute the largest eigenvalue and corresponding eigenvevtor of
X XT. This may be carried out by the power method [6], p.147 at the same time
as the ordinary delta rule or epoch iteration: the computation can be performed by
running through patterns one at a time, just as for the learning rule itself. We get
a normalised eigenvector PI of X X T corresponding to the largest eigenvalue ).1 of
XXT. Set
T -_ 1+ (,-1/2
"1
)
- 1 PIPI .
T
(3)
A routine calculation shows that T X XTrT has the same eigenvectors as X XT,
and the same eigenvalues but with ).1 replaced by 1. Each pattern xp should then
be multiplied by T, and, since we are now iterating with different data, the current
weight estimate W should be multiplied by
T-l=I+().~/2_1)pIPIT. (4)
We may then repeat the process to remove further eigenvalues. Basically the same
idea can be used for the iteration with the weights updated after each pattern
as in (1), but it is less convenient: for simplicity, our numerical experiments refer
only to the application to (2). Two small examples are discussed: in each case four
eigenvalues are removed though in fact for the first example, only the first eigenvalue
is significant.
The first example is a 'toy' problem taken from [5],p113. There are four patterns
= =
and n 3: Xl (1,0,0)T,x2 = =
(1, 1,0)T,x3 = (1,1, II and X4 (1,0, ll. The
=
corresponding outputs are Yl = 0, Y2 Y3 = Y4 = 1. Figure 1 shows the number of
178 CHAPTER 28
1~r-------------------------. ~t-------------------------,
900
500
tIl~ Delta til
c: c:
.2700
~eoo ~400
S 500
-
S
- 300
'0 '0
~200
E
:::l
AFT Z 100
02 0.4 0.' 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.2 0.4 0.8 0.1 t.O 1.2 1.4 1.6 1.8 2.0
Eta Eta
Figure 1 Figure 2
epochs required to obtain agreement in the weights (at the end of the epoch for the
delta rule) to an error or lE-8, for various values of '1/ and for the three methods: the
delta rule (1), the epoch or off line method (2) and the Adaptive Filtering Method
(AFT) using (3).
As a slightly more realistic example we also considered the Balloon Database B from
the UCI Repository of Machine learning and domain theories [1). This is a set of
sixteen 4-vectors which are linearly separable. Figure 2 compares the perfomance of
the epoch and AFT methods: on this very well behaved database the epoch method
actually performs better in terms of iterations than the delta rule even though it
=
requires a larger value of '1/. With '1/ 1, AFT requires only three epochs and good
performance is obtained over a wide range, whereas the best obtainable with the
epoch method is 28 with '1/ =
0.05 and this is a very sharp minimum: '1/ 0.03 =
requires 49 iterations, and '1/ = 0.06 requires 45.
3 Non-linear Networks
The usefulness of linear neural systems is limited, since many pattern recognition
problems are not linearly separable. We will define a general nonlinear delta rule.
The backpropagation rule [7), pp.322-328 used in many neural net applications is a
special case of this. For the linear network the dimension of the input space and
the number of weights are the same: n in our previous notation. Now we will let
M denote the total number of weights and n the input dimension. So the input
patterns x to our network are in IRn , and we have a vector w of parameters in
IRM describing the particular instance of our network: i.e. the vector of synaptic
weights. For a single layer perceptron with m outputs, the vector w is the the m X n
weight matrix, and thus M =
mn. For a multilayer perceptron, w is the cartesian
product of the weight matrices in each layer. For brevity we consider just a single
output. The network computes a function 9 : IRM X IRn --. IR. In [4) or [5) it is
shown that the generalised delta rule becomes
8w = 'I/(y - g(w, x)V'g(w,x) (5)
V'g takes the place ofx in (1). Observe that a change of weights in any given layer
will cause a (linear) change in the input vector to the succesive hidden layer. Thus
the required gradient is obtained by i) differentiating the current layer with respect
Ellacott (3 Easdown: Numerical Aspects of Mach ine Learning 179
Figure 3
to the weights in the layer and ii) multiplying this by the matrix representation
of the Frechet derivative with respect to the inputs in the succeeding layers. Thus,
let the kth weight layer, k = 1,2, .. . K, say, have weight matrix Wk : each row
of these matrices forms part of the parameter vector w. On top of each weight
layer is a (possibly) non-linear layer. At each of the m (say) hidden units we have
an activation function: hi for the jth unit. The function h whose jth co-ordinate
function is hi is a mapping IRm -+ IRm . However for a multilayer perceptron it
is rather special in that hi only depends on the jth element of its argument: in
terms of derivatives this means that the Jacobian H of h is diagonal. Let the H
and h for the units layer after the kth weight layer also be subscripted k. (Input
units to the bottom layer just have identity activation, as is conventional.) Finally
suppose that the input to the kth weight layer (i.e. the output from the units of
the previous layer) are denoted Vk , with VI =X. A small change c5Wk in the kth
weight matrix causes the input to the corresponding unit layer to change by c5Wk Vk.
The Frechet derivative of a weight layer Wr Vr with respect to its input Vr is of
course just W r . Thus the output is changed by HKWKHK-IWK-l .. . Hk c5Wk Vk.
Since this expression is linear in c5Wk it yields for each individual element that
component of the gradient of 9 corresponding to the weights in the kth layer. To
see this, recall that the gradient is actually just a Frechet derivative, i.e. a linear
map approximating the change in output. In fact we might as well split up Wk by
rows and consider a change (c5Wi,kf in the ith row (only). This corresponds to
c5Wk = ei(c5wi,k), so c5WkVk = ei(c5wi,k)Vk= eivkT(c5wi,kf = Vi,k(c5wi,kf, say,
where Vi,k is the matrix with ith row Vk, and zeros elsewhere. Thus, that section
of "Vg in (5) which corresponds to changes in the ith row of the kth weight matrix
is HK WK HK -1 WK -1 . . . Hk Vi,k. The calculation is illustrated in the following ex-
ample. Figure 3 shows the architecture. The shaded neurons represent bias neurons
with activation functions the identity, and the input to the bias neuron in the input
layer is fixed at 1. The smaller dashed connections have weights fixed at 1, and the
larger dashed ones have weights fixed at O. (This approach to bias has been adopted
to keep the matrix dimensions consistent.) All other neurons have the standard ac-
tivation function a = 1/(1 + eX). Other than the bias input, the data is the same
as the first example in Section 2. Let the initial weights be
180 CHAPTER 28
between the inputs and hidden layers numbering from the top in 3. Thus no bias
is applied (but weights in rows 1 and 2 in the last column are trainable), and the
bottom row of weights is fixed. Weights are
1 Introduction
The driving force behind RAM-based neural networks is their ease of hardware
realisation. The desire to retain this property influences the design of learning algo-
rithms. Traditionally, this has led to the use of the reward-penalty algorithm, since
only a single scalar value needs to be communicated to every node [7]. The math-
ematical tool of reverse differentiation enables derivatives of an arbitrary function
to be obtained efficiently at an operational cost of less than three times the orig-
inal function. Using three progressively more complex problems, the performance
of three gradient-based algorithms and reward-penalty are compared.
2 HyperNet Architecture
HyperNet is the term used to denote the hardware model of a RAM-based neural
architecture proposed by Gurney [5], which is similar to the pRAM of Gorse and
Taylor [4]. A neuron is termed a multi-cube unit (MCU), and consists of a number
of subunits, each with an arbitrary number of inputs. j and k reference nodes in the
hidden and output layers respectively, with i = 1, ... , I indexing the subunits. JL
denotes the site addresses, and is the set of bit strings JLI .... , JLn where n denotes the
=
number of inputs to the subunit. Zc refers to the cth real-valued input, with Ze E [0,1]
and ze.. (1 - ze). For e~ch of the 2n site store locations, two sets are defined:
c E M~Jo if J.le = OJ c E MYI if J.lc = 1. The access probability p(J.lii) for location JL
in subunit i of hidden layer node j is therefore p(J.lii) IteM'i = ,,0
Zc ilceM'i
,,1
Zc. The
subunit response (s) is then gained by summing the proportional site values, which
are in turn accumulated to form the multi-cube activation (a). The node's output
(y) is given by passing the MCU activation through a sigmoid transfer function
(y = u(a) = 1/(1 + e-~)). The complete forward pass of a two layer network of
HyperNet nodes is given in Table 1. in the form of a computational graph.
Gradient-based algorithms require the extraction of error gradient terms. Reverse
differentiation [10] enables derivatives to be obtained efficiently on any architecture.
Functions are usually composed of simple operations, and this is the basis on which
reverse accumulation works. A computational graph with vertices 1 to N connected
by arcs is constructed. The first n vertices are the independent variables, with
the results of subsequent basic operations fu(') stored in intermediate variables
181
182 CHAPTER 29
Computational Graph Reverse Accumulation
_,k 1 _k
8 = [ka
J.'ilc
.= -IilI:"s'"
/;
a'
.
i=l
s'i = I:8",;P(IJ,i)
. ,,';
P(IJ';) = II Zc II %c
ceM!{ CEM:!l
-,i 1 -i
8 = fja
8"., = p(p.'i)B'i
%n
Xu = =
fu(.), u n+l, ... , N with F(x) =
XN. The gradient vector g(x) aF(x)/ax =
can then be obtained by defining x axv/ax u , u = =
1, ... , v for vertex u, and
applying the chain rule. The process thus starts at vertex N, where XN = 1, and
percolates down. The reverse accumulation of a two layer HyperNet network is also
given in Table 1.
3 Learning Algorithms
The reward-penalty algorithm used is an adaptation of the P-model associative
algorithm devised by Barto [2] and modified for HyperNet nodes by Gurney [5].
A single scalar reinforcement signal, calculated from a performance measure, is
globally broadcast. The internal parameters are updated using this signal, and lo-
cal information. The performance metric used is the mean-squared error on the
Nk
output layer (e = 1/N k 2::k=1 [u( ak ) - yktp where ykt is the target output for
node k, and N k is the number of output layer nodes). The binary reward sig-
nal is then probabilistically generated: r = 1 with the probability (1 - e); r = 0
Ferguson et al.: Learning in RAM-based Neural Networks 183
otherwise. Given a reward signal (r = 1) the node's internal parameters are mod-
ified so that the current output is more likely to occur from the same stimulus. A
penalising step (r = 0) should have the opposite effect. The update is therefore
!:J..SJ.l = a [r(y - y) + rA(l - y - y)] where a is the learning rate, A is the degree of
=
penalising, and y ykt for output layer nodes and is the closest extreme for hidden
layer nodes: y = 1 if y > 0.5; Y = a if y < 0.5; or y = 1 with probability 0.5 if
=
y 0.5.
Gradient descent is the simplest gradient technique, with the update!:J..S = -0'5,
where a is the step size and 5 is the gradient term. In the trials reported here,
batched updating was used, where the gradients are accumulated over the training
set (an epoch) before being applied. Gradient descent has also been applied to
RAM-based nodes by Gurney [5], and Gorse and Taylor [4].
Steepest Descent represents one of the simplest learning rate adaption techniques. A
line search is employed to determine the minimum error along the search direction.
The line search used was proposed by Armijo [1], and selects the largest power of
two learning rate that reduces the network error.
Successive steps in steepest descent are inherently perpendicular [6], thus leading
to a zig-zag path to the minimum. A better search direction can be obtained by
incorporating some of the previous search direction. Momentum is a crude ex-
ample. Conjugate gradient utilises the previous direction with the update rule
!:J..S = a 8S = 0'( -5 + j3 8S-) where 8S- refers to the change on the previous
iteration. j3 is calculated to ensure that successive steps are conjugate, and was
calculated using the Polak-Ribiere rule [9].
4 Benchmark Applications
The 838 encoder is an auto-associative problem which has been widely used to
demonstrate the ability of learning algorithms [6]. Both the input and output lay-
ers contain n = 8 nodes, with log2 n = 3 hidden layer nod,es, Presentation of n
distinct patterns requires a unique encoding for each to be formulated at the hid-
den layer. Of the nn possible encodings, n! are unique. Thus for the 838 encoder,
only 40,320 unique encodings exist in the 16,777,216 possible, or 0.24%. The net-
work is fully connected, with every MCU containing only one subunit. The training
vectors consist of a single set bit, which is progressively shifted right.
The character set database is a subset of that utilised by Williams [11], and con-
sists of twenty-four examples of each letter of the alphabet, excluding 'I' and '0'.
Each are sixteen by twenty-four binary pixels in size, and were generated from UK
postcodes. Nineteen examples of each letter form the training set. A randomly gen-
erated configuration was utilised, with each pixel mapped only once, Eight inputs
per subunit, and seven subunits per MCU resulted in seven neurons in the hid-
den layer. The output layer contains twelve nodes, each with a single subunit fully
connected to the hidden layer.
The particle scattering images were generated by a pollution monitoring instru-
ment previously described by Kaye [8]. Data was collected on eight particle types,
namely: long and short caffeine fibres; 12~m and 3~m silicon dioxide fibres; copper
flakes; 3~m and 4.3~m polystyrene spheres; and salt crystals. The training set com-
prised fifty randomly selected images of each type, quantised to 16 2 5-bit images
to reduce computational load. A fully connected network was used, with sixteen
184 CHAPTER 29
hidden, and six output layer MCUs. Every MCU consisted of two subunits, each
with eight randomly generated connections. A more complete description of the
airborne particle application can be found in [3].
5 Experimental Results
The site store locations were randomly initialised for all but reward-penalty, where
they simply set to zero. The convergence criteria was 100% classification for the
838 encoder, and 95% for the other applications. Parameter settings were gleaned
experimentally for the reward-penalty and gradient descent algorithms. The gradi-
ent descent setting of p was also used for steepest descent and conjugate gradient.
The results are averaged over ten networks for the simpler problems, and five for
the particle task. Table 2 summarises the parameter settings, and results for the
four algorithms on the three applications. "Maximum", "mean", "deviation", and
"coefficient" are in epochs, with the latter two being the standard deviation and
coefficient of variation (V = t . 100) respectively. * denotes unconverged networks,
and hence the maximum cycle limit. "CPU / cycle" is based on actual CPU time
(secs) required on a Sun SPARCstation 10 model 40. "Total time" is given by
"mean" x "CPU / cycle" .
For the 838 encoder, conjugate gradient was the fastest, requiring marginally less
time than reward-penalty and almost six and a half times fewer cycles. The char-
acter recognition task highlights the difference in learning ability. Reward-penalty
was unable to converge, being more than three orders of magnitude away from the
desired error. Batched gradient descent again demonstrated consistent learning, but
was still the slowest of the gradient-based algorithms. The results for the particle
task exemplify the problem of steepest descent: every network became trapped and
required a relatively large number of cycles to be freed. The power of conjugate gra-
dient also becomes clear, needing five times fewer cycles then gradient or steepest
descent. Reward-penalty was not tried due to its failure on the simpler character
problem.
6 Conclusions
The performance of various learning algorithms applied to a RAM-based artificial
neural network have been investigated. Traditionally, reward-penalty has been ap-
plied to these nodes due to its inherent hardware amenability. The experiments
reported here suggest that reinforcement learning is bested suited to simple prob-
lems. With respect to the gradient-based algorithms, gradient descent was consis-
tently the slowest algorithm. While steepest descent was the fastest on the character
recognition task, it had a tendency to become trapped. Conjugate gradient was by
far the best algorithm, being fastest on two of the three applications.
REFERENCES
[I] L. Armijo, Minimisation of functions having lipschitz continuous first partial derivatives.
Pacific Journal on Mathematics, Vol. 16 (1966), ppl-3.
[2] A. G. Barto and M. I. Jordan, Gradient following without back-propagation in layered net-
works, in: Proc. Int. Conf. Neural Networks (ICNN '81), IEEE, Vol. II (1981), pp629-636.
[3] A. Ferguson, T. Sabisch, P.H. Kaye, L.C.W. Dixon and H. Bolouri, High-speed airborne par-
ticle monitoring using artificial neural networks, Advances in Neural Information Processing
Systems 8 (1996),.MIT Press, pp980-986.
[4] D. Gorse and J.G. Taylor, Training strategies for probabilistic RAMs, Parallel Processing in
Neural Systems and Computers, Elsevier Science Publishers (1990), pp31--42.
Ferguson et al.: Learning in RAM-based Neural Networks 185
[5] K. N. Gurney, Training nets of hardware realisable sigma-pi units, Neural Networks, Vol. 5
(1992), pp289-303.
[6] J. Hertz, A. Krogh and R.G. Palmer, Introduction to the Theory of Neural Computation,
Addison-Wesley (1991).
[7] T. K. L. Hui, P. Morgan, K.N. Gurney and H. Bolouri, A cascadable 2048-neuron VLSI ar-
tificial neural network with on-board learning, in: Artificial Neural Networks 2, North-Holland
(1992), pp647-651.
[8] P. H. Kaye, E. Hirst, 1.M. Clark and M. Francesca, Airborne particle shape and size clas-
sification from spatial light scattering profiles, Journal of Aerosol Science, Vol. 23 (1992),
pp597-611.
[9] E. Polak, Computational Methods in Optimisation, Academic Press (1971).
[10] L. B. Rall, Automatic Differentiation, Springer-Verlag (1981).
[11] P. M. Williams, Some experiments using n-tuple techniques for alphanumeric pattern clas-
sification, report 575, The Post Office Research Centre, Ipswich, (1977).
ANALYSIS OF CORRELATION MATRIX MEMORY
AND PARTIAL MATCH-IMPLICATIONS FOR
COGNITIVE PSYCHOLOGY
Richard Filer and James Austin
Advanced Computer Architecture Group, Department of Computer Science,
University of York, Heslington, YOJ 5DD, UK. Email: [email protected]
This paper describes new work on partial match using Correlation Matrix Memory (CMM), a
type of binary associative neural network. It has been proposed that CMM can be used as an
inference engine for expert systems, and we suggest that a partial match ability is essential to
enable a system to deal with real world problems. Now, an emergent property of CMM is an
ability to perform partial match, which may make CMM a better choice of inference engine than
other methods that do not have partial match. Given this, the partial match characteristics of
CMM have been investigated both analytically and experimentally, and these characteristics are
shown to be very desirable. CMM partial match performance is also compared with a standard
database indexing method that supports partial match (Multilevel Superimposed Coding), which
shows CMM to compare well under certain cirumstances, even with this heavily optimised method.
Parallels are drawn with cognitive psychology and human memory.
Keywords: Correlation Matrix Memory, Neural Network, Partial Match, Human Memory.
1 Introduction
Correlation Matrix Memory (CMM) [4] has been suggested as an inference engine,
possibly for use in an expert system [1]. Touretsky and Hinton [7] were the first to
implement successfully a reasoning system in a connectionist architecture, and there
have been several neural network based systems suggested since. However, only the
systems [1, 7] can be said to use a truly distributed knowledge representation (the
issue of localist versus distributed knowledge representation will not be discussed
here), for instance SHRUTI [6] is an elaborate system using temporal synchrony.
The SHRUTI model was developed with a localist representation and, although
a way of extending the model to incorporate a semi-distributed representation is
given, this is not a natural extension of the model; moreover, it is difficult to see
how learning could occur with either form of knowledge representation. Many of
the properties that have become synonymous with neural networks actually rely
on the use of a distributed knowledge representation. These are: (a) a graceful
degradation of performance with input data corruption; (b) the ability to interpolate
between input data and give a sensible output, and (c) a robustness to system
damage. Partial match ability is very much connected with (a) and (b), which
give neural network systems their characteristic flexibility. Here we suggest that a
certain flexibility in reasoning is invaluable in an inference engine because, in real
world problems, the input data is unlikely to be a perfect match for much of the
time. The CMM-based inference engine suggested by Austin [1] uses a distributed
knowledge representation, and therefore this system can, in principle, offer the
desired flexibility. The focus of this paper is a full analysis of the partial match
ability of CMM, including a comparison with a conventional database indexing
method that offers partial match. This conventional database indexing method is
Multilevel Superimposed Coding [3]. Section 2 contains a more detailed description
of CMM and partial match, including an analysis of the theoretical partial match
performance of CMM. Section 3 compares CMM partial match with Multilevel
186
Filer (3 Austin: Analysis of Correlation Matrix Memory 187
Superimposed Coding, and Section 4 considers how these results may be relevant
to cognitive psychology and the study of human memory.
2 CMM and Partial Match
CMM is a type of binary associative memory, which can be thought of as a matrix of
binary weights. The fundamental process which occurs in CMM is the association
by Hebbian learning of binary vectors representing items of information, which
subsequently allows an item of information to be retrieved given the appropriate
input. CMM allows very fast retrieval and, particularly, retrieval in a time which
is independent of the total amount of information stored in the memory for given
CMM size. In [1], a symbolic rule such as x::}Y can be encoded by associating a
binary vector code for X, the rule antecedent, with a binary vector code for Y, the
rule consequent. The rule can be recalled subsequently by applying the code for
X to the memory and retrieving the code for Y. Multiple antecedent items can be
represented by superimposing (bitwise OR) the binary vector codes for each item
prior to learning. A fixed weight, sparse encoding is used to generate codes with
a coding rate that optimizes storage wrt. number of error bits set in the output,
which are bits set in error due to interactions between the binary representations
stored in the CMM [11]. CMM learning is given by:
Mi+l = Mi EEl I 0 (1)
where Mi is the mxm CMM at iteration i, I and 0 are m-bit binary vector codes
to be associated, and EB means OR each corresponding element. Hence to obtain
CMM at iteration i+ 1, I and 0 are multiplied and the result ORed with Mi. Note
that learning is accomplished in a single iteration. CMM retrieval is given by:
f =1 M (2)
0= L-max(r,l) (3)
where I would be one of a pair of vectors previously associated in M, and the
function L-max(r, 1) [2] selects the I highest integers in the integer vector, r, to be
the I set bits in the output, O. Storage wrt. error is maximised when [11]:
n = log2 m (4)
where n is the number of set bits in each m-bit binary vector; usually, we choose
1= n. In a symbolic rule matching context, partial match would allow a rule that
has multiple items in the antecedent to fire when only a subset of the items are
present. In terms of the binary vector code representation being considered here
and if the input to the CMM contains z set bits, a full match input is characterised
by the greatest possible number of set bits (z = n), while a partial match has less
set bits than this (I :::; z < n, where k is the number of antecedent items). Because
of the way in which multiple input data are superimposed, CMM partial match
is em insensitive to the order in which items are presented as input. For example,
consider the following symbolic rule:
engine..stalled 1\ ignition..ok 1\ fuelgaugeJow:::} refuel
Normally, to check if a subset {engine..stalled , ignition..ok} is a partial match, it
would be necessary at least to look at all the possible orderings of the subset, but
with CMM this is not necessary. We term this emergent property Combinatorial
188 CHAPTER 30
Partial Match. CMM could equally be used in frame based reasoning, where
partial match would allow an entire frame to be recalled given a subset of the
frame.
Partial Match Performance
If each input learned by the CMM, I, is obtained by hashing and superimposing
the attributes that identify each record, then the CMM can be used subsequently
to locate the record in memory, using I (or a partial match for I), by associating
I with an output code, 0, that represents the location of the record. Hence CMM
can be used as a database indexing method. However, if multiple matches occur
and multiple locations are represented in the CMM output, then output codes may
be so overlapped that it is impossible to identify individual codes. Therefore, a
method is needed to decode the CMM output but giving the minimum number
of false positives (records that appear to be a match in the CMM but are not
true matches) and no false negatives (records that appear not to be a match in
the CMM but are true matches). We suggest here a novel method of identifying
which output codes may be present in the CMM output, whereby output codes are
subdivided and stored in buckets in memory, according to where the middle set bit
(MB) appears in each code (Fig. 1). This approach allows us simply to identify all
set bits in the output that could be MBs, then if all the buckets corresponding to
these bits are searched, we are guaranteed no false negative matches .
.... ... - .....
4 !~1
4:C=:
-c:~:
I .. ' 10
... :~....-..~~~.:~
-.. -... ~
:ik~~li~
. ,
. •~'n
~
:
• •• "lH"rYf'H
:
I
"""! fU:!CQ1W . ;
1oClk.n.t.BVG.SXt'I/AlWBnut
Note that, for a given p, if m is chosen such that q = 50%, then p ~ 3(i~:)2 [5] (i.e.
p may be expressed in terms of m). We can now write:
,= WI
m2
+ W2
+ pn 2
This result has been verified experimentally (up to m = 1000, however a lack of
(10)
space unfortunately does not permit the discussion of these experimental results
here).
3 A Comparison with Multilevel Superimposed Coding
Multilevel Superimposed Coding (MSC) [3] has been chosen as the conventional
database indexing method that supports partial match because of the similarities
of this method with our method, also MSC is used by several companies (e.g.
Daylight CIS Inc.) for indexing their large databases. In MSC (Fig. 2), records are
identified by hashing and superimposing attribute values into a binary code using
multiple hash functions to gain access at multiple levels to a binary tree. Note that
the resultant codes are of a different length, which is why multiple hash functions
are needed. As before, the idea is to search as little of the index as possible, and the
ideal state of affairs when locating a single record would be if at each node only one
branch were taken. However, if the index is poorly conditioned then all branches
may need to be searched, which would then be equivalent to performing a linear
search on the entire index.
1 Consider 0 comprising just one output code: 0 has n set bits, only one of which can be a
middle set bit (MB); therefore n - 1 of the set bits are not MBs (this generalises to the general
case of 0 comprising multiple output codes).
190 CHAPTER 30
Kim and Lee [3] consider an example case of an index for a database of 224 records,
for which MSC requires a 24 level tree with 24 hash functions. The analysis for
MSC relies on specifying t, the expected number of matches, and both methods
are allowed the same amount of storage (1.5 GB). The results are shown in Fig. 3,
and concern the fraction of each index that must be searched versus the expected
number of matches. Remembering that a full match input is characterised by the
greatest possible number of set bits (z = n), while a partial match has less set
bits than this (I :::; z < n), we observe from Fig. 3 that CMM approaches the
performance achieved by MSC provided (1) that the input is well specified, and
(2) that few true matches exist in the database. Even with a less well specified
input and with several matches, CMM still performs reasonably well (especially
considering the relative simplicity of the CMM method in comparison with the
heavily optimised and much less straightforward MSC method).
4 Implications for Cognitive Psychology
In presenting this work to colleagues from differing backgrounds - some in psychol-
ogy - it has become clear that this work may have relevance to cognitive psychology
and the study of human memory. The Encoding Specificity Principle due to
Endel Tulving [10] states: "Only that can be retrieved that has been stored, and how
it can be retrieved depends on how it was stored. II And in [8] Tulving states: "On
the unassailable assumption that cue A has more information in common with the
trace of the A- T [T for target word] pair of studied items than with the trace of
the B- T pair, we can say that the probability of successful retrieval of the target
item is a monotonically increasing function of the information overlap between the
information present at retrieval and the information stored in memory. II The En-
coding Specificity Principle is by no means the only theory to explain this aspect
of human memory function, and a number of experiments have been performed by
protagonists of this or other theories. The work presented in this paper could be
taken in support of the Encoding Specificity Principle, but this is not the intention
of the authors. However, CMM could provide a model of the low level workings of
human memory and, as such, the results of those experiments performed in relation
to the Encoding Specificity Principle are most interesting. For instance, in [9], 674
subjects learned a list of 24 target words, (1) with one of two sets of cue words, or
(2) without cue words; the cue words were chosen to each have an association of 1%
Filer & Austin: Analysis of Correlation Matrix Memory 191
with the corresponding target word. In retrieval, subjects were asked to remember
the list (a) with cues; (b) without cues, or (c) with wrong cues (a cue word taken
from the alternative list). The experimental results are given in terms of the mean
number of words remembered and are as follows, given in descending order of the
mean number of words remembered: (la) 14.93; (2b) 10.62; (lb) 8.72; (2a) 8.52;
(lc) 7.45. What is interesting for CMM as a model of memory, is the difference
between the results of experiments (la) and (lb). If one postulates that the target
word and the cue word are somehow stored together, like the input to a CMM, and
that the output would be the equivalent of a pointer to the target word then the
results ofthis paper can be interpreted in the light of the results in [9]. To do so, it
is necessary to envisage some indexing mechanism in the human brain for retrieving
information from memory that works best when there is little index memory to be
searched, which is perfectly plausible. Then (la) would correspond to a full match
input, which would be expected to give best retrieval performance, as was found to
be the case in the experiments; similarly, (lb) would correspond to a partial match
input, which would be expected to give a rather worse retrieval performance, again
as was found to be the case.
5 Conclusions
We have analysed the partial match performance of CMM, in line with the proposed
use of CMM as an inference engine for an expert system that must solve real world
problems. The analysis has enabled a comparison with a conventional database
indexing technique that supports partial match, the results of which suggest CMM
is good at performing partial match when the input is well specified and few matches
exist. Interestingly, a similar behaviour is observed in human memory experiments
and, although we would not go so far as to suggest that human memory is so
simple as CMM, we observe that there are similarities that would support the use
of CMM, or CMM partial match, as a model of some of the low level workings of
human memory in further experiments in cognitive psychology.
REFERENCES
[1] J. Austin, Correlation Matrix Memories for Knowledge Manipulation, in: International Con-
ference on Neural Networks, Fuzzy Logic, and Soft Computing, Japan (1994), Iizuka.
[2] D. Casasent and B. Telfer, High Capacity Pattern Recognition and Associative Processors,
Neural Networks, Vol. 4(5) (1994), pp687-98.
[3] Y.M. Kim and D.L. Lee, An Optimal Multilevel Signature File for Large Databases, In Fourth
International Conference on Computing and Information, IEEE (1992).
[4] T. Kohonen and M. Ruohonen, Representation Of Associated Data By Matrix Operators,
IEEE Transactions on Computers, (July 1973).
[5] J-P. Nadal and G. Toulouse, Information Storage in Sparsely Coded Memory Networks,
Network, Vol. 1 (1990), pp61-74.
[6] L. Shastri and V. Ajjanagadde, From Simple Associations to Systematic Reasoning, Be-
havioural & Brain Sciences, Vol. 16(3) (1993), pp417-93.
[7] D.S. Touretsky and G.E. Hinton, A Distributed Connectionist Production System, Cognitive
Science, Vol. 12 (1988), pp423-66.
[8] E. Thlving, Relation Between Encoding Specificity and Levels of Processing, in: Levels of
Processing in Human Memory. John Wiley & Sons (1979), London.
[9] E. Tulving and S. Osler, Effectiveness of Retrieval Cues in Memory for Words, Journal of
Experimental Psychology, Vol. 77(4) (1968), pp593-601.
[10] E. Thlving and D.M. Thomson, Encoding Specificity and Retrieval Processes in Episodic
Memory, Psychological Review, Vol. 80 (1973), pp352-73.
[11] D.J. Willshaw, O.P. Buneman, and H.C. Longuet-Higgins, Non-Holographic Associative
Memory, Nature, Vol. 222 (1969), pp960-62.
REGULARIZATION AND REALIZABILITY IN
RADIAL BASIS FUNCTION NETWORKS
Jason A.S. Freeman and David Saad*
Centre for Neural Systems, University of Edinburgh,
Edinburgh EH89LW, UK. Email: [email protected]
* Department of Computer Science f3 Applied Mathematics,
University of Aston, Birmingham B4 7ET, UK. Email: [email protected]
Learning and generalization in a two-layer Radial Basis Function network (RBF) is examined
within a stochastic training paradigm. Employing a Bayesian approach, expressions for general-
ization error are derived under the assumption that the generating mechanism (teacher) for the
training data is also an RBF, but one for which the basis function centres and widths need not
correspond to those of the student network. The effects of regularization, via a weight decay term,
are examined. The cases in which the student has greater representational power than the teacher
(over-realizable), and in which the teacher has greater power than the student (unrealizable) are
studied. Finally, simulations are performed which validate the analytic results.
1 Introduction
When considering supervised learning in neural networks, a quantity of particular
interest is the generalization error, a measure of the average deviation between de-
sired and actual network output across the space of possible inputs. Generalization
error consists of two components: approximation error and estimation error. Given
a particular architecture, approximation error is the error made by the optimal
student of that architecture, and is caused by the architecture having insufficient
representational power to exactly emulate the teacher; it is an asymptotic quantity
as it cannot be overcome even in the limit of an infinite amount of training data.
If the approximation error is zero, the problem is termed realizable; if not, it is
termed unrealizable. Estimation error is the error due to not having selected an
optimal student of the chosen architecture; it is a dynamic quantity as it changes
during training and is caused by having insufficient data, noisy data or a learning
algorithm which is not guaranteed to reach an optimal solution in the limit of an
infinite amount of data. There is a trade-off between representational power and
the amount of data required to achieve a particular error value (the sample com-
plexity) in that the more powerful the student, the greater the ability to eliminate
the approximation error but the larger the amount of data required to find a good
student.
This paper employs a Bayesian scheme in which a probability distribution is derived
for the weights of the student; similar approaches can be found in [6, 1, 2] and [3].
In [7], a bound is derived for generalization error in RBFs under the assumption
that the training algorithm finds a global minimum in the error surface; regular-
ization is not considered. For the exactly realizable case, Freeman et al. calculate
generalization error for RBFs using a similar framework to that employed here [2].
2 The RBF Network and Generalization Error
The RBF architecture consists of a two-layer fully-connected network, and is a uni-
versal approximator for continuous functions given a sufficient number of hidden
units [4]. The hidden units will be considered to be Gaussian basis functions, param-
192
Freeman & Saad: Radial Basis Function Networks 193
eterised by a vector representing the position of the basis function centre in input
space and a scalar representing the width of the basis function. These parameters
are assumed to be fixed by a suitable process, such as a clustering algorithm. The
output layer computes a linear combination of the activations of the basis functions,
parameterised by the adaptive weights w between hidden and output layers.
Training examples consist of input-output pairs (e, (). e
The components of are un-
correlated Gaussian random variables of mean 0, variance ul,while ( is generated
e
by applying to a teacher RBF and corrupting the output with zero-mean addi-
tive Gaussian noise, variance u~. The number, position and widths of the teacher
hidden units need not correspond to those of the student, allowing investigation of
over-realizable and unrealizable cases. The mapping implemented by the teacher is
denoted IT, and that of the student Is. Note it is impossible to examine generaliza-
tion error without some a priori belief in the teacher mechanism [9]. The training
algorithm for the adaptive weights is considered stochastic in nature; the selected
noise process leads to the following form for the likelihood:
P(Dlw,j3) ex exp(-j3ED) (1)
where ED is the training error. This form resembles a Gibbs distribution; it also
corresponds to the constraint that minimizing the training error is equivalent to
maximizing the likelihood [5]. This distribution can be realised by employing the
Langevin algorithm, which is simply gradient descent with an appropriate noise
term added to the weights at each update [8]. To prevent over-dependence of the
distribution of student weight vectors on the noise, it is necessary to introduce a
regularizing factor, which can be viewed as a Bayesian prior:
P(wl,) ex exp( -,Ew) (2)
where Ew is a penalty term based here on the magnitude of the student weight
vector. Employing Bayes' theorem, one can derive an expression for the probability
of a student weight vector given the training data and training parameters:
P(wID", j3) ex exp (-j3ED - ,Ew) (3)
The common definition of generalization error is the average squared difference
between the target function and the estimating function:
E(w) = ((fs(e,w)-IT(e))2) (4)
where (- .. ) denotes an average w.r.t. the input distribution. In practice, one only
has access to the test error, PT;ST 'L:;fST((p - Is(e p, w))2. This quantity is an
approximation to the expected risk, defined as the expectation of (( - Is(e, W))2
w.r.t. the joint distribution p(e, (). With an additive noise model, the expected
risk decomposes to E + CT~, where u~ is the variance of the noise.
When employing stochastic training, two possibilities for average generalization
error arise. If one weight vector is selected from the ensemble, as in usually the case
in practice, equation (4) becomes:
Alternatively, one can take a Bayes-optimal approach which, for squared error,
requires taking the mean estimate of the network. It is computationally impractical
194 CHAPTER 31
to find this, but it is interesting as it represents the result of the best guess, in an
average sense. In this case generalization error takes the form:
0.18 r----~---~------,
.,
\
::~,.\i
-
0.05
0.04 - " ....;1;
~ __ _
~
0.00 0'-~--:2'o-0--40":-~p--'60":----:80":----:-:"00 0.02 0 100 200 300
P
E G0.12 'f,\ I
::~~
~i-
O·04f
0.0040
- Over-Realisable. Optimal Regulartsation
- - - Over-Realisable. Under-Regulartsed
Student Matches Teacher
0.02 0
100 200 300
p 50 100
p
150 200
Figure 3 The Over-realizable Case: the Figure 4 The unrealizable case: the
dashed curve shows the over-realizable solid curve denotes the case where the
case with training optimised as if the stu- student is optimised as if the teacher is
dent matches the teacher (-y = 3.59, /1 = identical to it (-y =
2.22,/1 1.55); =
2.56), the solid curve illustrates the over- the dashed curve demonstrates the stu-
realizable case with training optimised dent optimised with knowledge of the true
with respect to the true teacher (-y = teacher (-y = 2.22,/1 = 3.05), while, for
3.59,/1 = 1.44), while the dot-dash curve comparison, the dot-dash curve shows a
is for the student matching the teacher student which matches the teacher (-y =
(-y = 6.52,/1 =
4.39). All the curves 2.22,/1 = 1.05). The curves were gener-
were generated with one teacher centre ated with two teacher centres at (1,0) and
at (1,0); the over-realizable curves had ( -1, 0); the unrealizable curves employed
two student centres at (1,0) and (-1,0). a single student at (1, 0); noise of variance
Noise with variance 1 was employed. 1 was utilised.
Acknowledgements
This work was supported by EU grant ERB CHRX-CT92-0063.
A UNIVERSAL APPROXIMATOR NETWORK FOR
LEARNING CONDITIONAL PROBABILITY
DENSITIES
D. Husmeier, D. Allen and J. G. Taylor
Centre for Neural Networks, Department of Mathematics,
King's College London, UK.
A general approach is developed to learn the conditional probability density for a noisy time series.
A universal architecture is proposed, which avoids difficulties with the singular low-noise limit. A
suitable error function is presented enabling the probability density to be learnt. The method is
compared with other recently developed approaches, and its effectiveness demonstrated on a time
series generated from a non-trivial stochastic dynamical system.
1 Introduction
Neural networks are used extensively to learn time series, but most of the ap-
proaches, especially those associated with the mean square error function whose
minimisation implies approximating the predictor x(t) --+ x(t + 1), will only give
information on a mean estimate of such a prediction. It is becoming increasingly
important to learn more about a time series, especially when it involves a consid-
erable amount of noise, as in the case of financial time series. This note is on more
recent attempts to learn the conditional probability distribution of the time series,
so the quantity
P(x(t + l)lx(t)) (1)
where x(t) denotes the time-delayed vector with components (x(t), x(t-1), ... , x(t-
m)), for some suitable integer m.
An important question is as to the nature of a neural network architecture that can
learn such a distribution when both very noisy and nearly deterministic time series
have to be considered. A variety of approaches have independently been recently
proposed (Weigend and Nix, 1994; Srivastava and Weigend, 1994; Neuneier et aI,
1994), and this plethora of methods may lead to a degree of uncertainty as to
their relative effectiveness. We here want to extend the discussion of an earlier
paper (Allen and Taylor, 1994) and will derive the minimal structure of a universal
approximator for learning conditional probability distributions. We will then point
out the relation of this structure to the concepts mentioned above, and will finally
apply the method to a time series generated from a stochastic dynamical system.
2 The General ANN Approach
2.1 Minimal Required Network Structure
Let us formulate the problem in terms of the cumulative probability distribution
F(ylx(t)) = p(x(t + 1) :::; ylx(t)) = [Yeo p(y'lx(t)) dy' (2)
first, as this function does not become singular in the noise-free limit of a deter-
ministic process, but reduces to
=
F(ylx(t)) O(y - f(x(t))) (3)
(where 0(.) is the threshold or Heaviside function, O(x) = 1 if x ~ 0, 0 otherwise).
We want to derive the structure of a network that, firstly, is a universal approx-
198
Husmeier et al.: A Network for Learning Conditional Probability 199
imator for F(.) and, secondly, obtains the noise-free limit of (3). It is clear that
a difficulty would arise if the universal approximation theorem of neural networks
were applied directly to the function F(y, x(t)) so as to expand it in terms of the
output of a single hidden layer net,
F(ylx(t)) = ~aiS
I
(f
J=O
WijXj + biy - t'Ji) , (4)
where Xj := x(t - j + 1), S(x) = 1/(1 + e- X ), and ai, Wij, bi , t'J i are constants with
obvious interpretation in neural network terms (ai output weights, t'J i thresholds,
Wij, bi weights between input and hidden layer). Trying to obtain the deterministic
case (3), the best one can get from (4) in the limit when ai --- Oi,i o (Kronecker
delta) and the weights of the hidden nodes become very large (so that each sigmoid
function S(.) entering in (4) becomes a threshold function IJ(.)), is the expression
with constants Cj, s. Thus at best it is only possible to approximate the deterministic
limit of a linear process, with f(x) a linear function of its variable.
To be able to handle the split between the single predicted variable y and the input
variable x, we proceed by developing a universal one-layer representation for y first,
and then do the same for the remaining variable x successively. Thus at the first
step results
F(ylx(t)) = LaiS(~i(Y - J-li)), (6)
where ~i is the steepness of the sigmoid function, and J-li a given threshold. We
then expand each J-li by means of the universal approximation theorem in terms of
a further hidden layer network, resulting in
In order that the right hand side of (6) be a cumulative probabilty distribution, i.e.
F( -oolx) = 0, F( oolx) = 1 and F(ylx) monotone increasing in the variable y, the
following conditions are imposed on the coefficients entering eq.(6):
~j > 0, ai 2: 0, L ai =1 (8)
This can be realized by taking ~j and ai as functions of new variables ~i and ' j , e.g.
~j = (~i)2, and aj = =
(,d 2/Ek(,k)2 or ai exp('i)/EkexP('k).
It is now possible to see that the deterministic limit (3) arises smoothly from the
universal two-hidden-layer architecture of equations (6) and (7) when ai --- Oi,j
and ~j ___ 00, i.e., for one output weight one and all the others zero, and a large
steepness parameter for the unit connected to the non-vanishing weight.
2.2 Moments
One of the attractive properties of the use of sigmoid response functions is that it
is possible to give (after some algebra) a closed form for the moment generating
200 CHAPTER 32
M(t) := 100
-00
"a-(-rrt/{3:)exp(J.d)
exp(ty)p(Ylx) dy = L...J' . '
.
•
sm('trt/ (3i)
., (9)
from which all the moments arising from the conditional probability density (1) can
be calculated, e.g.
E(Y) lim dM(t) = "a.'" (10)
t-+O dt L...J Sr'
i
(11)
For ai - 6ij, (3j - 00 the deterministic limit ensues, with E(ym) = [E(y)]m.
2.3 The Error Function
It is now necessary to suggest a suitable error function in order to make the universal
expression given by (6) and (7) the true conditional cumulative distribution function
given by the available data. As the true value for x(t + 1) is not known, one cannot
use standard mean square error. Let us assume that the process is stationary. Then
the negative expectation value of the logarithm of the likelihood,
1
E = -(log(p(x(t + l)lx(t)))) ~ - NP(x(t + l)lx(t)), (12)
is the appropriate choice, based on Kullback's Lemma (see, e.g., Papoulis, 1984),
according to which (12) in terms of the true (unknown) probability density, Ptrue,
is always smaller than (12) in terms of the probability density predicted by the
network, p. Hence by minimising (12) one can hope to always "get closer" to Ptrue'
2.4 Regaining the Conditional Probability Density
In order to get back to the conditional probability density, we have to take the
derivative of the output of (6) with respect to the target, y, yielding
p(ylx) = 8F(ylx) = 8S({3(y - J.I(x))) = (3S({3(y- J.I(x)))(I- S({3(y- J.I(x))) (13)
8y 8y
This function is Gaussian-shaped, so our resulting network structure can be sum-
marized as follows: The output node, which predicts the conditional probability
density p(x(t + 1) = ylx(t)), is connected to a layer of RBF-like nodes. The out-
put of the ith "RBF"-unit is determined by its input, x(t + 1) (the same for all
nodes), and its centre, J.li, the latter being an x-dependent function implemented in
a seperate one-hidden- layer network with the usual sigmoidal hidden nodes. The
parameters ai and {3i have obvious interpretations as a priori probability and inverse
kernel width, respectively. (Note that from (10) and (11) one gets Ui = 'tr /( J3{3i) for
the standard deviation of the ith kernel, i.e. its "width"). The same structure can
basically be found in the other approaches mentioned above too, with the following
differences and simplifications.
3 Comparison with Other Approaches
The CDEN (Conditional Density Estimation Network) of Neuneier et al. (1994)
uses only one hidden layer for computing all of the J.li(X), which corresponds to
simplifying (7) by making Cijlc and iJ ij independent of the subscript i. This simpli-
fication is certainly justified when the J.li(X) are of a similar functional form, with
Husmeier et al.: A Network for Learning Conditional Probability 201
Network "RBF" ker- /-Ii (1; ai
nel function x-dependent x-dependent x-dependent
As proposed 8'(.) yes, seperate no no
here hidden layers
eDEN Gaussian yes, 1 shared yes yes
hidden layer
Soft triangular no, from no, from yes
histograms preprocessing preprocessing
Weigend, 1 Gaussian yes yes unity
Nix
similar a priori probabilities, but may cause difficulties when this assumption is
strictly violated. On the other hand, the eDEN includes a further generalisation
of the architecture proposed here by making the kernel widths, (1i, and the output
weights, ai, x-dependent, computing them as outputs of seperate one- hidden-layer
networks. This is not necessary in order for the architecture to be a universal ap-
proximator, but may lead to a considerable reduction of the "RBF"-layer size, the
latter, though, at the expense of additional complexity in other parts of the net-
work. It thus depends on the specific problem under consideration if this further
generalisation is an advantage.
The soft histogram approach proposed by Srivastava and Weigend (1994) is identical
to a mixture of triangular-shaped "RBF"-nodes, P(ylx) = L:i ai(x)Ri(y), with
= =
Ri(y) hi(Y -/-li-d/(/-li -/-li-d if /-li-l :5 y :5 /-Ii, Ri(Y) hi(/-li+l - Y)/(/-li+l -/-Ii)
if JJi :5 Y :5 JJi+l, and Ri(Y) = 0 otherwise. The "heights" hi are related to the
kernel widths (1i =: JJi+l - JJi-l by hi = 2/(1i in order to ensure that the the
normalisation condition J::'oo P(ylx) dy = 1 is satisfied. The kernel centres result
from a preprocessing "binning" procedure (described in [Srivastava, Weigend 1994]).
The output of the network is thus similar to the eDEN and the model proposed
in this paper, with the difference that now only the output weights ai(x) are x-
dependent, whereas the kernel centres, JJi, are held constant.
Finally the architecture introduced by Weigend and Nix (1994) reduces the size
of the "RBF" layer to only one node, assuming a Gaussian probability distribu-
tion, which is completely specified by JJ and (1, both of which are given as the
(x-dep~ndent) output of a seperate network branch. Obviously, this parametric
approximation reduces the network complexity, but leads to a biased prediction.
4 Simulation
We tested the performance of our method on an artificial time series generated from
a stochastic coupling of two stochastic dynamical systems,
x(t + 1) = e(e - t9)ax(t)[l - x(t)] + (1 - e(e - 19)) [1 - xl«t)] , (14)
where e(.) symbolizes the Heaviside function, as before, and the parameters a, If,
e e
and are random variables drawn from a uniform distribution, E [0,1]' a E [3,4]'
If, E [0.5,1.25]. The prior probabilities of the two processes are determined by the
202 CHAPTER 32
Figure 1 Centres of the "RBF"- Figure 2 Cross section of the true (nar-
kernels, lLi (x(t)), after training (black row graph) and predicted (bold graph)
lines). The grey dots show a state-space conditional probability density, P( x( t +
plot of the time series. 1) = ylx(t) = 0.6).
treshold constant!}, which was chosen such that we got a ratio of 2: 1 in favour of the
left process (i.e., !) = 1/3). We applied a network with ten "RBF" -nodes to learn
the conditional probability density of this time series. Figure 1 shows a state space
plot ofthe time series (dots) and the centre positions ofthe "RBF" -kernels, f-li(X(t))
(black lines). Figure 2 shows a cross-section of the conditional probability density for
= = =
x(t) 0.6, P(x(t + 1) Ylx(t) 0.6), and compares the correct function (narrow
grey line) with the one predicted by the network (bold black line). Apparently the
network has captured the relevant features of the stochastic process and correctly
predicts the existence of two clusters. Note that a conventional network for time
series prediction, which only predicts the conditional mean of the process, would be
completely inappropriate in this case as it would predict a value between the two
clusters, which actually never occurs. The same holds for the network proposed by
Weigend and Nix, which would predict a value for x(t + 1) between the two clusters,
too, and a much too large error bar resulting from the assumption of a Gaussian
distribution.
5 Conclusions
A general approach to the learning of conditional probability densities for station-
ary stochastic time series has been presented, which overcomes the limitation of
restricted reduction to the noise-free case. The minimal architecture required for
the network to be a universal approximator and to contain the non-restricted noise-
free case was found to be of a two-hidden-Iayer structure. We have shown that the
other recently developed approaches to the same problem are of a very similar form,
differing basically only with respect to the x-dependence of the parameters and the
functional form of the "RBF" -kernels. Which architecture one should finally adopt
depends on the specific problem under consideration. While the structure proposed
by Weigend and Nix is suitable for a computationally cheap, but biased approxi-
mation of p(x(t + l)lx(t)) (approximation by a single Gaussian), the eDEN, the
method of soft histograms and the network proposed here offer a more accurate,
but also computationally more expensive determination of the whole conditional
Husmeier et al.: A Network for Learning Conditional Probability 203
probability density. It is an important subject of further research to assess the rel-
ative merits and drawbacks of the latter three models by carrying out comparative
simulations on a set of benchmark problems.
REFERENCES
[1] Allen DW and Taylor JG , Learning Time Series by Neural Networks, Proc ICANN (1994),
ed Marinaro M and Morasso P, Springer, pp529-532.
[2] Neuneier R, Hergert F, FinnoffW and Ormoneit D, Estimation of Conditional Densities: A
Comparison of Neural Network Approaches, Proc ICANN (1994), ed Marinaro M and Morasso
P, Springer, pp689--692.
[3] Papoulis, A., Probability, Random Variables and Stochastic Processes, McGraw-Hill (1984).
[4] Srivastava AN and Weigend AS, Computing the Probability Density in Connectionist Regres-
sion, Proc ICANN (1994), ed Marinaro M and Morasso P, Springer, pp685--688.
[5] Weigend AS and Nix DA, Predictions with Confidence Intervals (Local Error Bars), Proc
ICONIP (1994), ed Kim M-W and Lee S-Y,Korea Advanced Institute of Technology, pp847-
852.
CONVERGENCE OF A CLASS OF NEURAL
NETWORKS
Mark P. Joy
Centre for Research in Information Engineering,
School of Electrical, Electronic and Information Enineering,
South Bank University, 103, Borough Rd., London SEl OAA, UK.
Email: [email protected]. uk
The additive neural network model is a nonlinear dynamical system and it is well-known that if
either the weight matrices are symmetric or the dynamical system is cooperative and irreducible
(with isolated equilibria) then the net exhibits convergent activation dynamics. In this paper we
present a convegence thoerem for additive neural nets with ramp sigmoid activation functions
having a row-dominant weight matrix. Of course, such nets are not, in general, cooperative or
possess symmetic weight matices. We also indicate the application of this theorem to a new class
of neural networks - the Cellular Neural Networks - and consider its usefulness as a practical
result in image processing applications.
1 Introd uction
In the last few years the definition of the neural network dynamical system has
expanded rapidly. This paper concerns neural networks associated with the set of
ODEs:
n
(1)
j=l
where x E IR n , describes the activity levels of n 'neurons', bij is a real constant
representing the connection strength from neuron j to neuron i, f is a sigmoid
function and I represents a clamped input. In order to facilitate the analysis of (1)
we define f as a ramp sigmoid:
1
f(x) = 2(1 x + 1 1- 1x-I I). (2)
We wish to investigate the activation dynamics of these nets operating as CAMs.
Upon presentation of an input the net should 'flow', eventually settling at a station-
ary point of the dynamical system which represents output in the form of n reals.
With this dynamic in mind we wish to prevent, by judicious design, any of the more
exotic dynamical behaviour exhibited by nonlinear systems. What we seek to prove
is that the union of the basins of attraction of equilibria comprises the whole phase
space. It is natural then to make the following definition:
Definition 1 A dynamical system is said to be convergent if for every trajectory
<I>t( x) we have:
lim ¢it(x) = 'f],
t-+oo
where 'f] is an equilibrium point.
It is possible to relax this definition in different ways and still have workable def-
initions of the behaviour we require. For more details, see the excellent paper by
Hirsch, [1].
It is easy enough to show that solutions, x(t, xa) to (1), have bounded forward
closure if IIxall ~ K, K > 0, in the sup norm. We will assume boundedness of
204
Joy: Convergence of Neural Nets 205
inital conditions throughout, so that the phase space n is a compact subset of IRn .
By standard theory, all trajectories can be extended to have domain of definitions
equal to the whole of IR.
Since I is linear in the three segments defined by: I x I~ 1, x > 1 and x < -1,
the CNN vector field is piecewise linear. The regions of n on which the vector field
is constant will be called partial or total saturation regions depending on whether
some cells are in their linear region or not. It is clear (by piecewise linearity) that
if any total saturation region contains a stable equilibrium it is a trapping region;
whilst the absence of an equilibrium means that all trajectories will leave such a
region. Finally, J acobians are constant on the various partial saturation regions.
The analysis of large-scale interconnected systems often proceeds by investigating
the behaviour of the system thought of as a collection of subunits. When we can
describe the dynamical behaviour of the subunits successfully we may make progress
towards the description of the behaviour of the system as a whole by supposing
that the connections between units are so weak that the dynamical behaviour of
the isolated units dominate. In [2] an analysis of a type of additive neural net
was presented which presumed a strict row-diagonally dominant condition on the
feedback matrix B = (b'j); namely:
bi ; - 1> L I b,j I . (3)
i,pj
Roughly speaking (3) says that in terms of the dynamics, the self-feedback of each
CNN cell dominates over the summed influences from its neighbours and may there-
fore be viewed as an example of the above large-scale system analysis. In section
III of [2], the authors conjectured that (3) is sufficient to ensure convergence of the
associated net; furthermore they outlined a method of proof. In this note we show
that this conjecture is true by giving a rigorous argument built on their ideas.
2 Convergence of the Dynamical System.
Of prime importance to our result is the fact that if B-1 satisfies (3), then each
total saturation region contains a (unique) equilibrium point; this is the substance
of:
Lemma 2 II B is strictly row-diagonally dominant, then every total saturation
region contains an equilibrium.
Proof We borrow some notation from [3]; let S C {I, ... , n} be of cardinality m
and define the following sets:
Am={€=(€l, ... ,€n) I €i=O,iES and€E{-I,I}, forit/:.S},
where:
A={(€l, ... ,€n) l€iE{-I,I}}.
Then for each € E A define:
C(€) = {(Xl, .. . X n ) E n II Xi 1< 1, if €i = 0, Xi ~ 1, if €i = 1, Xi ~ -1, otherwise}.
Firstly we consider the total saturation region C(€.), for some €. E Ao. Suppose
that an equilibrium x for the linear system of equations restricted to C( €.), satisfies
I(Xi) = 1; then:
206 CHAPTER 33
provided that (3) is satisfied. Similarly, if I(xi) = -1, Xi :::; -1. Thus each total
saturation region contains a (unique) equilibrium. 0
Much more than lemma (2) is true, in fact it is possible to show that there exists
an equilibrium in each partial and total saturation region, if and only if B-1
satisfies (3). In this case it is possible, by a linearly invertible change of variables
(Grossberg's S~-exchange), to consider the dynamical system operating on a closed
hypercube, a so-called BSB system ("Brain-State-In-A-Box"), then the existence
equilibria at each corner of the hypercube is shown in [4].
Our main theorem is:
Proof By Gerschgorin's Circle theorem (see [5]), if (3) is satisfied all the Ger-
schgorin discs lie in the right half-plane. Thus any unsaturated variable, corre-
sponding to a cell operating in its linear region, has an associated eigenvalue with
positive real part. It follows that trajectories decay from partial saturation regions
and the linear region since at least one eigenvalue of the Jacobian has positive real
part there (unless, of course a trajectory lies in the stable manifold of an unstable
equilibrium point, in which case it converges to an equilibrium anyway).
Once a hyperplane Xi = ±1 has been crossed by a trajectory ¢J(t), say at at time
to, then at no future time t > to, will this trajectory satisfy 1¢Ji(t) 1< 1. To see this
consider the vector field along the hyperplane Hi, defined by Xi = 1. We have:
Xi = bii -1 + ~bij/(xj), (4)
j¢i
°
and it follows that Xi > along Hi since the maximum value that the summation
term in (4) attains equals:
~ 1bij I,
j¢i
because 1I(x) I:::; 1 for all x. (An identical argument along Xi = -1 holds). Thus
the vector field points 'outwards', i.e in the direction of increasing 1 Xi " along the
hyperplanes forming the boundary of the individual linear regions and therefore no
trajectory can decay across any H; with 1 ¢Ji 1 decreasing.
If we define:
p(¢J(t, x)) = # (saturated components of ¢J(t, x)),
then the argument in the previous paragraph shows that pet) is a non-decreasing
function (defined on {x} x [0,00)), which is bounded above by N. Thus limt-+oo pet)
exists and actually must equal N (if not, given some T > 0, there is at least one
unsaturated cell, i.e there exists tic, say, such that 1¢Ji.(X,t) 1< 1, for all t ~ T
and 1 :::; k :::; N. However, since the real parts of the corresponding eigenvalues are
positive, there exists a to > T such that 1¢J •• (x, to) I> 1, which is a contradiction).
Since all total saturation regions contain a stable equilibrium point by Lemma 2,
and the basin of attraction of such an equilibrium contains the total saturation
region,all trajectories converge. 0
Joy: Convergence of Neural Nets 207
By an identical argument we are able to prove the more general result:
In this paper we discuss an extended model neuron in ANN. It is the compartmental model which
has already been developed to model living neurons. This type of neuron belongs to the class of
temporal processing neurons. Learning of these neurons can be achieved by extending the model
of Temporal Backpropagation. The basic assumption behind the model is that the single neuron
is considered as possessing finite spatial dimension and not being a point processor (integrate and
fire). Simulations and numerical results are presented and discussed for three applications to time
series problems.
1 Introduction
Most models of a neuron used in ANN neglect the spatial structure of a neuron's
extensive dendritic tree system. However, from a computational viewpoint there are
a number of reasons for taking into account such a structure. (i) The passive mem-
brane properties of a neuron's dendritic tree leads to a spread of activity through
the tree from the point of stimulation at a synapse. Hence spatial structure in-
fluences the temporal processing of synaptic inputs. (ii) The spatial relationship of
synapses relative to each other and the soma is important. (iii) The geometry of the
dendritic tree ensures that different branches can function almost independently of
one another. Moreover, there is growing evidence that quite complex computations
are being performed within the dendrites prior to subsequent processing at the
soma.
The (linear) cable theory models try to explain the creation of the action potential
[1]. These models belong to the class of Hodgkin-Huxley derived models. The one-
dimensional cable theory describes current flow in a continuous passive dendritic
tree using P.D.E.s. These equations have straightforward solutions for an idealised
class of dendritic trees that are equivalent to unbranched cylinders. For cases of
passive dendritic trees with a general branching structure the solutions are more
complicated. When the membrane properties are voltage dependant then the ana-
lytical approach using linear cable theory is no longer valid. One way to account
for the geometry of complex neurons has been explored by Abbott et al. [2]. Using
path-integral techniques, they construct the membrane potential response function
(Green's function) of a dendritic tree described by a linear cable equation. The re-
sponse function determines the membrane potential arising from the instantaneous
injection of a unit current impulse at a given point on the tree.
A complementary approach to modelling a neuron's dendritic tree is to use a com-
partmental model in which the dendritic system is divided into sufficiently small
regions or compartments such that spatial variations of the electrical properties
within a region are negligible. The P.D.E.s of cable theory then simplify to a sys-
tem of first order O.D.E.s. From these equations we can then calculate the response
function, i.e. the law with which the action potential is created, see [3], [4], [5]. Using
the previous ideas, we construct an artificial neuron through appropriate simplifica-
tions in the structure of the neurobiological compartmental model and its response
function.
209
210 CHAPTER 34
2 The Compartmental Model Neuron
2.1 The Transfer Function
The artificial neuron is composed of a set of compartments, in which connections
from other neurons of the network are arriving. The transfer function for neuron i
in the layer I in time instance t has the form:
M~
up(n) = L w~j(n)Y(I_l)j(t - n) (2)
j=l
where S is any suitable nonlinear function, usually the sigmoid, v,?(t) is the net
potential at "soma" (compartment 0) at time t, T is the number of time steps used
to keep a history of the neuron, m is the number of compartments in the neuron,
M{; is the number of incoming weights to compartment {3 (note that this number
is compartment related), uP(n) is the net potential at compartment (3 at time
instance nand IIPI(n) is a kernel function corresponding to the response function
of the respective neurobiological model and has the following general form:
0.8
,", ~ .
'.'\
\
0.6
0.4
Figure 1 Logistic Time Series. Logistic map, 1=3.8. T=1, G=2. Network 1-15-1.
TDimH = 3, TDimO = 1. Multistep Prediction region 900-932.
0.9
- - OrIginal Series
. . . .. r.mporalBackPropegatJon
0.8 - - - - Compa_'
0.7
0.6 L--'-----'-_'---'-----'-_L--L----'-_.L..---'
o 200 400 600 800 1000
Again the Compartmental and Temporal Backpropagation models were used for
comparison. The architecture that was employed was a 1-5-1 network with 3 tapped
delays for the hidden neurons and 1 delay line for the output neuron. The T and
I parameters were 1.0 and 1.0 respectively. The incoming connections to output
neuron were connected to the "soma" . Again 400 points were used, here though
from the range 400-800, because we wanted to avoid the initial transient period.
The prediction that was sought was in the range of 800-1000, and 150 points were
used for single step prediction and the remaining 50 points were generated by a
multi-step prediction scheme.
Kasderidis (3 Taylor: The Compartmental Model Neuron 213
The solid line corresponds to the original series, the sparse and dense dashed lines
represent the Compartmental and Temporal Backpropagation model predictions
respectively.
Fig. 2 shows that a good approximation of the underlying map was achieved by
both models even though a fairly simple architecture was employed and the training
set was not the most representative.
3.3 Solid State Series
In Fig. 3 we see a series that is describing the chaotic voltage oscillations in the
NDR region of the crystal.
0.6
- - OrIginal Series remporalBackPropsgallon
- - - - Compartmental ............ Compartmental Fully Connec/od
0.4
0.2
0'---'-----1.--'----'--L---'-----1.--'----'----'
a 200 400 600 800 1000
Figure 3 Solid State Physics Time Series. T=l, G=2. Network 1-5-1. TDimH =
8, TDimO = 2. Connections to "soma". Network 1-5-1. TDimH = TDimO = 1.5
comps/neuron. Fully connected.
Again the Compartmental and Temporal Backpropagation models were used for
comparison. The architecture that was employed was a 1-5-1 network with 8 tapped
delays for the hidden neurons and 2 delay line for the output neuron. The T and
I parameters were 1.0 and 2.0 respectively. The incoming connections to output
neuron were connected to the "soma". For training, 450 points were used, from
the range 1-450, 200 points were used for validation in the range 500-700. The
prediction that was sought was in the range of 700-1000, and 200 points were
used for single step prediction and the remaining 100 points were generated by a
multi-step prediction scheme.
Here we try to tackle also the problem of how the incoming weights to a neuron
should be distributed among its compartments. Here instead of connecting just to
one compartment, we tested the idea of connecting to all the compartments. For this
purpose we used a 1-5-1 network structure, with 1 tapped delay line per hidden and
output neuron. But instead the neurons consist of 5 compartments each, where we
are connecting the incoming signal from every other neuron. The key to note here
is that the value of the incoming signal is the same for all the five compartments,
but different weights exist for every connection to a specific compartment.
214 CHAPTER 34
In Fig. 3 the solid line corresponds to the original series, the flat sparse and expo-
nentially decaying dense dashed lines represent the Compartmental and Temporal
Backpropagation model predictions respectively. The dense dashed line is the Com-
partmental model with full connectivity to all compartments.
Fig. 3 shows that a good approximation of the underlying map was achieved by both
models even though a fairly simple architecture was employed. We were surprised
by the fact that the best approximation to the underlying map was achieved by the
fully connected model.
Finally we have to mention here, that for comparison the parameters of the nets were
chosen to produce exactly the same number of free parameters (weights) namely
fifty for all three models.
4 Conclusion
From these initial simulations we see that in general the Compartmental model is
at least as successful as the Temporal Backpropagation model for the time series
involved. Even though fairly simple architectures were used, the underlying mapping
was approximated reasonably well as the single step predictions are showing. For
the multi-step predictions further simulations are needed, in order to specify a more
appropriate architecture and parameter range that leads to better performance.
Further research is now being carried out to investigate the complex couplings of
the parameters that control the behaviour of the model. Also a major issue of the
model is the scheme with which we choose to assign the incoming connections of
each neuron to its compartments [11].
REFERENCES
[I] Koch and Segev, editors, Methods in Neural Modeling, MIT Press (1989).
[2] Abbott, L. F., Farhi, E., Gutmann, S., The path integral Jor dendritic trees, Biological
Cybernetics, Vol. 66 (1991), pp61-70.
[3] Bressloff, P. C., Dynamics oj a compartmental model integrate-and-fire neuron with somatic
potential reset, Physica D, submitted.
[4] Bressloff, P. C., Taylor, J. G., Compartmental response Junction Jor dendritic trees, Biological
Cybernetics, Vol. 70 (1993), pp199-207.
[5] Bressloff, P. C., Taylor, J. G., Spatio-temporal pattern processing a compartmental model
neuron, Physica Review E, Vol. 47 (1993), pp2899-2912.
[6] Wan, E., Finite Impulse Response Neural Networks Jor Autoregressive Time Series Predic-
tion, in: Predicting the Future and Understanding the Past, eds. by A. Weigend and N.
Gershenfeld, SFI Studies in the Sciences of Complexity, Proc. Vol. XVII (1993), Addison-
Wesley.
[7] Kleidis, K., Varvoglis, H., Papadopoulos, D., Interaction oj charged particles with gravita-
tional waves oj various polarizations and directions oj propagation, Astronomy and Astro-
physics, Vol. 275 (1993), pp309-317.
[8] Karakotsou, C., Anagnostopoulos, A., Kambas, K., Spyridelis, J., Chaotic voltage oscillations
in the negative-differential-resistance region oj the 1- U curves oj V2 Os crystals, Physical
Review B, Vol. 46 (1992), No. 24, p16144.
[9] Kasderidis, S., Taylor, J. G., The compartmental model neuron and its application to time
series analysis, in proceedings of IEEE Workshop on Nonlinear Signal and Image Processing,
Neos Marmaras, Greece, (June 20-22 1995), p58.
[10] Kasderidis, S., Taylor, J. G., The compartmental model neuron, proceedings of WCNN (July
17-21 1995), Washington D.C., USA.
[11] Kasderidis, S., Taylor, J. G., King's College preprint (to appear).
Acknowledgements
During this work the first author was supported by a grant of the Greek State
Scholarship Foundation.
INFORMATION THEORETIC NEURAL NETWORKS
FOR CONTEXTUALLY GUIDED UNSUPERVISED
LEARNING
Jim Kay
Dept. of Statistics, University of Glasgow, Mathematics Building,
Glasgow G12 BQW, UK. Email: [email protected]
The purpose of this article is to describe some new applications of information-theoretic concepts
in unsupervised learning with particular emphasis on the implementation of contextual guidance
during learning and processing.
1 Introduction
Building on earlier work by Linsker, and Becker and Hinton [8, 2) , Kay and Phillips
(7) used the concept of three-way mutual information in order to define a new class
of objective functions designed to maximise the transfer of the information shared
between a set of inputs (the receptive field) and outputs that is predictably related
to the context(the contextual field) in which the learning occurs and termed one
of this new class of objective functions Coherent Infomax. In addition they in-
troduced a new activation function which combines information flowing from the
receptive and contextual fields in a novel way within local processors. Two illustra-
tions of the role of contextual guidance are given in (7) and further demonstrations
are provided in (9). This work, however, considered only the case where the local
processors have binary output units. In this article the methodology proposed in
(7) is extended to the case of general multivariate Gibbs distributions but will be
described, for simplicity, in the particular case of multivariate binary outputs. The
article proceeds as follows. In section 2 notation will be described and probabilistic
modelling of the multivariate outputs considered. The definition of suitable objec-
tive functions will be discussed in section 3 and various local objective functions
introduced. The learning rules will be presented in section 4 and finally in section
5 computational issues will be briefly considered.
2 Probabilistic Modelling
We consider a local processor having multiple outputs. This processor receives input
from two distinct sources, namely, its receptive field inputs,R = {RI' R 2 , •.. , Rm}
and its contextual field inputs,C = {CI ,C2 , ... ,Cn } and produces its outputs,X =
{Xl, X 2 , ••. , Xp}, where R, C and X are taken to be random vectors and we adopt
the usual device of denoting a random variable by a capital letter and its observed
value by the corresponding lower-case letter. In order to allow explicitly for the
possibility of incomplete connectivity we define connection neighbourhoods for the
ith ouput unit Xi. Let 8i(r), 8i(c) and 8i(x) denote, respectively, the set of indices
of the RF input units, the set of indices of the CF inputs and the set of indices of the
outputs that are connected to the ith output unit Xi. The corresponding random
variables are denoted by Rai, Cai, Xai respectively and the set of all components
of X excluding the ith component is X-i. The weights on the connections into the
ith output unit are given by Wij, Vij and Uij for the jth RF input, jth CF input
and the jth output unit respectively and we assume that the weights connecting
215
216 CHAPTER 35
the output units to each other are symmetric. We now define the integrated fields
in relation to the ith output unit.
Si(r) = EjE 8i(r) w;jRj - WiQ is the RF integrated field.
S;(c) = E jE 8;(c) VijCj - ViQ is the CF integrated field.
Si(X) = E jE 8;(x) UijXj is the output integrated field.
The weights WiQ and ViQ are biases.
The activation function at the ith output unit is now a function of three integrated
fields and we shall take it to have the following form
Ai = A(si(r), Si(C» + Si(X) == ai + Si(X). (1)
although we shall derive the learning rules in the general case. This particular way
of incorporating the integrated output has been chosen so that this local activation
function at the ith output unit is consistent with the definition of a multivariate
model for X. The activation function A which binds the activity of the RF and CF
integrated fields is that proposed in [7] defined by
1
=
A(s1, S2) '2S1(1 + exp(2s1S2)) (2)
We assume that the output vector X follows a binary markov random field model
[3] conditional on the RF and CF inputs, with probability mass function
exp(Ef=1 aixi + ~ Ef=1 EjE 8i(x) UijXiXj)
Pr(X = xlR = r, C = c) = Z(a, u) (3)
where Z(a, u) is the normalisation constant (i.e. not a function of x) required to
ensure that the probabilities sum to unity. This model is a regression model in two
distinct senses. Firstly, via the terms {as} which are general nonlinear functions
of the RF and CF inputs, it is a nonlinear regression of the outputs with respect
to all of the inputs. Secondly, when written in conditional form in equation (3),
it expresses an auto-regression for each output unit in terms of the other output
units in its neighbourhood. The formulation developed in the above model has the
advantage of interfacing a feed-forward network between layers with a recurrent
network structure within the output layer within a single coherent probabilistic
framework. Not only that but it is also possible to connect the multiple output local
processors themselves in a multi-layered network structure in a probabilistically
coherent manner.
From this model the local conditional distributions may be derived. As we shall
see using these distributions provides a local structure to the learning rules and,
under the restrictions on the output connection weights, the Hammersley-Clifford
theorem [3] ensures that working locally with the conditional models is equivalent
to assuming a coherent global model for the outputs. However in the case were
the output units are fully, mutually connected the equations presented here hold
without any necessity to invoke this theorem to ensure probabilistic coherence and
are derived using the basic rules of probability.
The local conditional distribution for the ith output is
(J; == Pr(X; = 11R8. = r8i, C 8; = C8i, X 8i = X8;) = 1/(1 + exp( -Ai)).
(4)
=
Here Ai ai+s;(x), where a; is any general differentiable function ofthe integrated
RF and CF fields.
Kay: Information Theoretic Neural Networks 217
3 Global and Local Objective Functions
We now consider a global objective function based on the joint distribution of all
outputs, RF inputs and CF inputs. In the case of multivariate outputs, we consider
the general version of the objective function introduced in [7] which is
F = leX; R; C) + <pll(X; RIC) + <p2l(X; CjR) + <P3H(XIR, C). (5)
Here the term I(X;R;C) is the three-way mutual information between the random
vectors X,R and C given by
leX; R; C) = leX; R) - leX; RIC) (6)
and
leX; R) = H(X) - H(XIR) (7)
is the mutual information shared between the random vectors X and R and the
symbol H denotes Shannon's entropy. For further details see [4]. The objective
function in equation (5) is based on the three-way mutual information and the
three possible conditional measures of (two-way) mutual information.
For the purposes of modelling this is expressed as
F = H(X) - lhH(XIR) - ~2H(XIC) - ~3H(XIR, C). (8)
In terms of the output, and indeed the input, units this is a global objective
function and general learning rules have been derived in the general case of Gibbs
distributions. However these lead to learning rules that are global and also compu-
tationally challenging in their exact form and so we now describe a particular local
approximation to the global objective function F; other definitions of locality are
possible [5]. In this multiple output case, it is natural to think of the processing
locally in terms of each output unit using the information available to it from its
RF, CF and output neighbourhoods. This suggests that one might focus in turn on
the joint distribution of, say, the ith output and its RF and CF inputs given the
neighbouring output units. From this perspective it then seems natural to intro-
duce the concept of the conditional three-way mutual information shared by the
ith output and its RF and CF inputs given the neighbouring outputs denoted by
I(Xi; R8i; C8;jX8i ) = I(Xi; R8i1X8i) - I(Xi; R8i1C8i, X iJi ) (9)
It is possible to decompose the global three-way mutual information as follows.
leX; R; C) = l(Xi; R8i; C iJi 1X8;) + I(X_i; R; C) (10)
This decomposition may be repeated recursively and is of particular relevance when
the output units represent some one-dimensional structure such as a time series ;
then the well-known general factorisation of joint probability into a product of a
marginal and conditional distributions allows the general three-way mutual infor-
mation to be written as a sum of local conditional three-way mutual information
terms. However such simplicity is not possible here, although this first-step decom-
position shows that the conditional three-way information is a part of the global
three-way information in a well-defined sense. The same conditioning idea may be
applied to the other components of information within the objective function F and
this leads to the specification of a local objective function for the ith output unit
defined as follows
Fi = I(Xi; R iJi ; CiJ;jXiJi ) + tPll(Xi; RiJilCiJi , X iJi )
+tP2l(Xi; CiJi lR8i, X8i) + tP3H(X;jRiJi' C iJi , X8i), (11)
218 CHAPTER 35
and we express {Fi} in the more useful form
Fi = H(XiI X ai)-1hH(XiI R a"X ai)-tP2 H(XiI Ca;, Xa;)-tP3H(XiIRa;, Ca"~ Xai)
(12)
This means that we envisage each output unit working to maximise Fi and because
of the the fact that mutually distinct sets of weights connect into each of the
outputs this is equivalent to maximising the sum of the Fis. We view this sum as
a locally-based approximation to the global objective function F. In the extreme
case where the outputs are conditionally independent this sum is equivalent to F.
Obviously the approximation will be better the smaller are the sizes of the output
neighbourhood sets relative to the number of outputs. We now provide formulae for
the local entropic terms and the components of local information for the ith output
unit.
(13)
H(XIXa')
••
= (E(')
X8i
logE(;)
X8i
+ (1 - E(')
X8i
) 10g(1 - E(')
X8i
))x8..
It follows that the components of local information at the ith output unit are as
follows.
=
I(Xi; Ra;; Ca,IXa,) (16) - (15) - (14) + (13),
I(X,; Ra,ICai, X ai ) = (15) - (13),
I(X;; CailRa" Xai) = (14) - (13),
H(X,jRa" Ca" Xai) (13) =
The {E} terms are averages of the output probabilities at the ith unit and are
defined at the end of section 4.
4 The Learning Rules
Using the locally-based objective functions developed in section 2 and the formula-
tion so far, it turns out that the learning rules for all weights have the same general
structure as those introduced in [7].
We describe the gradient ascent learning rules in relation to the ith output unit.
-of;
Wi.
= (tP3A, - - ) OAi)
0, Oi(l- Oi)~()R. R8i,C8 i,X8il
uS; r
(17
)
for each RF input s which connects into the ith output unit.
-oFi = ( tP3A; -
Vi.
0; oA, ) C. )R8i,C i,X8 i'
-) 0; (1 - Oi ) ~(
USi C
8 (18)
for each CF input s which connects into the ith output.
of, - oA,)
~ = (tP3A, - 0;)0;(1 - O;)~()X. R8i,C8i,C 8 i' (19)
uu.. us, x
for each output unit s which connects into the ith output. Note that these learning
rules are local and this results from the decision to separately maximise the local
Kay: Information Theoretic Neural Networks 219
objective functions {Fi}( or equivalently to maximise the sum of the {F;}). The
dynamic average for the ith output unit is
_ E(i) E(i) E(i)
OJ = log X8i. - tPllog R8i:X8i - tP2log C8i: X 8i . (20)
( 1 - E(') )
X8i
(1 - E('»)
R 8 i,X8i
(1 - E(') .)
C 8 i,X 8 i
Here the dynamic averages are more complicated than in the single output case
and their calculation involves storing the average probability at the ith output unit
for each pattern of the other outputs that connect into the ith output unit, for the
combination of each of the neighbouring output and RF input patterns and for the
combination of each of the neighbouring output and CF input patterns. However
various approximations are possible [5] The various averages of the probability
at the ith output unit are defined as follows. Ex(i).
8.
= (()i)RA.CA.IXA.,ER(i)X
a" 8.
. =
vII vi vi
(()i)C8iI R 8i,X8i' Eg~i,X8i = (()i)R8ilc8i,x8i' and the required partial derivatives may
be easily calculated.
5 Some Computational Details
The computation may be performed using on-line learning as was the case in [7] or,
alternatively, using batch learning. In the case of on-line learning the weight change
rules are applied with the averaging brackets removed and the required conditional
averages of output probabilities may be updated dynamically during learning after
the presentation of each pattern via recursive computation. In particular recursive
computation may be used to avoid the need to explicitly calculate the statistics of
the input data and then employ a two-stage approach to the learning. The method-
ology has been applied in a number of computational experiments described in [6]
which demonstrate the feasibility of the approach. It is shown there that the differ-
ences between the Infomax and Coherent Infomax computational goals described
by [7] hold in this more general set up and that the multiple output unit is capable
of representing its inputs by means of population codes. Further experimentation
is currently in progress and this will address the computational feasibility of scal-
ing up to large multiple output units and evaluate various approximations for the
conditional dynamic averages.
REFERENCES
[1) M. W. Hirsch, Activation dynamics in continous time nets, Neural Networks, Vol.2 (1989),
pp331-349.
[2) Becker S and Hinton G E, Self-organizing neural network that discovers surfaces in random-
dot stereograms. Nature Vol. 355 (1992), pp161-3
[3) Besag J E, Spatial analysis and the statistical analysis of lattice systems (with discussion),
J. R. Statist. Soc. Ser. B Vol. 36 (1974), ppI92-236.
[4) Hamming R W, Coding and Information Theory Englewood Cliffs, NJ: Prentice-Hall (1980).
[5) Kay J W, Information-theoretic neural networks for the contextual guidance of learning and
processing: mathematical and statistical considerations, Technical Report, Biomathematics
and Statistics Scotland (1994). (See http://www.stats.gla.ac.uk/limf)
[6) Kay J W, Floreano D and Phillips W A, Contextually guided unsupervised learning using
multivariate binary local processors, submitted to Neural Networks, (1996).
[7) Kay J W and Phillips W A, Activation functions, computational goals and learning rules for
local processors with contextual guidance, Neural Computation, in press (1997).
[8) Linsker R, Self-organization in a perceptual network, IEEE Computer, March (1988), ppl05-
17.
[9) Phillips W A Kay J W and Smyth D, The discovery of structure by multi-stream networks
of local processors with contextual guidance, Network Vol. 6 (1995), pp225-246.
CONVERGENCE IN NOISY TRAINING
Petri Koistinen
Rolf Nevanlinna Institute, P. O. Box 4, FIN-00014 University of Helsinki,
Finland. Email: [email protected]
A minimization estimator minimizes the empirical risk associated with a given sample. Sometimes
one calculates such an estimator based not on the original sample but on a pseudo sample obtained
by adding noise to the original sample points. This may improve the generalization performance of
the estimator. We consider the convergence properties (consistency and asymptotic distribution)
of such an estimation procedure. Subject classification: AMS(MOS)62F12
1 Introduction
In backpropagation training one usually tries to minimize the empirical risk
rn{t) := .!.
n .=1
t
llri - g{Xi, t)1I2 = min!,
where x H g(x, t) denotes the input/output mapping of a multilayer perceptron
whose weights comprise the vector t. When the training sequence {Zi}, Z. =
(Xi, ri), consists of i.i.d. (independent and identically distributed) random vec-
tors, it is known that under general conditions the parameter Tn minimizing the
empirical risk is strongly consistent for the minimizer set of the population risk
r{t) = EII Yl-g{X1 ,t)1I 2. Further, if the sequence {Tn} converges towards an iso-
lated minimizer t* of r, then under reasonable conditions, fo{ Tn - t*) converges in
distribution to a normal distribution with mean zero, see White [8].
Here we consider what happens in the above procedure when instead of the original
data one uses data generated by adding noise to the original sample points. Such a
practice has been suggested by several authors, see [4] for references. The relation-
ship of this procedure to regularization has been investigated in many recent papers,
see [6, 3, 1, 7]. In the present paper we outline both consistency and asymptotic
distribution results for noisy training. The results are based on the doctoral thesis
of the author [5], where the interested reader can find rigorous proofs and results
which are more general than what can be covered here. The consistency results ob-
tained in the thesis are much stronger than the previous results of Holmstrom and
Koistinen [4]. Asymptotic distribution results in our setting have not been available
previously.
2 The Statistical Setting
The original sample is part of a sequence ZI, Z2, ... of random vectors taking val-
ues in mA:. The noisy sample is generated by replacing each original sample point
ZI, ... , Zn with s ~ 1 pseudo sample points
Z!!;j=Zi+hnN;j, i=I, ... ,n,j=I, ... ,s. (1)
Here the hn's are positive random variables called the smoothing parameters. We
assume that the noise vectors N,/s are i.i.d. and independent of the Z,'s and the
hn's. The smoothing parameters are allowed to depend on the {Zi }-sequence. We
need to let h n converge to zero to ensure consistency, and therefore the pseudo
sample points are not i.i.d. This prevents us from using standard convergence results
for minimization estimators to analyze the convergence properties of noisy training.
220
Koistinen: Convergence in Noisy Training 221
We next define empirical measures associated with the original and the pseudo
observations. Let 6x denote the probability measure with mass one at x E IRk, and
define the probability measure
n
I1n := n- 1 ~6z;,
;=1
i.e., I1n places mass lin at each of the observations Zl, ... ,Zn. We call11n the
empirical measure of the first n observations. Similarly, we define the empirical
measure I1ff. of the pseudo observations Z!j, i = 1, ... , n,j = 1, ... , s as the prob-
ability measure which places mass 1/(ns) at each ofthese ns points. The measures
I1n and I1ff. are examples of what are called random probability measures; the ran-
domness arises from the fact that the measures depend on the observed values of
random vectors.
For the consistency results, we need to assume that the empirical measures I1n
converge weakly, almost surely, towards some probability measure 11 in IRk; in
symbols, I1n ';;,}11. The definition of this mode of convergence for any sequence {An}
of random probability measures in IRk is as follows,
where Cb is the set of bounded continuous functions IRk --+ IR. An argument due
to Varadarajan [2, Th 11.4.1) then implies that for measures in IRk,
J
the risk
ret) := l(z, t) l1(dz), t ET (3)
is minimal. Here the measure 11 is supposed to be unknown, so we cannot solve
the problem directly. Instead we may try to minimize either the empirical risk
J
associated with the original observations
rn(t):=;;1 ~l(Zi,t)
n
= l(z,t)l1n(dz) (4)
.=1
or the empirical risk associated with the noisy observations
rtf.(t) := -1 ~
ns
n •
~ l(Z!j, t) =
i=l j=l
J l(z, t) I1tf.(dz). (5)
222 CHAPTER 36
3 Consistency
If ACT, define the distance oft E T from A by d(t, A) := inf{lIt-YII : YEA}, with
the convention that d(t,0) = 00. We write argminT l' for the (possibly empty) set
of points that minimize l' on T. We seek conditions guaranteeing for our estimators
()n that
d(()n, argmin r)~O.
T
If this holds, we say that ()n is strongly consistent for the set argmin T r.
The following result is proved in [5].
Theorem 1 Let s ~ 1. If h n ~O, and for some probability measure /-I, /-In~'/-I, then
also /-Iff;~d;/-I, as n --+ 00.
This motivates the following approach. Let {An} be a sequence of random prob-
ability measures and /-I a nonrandom probability measure such that An~/-I and
J
let
Rn(t) := £(z, t) An(dz). (6)
Suppose that ()n is a random vector with values in T such that
Rn(()n) = inf
T
Rn + 0(1), (a.s.). (7)
Under certain conditions, ()n is then strongly consistent for the set argminT 1', as is
formulated in the following theorem. Hence we obtain a consistency theorem for any
estimator Tn coming close enough to minimizing rn. Provided hn~O, we also obtain
a consistency theorem for any estimator T!. coming close enough to minimizing rff•.
Theorem 2 Let the parameter set T be compact and let An~/-I. Suppose £ is con-
tinuous on IRk x T and dominated by a continuous, /-I-integrable function, i.e.,
I£(z, t)1 ~ L(z), z E IRk, t E T,
where L ~ 0 is continuous and J L d/-l < 00. If in addition, J L dAn~ J L d/-l, then
any random vector satisfying (7) is strongly consistent for the set argminT r.
In practice, the most useful dominating functions are the scalar multiples of the
powers IzlP,p ~ 1. If J IzlP /-In(dz)~ [lzlP /-I(dz) < 00 and EINIP < 00, then it
can be shown that also J IzlP /-Iff.(dz)~ J IzlP /-I(dz). This facilitates checking the
conditions of the previous theorem.
4 Asymptotic Distribution
A consistency result does not tell how quickly the estimator converges. One way
to characterize this rate is by giving an asymptotic distribution for the estimator.
Our asymptotic distribution results are of the form y'n(()n -t*)~N(O, C), i.e., they
state that y'n(()n - to) converges in distribution to a normal law with mean zero
and a covariance matrix C. Here t* denotes a minimizer of r. Such a result says
that the law of ()n collapses towards a point mass at t* at a very specific rate, e.g.,
in the sense that for any f > 0, n 1 / 2 -<I()n - t*1 converges to zero in probability and
n 1 /2+<I()n - t*1 converges to infinity in probability.
Henceforth we assume that the original observations Zl, Z2, ... are i.i.d. and that
the noise vectors have mean zero. The effect of noisy training in linear regression is
Koistinen: Convergence in Noisy Training 223
relatively straightforward to analyze. This is the case where z is the pair (x, y), x E
IRk-l,y E IR and f(z,t) = (xTt - y)2. Denote the minimizer of rn by Tn and the
minimizer of rtfs by T/!.. If hn = op(n- 1/ 4), i.e., if n 1/ 4h n converges in probability
to zero, then it turns out that T/!. and Tn have the same asymptotic distribution. If
hn converges to zero at some slower rate, then the situation is more complicated;
when it can be obtained, the asymptotic distribution of T/!. then typically depends
on the sequence {h n }.
The same kind of results hold also more generally. Let now t* E Tsatisfy V'r(t*) = 0,
i.e., t* is a stationary point of the risk r. We assume that t* is an isolated stationary
point in the interior of T. Further, we assume that f is a C 3 - function on IRk X IRm ,
and that Zl has a compactly supported law. Then the matrices
A := E[~~ f(Zl, to)], B:= Cov[V't f(Zl' to))
are well defined. Here V't and ~; denote the gradient and Hessian operators with
respect to t, respectively. We assume that A is invertible and that B is nonzero.
Further, we assume that the noise vectors have a compactly supported law and that
the smoothing parameters satisfy 0 S hn S M for some constant M.
=
Under these assumptions, let h n op(n- 1 / 4 ). Let {Tn} be a sequence ofT-valued
J
random vectors such that
Tn~t*, and V't fez, Tn) ttn(dZ) = op(n-1/2),
and let {T/!.} be a sequence of T-valued random vectors such that
5 Conclusions
We have outlined new results for the convergence properties of minimization esti-
mators in noisy training. The main conditions in the consistency result are that
the empirical measures associated with the original sample converge weakly, almost
surely, towards some measure tt and that the smoothing parameters h n --+ 0, almost
surely.
The main conditions for the asymptotic distribution result are the following: the
original sample points are i.i.d., the noise vectors have zero mean and h n = Op (n -1/4).
224 CHAPTER 36
Under certain additional conditions we then have that the asymptotic distributions
of Tn and T1!. are identical, where Tn denotes the minimizer of the empirical risk
associated with original sample and T1!. the minimizer of the empirical risk associ-
ated with the noisy sample. This implies that the asymptotic distributions of r( Tn)
and r( T1!.) coincide and hence additive noise can have only a higher-order effect on
the performance of a minimization estimator as the sample size goes to infinity.
However, numerical evidence indicates that additive noise sometimes does improve
the generalization performance of a minimization estimator, at least with small
sample sizes. It remains to be seen whether this effect can be quantified by analyzing
the distribution of r( T1!.) using more refined asymptotic methods.
REFERENCES
[1] C.M. Bishop, Training with noise is equivalent to Tikhonov regularization, Neural Compu-
tation, Vol. 7 (1995), pp108-116.
[2] R.M. Dudley, Real Analysis and Probability, Wadsworth & Brooks/Cole, Pacific Grove, CA,
(1989).
[3] Y. Grandvalet and S. Canu, Comments on "noise injection into inputs in back propagation
learning", IEEE Trans. Systems, Man, and Cybernetics, Vol. 25 (1995), pp678-681.
[4] L. Holmstrom and P. Koistinen, Using additive noise in back-propagation training, IEEE
Trans. Neural Networks, Vol. 3 (1992), pp24-38.
[5] P. Koistinen, Convergence of minimization estimators trained under additive noise, Research
reports A 12, Rolf Nevanlinna Institute, Helsinki University of Technology (1995), (Doctoral
thesis).
[6] K. Matsuoka, Noise injection into inputs in back-propagation learning, IEEE Trans. Systems,
Man, and Cybernetics, Vol. 22 (1992), pp436-440.
[7] R. Reed, R.J. Marks II, and S. Oh, Similarities of error regularization, sigmoid gain scaling,
target smoothing, and training with jitter, IEEE Trans. Neural Networks, Vol. 6 (1995), pp529-
538.
[8] H. White, Learning in artificial neural networks: A statistical perspective, Neural Computa-
tion, Vol. 1 (1989), pp425-464.
NON-LINEAR LEARNING DYNAMICS WITH A
DIFFUSING MESSENGER
Bart Krekelberg and John G. Taylor
Centre for Neural Networks, King's College London,
London WC2R 2LS, UK. Email: [email protected]
The diffusing messenger Nitric Oxide plays an important role in the learning processes in the
brain. This diffusive learning mechanism adds a non-linear and non-local effect to the standard
Hebbian learning rule. A diffusive learning rule can lead to topographic map formation but also has
a strong tendency to homogenise the synaptic strengths. We derive a non-linear integral equation
that describes the fixed point and show which parameter regimes lead to non-trivial solutions.
Keywords: Nitric Oxide, self-organization, topographic maps
Introduction
Most learning rules used in neurobiological modelling are based on Hebb's postulate
that a synaptic connection is strengthened if and only if there is both a post-synaptic
and pre-synaptic depolarization at that particular synapse. Recent physiological
research, however, shows that this assumption is not always warranted. Experiments
in areas varying from rat hippocampus to cats' visual cortex show that there is a
non-local effect in learning.
This non-local learning is often thought to be mediated by retrograde messengers.
Nitric Oxide (NO) is a candidate for such a messenger which is produced post-
synaptically (both in the dendrites and the soma [1]), then diffuses through the
intracellular space and is taken up pre-synaptically. The chemical properties of
NO allow it to diffuse over relatively long distances in the cortex; with a diffusion
constant of 3.3 x 1O-5 cm 2 / s and a half-life in the intracellular fluid of 4 rv 6 seconds,
it has an effective range of at least 150j.lm. If NO were an ordinary neurotransmitter
it would be hard to understand how the specificity of neuronal connections in the
brain on a scale smaller than this could be achieved. An explanation has been found
in experimental setups with locally modifiable concentration of NO: physiologists
have been able to demonstrate that low levels of NO lead to depression of synaptic
strengths (LTD) and high levels lead to Long Term Potentiation (LTP) (For recent
reviews see [3, 8]). This non-linearity in the dependence of the synaptic change on
the NO concentration is crucial to attain specificity in a network with a diffusing
messenger.
Detailed simulations of such a a non-linear and non-local learning rule have been
performed in [2, 7]. In these simulations pattern formation in the weights was seen
to occur but, as the input patterns and the initial (partly topographic) weight
distributions were already organized, it is difficult to tell how general these re-
sults are. In [5] we analysed networks with a diffusing messenger in a linearised
reaction-diffusion framework and found that homogeneous weights were the domi-
nant solutions of the dynamics. Following up a suggestion in [4] that Nitric Oxide
could underlie a mechanism similar to the neighbourhood function in the SOFM.
we showed in [6] that a diffusing messenger can indeed support the development of
topographic maps without the need for a Mexican Hat lateral interaction.
Here we extend our previous analysis to arrive at a fuller account of the non-linear
effects. We derive a general fixed point equation for the weights and determine
which parameter regimes admit non-zero homogeneous solutions.
225
226 CHAPTER 37
1 The Model
We consider a neural field consisting of leaky integrator neurons with activity u,
all sampling their input with a synaptic density function r from a field of synapses
with strengths s. Note that this implies that synapses do not "belong" to a neuron,
a neuron just takes its input from the synapses within its reach (determined by
the function r). Inputs a(x, xo) are Gaussian shaped with the maximum at Xo and
spread (J'. The inputs are presented with equal probability at each position Xo. Nitric
Oxide is produced in the dendrites at a rate proportional to the local depolarization,
decays at a rate -1 and has diffusion constant K.. Learning is non-linearly dependent
on the local NO level through the function I which captures the qualitative effects
described above (see also 2). Given an input centred at Xo, the dynamical equations
can be written as:
u(x, xo) -u(x) + / dx'r(x, x')s(x')a(x', xo),
n(x, xo) = -n(x) + s(x)a(x, xo) + K.V' 2 n(x, xo),
sex) -sex) + /[n(x, xo)]a(x, xo).
We assume (as in [7]) that all learning is dependent on the local NO level and
there is no direct dependence on the post-synaptic depolarization (see Discussion).
The local NO level that determines the change in synaptic strength depends on the
details of the post-synaptic production of NO and the pre-synaptic measurement
process. As the details ofthis process are as yet unknown, we use a Gaussian Green's
function (G(x, x')) to model the spatial distribution of NO. This approximation is
exact if the NO is produced in a short burst and measured some fixed period of time
later. The real process of production and measurement will presumably be more
complicated, but we expect that at least the qualitative features are well described
by a Gaussian kernel. With these approximations and averaging over all patterns,
the fixed point equation for the learning dynamics can be written in the form of
the following non-linear Fredholm integral equation.
In section 2 we substitute the simplest non-linear function that still captures the
qualitative features observed in experiments:
I(n) = -n(n - D)(n - P) (2)
with D and P two parameters that can be determined directly from experiments.
The unphysical values of I for negative NO concentration merely stabilise the triv-
ial fixed point, and the decrease of I after its positive maximum can either be
interpreted as toxicity of a high concentration of NO or an easy way of modelling
a saturation of the NO production. The important features of the function are the
negative values for small NO concentration (LTD) and positive values (LTP) for
high NO concentration.
2 Analysis
In appendix 4 we use the assumptions about the non-linearity, the spread of NO
(p) and the width ((J') of the Gaussian inputs in equation 1 to derive the following
Krekelberg [3 Taylor: Modelling a Diffusing Messenger 227
expression in Fourier space for the fixed point of the dynamics.
s(k) _DPNls(k)e-f31k2
RHS
Wide Inputs
Critical Inputs
Narrow Inputs
model and the more standard Hebbian learning rules could operate concurrently. In
that case the interaction between the two learning rules needs to be investigated.
Furthermore, we assumed that all NO is produced in the dendrites. There is evi-
dence, however, that the production can take place in the soma and even the axons
[1]. Our previous analyses [5, 6] considered production of NO in the soma, and
showed different behaviour than that described here. A full model would include
all the sources of NO and the interaction between them.
Lastly, we have modelled all time-dependent effects of the diffusion with the single
parameter p. Interesting dynamics could result from including the time dependence
explicitly, especially if the inputs are temporally correlated. We will address this
issue in future work.
4 Appendix
Starting from equation 1 we substitute the Gaussian Green's function with spread
p and the inputs to derive the fixed point equation. First we derive the NO concen-
tration at position x when a stimulus is presented at Xo. The weights are written
J J
as an integral of Fourier components s(k):
n(x,xo) = dk's(k') dx'e ik 'x'aoe- l1 (x'-xo)'e- p (x-x')'.
For the averaged weights' fixed point equation we have to average over all the
J
patterns:
s(x) = dxo aoe- l1 (x-xo)' f [n(x, xo)].
For each of the terms in f, the integral over the distribution of patterns has to be
done seperately. Here we do the linear case; the quadratic and cubic terms follow
Krekelberg f3 Taylor: Modelling a Diffusing Messenger 229
by analogy but will include interactions betweeen the different fourier components.
Substituting the non-linearity f from equation 2, the integral over the pattern
positions Xo in the linear term (s(1)) is another Fourier integral and gives:
-aoDP
2 (if;f+,i
--
p+u u+-ff;
* J dk' s(k')e-ik'xe-k'2/4(P+I1)e-112k'2/«p+I1)(112+2PI1)).
Acknowledgements
B. Krekelberg was supported by a grant from the European Human Capital and
Mobility Programme.
A VARIATIONAL APPROACH TO ASSOCIATIVE
MEMORY
Abderrahim Labbi
Dept. of Computer Science, University of Geneva,
24 Rue General Dufour, 1121 Geneva 4, Switzerland. Email: [email protected]
The purpose of this paper is to show how fundamental results from variational approximation
theory can be exploited to design recurrent associative memory networks. We begin by stating the
problem of learning a set of given patterns with an associative memory network as a hypersurface
construction problem, then we show that the associated approximation problem is well-posed.
Characterizing and determining such a solution will lead us to introduce the desired associative
memory network which can be viewed as a recurrent radial basis function (RBF) network which
has as many attractor states as there are fundamental patterns (no spurious memories).
Subject classification: AMS(MOS) 65F10, 65B05.
Keywords: Associative Memory, Variational Approximation, RBF Networks.
1 Introduction
Learning from a set of input-output patterns, and sometimes from additional a pri-
ori knowledge, in neural networks usually amounts to determining a set of param-
eters relative to a given network. In the case of feedforward networks, the problem
may be equivalent to determining a continuous mapping. Such a problem can be
stated in the framework of multivariate approximation theory which, in some cases,
is closely linked to regularization [11]. Concerning learning in recurrent associative
memory networks, one usually has to determine a set of parameters connected with
the dynamics of a given network so that the patterns (memories) to be stored are
stable states of such a network [1, 3, 7]. Complete characterization of the network
dynamics may be achieved by defining a global Lyapunov function (energy) which
decreases during the network evolution (the recalling process) and whose local min-
ima are the network's stable states [3, 6, 7].
The procedure followed here for the design of an associative memory network is the
reverse of the general approach: using methods from variational approximation,
we determine a function (or a hypersurface) which has as many local maxima as
the number of patterns to be memorized, and then define a dynamics on such a
hypersurface (actually, the gradient dynamics) which has its stable states close to
the patterns to be memorized.
We begin by proving that the problem is well posed from the approximation point
of view, and that the desired function can be explicitly computed when choosing
convenient constraints and parameters in the variational formulation of the prob-
lem. We finally show that the derived gradient dynamics can be implemented by an
associative memory network. Such a network can be viewed as a recurrent radial
basis function network which has as many stable states as there are fundamental
patterns to be stored. We also discuss how basins of attraction can be shaped so
that no spurious stable states can be encountered during the recalling process.
2 Statement of the Problem
Let us consider a set of patterns, A = {Sl,S2, ... ,Sm} C Q C IRn. The first
stage of the proposed approach is to construct a hypersurface defined by a bounded
smooth function, F : IRn -+ IR whose unique local maxima are close to the patterns
Sj, i = 1, ... , m. Therefore, we will reduce the problem to a functionnal minimization
230
Labbi: A Variational Approach to Associative Memory 231
under constraints which aim to translate the following assumptions: i) F should
interpolate the data (Si, Yi), i = 1, ... , m, where Yi = F( S;) are large positive real
values, and F should tend to zero outside of the data. This constraint is reduced
here to the minimization of the L2 norm of F. ii) F should not oscillate between
the data. This constraint is reduced to surface tension minimization, which usually
amounts to minimize the L2 norm of the gradient of F. iii) F should be smooth
enough. This constraint is usually imposed by minimizing the L2 norm of smoothing
differential operators (DP) to be introduced in the following.
=
Let us define Cy {f E IRn H IR, IE C(IRn ), r > 0, and f(Si) =
yd. Determining
an approximation of F regarding the given three constraints can be reduced to the
minimization of a cost functional J over Cy,
J(f)= AoIIDo fll2 + A111D1 1112 + ... + ApliDP 1112 f E Cy (1)
The operators Dk are defined by, Dk f(X) = Li\ +.+in=k &x'\ &k&x' n f(X), k ~ 0.
1 ... n
To solve the minimization problem, we can either use standard methods from the
calculus of variations by means ofthe Euler-Lagrange equation [4, 11], or use direct
methods based on functional analysis. Herein, we use the latter method which is
closely related to the reproducing kernels method [5, 15].
First, let us consider the Hilbert space, HP(IRn) = {f E L2(IRn); L~=o IIDk 1112 <
00 }endowed with the inner product, < I, 9 > Hp = L~=o < I, 9 >k, where <
I, 9 >k=< Dk I, Dk 9 >. When p and n satisfy the condition, p > %+ r, r ~ O,then
HP(IR n ) is a subset of cr(IRn) [12]. We assume this condition is satisfied in the
following.
To show that the problem is well-posed (i.e. it has a unique solution in C y ), we
consider the following vector space, H~(IRn) = {I E L 2(R n ), L~=o AkllDk 1112 <
00, and Ak > O}. It can easily be shown [8, 12] that H~(IRn) endowed with the
inner product, < I, 9 >H~ = L:~=o Ak < I, 9 > k, is a Hilbert. space whose norm is,
1I/11B"
• A
= « I, I > B")
• A
t. Moreover, HP and H~ are topologically equivalent since
the1r norms are eqmvalent.
Now, observe that the norm of H~(IRn) is identical to the functional J to be
minimized, so the minimization problem can be rewritten as,
(P) Minimize J(f) =11 I lI~p, IE Cy
A
Theorem 1 Given strictly positive real parameters, Ak, k = 1, ... , p, the problem
(P) has a unique solution F E H~(IRn), which is characterized by,
< F, U > H~ = L::1.8i. < 8;, U >, Vu E H~, where.8i are real parameters and
< 8i , . > are Dirac operators supported by Si, i = 1, ... , m.
Proof The proof1 is based on the projection theorem for the existence and the
charactrization of the best approximation in a linear manifold of a Hilbert space
[8, 9, 14]. In order to use such results, we first show that Cy is a linear manifold of
H~. This is achieved by considering the set Co defined as,
Theorem 2 Consider the functions (kernels) G(., 51), G(., S2), ... , G(., Sm) as re-
spective solutions of the partial differential equations (in the distributions sense),
= =
PG(X, Si) 6(X - Si), i 1, ... , m, VX E IRn. Then, there exist unique parameters
f31, f32, ... , f3m such that the solution F of the problem (P) is,
F(X) = E7::.1 f3i.G(X,Si),where (f31, ... ,f3m) is the solution of the linear system,
F(Sj) = E'::1 f3i.G(Sj, Si) = Yj, j = 1, ... , m.
Proof First, let us show that any function f(X) = E'::1 f3i.G(X, 5i ), with arbitrary
f3i 's, satisfies the characterization of theorem (1). If we consider any pair offunctions
<jJ and 'IjJ in H~, an integration by parts in the sense of distributions gives [13],
< <jJ,'IjJ >k= (_I)k < tl. 2k <jJ,'IjJ > .This relation allows us to write, for any U E HL
m m p
< F,u >HX = Ef3i < G(X,Si),U >HX= Ef3i < E(-I)k>"ktl. 2k G(X,Si),u >
m
Ef3i < 6i,u >
;=1
So the function F given in the theorem satisfies the characterization stated in
theorem (1). Finally, to show that F is the solution of the problem (P), we have to
show that F E Cy, which amounts to show that there is a unique solution (in f3i's)
to the following linear system, F(5i ) = Ej=1 f3iG(5j , 5i) = Yi, i = 1, ... , m. Since
P is a linear combination of iterated Laplacians, it is rotation and translation
invariant [5, 13], and the kernels G(., Si) are radial, centered respectively on 5i,
=
and can be written as, G(X, Si) G(IIX - XiiI).
To show that the linear system has a unique solution, it suffices to show that the
function G(t) is positive definite [10]. Since G verifies the equations,
>"oG(IIX - Si II) - >"1tl.G(IIX - Si II) + ... + (-I)P >"ptl.PG(IIX - Si II) = c5(IIX - Sj II)
=
it can be shown that G(t) can be written as [11], G(t) fIR >.o+>.:::ni.~1>'.W2' dw =
fIR exp(itw)dV(w) , where V is a bounded nondecreasing function. G is then writ-
ten in the form required by the Bochner theorem [2] which characterizes positive
definite functions. Therefore, F(X) = E7::.1 f3i.G(X, Si) is the solution of (P). 0
Poggio & Girosi [11] considered a similar variational problem for learning continu-
ous mappings under a priori assumptions. They derived similar kernels G(t) using
the Euler-Lagrange equation. Setting >"k to some particular values gives some il-
lustrations of the solution F in figure 1. If we let p tend to infinity in the problem
(P) as in [11] (considering only functions with infinite smoothness degree), and
Labbi: A Variational Approach to Associative Memory 233
Figure 1 The solution F: (left) for >'0 =: (7 > 0, >'1 =: 1, and >'k =: 0, k ~ 2, we
have G 1 (t) =: 210"' exp( -(7.ltl) which is not differentiable at the origin; (right) for
>'0 =: (74, >'1 =: 2(72, >'2 =: 1, and >'k =: 0, k ~ 3, we have G 2 (t) =: Gl (t) * Gl (t) (the
convolution product) which is differentiable.
choose Ak = k I 2;,,2k' then following the same reasoning as the previous sections,
HIDDEN
r1
INPUl'IOUTPUl'
REFERENCES
[1] Amari S., Statistical Neurodynamics of Associative Memory Neural Net, Vol. 1 (1988).
[2] Bochner S., Lectures on Fourier Integrals, Annals of Mathematical Studies, No. 42 (1959),
Princeton Univ. Press.
[3] Cohen M.A., Grossberg S., Ahsolute Stahility of Glohal Pattern Formation and Parallel Mem-
ory Storage hy Competitive Neural Networks, IEEE Trans. on SMC, Vol. SMC-13 (1983).
[4] Courant R., Hilbert D., Methods of Mathematical Physics, Interscience, London (1962).
[5] Duchon J., Spline Minimizing Rotation-Invariant Semi-norms in Soholev Spaces, Construc~
tive Theory of Several Variables, in W. Schempp & K. Zeeller (Eds), Lecture Notes in Math-
ematics, No. 571 (1977), Springer-Verlag, Berlin.
[6] Goles E. Lyapunov Functions Associated to Automata Networks, in: Automata Networks in
Computer Science, F. Fogelman et al. (eds), Manchester Univ. Press (1987).
[7] Hopfield J.J., Neural Networks and Physical Systems with Emergent Collective Computa-
tional Ahilities, Proc. of the Nat. Acad. of Science, USA, Vol. 79 (1982).
[8] Labbi A., On Approximation Theory and Dynamical Systems in Neural Networks, Ph.D.
Dissertation in Applied Mathematics, INPG, Grenoble (1993).
[9] Laurent P.J. Approximation et Optimisation, Hermann, Paris (1973).
[10] Micchelli C.A. Interpolation of Scattered Data: Distance Matrices and Condionally Definite
Functions, Constr. Approx., Vol. 2 (1986).
[11] Poggio T., Girosi F. Networks for Approximation and Learning, Proc.IEEE, Vol. 78 (1990).
[12] Rudin W. Functional Analysis. McGraw-Hill, New York, St Louis, San Fransisco (1991).
[13] Schwartz L. Theorie des Distributions. Hermann, Paris (1966).
[14] Singer I. Best Approximation in Normed Linear Spaces hy Elements of Linear Subspaces.
Springer-Verlag, New York, Heidelberg, Berlin (1970).
[15] Wahba G. Splines Models for Ohservational Data. Series in Applied Mathematics, Vol. 59
(1990), SIAM, Philadelphia.
TRANSFORMATION OF NONLINEAR
PROGRAMMING PROBLEMS INTO SEPARABLE
ONES USING MULTILAYER NEURAL NETWORKS
Bao-Liang Lu and Koji Ito*
The Institute of Physical and Chemical Research (RIKEN), 3-8-31 Rokuban,
Atsuta-ku, Nagoya 456, Japan. Email: [email protected]
* Interdisciplinary Graduate School of Science and Engineering,
Tokyo Institute of Technology, 4259, Nagatsuda, Midori-ku, Yokohama 226, Japan.
Email: [email protected]
In this paper we present a novel method for transforming nonseparable nonlinear prograrruning
(NLP) problems into separable ones using multilayer neural networks. This method is based on
a useful feature of multilayer neural networks, i.e., any nonseparable function can be approxi-
mately expressed as a separable one by a multilayer neural network. By use of this method, the
nonseparable objective and (or) constraint functions in NLP problems can be approximated by
multilayer neural networks, and therefore, any nonseparable NLP problem can be transformed
into a separable one. The importance of this method lies in the fact that it provides us with a
promising approach to using modified simplex methods to solve general NLP problems.
Keywords: separable nonlinear prograrruning, linear prograrruning, multilayer neural network.
1 Introduction
Consider the following NLP problem:
Minimize pea:) for a: E IRn (1)
subject to 9i(a:) ~ 0 for i = 1, 2, ... , m,
hj(a:) =0
forj=I,2,···,r.
where pea:) is called the objective function, g;(a:) is called an inequality constraint
and h j (a:) is called an equality constraint.
NLP problems are widespread in the mathematical modeling of engineering design
problems such as VLSI chip design, mechanical design, and chemical design. Unfor-
tunately, for the general NLP problems, computer programs are not available for
problems of very large size. For a class of NLP problems known as separable [4, 1],
some variation of the simplex method, a well-developed and efficient method for
solving linear programming (LP) problems, can be used as a solution procedure.
Separable nonlinear programming (SNLP) problem refers to a NLP problem where
the objective function and the constraint functions can be expressed as the sum of
functions of a single variable. A SNLP problem can be expressed as follows:
n
Minimize LPk(Xk) (2)
k=l
n
subject to Lgik(Xk) ~ 0 for i = 1, 2, m,
k=l
n
L hjk(xk) = 0 for j = 1, 2, r.
k=l
235
236 CHAPTER 39
An important problem in mathematical programming is to generalize the simplex
method to solve NLP problems. In this paper we discuss how to use multilayer
neural networks to transform nonseparable NLP problems into SNLP problems.
2 Transformation of Nonseparable Functions
Let q( z) be a multivariable function. Given a set of training data sampled over
q(z), we can train a three-layer network 1 to approximate q(z) [2]. After training
the mapping q( z) formed by the network can be regarded as an approximation of
q(z) and expressed as follows:
q(z) = a + {3(J (f
3=1
W31j! (t
.=1
W2ji Xi + biaS2j) + biaS31) - 1'), (3)
where Xj is the input ofthe jth unit in the input layer, Wkji is the weight connecting
the ith unit in the layer (k - 1) to the jth unit in the layer k, biaSkj is the bias of
the jth unit in the layer k, ! is the sigmoidal activation function, Nk is the number
of units in the layer k, a, (3, and l' are three constants which are determined by the
formula used for normalizing training data.
Introducing auxiliary variables b31 , b21 , b22 , ... , and b2N2 into Eq. (3), we can
obtain the following simultaneous equation
q(b31 ) = a + (3(J(b3d - 1')
N2
b31 - L W31j!(b 2j ) = bias31
j=l
N,
b21 - L W21i Xi = biaS21 (4)
;=1
N,
b2N2 - L W2N2i Xi = bias2N2'
i=l
where bkj for k = 2, 3, j = 1, 2, ... , NK, and Xi for i = 1, 2, ... , N 1, are variables.
We see that all of the functions in Eq. (4) are separable. The importance of Eq. (4)
lies in the fact that it provides us with an approach to approximately expressing
nonseparable functions as separable ones by multilayer neural networks.
In comparison with conventional function approximation problems the training task
mentioned above is easier to be dealt with. The reasons for this are that (a) an
arbitrary large number of sample data for training and test can be obtained from
q(z), and (b) the goal of training is to approximate a given function q(z), so the
performance of the trained network can be easily checked.
3 Transformation of Nonseparable NLP problems
According to the locations of nonseparable functions in NLP problems, nonsepa-
rable NLP problems can be classified into three types: (I) only the objective func-
tion is nonseparable function; (II) only the constraint functions are nonseparable
1 For simplicity of description, we consider only three-layer networks throughout the paper. The
results can be extended to M-Iayer (M > 3) networks easily.
Lu & Ito: Transformation of NLP Problems 237
functions; and (III) both the objective and constraint functions are nonseparable
functions.
For Type I nonseparable NLP problems, we only need to transform the objective
function into separable one. Replacing the objective function with its approximation
in Eq. (4), we can transform a Type I nonseparable NLP problem into a SNLP
problem as follows:
Minimize ex + f3(f(b~1) -,) (5)
N~
subject to b~1 - L wf1 j f(bg) = biasf1
j=1
Nl
bo
21 - '" ° = b·zas21°
L..J W21i Xi
i=1
Nl
b~NO2 - '~
" W2NOj
2
Xi = bias~NO
2
i=1
n
Lgik(Xk) ;::: 0 for i = 1, m,
k=1
n
Table 1 The numbers of variables and constraints in the original and transformed
problems.
it may become more difficult for a smaller network (e.g., fewer number of hidden
units) to approximate a nonseparable function. There exists a trade-off between the
complexity of the transformed SNLP problems and the approximating capability
of neural networks.
5 Simulation Results
Consider the following simple NLP problem:
Minimize 2 - sin 2 Xl sin 2 X2 (6)
subject to 0.5 ~ Xl ~ 2.5
0.5 ~ X2 ~ 2.5
Clearly, this is a Type I nonseparable NLP problem. A three-layer perceptron with
two input, ten hidden, and one output units is used to approximate the objective
function. The training data set consists of 524 input-output data which are gathered
by sampling the input space [0.5, 2.5] x [0.5, 2.5] in a uniform grid. The network
is trained by the back-propagation algorithm [6]. In this simulation, the learning is
considered complete when the sum of squared error between the target and actual
outputs gets less than 0.05. Replacing the objective function with its approximation
formed by the network, we obtain a SNLP problem.
Approximating the sigmoidal activation function in the SNLP problem over the
interval [-16, 16] via 14 grid points, we obtain an approximate SNLP problem.
Solving this problem with the simplex method with the restricted basis entry rule
[4], we obtain the solution: xi =
1.531488 and x2 =
1.595567. If the sigmoidal
activation function is approximated over the interval [-16, 16] via 40 grid points,
we can obtain a better solution: xi =
1.564015 and x2 =
1.569683. Solving this
problem directly with the Powell method [5], we obtain a more accurate solution:
= =
xi 1.57079 and x2 1.57079. It should be noted that there exits a trade-off
between the accuracy of the approximation of SNLP problems and the number of
grid points for each variable.
In general, several local solutions may exist in a SNLP problem. But only one solu-
tion can be obtained by solving the approximate SNLP problem using the simplex
method with the restricted basis entry rule. It has been shown that if the objective
function is strictly convex and all the constraint functions are convex, the solution
obtained by the modified simplex method is sufficiently close to the global optimal
Lu & Ito: Transformation of NLP Problems 239
solution of the original problem by choosing a small grid. Unfortunately, the SNLP
problems transformed by our method are non-convex since the sigmoidal activation
function is non-convex. In such case, even though optimality of the solution can not
be claimed with the restricted basis entry rule, good solutions can be obtained [1].
6 Conclusion and Future Work
We have demonstrated how multilayer neural networks can be used to transform
nonseparable functions into separable ones. Applying this useful feature to nonlinear
programming, we have proposed a novel method for transforming nonseparable NLP
problems into separable ones. This result opens up a way for solving general NLP
problems by some variation of the simplex method, and makes connection between
multilayer neural networks and mathematical programming techniques. As future
work we will perform simulations on large-scale nonseparable NLP problems.
REFERENCES
[1] M. S. Bazaraa, H. D. Sherali, and C. M. Shetty, Nonlinear Programming: Theory and Algo-
rithms, 2nd Edition, John Wiley & Sons, Inc. (1993).
[2] K. Hornik, M. Stinchcombe and H. White, Multilayer leedlorward networks are universal
approximators, Neural Networks, Vol.2 (1989), pp359-366.
[3] J. Lee, A novel design method lor multilayer leedlorward neural networks, Neural Computa-
tion, Vol.6 (1994), pp885-901.
[4] C. E. Miller, The Simplex Method lor Local Separable Programming, in: Recent Advances in
Mathematical Programming, R. L. Graves and P. Wolfe eds., McGraw-Hill (1963), pp89-100.
[5] M. J. D. Powell, A last algorithm lor nonlinearly constrained optimization calculations, in:
Lecture Notes in Mathematics No. 630 (1978), G. A. Wast on ed., Springer-Verlag, Berlin.
[6] D. E. Rumelhart, G. E. Hinton and R. J. Williams, Learning representations by backpropa-
gating errors, Nature, Vol. 323 (1986), pp533-536.
[7] X. H. Yu, G. A. Chen and S. X. Cheng, Dynamic learning rate optimization 01 the backprop-
agation algorithm, IEEE Transactions on Neural Networks, Vol.6 (1995), pp669-677.
[8] Z. Wang, C. D. Massimo, M. T. Tham, and A. J. Morris, A procedure lor determining the
topology 01 multilayer feedforwar neural networks, Neural Networks, Vol. 7 (1994), pp291-300.
Acknowledgements
We would like to thank Steve W. Ellacott and the reviewer for valuable comments
and suggestions on the manuscript.
A THEORY OF SELF-ORGANISING NEURAL
NETWORKS
S P Luttrell
Defence Research Agency, Malvern, Worcs, WR14 3PS, UK.
Email: [email protected]
The purpose of this paper is to present a probabilistic theory of self-organising networks based
on the results published in [1]. This approach allows vector quantisers and topographic mappings
to be treated as different limiting cases of the same theoretical framework. The full theoretical
machinery allows a visual cortex-like network to be built.
1 Introduction
The purpose of this paper is to present a generalisation ofthe probabilistic approach
to the static analysis of self-organising neural networks that appeared in [1]. In the
simplest case the network has two layers: an input and an output layer. An input
vector is used to clamp the pattern of activity of the nodes in the input layer,
and the resulting pattern of individual "firing" events of the nodes in the output
layer is described probabilistically. Finally, an attempt is made to reconstruct the
pattern of activity in the input layer from knowledge of the location of the firing
events in the output layer. This inversion from output to input is achieved by
using Bayes' theorem to invert the probabilistic feed-forward mapping from input
to output. A network objective function is then introduced in order to optimise
the overall network performance. If the average Euclidean error between an input
vector and its corresponding reconstruction is used as the objective function, then
many standard self-organising networks emerge as special cases [1, 2].
In section 2 the network objective function is introduced, in section 3 a simpler
form is derived which is an upper bound to the true objective function, and in
section 4 the derivatives with respect to various parameters of this upper bound
are derived. Finally, in section 5 various standard neural networks are analysed
within this framework.
2 Objective Function
The basic mathematical object is the objective function D, which is defined as
f J
Yl,Y2 .. ··,Y,,=1
dxdx' Pr (x) Pr (Yl, Y2,···, Ynl x) Pr (x'IYl, Y2,···, Yn) IIx - X'II2
(1)
where x is the input vector and x' is its reconstruction, (Yl, Y2, ... , Yn) are the
locations in the output layer of n firing events, and IIx - X'II2 is the Euclidean
distance between the input vector and its reconstruction. The various probabili-
ties arise as follows: f dx Pr (x) (... ) integrates over the training set of input vec-
tors, Pr (Yl, Y2, ... , Yn Ix) is the joint probability of n firing events at locations
(Yl,Y2,·· ·,Yn), Pr(x'IYl,Y2,···,Yn) is the Bayes' inverse probability that input
vector x' is inferred as the cause of the n firing events, E;:,Y2, ... ,Y .. =1 ( .. -) sums
over all possible locations (on an assumed rectangular lattice of size m) of the n
firing events. The order in which the n firing events occurs will be assumed not to be
observed, so that Pr(Yl,Y2,·· ·,Ynlx) is a symmetric function of (Yl,Y2,·· ·,Yn).
240
Luttrell: A Theory of Self-Organising Neural Networks 241
A simplifying assumption will be made where the observed firing events are assumed
to be statistically independent, so that
Pr (Y1, Y2, ... ,Yn Ix) = Pr (Y1Ix) Pr (Y2Ix) ... Pr (Yn Ix) (2)
Normally, a simple form for Pr (Ylx) is used such as
m
D =2 JdxPr (x) f:
Yl,Y2,"',Yft,=1
Pr (Y1, Y2,"', Ynlx ) IIx - x' (Y1, Y2,"', Yn)1I 2 (6)
242 CHAPTER 40
D2 - 2(nn- I ) J dxPr(x) f
Yl,y,=l
Pr(Y1,Y2Ix )(x-x'(yI))'(X-X'(Y2))
D3 == 2 f _pr(Y1'Y2''''Yn)IIX'(Y1'Y2''''Yn)-~tx'(Yi)112
Yl,y",,·y .. -l .=1
To obtain this result Pr (Y1, Y2, ... , Yn Ix) has been assumed to be symmetric under
permutation of the locations Y1, Y2, ... , Yn (e.g. the locations, but not the order of
occurrence of the n firing events is known). If the independence assumption in (2)
J P«X+- t.
is now invoked, then D2 may be simplified to
The dependence of D3 on the n-argument reference vectors x' (Y1, Y2, ... Yn) is
inconvenient, because the total number of such reference vectors is 0 (/mln), where
Iml is the total number of output nodes (Iml = m1 m2 ... md for ad-dimensional
rectangular lattice of size m). However, the positivity of D3, together with D =
D1 + D2 - D3, will be used to obtain an upper bound to D as D ::; D1 + D 2,
which depends only on I-argument reference vectors x' (y). The total number of
I-argument reference vectors is equal to Iml.
4 Differentiate the Objective Function
In order to implement an optimisation algorithm the derivatives of D1 and D2
with respect to the various parameters must be obtained. The expressions that are
encountered when differentiating are rather cumbersome, but they have a simple
structure which can be made clear by the introduction of the following notation
Ly,yl == Pr (Y'IY) Py,yl == Pr (Y'lx; y)
Py == "£y/EN-l(y) Py/,y (LTp)y == "£y/EC-l(y) Ly/,ypyl
d y == x - x' (y) (Ld)y == "£y/EC(y) Ly,y/dyl
(P Ld)y == "£y/EN(y) Py,yl (Ld)yl (p T P Ld)y == "£y/EN-l(y) Py/,y (P Ld)yl
ey == IIx - x' (y)1I 2 (Le)y == ,,£y/EC(Y) Ly,y/eyl
~P~e)Ym== L(~EN)(y) Py,yl (Le)yl (P~ ~Le)fu == Ly/EN-l(y) Py/,y (P Le)yl
d = "£y=l L P y d y or d = "£Y=l (P Ld)y
(9)
Luttrell: A Theory of Self-Organising Neural Networks 243
which allows (5) to be written as Pr (ylx) --+ it (LT p)y' Dl and D2 may be differ-
entiated with respect to x' (y) to obtain
aDl
ox' (y)
4
- nM
j dx Pr (x) (LTp)y d y
8D l = 2
nM
j m
dxPr(x)L8logQ(xly) (py(Le)y-(pTPLe)y) (11)
y=l
2 j [(pY(Le)y-(PTPLe)y) 1 (12)
a ( w(y)
b (y) ) nM dxPr(x) X(I- Q (X 1Y ) ) ( ; )
a(b(Y)
w(y)
)
The gradients in (10) and (12) may be used to implement a gradient descent algo-
rithm for optimising Dl + D 2 , which then leads to a least upper bound on the full
objective function D (= Dl + D2 - D3).
5 Special Cases
Various standard results that are special cases of the model presented above are
discussed in the following subsections.
5.1 Vector Quantiser and Topographic Mapping
Assume n = 1 so that only 1 firing event is observed so that D2 = D3 = 0,
yEN (y') Vy, y' so that the neighbourhood embraces all of the output nodes,
and probability leakage of the type given in (5) is allowed. Then D reduces to
D = 2!dxPr(x)2::;=lPr(yly(x))lIx-x'(y)1I2 [1], which leads to a behaviour
that is very similar to the topographic mappings described in [5], where Pr (yly')
now corresponds to the topographic neighbourhood function. In the limit where
=
Pr (yly (x)) 8y ,y(x) this reduces to the criterion for optimising a vector quantiser
[4].
5.2 Visual Cortex Network
A "visual cortex" -like network can be built if the full theoretical machinery pre-
sented earlier is used. This network has many of the emergent properties of the
mammalian visual cortex, such as orientation maps, centre-on/surround-off detec-
tors, dominance stripes, etc (see e.g. [7] for a review of these phenomena). There is
244 CHAPTER 40
an input layer with a pattern of activity representing the input vector, an output
layer with nodes firing in response to feed-forward connections (i.e. a "recogni-
tion" model), and a mechanism for reconstructing the input from the firing events
via feed-back connections (i.e. a "generative" model). The output layer has lateral
inhibition implemented as in (4); the neighbourhood of node y' has an associated
inhibition factor 1/ EyIlEN"(yl) Q (xly") , and the overall inhibition factor for node y
is EyIEN"-l(y) (1/
EyIlEN"(yl) Q (xly") ), which is the sum of the inhibition factors
over all nodes y' that have node y in their neighbourhood. This scheme for intro-
ducing lateral inhibition is discussed in greater detail in [3]. The leakage introduced
by Pr (yly') induces topographical ordering as usual.
In the limit n ~ 1 where Dl is dominant, this network behaves like a topographic
mapping network, except that the output layer splits up into a number of "do-
mains" each of which is typically a lateral inhibition length in size, and each of
which forms a separate topographic mapping. These domains are seamlessly joined
together, so no domain boundaries are actually visible. In the limit n ~ 00 where
D2 is dominant, this network approximates its input as a superposition of reference
vectors E:=l Pr (ylx) x' (y) (see (8)). Thus the network is capable of explaining
the input in terms of multiple causes.
6 Conclusions
A single theoretical framework has been shown to describe a number of standard
self-organising neural networks. This makes it easy to understand the relationship
between these neural networks, and it provides a useful framework for analysing
their properties.
REFERENCES
[1] Luttrell S P, A Bayesian analysis of self-organising maps, Neural Computation, Vol. 6(5)
(1994), pp767-794.
[2] Luttrell S P, Designing analysable networks, Handbook of Neural Computation, OUP (1996).
[3] Luttrell S P, Partitioned mixture distribution: an adaptive Bayesian network for low-level
image processing, lEE Proc. Vision hnage Signal Processing, Vol. 141(4) (1994), pp251-260.
[4] Linde Y, Buzo A and Gray R M, An algorithm for vector quantiser design, IEEE Trans.
COM, Vol. 28(1) (1980), pp84-95.
[5] Kohonen T, Self organisation and associative memory, Springer-Verlag (1984).
[6] Luttrell S P, Derivation of a class of training algorithms, IEEE Trans. NN, Vol. 1(2) (1990),
pp229-232.
[7] Goodhill G J, Correlations, competition, and optimality: modelling the development of to-
pography and ocular dominance, CSRP 226 (1992), Sussex University.
Acknowledgements
© British Crown Copyright 1995/DRA. Published with the permission of the Con-
troller of Her Britannic Majesty's Stationery Office.
NEURAL NETWORK SUPERVISED TRAINING
BASED ON A DIMENSION REDUCING METHOD
G.D. Magoulas, M.N. Vrahatis*,
T.N. Grapsa* and G.S. Androulakis*
Department of Electrical and Computer Engineering, University of Patras,
GR-261.10, Patras, Greece. Email: [email protected]
* Department of Mathematics, University of Patras,
GR-261.10 Patras, Greece. Email: [email protected]
In this contribution a new method for supervised training is presented. This method is based on
a recently proposed root finding procedure for the numerical solution of systems of non-linear
algebraic and/or transcendental equations in mn. This new method reduces the dimensionality of
the problem in such a way that it can lead to an iterative approximate formula for the computation
of n - 1 connection weights. The remaining connection weight is evaluated separately using the
final approximations of the others. This reduced iterative formula generates a sequence of points in
mn - 1 which converges quadratically to the proper n -1 connection weights. Moreover, it requires
neither a good initial guess for one connection weight nor accurate error function evaluations. The
new method is applied on some test cases in order to evaluate its performance.
Subject classification: AMS(MOS) 65K10, 49D10, 68T05, 68G05.
Keywords: Numerical optimization methods, feed forward neural networks, supervised training,
back-propagation of error, dimension-reducing method.
1 Introduction
Consider a feed forward neural network (FNN) with I layers, I E [1, L). The error
is defined as ek(t) = dk(t) - yf(t), for k = 1,2, ... , K, where dk(t) is the desired
response at the kth neuron of the output layer at the input pattern t, yf(t) is the
output at the kth neuron of the output layer L. If there is a fixed, finite set of input-
output cases, the square error over the training set which contains T representative
cases IS:
T T K
E = LE(t) = L Le~(t). (1)
t=1 t=1 k=1
The most common supervised training algorithm for FNNs with sigmoidal non-
linear neurons is the Back-Propagation (BP), (4). The BP minimizes the error
function E using the Steepest Descent (SD) with fixed step size and computes the
gradient using the chain rule on the layers ofthe network. BP converges too slow and
often yields suboptimal solutions. The quasi-Newton method (BFGS) (2), converges
much faster than the BP but the storage and computational requirements of the
Hessian for very large FNNs make its use impractical for most current machines.
In this paper, we derive and apply a new training method for FNNs named Dimen-
sion Reducing Training Method (DRTM). DRTM is based on the methods studied
in (3) and it incorporates the advantages of Newton and SOR algorithms (see [4]).
2 Description of the DRTM
Throughout this paper IRn is the n-dimensional real space of column weight vectors
w with components WI, W2, ... , W n ; (y; z) represents the column vector with com-
ponents YI, Y2, ... , Ym, ZI, Z2, ... , Zk; OJ E( w) denotes the partial derivative of E( w)
with respect to the ith variable Wi; g( w) = (gl (w), ... , gn (w)) defines the gradi-
ent V' E( w) of the objective function E at w while H = [Hij] defines the Hessian
245
246 CHAPTER 41
2 -
V' E(w) of Eat w; A denotes the closure of the set A and E(Wl' ... , Wi-I, . ,Wi+l,
... , wn ) defines the error function obtained by holding WI, ... , Wi-I, Wi+I, ... , Wn
fixed.
The problem of training is treated as an optimization problem in the FNN's weight
space (i.e., n-dimensional Euclidean space). In other words, we want to find the
proper weights that satisfy the following system of equations :
=
m(W) 0, i 1, ... , n. = (2)
In order to solve this system iteratively we want a sequence of weight vectors
{wP},p =
0,1, ... which converges to the point w· =
(wi, ... , w~) EVe mn
of the function E. First, we consider the sets Bi , to be those connected com-
ponents of g;I(O) containing w· on which ongi # 0, for i = 1, ... , n respec-
tively. Next, applying the Implicit Function Theorem (see [4, 3]) for each one
of the components gi we can find open neighborhoods Ai c mn - 1 and A; i C
m of the points y. = (wi, ... , W~_I) and w~ respectively, such that for' any
y = (WI, ... , Wn-l) E Ai there exist unique mappings <Pi defined and continu-
ous in Ai such that: Wn =
<Pi(Y) E A;,;, and gi (y; <pi(Y» 0, i =
1, ... , n. =
Moreover, the partial derivatives OJ <Pi, j = 1, ... , n - 1 exist in Ai for each <Pi,
they are continuous in Ai and they are given by :
Ojgi(Y;<Pi(Y» .
OJ<Pi(Y) = Ongi (y; <pi(Y» , l = 1, ... , n, j = 1, ... , n - 1. (3)
Working exactly as in [3], we utilize Taylor's formula to expand <Pi(Y), about yP.
By straightforward calculations, utilizing approximate values for gi(·) and Ojgi(-) ==
orjE (see [5], where error estimates for these approximations can also be found) we
obtain the following iterative scheme for the computation of the n - 1 components
ofw· :
(4)
where yP = [wf], vp = [Vi] = [w~,i - w~,n] and the elements of the matrix Ap are:
.. _ [g;(yP + hej; w~,;) - g;(yP; ~,i) gn(yP + hej; ~,n) - gn(yP; w~,n) 1
[a.]] - g;(yP; ~,; + hen) _ g;(yP; w~,;) gn(YP; w~,n + hen) - gn(YP; w~,n) ,
(5)
with w~,i =
<Pi(yP), h a small quantity and ej the j-th unit vector. After a de-
sired number of iterations of (4), say p = m, the nth component of W* can be
approximated by means of the following relation :
wn
m+l _
- wn j=l]
E
m,n _ n-l{( m+l
w·
_ '!'). gn(ym + hej; w::"n) - gn(ym; w::"n)} (6)
w] m n
gn ( ym;wn' m n)
+hen ) -gn ( ym;wn' .
Note that the iterative formula (4) uses the matrices Ap and vp. The matrix Ap
constitutes the reduced-Hessian of our network and its components incorporate
components of the Hessian but are evaluated at different points. The matrix vp
uses only the points w~,; (i = 1, ... , n - 1) and w~,n instead of the gradient values
employed in Newton's method. A proof for the convergence of (4) and (6) can be
found in [6].
Relative procedures for obtaining w· can be constructed by replacing Wn with any
one of the components WI, . .. , Wn-l, for example Wint. The above described method
Magoulas et al.: A Dimension Reducing Training Method 247
FR PR BFGS DRTM
WU IT FE IT FE IT FE IT FE ASG
(0.3,0.4) F F F F F F 5 20 100
(-1, -2) F F F F 14 274 7 28 140
(-1,10) F F F F 14 285 7 28 140
(0.2,0.2) F F F F F F 5 20 100
(2,1) F F F F 13 298 5 20 100
(0.3,0.3) F F F F F F 5 20 100
(-1.2,1.2) F F F F F F 7 28 140
does not require the expressions /Pi but only the values tot'; which are given by the
solution of the one-dimensional equations gi(wi, ... , tJ.:-l'·) = O. So, by holding
= = =
yP (wi,···, tJ.:-l) fixed, we can solve the equations: gi(yP; rf) 0, i 1, ... , n,
for an approximate solution rf in the interval (a, b) with an accuracy D. In order
to solve the one-dimensional equations, we employ a modified bisection method
described in [3, 12] and given by the following formula:
wP+ 1 = wP + sgmp(wP) q / 2P+1 , p = 0,1, ... , (7)
with wO = a, q = sgmp(a) (b-a) and where sgn defines the well known sign function.
This method computes with certainty a root when sgn1/l( wO) sgmp( wP) = -1 (see
[12]). It is evident from (7) that the only computable information required by this
method is the algebraic signs of the function 1/1.
A high-level description of the new algorithm can be found in [8].
3 Simulation Results
Here we present and compare the behavior of the DRTM with other popular meth-
ods on some artificially created but characteristic situations. For example, it is
common in FNN training to take minimization steps that increase some weights by
large amounts pushing the output of the neuron into saturation. Moreover, in vari-
ous small and large scale neural network applications the error surface has flat and
steep regions. It is well known that the BP is highly inefficient in locating minima
in such surfaces. In the following examples, the gradient is evaluated using finite
differences for the DRTM and analytically for all the other methods.
Example 1 The objective function's surface has flat and steep regions
10
E(w) = Egr, (8)
i=1
System (8), which is a well-known test case, (Jennrich and Sampson Function) (see
= =
[9]), has a global minimum at WI W2 0.2578 .... In Table 1 we present results
obtained by applying the nonlinear conjugate gradient methods Fletcher-Reeves
(FR) and Polak-Ribiere (PR) and the quasi-Newton Broyden-Fletcher-Goldfarb-
Shanno (BFGS) method with the corresponding numerical results ofDRTM. In this
Table IT indicates the total number of iterations required to obtain w* (iterations
248 CHAPTER 41
limit= 500); FE the total number of function evaluations (and derivatives) and
ASG the total number of algebraic signs of the components of the gradient that
are required for applying the iterative scheme (7). Because of the difficulty of the
problem FR and PR failed to converge in all the cases (marked with an F in the
table). The results are mixed with the BFGS method. Especially, when we are close
to the minimum BFGS leaves the appropriate region moving to wrong direction in
order to minimize the objective function.
Example 2 The objective function's surface is oval shaped and bent.
We can artificially create such a surface by training a single neuron with sig-
moid non-linearity using the patterns {-6, I}, {-6.1, I}, {-4.1, I}, {-4, I}, {4, I},
{4.1, I}, {6, I}, {6.1, I} for input and {O}, {O}, {0.97}, {0.99}, {0.01}, {0.03}, {I},
{I} for output. The weights WI, Wz take values in the interval [-3,3] x [-7.5,7.5].
The global minimum is located at the center of the surface and there are two valleys
that lead to local minima. The step size for the BP was 0.05. The initial weights were
formed by spanning the interval [-3,3] in steps of 0.05 and the interval [-7.5,7.5]
in steps of 0.125.
The behavior of the methods is exhibited in Table 2, where MN indicates the mean
number of iterations for simulations that reached the global minimum; STD the
standard deviation of iterations; sue the percentage of success in locating the
global minimum and MAS the mean number of algebraic signs that are required
for applying the iterative scheme (7). Note that for DRTM, since finite differences
are used, two error function evaluations are required in each iteration. BP succeeds
to locate the global minimum when initial weights take values in the intervals
WI E [-0.8,1.5] and Wz E [-2.5,2.5]. On the other hand, DRTM is less affected
by the initial weights. In this case we exploit the fact that we are able to isolate
the weight vector component most responsible for unstable behavior by reducing
the dimension of the problem. Therefore, DRTM is very fast and possesses high
percentage of success.
4 Conclusion and Further Improvements
This paper describes a new training method for FNNs. Although the proposed
method uses reduction to simpler one-dimensional equations, it converges quad-
ratically to n - 1 components of an optimal weight vector, while the remaining
weight is evaluated separately using the final approximations of the others. Thus,
it does not require a good initial estimate for one component of an optimal weight
vector. Moreover, it is at the user's disposal to choose which will be the remaining
weight, according to the problem. Since it uses the modified one-dimensional bisec-
tion method, it requires only that the algebraic signs of the function and gradient
values be correct. It is also possible to use this method in training with block of
Magoulas et al.: A Dimension Reducing Training Method 249
weights using different remaining weights. In this case, the method can lead to a
network training and construction algorithm. This issue is currently under devel-
opment and we hope to address it in a future communication.
Note that in general the matrix of our reduced system is not symmetric. It is
possible to transform it to a symmetric one by using proper perturbations [6]. If
the matrix is symmetric and positive definite the optimal weight vector minimizes
the objective function. Furthermore, DRTM appears particularly useful when it is
difficult to evaluate the gradient values accurately, as well as when the Hessian at
the optimum is singular or ill-conditioned [8].
REFERENCES
[1] D. E. Rumelhart and J. L. McClelland eels., Parallel Distributed Processing: E:cplorations in
the Microstructure oj Cognition, Vol. 1, MIT Press, 1986, pp318-362.
[2] R. L. Watrous, Learning algorithms Jor connectionist networks: applied gradient methods oj
non-linear optimization, in Proc. IEEE Int. Conf. Neural Networks, San Diego, CA, Vol.2
(1987), pp619-627.
[3] T. N. Grapsa, M. N. Vrahatis, A dimension-reducing method Jor solving systems oj non-
linear equations in lRn , Int. J. Computer Math., Vol.32 (1990), pp205-216.
[4] J. M. Ortega, W. C. Rheinboldt, Iterative Solution oj Non-linear Equations in Several Vari-
abies, Academic Press, New York, (1970).
[5] J. E. Dennis, R. B. Schnabel, Numerical methods Jor unconstrained optimization and non-
linear equations, Prentice-Hall, Englewood Cliffs, NJ, (1983).
[6] T. N. Grapsa, M. N. Vrahatis, A dimension-reducing method Jor unconstrained optimization,
J. Compo Appl. Math. Vol. 66 (1996), pp239-253.
[7] M. N. Vrahatis, Solving systems oj non-linear equations using the non zero value oj the
topological degree, ACM Trans. Math. Software, Vol. 14 (1988), pp312-329.
[8] G. D. Magoulas, M. N. Vrahatis, T. N. Grapsa, G. S. Androulakis, A dimension-reducing
training method Jor Jeed-Jorward neural networks, Tech. Rep. CSL-I095, Department of Elec-
trical & Computer Engineering, University of Patras, (1995).
[9] B. J. More, B. S. Garbow, K. E. Hillstrom, Testing unconstrained optimization, ACM Trans.
Math. Software, Vol. 7 (1981), pp17-41.
A TRAINING METHOD FOR DISCRETE
MULTILAYER NEURAL NETWORKS
G.D. Magoulas, M.N. Vrahatis*,
T.N. Grapsa* and G.S. Androulakis*
Department of Electrical and Computer Engineering, University of Patras,
GR-261.10, Patras, Greece. Email: [email protected]
* Department of Mathematics, University of Patras,
GR-261.10 Patras, Greece. Email: [email protected]
In this contribution a new training method is proposed for neural networks that are based on
neurons whose output can be in a particular state. This method minimises the well known least
square criterion by using information concerning only the signs of the error function and inaccurate
gradient values. The algorithm is based on a modified one-dimensional bisection method and it
treats supervised training in networks of neurons with discrete output states as a problem of
minimisation based on imprecise values.
Subject classification: AMS(MOS) 65KlO, 49DlO, 68T05, 68G05.
Keywords: Numerical analysis, imprecise function and gradient values, hard-limiting threshold
units, feed forward neural networks, supervised training, back-propagation of error.
1 Introduction
Consider a Discrete Multilayer Neural Network (DMNN) consisting of L layers,
in which the first layer denotes the input, the last one, L, is the output, and the
intermediate layers are the hidden layers. It is assumed that the (/-1 )-th layer has
N,-l units. These units operate according to the following equations:
N ,- 1
where net) is the net input to the jth unit at the Ith layer, w!jl,1 is the connection
weight from the ith unit at the (I - 1)-th layer to the jth unit at the Ith layer, Y!
denotes the output ofthe ith unit belonging to the Ith layer, 0) denotes the threshold
of the jth unit at the Ith layer, and u is the activation function. In this paper we
consider units where u(netD is a discrete activation function. We especially focus
on units with two output states, usually called binary or hard-limiting units [1],
i.e. u l (net~) = "true", if net~ ;::: 0, and "false" otherwise.
Although units with discrete activation function have been superseded to a large
extent by the computationally more powerful units with analog activation function,
still DMNNs are important in that they can handle many of the inherently binary
tasks that neural networks are used for. Their internal representations are clearly
interpretable, they are computationally simpler to understand than networks with
sigmoid units and provide a starting point for the study of the neural network
properties. Furthermore, when using hard-limiting units we can understand better
the relationship between the size of the network and the complexity of the training
[2]. In [3] it has been demonstrated that DMNNs with only one hidden layer, can
create any decision region that can be expressed as a finite union of polyhedral sets
when there is one unit in the input layer. Moreover, artificially created examples
were given where these networks create non convex and disjoint decision regions.
250
Magoulas et al.: A New Training Method for DMNNs 251
Finally, discrete activation functions facilitate neural network implementations in
digital hardware and are much less costly to fabricate.
The most common feed forward neural network (FNN) training algorithm, the
back-propagation (BP) [4] that makes use of the gradient descent, cannot be ap-
plied directly to networks of units with discrete output states, since discrete ac-
tivation functions (such as hardlimiters) are non-differentiable. However, various
modifications of the gradient descent have been presented [5, 6, 7]. In [8] an approx-
imation to gradient descent, the so-called pseudo-gradient training method, was
proposed. This method uses the gradient of a sigmoid as a heuristic hint instead of
the true gradient. Experimental results validated the effectiveness of this approach.
In this paper, we derive and apply a new training method for DMNNs that makes
use of the gradient approximation introduced in [8]. Our method exploits the impre-
cise information regarding the error function and the approximated gradient, like
the pseudo-gradient method does, but it has an improved convergence speed and
has potential to train DMNNs in situations where, according to our experiments,
the pseudo-gradient method fails to converge.
2 Problem Formulation and Proposed Solution
We consider units with two discrete output states and we shall use the convention
f (or - J) for ''false'' and t (or +t) for "true", where f, t are real positive numbers
and f < t, instead of the classical 0 and 1 (or -1, and +1). Real positive values
prevent units from saturating, give to the logic ''false'' some power of influence
over the next layer of the DMNN, and help the justification of the approximated
gradient value which we shall employ.
First, let us define the error for a discrete unit as follows: ej(t) = dj(t) - yf(t),
for j = 1,2, ... , NL, where dj(t) is the desired response at the jth neuron of the
output layer at the input pattern t, yf(t) is the output at the kth neuron of the
output layer L. For a fixed, finite set of input-output cases, the square error over
the training set which contains T representative cases is:
T T NL
where the back-propagating error signal g for the output layer is of (dj -
s( netf)) . s' (netf) and for the hidden layers (I E [2, L - 1]) is ~ = s' (net} )
252 CHAPTER 42
En w~,~+16~+1 . In these relations s' (net)) is the derivative of the analog activation
function.
By using real positive values for "true" and "false" we ensure that the pseudo-
gradient will not reduce to zero when the output is "false". Note also that we do
not use (1" which is zero everywhere and non-existent at zero. Instead, we use s'
which is always positive, so 6j gives an indication of the direction (and magnitude
of a step up or down as a function of net~ in the error surface E.
However, as pointed out in [8], the value of the pseudo-gradient is not accurate
enough, so gradient descent based training in DMNNs is considerably slow when
compared with BP training in FNNs.
In order to alleviate this problem we propose an alternative to the pseudo-gradient
training method procedure. To be more specific, we propose to solve the one-
dimensional equation :
E( WI,···, Wi-I,
0 0
Wi' 0
Wi+l,"" Wn -
0) E( WI"'"
0 0 0 0 0)
Wi_I' Wi' Wi+l,"" Wn = ,
0
for WI keeping all o1iher components of the weight vector in their constant values.
Now, if WI is the solution of the above equation, then the point defined by the
vector (WI, wg, ... , W~) possesses the same error function value with the point WO,
so it belongs to the same contour line of w o. Assuming that the error function
curves up from w· in all directions, we can claim that any point which belongs to
the line with endpoints W O and (WI, wg, ... , w~) possesses smaller error function
value than these endpoints. With this fact in mind we can now choose such a point,
=
say, for example wi w~ + 'Y (Wl - wn, 'Y E (0,1), and solve the one-dimensional
equation:
E( wl,W2"",wi_l,wi,wi+l"",wn
1 0 0 0 0)
- E( wl,w2"",wi_l,wi,wi+l"",wn
1 0 0 0 0 0)
= 0,
for W2 keeping all other components in their constant values. If W2 is the solution
of this equation then we can obtain a better approximation for this component by
taking w~ = wg + 'Y (W2 - wg), 'Y E (0,1).
Continuing in a similar way with the remaining components of the weight vector
we obtain the new vector w l = (wi, . .. , w~) and replace the initial vector w O by
w l . The procedure can then be repeated to compute w 2 and so on until the final
estimated point is computed according to a predetermined accuracy. So, in general
we want to find the parameter x (a weight or threshold) that satisfies:
k+1 k
E( Xlk+l ""'Xi_l'X'Xi+l""'Xn k) E( Hl Hl k k k) - 0
- Xl ""'Xi_l'Xi'Xi+l""'Xn - ,
by applying the modified bisection (see [12, 13]) in the interval (ai, bi) within accu-
racy d:
= =
xf+l xf + C sgn (E(zP) - E(zO)) /2 P+1, P 0,1, ... , rlog2((bi - ai) d- l )l,
where the notation r'l refers to the smallest integer not less than the real number
= =
quoted and zO (x~+l, ... , x~~t, ai, xf+1' ... ,X~), zP (x~+l, ... , x~~t, xf, Xf+l'
... , X~), C = sgnE(zO)(bi - ai), ai = xf - HI + sgno;E(xk)}h;, bi = ai + hi. If an
iteration of the algorithm fails we switch to the pseudo-gradient training method.
So, the justification of the new procedure is based on the heuristic justification of the
pseudo-gradient which can be found in anyone of [8,9,10]. A formal justification
of the proposed procedure in case of differentiable objective functions can be found
in [11].
Magoulas et al.: A New Training Method for DMNNs 253
BP New method
MN STD MNE MN STD MNE MAS SAS
a) 561 550.4 0.0396 40.6 4.2 0.0000008 239.6 54.12
b) 18121 3048.7 0.49 28.5 13.43 0.45 20673 9310.9
3 Experimental Results
Here we present and compare the behaviour of the new training method with the
BP [4] and the pseudo-gradient training method [8] for the XOR problem and
training an 1-10-1 network to approximate the function sinx cos 2x (Table 1). In all
problems'Y = 0.5, d = 10- 10 ,11, = 10 and no pseudo-gradient subprocedure has been
applied with the proposed method in order to get more fair evaluation. MN indicates
the mean number of iterations; STD the standard deviation of iterations; MNE the
mean value of the error; MAS the mean number of algebraic signs required for the
bisection scheme and SAS the standard deviation of the required algebraic signs.
The results are for 10 simulation runs, for the same initial weights; the maximum
number of iterations was set to 2000, the weights were initialised in the interval
[-10,10] and the step size for BP was set to the standard value 0.75. For the XOR
the thresholds were set as follows: "true" = 0.8 and "false" =0.2. Under the
same conditions the pseudo-gradient training needed more than 2000 iterations to
converge. The frequency with which the algorithm became trapped in local minima
seems to be about the same as for BP for binary tasks. We also used the new
method in training DMNN to learn smooth functions. One hidden layer of hard-
limiting units and one output unit with linear activation function was used in all
our experiments. We did not manage to train DMNNs using the pseudo-gradient
training method due to oscillations, although various step sizes and different discrete
activation functions have been tried. With the new algorithm and discrete activation
functions such as 0.5 for "true" and -0.5 for "false" DMNNs were trained as fast
as, and often faster than, BP trained FNNs until E ::; 0.5 (over 21 input/output
cases). After this error bound, the convergence speed was reduced due to saturation
problems.
However, it is worth noticing the difference in the behaviour between BP and the
new method. Back-propagation trained FNNs exhibit a greater tendency to fit
closely data with higher variation than data with low variation. On the other hand,
although DMNNs do not produce smooth functions, they learn the general trend of
the data values and therefore might be more useful than FNNs when there is noise
in the data and the error goal can be set so high that the network does not have
to fit all the target values perfectly. Situations like this usually occur in system
identification and control (see [14]).
4 Conclusion and Further Improvements
This paper describes a new training method for DMNNs. The method does not
directly perform gradient evaluations. Since it uses the modified one-dimensional
bisection method it requires only that the algebraic signs of the function and gra-
254 CHAPTER 42
dient values be correct; so it can be applied to problems with imprecise function
and gradient values. The method can also be used in training with block of network
parameters, for example train the entire network, then the weights to the output
layer and the thresholds of the hidden units, etc. We have tested such configurations
and the results were very promising, providing faster training.
REFERENCES
[1) W. McCullough, W. H. Pitts, A logical calculus of the ideas imminent in nervous activity,
Bulletin Mathematical Biophysics, Vol. 5 (1943), pp115-133.
[2) S. E. Hampson, D. J. Volper, Representing and learning boolean functions of multivalued
features, IEEE Trans. Systems, Man & Cybernetics, Vol. 20 (1990), pp67-80.
[3) G. J. Gibson, F. N. Cowan, On the decision regions of multi-layer perceptrons, Proc. IEEE,
Vol. 78 (1990), ppI590-1594.
[4) D. E. Rumelhart and J. L. McClelland eds., Parallel Distributed Processing: Explorations in
the Microstructure 0/ Cognition, Vol. 1, MIT Press (1986), pp318-362.
[5) B. Widrow, R. Winter, Neural nets for adaptive filtering and adaptive pattern recognition,
IEEE Computer (March 1988), pp25-39.
[6) D. J. Tom, Training binary node feed forward neural networks by back-propagation of error,
Electronics Letters, Vol. 26 (1990), ppI745-1746.
[7) E. M. Gorwin, A. M. Logar, W. J. B. Oldham, An iterative method for training multilayer
networks with threshold functions, IEEE Trans. Neural Networks, Vol. 5 (1994), pp507-508.
[8) R. Goodman, Z. Zeng, A learning algorithm for multi-layer perceptrons with hard-limiting
threshold units, in: Proc. IEEE Neural Networks for Signal Processing (1994), pp219-228.
[9) Z. Zeng, R. Goodman, P. Smyth, Learning finite state machines with self-clustering recurrent
networks, Neural Computation, Vol. 5 (1993), pp976-990.
[10) Z. Zeng, R. Goodman, P. Smyth, Discrete recurrent neural networks for grammatical infer-
ence, IEEE Trans. Neural Networks, Vol. 5 (1994), pp32Q-330.
[11) M. N. Vrahatis, G. S. Androulakis, G. E. Manoussakis, A new unconstrained optimization
method for imprecise function and gradient values, J. Mathematical Analysis & Applications,
Vol. 197 (1996), pp586--607.
[12) M. N. Vrahatis, Solving systems of non-linear equations using the non zero value of the
topological degree, ACM Trans. Math. Software, Vol. 14 (1988), pp312-329.
[13) M. N. Vrahatis, CHABIS: A mathematical software package for locating and evaluating
roots of systems of non-linear equations, ACM Trans. Math. Software, Vol. 14 (1988), pp330-
336.
[14) H. J. Sira-Ramirez, S. H. Zak, The adaptation of perceptrons with applications to inverse
dynamics identification of unknown dynamic systems, IEEE Trans. Systems, Man & Cyber-
netics, Vol. 21 (1991), pp634-643.
LOCAL MINIMAL REALISATIONS OF TRAINED
HOPFIELD NETWORKS
S. Manchanda and G.G.R. Green*
Dept. of Chemical and Process Engineering,
University of Newcastle upon Tyne, Newcastle upon Tyne, NEl 7RU, UK.
* Dept. of Physiological Sciences, Email: [email protected]
A methodology for investigating the invariant structural characteristics of the different approx-
imations produced by Hopfield networks is presented. The technique exploits the description of
the dynamics of a network as a Generating series which relates the output of a network to the
past history of inputs. Truncations of a Hopfield Generating series are approximations to unknown
dynamics to a specified order. As a truncated series has finite Lie rank, a local minimal realisation
can be formulated. This realisation has a dimension whose lower bound is determined by the
relative order of the network and whose upper bound is determined by the order of truncation.
The maximal dimension of the minimal realisation is independent of the number of nodes in the
network.
Keywords: Hopfield networks, nonlinear dynamics, realisations, minimality.
1 Introduction
Trained recurrent networks are commonly used to provide models of an unknown
nonlinear dynamic system. The representations are in the form of state-space models
which are usually characterised as sets of nonlinear differential equations. However,
different combinations of network weight parameters often produce comparable
approximation capabilities. Thus a fundamental problem is the interpretation of
these representations. One approach to this problem is to attempt to reduce the
state-space model to a minimal form as this is the description to which all other
representations are related by diffeo-morphisms.
In practice, network model building is concerned with producing a suitable ap-
proximation of the unknown dynamic system, thus what is required is a method
for specifying the order of the approximation and for producing the corresponding
minimal realisation which has the same approximation capability. The input-output
behaviour of a trained network can be formulated as a formal power series in non-
commutative variables. This formal power series, the Generating series [2], has a
minimal realisation if its Lie rank is finite [1]. If the infinite Hopfield Generating
series which, in general, is not of finite Lie rank can be truncated, it can be used to
produce minimal realisations whose input-output behaviour match that of the net-
work up to a specified arbitrary order. This is equivalent to producing the minimal
realisation of the unknown dynamics to a specified order of approximation.
In this article the tools for constructing minimal realisations of truncated Gener-
ating series are applied to Hopfield recurrent networks. These tools were originally
described by Fliess [1]. The approach was further developed by Jacob & Oussous
[3].
2 Hopfield Networks and their Local Solutions
The Lie derivatives, Ao and Ai, that define the state-space of the single-input
(m = 1) single-output (SISO) Hopfield RNN are
N NaN a
Ao = 1)-KiXi+
. LWijlT(Xj))- Ai = L l i -
. a~ . a~
I J I
255
256 CHAPTER. 43
where 0"( Xi) is the output of the ith node in a single layer of N hidden nodes, "i
acts as a time constant of the ith node, Wij is the weight between node i and j,
and riU is the weighted input U into node i. It is assumed that the 0" nonlinearity
is the tanh function and that the output of the network is y = h( x) = 0"( Xl)'
The Generating series solution of this specific linear analytic system is formed by
the Peano-Baker iteration of the state-space differential equations. It can be shown
to be [2]
m
y(t) = S = hlxo + 1: 1: Aj,Ai, ... Ajv(h)lxoZjv·· 'zj,zj,
v?oj" .. ,jv=o
where S is a mapping between the free monoid Z', constructed from the alphabet
set zo, Zl, into IR and the subscript IXo means evaluated at the initial conditions of
the state vector. Each word, zj, zi, .. Zjk corresponds to the iterated integral
!at Uj,(TI) !a T1
Uj,(T2)" '!aTk-l Ujk(Tk)dTk" ·dT2dTI
f Zo Zl
Zo Ao(h) A6(h) A1Ao(h)
Zl Al(h) AoAl(h) Ar(h)
[zo, Zl] A1Ao(h) - AoAl(h) 0 0
Table 1
order rank
1 1
2 3
3 5
4 8
Table 2
4 Discussion
This article is concerned with local approximations to unknown nonlinear dynamics
up to a specified order. The approximation order is in terms of the length of the
Generating series that represents the input-output behaviour of a trained network.
A trained recurrent network can closely approximate, in a local sense, an unknown
dynamic when the network Generating series is similar to that of the unknown
system. It should be noted that the unknown dynamic system may not necessarily
be represented by a suitable global model and therefore one should seek local models
in the first instance. The Generating series, like the Taylor series, is a localfunctional
expanSIOn.
In this article only an upper and lower bound on the rank of the Lie-Hankel matrix
is described. The upper bound is determined by the length of the truncated Gen-
erating series and is directly related to the order of approximation of the unknown
dynamics by the Hopfield network.
The rank determines the dimension of the minimal realisation. The dimension of
the minimal realisation is independent of the number of hidden nodes in the Hopfield
network. The network trajectory evolves locally on a submanifold of the Hopfield
state-space.
The local minimal realisation of a truncated Generating series of a Hopfield network
is a set of polynomial differential equations and an output which is polynomial
in the states. The state-space dynamics are fixed and do not depend upon the
position in the original Hopfield state space. The minimal state space dynamics
are input driven, with zero initial conditions. The state space dynamics reflect the
local influence of the system input. However, the output map is dependent upon
the position in the Hopfield state space. These observations on the form of the local
minimal realisations imply that any two networks which are used to approximate
an unknown system have the same minimal state-space dynamics and differ only
in the form of the output function.
Acknowledgements
We gratefully acknowledge support from the UK EPSRC and the Research Com-
mittee of the University of Newcastle upon Tyne, UK.
REFERENCES
[1] M. Fliess, Realisation locale des systemes non lineaires, algebres de lie jiltrees transitives et
series generatrices non commutatives, Invent. Math., Vol. 71 (1983), pp521-537.
[2] M. Fliess, M. Lamnabhi, and F. Lamnabhi-Lagarrigue, An algebraic approach to nonlinear
functional expansions, IEEE Transactions on Circuits and Systems, Vol. 30 (1983) pp554-570.
[3] G. Jacob and N. Oussous, Local and minimal realisation of nonlinear dynamical systems
and lyndon words, in: A. Isidori, editor, IFAC symposium: Nonlinear Control Systems Design
(1989), pp155-160.
[4] G. Viennot. Aigebre de Lie libres et monoides libres. Lecture Notes in Mathematics, Vol. 691
(1978), Springer-Verlag.
DATA DEPENDENT HYPERPARAMETER
ASSIGNMENT
Glenn Marion and David Saad *
Department of Statistics and Modelling Science, Livingstone Tower,
26 Richmond Street, Glasgow Gl lXH, UK. Email: [email protected]
* Department of Computer Science and Mathematics, !Aston University,
The Aston Triangle, Birmingham B4 7ET, UK. Email: [email protected]
We show that in supervised learning from a particular data set Bayesian model selection, based
on the evidence, does not optimise generalization performance even for a learnable linear prob-
lem. This is achieved by examining the finite size effects in hyperparameter assignment from the
evidence procedure and its effect on generalisation. Using simulations we corroborate our ana-
lytic results and examine an alternative model selection criterion, namely cross-validation. This
numerical study shows that in the learnable linear case for finite sized systems leave one out
cross-validation estimates correlate more strongly with optimal performance than do those of the
evidence.
1 Introduction
The problem of supervised learning, or learning from examples, has been much
studied using the techniques of statistical physics ( see e.g. [7]). A major advantage
of such studies over the usual approach in the statistics community is that one can
examine the situation where the fraction (a) of the number of examples (p) to the
number of free parameters (N) is finite. This contrasts with the asymptotic (in a)
treatments found in the statistics literature (see e.g. [6]). However, one draw back of
the statistical physics approach is that it is based on the, so called, thermodynamic
limit where one allows N and p to approach infinity whilst keeping a constant. A
quantity is said to be self averaging if its variance over data sets of examples tends to
zero in the thermodynamic limit. We show that in Bayesian model selection based on
the evidence, conclusions drawn from the thermodynamic results are qualitatively
at odds with the finite size behaviour.
In the supervised learning scenario one is presented with a set of data
V = {(Yt(xJ.l),xJ.l): J.L = l..p}
consisting of p examples of an otherwise unknown teacher mapping denoted by
the distribution P(Yt I x). Furthermore, we assume that the N dimensional input
space is sampled with probability P(x). The learning task is to use this data V
to set the N8 parameters w of some model (or student) such that it's output,
Y8(X), generalizes to examples not contained in the training data, V. Often this is
achieved by minimising a weighted sum, f3Ew(V) + iC(W) of the quadratic error of
the student on the training examples, Ew(V), and some cost function, C(w), which
penalises over complex models. Provided i is non-zero this serves to alleviate the
problem of overfitting. It is the setting of the, so-called, hyperparameters f3 and i
which we will examine in this presentation.
In terms of practical methods for hyperparameter assignment there are essentially
two choices. Firstly one can attempt to estimate the generalisation error (e.g. by
cross-validation [6]) and then optimise this measure with respect to the hyperpa-
rameters. However, such an approach can be computationally expensive. Secondly,
259
260 CHAPTER 44
one can optimise some other measure and hope that the resulting assignments pro-
duce low generalisation error. In particular, MacKay [6) advocates the evidence as
such a measure. Model selection based on the evidence, in the case of a linear stu-
dent and teacher, has been studied by Bruce and Saad [1) in the thermodynamic
limit. Their results show that optimising the average, over all possible data sets
D, of the log evidence with respect to the hyperparameters optimises the average
generalization error. An average case analysis of an unlearnable scenario can be
found in [3) and shows that in general the evidence need not be optimal on average.
In this paper we examine hyperparameter assignment from the evidence based on
an individual data set, in the learnable linear case. In the next section we review
the evidence framework and introduce the generalization error. In section 3, we
show that the evidence procedure is unbiased and that the evidence and generaliza-
tion error are self averaging. In section 4 we examine hyperparameter assignment
from the evidence based on a particular data set. First order corrections to the
performance measures show that in general the evidence procedure does not lead
to optimal performance. Finally, we corroborate these conclusions using a numeri-
cal study which, furthermore, reveals that even leave one out cross-validation is a
superior model selection criterion to the evidence in the learnable linear case for
small systems.
2 Objective Functions
2.1 The Evidence
Since Ew(D) is the sum squared error then, if we assume that our data is corrupted
by Gaussian noise with variance 1/2(3, the probability, or likelihood of the data(D)
being produced given the model wand (3 is P(D I (3, w) ex e- f3Ew (,D). The complex-
ity cost can also be incorporated into this Bayesian scheme by assuming the a priori
probability of a rule is weighted against 'complex' rules, P(w I ,) ex e--YC(w). Mul-
tiplying the likelihood and the prior together we obtain the post training or student
distribution, P(w I D", (3) ex e- f3Ew (V)--yC(w). It is clear that the most probable
model w· is given by minimizing the composite cost function (3Ew(D) + ,C(w)
with respect to w.
The evidence, P(D I ,,(3), is the normalisation constant for the post training dis-
tribution. That is, the probability of (or evidence for) the data set (D) given the
hyperparameters (3 and,. Throughout this paper we refer to the evidence proce-
dure as the process of fixing the hyperparameters to the values that simultaneously
maximize the evidence for a given data set.
2.2 The Generalization Error
We will use the notation (f(z))p to denote the average of the quantity J(z) over
the distribution P(z). However, we will use the short hand Ow to mean the average
over the post training distribution P( wiD, " (3). As our performance measure we
choose the expected difference over the input dimension P(x) between the average
student and the average teacher. That is, the data dependent generalisation error,
fg(D) = «((Yt(x))P(ytl x ) - (Ys(x))w)2)p(x). If we were to average over all possible
data sets of fixed size then this would correspond to the generalization error studied
in [1).
Marion (3 Saad: Hyperparameter Assignment 261
3 Finite System Size
= =
Since the student is linear with output y(x) w.x/../N, N. N. We also assume
that the teacher mapping is linear, with weights wo, and corrupted by zero mean
Gaussian noise of variance u 2 . Thus, P(Yt I xI') ex: e-(Y~ -w· .X,./,fN)2 /2q2. Further,
we assume P(x) is N(O, U.,)l and adopt weight decay as our regularization proce-
dure, that is C(w) = wT w. In this case we can explicitly calculate the evidence,
or rather the normalised log of the evidence f(V)= -l/Nln(P(V I A,(3), where
we have introduced the weight decay parameter A = 'Y / (3. The generalisation error
and the consistency can be calculated from f(V) by averaging appropriate expres-
sions over the input distribution P(x). Details of these calculations will appear in
a subsequent paper [4].
3.1 Consistency, Unbiasedness and Self Averaging
Firstly, we examine the free energy, f(V), and the generalisation error in the limit
of large amounts of data (i.e. as p -+ 00 with N fixed). Using the central limit
theorem we can show that, in this limit, to first order the generalisation error is
independent ofthe weight decay whilst f is optimised by Aell = AO == u 2 /(u~u~) and
{3ell = (30 == 1/(2u 2 ). As we shall see later in the context oflarge N this insensitivity
of the generalisation error to the value of the weight decay is associated with a
divergence in the variance of the optimal weight decay as the number of examples
grows large. This asymptotic insensitivity to the weight decay is a reflection of the
fact that our linear student is mean square consistent. We will thus focus on the
following quantity when assessing the evidence procedure's performance,
( A ) - (g(Aell(V)) - (g(Aopt(V)) (1)
Keg ell - (g(Aopt(V)) .
Secondly, it can be shown that (gij)P({x":I'=1..p}) ex: Oij, then the resulting average
free energy, f=(J(V))P(1)) is extremised by A =
Ao and {3 =
(30. Similarly, the
average generalisation error is optimised by Ao. This corresponds to the average
case result obtained for the thermodynamic limit in [1] but is valid for all Nand
p. Thus, the particular conclusion, of the thermodynamic average case analysis,
for the learnable linear scenario, that the evidence procedure optimises average
performance is valid for all N and in this sense the procedure is unbiased.
Finally using results of [9] 2 one can show that the variance, over possible reali-
sations of the data set, of the free energy, f(V) is order O(l/N) as we approach
the thermodynamic limit; it is a self averaging quantity. Similarly, it can also be
shown that the generalization error is self averaging. This means that in the ther-
modynamic limit the behaviour exhibited by the system for any particular data set
will correspond to the average case behaviour, that is the fluctuations around the
average vanish. Thus, the average case analysis of [1] corresponds to the case for a
particular data set because their results were obtained in the thermodynamic limit.
4 Data Dependent Hyperparameter Assignment
Having now established that the evidence procedure is unbiased and that the free
energy and performance measures are self averaging we now wish to examine the
system behaviour for particular data sets of finite size. This is clearly the regime
1 Where N(x,u) denotes a nonnal distribution with mean x and variance u 2 •
2 Alternatively one can show this result using diagrammatic methods.
262 CHAPTER 44
Var{Aeg)
I,I!
5,,--~--------------, 5>r~----------------~
Figure 1 The variance in the optimal weight decay (Var(Aeg» for various noise
levels, (i) AO = 0.04, (ii) Ao = 0.25 and (iii) AO = 4/9 is shown in the left-hand
graph. Notice the linear divergence in a which corresponds to our result in section
3.1 that, for sufficiently large p, the generalization error is independent of A. The
variance in the evidence optimal weight decay (Var(Aev» is shown, in the right-
hand graph, for the same noise levels. The O(l/a) decay of this quantity is a
reflection of the fact that for large P Aev(V) = Ao.
of interest to real world applications since one is then in the business of optimising
performance based on a particular data set. To obtain the hyperparameter assign-
ments made by the evidence procedure we must simultaneously solve o)./(D) = 0
and o{J/(D) = 0, where oe! == 0//08. We can linearize these equations, close to the
thermodynamic limit, by expanding around A = Ao and f3 = f3o. Similarly, we can
also expand the true optimal weight decay about the thermodynamic limit value,
Ao.
We find that (co )-variances of these quantities are 0(1/N). Figure 1 shows, to first
order, the scaled variances 3 in the evidence optimal weight decay, V are Aev) and
that in the true optimal weight decay, Var(Aopt). The asymptotic O(l/a) decay
of the former reflects the fact that, as discussed in section 3.1, lima--+oo Aev(D) =
Ao. Similarly, the divergence of the latter is indicative of the insensitivity of the
generalization error to the weight decay for large a. The divergence of both curves
for small a is order O( 1/ N a) and reflects the break down of the thermodynamic
limit when the number of examples p does not scale with the system size N,
Similarly we find that the average squared distance between the evidence assign-
ment and the optimal, < (Ao(D) - Aev(D))2 >P(V), is order O(l/N). This distance
is non zero, except for a > 1 in the noiseless limit. Further, in the large a limit
this distance diverges, revealing the inconsistency of the evidence weight decay
assignment.
4.1 Effects on Performance
We now examine the effects on performance of this sub-optimal hyperparameter
assignment. Firstly, to order O(l/VN) the optimal performance, fg(Aopt, D), and
that resulting from use ofthe evidence procedure, fg(Aev, D) are the same. However,
3 i. e. N times the true variances
Marion 8 Saad: Hyperparameter Assignment 263
to order 0(1/ N) they differ thus we can write the correlation between them, some-
what suggestively, as 1 - 0(1/ N). Unfortunately, we are unable to calculate this
correlation to 0(1/ N). Therefore, we examine < ~'g >P(V) which tends to 1/N in
the limit of large 0: reflecting the inconsistency of the evidence weight decay assign-
ments. In the limit of no noise for 0: > 1 we find that < ~'g >P(V)= (0:+ 1)/N( 0: -1)
revealing that even for small noise levels the evidence procedure is sub-optimal.
4.2 Comparison with Cross-validation
Given, that the evidence procedure is sub-optimal it is natural to ask if another
model selection criteria could do better. Here we compare the evidence procedure
with leave-one-out cross-validation using simulations of a I-dimensional system.
That is we set the weight decay using the cross-validatory estimate and the evi-
dence estimate and compare the resulting generalisation error to the optimal. The
results, averaged over 1000 realisations of the data set for each value of p, are plot-
ted in figure 2. These show that the evidence weight decay assignment results in
sub-optimal performance with fg(>'ev, V) not fully correlated with fg(>'O, V). More-
over, the left-hand graph shows that the resulting error from the cross-validatory
estimate correlates more strongly with the optimal generalisation error than does
that resulting from the evidence estimate. In addition, the right-hand graph shows
that the fractional increase in the generalisation error is considerably larger for the
evidence procedure than for cross-validation.
1.0 .;E"'-'--'-"--
.l" .•-~ . \
.l·r 6- I
0.5
0.0--1-,------,-----,------1
5 10 15 20
P p
Figure 2 1-D simulation results: The left-hand graph shows the correlation be-
tween the optimal generalization error and those obtained using the evidence (solid)
and cross-validation (chain) with >'0 =
1.0. The right-hand graph shows the frac-
tional increase in generalization error K<g (>.) = (fg(>.) - fg(>'opt})/ fg(>'opt}. >. is set
by the evidence (dashed) and by cross-validation (chain) for'\o = 1.0. For,\o = 0.01
the evidence case is the solid curve cross-validation the dotted curve. In the latter
case the error bars are not shown for the sake of clarity.
5 Conclusion
We have shown that, despite thermodynamic and average case results to the con-
trary, model selection based on the evidence does not optimise performance even
in the learnable linear case. In addition, numerical studies indicate that for small
systems cross-validation is closer, than the evidence procedure, to optimal perfor-
264 CHAPTER 44
mance. However, for large systems the evidence might still be a reasonable alterna-
tive to the computationally expensive cross-validation.
REFERENCES
[1] Bruce A D and Saad D, Statistical mechanics of hypothesis evaluation. J. of Phys. A: Math.
Gen. Vol. 27 (1994), pp3355-3363.
[2] MacKay D J C, Bayesian interpolation, Neural Compo Vol. 4 (1992), pp415-447.
[3] Marion G and Saad D, A statistical mechanical analysis of a Bayesian inference scheme for
an unrealizable T1£le. J. of Phys. A: Math. Gen. Vol. 28 (1995), pp2159-2171.
[4] Marion G and Saad D, Finite size effects in Bayesian model selection and generalization. J.
of Phys. A: Math. Gen. Vol. 29 (1996), pp5387-5404.
[5] Sollich P, Finite-size effects in learning and generalization in linear perceptrons. J. of Phys.
A: Math. Gen. Vol. 27 (1994), pp7771-7784.
[6] Stone M, An asymptotic equivalence 01 choice of model by cross-validation and Akaikes
criterion, J. Roy. Statist. Soc. Ser. B Vol. 39 (1977), pp44-47.
[7] Watkin T L H, Ran A, Biehl M, The statistical mechanics of learning a rule. Reviews of
Modern Physics Vol. 65 (1993), pp499-556.
Acknowledgements
The authors wish to thank the Physics Department, University of Edinburgh where
this work was conducted.
TRAINING RADIAL BASIS FUNCTION NETWORKS
BY USING SEPARABLE AND ORTHOGONALIZED
GAUSSIANS
J. C. Mason, I. J. Anderson, G. Rodriguez* and S. Seatzu*
School of Computing and Mathematics, University of Huddersfield,
Queensgate, Huddersfield HD1 3DH, UK
* Department of Mathematics, University of Cagliari, viale Merello 92,
09123 Cagliari, Italy.
Radial basis function (RBF) networks have the great advantage that they may be determined
rapidly by solving a linear least squares problem, assuming that good positions for the radial
centres may be selected in advance. Here it is shown that, if there is some structure in the data,
for example if the data lie on lines, then variables in Gaussian RBFs may be separated and a
near-optimal least squares solution may be obtained rather efficiently. Second, it is shown that
a system of Gaussian RBFs with structured or scattered centres may be orthogonalized over a
continuum or discrete data set and thereafter the least squares solution is immediate. Keywords:
Gaussians, orthogonalization, RBF, separability.
1 Introduction
Suppose that there are m input data x(1), ... , x(m), and that each datum x has d
components Xl, .. " Xd. Suppose that we adopt a radial basis function network with
w ,... ,w
t he centres (1) ,gIven by (i) = ((i)
(n)'
w (i) (i»)' ,lor
W 1 ,W 2 , ... , Wd
1.'
z• = 1, ... , n.
The components, w?) ofthe centres are "weights" in the network, between the input
and hidden layer. Frequently good choices of centres may be made by clustering
techniques [3], and so we assume w?) to be fixed. Suppose that coefficients Ci, for
i = 1, ... , n, are associated with radial basis (transfer) functions <Pi(r) applied to
the argument rj, where
265
266 CHAPTER 45
(2)
;=1
for k = 1, ... , m x , where, for each i = 1, ... , n x , we have the (overdetermined)
system
ny
bel)
; -- ""' ·"'(x(l)
L.J c.',J'I' 2
- w(i))
2 (3)
j=l
h(l) is a vector of the elements b~l) for i = 1,2, ... , n x , and f(i) is a vector of the
elements f(Xk,l) for k = 1,2, ... , m x . We note that the matrix Ax is independent
of £ and so it is only necessary to factorize it once.
Similarly, we may express the system (3) as
AyC(i) = hi
for i = 1,2, ... , n x , Here, the matrix Ay is independent of i and so, .again, it is
only necessary to factorize this matrix once. Thus, the solution of (2) and (3) by
QR factorization can be achieved in 0 (mynx(mx + ny)) operations. This is a great
saving over the 0 (mXmyn~n~) operations that would be required ifthe system (1)
of mxmy equations were solved by least squares without exploiting any structure.
2.2 Data on Lines
If the data are less structured than a mesh, but do form lines, then fast methods
may be adopted. Consider lines of data, where the X2 values are fixed but the Xl
values are scattered (in different positions on each line). The data abscissae may
be written as (xlk,l),x~l)), for k = 1, ... ,mi and £ = 1, ... ,my, so the Xl values
now vary with both k and £. Equations (2) and (3) are still valid, but with xlk ) and
mx replaced by xl,!:,i) and ml. We may therefore still solve (2) and (3) in the least
Mason et al.: RBF Networks using Gaussians 267
squares sense. However, the solution is no longer the true least squares solution
of (1). Indeed the solution of (2) now involves a different matrix for each value of f,
and this also means that the solution is less efficient - 0 (m.,myn;) operations are
needed to solve (2), although only 0 (mynyn.,) operations are still required for (3).
The method we describe here is the Gaussian RBF analogue of the methods of
Clenshaw and Hayes [2] for polynomial approximation and of Anderson, Cox and
Mason [1] for spline approximation of data on lines.
3 Orthogonalized Gaussians
Gaussians can be orthogonalized in a number of ways. Suppose that for data {x} =
{(Xl, ... , Xd)'}, we have centres wCi) for i = 1, ... , n, and Gaussian RBFs 4>i(lIxll) =
=
exp( -lIx - wCi)1I 2 ) for i 1, ... , n.
Then a general orthogonalization technique, based on the Gram-Schmidt procedure,
is to form a new basis '1Pl, ... ,tPn as,
i
3.1 Continuum
Let us adopt an inner product over IRd == (-oo,oo)d and define
V == r exp( _jjxIl
lffid
2) dS == r exp( -li x - wjj2) dS,
lffid
for any fixed w. Now ti,j is readily calculated by using the following formula (whose
derivation follows from the cosine rule),
jjx - w(i) jj2 + IIx - w(j) 112 == 2jjx - Wjj 112 + ~ IIw(i) - w(j) 112, (10)
L ci1/!i(X)
n
F(x) ==
i=l
is defined by setting
Cj ==< F,1/!i > Inj.
3.2 Scattered Discrete Data
We might also adopt this inner product for a discrete data set, since it has the
advantage of being data independent. The snag is that we do not then obtain a
diagonal system for determining a least squares approximation on the data set and
must solve
n
[F,1/!j] == LCi[1/!j,1/!j]
i=l
for i == 1, ... , n, where [F, <p] denotes the inner product over the discrete data set.
However, we would expect the matrix with entries [1/!i, 1/!j] to be "close to diagonal".
Alternatively, if we define a discrete inner product over the data {x(k)}
< f, g >== L f(x(k))g(x(k))
and write
then
Mason et al.: RBF Networks using Gaussians 269
where S is the matrix of Sk,i' Thus we are effectively forming the components of
the normal matrix. The calculation is similar to that of Section 3.1, except that
ST S is now used rather than S (hence requiring a more complicated calculation).
4 Results
The validity of the procedure of Section 3.2 has been successfully tested by using
orthogonalized RBFs to recognize the ten digits 0, ... ,9 from their pixel patterns -
we do not have sufficient space here to give details, but note that the procedure is
very fast compared with using back propagation procedures and sigmodal approxi-
mations. We have also calculated condition numbers for a variety of RBF matrices
which occur in data fitting, and there are apparent advantages for conditioning in
using orthogonality on a continuum rather than on a discrete data set. However
further work is needed to develop an orthogonalization algorithm which consistently
improves conditioning compared with the use of a conventional Gaussian basis.
REFERENCES
[1] I. J. Anderson, M. G. Cox and J. C. Mason, Tensor-product spline interpolation to data on
or near a family of lines. Numerical Algorithms, Vol. 5 (1993), ppI93-204.
[2] C. W. Clenshaw and J. G. Hayes, Curve and surface fitting. J. Inst. Math. Appl., Vol. 1
(1965), ppI64-183.
[3] J. D. Mason, R. J. Craddock, J. C. Mason, P. C. Parks and K. Warwick, Towards a stability
and approximation theory for neuro-controllers. Control 94. lEE Pub. 389 (1994), ppl00-l03.
ERROR BOUNDS FOR DENSITY ESTIMATION BY
MIXTURES
Ronny Meir and Assaf J. Zeevi
Electrical Engineering Department, Technion, Haifa 32000, Israel.
We consider the problem of estimating a density function from a sequence of N independent and
identically distributed observations Xi taking values in lRd . The estimation procedure constructs
a convex mixture of 'basis' densities and estimates the parameters using the maximum likelihood
method. Viewing the error as a combination of two terms, the approximation error measuring the
adequacy of the architecture, and the estimation error resulting from the finiteness of the sample
size, we derive upper bounds to the total error, thus obtaining bounds for the rate of convergence.
These results then allow us to derive explicit expressions relating the sample complexity and model
complexity needed for learning.
1 Introduction
There have traditionally been two principal approaches to density estimation, namely
the parametric approach which makes stringent assumptions about the density, and
the nonparametric approach which is essentially distribution free. In recent years,
a new approach to density estimation, often referred to as the method of sieves [2]
has emerged. In this latter approach, one considers a family of parametric models,
where each member of the family is assigned a 'complexity' index in addition to
the parameters. In the process of estimating the density one usually sets out with a
simple model (low complexity index) slowly increasing the complexity of the model
as the need may be. This general strategy seems to exploit the benefits of both the
parametric as well as the non parametric approaches, namely fast convergence rates
and universal approximation ability, while not suffering from the drawbacks of the
other methods. In this paper we consider the representation of densities as convex
combinations of 'basis' densities, thus permitting an interpretation as a mixture
model. We split the problem into two separate issues, namely approximation and
estimation, which are discussed at more length in section 2.
2 Statement of the Problem
The problem of density approximation by convex combinations of densities can be
phrased as follows: we wish to approximate a class of density functions, by a convex
combination of 'basis' densities. In this work we consider the class, F, of compactly
supported and continuous a.e density functions. We thus seek linear combinations
of densities of the form
n n
270
Meir f3 Zeevi: Error Bounds for Density Estimation by Mixtures 271
estimation, one usually considers a fixed value of n, assuming that the true density
is a member of the class. In our formulation n is a parameter, whose magnitude
will be bounded.
Another line of recent work is that of function approximation through linear com-
binations of non-linearly parameterized 'basis' functions (for example neural net-
works). The novel feature concerning the representation given in eq. (1), as com-
pared with the function approximation literature, is that we demand the coefficients
Q:i to be nonnegative and sum to one, and moreover require the functions ¢(x; 9)
=
to be densities, i.e. ¢(x; 9) > 0 and J ¢(x; 9)dx 1.
As discussed above, the establishment of the existence of a good approximating
density f* is only the first step in the estimation procedure. One still needs to
consider an effective procedure, whereby the optimal function can be obtained, at
least in the limit of an infinite amount of data. Assuming the estimation is based on
the finite data set {x;};;' 1 , and denoting the estimated density by in,N,the minimal
requirement (referred to as consistency) is that in,N -+ f* as N -+ 00, where
the convergence takes place in some well defined probabilistic sense. Since we are
interested in this paper in convergence rates, we will in fact make use of stronger
results [4] which actually characterize the rates at which the above convergence
takes place (see section 4).
In summary then, the basic issue we address in this work is related to the rela-
tionship between the approximation and estimation errors and (i) the dimension
of the data, d, (ii) the sample size, N, and (iii) the complexity of the model class
parameterized by n.
3 Preliminaries
In order to discuss approximation of densities we must define an appropriate dis-
tance measure, d(f, g), between densities f and g. A commonly used measure of
discrepancy between densities is the so-called Kullback-Leibler (KL) divergence,
given by dK(fllg) = J flog f. As is obvious from the definition, the KL divergence
is not a true distance function since it is not symmetric. To circumvent this prob-
lem one often resorts to an alternative definition of distance, namely the squared
Hellinger distance d~(f, g) = J (VI - ,;g) 2, which can be shown to be a true
metric and is particularly useful for problems of density estimation.
Using the results of [2] concerning the method of sieves we conclude that under
appropriate conditions on ¢, the target density f belongs to the closure of the
convex hull of the set of basis densities <I> = {¢(" 9)}. The question arises, however,
as to how many terms are needed in the convex combination in order to achieve an
approximation error smaller than some arbitrary ( > O. The answer to this question
can be obtained using a remarkable lemma of Maurey which is proved for example
in [1].
4 Main Results
Using the results of Maurey referred to above, it can be shown that given any ( > 0
one can construct a convex combination of densities, ¢o E <1>, in such a way that
the total error between an arbitrary density and the model is smaller than f. We
consider now the problem of estimating a density function from a sequence of d-
dimensional samples, {x;}, i = 1,2, ... , N, which will be assumed throughout to
be independent and identically distributed according to f(x). As in eq. (1) we let
272 CHAPTER 46
n denote the number of components in the convex combination. The total number
of parameters will be denoted by m, which in the problem studied here is O(nd).
In the remainder of this section we consider the problem of estimating the pa-
rameters of the density through a specific estimation scheme, namely maximum
likelihood, corresponding to the optimization problem, iJn,N = argmaxe L(x N ; 8)
where L(xN;8) is the likelihood function, L(xN;8) = rrkf~(Xk)' x N = {Xi}f:l
and f~(x) = L:~=1 (Xi¢(X; 8;). We denote the value of f~,N evaluated at the max-
imum likelihood estimate by in,N' Now, for a fixed value of n, the finite mixture
model, f~, may not be sufficient to approximate the density f, to the required
accuracy. Thus, the model for finite n falls into the so called class of misspecified
models [4] and the procedure of maximizing L should more properly be referred
to as quasi maximum likelihood estimation. Thus, iJn,N should be referred to as
the quasi maximum likelihood estimator. Since the data are assumed to be i.i.d,
k
it is clear from the strong law of large numbers that L(xN ; 8) -+ E log f~ (x)
almost surely as N -+ 00, where the expectation is taken with respect to the true
(but unkown) density, f(x), generating the examples. From the trivial equality
E log f~(x) = -dKUllf~) + E log f(x) we see that the maximum likelihood estima-
tor iJn,N is asymptotically given by 8~, where 8~ = argmine dKUllf~). For further
discussion see [4].
Now, the quantity of interest in density estimation is the distance between the
true density, f, and the density obtained from a finite sample of size N. Using the
previous notation and the triangle inequalitiy for metric d(·, .) we have dU, in,N) ::;
dU, f~) + dU~, in,N). This inequality stands at the heart of the derivation which
follows. We will show that the first term, namely the approximation error, is small.
This demonstration utilizes Maurey's lemma as well as several inequalities. In order
to evaluate the second term, the estimation error, we make use of the results of
White [4] concerning the asymptotic distribution of the quasi maximum likelihood
estimator iJn,N. The splitting of the error into two terms in the triangle inequality,
is closely related to the expression of the mean squared error in regression as the
sum of the bias (related to the approximation error) and the variance (akin to the
estimation error).
As mentioned above, Maurey's lemma provides us with an existence proof, in the
sense that there exists a parameter value 8 0 such that the error of the combination
(1) is smaller than cln. Since we are dealing here with a specific estimation scheme,
namely maximum likelihood, which asymptotically approaches a particular param-
eter value 8~, the question we ask, however, is whether the parameter 8~ obtained
through the maximum likelihood procedure also gives rise to an approximation er-
ror of the same order as that of 8 0 . The answer to this question is affirmative, as
we demonstrate in the next theorem, which is the first main result of this section.
Theorem 2 (Error bound) For sample size N sufficiently large, and given appro-
priate smoothness assumptions (see [4]), the expected estimation error,
2 •
EvdH(f, In,N),
obtained from the quasi maximum likelihood estimator, in,N, is bounded as follows:
Ev [d~(f, in,N)] ~ 0 (C:F,~/n) + 0 (m· / N), where d denotes the data dimension,
and m* = Tr(C· I*) with C· and I* given above.
Proof (sketch) The initial step in the proof is to expand dk(f~, in,N) to a first
order Taylor series with remainder. Some simple algebraic manipulations yield an
approximation in terms of the quasi maximum likelihood estimator and the pseudo
information matrix. Taking expectation with respect to the data we obtain the
bound 0 (rr;;) utilizing the asymptotic results of White (1982). The final derivation
follows by the triangle inequality and the approximation bound. 0
If we take n, N -+ 00 so that (n/N) -+ a the matrix C* converges to the inverse of
the 'true' density Fisher information matrix, which we shall denote by [-1(9), and
274 CHAPTER 46
the pseudo-information matrix, 1*, converges to the Fisher information I( 9). This
argument follows immediately from Theorem 2, which ensures the convergence of
the misspecified model to the 'true', underlying density. Therefore their product
converges to the identity matrix, with a dimension of the parameter vector m =
n(d + 2). The bound on the estimation error will therefore be given asymptotically
by Ev [d~(f, in,N)] : :; 0 (c:'~ )+0 (';!)
which is valid in the limit (N -> 00, -> n
00, N -> 0). The optimal complexity index n may be obtained from the asymptotic
Acknowledgements
The authors thank Robert Adler, Paul Feigin and Allan Pinkus for helpful discus-
SIOns.
ON SMOOTH ACTIVATION FUNCTIONS
H.N.Mhaskar
Department of Mathematics, California State University
Los Angeles, California, 90032, USA.
We had earlier constructed neural networks which are capable of providing optimal approximation
rates for smooth target functions. The activation functions evaluated by the principal elements
of these networks were infinitely many times differentiable. In this paper, we prove that the
parameters of any network with these two properties must satisfy certain lower bounds. Our results
can also be thought of as providing a rudimentary test for the hypothesis that the unknown target
function belongs to a Sobolev class.
1 Introduction
The notion of a generalized translation network was introduced by Girosi, Jones and
Poggio [4] (generalized regularization networks in their terminology). Let 1 ~ d ~ s
be integers, and <P : JRd -+ JR be a fixed function. A generalized translation network
with n neurons evaluates a function of the form
(1)
k=l
where x E JR3, ak's are real numbers, bk's are vectors in JRd, and Ak'S are d x s real
matrices. This mathematical model includes the traditional neural networks, where
= =
d 1, as well as the radial basis function networks, where d s, Ak'S are all equal
to the s x s identity matrix, and <P is a radial function. Girosi, Jones, and Poggio
have discussed the importance of the more general networks for applications in
computer graphics, robotics, control, image processing, etc., as well as emphasized
the need for a theoretical investigation of the capabilities of these networks for
function approximation.
One important reason for using generalized translation networks for function ap-
proximation is to obtain a concrete, trainable model for the typically unknown
target function. Some of the features required of the model are the following. Of
course, the model should approximate the target function within a given margin
of error, utilizing as few neurons as possible. It is also desirable that the function
evaluated by the model be smoother than the target function. Moreover, it is also
expected that the parameters ak, Ak, b k of the model remain bounded as the margin
of error approaches zero. In this paper, we prove that these goals are incompatible
with each other: the parameters of a model that evaluates a good approximation,
with <p smoother than the target function, must tend to infinity as the margin of
error tends to zero. In order to motivate this result further, we first review certain
known "positive" results.
It is well known [1], [2], [5], [6] that an arbitrary continuous function of s variables
can be approximated arbitrarily closely on an arbitrary compact subset of JRs by
neural networks. A similar result was established in [8] for the case of generalized
translation networks. A deeper problem in this context is to determine the number
of neurons required to approximate all functions in a given function class within
a given margin of error. Equivalently, one seeks to estimate the degree of approx-
imation of the target function in terms of the number of neurons, n. Since the
target function is typically not known in advance, it is customary to assume that
275
276 CHAPTER 47
it belongs to some known function class. A simple assumption is that the function
possesses continuous derivatives up to order r, where r ~ 1 is some integer. It is
well known [12] that for any function satisfying this condition, there is an algebraic
polynomial of degree not exceeding m, which gives a uniform approximation to this
function on [-1,1]' with an order of accuracy O(m-r). In terms of the number n
of parameters involved, this order is O(n- r /,). According to a result of DeVore,
Howard, and Micchelli [3], this is asymptotically the best order of approximation
that can possibly be achieved for the entire class of target functions, using any
"reasonable" approximation process involving n parameters. It is an open problem
to determine whether the same degree of approximation can be achieved with gen-
eralized translation networks. One expects that the actual degree of approximation
should depend upon certain growth and smoothness characteristics of the activation
function ifJ.
This author and Micchelli investigated this problem in detail in [10], starting with
the case when both the target function and the activation function are 211'-periodic.
When ifJ is a 211'- periodic function, we were able to approximate the trigonometric
monomials by generalized translation networks, with a bound on the accuracy of
approximation in terms of the trigonometric degree of approximation of ifJ. This
turned out to be a very fruitful idea, enabling us to establish a connection between
the degree of approximation provided by generalized translation networks on one
hand, and the degree of trigonometric polynomial approximation to the target func-
tion and to the activation function ifJ on the other hand. The general theorem was
applied to the case when ifJ is not periodic, establishing degree of approximation
theorems for a very wide class of activation functions. As far as we are aware, our
estimates on the degree of approximation by radial basis functions were the first
of their kind, in that the estimates were in terms of the number of function evalu-
ations, rather than a scaling factor. These results were announced in [11]. In [10],
we constructed networks to provide to an optimal recovery of functions, as well as
networks to provide simultaneous approximation of a function and its derivatives.
The idea was also applied in [9] to obtain certain dimension-independent bounds.
Both in [9] and [10], we give explicit constructions of the networks. The results
indicated that both the growth and smoothness of the activation function play a
role in the complexity problem.
In [7], we concentrated on activation functions that are infinitely often differentiable
in an open set, and for which there is at least one point in this open set at which
none of the partial derivatives is zero. Using the ideas in the paper [6] of Leshno, Lin,
Pinkus, and Schocken, we proved that generalized translation networks with such
activation functions provide the optimal degree of approximation for all smooth
functions. We also obtained estimates for the approximation of analytic functions.
The activation functions to which our results are applicable include the squash-
ing function, (1 + e-",)-l when d = 1, and the Gaussian function, the thin plate
splines, and multiquadrics, when 1 ~ d ~ s. We give explicit constructions, and
the matrices Ak and "thresholds" b k in the networks thus obtained are uniformly
bounded, independently of the degree of approximation desired. Unfortunately, the
coefficients ak in the networks may grow exponentially fast as the desired degree of
approximation tends to zero.
Mhaskar: On Smooth Activation Functions 277
In this paper, we demonstrate that this phenomenon cannot be avoided if the acti-
vation function is a bounded analytic function in a poly-strip; the coefficients and
the matrices cannot all be bounded independently of the desired degree of approxi-
mation. This fact persists even if ¢; satisfies less stringent conditions. In particular,
all the special functions ¢; mentioned above necessarily have this drawback. For the
sake of simplicity, we have presented our results in the context of uniform approx-
imation. They are equally valid when the approximation is considered in an LP
space.
2 Main Results
Let k, m 2: 1 be integers, and B be a closed sub cube of IRm (not necessarily
compact). The class of all functions f : IRm -+ IR having continuous and bounded
partial derivatives of order up to (and including) k on B will be denoted by C1.
In this section, we prove that if the activation function ¢; E C~d for some integer
£ 2: 1, and the target function f is not infinitely often differentiable on [-1, I)S,
then the coefficients ak in any generalized translation network of the form (1) that
provides a uniform approximation of f on [-1, I)S must satisfy certain lower bounds.
These lower bounds will be obtained in terms of the norms of the matrices A k • If
x = (Xl' ... ' Xm) E IRm , we define
Ixlm:= l~J~m
max IXjl. (2)
If d, B 2: 1 are integers, and A is a d X B matrix, we define its norm by
IIAII := max IAxld. (3)
Ixl,~l
In the sequel, c, Cl, ... will denote positive constants, which may depend upon fixed
quantities, such as ¢;, d, and B, and other indicated parameters only.
We prove the following theorem.
Theorem 1 Let 1 :::; d:::; B, £, r 2: 1 be integers, and ct, f be positive real numbers.
Suppose that ¢; E C~d' f: [-1, l)S -+ IR, and f fj. Cra,b]' for some [a, b) C (-1,1).
Suppose that for every integer n 2: 1, there exists a generalized translation network
n
Nn(x) := L: ak,n¢;(Ak,n X + bk,n), (4)
k=l
Buch that
xE[-I,I)S, (5)
where c is a positive constant that may depend upon f, ¢;, d, s, and ct but is inde-
pendent of n. Then there exists a subsequence A of integers and a positive constant
Cl depending on f, ¢;, d, s, ct, and f such that
(na(l/(r+£)-l/l))l
L:
n
lak,nl 2: Cl M ' n E A, (6)
k=l n
where
(7)
t
as a bounded, analytic function. Then the estimate (6) can be replaced by
1 Introduction
Nowadays generalisation and regularisation are two of the most challenging topics
in neural computation. By generalisation it is generally understood that, from a
given training set consisting of input-output observations {Xj, y;}, i = 1, ... , n,
of an underlying event F., such that F.(x;) = Yj, it is desired to construct an
estimated map F which, for a new test set of input observations {Xj} will provide
a good prediction for the unobserved output observations {Yj}.
One way of achieving generalisation, known as regularisation or smoothing, is to
find the mapping F by means of a neural network, subject to some constraints on
the solution. One possible constraint is to limit the number of units in the neural
network, or to employ pruning techniques during learning in order to limit the
degree of freedom of F and thus avoid overfitting of the observations (see [1] for a
survey). A second class of regularisation techniques involves the determination of a
mapping F in the d-dimensional Hilbert space 1{ of functions with continuous first
derivatives and square integrable second derivatives which minimise,
1 n
MSE = - . 'L)Yj - F(Xj))2 (1)
n j=l
(2)
for different values of a regularisation parameter ~ (see [2] and references contained
therein). Equation (1) is known as the Mean Squared Error (MSE) of the training
set and (2) as the regularisation constraint, which is equal to the energy of the
second derivative of the mapping F. If ~ is made large the F will just interpolate
the training data and, conversely, if it is chosen very small then the mapping, al-
though smooth, will not represent the underlying event F •. A good regularisation is
achieved by adapting ~ to reach the convenient tradeoff between the fitting and the
smoothness of the solution. Normally this is achieved by choosing different values
for ~, training under these ~ constraints, and testing the generalisation error results
280
Molina (3 Niranjan: Regularisation by Gaussian Convolution 281
on a test set by cross-validation. Although this solution may generalise well, it is
computationally expensive and needs retraining for each of the possible values of K.
Moreover, the final results may be biased by the capacity of the neural network or
the training rule to escape from local minima. In this paper, we propose a new reg-
ularisation technique based on convolution after training a Gaussian Radial Basis
Function (GaRBF) network with gaussian filters. This technique, which does not
need retraining, consists in essence of the following steps:
1:
and g(x, 1'1) as
(5)
282 CHAPTER 48
and,
(6)
Finally, from the inverse Fourier transform of their product, a new GaRBF of same
location but different amplitude and width is obtained,
i: i:
filter g(x, u) is a straightforward consequence of equation (7), since,
and is equal to a new GaRBF network with same number of units, located at the
same positions but with different amplitude and width. In the particular context
of regularisation by convolution, it is of great importance to keep the convolved
function h( T) as close as possible to the original F(x) except at those points where
high frequencies have been filtered. This may be achieved by requiring the L1-norm
of the GaRBF network to remain constant. A good alternative is to calculate the
L1-norm of the gaussian filter, as follows,
1 00
-00
- x2
e ..; dx = uj (11"»
d
(9)
and then convolve the network with a normalised filter which leads to the same
results as the normalisation of the GaRBF network. Thus, the L1-normalised con-
volution F(x) ® g(x, u) of a GaRBF network F(x) and a gaussian filter g(x, u)
which retains the same norm of the original network may then be stated as,
F( X ) 'tY 9 x, u -
,0, ( )..:..
J OO
-00
F(X)g(T - x u)dx
'
(10)
d
uj (11")~
3 The Regularising Technique
This section describes the regularising technique based on the convolution of GaRBF
units with gaussian filters, developed in previous section. The technique is to be
applied to a pre-trained GaRBF network as defined in equation (3), and uses the
Root Mean Squared (RMS) error obtained on a cross-validation set after regulari-
sation as a performance criterion. The best regularising gaussian filter is found by
binary search under the assumption that the RM S error has no local minima in
the search space. Reasonable bounds for the binary searching are easily obtained
from the bounded space on which the underlying function F. lies. Thus the minimal
bound for the width filter is equal to zero and the maximal bound is equal to the
maximal distance between two points of the bounded space 1f..
Molina 8 Niranjan: Regularisation by Gaussian Convolution 283
The structure of the related regularisation algorithm and its initialisation is shown
below as pseudo-code,
4 Numerical Results
In order to illustrate the performance of Regularisation by Convolution, we have
applied the algorithm to Wahba's synthetic problem (2). This problem consists of
the regression of a noisy function generated synthetically according to the model,
F*(x) = 4.26 (e- 2x - 4e- x + 3e- 3X ) + /I, where /I is normally distributed random
noise with zero mean and standard deviation 0.2. In the original problem, 100 noisy
observations were generated and used as learning set for the training of a sigmoid
feed-forward neural network. Wahba performed regularisation during training us-
ing equation (2). The" value which provided the lowest RMS error was obtained
284 CHAPTER 48
-1.50~-:O':"".'-;----:,:':"
.• -~--:2'"=.5-7-~3.5 -1.50~-:O':"".5-;---:'':"".5-~--:2'"=.5-7--:'3.5·
0) b)
1 Introduction
We will study here the problem of the prediction of nonlinear dynamical systems
from a set of sampled measurements of the state of the system. The nonlinear
systems which we will consider are governed by a multidimensional ordinary dif-
ferential equation. Although the dynamics of the system is continuous in time, the
samples that we will use for prediction are the state variables measured at discrete
time-intervals only. We are therefore led to postulate a discrete-time form for our
prediction: x(k + 1) = F(x(k)). In the first section, we will show - using the fact
that the underlying dynamics of the system is time-continuous - that the function
F used for prediction should be invertible. The problem then becomes that of build-
ing an approximation scheme specifically designed for invertible maps. The main
drawback of standard approximation schemes is that they are based on summing
basis functions to build an approximation. However, this operation does not pre-
serve invertibility. Using the operation of composition instead of addition, we are
able to guarantee invertibility. We will thus build approximation schemes based
on composition of invertible basis functions. In the second section, we will analyze
those compositions within the framework of Lie algebra theory. Starting from a
method from numerical analysis we will be able to derive a new approach to the
prediction of nonlinear systems. In section three, this new method will be presented
in a neural-like form and illustrated by an example. We will also present a general
model for approximation which we call "MLP in dynamics space". Finally, we will
conclude.
1.1 Invertibility of Dynamical Systems
The systems considered here are defined by an ordinary differential equation (ODE)
on some n-dimensional manifold M. We follow the differential geometric description
of dynamical systems [2] and assume the standard smoothness assumptions. For
each initial condition Xo, we can solve the initial value problem. We then collect all
those solutions in one function <I>(xo, t) which gives the solution of the ODE as a
285
286 CHAPTER 49
~~
/'
'
"
,.' "
:
. ,
. ,"
"
.
,
" ,
'
,'i""
'
. x(k)
standard approximation schemes are based on the summation of some basis func-
tions, this operation does not preserve the property of being a diffeomorphism. On
the other hand, composition does preserve this property, so we propose to build an
approximation scheme based on composing basis functions instead of adding them.
2 Lie Algebra Theory
Lie algebra theory has a long history in physics, mostly in the areas of classical
mechanics [1] and partial differential equations [8]. It is also an essential part of
nonlinear system theory [5] . Our approach to system prediction can be best cast in
this framework. A Lie algebra is a vector space where we define a supplementary
operation: the bracket [., .] of two elements of the algebra. The bracket is an oper-
ation which is bilinear, antisymmetric and satisfies the Jacobi identity [8]. In the
case where the Lie algebra is the vector space of all C r vector fields, the time-t map
<pt plays a very important role . If the vector field is A, we define the exponential
map as follows: exp(t.A) == <pt . This notation is an extension of the case where the
vector field is linear in space and where the solution is given by exponentiation of
the matrix. The exponential is thus a mapping from the manifold into itself. And
this map depends on the underlying vector field A and changes over time. From
here on, the product of exponentials will denote the composition of the maps. One
can then define the Lie bracket by
[A, B] = 8~.~t It=_=o exp( -s.B). exp( -t.A). exp(s .B). exp(t.A) (7)
In the case where the manifold is IRn , the bracket can be further particularized to
[A,B]i ~ (Bi.~
= L.. 8 k -Ai . ~
8B') (8)
;=1 uX, UX,
exp(8t.X)[x(k)]. This problem has recently been the focus of much attention in the
field of numerical analysis for the integration of Hamiltonian differential equations
[6] [7]. Suppose that the vector field X is of the form X =
A + B where we can
integrate A and B directly. We can use the BCH formula to produce a first-order
approximation to the exponential map:
BCH : exp(8t.X) = exp(8t.A). exp(8t.B) + O(8t 2) (10)
This is the essence of the method as it shows that one can approximate an expo-
nential map (that is the map arising from the solution of an ODE) by composing
simpler maps. By repeated use of the BCH formula, we can show that the following
leapfrog scheme is second-order.
8t 8t
=
Leapfrog: exp(8t.X) exp( T.A). exp(8t.B). exp( T.A) + O(8t 3 ) (11)
Using this leapfrog scheme as a basis element for further leapfrog schemes, Yoshida
[9] showed that it was possible to produce an approximation to exp(8t.X) up to
any order. Forest and Ruth [3] showed that approximations could be built for more
than two vector fields. Combining the two we can state that it is possible to build an
approximation to the solution of a linear combination of vector fields as a product
of exponential maps:
X = E7::l ai .Ai (12)
:::} 3Wi; : exp(8t.X) = IT:=l IT7::l exp(Wi; .8t.Ai) + O(8tp +l )
3 Network Implementation
Such an approximation scheme can be easily implemented as a multilayer network
as can be seen in the following figure (Fig. 2). The problem of predicting the system
can now simply be solved by minimizing the prediction error of the model by tuning
the weights Wi; using some gradient-based optimization technique.
3.1 Example
We will now present an example to illustrate how this method can be applied. We
will do this by looking at the Van den Pol oscillator. This system can be written as
a first-order system in state-space form which can be seen as a linear combination
of two vector fields Al , A2 which can be solved analytically:
Acknowledgements
This research work was carried out at the ESAT laboratory of the Katholieke Uni-
versiteit Leuven, in the framework of the Concerted Action Project of the Flemish
Community GOA-MIPS, and within the framework of the Belgian program on
inter-university attraction poles (IUAP-17) initiated by the Belgian State, Prime
Minister's Office for Science. The scientific responsibility rests with its authors.
Yves Moreau is a research assistant of the N.F.W.O. (Belgian National Fund for
Scientific Research).
STOCHASTIC NEURODYNAMICS AND THE SYSTEM
SIZE EXPANSION
Toru Ohira and Jack D. Cowan*
Sony Computer Science Laboratory 3-14-13 Higashi-gotanda, Shinagawa,
Tokyo 141, Japan. Email: [email protected]
* Departments of Mathematics and Neurology, The University of Chicago,
Chicago, IL 60637, USA. Email: [email protected]
We present here a method for the study of stochastic neurodynamics in the master equation frame-
work. Our aim is to obtain a statistical description of the dynamics of fluctuations and correlations
of neural activity in large neural networks. We focus on a macroscopic description of the network
via a master equation for the number of active neurons in the network. We present a systematic
expansion of this equation using the "system size expansion" . We obtain coupled dynamical equa-
tions for the average activity and of fluctuations around this average. These equations exhibit
non-monotonic approaches to equilibrium, as seen in Monte Carlo simulations.
Keywords: stochastic neurodynamics, master equation, system size expansion.
1 Introduction
The correlated firing of neurons is considered to be an integral part of informa-
tion processing in the brain[12, 2]. Experimentally, cross-correlations are used to
study synaptic interactions between neurons and to probe for synchronous net-
work activity. In theoretical studies of stochastic neural networks, understanding
the dynamics of correlated neural activity requires one to go beyond the mean field
approximation that neglects correlations in non- equilibrium states[8, 6]. In other
words, we need to go beyond the simple mean-field approximation to study the
effects of fluctuations about average firing activities.
Recently, we have analyzed stochastic neurodynamics using a master equation[5, 8].
A network comprising binary neurons with asynchronous stochastic dynamics is
considered, and a master equation is written in "second quantized form" to take
advantage of the theoretical tools that then become available for its analysis. A
hierarchy of moment equations is obtained and a heuristic closure at the level of
second moment equations is introduced. Another approach based on the master
equation via path integrals, and the extension to neurons with a refractory state
are discussed in [9, 10].
In this paper, we introduce another master equation based approach to go beyond
the mean field approximation. We concentrate on the macroscopic behavior of a
network of two-state neurons, and introduce a master equation for the number of
active neurons in the network at time t. We use a more systematic expansion of
the master equation than hitherto, the "system size expansion" [11]. The expansion
parameter is the inverse of the total number of the neurons in the network. We
truncate the expansion at second order and obtain an equation for fluctuations
about the mean number of active neurons, which is itself coupled to the equation
for the average number of active neurons at time t. These equations show non-
monotonic approaches to equilibrium values near critical points, a feature which is
not seen in the mean field approximation. Monte Carlo simulations of the master
equation itself show qualitatively similar non-monotonic behavior.
290
Ohira f3 Cowan: Stochastic Neurodynamics 291
2 Master Equation and the System Size Expansion
We first construct a master equation for a network comprising N binary elements
with two states, "active" or firing and "quiescent" or non-firing. The transitions
between these states are probabilistic and we assume that the transition rate from
active to quiescent is a constant a for every neuron in the network. We do not
make any special assumption about network connectivity, but assume that it is
"homogeneous", i.e., all neurons are statistically equivalent with respect to their
activities, which depend only on the proportion of active neurons in the network.
More specifically, the transition rate from quiescent to active is given as a function
¢l of the number of active neurons in the network. Taking the firing time to be
about 2ms, we have a ~ 5008- 1 . For the transition rate from quiescent to active,
the range of the function is ~ 30 - 1008- 1 reflecting empirically observed firing-
rates of cortical neurons. With these assumptions, one can write a master equation
as follows.
fJ
- fJtPN[n, t] = a(nPN[n, t]- (n+ l)PN [(n+ 1), t])
n n (n-l) n-l
+ N(l- N)¢l(N)PN[n, t]-N(l-~)¢l(AT)PN[(n-l), t], (1)
where PN[n, t] is the probability that the number of active neurons is n at time t.
(We absorbed the parameter representing total synaptic weight into the function
¢l.) This master equation can be deduced from the second quantized form cited
earlier, which will be discussed elsewhere. The standard form of this equation can
be rewritten by introducing the "step operator" , defined by the following action on
an arbitrary function of n:
[f(n) = f(n + 1), [-1 f(n) = f(n - 1) (2)
In effect, [ and [-1 shift n by one. Using such step operators, Eq. (1) becomes
:'t=,(b)
A
~t=,(a) ~'L(C)
0.6 0.6 0.6
0.. 0.2 0 ••
o 2 4 6 ~ 10 12 14 0 2 4 6 I 10 12 14 0:1' 6 8 10 12 14
TI_ TI_ TI_
'"
B
~:~ J~ ~:L(C)
,~,~"o 2 , 6 8 10 12 14 O. 4 6 9 10 U 14 0:" 6 a 10 12 14
Time Time Time
Figure 1 Comparison of solutions of (A) the macroscopic equation, and (B) (23)
and (24). The parameters are set at f3 = 15, e = 0.5 and a = (a) 0.2, (b) 0.493,
and (c) 0.9. The initial conditions are X = 0.5, and 7J =(A)O.Ol, and (B) O.
3 Discussion
We have here outlined an application of the system size expansion to a master
equation for stochastic neural network activity. It produced a dynamical equation
for the fluctuations about mean activity levels, the solutions of which showed a non-
monotonic approach to such levels near a critical point. This has been seen in model
networks with low connectivity. Two issues raised by this approach require further
comment: (1) In this work we have used the number of neurons in the network
as a expansion parameter. Given the observation that the overall connectedness
affects the stochastic dynamics, a parameter representing the average connectivity
per neuron may be better suited as an expansion parameter. We note that this
parameter is typically small for biological neural networks. (2) There are many
studies of Hebbian learning in neural networks[l, 4, 7]. In such studies attempts
294 CHAPTER 50
0.6
0.4
0.8
0.6
0.4
A
0.8
0.6
0.4
o 10 20 30 40 50 60 0 10 20 )0 40 SO 60 0 10 20 30 40 50 60
Time Time Time
1 Introduction
Many applications of neural networks are concerned with the prediction of one or
more continuous output variables, given the values of a number of input variables.
As well as predictions for the outputs, it is also important to provide some measure
of uncertainty associated with those predictions.
The Bayesian view of regression leads naturally to two contributions to the error
bars. The first arises from the intrinsic noise on the target data, while the second
comes from the uncertainty in the values of the model parameters as a consequence
of having a finite training data set [1, 2). There may also be a third contribution
which arises if the true function is not contained within the space of models under
consideration, although we shall not discuss this possibility further.
In this paper we focus attention on a class of universal non-linear approximators
constructed from linear combinations of fixed non-linear basis functions, which we
shall refer to as generalized linear regression models. We first review the Bayesian
treatment of learning in such models, as well as the calculation of error bars [3).
Then, by considering the contributions arising from individual data points, we pro-
vide insight into the nature of the error bars and their dependence on the location
of the data in input space. This in turn leads to the key result of the paper which is
an upper bound on the true error bars expressed in terms of the single-data-point
contributions. Our analysis is very general and is independent of the particular form
of the basis functions.
2 Bayesian Error Bars
We are interested in the problem of predicting the value of a noisy output variable
t given the value of an input vector x. Throughout this paper we shall restrict
attention to regression for a single variable t since all of the results can be extended
in a straightforward way to multiple outputs. To set up the Bayesian formalism we
begin by defining a model for the distribution of t conditional on x. This is most
commonly chosen to be a Gaussian function in which the mean is governed by the
output y(x; w) of a network model, where w is a vector of adaptive parameters
295
296 CHAPTER 51
(weights and biases). Thus we have
1 { (Y(X;W)-t)2}
p(tlx, w) = (211'0';)1/2 exp - 20'~ (1)
We can then combine the likelihood function and the prior using Bayes' theorem to
obtain the posterior distribution of weights given by p(wID) = p(Dlw)p(w)jp(D).
The predictive distribution of t given a new input x can then be written in terms
of the posterior distribution in the form
which we shall call generalized linear regression models. Here the <Pj (x) are a set
of fixed non-linear basis functions, with generally one of the basis functions <PI =
1 so that WI plays the role of a bias parameter. Such models possess universal
approximation capabilities for reasonable choices of the <Pj (x), while having the
advantage of being linear in the adaptive parameters w.
Since (5) is linear in w, both the noise model p(tlx, w) and the posterior distribution
p( wiD) will be Gaussian functions of w. It therefore follows that, for a Gaussian
prior of the form (2), the integral in (4) will be Gaussian and can be evaluated
analytically to give a predictive distribution p(tlx, D) which will be a Gaussian
function of t. The mean of this distribution is given by y(x; WMP) where WMP is
Qazaz et al.: Upper Bound on Bayesian Error Bars 297
found by minimizing the regularized error function
1 N 1
-"{t/> T(:z:n)w -tnV + -w TSw (6)
2u1/2 n=1
~ 2
and is therefore given by the solution of the following linear equations
AWMP = u;/g, T t (7)
where t is a column vector with elements tn, A is the Hessian matrix given by
N
A = ~ E t/>(:z:n)t/>(:z:n) T + S = ~g, Tg, + S (8)
UI/ n=1 UI/
and g, is the N x M design matrix with elements cJI nj = tPj (:z:n). Solving for WMP
and substituting into (5) we obtain the following expression for the corresponding
network output
(9)
The covariance matrix for the posterior distribution p(wID) is given by the inverse
of the Hessian matrix. Together with (4) this implies that the total variance of the
output predictions is given by
u 2(:z:) = u; + u~(:z:) = u; + t/> T(:z:)A- 1t/>(:z:) (10)
Here the first term represents the intrinsic noise on the target data, while the second
term arises from the uncertainty in the weight values as a consequence of having a
finite data set.
3 An Upper Bound on the Error Bars
We first consider the behaviour of the error bars when the data set consists of a
single data point. As well as providing important insights into the nature of the
error bars, it also leads directly to an upper bound on the true error bars.
In the absence of data, the variance is given from (8) and (10) by
u 2(:z:) = u; + t/> T(:z:)S-1t/>(:z:) (11)
where the second term, due to the prior, is typically much larger than the noise
term u;. If we now add a single data point located at :z:0 then the Hessian becomes
S+u;;2t/>(:z:0)t/> T(:z:0). To find the inverse ofthe Hessian we make use ofthe identity
(M T)-1 _ -1 (M- 1 v) (v TM- 1)
+ vv - M - 1 + v T M- 1v (12)
which is easily verified by multiplying both sides by (M + vv T). The variance at
an arbitrary point :z: for a single data point at :z:0 is then given by
2() 2 ) C(:z:, :z:0)2
U :z: =UI/+C(:z:,:z: - °
2 C( :z: ,:z:0)
UI/+
(13)
where we have defined the prior covariance function
C(:z:, :z:') = t/> T(:z:)S-1t/>(:z:') (14)
The first two terms on the right hand side of (13) represent the variance due to
the prior alone, and we see that the effect of the additional data point is to reduce
the variance from its prior value, as illustrated for a toy problem in Figure 1. From
(13) we see that the length scale of this reduction is related to the prior covariance
function C(:z:, :z:').
298 CHAPTER 51
If we evaluate u 2 ( x) in (13) at the point XO then we can show that the error
bars satisfy the upper bound u 2 ( XO) :::; 2u~. Since the noise level is typically much
less than the prior variance level, we see that the error bars are pulled down very
substantially in the neighbourhood of the data point. Again, this is illustrated in
Figure 1.
We now extend this analysis to provide an upper bound on the error bars. Suppose
we have a data set consisting of N data points (at arbitrary locations) and we add
an extra data point at x N+1. Using (8) the Hessian AN+l for the N + 1 data points
can be written in terms of the corresponding Hessian AN for the original N data
points in the form
(15)
Using the identity (12) we can now write the inverse of AN+l in the form
A-l~(xN+l)~ T(xN+l)A- l
A- l -A- l _ No/ 0/ N (16)
N+l- N u~+4>T(xN+l)ANl4>(xN+l)
Substituting this result into (10) we obtain
[4> T (XN+l )ANl4>(X)] 2
u2 (x) - u2 (x) - (17)
N+l - N u~ + 4> T(xN+l)ANl4>(xN+1)
From (8) we see that the Hessian AN is positive definite, and hence its inverse will
be positive definite. It therefore follows that the second term on the right hand side
of (17) is negative, and so we obtain
u;'+1(x) :::; u;'(x) (18)
This represents the intuitive result that the addition of an extra data point cannot
lead to an increase in the magnitude of the error bars. Repeated application of this
result shows that the error bars due to a set of data points will never be larger than
the error bars due to any subset of those data points.
It can also be shown that the average change in the error bars resulting from the
addition of an extra data point satisfies the bounds
1
L
N 2
(~U2(X)) == N [U;'+l(Xn) - u;'(xn)] ~ - ~ (19)
n=l
A further corollary of the result (18) is that, if we consider the error bars due
to each of a set of N data points individually, then the envelope of those error
bars constitutes an upper bound on the true error bars. This is illustrated with a
toy problem in Figure 1. The contributions from the individual data points are
easily evaluated using (13) and (14) since they depend only on the prior covariance
function and do not require evaluation or inversion of the Hessian matrix.
4 Summary
In this paper we have explored the relationship between the magnitude of the
Bayesian error bars and the distribution of data in input space. For the case of a
single isolated data point we have shown that the error bar is pulled down close to
the noise level, and that the length scale over which this effect occurs is characterized
by the prior covariance function. From this result we have derived an upper bound
on the error bars, expressed in terms of the contributions from individual data
points.
Qazaz et al.: Upper Bound on Bayesian Error Bars 299
3.5
3 ..•
2.5
•
•,,
,
2
1.5
0.5
°0·L-----~~~--+-~----~--~
0.2 0.4 0.6 0.8
Figure 1 A simple example of error bars for a one-dimensional input space and
a set of 30 equally spaced Gaussian basis functions with standard deviation 0.07.
There are two data points at x = 0.3 and x = 0.5 as shown by the crosses. The
solid curve at the top shows the variance q2(x) due to the prior, the dashed curves
show the variance resulting from taking one data point at a time, and the lower
solid curve shows the variance due to the complete data set. The envelope of the
dashed curves constitutes an upper bound on the true error bars, while the noise
level (shown by the lower dashed curve) constitutes a lower bound.
REFERENCES
[1] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, (1995).
[2] D. J. C. MacKay. Bayesian interpolation. Neural Computing, Vol. 4(3) (1992), pp415-447.
[3] C. K. I. Williams, C. Qazaz, C. M. Bishop, and H. Zhu. On the relationship between Bayesian
error bars and the input data density, In Proceedings Fourth lEE International Conference on
Artificial Neural Networks (1995), pp160-165, Cambridge, UK, lEE.
Acknowledgements
This work was supported by EPSRC grant GR/K51792 Validation and verification
of neural network systems.
CAPACITY BOUNDS FOR STRUCTURED NEURAL
NETWORK ARCHITECTURES
Peter Rieper, Sabine Kroner* and Reinhard Moratz**
FB Mathematik, Universitiit Hamburg, Bundesstr. 55,
D-20146 Hamburg, Germany.
* Technische Informatik I, TU Hamburg-Harburg, Harburger Schloftstr. 20,
D-21071 Hamburg, Germany. Email: [email protected]
** AG Angewandte Informatik, Universitiit Bielefeld, Postfach 100131,
D-33501 Bielefeld, Germany.
1 Introduction
Structured multi-layer feedforward neural networks gain more and more importance
in speech- and image processing applications. Their characteristic is that a-priori
knowledge about the task to be performed is already built into their architecture
by use of nodes with shared weight vectors. Examples are time delay neural net-
works [10] and networks for invariant pattern recognition [4, 5]. One problem in
the training of neural networks is the estimation of the number of training sam-
ples needed to achieve good generalization. In [1] is shown that for feedforward
architectures this number is correlated with the capacity or Vapnik-Chervonenkis
dimension of the architecture. So far an upper bound for the capacity has been de-
rived for two-layer feedforward architectures with independent weights: it depends
with O( ~ . In !) on the number w of connections in the architecture with q nodes
and a output elements. In this paper we focus on the calculation of upper bounds
for the capacity of structured multi-layer feedforward neural architectures. First we
give some definitions and introduce a new general terminology for the description
of structured neural networks. In section 3 we apply this terminology on structured
feedforward architectures first with one layer then with multiple layers. We show
that they can be transformed into equivalent conventional multi-layer feedforward
architectures. By extending known estimations for the capacity we achieve upper
bounds for the capacity of structured neural architectures which increase with the
number of independent network parameters. This means that weight sharing in a
fixed neural architecture leads to a significant reduction of the upper bound of the
capacity. The capacity mainly depends on the number of free parameters analogous
to the case with independent weights. At the end we comment the results.
2 Definitions
A layered feedforward network architecture N; a is a directed acyclic graph with a
sequence of e input nodes, r - 1 (r E IN) inter~ediate (hidden) layers of nodes, and
a final output layer with a nodes. Every node is connected only to nodes in the
next layer. To every node k with indegree n E IN a triplet (Wk' Sk, be) is assigned,
consisting of a weight vector Wk E IRn , a threshold value Sk E IR, and an activation
300
Rieper et at.: Capacity Bounds for Structured Architectures 301
function !k : IR -+ {a, I}. The activation function for all but the input nodes
is the hard limiter function, and without loss of generality we choose s = for °
the threshold values of all nodes. We define an architecture N; a with given triplets
(w, s, f) for all nodes as a net N;,a' With the net itself a functi~n F: IR! 1-+ {a, l}a
is associated.
Let S be a fixed (m x e)-input-matrix for N; a' All nets N; a that map S on the
same (m x a)-output-matrix T are grouped in ~ net class of JJ;,a related to S. ~(S)
is the number of net classes of N;,a related to S. The growth function gem) of an
architecture N; a with m input vectors is the maximum number of net classes over
all (m x e )-inp~t matrices S. Now we consider the nodes of the architecture N; a
within one layer (except the input layer) with the same in degree d E IN. All nod~s
k whose components of their weight vectors Wk E IRd can be permuted through a
permutation 7I'k: IRd -+ IRd so that 7I'k(Wk) = W Vk for some vector wEIRd are
elements of the same node class Kw. We call an architecture N; a structured if at
least one node class has more than one element. Then the archite'cture with b node
classes KWi (i = 1, ... , b) is denoted N;,a(I<Wl' ... , K Wb )'
The Vapnik-Chervonenkis dimension dvc [9] of a feedforward architecture is defined
by dvc := sup{mEIN Ig(m)=2ma}. Let Q:= {mEIN ~(:::) 2':~}. Then I
°
c := sup Q for Q # 0, or c := for Q = 0, is an upper bound for the Vapnik-
Chervonenkis dimension and is also defined as capacity in [2, 7].
3 Upper Bounds for the Capacity
In this section is shown how structured architectures can be transformed into
conventional architectures with independent weights. The upper bounds for the
capacity of these conventional architectures then are applied to the structured
architectures. A basic transformation needed in the following derivations is the
transformation of structured one-layer architectures N(K w1 , ... , K Wb ) with input
nodes of outdegree 2': 1 and input vectors XI into structured one-layer architectures
N'(K w1 , ... , K Wb ) with input nodes of out degree 1 only and dependent input vec-
tors xl' (l = 1, ... , m): Every input node with out degree z > 1 is replaced by
z copies of that input node. The outgoing edges of the input node are assigned
to the copies in such a way that every copy has out degree 1. The elements of
the input vectors are duplicated in the same way. By permuting the input nodes
and the corresponding components of the input vectors we get the architecture
Nil (KWl , ... , K Wb ) without any intersecting edges.
3.1 Structured One-layer Architectures
I) First we focus on structured one-layer architectures N; a(I<W) with a set
I := {Ul,"" u e } of e input nodes and the output layer K :'= {k 1 , ... , k a }. Let
Kw := K be the only node class. All nodes in Kw = K have the same in degree
dE IN.
Theorem 1 Let a structured one-layer architecture Nel a(I<W) with only one node
class Kw = K be given. Suppose d E IN as the indegr~e of all nodes in Kw. The
number of input nodes is e Sa· d. For m input vectors of length e an upper bound
for the growth function gem) of the structured one-layer architecture Nel,a(Kw) is
302 CHAPTER 52
given by
• =0
Proof At first we examine structured one-layer architectures with out degree 1 for
every input node, equivalent to architectures N! a(Kw) with e a· d input nodes. =
By permuting the input nodes and the corresponding components of the input
vectors we get the architecture N"(Kw). Without loss of generality we consider
the permutation 'II' of the node class Kw as the identity function. Thus we have
w = w(k 1 ) = ... = w(ka) E ffid for the a weight vectors (cf. Figure 1 a)). For m
fixed input vectors z/ := (zl, ... , zi) E ffiad (zi E ffid, 1 = 1, ... , m, i = 1, ... , a)
let S be an (m x a·d)-input matrix for Nel,a(Kw):
S,~ [J ~ [[ :~ )
A given weight vector Wl E ffid defines a function Fl ffiad -+ {D,l}a or a net
s,~(~n . (1)
On NJ,l the weight vector Wl (W2) defines a function Fl : ffid -+ {D, I} (F2 : ffid -+
{D, I}) or a net Nl (N2)' respectively. Because of Fl(Z.):j; F2(Z.) for at least one
input vector z. (8 E {I, ... , m}) and definition (1) the nets Nl and N2 are elements
of different net classes of NJ 1 related to the input matrix S. Summarizing we get:
if two nets of the architectu~e N;,a(K w ) are different related to any input matrix
Rieper et at.: Capacity Bounds for Structured Architectures 303
S we can define an input matrix S for NJ 1 by (1), so that the corresponding nets
are different, too. For the number of net classes this yields
(2)
With the results of [7] the growth function of the architecture NJ 1 is given by
C(m·a, d). From (2) also follows that this is an upper bound for th~ growth func-
tion of the structured one-layer architecture N"(Kw) or N~da(I{W) respectively:
g(m) :::; C(m·a,d). The inequation g(m) ~ C(m·a,d) can e~sily be verified in a
similar way, so it yields g(m) = C(m·a,d) for the growth function of structured
one-layer architectures NI a(Kw) with out degree 1 for every input node.
Now we consider structured one-layer architectures N e1,a(I{W) with outdegree z > 1
for some input nodes. These architectures can be transformed into structured one-
layer architectures N"(Kw) with e = a· d input nodes all with out degree 1. But
the input vectors of the input matrix for the transformed architecture N"(Kw)
cannot be chosen totally independent. Thus, C(m·a, d) is an upper bound for the
growth function of structured one-layer architectures AJ:1 a(Kw) with exactly one
node class Kw = K. ' 0
Remark 1 With [7J we find 2~d as an upper bound for the capacity of structured
one-layer architectures N e1,a(I{W) with exactly one node class Kw.
II) Second we focus on structured one-layer architectures N e1,a(Kw1' ... , K Wb ) with
b (2:::; b < a) node classes K W1 , ... ,Kwh . These classes form the set K of the a
output nodes: K = KW1 U... UKWh '
Theorem 2 Assume a structured one-layer architecture N e1,a(Kw1"'" KWh)
with e :::; E~=l C¥i·dj input nodes, a = E~=l C¥j output nodes, and bE 1N (2:::; b:::; a)
node classes KWi (i = 1, ... , b). Let C¥j := IKw;l be the sizes of the node classes
K Wi , and di the indegrees of the nodes in KWi (i = 1, ... , b). For m input vectors
the product
h
II C(m'c¥i, d i)
i=l
c= 0 (b~d .lnt)
Proof For the growth function gem) of the architecture IVI,a(Kw1 , ... , K Wb ) we
get with the above definitions and Theorem 2:
b b
gem) ~ IT C(m·a;, d ~ IT C(m·ii, d) =C(m·ii, d)b
i)
;=1 ;=1
~~b
t· d
m ~ const . --::::- ·In(t) .
a
For details and further information see [8]. o
3.2 Structured Multi-layer Architectures
Consider a structured r-Iayer architecture with e input nodes, aj nodes in the
hidden layers Hi (j =
1, ... , r - 1) and a nodes in the output layer K. Let
the layers Hi be the disjoint union of the bj ~ aj node classes H wj , ... , H w j
1 bj
and the output layer the disjoint union of the node classes KWi (i 1, ... , b). =
The number of node classes is L:;:~ bj + b =: f3. The structured architecture is
denoted by IV;, a(HW 1, 1
• . . , KWb)' A structured r-Iayer feedforward architecture
IV;, a(HW"""
1
K Wb ) can be regarded as a combination of r structured one-layer
feedforward architectures since the output matrices of each layer are the input ma-
trices for the following layer. Thus, we get an upper bound for the growth function
gem) of IV;, a(HW1, 1
... , K Wb ) by multiplying the growth functions of the r struc-
D(g g
tured one-layer architectures (refer to Theorem 2):
With the maximum size ii := max { ai, ... , ab, a~, ... , a~:_I,} of the f3 node classes,
and the maximum in degree d of all nodes of the architecture
IV; a(HW1, ... , K Wb ), C(m·ii, d)f3 is an upper bound for (3).
, 1
ture with /3 2': 2 node classes H W:' ... , K Wb , d 2': 2 the maximum indegree of all
nodes, ii the maximum size of all f3 node classes, and t:= ¥ 2': 2. For the capacity
of IV;, a(Hw '1 , ... , K Wb ) we get
c =0 /3, d In t~) .
( -a-'
Rieper et al.: Capacity Bounds for Structured Architectures 305
Proof Analogous to the proof of Theorem 3. 0
An architecture N;,a(H wi , ... , K wb ) with 2:j:i bj + b = 2:j:i aj + a node classes
is equivalent to an architecture N; a in which every node class has size 1. So the
above upper bounds for the capacity hold good for conventional r-layer feedforward
architectures, too.
4 Conclusion
By transforming architectures with shared weight vectors into equivalent conven-
tional feedforward architectures and the extension of the definitions of the growth
function and the capacity to multi-layer feedforward architectures we obtain estima-
tions for the upper bounds of the capacity of structured multi-layer architectures.
These upper bounds depend with 0(; . In t) on the number p of free parameters
in the structured neural architecture with maximum size a of the f3 node classes,
t:= ¥ 2 2, and a nodes in the output layer. So weight sharing in a fixed neural
architecture leads to a reduction of the upper bound of the capacity. The amount of
the reduction increases with the extent of the weight sharing. With = 1 the upper a
bounds hold good for conventional feedforward networks with independent weights,
too. It is known that the generalization ability of a feedforward neural architecture
improves within certain limits with a reduction of the capacity for a fixed number of
training samples. As a consequence of our results a better generalization ability can
be derived for structured neural architectures compared to the same unstructured
ones. This is a theoretic justification for the generalization ability of structured
neural architectures observed in experiments [5]. Further investigations focus on
an improvement of the upper bounds, on the determination of capacity bounds
for special structured architectures, and on the derivation of capacity bounds for
structured architectures of nodes with continuous transfer functions [3, 6].
REFERENCES
[1] E. B. Baum, D. Haussler, What Size Net gives Valid Generalization?, Advances in Neural
Information Processing Systems, D. Touretzky, (Ed.), Morgan Kaufmann, (1989).
[2] T. M. Cover, Geometrical and Statistical Properties of Systems of Linear Inequalities with
Applications in Pattern Recognition, IEEE Trans. on Electronic Computers, Vol. 14 (1965),
pp326-334.
[3] P. Koiran, E.D. Sontag, Neural Networks with Quadratic VC Dimension, NeuroCOLT Tech-
nical Report Series, NC-TR-95-044, London (1995).
[4] S. Kroner, R. Moratz, H. Burkhardt: An adaptive invariant transform using neural network
techniques, Proc. of 7th Europ. Sig. Proc. Conf., Holt et al. (Eds.), Vol. III (1994), ppI489-1491,
Edinburgh.
[5] Y. Ie Cun, Generalization and Network Design Strategies, Connectionism in Perspective,
R. Pfeiffer et al. (Eds.), Elsevier Science Publishers B.V. North-Holland (1989), ppI43-155.
[6] W. Maass, Vapnik.Chervonenkis Dimension of Neural Nets, Preprint, Techn. Univ. Graz,
(1994).
[7] G. J. Mitchison, R. M. Durbin, Bounds on the Learning Capacity of Some Mufti·Layer
Networks, Biological Cybernetics, Vol. 60 No.5 (1989), pp345·356.
[8] P. Rieper, Zur SpeicherfO,higkeit vorwiirtsgerichteter Architekturen kiinstlicher neuronaler
Netze mit gekoppelten Knoten, Diplomarbeit, Universitiit Hamburg, (1994).
[9] V. Vapnik, Estimation of Dependences Based on Empirical Data, Springer· Verlag, Berlin
(1982).
[10] A. Waibel, Modular Construction of Time·Delay Neural Networks for Speech Recognition,
Neural Computation, VoU (1989), pp39·46.
ON-LINE LEARNING IN MULTILAYER NEURAL
NETWORKS
David Saad and Sara A. Solla*
Dept. of Computer Science and Applied Mathematics,
University of Aston, Birmingham B4 7ET, UK.
* CONNECT, The Niels Bohr Institute, Blegdamsvej 17,
Copenhagen 2100, Denmark.
We present an analytic solution to the problem of on-line gradient-descent learning for two-layer
neural networks with an arbitrary number of hidden units in both teacher and student networks.
The technique, demonstrated here for the case of adaptive input-to-hidden weights, becomes exact
as the dimensionality of the input space increases.
Layered neural networks are of interest for their ability to implement input-output
maps [1]. Classification and regression tasks formulated as a map from an N-
e
dimensional input space onto a scalar ( are realized through a map ( = h(e),
which can be modified through changes in the internal parameters {J} specifying
the strength of the interneuron couplings. Learning refers to the modification of
these couplings so as to bring the map h implemented by the network as close
as possible to a desired map 1.
Information about the desired map is provided
through independent examples (ell, (1'), with (I' =
!(e) for all J.l. A recently in-
troduced approach investigates on-line learning [2]. In this scenario the couplings
are adjusted to minimize the error after the presentation of each example. The re-
sulting changes in {J} are described as a dynamical evolution, with the number of
examples playing the role of time. The average that accounts for the disorder in-
troduced by the independent random selection of an example at each time step can
be performed directly. The result is expressed in the form of dynamical equations
for order parameters which describe correlations among the various nodes in the
trained network as well as their degree of specialization towards the implementa-
tion of the desired task. Here we obtain analytic equations of motion for the order
parameters in a general two-layer scenario: a student network composed of N input
units, K hidden units, and a single linear output unit is trained to perform a task
defined through a teacher network of similar architecture except that its number
M of hidden units is not necessarily equal to K. Two-layer networks with an ar-
bitrary number of hidden units have been shown to be universal approximators [1]
for N -to-one dimensional maps. Our results thus describe the learning of tasks of
arbitrary complexity (general M). The complexity of the student network is also
arbitrary (general K, independent of M), providing a tool to investigate realizable
(K = M), over-realizable (K > M), and unrealizable (K < M) learning scenarios.
In this paper we limit our discussion to the case of the soft-committee machine
[2], in which all the hidden units are connected to the output unit with positive
couplings of unit strength, and only the input-to-hidden couplings are adaptive.
Consider the student network: hidden unit i receives information from input unit
r through the weight J;r, and its activation under presentation of an input pattern
e = (6,···,eN) is Xi = J j ·e, with Jj = (Jil, ... ,JiN) defined as the vector of
incoming weikhts onto the i-th hidden unit. The output of the student network is
O'(J,e) = L;=l g (J i · e), where g is the activation function of the hidden units,
306
Baad (3 Bolla: On-line Learning 307
taken here to be the error function g(x) == erf(xlV2), and J == {J;}l<i<K is the set
of input-to-hidden adaptive weights. Training examples are of the form [e, (1'). The
components of the independently drawn input vectors e are uncorrelated random
variables with zero mean and unit variance. The corresponding output (I' is given
by a deterministic teacher whose internal structure is that of a network similar to
the student except for a possible difference in the number M of hidden units. Hid-
den unit n in the teacher network receives input information through the weight
vector Bn = (Bnl , ... , BnN), and its activation under presentation of the input
pattern e =
is Y~ Bn . e· =
The corresponding output is (I' L:~=l 9 (Bn . el')·
We will use indices i, j, k, I . .. to refer to units in the student network, and n, m, ...
for units in the teacher network. The error made by a student with weights J on a
e
given input is given by the quadratic deviation
1
E(J,e) == 2" [u(J,e) - (]2 = 2"1 [ t;g(Xj)
K
-
M
~g(Yn)
]2 (1)
P(x,y) = 1
v'(211")K+MICI
exp {--21(X,Y?C-1(X,y)} ,with C = [R~ TR] .
(2)
The averaging yields an expression for the generalization error in terms of the order
parameters Qik, R;n, and Tnm. For g(x) == erf(xlV2) the result is:
1 {","" Qik Tnm
Eg (J) = - LJ arcsm ~ + ",""
LJ arcsm
1I"'k
• vI + Qi; VI + Qkk nm ';1 + Tnn ';1 + Tmm
- 2 ",""
LJarcsm R;n } . (3)
.
In
v'f+liii VI + Tnn
The parameters Tnm are characteristic of the task to be learned and remain fixed,
while the overlaps Qik and Rin are determined by the student weights J and evolve
during training. A gradient descent rule for the update of the student weights results
in Jf+1 = Jf + 11 6f e,
where the learning rate 17 has been scaled with the input
size N, and 6f == g'(xf) [L:~=l g(y~) - L:f=l g(xj)] is defined in terms of both
the activation function 9 and its derivative g'. The time evolution of the overlaps
R;n and Qik can be explicitly written in terms of similar difference equations. The
dependence on the current input e
is only through the activations x and y, and the
corresponding averages can be performed using the joint probability distribution
(2). In the thermodynamic limit N -+ 00 the normalized example number a = piN
308 CHAPTER 53
can be interpreted as a continuous time variable, leading to the equations of motion:
The two multivariate Gaussian integrals: 13 == < g'(U) v g(W) > and 14 == <
g'(u) g'(v) g(w) g(z) > represent averages over the probability distribution (2). The
averages can be performed analytically for the choice g( x) = erf( x /../2). Arguments
assigned to 13 and 14 are to be interpreted following our convention to distinguish
student from teacher activations. For example, 13(i, n,j) == < g'(Xi) Yn g(Xj) >, and
the average is performed using the three-dimensional covariance matrix C3 which
results from projecting the full covariance matrix C of Eq. (2) onto the relevant·
subspace. For 13(i,n,j) the corresponding matrix is:
Q;i R;n Qi j )
C3 =( R;n Tnn Rjn .
Qij Rjn Qjj
13 is given in terms of the components of the C3 covariance matrix by
13 =~ _1_ C23 (1 + C11 ) -
C 12 C13 , (5)
7r VK3 1 + C11
=
with A3 (1+C11 )(I+C33 )-C[3. The expression for 14 in terms of the components
of the corresponding C4 covariance matrix is
14 ~
= 7r42 VA4 arcsin (d.$;)
A1 A2
, (6)
0.8
(a) (~-
1.0
0.8
(b)
........~... R 11 - R "
_. __.. R,.l
r-
-Q1 I!
. R2 1 I
f
......,," Q" - - - - R22 ----.--" R2J
'-
0.4 0.4
,./
!
\.
0.2 0.2
0.0
'\ 0.0
0 2000 4000 6000 8000 a 2000 4000 6000 6000
ex ex
0.1 (C) 0.08
(d) """ '1'/ •. ,
0.08 '1'/ •. ,
0.08 _...._",,- '1'/ •.•
tUJ
tUJ --'1'/ •. 7
W 0.08
IN r--.
0.04
0.04
r--'\
1
0.02
l
0.02
0.0 0.0 ~\
a 2000 4000 6000 8000 a 2000 4000 6000
ex ex
Figure 1 The overlaps and the generalization error as a function of ex for a three-
node student learning an isotropic teacher (Tnm =
Onm). Results for TJ 0.1 are =
shown for (a) student-student overlaps Qik' (b) student-teacher overlaps Rin, and
( c) the generalization error. The generalization error for different values of the
learning rate TJ is shown in (d).
teacher nodes decreases and eventually decays to zero. We thus distinguish between
a growing overlap R between a given student node and the teacher node it begins
to imitate, and decaying secondary overlaps 5 between the same student node and
the remaining teacher nodes. Further specialization involves the decay to zero of
the student-student correlations C and the growth of the norms Q of the student
vectors. The student nodes can be relabeled so as to bring the matrix of student-
teacher overlaps to the form R in = Rbin + 5(1- bin); the matrix of student-student
overlaps is of the form Qik = Qbik + C(l - bik ). The subsequent evolution of the
system converges to an optimal solution with perfect generalization, characterized
by a fixed point at (R*)2 = Q* = 1 and 5* = C* = 0, with = O. Linearization E;
of the full equations of motion around the asymptotic fixed point results in four
eigenvalues, of which only two control convergence. An initially slow mode is char-
acterized by a negative eigenvalue that decreases monotonically with 'f/, while an
initially faster mode is characterized by an eigenvalue that eventually increases and
becomes positive at 'f/max = 7r/(V3K), to first order in 11K. Exponential conver-
gence of R, 5, C, and Q to their optimal values is guaranteed for all learning rates
in the range (0, 'f/max); in this regime the generalization error decays exponentially
to E; = 0, with a rate controlled by the slowest decay mode.
Saad & Solla: On-line Learning 311
REFERENCES
[1] G. Cybenko, Approximation by superposition of sigmoidal functions, Math. Control Signals
and Systems Vol. 2 (1989), pp303-314.
[2] M. Biehl and H. Schwarze, Learning by online gradient descent, J. Phys. A Vol. 28 (1995),
pp643-656.
[3] D. Saad and S. A. Solla, Exact solution for on-line learning in multilayer neural networks,
Phys. Rev. Lett. Vol. 74 (1995), pp4337-4340.
[4] D. Saad and S. A. Solla, On-line learning in soft committee machines, Phys. Rev. E Vol 52
(1995), pp4225-4243.
Acknowledgements
Work supported by EU grant ERB CHRX-CT92-0063.
SPONTANEOUS DYNAMICS AND ASSOCIATIVE
LEARNING IN AN ASSYMETRIC RECURRENT
RANDOM NEURAL NETWORK
M. Samuelides, B. Doyon*, B. Cessac** and M. Quoy***
Centre d 'Etudes et de Recherches de Toulouse, 2 avenue Edouard Belin,
BP 4025, 31055 Toulouse Cedex, France. Email: [email protected].
* Unite INSERM 230, Service de Neurologie, CHU Purpan,
31059 Toulouse Cidex, France.
** University of New-Mexico, Department of Electrical and
Computer Engineering, Albuquerque, NM 87131, USA.
*** Universitiit Bielefeld, BiB os, Postfach 100131,
33501 Bielefeld, Germany.
Freeman's investigations on the olfactory bulb of the rabbit showed that its dynamics was chaotic,
and that recognition of a learned pattern is linked to a dimension reduction of the dynamics on
a much simpler attractor (near limit cycle). We adress here the question wether this behaviour
is specific of this particular architecture or if this kind of behaviour observed is an important
property of chaotic neural network using a Hebb- like learning rule. In this paper, we use a mean-
field theoretical statement to determine the spontaneous dynamics of an assymetric recurrent
neural network. In particular we determine the range of random weight matrix for which the
network is chaotic. We are able to explain the various changes observed in the dynamical regime
when sending static random patterns. We propose a Hebb-like learning rule to store a pattern
as a limit cycle or strange attractor. We numerically show the dynamics reduction of a finite-
size chaotic network during learning and recognition of a pattern. Though associative learning
is actually performed the low storage capacity of the system leads to the consideration of more
realistic architecture.
1 Introduction
Most part of studies on recurrent neural networks assume sufficient conditions of
convergence. Relaxation to a stable network state is simply interpreted as a stored
pattern. Models with symmetric synaptic connections have relaxation dynamics.
Networks with asymmetric synaptic connections lose this convergence property and
can have more complex dynamics. However, as pointed out by Hirsch, [8), it might
be very interesting, from an engineering point of view, to investigate non convergent
networks because their dynamical possibilities are much richer for a given number
of units.
Moreover, the real brain is a highly dynamic system. Recent neurophysiological
findings have focused attention on the rich temporal structures (oscillations) of
neuronal processes [7, 6] which might play an important role in information pro-
cessing. Chaotic behavior has been found out in the nervous system [2] and might
be implicated in cognitive processes [9]. Freeman's paradigm [9] is that the basic
dynamics of a neural system is chaotic and that a particular pattern is stored as an
attractor of lower dimension than the initial chaotic one. The learning procedure
thus leads to the creation of such an attractor. During the recognition process, first,
the network explores a large region of its phase space through a chaotic dynamics.
312
Samuelides et al: Spontaneous Dynamics and Learning 313
When the stimulus is presented then the dynamics is reduced and the systems fol-
lows the lower dimensional attractor which has been created during the learning
process. The question arises wether this paradigm which has been simulated in [11]
using an artifical neural network is due to a very specific architecture or if it is a
general phenomenum for recurrent network.
The first step to adress this problem was to determine the conditions for the exis-
tence of chaotic dynamics among the various architecture of recurrent neural net-
works. A theoretical major advance in that direction was achieved by Sompolinsky
[10]. They established strong theoretical results concernig the occurence of chaos for
fully connected assymetric random recurrent networks in the thermodynamic limit
by using dynamical mean field theory. In their model, neurons are formal neurons
with activation state in [-1,1] with a symmetric transfer function and no threshold.
The authors show that the system exhibits chaotic dynamics. These results were
extended by us in [5] to the case of diluted networks with discrete time dynamics.
One can ask whether such results remain valid in a more general class of neural net-
works with no reversal symmetry i.e. with activation state in [0,1] and thresholds.
The presence of thresholds is biologically interesting. Moreover, it allows to study
the behaviour of the network when submitted to an external input. In this paper,
we describe this model and report the main results about spontaneous dynamics in
section 2. In section 3, we define a hebbian learning rule. We study the reduction of
the dynamics during the learning process and the recognition of a learned stimuli.
We then discuss the results and conclude (4).
2 Spontaneous Dynamics of Random Recurrent Networks with
Thresholds
The neurons states are continuous variables Xi, i = 1 ... N. The network dynamics
is given by:
where J'j is the synaptic weight between the neuron j and the neuron i . The
J;j'S are independant identically distributed random variables with expectation
E(J;j) = ~ and a variance Var(J;j) = -:: . The thresholds (0,) are independant
identically distributed gaussian random variables of expectation E( OJ) = (j and
=
variance V are 0,) O'~. Our numerical studies are made with the sigmoidal function:
I(x) = H;2.", so that the neurons states Xj(t) belong to [0,1]. Thus Xj(t) may be
seen as the mean firing rate of a neuron.
This kind of system is known to present the "propagation of chaos" property in the
thermodynamic limit. This approach was initiated for neural networks by Amari,
[1]. Though the denomination of "chaos" for this basic properties of vanishing finite-
size correlations is quite confusing and has nothing to do with determinsitic chaos
which will be considered afterwards, we shall keep it. Namely the intra-correlations
between finite sets of neuron activation state and between neurons and weights
vanish and each neuronal activation state process converges in law in the thermo-
dynamic limit towards independant gaussian processes. This statement allow us to
314 CHAPTER 54
derive mean field equations
1 N
1 =
mN(t) N Lx;(t)
governing the limits for N -+ 00 of ;=1 (2)
1 N
qN(t) - L Xt(t)2
N ;=1
These mean field equations are
m(t + 1) 1 00
~e _;2 f[hJ J2q(t) + U(J2 + m(t)J - "0
= 1
{ v211" (3)
dh _,,2 2 . /
-00
- -
q(t + 1) 2 f [h y J2q(t) + u(J2 + m(t)J - (J
00
rn=e-
-00 v211"
We shall now restrict ourselves for numerical studies to the case where J = o.
The second mean-field equation is self-consistent and we are able to determine the
critical surface in the space (~, 9 J, 7-) between a single stable fixed point and
two stable fixed points (saddle-node bifurcation) . Consider a fixed point x* of the
system (1). Its components (xi) are i.i.d. gaussian random variables, their moments
are given by the previous mean-field equations. To determine the stability of x* in
(1), we compute the Jacobi matrix of the evolution operator
D(x*) = 2gL(x*)J (4)
where L(x*) is the diagonal matrix of components (x; - x;2) and where J is the
connection weight matrix. The spectral radius of this random matrix can be com-
puted and this computation determines a stability boundary surface as shown below
in figure 1. At the thermodynamic limit, it is possible to use the "propagation of
chaos" property to compute the evolution of the quadratic distance between two
trajectories of the system coming for arbitrary close initial conditions [3]. 0 is a fixed
point for this recurrence and the critical condition for its destabilization is exactly
the same than the previous condition for the first destabilization of the fixed point
of the network. This proves, that at the thermodynamic limit, one observes a sharp
transition between a fixed point regime and a chaotic regime. So in the bifurcation
map of figure 1 there are four different regions: - region 1 : there is only one stable
fixed point - region 2 : there are two stable fixed points (actually region 2 is a very
small cusp) - region 3 : one stable fixed point and one strange attractor coexist.
One may converge towards one or the other depending on the intial conditions.
- region 4 : there is only one strange attractor Regions 2 and 3 shrink when u(J
increases. When u(J = 0.2, only regions 1 and 4 remain. For finite size systems, the
destabilization and apparition of chaos boundaries are different. There exists an in-
termediate zone where the system is periodic or quasiperiodic. This corresponds to
the quasiperiodicity route to chaos we observe in the system when gJ is increased
[5, 4]. The simulations confirm with accuracy all these theoretical predictions for
medium-sized networks (from 128 units up to 512 units), showing the progressive
stretching of the transition zone where periodic and almost-periodic attractors take
place.
Samuelides et al: Spontaneous Dynamics and Learning 315
saddle-node boundary
Hence the input is equivalent to a threshold. The global resulting threshold has a
mean value "0 - I and a variance. We can then interpret the presentation of an input
as a change of the parameters of the system. Its effect can therefore be predicted
from the bifurcation map.
If the configuration of the network stands in the chaotic region near the boundary
of this region, then the presentation of an input will tend to reduce the dynamics of
the system on a limit cycle by crossing the boundary, falling into the limit cycles and
T2 torus area. The same input may lead to different answers by different networks
with the same statistical parameters. This modification of the parameters by a
pattern presentation creates a new system (with a different attractor).
3.2 Auto-associative Learning.
The learning procedure we define will enable us to associate a limit cycle or a
strange attractor to a presented pattern. We modify the connection strength by a
biologically motivated Hebb-like rule:
.
If Xj(t) > 0.5 then Jij(t + 1) = J;j(t) + Na [Xj(t) - 0.5] [Xi(t + 1) - 0.5]
else Jij(t + 1) = Jij(t)
We add the constraint that a weight cannot modify its sign. a is the learning rate.
The learning procedure is implemented at each step when the network has reached
the stationary regime.
316 CHAPTER 54
In all our simulations the learning procedure reduces the fractal dimension of the
chaotic at tractors of the dynamics. Eventually the system follows an inverse quasi-
periodicity route towards a fixed point. We have chosen to stop the procedure on a
limit cycle, thus associating this cycle with the pattern learned.
In fact, one can speak of learning if the network has a selective response for the
learned pattern, and if the learning procedure does not affect the dynamical be-
haviour when an other pattern is presented. On order to study this selectivity
property, we make the following simulation. We learn one prototype pattern (ie.
we iterate the learning dynamics upon reaching a limit cycle). Then we present 30
other random patterns (drawn randomly with the same statistics than the learned
pattern), and we compare the attractors before, and after learning. For all patterns
but one, the dynamics before learning were chaotic. For all these patterns, the dy-
namics were still chaotic after learning (the pattern whose response was periodic,
had still a periodic attractor). Hence the network reduces its dynamics only for the
learned prototype pattern.
3.3 Retrieval Property
In order to study the retrieval property of this associative learning process, we add
noise to a pattern previously learned, and show how it affects the recognition. A
pattern is a vector of gaussian random variables (for the simulations, each com-
ponent has a mean zero and standard deviation 0.1). We add to each component
a gaussian noise of mean zero and standard deviation 0.01, and 0.02, which thus
corresponds to a level of noise of 10% and 20% on the pattern. We recall that the
presentation of a noisy pattern changes the parameters of the system. So recog-
nition has to be defined by some similarities between the attractors. In order to
quantify the similarity between cycles, we compute their gravity center, their mean
radius and winding (rotation) number and we define a crtierion of similarity based
on these numerical indexes.
To estimate the recognition rate, we learn one prototype pattern. Then we present
30 noisy patterns (derived from the prototype one), and compute the similarity
values. With a 10% level of noise recognition rate after 7 learning steps is 83%,
with 20% noise it falls down to 27%.
4 Discussion and Conclusion
Our model reproduces the observations by Freeman concerning the dimension re-
duction of the system attractor by recognition of a learned pattern, in a model
of the olfactory bulb [11]. The system does not need to wait until a fixed point
is reached to perform recognition: relying on our experiments, convergence on an
attractor is very fast. This gives an insight into the mechanism leading to such a
phenomenon by the extraction of the few relevant parameters related to it.
However this model suffers severe drawbacks concerning the control of the learning
process and the storage capacity. Moreover, the modifications supported by the
weights during the learning process are difficult to interpret theoretically. These
limitations suggest to introduce at the next step architectural and functional dif-
ferentiation into the network (inhibitory and excitatory neurons, multiple layers,
random geometric-dependant connectivity and time delays, modular architecture
of chaotic oscillators).
Samuelides et al: Spontaneous Dynamics and Learning 317
Coding by dynamical attractors is also particularly suited for learning of temporal
signals. For the moment, we only focused on the learning of static patterns. However,
we performed with interesting results some preliminary simulations on presentation
of temporals sequences. This could lead to connect different chaotic networks in
order to perform recognition tasks using the synchronization processes highlighted
by Gray [7].
REFERENCES
[1] Amari S.L, Characteristics of random nets of analog neuron-like elements, IEEE
Trans.Syst.Man.Cyb, Vol. 2.5(1972), pp643-657.
[2] Babloyantz A., Nicolis C., Salazar J.M., Evidence of chaotic dynamics of brain activity during
the sleep cycle, Phys. Lett. Vol. ll1A (1985), pp152-156.
[3] Cessac B., Increase in complexity in random neural networks, J.PhysI, Vol. 5 (1995), pp409-
432.
[4] Cessac B., Doyon B., Quoy M., Samuelides M., Mean-field equations, bifurcation map and
route to chaos in: discrete time neural networks, Physica D, Vol. 74 (1994), pp24-44.
[5] Doyon B., Cessac B., Quoy M., Samuelides M., Chaos in Neural Networks With Random
Connectivity, International Journal Of Bifurcation and Chaos, Vol. 3 (1993), No.2, pp279-
291.
[6] Eckhorn R., Bauer R., Jordan W., Brosch M., Kruse W., Munk M., Reitboeck H.J., Coher-
ent oscillations: A mechanism of feature linking in the visual cortex? Multiple electrode and
correlation analysis in the cat, BioI. Cybernet. Vol. 60 (1988), pp121-130.
[7] Gray C.M., Koenig P., Engel A.K., Singer W. Oscillatory responses in cat visual cortex
exhibit intercolumnar synchronisation which reflects global stimulus properties, Nature Vol.
338 (1989), pp334-337.
[8] Hirsch .M.W., Convergent Activation Dynamics in Continuous Time Networks, Neural Net-
works Vol. 2 (1989), pp331-349.
[9] Skarda C.A., Freeman W.J., How brains make chaos in order to make sense of the world,
Behav. Brain Sci. Vol. 10 (1987), pp161-195.
[10] Sompolinsky H., Crisanti A., Sommers H.J., Chaos in random neural networks, Phys. Rev.
Lett. Vol. 61 (1988), pp259-262.
[n] Yao Y., Freeman W.J., Model of biological pattern recognition with spatially chaotic dynam-
ics, Neural Networks Vol. 3 (1990), pp153-170.
Acknowledgements
This research has been supported by a grant from DRET (the National French
Defense Agency) and by the COGNISCIENCE research program of the C.N.R.S ..
M Quoy is supported by a Lavoisier grant of the French Foreign Office and B.
Cessac by a Cognisciences grant of the C.N.R.S ..
A STATISTICAL MECHANICS ANALYSIS OF
GENETIC ALGORITHMS FOR SEARCH AND
LEARNING
Jonathan L. Shapiro, Adam Priigel-Bennett* and Magnus Rattray
Department of Computer Science, University of Manchester,
Manchester, M13 9PL, UK.
* NORDITA Blegdamsvej 17, DK-2100 Copenhagen 0, Denmark.
Statistical mechanics can be used to derive a set of equations describing the evolution of a genetic
algorithm involving crossover, mutation and selection. This paper gives an introduction to this
work. It is shown how the method can be applied to to very simple problems, for which the
dynamics of the genetic algorithm can be reduced to a set of nonlinear coupled difference equations.
Good results are obtained when the equations are truncated to four variables.
Keywords: Genetic Algorithms, Statistical Mechanics, Fitness Distributions.
1 Introduction
Genetic Algorithms (GA) are a class of search techniques which can be used to find
solutions to hard problems. They have been applied in a range of domains: opti-
misation problems, machine learning, training neural networks or evolving neural
network architectures, and many others. (For an introduction, see Goldberg [1].)
They can be naturally applied to discrete problems for which other techniques are
more difficult to use, and they parallelise well. Most importantly, they have been
found to work in many applications.
Although GAs have been widely studied empirically, they are not well understood
theoretically. Unlike gradient descent search and simulated annealing, genetic al-
gorithms are not based on a well-understood process. The goal of the research de-
scribed here is to develop a formalism which allows the study of genetic algorithm
dynamics for problems of realistic size and finite population sizes. The problems
we have studied are clearly toy problems; however, they contain some realistic as-
pects, and we hope their consideration will be a stepping stone to the study of more
realistic problems.
2 The Statistical Mechanics Approach
Ideally, one would like to solve the dynamics of genetic algorithms exactly. This
could be done, in principle, either by studying the stochastic equation directly, or
by using a Markov Chain formulation to analyse the deterministic equation for the
probability of being in a given population (see [2, 3]). However, it is very difficult
to make progress in this way, because one must solve a high-dimensional, strongly
interacting, nonlinear system which is extremely difficult to analyse.
In these exact approaches, the precise details of which individuals are in the popu-
lation is considered. In our approach much less information is assumed to be known.
We consider only statistical properties of the distribution of fitnesses in the pop-
ulation; from the distribution of fitnesses at one time, we predict the distribution
of fitnesses at the next time step. Iterating this from the initial distribution, we
propose to predict the fitness distribution for all times.
This distribution tells you want you want to know - for example the evolution of
the best member of the population can be inferred. But is it possible, in principle,
to predict the fitness at later times based on the fitness at earlier times?
318
Shapiro et. al.: Statistical Mechanics of Genetic Algorithms 319
The answer is no. Although it is possible to predict the effect of selection based solely
on the fitness distribution, mutation and crossover depend upon the configurations
of the strings in the population which cannot be inferred from their fitnesses. We
use a statistical mechanics assumption to bridge this gap.
We characterise the distribution by its cumulants. Cumulants are statistical prop-
erties of a distribution which are like moments, but are more stable and robust. The
first two cumulants are the mean and variance respectively. The third cumulant is
related to the skewness; it measure asymmetry of the distribution about the mean.
The fourth cumulant is related to the kurtosis; it measure whether the distribution
falls off faster or more slowly than Gaussian.
3 Test Problems
The method has been applied to four problems. The simplest task, and one which
will be used throughout this paper to illustrate the method, is the optimisation
of a linear, spatially inhomogeneous function, counting random numbers. In this
problem, each bit of the string contributes individually an arbitrary amount. The
fitness is,
;=1
Here T is a string of ±1 oflength N which the GA searches over. The Ji'S are fixed
random numbers.
Other problems to which the method has been applied include: the task of max-
imising the energy of a spin-glass chain [4, 5, 6], the subset sum problem [7] (an
NP hard problem in the weak sense), and learning in an Ising perceptron [8]. For
the first two problems, the dynamics can be solved, and the formalism predicts ac-
curately the evolution of the genetic algorithm. The latter problem has been much
more difficult to solve, however.
4 The Genetic Operators
We now discuss how our approach is applied to the study of three specific GA
operators: selection, mutation, and crossover. We will use the counting random
numbers task to illustrate how the calculations are done.
4.1 Selection
Selection is the operation whereby better members of the population are replicated
and less good members are removed. The most frequently used method is to chose
the members of the next population probabilistically using a weighting function
R(F a ). We have studied Boltzmann selection
R(F) ex exp(f3F)
where f3 is a parameter controlling the selection rate. Fitness proportional selection
(R(F) = F), culling selection (choose the n best), and other forms can also be
handled by the approach.
In an infinite population, the new distribution of fitness in terms of the old p(F)
would simply be R(F)p(F) up to a normalisation factor. In a finite population,
there will be fluctuations around this. The way to treat this is to draw P levels
320 CHAPTER 55
[!tl
from p. The generating function for cumulants is
This problem is analogous to the Random Energy Model proposed and studied by
Derrida [9], and can be computed by using the same methods developed for that
problem.
For Boltzmann selection, the cumulant expansion for the distribution after selection
in terms of selection before can be found as an expansion in the selection parameter
{3,
Ka ( 1- !) K3 + {3 [ (1 - ~ ) K4 - ~ K~] + ....
Or the cumulants can be found for arbitrary {3 numerically.
For many problems, the initial distribution will be nearly Gaussian. Selection in-
troduces a negative third cumulant - the low-fitness tail is more occupied than the
high-fitness one. An optimal amount of selection is indicated. The improvement of
the mean increases with {3 initially, but saturates for large {3. Selection also decreases
the variance; this effect is small for small {3 but becomes important for increasing
values. The trade-off between these two effects of selection - increase in mean but
decrease in genetic diversity - are balanced for intermediate values of {3. This has
been discussed in earlier work [5, 6].
4.2 Mutation
Mutation causes small random changes to the individual bits of the string. Thus, it
causes a local exploration, in common with many other search algorithms. It acts
on each member of the population independently.
To study the effect of mutation, we introduce a set of mutation variables, mf, one for
each site of each string, which take the value 1 if the site is mutated, 0 otherwise.
Let m be the mutation probability. In terms of these variables, the fitness after
mutation is (for counting random numbers)
F =L: Ji (1 - 2mi') rt'. (2)
Averaging over all variables gives the cumulants after mutation. This yields,
K7' (1 - 2m)Kl
(3)
Kr K2 + (1 - (1 - 2m?) (N - K2)
Kr K3(1 - 2m)3 - 2(1 - 2m) (1 - (1 - 2m)2) ~
3
L: J; (r;)
i
where < r > denotes population average. (The fourth cumulant can be calculated
in a similar fashion).
Shapiro et. al.: Statistical Mechanics of Genetic Algorithms 321
The first two equations express obvious effects. Mutation brings the mean and
variance back towards values of a random population - mean decreases to zero
and the variance increases toward N. The equation for the third cumulant is more
interesting. The first term decreases the third cumulant, but the second increases
its magnitude ifit is negative (which it typically is). This is the part which depends
upon the configuration average.
4.3 Crossover
Crossover can be treated in a similar manner to mutation. Introduce a set of
crossover variables, xf f3 which are 1 if a crossed with f3 at site i and 0 other-
wise, write equations for the cumulants in terms of these variables and average.
One finds that both the mean and the variance remain unchanged. The third and
higher cumulants are reduced and brought to natural values which depend upon
configurational averages.
4.4 Using the Statistical Mechanics Ansatz to Infer
Configurational Averages from Fitness Distribution
Averages
In the previous section we showed that the effect of mutation on the cumulants
depends upon the fitness distribution (i.e. on the cumulants) and upon properties
of the configuration of the strings. As we have argued, one cannot, in principle,
determine the configuration from the fitness distribution. We invoke a statistical
mechanics ansatz. We assume that the string variables are free to fluctuate subject
to the constraint that the cumulants are given. In other words, our assumption
is that of all the configurations which have the same distribution of fitnesses, the
more numerous ones are more likely to describe the actual one. This can be seen
as a maximum entropy assumption. We use a simple statistical mechanics model
to implement the proposed relationship between fitness statistics and configuration
statistics. Details are presented in [6].
5 Discussion and Future Work
The equations for each of the genetic operators can be put together to predict the
whole dynamics. Typical curves are shown in figure 1. The theoretical curves were
produced by calculation of the initial distribution theoretically, and iterating the
equations repeatedly. No experimental input was used. Although, the agreement
between experiment and theory is not perfect, these results are as accurate as any
approach of which we are aware.
This gives a picture of the role of crossover, for this simple problem. Selection im-
proves the average fitness, but decreases variance and introduces a negative skew-
ness. Mutation increases the variance, introducing genetic diversity; however, it also
decreases the mean. Crossover has no effect on the first two cumulants. It reduces
the magnitude of the skewness, however. This replaces some of the low-fitness tail
with some high-fitness tail, which improves the best member of the population.
Since, for this problem, crossover has no effect on the mean, there is no cost in
doing this and crossover is helpful. For a realistic problem, however, crossover will
introduce a deleterious effect to the mean; whether this effect dominates over the
improvement due to the decrease of the third cumulant may determine whether the
crossover operator is a useful one for the problem in question.
322 CHAPTER 55
150.0 .----~---~---_r---__,
100.0
---- ---------------
--
//-
/../
A /./"'"
50.0
Figure 1 The solid curves show the evolution of the first two cumulants and the
fitness of the best member of the population for N = 127, P = 50, f3 = 0.1 and
m = 1/2N. The ground state is at F besl R:i 101. The dashed curve shows the
theoretical prediction. The overscore indicates averaging over 10000 runs. From
reference [6].
REFERENCES
[1] David E. Goldberg, Genetic Algorithms in Search, Optimization & Machine Learning,
Addison-Wesley, Reading, Mass. (1989).
[2] A Nix and Michael D. Vose, Modelling genetic algorithms with Markov chains, Annals of
Mathematics and Artificial Intelligence, Vol. 5 (1991), pp79-88.
[3] Darrell Whitley. An executable model oj a simple genetic algorithm, in: L. Darrel Whitley,
ed., Foundations of Genetic Algorithms 2. Morgan Kaufmann, San Mateo (1993).
[4] A. Priigel-Bennett and J. L. Shapiro, An analysis oj genetic algorithms using statistical
mechanics, Phys. Rev. Letts., Vol. 72(9) (1994), ppI305-1309.
[5] J. L. Shapiro, A. Priigel-Bennett, and M. Rattray, A statistical mechanical Jormulation oj the
dynamics oj genetic algorithms, Lecture Notes in Computer Science, No. 864 (1994), ppI7-27.
[6] A. Priigel-Bennett and J. L. Shapiro, Dynamics oj genetic algorithms Jor simple random
Ising systems, Physica D, in press (1997).
[7] L. M. Rattray, An analysis oj a genetic algorithm with stabilizing selection, Complex Systems
Vol.9 (1995), pp213-234.
[8] L. M. Rattray and J. L. Shapiro, The dynamics oj a genetic algorithm Jor a simple learning
problem, Journal of Physics A, Vol.29(23) (1996), pp7451-7473.
[9] B. Derrida. Random-energy model: An exactly solvable model oj disordered systems, Phys.
Rev., Vol. B24 (1984), pp2613-2626.
VOLUMES OF ATTRACTION BASINS IN RANDOMLY
CONNECTED BOOLEAN NETWORKS
Sergey A. Shumsky
P.N.Lebedev Physics Institute, Leninsky pr.53, Moscow, Russia.
The paper presents the distribution function of the volumes of attraction basins in phase portraits
of Boolean networks with random interconnections for a special class of uniform nets, built from
unbiased elements. The distribution density at large volumes tends to uuiversal power law :F .cx
V- 3 / 2 •
1 Introduction
Randomly Connected Boolean Networks (RCBN) represent a wide class of connec-
tionist models, which attempt to understand the behavior of systems, built from a
large number of richly interconnected elements [1].
Since the early studies of Kauffman [2] RCBN became a classical model for studying
the dynamical properties of random networks. The most intriguing feature of RCBN
is the phase transition from the stochastic regime to the ordered behavior, first
observed in simulations [2], and later explained analytically [3]-[5].
In the chaotic phase the phase trajectories are attracted by very long cycles, which
lengths grow exponentially with the size of the system N [4]. Thus, even for not so
large systems it is practically impossible to indicate these cycles, and their phase
trajectories resemble random walks in the phase space.
On the contrary, in the ordered phase the short cycles dominate [4]. In fact, this
paper will present a numerical evidence, that almost the whole system's phase
space belongs to the attraction basins of the fixed points. The distribution of the
volumes of these attraction basins is an important characteristic, since it gives the
probabilities of different kinds of asymptotic behavior of the system.
So far, the distribution of attraction basins is known only for fully connected
Boolean networks, namely the Random Map Model (RMM) [6], which represents
the chaotic RCBN. The onset of the ordered phase implies, that the mean number
of inputs per one element, K, does not exceed some critical value, which depends
only on the type of the Boolean elements [5]. Thus, this phase takes place only in
the case of extremely sparse interconnections: KIN -+ 0 for N -+ 00.
In the present paper we calculate the distribution of attraction basins in the ordered
phase, using the fact, that in sparsely connected networks there exists a correlation
between the number of logical switchings at the consequent time steps, which is
absent in the RMM. In Sec. 2 we formulate our model as the straightforward ext en-
tion of the RMM, which takes into account the above correlation. Sec. 3 presents
the solution for our problem found by means of the theory of branching processes.
Sec. 4 presents the comparison of the theoretical predictions with the simulations
results for the Izing neural networks. Sec. 5 gives some concluding remarks.
2 Model Description
The networks under consideration consist of N two-state automata receiving their
inputs from the outputs (states) of other automata, connected with the given one.
These connections are attached at random when the network is assembled and then
323
324 CHAPTER 56
fixed. The parallel dynamics is governed by the map:
x=> q,(x), (x, q, E {±1}N). (1)
Function q,(x) depends vacuously on some of its arguments.
In a phase portrait of a network there is a unique phase vector, begining at each
phase state. However, there are no restrictions on the number of vectors coming to
that state. Thus, any phase portrait represents a forest of the trees with their roots
belonging to the attractor set. For the ordered phase this attractor set is represented
mainly by fixed points.
We shall be interested in the statistical properties of this forest, namely by the
distribution of volumes of the trees. These statistical properties, of course, should
characterize not an individual map, but some ensemble of maps. The ensemble of
RCBN contains all possible variants of interconnections among N automata, chosen
at random from some infinite basic set of automata. We assume, that this set of
automata is unbiased, that is, the statistical properties of phase vectors do not
depend on the state point. Such ensembles will be further referred to as uniform
ones.
Such is, for example, the RMM, where phase vector starting in any state may come
to each state with equal probability. Since there are n such possibilities for each
state point (where n = 2N is the phase volume of a system), this ensemble con-
sists of nn maps, corresponding to all possible variants of fully connected Boolean
networks.
In the absence of correlations between consequent vectors, one can characterize the
uniform ensemble by only one statistical characteristic - the distribution of vectors
lengths (in the Hamming sense), Wm , which is binomial:
Wm =( ~ ) (1 - wo)m(wo)N-m, (2)
with Wo = (1 + vo)/2 being the mean fraction of fixed points in the phase portraits
of automata from the basic set [5]. For the RMM, for example, Vo = 0 and Wo = 1/2.
But in the absence of selfexcitable automata in the basic automata set (i.e. those,
that oscillate for the fixed values oftheir inputs), Vo ;::: O. This leads to exponentially
large number of fixed points:
no = nwo = (1 + vO)N. (3)
In general case there exists a correlation between the lengths of consequent vectors.
This correlation is more pronounced in diluted networks. Indeed, recall, that the
vector's length is the number of automata, which change state at the corresponding
time step. In diluted networks the probability, that a given automaton will change
state is proportional to the probability, that at least one of its inputs has changed
its value. As a consequence, the mean number of automata switchings at the next
step is proportional to that at the previous step.
To take this fact into account we introduce the conditional probability Pm/ that in
phase trajectories from a given ensemble vector with length m will be followed by
vector of length I. This is a straightforward generalization of the RMM, for which
Pm/=W/.
Summarizing, we will deal with an ensemble of phase portraits (1) with the statis-
tical characteristics of phase trajectories given by Wm and Pm/. We are interested,
however, not in the characteristics oftrajectories, but rather in that of random trees
Shumsky: Volumes of Attraction Basins 325
in phase portraits from our ensemble. Such is the mean number lI/m of vectors with
the length m, which precede the vector with the length l:
lI/m = WmPmI/W/, (m = 1, ... , N; 1 = 0, ... , N). (4)
Zero-length vectors do not wecede any state, and are thus excluded: lIlO == O. The
=
normalization condition E/=o Pm/ 1 may be rewritten as:
N
L W/ll/m = Wm, (m = 1, ... , N). (5)
/=0
To complete the description of random forest one needs not pnly the average num-
ber lI/m , but the whole distribution function for the number of different vectors,
preceding the given vector of length l. In the limit n -+ 00 this distribution tends
to a Poisson one with the generating function:
/(s) = exp[ft(s - 1)]. (6)
The latter is a function offormal parameter s:
I/(s)= L il;m" ... ,mNS,{,', ... ,sr;JN,
with il;m, ,... ,mN being the probability, that a vector of length 1 is preceded by ml
vectors of length 1, ... , mN vectors of length N. In the following of this paper all
running indexes range from 1 to N.
3 Distribution of the Basins Volumes
Now, when all the characteristics of random trees are determined, one can use the
well known results of the theory of brunching processes. For the sake of clarity
we will not supply the details here, relegating the involved calculations to a more
extensive publication.
3.1 General Solution
The theory of brunching processes allows one to find the distribution of the volumes
of random trees, provided the generating function /( s) is known [7]. For the one
given by (6) the fraction of trees of volume V is:
() = exp (- Ei ki 61/ 2)
FV , (7)
v'27r Ei kiP~
where k, p and 6 satisfy the saddle point equations:
L: 7rij 6j = -{3, (8)
j
(9)
j
Here Pi = -6;/{3 = E j 7r- 1ij, and matrix 7rij == 6ij - IIij has the eigenvalues
Ii = 1 - ~i, being the relaxation decrements for the initial Markov process (~i are
eigenvalues of matrixes P and ft).
The solution of this system of equations reads:
= La
l~ka
ki A' -{3' (10)
a
326 CHAPTER 56
with .A a , If and rf being the eigenvalues, the left and the right eigenvectors of the
generalized eigenvalue problem:
'\'""
~ 7rijrj
a
= AaPiri
\ - a
. (11)
j
Coefficients k a and pa in (10) represent the spectral expansion of the known vectors
ki = L:a If k a/ .A a, Pi = L:a rf pa / .A a. The Lagrange parameter {3 is found from the
equation:
(12)
0.1
F(V)
0.01
IE-3
reached their attractors. The results showed that for sufficiently large N > 25 all
phase trajectories converge to some fixed point.
To obtain the distribution of the basins we examined the whole phase space. Since
the computational time grows exponentially with the size of the network, we chose
the networks of relatively small size, N = 11 (the measured probability of con-
vergence was 0.97). For K = 2, 25000 random networks were generated, and each
phase portrait was examined point by point. Fig. 4 shows the obtained distribution
of the volumes of the basins of fixed points. The dotted line on this plot repre-
sents the asymptotic power law :F ex: V- 3 / 2 . The correspondence is good in a wide
range of volumes, except the very tail of the distribution. The observed cutt-off of
the power law is due to the finiteness of the state space, ignored by the branching
process approach.
5 Conclusions
The paper presented the distribution of the attraction basins of randomly connected
Boolean networks, which generalize the Kauffman NK-model. The obtained results
may be useful, for example, for studying the relaxation processes in nonergodic
systems. At finite temperature the volumes of basins play a similar role as the
energy levels of the valleys in spin glasses.
REFERENCES
[1] J. Doyne Farmer, A Rosetta stone for connectionism, Physica D, Vol. 42 (1990), pp153-187.
[2] S. A. Kauffman, Metabolic stability and epigenesis in randomly connected nets, J. Theor.
BioI., Vol. 22 (1969), pp437-467.
[3] B. Derrida and Y. Pomeau, Random networks of automata: a simple annealed approximation
Europhys. Lett., Vol. 1 (1986), pp45-49.
[4] K. E. Kiirten, Critical phenomena in model neural networks Phys. Lett. A, Vol. 129 (1988),
pp157-160.
[5] S. A. Shumsky, Phase portrait characteristics of random logical networks J. Moscow Phys.
Soc., Vol. 2 (1992), pp263-281.
[6] B. Derrida and H. Flyvbjery, The random map model: a disordered model with deterministic
dynamics, J. Phys. France, Vol. 48 (1987), pp971-978.
[7] T. E. Harris, The Theory of Branching Processes. Springer-Verlag, Berlin (1963).
EVIDENTIAL REJECTION STRATEGY FOR NEURAL
NETWORK CLASSIFIERS
A. Shustorovich
Business Imaging Systems, Eastman Kodak Company, Rochester, USA.
In this paper we present an approach to estimation of the confidences of competing classification
decisions based on the Dempster-Shafer Theory of Evidence. It allows us to utilize information
contained in the activations of all output nodes of a neural network, as opposed to just one or
two highest, as it is usually done in current literature. On a test set of 8,000 ambiguous patterns,
a rejection strategy based on this confidence measure achieves up to 30% reduction in error rates
as compared to traditional schemes.
1 Introduction
Neural network classifiers usually have as many output nodes as there are com-
peting classes. A trained network produces a high output activation of the node
corresponding to the desired class and low activations of all the other nodes. In this
type of encoding, the best guess corresponds to the output node with the highest
activation. Our confidence in the chosen decision depends on the numeric value of
the highest activation as well as on the difference between it and the others, es-
pecially the second highest. Depending on the desired level of reliability, a certain
percentage of classification patterns can be rejected, and the obvious strategy is to
reject patterns with lower confidence first. The rejection schemes most commonly
used in literature are built around two common-sense ideas: the confidence increases
with the value of the highest activation and with the gap between the two highest
activations.
In some systems, only one of these values is used as the measure of confidence;
in others, some ad-hoc combination of both. For example, Battiti and Colla [1]
reject patterns with the highest activation (HA) below a fixed threshold, and also
those with the difference between the highest and the second highest activation
(DA) below a second fixed threshold. Martin et al. [5] use only DA in their Sac-
cade System, whereas Gaborski [4] uses the ratio of the highest and the second
highest activation (RA) as the preferred measure of confidence. Fogelman-Soulie
et al. [3] propose a distance-based (DI) rejection criterion in addition to HA and
DA schemes; namely they compare the Euclidean distance between the activation
vector and the target activation vectors of all classes and reject the pattern if the
smallest of these distances is greater than a fixed threshold. Bromley and Denker
[2] mention experiments conducted by Yann Le Cun in support of their choice of
the DA strategy.
Logically, one expects the two highest activations to provide almost all the informa-
tion for the rejection decision, because usually ambiguities involve a choice between
the right classification and, at most, one predominant alternative. Still, ideally, we
would prefer a formula combining all the output activations into a single measure of
confidence. Such formula can be derived if we view this problem as one of integra-
tion of information from different sources in the framework of the Dempster-Shafer
Theory of Evidence [6]. We can treat the activation of each output node as the
evidence in favor of the corresponding classification hypothesis and combine them
into the measure of confidence of the final decision.
328
Shustorovich: Evidential Rejection Strategy for Classifiers 329
2 The Dempster-Shafer Theory of Evidence
The Dempster-Shafer Theory of Evidence is a tool for representing and combining
measures of evidences. This theory is a generalization of Bayesian reasoning, but it
is more flexible than Bayesian when our knowledge is incomplete, and, therefore,
we have to deal with uncertainty. It allows us to represent only actual knowledge,
without forcing us to make a conclusion when we are ignorant.
Let e be a set of mutually exhaustive and exclusive atomic hypotheses e =
{0 1 , ... , ON}' e is called the frame of discernment. Let 2El denote the set of all
subsets of e. A function m is called a basic probability assignment if
m : 2El -+ [0,1], m(0) = 0, and L m(A) = 1. (1)
A~El
probability mass of
measure ml(Xi) . m2(Yj)
committed to Xi n Yj
where cPn is a monotonic function of On. We shall discuss a specific form of this
function later.
Now we have to combine these evidences into their orthogonal sum according to
eqn. (2). Let us start with the case in which N = 2. The first simple evidence
function ml, with its focal element (h, has the degree of support cPl. Its only other
nonzero value is ml(0) = 1 - cPl. Similarly, (J2 is the focal element of m2, with a
degree of support of cP2, and m2(0) = 1 - cP2. When we produce the orthogonal
sum m = ml EB m2, we have to compute beliefs for the total of four subsets of 0: 0,
(JI, (J2, and 0 (Figure 2). The value m(0) should be proportional to ml(0)· m2(0)
m2(0) = 1- cP2
m2((J2) = cP2
Figure 2 Orthogonal sum of two simple evidence functions with atomic foci.
because this is the only way to obtain 0 as the intersection. Similarly, m( (JI) should
be proportional to ml((JI)' m2(0), while m((J2) should be proportional to m2((J2)'
ml(0). The only way we can produce 0 as an intersection of a subset with nonzero
ml-evidence and another subset with nonzero m2-evidence is 0 = (JI n (J2. The
corresponding product, cPl' tP2, should be excluded from the normalization constant
K. Finally,
m((JI) = K . tPI . (1 - cP2) m((J2) = K . cP2 . (1 - cPI),
1
=
m(0) K . (1 - tPI) . (1 - cP2) , K=---
1 - cPl' cP2
(4)
It is obvious that if both tPI = 1 and cP2 = 1, our formula cannot work because
it involves division of ~. However, this is quite appropriate, because it indicates
that both ml and m2 are absolutely confident and flatly contradict each other. In
such a case, there is no hope of reconciling the differences. If only one of the two
degrees of support equals 1, for example, cPI = =
1, then m((Jd 1, m((J2) 0, and =
m(0) = 0. In this case, m2 cannot influence the absolutely confident mI.
If neither of the degrees of support equals 1, we can denote an = l!¢n and trans-
form eqn. (4) into a more convenient form:
al a2 1
m((JI)= 1 + al + a2
, m((J2) = 1 + al + a2
, m(0) = -:----
1 + al + a2
(5)
Shustorovich: Evidential Rejection Strategy for Classifiers 331
Using mathemaical induction, one can prove that when we have to combine N
simple evidence functions with atomic focal elements, their orthogonal sum becomes
40.00
30.00
20.00 - - ~0 .
\'-
EV
\-, "
\
HA
10.00 \
DA
RA \
DI \
0.00 -/----+--f---f---f----t--+--+--J----,I----1
0.00 10.00 20.00 30.00 40.00 50.00 60.00 10.00 80.00 90.00 100.00
333
334 CHAPTER 58
points Ao, AI, ... , AM at which the function f(x) changes its character are termed
break points or knots in the theory splines (other authors prefer" joints" or "nodes")
The polynomial pieces are joined together with certain smoothness conditions. The
smoothness conditions are limx -+ Ai - f(j)(x) = limx -+ Ai + f(j)(x), (j = 1, ... , k-1).
In other words, a spline function of degree k ;::: 0, having as break points the
strictly increasing sequence Aj, (j = 0,1, .. M + 1; Ao = 0, AM+! = d), is a piecewice
polynomial on intervals [Aj, Aj+l] of degree k at most, with continuous derivatives
up to order k - 1 on [0, dJ. It is known that a uniformly continuous function can
be approximated by polynomial splines to arbitrary accuracy, see for instance de
Boor [2]. This concept can be naturally extended to higher dimensions. The set
of spline functions is endowed with the natural structure of a finite vector space.
The dimension of the vector space of spline functions of degree k with M interior
break points is m = M + k + 1. A neural network-like basis of the space of spline
functions is the basis of B-splines. Each spline function can be written as a unique
linear combination of basis functions.ln this paper we shall use the basis of B-splines,
°
which may be defined, for example, in a recursive way (cf. de Boor [2]). Let us define
{A-k = A-HI = ... = Ao = < Al < A2 < ... < AM < d = AM+I = ... = Am} an
extended set of M + 2k + 2 knots associated with the interval [0, dJ. Each B-spline
basis function (a unit in neural network terminology) BJ (x) is completely described
by a set of k + 2 knots {Aj-k-l, Aj-k, ... , Aj}. There are two important properties
° =°
of these units [2], namely they are nonnegative, vanish everywhere except on a
few contiguous subintervals BJ(x) > if Aj-k-l :::; x :::; Aj, BJ(x) otherwise,
and they define a partition of unity on [0, dJ, L:j=l BJ(x) = 1 for all u[O, dJ. We
shall mainly deal with cubic (k = 3) and quadratic (k = 2) B-splines because
of their simplicity and importance in applications. Higher-degree splines are used
whenever more smoothness is needed in the approximating function. For a chosen
number m of units, the resulting polynomial spline is a function fm from the set
S;;, = {fm : [0, dJ -+ IR I fm(x) = L:j=l Wj BJ (x)}.By Sk = UmS;;' we denote the
set of all univariate splines of degree k (order k+ 1). All these properties allow us to
think of spline functions as neural networks. The numbers Wj are called weights (in
statistics and neural network theory) or control points (in computer aided geometric
design).
2.2 Feedforward Neural Networks
The virtue of neural networks in our approach is their capability to capture more
information about the true system structure then, e.g. by polynomials. There is
a number of methods for estimation of the network parameters. For example, in-
cremental learning algorithms are described in Fahlman and Lebier [4] or Jones's
approach in [5]. We will assume that neural network parameters have been estimated
by one of the available learning algorithms. In the case of B-spline networks, the
simplest approach computes the weights by the least squares method and, eventu-
ally, iterates the placement of knots in order to optimize it. Neural networks do not
provide immediate insight into how the modeled mechanism works. This insight
can be gained by solving an extraction problem. What we mean by an extraction
problem for a neural net is how to extract the underlying structure of dependence
in terms of a local polynomial regression model. This framework is useful when a
simple explicit description (e.g. in terms of the low-degree polynomial regression) is
Smid fj Vol/: Dynamics and Change Point Retrieval 335
required. Also this technique is more feasible than a 'direct' nonlinear polynomial
regression for systems where the generator 9 is changing in time.
~I o
: if
10 20 30
I
40
j :~ o 10 20 30
I
40
~I : ~: 1
o 10 20 30 40
:[/;J
o 0.5 1
-3
~t-IL. -X ==:_1;~!.~~__-_-_-~-t
1=0
o 0.5
is from the family of quadratic maps that is known to be chaotic and ergodic.
The time series generated by this map cannot be predicted for many steps ahead
but the generator of this series can be identified from the time series. We used
a quadratic spline imn
with 40 interior break points to identify sets of polyno-
mial coefficients. Pieces of the spline imn
are represented by local polynomials
aj(x - Aj)2 + bj(x - Aj) + Cj,(j =
0,1,···,39). Figures 1,2 and 3 show poly-
nomial coefficients aj(O), bj(O), Cj(O). These coeffiecient were recalculated from the
local coefficients at the common break point Ao = O. Figure 4 shows the piecewise
quadratic map. Figure 5 shows the error ofthe spline fit. Figure 6 shows 400 points
of the chaotic time series generated by the piecewise quadratic map and additive
gaussian noise N(O, (1'2 =
0.0001). The dynamical problem then becomes the fol-
lowing approximation test: a domain is randomly partitioned into a finite number
of subintervals. On each subinterval are generated data by a noisy polynomial. The
aim of the analysis is to find those polynomials and/or to find all break points.
338 CHAPTER 58
7 Concluding Remarks
We have proposed a method employing the B-splines network for estimation of an
unknown response function from noisy data, and for extraction a potential piecewise
polynomial submodel. It has been shown how the first terms of the Taylor expansion
of the estimator, when expanded at different points of its domain, can be utilized
for revealing the degree of the underlying polynomial model, and for the detection
of changes in the model. This approach can be adapted for multidimensional and
discrete dynamical systems (3) or (4). In these cases, we have to work with the
partial derivatives. However, the multivariate case leads to the use of multivariate
approximating functions. In such a situation, the estimation becomes less effective (
a much greater amount of data is needed ), even the theoretical rate of convergence
is slower. Many authors (cf. Breiman[l], Stone[8]) recommend the use of additive
approximations with low-dimensional components.
REFERENCES
[1] Breiman L., Fitting additive models to regression data: diagnostics and alternative views,
Comp.Statistics and Data Analysis Vol. 15 (1993), ppl3-46.
[2] de Boor C., A Practical Guide to Splines, Springer Verlag (1978).
[3] Hall C.A. and Meyer W.W., Optimal error bounds for cubic spline interpolation, Journal of
Approximation Theory Vol. 16, No.2 (February 1976).
[4] Fahlman S.E. and Lebiere C., The cascade correlation learning architecture, Research Report
CMU-CS-90-100 (1990).
[5] Kurkova V. and Smid J., An incremental architecture algorithm for feedforward neural nets,
in: Computer Intensive Methods in Control and Signal Processing, Preprints of the European
IEEE Workshop, Prague (1994).
[6] Marsden, M.J.(1974), Quadratic Spline Interpolation, Bulletin of the American Mathematical
Society, Vol. 80, No.5 (September 1974).
[7] Smid J. and Volf P., Dynamics retrieval and characterization from nonlinear time series,
Neural Network World, to appear (1995).
[8] Stone C.J., Additive regression and other nonparametric models, The Annals of Statistics
Vol. 13 (1985), pp689-705.
Acknowledgements
This research is supported by NASA/GSFC under Grants NAG 5-2686, NCC 5-127
and by the Czech Academy of Science Grant No 275 408
QUERY LEARNING FOR MAXIMUM INFORMATION
GAIN IN A MULTI-LAYER NEURAL NETWORK
Peter Sollich
Department of Physics, University of Edinburgh
Edinburgh EH9 3JZ, UK. Email: [email protected]
In supervised learning, the redundancy contained in random examples can be avoided by learning
from queries, where training examples are chosen to be maximally informative. Using the tools of
statistical mechanics, we analyse query learning in a simple multi-layer network, namely, a large
tree-committee machine. The generalization error is found to decrease exponentially with the
number of training examples, providing a significant improvement over the slow algebraic decay
for random examples. Implications for the connection between information gain and generalization
error in multi-layer networks are discussed, and a computationally cheap algorithm for constructing
approximate maximum information gain queries is suggested and analysed.
1 Introduction
In supervised learning of input-output mappings, the traditional approach has been
to study generalization from random examples. However, random examples contain
redundant information, and generalization performance can thus be improved by
query learning, where each new training input is selected on the basis of the existing
training data to be most 'useful' in some specified sense. Query learning corresponds
closely to the well-founded statistical technique of (sequential) optimal experimental
design. In particular, we consider in this paper queries which maximize the expected
information gain, which are related to the criterion of (Bayes) D-optimality in op-
timal experimental design. The generalization performance achieved by maximum
information gain queries is by now well understood for single-layer neural networks
such as linear and binary perceptrons [1, 2, 3]. For multi-layer networks, which are
much more widely used in practical applications, several heuristic algorithms for
query learning have been proposed (see e.g., [4, 5]). While such heuristic approaches
can demonstrate the power of query learning, they are hard to generalize to sit-
uations other than the ones for which they have been designed, and they cannot
easily be compared with more traditional optimal experimental design methods.
Furthermore, the existing analyses of such algorithms have been carried out within
the framework of 'probably approximately correct' (PAC) learning, yielding worst
case results which are not necessarily close to the potentially more relevant average
case results. In this paper we therefore analyse the average generalization perfor-
mance achieved by query learning in a multi-layer network, using the powerful tools
of statistical mechanics. This is the first quantitative analysis of its kind that we
are aware of.
2 The Model
We focus our analysis on one of the simplest multi-layer networks, namely, the tree-
committee machine (TCM). A TCM is a two-layer neural network with N input
units, K hidden units and one output unit. The 'receptive fields' of the individual
hidden units do not overlap, and each hidden units calculates the sign of a linear
combination (with real coefficients) of the N / K input components to which it is
connected. The output unit then calculates the sign of the sum of all the hidden
339
340 CHAPTER 59
unit outputs. A TCM therefore effectively has all the weights from the hidden to
the output layer fixed to one. Formally, the output y for a given input vector x is
y =sgn (I:~1 O"i) O"i = sgn (X[Wi) (1)
where the O"i are the outputs of the hidden units, Wi their weight vectors, and
=
x T (xl, ... , Xk) with Xi containing the N / K (real-valued) inputs to which hid-
den unit i is connected. The N (real) components of the K (N / K)-dimensional
hidden unit weight vectors Wi, which we denote collectively by w, form the ad-
justable parameters of a TCM. Without loss of generality, we assume the weight
vectors to be normalized to w; = N / K. We shall restrict our analysis to the case
where both the input space dimension and the number of hidden units are large
(N -+ 00, K -+ 00), assuming that each hidden unit is connected to a large number
of inputs, i.e., N / K » 1. As our training algorithm we take (zero temperature)
Gibbs learning, which generates at random any TCM (in the following referred to
as a 'student') which predicts all the training outputs in a given set of p training
examples e(p) = =
{(xl', yl'), J.l 1 ... p} correctly. We take the problem to be per-
fectly learnable, which means that the outputs yl' corresponding to the inputs xl"
are generated by a 'teacher' TCM with the same architecture as the student but
with different, unknown weights w o. It is further assumed that there is no noise on
the training examples. For learning from random examples, the training inputs xl"
are sampled randomly from a distribution Po(x). Since the output (1) of a TCM
is independent of the length of the hidden unit input vectors Xi, we assume this
distribution Po(x) to be uniform over all vectors x T = (xI, ... , xk) which obey
the spherical constraints x; = N / K. For query learning, the training inputs xl" are
chosen to maximize the expected information gain of the student, as follows. The
information gain is defined as the decrease in the entropy S in the parameter space
of the student. The entropy for a training set e(p) is given by
S(e(p») = - /dwP(wle(P»)lnp(wle(P»). (2)
For the Gibbs learning algorithm considered here, p(wle(p») is uniform on the
'version space', the space of all students which predict all training outputs cor-
rectly (and which satisfy the assumed spherical constraints on the weight vectors,
w; = N / K), and zero otherwise. Denoting the version space volume by V (e(p»),
the entropy can thus simply be written as S(e(p») = In V(e(p»). When a new
training example (xp+1 , yp+l) is added to the existing training set, the information
gain is 1= S(e(p») - S(e(N+l»). Since the new training output yP+l is unknown,
only the expected information gain, obtained by averaging over yP+l, is available
for selecting a maximally informative query xp+l. As derived in Ref. [2], the prob-
ability distribution of yP+! given the input x P+! and the existing training set e(p)
is p(yP+l = ±1Ixp +1 , e(p») = v±, where v± = V(e(N+l»)lyp+,=±dV(e(p»). The
expected information gain is therefore
(I) P(yp+'lx P+',0(p» = -v+ In v+ - v-In v- (3)
and attains its maximum value In 2 (= 1 bit) when v± = ~, i.e., when the new
input x p +1 bisects the existing version space. This is intuitively reasonable, since
v± = ~ corresponds to maximum uncertainty about the new output and hence to
maximum information gain once this output is known.
SoUich: Query Learning in a Multi-layer Neural Network 341
Due to the complex geometry of the version space, the generation of queries which
achieve exact bisection is in general computationally infeasible. The 'query by com-
mittee' algorithm proposed in Ref. [2] provides a solution to this problem by first
sampling a 'committee' of 2k students from the Gibbs distribution p(wle(p») and
then using the fraction of committee members which predict +1 or -1 for the out-
put y corresponding to an input x as an approximation to the true probability
P(y = =
±llx, e(p») v±. The condition v± ~ is then approximated by the =
requirement that exactly k of the committee members predict +1 and -1, respec-
tively. An approximate maximum information gain query can thus be found by
sampling (or filtering) inputs from a stream of random inputs until this condition is
met. The procedure is then repeated for each new query. As k - t 00, this algorithm
approaches the exact bisection algorithm, and it is on this limit that we focus in
the following.
3 Exact Maximum Information Gain Queries
The main quantity of interest in our analysis is the generalization error fg, defined
as the probability that a given student TCM will predict the output of the teacher
incorrectly for a random test input sampled from Po(x). It can be expressed in
terms of the overlaps Ri = 'frw;
w? of the student and teacher hidden unit weight
vectors Wi and w? [6]. In the thermodynamic limit, the Ri are self-averaging, and
can be obtained from a replica calculation of the average entropy S as a function
=
of the normalized number of training examples, 0: pIN; details will be reported
in a forthcoming publication [7]. The resulting average generalization error is plot-
ted in Figure 1; for large 0:, one can show analytically that fg ()( exp( -o:~ In 2).
This exponential decay of the generalization error fg with 0: provides a marked
0.5 0
- exact Infl
0.4 - - - constructive
-1
fl random
0.3
-2
0.2 exact
'-0.'-0
-3 constructive
0.1 random
0.0 -4
0 2 4 a 6 8 10 -6 -4 1I
-2 0
improvement over the fg ()( 1/0: decay achieved by random examples [6]. The ef-
fect of maximum information gain queries is thus similar to what is observed for a
binary perceptron learning from a binary perceptron teacher, but the decay con-
stant c in fg ()( exp( -co:) is only half of that for the binary perceptron [2]. This
means that asymptotically, twice as many examples are needed for a TCM as for
a binary perceptron to achieve the same generalization performance, in agreement
342 CHAPTER 59
with the results for random examples [6]. Since maximum information gain queries
lead to an entropy s = -a In 2 in both networks, we can also conclude that the
relation s ~ In fg for the binary perceptron [2] has to be replaced by s ~ In f~ for
the tree committee machine. Figure 1 shows that, as expected, this relation holds
independently of whether one is learning from queries or from random examples.
4 Constructive Query Selection Algorithm
We now consider the practical realization of maximum information gain queries in
the TCM. The query by committee approach, which in the limit k --+ 00 is an exact
algorithm for selecting maximum information queries, filters queries from a stream
of random inputs. This leads to an exponential increase of the query filtering time
with the number of training examples that have already been learned [3]. As a cheap
alternative we propose a simple algorithm for constructing queries, which is based
on the assumption of an approximate decoupling of the entropies of the different
hidden units, as follows. Each individual hidden unit of a TCM can be viewed
as a binary perceptron. The distribution P(wile(p») of its weight vector Wi given
a set of training examples e(p) has an entropy Si associated with it, in analogy
to the entropy (2) of the full weight distribution p(wle(p»). Our 'constructive
algorithm' for selecting queries then consists in choosing, for each new query xl'+! ,
the inputs Xf+l to the individual hidden units in such a way as to maximize the
decrease in their entropies Si. This can be achieved simply by choosing each Xf+l
to be orthogonal to wf = (wi)P(wI9<!'» (and otherwise random, i.e., according
to Po (x)) [7], thus avoiding the cumbersome and time-consuming filtering from
a random input stream. In practice, one would of course approximate wf by an
average of 2k (say) samples from the Gibbs distribution p(wle(p»); these samples
would have been needed anyway in the query by committee approach.
The generalization performance achieved by this constructive algorithm can again
be calculated by the replica method; as shown in Figure 1, it is actually slightly
superior to that of exact maximum information gain queries. The a-dependence of
the entropy, s = -a In 2, turns out to be the same as for maximum information
gain queries; this indicates that the correlations between the individual hidden
ilnits become sufficiently small for K --+ 00, so that queries selected to minimize
the individual hidden units' entropies also minimize the overall entropy of the TCM.
5 Conclusions
We have analysed query learning for maximum information gain in a large tree-
committee machine (TCM). Or main result is the exponential decay of the general-
ization error fg with the normalized number of training examples a, which demon-
strates that query learning can yield significant improvements over learning from
random examples (for which fg ()( l/a for large a) in multi-layer neural networks.
The fact that the decay constant c in fg ()( exp( -ca) differs from that calculated
for single-layer nets such as the binary perceptron raises the question of how large
c would be in more complex multi-layer networks. Combining the worst-case bound
in [3] in terms of the VC-dimension with existing storage capacity bounds, one
would estimate that c could be as small as O(I/lnK) for networks with a large
number of hidden units K. This contrasts with our result c --+ const. for K --+ 00,
and further work is clearly needed to establish whether there are realistic networks
which saturate the lower bound c = O(I/lnK).
SoUich: Query Learning in a Multi-layer Neural Network 343
We have also analysed a computationally cheap algorithm for constructing (rather
than filtering) approximate maximum information gain queries, and found that it
actually achieves slightly better generalization performance than exact maximum
information gain queries. This result is particularly encouraging considering the
practical application of query learning in more complex multi-layer networks. For
example, the proposed constructive algorithm can be modified for query learning
in a fully-connected committee machine (where each hidden unit is connected to all
the inputs), by simply choosing each new query to be orthogonal to the subspace
spanned by the average weight vectors of all K hidden units. As long as K is
much smaller than the input dimension N, and assuming that for large enough
K the approximate decoupling of the hidden unit entropies still holds for fully
connected networks, one would expect this algorithm to yield a good approximation
to maximum information gain queries. The same conclusion may also hold for a
general two-layer network with threshold units (where, in contrast to the committee
machine, the hidden-to-output weights are free parameters), which can approximate
a large class of input-output mappings. In summary, our results therefore suggest
that the drastic improvements in generalization performance achieved by maximum
information gain queries can be made available, in a computationally cheap manner,
for realistic neural network learning problems.
REFERENCES
[1] P. Sollich, Query construction, entropy, and generalization in neural network models,
Phys. Rev. E, Vol. 49 (1994), pp4637-4651.
[2] H. S. Seung, M. Opper, and H. Sompolinsky, Query by committee, in: Proc. 5th Workshop
on Computational Learning Theory (COLT '92), ACM, New York, (1992), pp287-294.
[3] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby, Information, prediction, and query by com-
mittee, in: Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan,
and C. Lee Giles eds., Morgan Kaufmann, San Mateo, CA (1993), pp483-490.
[4) E. Baum, Neural network algorithms that learn in polynomial time from examples and
queries, IEEE Trans. Neural Netw., Vol. 2 (1991), pp5-19.
[5) J.-N. Hwang, J. J. Choi, S. Oh, and R.J. Marks II, Query-based learning applied to partially
trained multilayer perceptrons, IEEE Trans. Neural Netw., Vol. 2 (1991), pp131-136.
[6) H. Schwarze and J. Hertz, Generalization in a large committee machine, Europhys. Lett.,
Vol. 20 (1992), pp375-380.
[7] P. Sollich, Learning from minimum entropy queries in a large committee machine, Phys.
Rev. E, Vol.53 (1996), ppR2060-R2063.
Acknowledgements
This work was partially supported by European Union grant no. ERB CHRX-
CT92-0063.
SHIFT, ROTATION AND SCALE INVARIANT
SIGNATURES FOR TWO-DIMENSIONAL CONTOURS,
IN A NEURAL NETWORK ARCHITECTURE
David MeG. Squire and Terry M. Caelli
School of Computing, Curtin University of Technology Perth, Western Australia,
A ustralia. Email: [email protected]@cs.curtin.edu.au
A technique for obtaining shift, rotation and scale invariant signatures for two dimensional con-
tours is proposed and demonstrated. An in variance factor is calculated at each point by comparing
the orientation of the tangent vector with vector fields corresponding to the generators of Lie trans-
formation groups for shift, rotation and scaling. The statistics of these invariance factors over the
contour are used to produce an invariance signature. This operation is implemented in a Model-
Based Neural Network (MBNN), in which the architecture and weights are parameterised by the
constraints of the problem domain. The end result after constructing and training this system is
the same as a traditional neural network: a collection of layers of nodes with weighted connec-
tions. The design and modeling process can be thought of as compiling an invariant classifier into
a neural network. We contend that these invariance signatures, whilst not unique, are sufficient
to characterise contours for many pattern recognition tasks.
1 Introduction
1.1 The Model-Based Approach to Building Neural Networks
The MBNN approach aims to retain the advantages of Traditional Neural Networks
(TNNs), i.e. parallel data-processing, but to constrain the process by which the
architecture of the network and the values of the weights are determined, so that the
designer can use expert knowledge of the problem domain. MBNNs were introduced
in [1]. In that paper we proposed networks in which the weights were functions of the
relative positions of nodes, and several, possibly shared, parameters. This reduced
the dimensionality of the space searched during training, and the size of the training
set required, since the network was guaranteed to respond only to certain features.
The resultant network was just a collection of nodes and weighted connections,
exactly as in a TNN. The key notion was that neural networks with highly desirable
properties could be produced by using expert knowledge to constrain the weight
determination process. The MBNN approach departs from the TNN view in that
the operation is not determined entirely by the training set supplied.
1.2 Model-Based Neural Networks and Invariant Pattern
Recognition
That the parameters of TNNs are entirely learnt can be a limitation. To achieve
good performance, the training set must be sufficiently large and varied to span
the input space. Collecting this data and training the network can be very time-
consuming. The MBNN formulation allows the creation of networks guaranteed to
respond to particular features in, and to be invariant under certain transformations
of, the input image. A data set containing various shifted, distorted or otherwise
transformed versions of the input patterns, such has long been a common approach
to invariant pattern recognition using neural networks [2], is not required. The
concept of MBNNs is here extended to include modular networks. Each module
has a well-defined functionality. The weights in each module can be arrived at by
344
Squire (3 Caelli: Shift, Rotation and Scale Invariant Signatures 345
any technique at all: some may be set by the designer, others by training the module
for a specific task. This approach allows the the network designer great flexibility.
Separately trained modules can perform data processing tasks independent of the
final classification of the input pattern.
2 Invariance Signatures
2.1 Functions Invariant With Respect To Lie Transformation
Groups
We wish to determine the invariance of a function F(x, y) with respect to a Lie
transformation group.
G(x,y) = [gx(x,y) gy(x,y) f (1)
is the vector field corresponding to the generator of the group. F is constant with
respect to the action of the generator if
of of
"F· G(x,y) = 0, I.e. oxgx + Oygy = O. (2)
Consider a contour parameterised by t on which F is constant:
=
F(x(t), yet)) C V t. (3)
Since F is constant on the contour, we have:
dF _ of dx of dy _ 0 of dy
dt - ox dt + oy dt - , so that g; = - dx.
By
(4)
Thus the invariance condition given in equation 2, holds if
dy gy
dx = gx (5)
The vector field for dilation invariance is similar. For the translation invariance
case, the vector field Vtrans(x, y) is constant for all (x, y). The direction is that of
the dominant eigenvector of the coordinate covariance matrix of the contour. The
invariance signature of an image consists of histograms of the invariance measures
for all the "on" points. It is invariant under rotations, dilations, translations and
reflections of the image. This encoding is not unique, unlike some previous integral
transform representations [3].
346 CHAPTER 60
3 A Neural Network System For Computing Invariance
Signatures
A MBNN was constructed to compute invariance signatures and classify input pat-
terns on this basis. This MBNN, consisting of a system of modules, some hand-
coded and some trained, is shown in Figure 1. Whilst this system appears complex,
FinalCllllsificalion
it retains the basic characteristics of a TNN 1 . Each node i computes the sum
of its weighted inputs, net; = L:j W;jXj. This is used as the input to the trans-
fer function f, which is either linear, f(netj) = netj, or the standard sigmoid,
f( net)·) = 1+e 1 net J. • The only departure from a TNN is that some of the weights
are dynamic: the weight is calculated by a node higher up in the network. This
allows the MBNN to compute dot products 2 , and some nodes to act as gates. Since
weights in any neural network implementation are just references to a stored value,
this should not present any difficulty. There is insufficient space here to describe
all the modules in the system. Descriptions of the Local Orientation Extractor and
the Binning Unit are given as examples of the way the modules were constructed.
lwith the exception of the Dominant Image Orientation Unit, for which a neural network
solution is still being developed.
2The calculation of dot products is achieved by using the outputs of one layer as the weights
on connections to another layer.
Squire (3 Caelli: Shift, Rotation and Scale Invariant Signatures 347
3.1 Local Orientation Extraction
A simple and robust estimate of the tangent vector at a point is the dominant
eigenvector of the covariance matrix of a square window centred on that point.
The vector magnitudes are weighted by the strength of the orientation. Let the
eigenvalues be >'1 and >'2, >'1 ~ >'2. The corresponding unit eigenvectors are el and
e2. The weighted tangent vector estimate sis:
s = { 0( 1 - ~) el >'1;t 0 (9)
>'1 = o.
A TNN with a 3 x 3 input layer, 20 hidden nodes, and 2 output nodes was trained
using backpropagation [4] to produce this output for all possible binary input im-
ages with a centre value of 1. Training was stopped after 6000000 iterations, when
96.89% of the training set variance was fitted. This problem is similar to edge ex-
traction, except that edge extraction is usually performed on greyscale gradients
rather than thin binary contours. Srinivasan et al. [5] have developed a neural net-
work edge detector which produces a weighted vector output like that in equation
9. We intend to produce a more compact and accurate tangent estimator using a
MBNN incorporating Gabor weighting functions, as used in [1].
3.2 The Binning Unit
Node A
The weights for the binning unit in figure 2 were determined by hand. There is a
binning unit for each of the n bins in each invariance signature histogram. Each
binning unit is connected to all nodes in the invariance image, and inputs to it are
gated by the input image, so that only the N nodes corresponding to ones in the
input image contribute. The bins have width n:'I' since the first bin is centred on
0, and the last on 13. Nodes A and B only have an output of 1 when the input x
is within bin i. This condition is met when:
2i + 1
2i - 1
2(n -1) < x < 2(n-1 r (10)
To detect this condition, the activations of nodes A and B are set to:
a(2i - 1)
netA = ax - 2(n _ 1) (11)
3 A bin ending at 0 or 1 would miss contributions from the extrema since the edge of the neural
bin is not vertical.
348 CHAPTER 60
Q:(2i+l)
- Q: X + -::-';-----:-7-
2(n - 1)
(12)
where Q: is a large number, causing the sigmoid to approximate a step function. Here,
Q: = 1000 was used. Node C acts as an AND gate. Node D sums the contributions
Acknowledgements
David Squire was funded by an Australian Postgraduate Award.
4More than one example of each letter was required since slightly different signatures were
generated by different orientations, due to the residual error in the Local Orientation Extractor.
5 all rotations were by multiples of ¥radians.
FUNCTION APPROXIMATION BY THREE-LAYER
ARTIFICIAL NEURAL NETWORKS
Shin Suzuki
Information Science Research Laboratory, NTT Basic Research Laboratories,
3-1 Morinosato- Wakamiya, Atsugi-Shi, Kanagawa Prej., 243-01 Japan.
Tel: +81 462 40 3574, Fax: +81 462404721, Email: [email protected]
This paper presents constructive approximations by three-layer artificial neural networks with (1)
trigonometric, (2) piecewise linear, and (3) sigmoidal hidden-layer units. These approximations
provide (a) approximating-network equations, (b) specifications for the numbers of hidden-layer
units, (c) approximation error estimations, and (d) saturations of the approximations.
Keywords: Constructive Approximation, Saturation.
1 Introduction
Previous studies on function approximation by artificial neural networks show only
the existence of approximating networks by non- constructive methods[l] [2] and thus
contribute almost nothing to developing the networks and to specifying the prop-
erties of the approximations. This paper presents constructive approximations by
networks with (1) trigonometric, (2) piecewise linear, and (3) sigmoidal hidden-
layer units. These approximations provide (a) approximating-network equations,
(b) specifications for the numbers of hidden-layer units, (c) approximation error
estimations, and (d) saturations of the approximations.
2 Preliminaries
Let IN and m be the set of natural and real numbers, and INo = IN U {O}. Let
Irl = l:~1 Iril for r = (ri)~1 E IN~ and IItll = (l:~1 m
1/2 for t = (ti)~1 E
mm. For p ~ 1, we denote by L~". (mm) the space of 21r-periodic on each m
of IRm pth-order Lebesgue-integrable functions defined on mm to m with L~".-
norm II/11L~ .. = {(21r)-m J~,,'" J~".II (xW dX} lip and by C2". (mm) the space of
21r-periodic continuous functions defined on mm to m with C2".-norm II/11c, .. =
c
sup {1/(x)1 ; IXil ~ 1r, i = 1, .. . ,m}. Let W = L~".(mm) or 2".(mm) throughout
this paper. For I, 9 E W, (f(t) , g(t)) = (21r)-m J~".···J~"./(t)g(t)dt and the
convolution I*g(x) = (21r)-m J~1f'" J~7f I (t)g (x - t)dt. Let the sigmoidal func-
tion sig (x) = {I + exp ( -;I:)} -1. Let lEW and 6 ;::: O. We introduce the modulus
of continuity of I in W wq; (I, 6) = sup {III ( . + t) - I ( . )IIq; ; t E mm, IItll ~ 6},
where III ( . + t) - I (- )IIL~ .. = {(21r)-m J~7f'" J~7f II (x + t) - I (xW dx riP and
11/(' +t)- 1(-)IIc, .. =sup{l/(x+t)-/(x)l; IXil~1r, i=I, ... ,m}.
We say I satisfies a Lipschitz's condition with constant M > 0 and exponent v > 0
in W, if
II/( . + t) - 1(- )IIq; ~ M IIt ll"
for t E mm. Let LiPM (W; v) be the set of functions satisfying this condition and
Lipschitz's class of order v Lipv = U {LipM (W; v) ; M > O}. We notice that if
lEW satisfies a Lipschitz's condition with constant M and exponent v in W, then
Wq; (I, 6) ~M 6" .
349
350 CHAPTER 61
3 Results
3.1 Constructive Approximations and approximation error
estimations
Let b~ =.ft,=1 sin '1$~1r for r =(ri)~1 E IN~ and B>' =(>'!2)m for A E IN in this
section.
=
Theorem 1 (trigonometric case) Let f (/;)7=1 be a lunction defined on mm
to mn such that Ii E W. For arbitrary parameter A E IN, a three-layer network with
trigonometric hidden-layer units TN [f)>' =
(TN [Ii]>' ) ;=1 approximates to f with
W-norm, such that
TN [Ii (x) = 0 [Ii]>' +
O:5Pi, Qi:5>'
L {a [Ii]>' (p, q) cos (p - q) x +,8 [Ii]>' (p, q) sin (p - q) x}, (1)
(p,q)
where 0 [Ii]>' = (Ii (t), 1), a [Ii]>' (p, q) = 2B>' b; b~ (Ii (t), cos (p - q) t), and
,8.[/i]~ (p, q) = 2B>' b& b~ (Ii (t), ~n (p -!!) t). (The above summation is ove~ com.-
bznatzons olp = (Pi);=I' q = (qi)i=1 E INo such that p f. q, 0:::; Pi, qi :::; A, z.e., il
the summation is added in the case 01 (p, q), it is not added in the case 01 (q, p).
This notation 01 the summation is used throughout this paper.) Then TN [f]>' has
(2A + 1)m - 1 hidden-layer units. Also the approximation error 01 each coordinate
with W-norm lor i = 1, ... ,n is estimated by
Il/i - TN [/i]>'lIw :::; (1 + ~2 Vm) Ww (Ii, (A + 2)-1) . (2)
The next corollary means that TN [f]>' can approximate f with any degree of ac-
curacy if A increases i.e., the number of hidden-layer units increases.
Corollary 1 II/i - TN [/i]>'lIw -+ 0 as A -+ 00 (i = 1, ... ,n).
Theorem 2 (piecewise linear case) Let f = (Ji)~=1 be a lunction defined on
mm to mn such that Ii E L~,," (mm). For two arbitrary independent parameters
A, u E IN, a three-layer network with piecewise linear hidden-layer units PN [f]>', " =
(p N [Ii]>',,, );=1 approximates to f with L~,," -norm, such that
P N [Ii]>',,, (x) = 0 [Ii]>',,, +
4Ip-ql,,-1
L a [Ii]>',,, (p, q, k) PL k «p - q) x), (3)
(p,q) k=O
where
0 (rx:::;-lrl1r+~;), 1 (rx~-lrl1r+(kt~)"")
P Lk (rx) = {
2: rx + 21rl u - k (-lrl1r + ~; < rx < - Irl1r + ( t)"")
k 1 '
O:5Pi, Qi:5>'
o[!;]>., " = (I;(t) , 1)+2B>'sin~
4u
"
L...J
(-1)lp-qlb~b~(li(t), cos(p-q)t),
(p, q)
Suzuki: Function Approximation by Neural Networks 351
a (fdA, u (p, q, k) = 4 (_l)lp-ql BA b; b~ sin 4: x
.
{ (Ii (t) , sm (2k+1)7r . (2k+1)7r}
(p - q) t) cos 4u - (ld t) , cos (p - q) t) sm 4u '
Then PN [£]A' u has 2muA (A + 1) (2A + 1)m-1 hidden-layer units. Also the approx-
imation error of each coordinate with L~ .. -norm is estimated by
=
Theorem 3 (sigmoidal case) Let £ (/;)7=1 be a function defined on IRm to IRn
such that Ii E L~ .. (IRm). For two arbitrary independent parameters A, u E IN, a
three-layer network with sigmoidal hidden-layer units SN [f]A, u = (SN [f;]A, u) ;=1
approximates to £ with L~ .. ""norm, such that
SN [Ii' u (x) = () [li]A, u +
4Ip-qlu-1
E
O~Pi,qi~A
E
(p, q) k=O
a [li]A, u (p, q, k) SG'k «p - q) x), (5)
where SG'k (rx) = sig (S;rx + 81rl u - 4k - 2), and () (fi]A,U and a [li]A,U (p, q, k)
are the same as in Theorem 2. Then SN [£]A'U has 2mUA (A + 1) (2A + 1)m-1 hidden-
layer units. Also the approximation error of each coordinate with L~ .. -norm is es-
timated by
L 2" L 2"
are almost estimated by the first terms, which are the same as Inequality (2), when
u is large enough for A. The following two corollaries give conditions on u in terms of.
A that the approximation errors approach 0, if A increases. Under these conditions
PN [f]A, u and SN [f]A, u can approximate f with any degree of accuracy, if the
number of hidden-layer units increases.
Corollary 2 If u is a higher-order infinity than A¥-, i.e., u = u (A) ~ 00 and
;~ ~O aSA~oo, then Il/i-PN[/i]A,utp ~O aSA~oo (i= 1, ... ,n).
2"
352 CHAPTER 61
Corollary 3 If u is a higher-order infinity than Amp, i.e., u = u (A) _ 00 and
~-OasA-oo,thenll/;-SN[fi]>.,qll p -OasA_oo (i=I, ... ,n).
L,,,
-1t
-1t
,.».,,- I~
0.2 .------.-------------------,
-f- 1ICNaI (tri,<lIDCmeIric)
-e- 1ICNaI (piecewUc Iiacar)
~ IICNaI (oiploid)
~ "1imaICCI (tri.""""'ctric)
-e- esIimaIccI (piecewUc Iiacar)
~ ealimalccl (Jip>oidaI)
0.15
1 •
:"-.~~-J(0
.".",-
I 0.1
1 •
I~
0.05
2 4 6 8 10 A
Figure 2 Approximating networks with three kinds of hidden layer units at 0" =
60. (When>. is the same, the three kinds of networks have almost the same figures .)
and 7.2 x 10 3 ,4.3 x 10\ 1.3 X 10 5 ,2.9 X 10 5 , 5.5 X 10 5 in the piecewise linear and
sigmoidal cases. The actual errors with L~".-norm are numerically calculated from
the left-hand members ofInequalities (2), (4), and (6), and the estimated errors are
calculated from their left-hand members.
5 Conclusion
This paper presents constructive approximations by three-layer artificial neural net-
works with (1) trigonometric, (2) piecewise linear, and (3) sigmoidal hidden-layer
units to 27r-periodic pth-order Lebesgue integrable functions defined on IRm to
IRn for p 2: 1 with L~".-norm . (In the case of (1), networks with trigonometric
hidden-layer units can also approximate 27r-periodic continuous functions defined
354 CHAPTER 61
on mm to mn with C 2.. -norm in the same time.) The approximations provide (a)
approximating-network equations, (b) specifications for the numbers of hidden-layer
units, (c) approximation error estimations, and (d) saturations of the approxima-
tions. These results can easily be applied to non-periodic functions defined on a
bounded subset of mm.
REFERENCES
[1] Cybenko, G., Approximation by superpositions of sigmoidal function, Mathematics of Con-
trol, Signals and System, Vol. 2(4) (1989), pp305-314.
[2] Hornik, K., Approximation capabilities of multilayer feedforward networks, Neural Networks,
Vol. 4 (1991), pp251-257.
[3] Lorentz, George G., Approximation of Functions, Holt, Rinehart and Winston (1966).
[4] Davis, Philip J., Interpolation and Approximation, Dover (1975).
[5] Takenouchi, O. and Nishishiraho, T., Kinji Riron (Approximation tkeory), in Japanese,
Baifuukan (1986).
[6] DeVore, Roland A. and Lorentz, George G., Constructive Approximation, Springer-Verlag
(1993).
[7] Suzuki, S., Function Approximation by Tkree-Layer Artificial Neural Networks, Proc. of 1993
Int. Symposium on Nonlinear Theory and its Applications, Vol. 4 (1993), ppI269-1272.
Acknowledgements
The author would like to thank Dr.Masaaki Honda, and other members of HONDA
Research Group of Information Science Research Laboratory for many useful and
helpful discussions.
NEURAL NETWORK VERSUS STATISTICAL
CLUSTERING TECHNIQUES: A PILOT STUDY IN A
PHONEME RECOGNITION TASK
G. Tambouratzis, T. Tambouratzis* and D. Tambouratzis
Dept. of Mathematics, Agricultural Univ.of Athens,
Iera Odos 75, Athens 118 55, Greece.
* Institute of Nuclear Technology - Radiation Protection,
NRCPS "Demokritos", Aghia Paraskevi 153 10, Greece.
In this article, two neural network clustering techniques are compared to classical statistical tech-
niques. This is achieved by examining the results obtained when applying each technique to a
real-world phoneme recognition task. An analysis of the phoneme datasets exposes the clusters
which exist in the pattern space. The study of the similarity of the patterns which are clustered
together allows an objective evaluation of the clustering efficiency of these techniques. It also gives
rise to a revealing comparison of the way each technique clusters the dataset.
1 Introduction
Clustering algorithms attempt to organise unclassified patterns into groups in such
a way that patterns within each group are more similar than patterns belonging to
different groups. In classical statistics, there exist a wide range of agglomerative and
divisive clustering techniques [1], which use as distance measures either Euclidean-
type distances or other metrics suitable for binary data. Recently, techniques based
on neural networks have been developed and have been found well-suited to cluster-
ing large, high-dimensional pattern spaces. In this article, the clustering potential
of two fundamentally different neural network models is studied and compared to
that of statistical techniques. The network models are: (a) the Self-Organising Logic
Neural Network (SOLNN) [2], which is based on the n-tuple sampling method and
(b) the Harmony Theory Network (HTN) [3] which constitutes a derivative of the
Hopfield network and a variant of the Boltzmann machine.
In an effort to evaluate the effectiveness ofthe two network models when discovering
the clusters which exist in the pattern space, a comparison is made to a number of
established statistical methods. This comparison is performed in a series of experi-
ments which use real-world phoneme data. A number of phonemes is selected and
used as prototypes. Various levels of noise are injected to the prototypes, resulting
in different datasets, each consisting of well defined phoneme-classes. The varying
levels of noise cause each dataset to have fundamentally different characteristics.
The behaviour of the clustering techniques is then studied for each dataset.
2 Overview of the Clustering Techniques
The Self-Organising Logic Neural Network (SOLNN) shares the main structure of
the discriminator network [4]. It is consequently based on the decomposition of the
input pattern into tuples of n pixels and the comparison of these tuples to the corre-
sponding tuples of training patterns. In the SOLNN model, the basic discriminator
structure has been extended by allocating m bits to each tuple combination rather
than a single one. This allows the network to store information concerning the fre-
quency of occurrence of the corresponding tuple combination during learning and
355
356 CHAPTER 62
is instrumental to the SOLNN's ability to learn in the absence of external super-
vision. The SOLNN has been shown to successfully perform clustering tasks in an
unsupervised manner [2,5]. The SOLNN model is characterised by the distribution
constraint mechanism [5] which enables the user to determine the desired radius
of the SOLNN clusters. This mechanism is similar to the vigilance parameter of
Adaptive Resonance Theory (ART) [6].
The Harmony Theory Network (HTN) [3] consists of binary nodes arranged in
exactly two layers. For this task, the lower layer encodes the features of the unclas-
sified patterns and the upper layer encodes the candidate patterns of the clustering
task. Each connection between a feature and a classified pattern specifies the posi-
tive or negative relation between them, i.e. whether or not the pattern contains the
feature. No training is required to adapt the weights, which depend on the local
connectivity of the HTN [3]. During clustering, each candidate pattern is input to
the lower layer of the HTN; the activated nodes of the upper layer constitute the
patterns to which the candidate pattern is clustered (also see [7] for a more detailed
description). Both the required degree of similarity between clustered patterns and
the desired number of clusters are monitored by the parameter k of Harmony The-
ory, which resembles the vigilance parameter of ART [6] and the radius of RBF
(Radial-Basis Function) networks [8]. Its value is changed, in a uniform manner for
all candidate patterns, in the search for optimal clustering results.
Hierarchical statistical clustering techniques [1] of the agglomerative type are used
for clustering in this article. The techniques employed are (i) the single linkage, (ii)
the complete linkage, (iii) the median cluster, (iv) the centroid cluster and (v) the
average cluster.
3 Description of the Clustering Experiments
The data employed in the experiments are real-world phonemes which have been
obtained from the dataset of the LVQ-PAK simulation package [9]. The phonemes
in this package have been pre-processed so that each phoneme consists of 20 real-
valued features. The selected phonemes (called prototypes) correspond to the letters
"A", "0", "N", "I", "M", "U" and "S". Since both the SOLNN and the HTN net-
works require binary input patterns, the phonemes have been digitised, by encoding
each real-valued feature into four bits via the thermometer-coding technique [10].
Consequently, the resulting prototypes are 80-dimensional binary patterns.
Each of the prototypes has been used to generate a number of noisy patterns by
adding random noise of a certain level, namely 2.5, 5, 7.5 and 10%. A different
dataset (experiment-dataset) is created for each level of noise. Every experiment-
dataset consists of groups of noisy patterns whose centroids coincide with, or
are situated very near to, the prototypes. The different levels of noise cause the
experiment-datasets to occupy varying portions of the input space and the groups
of noisy patterns to overlap to a different extent. The prototype and the noisy pat-
terns for each level of noise constitute a phoneme class. An analysis of the phoneme
classes in each experiment-dataset indicates that the patterns of each phoneme class
are closer to other patterns of the same class than to patterns of other phoneme
classes. As the noise level increases, each phoneme class occupies a larger portion
of the pattern space and the distance between phonemes from different classes is
reduced, while the probability of an overlap occurring between different classes in-
Tambouratzis G, T & D: Neural Network vs. Statistical Clustering 357
Phoneme Average Minimum Maximum Minimum
Class Distance Distance Distance Distance
within class within class within class between classes
A 16.11% 10.00% 20.00% 22.50% (A & 0)
0 16.67% 10.00% 20.00% 22.50% (0 & A)
N 16.22% 7.50% 20.00% 23.75% (N & 0)
1 16.33% 10.00% 20.00% 20.00% (I & M)
M 16.22% 10.00% 20.00% 20.00% (M & I)
U 16.06% 10.00% 20.00% 23.75% (U & 0)
S 16.22% 10.00% 20.00% 26.25% (S & I)
Table 1 Characteristics of the pattern space used for 10% noise level. Distances
are calculated as percentages of the total number of pixels.
creases. The 1O%-noise dataset is probably the most interesting one since, in that
case, the minimum distance between two phonemes from different classes (more
specifically classes "I" and "M") becomes equal to the maximum distance between
patterns within any phoneme class. Due to this fact, it is expected to be the most dif-
ficult experiment-dataset to cluster successfully. Its characteristics are summarised
in Table 1 to allow for a detailed evaluation of the clustering results.
The task consists of grouping the patterns of each experiment-dataset so that the
phoneme classes are uncovered. The results obtained by each of the three clustering
techniques are evaluated by taking into account the characteristics and topology
of each experiment-dataset, i.e. the pixel-wise similarity between the patterns and
the clusters in which they have been organised by each technique. This enables (i)
a comparison of the way in which each technique operates for various data distri-
butions and (ii) an evaluation of the effect that the relation between the maximum
distance of patterns of the same phoneme class and the minimum distance of pat-
terns of different phoneme classes has on clustering.
Additionally, a statistical analysis of the pattern space created by each experiment-
dataset has been performed to investigate how effective the clustering techniques
actually are. This investigation is based on the similarity between classes in the
pattern space. The comparison of the two neural network techniques and the sta-
tistical methods, together with an in-depth analysis of the pattern space, provides
an accurate insight to the capabilities and limitations of each of the techniques
studied.
4 Experimental Results
The different clustering techniques are applied to the clustering task described in
the previous paragraph. The results obtained are summarised in Table 2, where the
following information is contained:
(i) the proportion of dataset phonemes that are correctly classified, that is of pat-
terns assigned to a cluster representing their phoneme class;
(ii) the number of multi-phoneme clusters, that is clusters containing patterns from
more than one phoneme class;
358 CHAPTER 62
Noise
Level Criterion SOLNN HTN Statistical
Correct classification 100% 100% 100%
2.5%, Multi-phoneme clusters formed 0 0 0
5%, Phonemes in multi-phoneme cluster 1 1 1
7.5% Number of created clusters 7 7 7
Number of clusters per phoneme 1 1 1
Correct classification 86%/100% 97% 100%
Multi-phoneme clusters formed 4/0 1 0
10% Phonemes in multi-phoneme cluster 4/1 2 1
Number of created clusters 7/10 21 7
Number of clusters per phoneme 4/2 4 1
Table 2 Comparative results of the three methods for the different noise levels.
In the case of the SOLNN, for 10% noise, two sets of results are noted, the first
corresponding to a 7-discriminator network and the second to a 10-discriminator
network. In the case of statistical methods, the results are obtained using the Ham-
ming distance metric.
5 Conclusions
Both the neural network (SOLNN and HTN) and the statistical techniques have
been found to perform the selected clustering task satisfactorily. For low noise
levels, all techniques cluster the dataset successfully, by forming exclusively single-
phoneme clusters. For higher noise levels, the statistical methods always generate
the optimum clustering according to the Hamming distance metric, as witnessed by
the study of the dataset structure. The quality of the clustering generated by the
two neural network models is slightly inferior to that of the statistical techniques.
This is indicated by the creation of multiple clusters for a few phoneme classes.
However, the vast majority of clusters consist of patterns from a single phoneme
class (see Table 2), thus producing successful clustering.
It is worth noting that the study of the Hamming distances between the differ-
ent phoneme patterns in the dataset indicates that the clustering behaviour of all
three techniques is justified. In particular, for the noise level of 10% when the most
differences between the clustering results of the three methods are detected, the
minimum distance between the phoneme classes is equal to the maximum distance
between phonemes of the same class. This allows for more than one possible clus-
tering results of almost the same quality.
REFERENCES
[1) Kaufman, L., Rousseeuw, P.J., Finding Groups in Data, Wiley, New York (1990).
[2) Tambouratzis, G., Tambouratzis, D., Self-Organisation in Complex Pattern Spaces Using a
Logic Neural Network, Network: Computation in Neural Systems, Vol. 5 (1994), pp599-617.
[3) Smolensky, P., Information processing in dynamical systems: foundations of Harmony The-
ory, in: Rumelhart, D.E., McClelland, J.L. (eds.) Parallel Distributed Processing, Vol.1: Foun-
dations, MIT Press, Cambridge MA. (1986), pp194-281.
[4) Aleksander, I., Morton, H., An Introduction to Neural Computing, Chapman and Hall, Lon-
don, England (1990).
[5) Tambouratzis, G., Optimising the Topology-Preservation Characteristics of a Discriminator-
Based Logic Neural Network, Pattern Recognition Letters, Vol. 15 (1994), pp1019-1028.
[6) Carpenter, G.A., Grossberg, S., The ART of Adaptive Pattern Recognition by a Self-
Organising Neural Network, IEEE Computer (March 1988), pp77-88.
[7) Tambouratzis, T., Tambouratzis, D., Optimal Training Pattern Selection Using a Cluster-
Generating Artificial Neural Network, in: Proceedings of the 1995 International Conference
on Artificial Neural Networks and Genetic Algorithms, Mines d' Ales, France (April 1995),
Springer-Verlag, pp472-475.
[8) Chen, S., Cowan, C.F.N., Grant, P.M., Orthogonal Least Squares Learning Algorithm for
Radial Basis Function Networks, IEEE Transactions on Neural Networks, Vol. 2 (1991 ),pp302-
309.
[9) Kohonen, T., Kangas, J., Laaksonen, J., Torkkola, K., LVQ_PAK: The Learning Vector Quan-
tisation Program Package, (1992).
[10) Aleksander, I., Stonham, T.J., Guide to Pattern Recognition using Random-Access Memo-
ries, Computers and Digital Techniques, Vol. 2 (1979), pp29-40.
MULTISPECTRAL IMAGE ANALYSIS USING PULSED
COUPLED NEURAL NETWORKS
Gregory L. Tarr, Xavier Clastres*, Laurent Freyss*,
Manuel Samuelides*, Christopher Dehainaut and William Burckel
Phillips Laboratory, Kirtland Air Force Base Albuquerque,
New Mexico, USA.
* Centre d'Etudes and Recherche de Toulouse, Toulouse, France.
Pulsed oscillatory neural networks are examined for application to analysis and segmentation of
multispectral imagery from the Satelite Pour l'Observation de la Terre (SPOT). These networks
demonstrate a capacity to segment images with better performance against many of the resolution
uncertainty effects caused by local area adaptive filtering. To enhance synchronous behavior, a
reset mechanism is added in the model. Previous work suggests that a reset activation pulse is
generated by sacatic motor commands. Consequently, an algorithm is developed, which behaves
similar to adaptive histogram techniques. These techniques appear both biologically plausible and
more effective than conventional techniques. Using the pulse time-of-arrival as the information
carrier, the image is reduced to a time signal which allows an intelligent filtering using feedback.
1 Introduction
Histogram image analysis may play an important role in biological vision image
processing. Structures of artificial neurons can be used to create histogram like sig-
nals quickly. In this paper, we will examine how algorithms based on fast histogram
processing may offer advantages for computer vision systems.
In a biological vision system, the signals detected at the retina are passed to the
LGN then to the Visual Cortex, where neighborhood preserving maps of the retina
are repeated at least 15 times over the surface of the cortex. For every path forward,
there are several neurological pathways in the reverse direction. Reverse direction
transmission of information suggest feedback signals. Using appropiate feedback,
the image processing can be controlled using recurrence. Pulses generated by dy-
namic neuronal models suggest a method for building recurrance into vision models.
1.1 Dynamic Neural Networks
Dynamic neural networks, first examined by Stephen Grossberg, Maas and others
[3] were an attempt to construct models closer to their biological counterpart. In
this model, the basic computational unit is not the usual static matrix multiplier
with a non-linear transfer function between layers, but a leaky integrator with a
pulsed oscillatory output. This work is sometimes refereed to as Integrate and Fire
Networks.
To understand the role of synchrony in the cat visual cortex, Eckhorn devised a
model which would replicate some of this behavior. The Eckhorn [1] dynamic model
represents a visual neuron as a multi-input element with a single output. The output
is made up of oscillatory electrical pulses in time. Three types of connections make
up the input section: the feeding, the linking and the direct inputs. The direct
input provides pixel information. The feeding inputs provide local information in
the region of the direct inputs. The combination of the direct and feeding input
provide texture information in a local region around the direct input. The linking
field provides a global connection for all sub-regions in the image. The linking
361
362 CHAPTER 63
'
, '\
,
'' ,
,\
:: '\ ""1e'"''''
:
..::i;;;'1
,\ I:j-\ , '
'!' \ ': lie'\\ l'' 1
\J1jt-1;fct/
n.a • -I
Figure 2
-- -. ---
I
, 1
I ~ I
j
.1..
w ..
,
I
M
- w .. ..
•
"--
.1 _ _
j
, i
I t
1
.. I M
I 'W .
Figure 3
Figure 4
This image was segmented as part of a larger project to perform terrain categoriza-
tion using a linear approximation to the Eckhorn model developed by Freyss. For
a relatively complex image such as this, most of the common techniques perform
relatively poorly. The advantage with this new technique is that there are almost
no parameters to be set and no buried decisions being made by the operator.
3 Conclusions and Future Work
The model we developed demonstrates effective image segmentation over a wide
variety of image class types. Although many of the results demonstrated here could
be duplicated using standard techniques, these methods offer a simple modular
approach to the image analysis, and are easily implemented in silicon devices.
Our work suggests that the group behavior of clusters of neurons provides a tech-
nique which encodes images into one dimensional signals in time. Using the temporal
encoded group output as a control signal will add a large measure of robustness to
a visual system.
An important aspect of this work is the new approach to image processing; that is
expectation filtering for particular characteristics with feedback to enhance desired
qualities in the signal. Donaho's [2) work on histogram equalization approximation
for manufacturing processes could easily be implemented using this architecture.
REFERENCES
[1] R. Eckhorn H. Reitboeck M. Arndt P. Dicke, Feature linking via $ynchronization among
distributed assemblies simulations of results from cat visual cortex, Neural Computation, Vol.
1(2) (1990), pp293-307.
[2] G.W. Donohoe and C. Jeong, A combined analog-digital technique for normalizing video
signals for the detection of moving objects, in: Proceedings of the International Conference on
Acoustics, Speech and Signal Processing (March 23-26).
[3] Stephen Grossberg, Nonlinear neural networks: Principles, mechani$ms, and architectures,
Neural Networks, Vol. 1 (1988), ppI7-61.
REASONING NEURAL NETWORKS
Rua-Huan R. Tsaih
Department of Management Information Systems,
National Chengchi University, Taipei, Taiwan, ROC.
The Reasoning Neural Network (RN) has a learning algorithm belonging to the weight-and-
structure-change category, because it puts only one hidden node in the initial network structure,
and will recruit and prune hidden nodes during the learning process. Empirical results show that
learning of the RN is guaranteed to be completed, the number of required hidden nodes is reason-
able, that the speed of learning is much faster than back propagation networks, and that the RN
is able to develop good internal representation.
1 Introduction
Intuitively, human learning consists of cramming and reasoning at a high level
of abstraction [5]. This observation has suggested a learning algorithm as shown
in Figure 1. This learning algorithm belongs to the weight-and-structure-change
category, because it puts only one hidden node initially, and will recruit and prune
hidden nodes during the learning process. Our learning algorithm guarantees to
achieve perfectly the goal of learning. There are some similar learning algorithms;
however, most of them have more complex pruning strategies. [7, 4, 1, 2].
2 The RN's Network Architecture
The RN adopts the layered feedforward network structure. Let's suppose that the
network has three layers with m input nodes at the bottom, p hidden nodes, and
q output nodes at the top. Let Be E {-I,I}m be the cth given stimulus input,
bej the stimulus value received in the jth input node when Be is presented to the
network, Wij the weight of the connection between the jth input node and the ith
hidden node, (}j the negative of the threshold value of the ith hidden node, Wi ==
(Wi!, Wi2, •.. , Wim)t the vector of weights ofthe connections between all input nodes
and the ith hidden node, where the superscript t indicates the transposition, XI ==
(Oi, wD, and X t == (XL X;, ... , X~). Then, given the stimulus Be, the activation
value of the ith hidden node is computed:
m
heBe, Xi) == tanh(Oi + L wijbej ).
j=l
Let heBe, X) == (h(Be, Xd, heBe, X 2 ), .•. , heBe, Xp)t be the activation value vec-
tor of all hidden nodes when Be is presented to the network, ru the weight of the con-
nection between the ith hidden node and the lth output node, r, == (rll' r'2, ... , rlp)t
the vector of weights of the connections between all hidden nodes and the lth output
node, Sl the negative of the threshold value of the lth output node, Yf == (Sl' rD,
y t == (Yi, V;, ... , V:), and zt == (yt, xt). The activation value of the lth output
node is computed after heBe, X):
p
O(Be, V"~ X) == tanh(sl + E ruh(Be, Xi)).
i=1
366
Tsaih: Reasoning Neural Networks 367
Figure 1 The block diagram of the learning algorithm. The details of the thinking
box, the reasoning box and the cramming box are shown in Figure 2 and Figure 3.
_____________________ 1
Figure 2 The Generalised Delta Rule part, the thinking part and the reasoning
part. The values of given constants", and w in the Generalised Delta Rule part are
tiny.
(LSC), the thinking mechanism, the cramming mechanism; and the reasoning mech-
anism. These aspects are explained in the following.
With respect to each output node, the network is used as a classifier which learns
to distinguish if the stimulus is a member of one class of stimuli, called class 1, or
368 CHAPTER 64
p + 1 --+ p, then adds the new pth hidden node with wp = Bk, Bp = 1 - m, and
rip = 0 for every I E L, and, for every I ~ L
Figure 3 The Cramming part. Suppose that I E L if, before implementing the
cramming mechanism, the LSC with respect to the Ith output node is satisfied.
of a different class, called class 2, by being presented with exemplars of each class.
With respect to the lth output node, let K == Kll U K12, where Kll and KI2 are the
sets of indices of all given training stimuli in classes 1 and 2, respectively; and let
del be the desired output value of the lth output node of the cth stimulus, with 1.0
and -1.0 being respectively the desired output values of classes 1 and 2. Learning
seeks Z where, for alii,
Vc E K (1)
and 0 < v < 1. With respect to the Ith output node, let the LSC be that
min O(Bc, YI, X) > max O(Bc, YI, X).
cEKIl cEK '2
When the LSC with respect to the Ith output node is satisfied, the requirement (1)
with respect to the Ith output node could be achieved by merely adjusting Y I [5J.
At the learning stage, the training stimuli are presented one by one. At the kth
given stimulus, the objective function is defined as:
k q
E(Z) == L L (O(Bc, YI, X) - dc/)2
c=l 1=1
and let K(k) == {I, ... , k} and K(k) == Kll(k) U Kdk), where Kll(k) and KI2(k)
are, respectively, the sets of indices of the first k training stimuli in classes 1 and 2,
with respect to the Ith output node. Then the thinking mechanism is implemented,
in which the momentum version of the generalized delta rule (with automatic ad-
justment of learning rate) is adopted. Learning might converge to the bottom of a
very shallow steep-sided valley [3J, where the magnitude of the gradient will be tiny
and the consecutive adaptive learning rates will also be tiny. Therefore, as shown in
the generalized delta rule part of Figure 2, these two criteria are adopted to detect
if the learning hits the neighborhood of an undesired attractor.
The desired solution is not required to render the requirement (1) satisfied or to
be a stationary point in which V'zE(Z) = O. Thus, the magnitude of IIV'zE(Z)11
before hitting the desired solution is not necessarily tiny and the learning time is
rather less, compared with conventional stopping criteria (for example, small E(Z)
or IIV'zE(Z)11 = 0).
The thinking mechanism does not guarantee that the LSC with respect to all output
nodes will be satisfied. Two ideas could render the learning capable of escaping
from the undesired attractor: add a hidden node and alter the objective function.
Tsaih: Reasoning Neural Networks 369
By adding a hidden node, the dimension of the weight space is increased; while
altering the objective function will change its function surface on the weight space
such that the trapped attractor could be no more an attractor. These two ideas are
implemented in our learning algorithm via the cramming mechanism and that the
objective function is altered by introducing a new training stimulus.
The cramming mechanism can be viewed as that, at first a new hidden node with
the near threshold activation function is added, then the softening mechanism of
[6] is used immediately to render the activation function of the new hidden node a
tanh one. The Lemma in [6] shows that the mechanism of adding a new hidden node
with the near threshold activation function and a finite value of the gain parameter
can immediately render the LSC with respect to all output nodes satisfied, if the
training set has no internal conflicts (different outputs for the same input).
However the number of required hidden nodes may be too many and the generaliza-
tion ability of the network may be bad. Thus it is necessary to adopt the reasoning
mechanism for the purpose of rendering the network more compact. The reasoning
mechanism includes the thinking mechanism and the pruning mechanism of remov-
ing irrelevant hidden nodes. In a Z, the ith hidden node is said to be irrelevant to
the LSC with respect to the lth output node if the LSC is still satisfied with the
same Z except r/i = 0; and a hidden node is irrelevant if it is irrelevant to the LSC
with respect to all output nodes [5].
4 The Performance of the RN
We report three experiments. In each simulation, there are 100 testing cases, each
with different input sequence of training stimuli. One experiment is the m-bits
parity learning problem. In Figure 4a, the numbers of used hidden nodes during
the 6-bits parity learning process are plotted. The variance of p is due to the different
input sequence of training stimuli. Figure 4b shows the summary of the simulation
results, and it shows that the average value p ofthe m-bits parity problems is merely
a little bigger than m. However, it is surprising to see that the RN can solve the
m-bits parity problems with less than m hidden nodes.
The output/hidden experiment is used to identify the relationship between the
number of required hidden nodes and the number of used output nodes. The number
of input nodes is fixed to be 8. The training stimuli and their input sequence are
randomized; but the number of used output nodes is varied from 1 to 6. In Figure
4c, the simulation result of the "m = 8, q = 3, and K = 100" problem are plotted.
Figure 4d shows the summary of the simulation results, and it shows that the
relationship between the average value p of the RN and the value of q is rather a
linear one.
One significant phenomenon of above simulations is that the value of q influences
the value of p more significantly than the values of K and m do. Another interesting
phenomenon is that if there is no correlation within current given training stimuli,
the RN tends to cram them (in other words, memorize them individually) by using
many more hidden nodes. But when there are correlation within the current given
training stimuli, it seems that the RN will figure out a smart way to classify the
given training stimuli by using less hidden nodes. In other words, the RN has the
ability of developing a good internal representation for the given training stimuli.
370 CHAPTER 64
a) .,
"
" b)
/.. _.._....
, ,
, ,
:~
/ ~ r, ;
..........
: r-o_ti
...
".
c) ...,,
,
d) •....... _- ..
• . . •r
.. '
, ... -
...... ,,'
_111_-
..0 eo 80 100
_111_-
D,L---~2----~3~---4L---~5----~
e) ""
"
5
f
I
--
.-l :".-4
--- .01_
...... ,.....
I - -.-- mil
i
t : ........ .J
--- - - - -- - - - -- - - - "',
o
, ....... ' - ...
246.ronunnm.H88~~~
_111_-
Figure 4 a) The simulation result of the 6-bits parity problem. b) The summary
of simulation results of the m-bits parity problem. c) The simulation result of the
Urn = 8, q = 3, and K = 100" problem. d) The summary of simulation results of
the output/hidden problems. e) The simulation result of the 5-p-5 problem.
The third experiment is the 5-p-5 problem (Figure 4e)), in which the training stimuli
are the same as those of the 5-bits parity problem and the desired output vector is
the same as the stimulus input vector. Somewhat surprisingly, as shown in Figure
5, each hidden node of one final RN has only one strong connection strength from
input nodes, and each output node has only one strong connection strength from
Tsaih: Reasoning Neural Networks 371
Figure 5 One final RNN of the 5-p-5 encoder problem, where we show only the
connections with weights of large magnitude, and the signs of their weights. Note
that the signs of two connected strongest connections are the same.
hidden nodes. In addition, different hidden nodes have strongest connections from
different input nodes, different output nodes have strongest connections from a
different hidden node, and the signs of connected strongest connections are the
same. It seems that, after learning the full set of training stimuli, the RN had
learned to use the hidden nodes to bypass the input stimulus, rather than to encode
them.
5 Discussions and Future work
The empirical results show that the learning of the RN is much faster than the
back propagation learning algorithm, and that the RN is able to develop good in-
ternal representation with good generalization. The RN has flexibility in its learning
algorithm; different algorithms have been obtained by integrating the prime mech-
anisms in different way. These yield different simulation results: it seems that we
should use different management in the RN for different application problems.
REFERENCES
Fahlman, S. & c.
[I] Lebiere, The Cascade-Correlation Learning Architecture, in: Touretzky,
D. (eds), Advances in Neural Information Processing Systems II, Denver (1989), San Mateo,
Morgan Kaufmann.
[2] Frean, M., The Upstart Algorithm: A Method for Constructing and Training Feedforward
Neural Networks, Neural Computation Vol. 2 (1990), ppI98-209.
[3] Mcinerny, J., K. Hainer, S. Biafore, & R. Hecht-Nielsen,. Back Propagation Error Surfaces
Can Have Local Minima, Proceedings of the International Joint Conference on Neural Net-
works (1989), Vol II, p627.
[4] Mezard, M & J. Nadal, Learning in Feedforward Layered Networks: The Tiling Algorithm,
Journal of Physics Vol. A 22 (1989), pp2191-2204.
[5] Tsaih, R., The Softening Learning Procedure, Mathematical and Computer Modelling, Vol.
18 (1993), No.8, pp61-64.
[6] Tsaih, R., The Softening Learning Procedure for The Networks With Multiple Output Nodes,
MIS Review, Vol. 4 (1994), pp89-93, Taipei.
[7] Watanabe, E., & H. Shimizu, Algorithm for Pruning Hidden nodes in Multi-Layered Neural
Network for Binary Pattern Classification Problem, Proceedings of the International Joint
Conference on Neural Networks (1993), Vol I, pp327-330.
CAPACITY OF THE UPSTART ALGORITHM
Ansgar H. L. West* and David Saad
Neural Computing Research Group, Aston University,
Aston Triangle, Birmingham B4 'lET, UK.
Email: [email protected]@aston.ac. uk
* Also affiliated to: Department of Physics, University of Edinburgh,
Mayfield Road, Edinburgh, EH93JZ, UK.
The storage capacity of multilayer networks with overlapping receptive fields is investigated for
a constIUctive algorithm within a one-step replica symmetry breaking (RSB) treatment. We find
that the storage capacity increases logarithmically with the number of hidden units K without sat-
urating the Mitchison-Durbin bound. The slope of the logarithmic increase decays exponentionally
with the stability with which the patterns have been stored.
1 Introduction
Since the ground breaking work of Gardner [1] on the storage capacity of the per-
ceptron, the replica technique of statistical mechanics has been successfully used
to investigate many aspects of the performance of simple neural network mod-
els. However, progress for multilayer feedforward networks has been hampered by
the inherent difficulties of the replica calculation. This is especially true for ca-
pacity calculations, where replica symmetric (RS) treatments [2] violate the upper
Mitchison-Durbin bound [3] derived by information theory. Other efforts [4] break
the symmetry of the hidden units explicitly prior to the actual calculation, but
the resulting equations are approximations and difficult to solve for large networks.
This paper avoids these problems by addressing the capacity of a class of networks
with variable architecture produced by a constructive algorithm. In this case, re-
sults derived for simple binary perceptrons above their saturation limit [5] can be
applied iteratively to yield the storage capacity of two-layer networks.
Constructive algorithms (e.g., [6, 8]) are based on the idea that in general it is a
priori unknown how large a network must be to perform a certain classification
task. It seems appealing therefore to start off with a simple network, e.g., a binary
perceptron, and to increase its complexity only when needed. This procedure has
the added advantage that the training time of the whole network is relatively short,
since each training step consists of training the newly added hidden units only,
whereas previously constructed weights are kept fixed. Although constructive al-
gorithm seem therefore rather appealing, their properties are not well understood.
The aim of this paper is to analyse the performance of one constructive algorithm,
the upstart algorithm [8], in learning random dichotomies, usually referred to as
the capacity problem.
The basic idea of the upstart algorithm is to start with a binary perceptron unit
with possible outputs {I,O}. Further units are created only if the initial perceptron
372
West (3 Saad: Capacity of the Upstart Algorithm 373
to the output unit. The original upstart algorithm produces a hierarchical network
where the number of hidden units tends to increase exponentionally with each
generation. Other versions of the upstart algorithm[8] build a two-layer architecture
and show only a linear increase of the number of units with each generation, which
is in general easier to implement.
We have therefore analysed a non-hierarchical version of the upstart algorithm.
Within a one-step replica symmetry breaking (RSB) treatment [9], networks con-
structed by the upstart algorithm show a logarithmic increase of the capacity with
the number of nodes in agreement with the Mitchison-Durbin bound
(a c ex InK/ln2)
, whereas the simpler RS treatment violates this bound. Furthermore, the algo-
rithm does not saturate the Mitchison-Durbin bound for zero stability. We further
find that the slope of the logarithmic increase of the capacity against network size
decreases exponentionally with the stability.
2 Model Description and Framework
2.1 Definition of the Upstart Algorithm
The upstart algorithm first creates a binary perceptron (or unit) Vo which learns
a synaptic weight vector W E IRN and a threshold 0 which minimize the error on
a set of p input-output mappings e =
E {-I, I}N - t (I' E {O, I} (J.l 1, ... ,p) from
an N-dimensional binary input space to binary targets. The output of the binary
percept ron is determined by
(=1 (=0
CORRECTLY ON WRONGLY ON
(1=1 V+
V- 0
* V+ 0
V- I
WRONGLY OFF CORRECTLY OFF
(1=0 V+ 1 V+ 0
V- 0 V- *
Table 1 The targets of the upstart II algorithm depending on the requested target
, and the actual output u of the output unit O. The target "*,, means that the
pattern is not included in the training set of V±.
define the algorithm upstart II by the following steps which are applied recursively
until the task is learned:
Step 0: Follow the above procedure for the original unit Vo and the creation of
the output unit O. Evaluate the number of WRONGLY OFF and WRONGLY ON
errors.
Step 1: If the output unit 0 of the upstart network of i generations makes more
WRONGLY OFF than WRONGLY ON errors, a new unit Vi+l is created and
trained on the training set and targets given in Table 1. If there are more
WRONGLY ON than WRONGLY OFF errors, a new unit Vt+l is created with
training set and targets also given in Table 1. If both kind of errors occur
equally, two units Vi+l and vt+! are created with training sets and targets as
above.
Step 2: The new units are trained on their training sets and their weights are
frozen. The units Vt+l' Vi+l are then connected with positive, negative weights
to the output unit respectively. The modulus of the weights are adjusted so that
Vf+l overrules any previous decisions if active. The total number of WRONGLY
OFF and WRONGLY ON errors of the upstart network of generation i + 1 is then
reevaluated. If the network still makes errors the algorithm goes back to Step 1.
The algorithm will eventually converge as a daughter unit will always be able to
correct at least one of the previously misclassified patterns without upsetting any
already correctly classified examples.
2.2 Statistical Mechanics Framework for Calculating the
Capacity Limit
Since the upstart algorithm trains only perceptrons, we can apply knowledge of
the capacity limit and of the error rate of perceptrons above saturation derived in
a statistical mechanics framework to calculate the capacity limit of the upstart II
algorithm for an arbitrary number of generations. Below, we briefly review this
West (3 Saad: Capacity of the Upstart Algorithm 375
statistical mechanics calculation and refer the reader to [5] and to previous work [1]
for a more detailed treatment.
In the capacity problem the aim is to find the maximum number p of random input-
output mappings of binary N-dimensional input vectors e to targets (I' E {O, I},
which can be realized by a network on average. We assume that each component of
the input vectors e is drawn independently with equal probability from {-I, I}.
The distribution of targets is taken to be pattern independent with a possible bias
b: P(() = HI + b)6(1 - () + HI - b)6((). We will here only consider an unbiased
output distribution for the intial perceptron. The target distributions for daughter
units however will in general be biased.
Each binary perceptron is trained stochastically and we only allow weight vector
solutions with the minimal achievable error. The error rate, i.e., the number of er-
rors divided by the total number of examples, is assumed to be self-averaging with
respect to the randomness in the training set in the thermodynamic limit N --+ 00.
In this limit the natural measure for the number of examples p is a = piN. With
increasing a the weight space of possible solutions shrinks, leaving a unique solu-
tion at the capacity limit of the binary perceptron. Above the capacity limit many
different weight space solutions with the same error are possible. In general the so-
lution space will be disconnected as two solutions can possibly missclassify different
patterns. As a diverges, the solution space becomes increasingly fragmented.
The replica trick is used to calculate the solution space and the minimal error rate
averaged over the randomness of the training set. This involves the replication of
the perceptron weight vector, each replica representing a different possible solution
to the same storage problem. In order to make significant progress, one has further
to assume some kind of structure in the replica space. Below the capacity limit,
the connectedness of the solution space is reflected by the correctness of a replica
symmetric (RS) ansatz. Above the capacity, the disconnectedness of the solution
space breaks the RS to some degree. We have restricted ourselves to a one-step
replica symmetry breaking (RSB) calculation, which is expected to be at least
sufficient for small error rates. The form of the equations for the error rate resulting
from the RS and one-step RSB calculations are quite cumbersome and will be
reported elsewhere [5, 11]. For the perceptron, the error rate is a function of the
output bias b and the load a only.
3 Results of the Upstart Algorithm
The capacity of an upstart network with K hidden units can now be calculated.
The initial percept ron is trained with an example load of a and an unbiased output
distribution b = O. The saddlepoint equations and the WRONGLY ON and WRONGLY
OFF error rates are calculated numerically. These error rates determine the load
and bias for the unit(s) to be created in the next generation. Now its (their) error
rates and the errors of the output unit can in turn be calculated by solving the
saddlepoint equations. This is iterated until K units have been built. If tlie output
unit still makes error, we are above the capacity limit of the upstart net with K
hidden units and a has to be decreased. On the other hand, if the output unit makes
no errors, a can be increased. The maximal a for which the output unit makes no
errors defines the saturation point of the network. The capacity limit, defined here
376 CHAPTER 65
0.4-.-------------,
8 --X;=o (b)
------ X; =0.1
_._.- X; =0 (RS)
0.3
'Y
,,' 0.2
-,')'(11:)
2
Figure 1 (a) Within the one-step RSB theory, tl e capacity D!c increases log-
arithmically with the number of hidden units K fa . large K for the stabilities
K = 0 (0.1), i.e., D!c oc 0.3595 (0.182)InK (see sup erin_posed asymptotics). The RS
theory violates the Mitchison-Durbin bound (third asymptotic: D!c oc In K / In 2) for
K ~ 180. (b) The slope 'Y of the logarithmic increase of the capacity decreases
exponentionally with the stability K.
as the maximal number of examples per adjustable weight of the network, then
becomes simply ac(K) a/K. =
In Fig. la we present the storage capacity as a function of the number of hidden units
for both a one-step RSB and a RS treatment at zero stability of the patterns (I\: = 0).
Whereas one-step RSB predicts a logarithmic increase a c ( K) ex: In( K) for large
networks, in agreement with the Mitchison-Durbin bound, the results for the RS-
theory violate this upper bound l , i.e., the RS theory fails to predict the qualitative
behaviour correctly.
In Fig. la we also show that the storage capacity still increases logarithmically with
the number of units K for non-zero stability, but with a smaller slope /. Fig. Ib
shows the dependence of the slope / as a function of the stability I\: for one-step
RSB. The maximal slope for zero stability / = 0.3595 ± 0.0015 does not saturate the
Mitchison-Durbin bound / = 1/ In 2 ~ 1.4427, but is about four times lower. With
increasing stabilities I\: this slope decreases exponentionally / ex: exp( -6. 77 ± 0.02 1\:).
4 Summary and Discussion
The objective of this work has been to calculate the storage capacity of multilayer
networks created by the constructive upstart algorithm in a statistical mechanics
framework using the replica method. We found that the RS-theory fails to predict
the correct results even qualitatively. The one-step RSB theory yields qualitatively
and quantitatively correct results over a wide range of network sizes and stabilities.
In the one-step RSB treatment, a logarithmic increase with slope / of the capacity of
the upstart algorithm with the number of units K was found for all stabilities. The
slope decreases exponentionally [/ ex: exp( -6.771\:)] with the stability 1\:. It would be
interesting to investigate if this result carries over to other constructive algorithms
or even to general two-layer networks.
IThe violation occurs for K ~ 180.and the largest networks in the RS case were K = 999.
West €3 Saad: Capacity of the Upstart Algorithm 377
For zero stability the slope of this increase is around four times smaller than the
upper bound (1/ In 2) predicted by information theory. We suggest that this indi-
cates that the upstart algorithm uses its hidden units less effectively than a general
two-layer network. We think this is due to the fact that the upstart algorithm uses
the hidden units to overrule previous decisions, resulting in an exponential increase
of the hidden layer to output unit weights. This is in contrast to general two-layer
networks which usually have hidden-output weights of roughly the same order and
can therefore explore a larger space of internal representations. For the upstart
algorithm a large number of internal representations are equivalent and others can-
not be implemented as they are related to erroneous outputs. However, it would
be interesting to investigate how other constructive algorithms (e.g., [6]) perform
in comparison. A systematic investigation of the storage capacity of constructive
algorithms may ultimately lead to a better understanding, and thus possibly to
novel, much improved algorithms.
REFERENCES
[1) E. Gardner, The space of interactions in neural network models, J. Phys. A Vol. 21 (1988),
pp257-270.
[2) E. Barkai, D. Hansel and H. Sompolinsky, Broken symmetries in multilayered perceptrons,
Phys. Rev. A Vol. 45 (1992), pp4146-4161.
[3) G. J. Mitchison and R. M. Durbin, Bounds on the learning capacity of some multi-layer
networks, Biological Cybernetics Vol. 60 (1989), pp345-356.
[4) D. Saad, Explicit symmetries and the capacity of multilayer neural networks, J. Phys. A Vol.
27 (1994), pp2719-2734.
[5) A. H. L. West, and D. Saad, Threshold induced phase transitions in perceptrons, To appear
in J. Phys. A (March 1997).
[6) M. Mezard and J.-P. Nadal, Learning in feed-forward layered networks: the tiling algorithm,
J. Phys. A Vol. 22 (1989), pp2191-2203.
[7) J.-P. Nadal, Study of a growth algorithm for a feed-forward network, International Journal
of Neural Systems, Vol.1 (1989), pp55-59.
[8) M. Frean, The upstart algorithm: a method for constructing and training feed-forward neural
networks, Neural Computation Vol. 2 (1990), pp198-209.
[9) For an overview, see e.g., M. Mezard, G. Parisi and M. G. Virasoro, Spin Glass Theory and
Beyond, World Scientific, Singapore (1987).
[10) M. Frean, A thermal perceptron learning rule, Neural Computation Vol. 4 (1992), pp946-957,
and references therein.
[11) A. H. L. West and D. Saad, Statistical mechanics of constructive algorithms, in preparation
(1997).
Acknowledgements
AHLW would like to acknowledge gratefully financial support by the EPSRC. This
work has been supported by EU grant ERB CHRX-CT92-0063.
REGRESSION WITH GAUSSIAN PROCESSES
Christopher K. 1. Williams
Neural Computing Research Group, Department of Computer Science
and Applied Mathematics, Aston University, Birmingham B4 JET, UK.
Email: [email protected]. uk
The Bayesian analysis of neural networks is difficult because the prior over functions has a complex
form, leading to implementations that either make approximations or use Monte Carlo integration
techniques. In this paper I investigate the use of Gaussian process priors over functions, which
permit the predictive Bayesian analysis to be carried out exactly using matrix operations. The
method has been tested on two challenging problems and has produced excellent results.
1 Introduction
In the Bayesian approach to neural networks a prior distribution over the weights
induces a prior distribution over functions. This prior is combined with a noise
model, which specifies the probability of observing the targets t given function
values y, to yield a posterior over functions which can then be used for predictions.
For neural networks the prior over functions has a complex form which means
that implementations must either make approximations [4] or use Monte Carlo
approaches to evaluating integrals [6].
As Neal [7] has argued, there is no reason to believe that, for real-world problems,
neural network models should be limited to nets containing only a "small" number
of hidden units. He has shown that it is sensible to consider a limit where the
number of hidden units in a net tends to infinity, and that good predictions can be
obtained from such models using the Bayesian machineryl. He has also shown that
a large class of neural network models will converge to a Gaussian process prior
over functions in the limit of an infinite number of hidden units.
Although infinite networks are one method of creating Gaussian processes, it is
also possible (and computationally easier) to specify them directly using paramet-
ric forms for the mean and covariance functions. In this paper I investigate using
Gaussian processes specified parametrically for regression problems 2 , and demon-
strate very good performance on the two test problems I have tried. The advantage
of the Gaussian process formulation is that the integrations, which have to be ap-
proximated for neural nets, can be carried out exactly (using matrix operations) in
this case. I also show that the parameters specifying the Gaussian process can be
estimated from training data, and that this leads naturally to a form of "Automatic
Relevance Determination" [4], [7].
2 Prediction with Gaussian Processes
A stochastic process is a collection of random variables {Y(x)lx E X} indexed by
a set X. Often X will be a space such as IRd for some dimension d, although it
could be more general. The stochastic process is specified by giving the probability
distribution for every finite subset of variables Y(xt}, ... , Y(Xk) in a consistent
manner. A Gaussian process is a stochastic process which can be fully specified by
its mean function J.L(x) = E[Y(x)] and its covariance function C(x, x') = E[(Y(x)-
1 Large networks cannot be successfully used with maximum likelihood training because of the
overfitting problem.
2By regression problems I mean those concerned with the prediction of one or more real-valued
outputs, as compared to classification problems.
378
Williams: Regression with Gaussian Processes 379
J.l( x) )(Y (X') - J.l( X'))]; any finite set of points will have a joint multivariate Gaussian
distribution.
Below I consider Gaussian processes which have J.l(x) == O. This is the case for
many neural network priors [7], and otherwise assumes that any known offset or
trend in the data has been removed. A non-zero J.l(x) can be incorporated into the
framework, but leads to extra notational complexity.
Given a prior covariance function Cp(x, x'), a noise process CN(X, x') (with
=
CN(X,X') 0 for x"# x') and data V = ((XI,tt),(X2,t2), ... ,(xn ,t n )), the pre-
diction for the distribution of Y corresponding to a test point x is obtained simply
by marginalizing the (n + 1)-dimensional joint distribution to obtain the mean and
vanance
y(X) k~(x)(I{p + KN)-It (1)
u3(x) Cp(x,x) + CN(x,x) - k~(x)(Kp + KN)-Ikp(x) (2)
where [Kc.]ij = Ca(Xi, Xj) for 0: = P, N, kp(x) = (Cp(x, xt), ... , Cp(x, xn)f
and t = (t l , ... , tnf. uJ( x) gives the "error bars" of the prediction. In the work
below the noise process is assumed to have a variance (1; independent of x so that
KN = (1;1.
The Gaussian process view provides a unifying framework for many regression meth-
ods. ARMA models used in time series analysis and spline smoothing (e.g. [10])
correspond to Gaussian process prediction with a particular choice of covariance
=
function 3 , as do generalized linear regression models (y(x) Li Wi(jJ;(X), with {<Pi}
a fixed set of basis functions) for a Gaussian prior on the weights {w;}. Gaussian
processes have also been used in the geostatistics field (e.g. [3], [1]), and are known
there as "kriging", but this literature has concentrated on the case where x E rn?
or IR3 , rather than considering more general input spaces. Regularization networks
(e.g. [8], [2]) provide a complementary view of Gaussian process prediction in terms
of a Fourier space view, which shows how high-frequency components are damped
out to obtain a smooth approximator.
2.1 Adapting Covariance Functions and ARD
Given a covariance function C = Cp + CN, the log probability I of the training
data is given by
I =- 211og det 1\
T.'
- 2t
1 TK'-I t - "2
n Iog 27r ( 3)
t ~i(Xi
of 9 4. For example, in a d-dimensional input space we may choose
where Va, VI and the {Wi} are adjustable. In MacKay's terms [4] I is the log "evi-
dence", with the parameter vector 9 roughly corresponding to his hyperparameters
a and {3; in effect the weights have been exactly integrated out.
One reason for constructing a model with variable w's is to express the prior belief
that some input variables might be irrelevant to the prediction task at hand, and
3Technically splines require generalized covariance functions.
4 See section 4 for a discussion of the hierarchical Bayesian approach.
380 CHAPTER 66
we would expect that the w's corresponding to the irrelevant variables would tend
to zero as the model is fitted to data. This is closely related to the Automatic
Relevance Determination (ARD) idea of MacKay and Neal [5], [7].
3 Experiments with Gaussian Process prediction
Prediction with Gaussian processes and maximum likelihood training of the covari-
ance function has been tested on two problems: (i) a modified version of MacKay's
robot arm problem and (ii) the Boston housing data set.
For both datasets I used a covariance function of the form given in equation 4 and a
gradient-based search algorithm for exploring B-spacej the derivative vector 8118B
was fed to a conjugate gradient routine with a line-search 5 •
3.1 The Robot Arm Problem
I consider a version of MacKay's robot arm problem introduced by Neal (1995).
The standard robot arm problem is concerned with the mappings
Yl = Tl cos Xl + T2 COS(XI + X2) Y2 = Tl sin Xl + T2 sin(xl + X2) (5)
The data was generated by picking Xl uniformly from [-1.932, -0.453] and [0.453,
1.932] and picking X2 uniformly from [0.534, 3.142]. Neal added four further inputs,
two of which were copies of Xl and X2 corrupted by additive Gaussian noise of
standard deviation 0.02, and two further irrelevant Gaussian-noise inputs with zero
mean and unit variance. Independent zero-mean Gaussian noise of variance 0.0025
was then added to the outputs Yl and Y2. I used the same datasets as Neal and
MacKay, with 200 examples in the training set and 200 in the test set.
The theory described in section 2 deals only with the prediction of a scalar quantity
Y, so I constructed predictors for the two outputs separately, although a joint
prediction is possible within the Gaussian process framework (see co-kriging, §3.2.3
in [1]). Two experiments were conducted, the first using only the two "true" inputs,
and the second one using all six inputs. For each experiment ten random starting
positions were tried. The log(v)'s and log(w)'s were all chosen uniformly from [-3.0,
0.0]' and were adapted separately for the prediction of Yl and Y2. The conjugate
gradient search algorithm was allowed to run for 100 iterations, by which time the
likelihood was changing very slowly. Results are reported for the run which gave
the highest probability of the training data, although in fact all runs performed
Sin fact the parameterization log (J was used in the search to ensure that the v's and w's stayed
positive.
Williams: Regression with Gaussian Processes 381
very similarly. The results are shown in Table 1 6 and are encouraging, as they
indicate that the Gaussian process approach is giving very similar performance to
two well-respected techniques. All of the methods obtain a level of performance
which is quite close to the theoretical minimum error level of 1.0. It is interesting
to look at the values of the w's obtained after the optimization; for the Y2 task
the values were 0.243, 0.237, 0.0650, 1.7 x 10- 4 , 2.6 X 10- 6 , 9.2 X 10-7, and Va
and VI were 7.920 and 0.0022 respectively. The w values show nicely that the first
two inputs are the most important, followed by the corrupted inputs and then the
irrelevant inputs.
3.2 Boston Housing Data
The Boston Housing data has been used by several authors as a real-world regression
problem (the data is available from ftp://lib.stat.emu.edu/datasets). For each
of the 506 census tracts within the Boston metropolitan area (in 1970) the data gives
13 input variables, including per capita crime rate and nitric oxides concentration,
and one output, the median housing price for that tract.
A ten-fold cross-validation method was used to evaluate the performance, as de-
tailed in [9]). The dataset was divided into ten blocks of near-equal size and distri-
bution of class values (I used the same partitions as in [9]). For each block in turn
the parameters of the Gaussian process were trained on the remaining blocks and
then used to make predictions for the hold-out block. For each of the ten experi-
ments the input variables and targets were linearly transformed to have zero mean
and unit variance, and five random start positions used, choosing the log(v)'s and
log(w)'s uniformly from [-3.0,0.0]. In each case the search algorithm was run for
100 iterations. In each experiment the run with the highest evidence was used for
prediction, and the test results were then averaged to give the entry in Table 2.
The fact that the Gaussian process result beats the best result obtained by Quinlan
(who made a reasonably sophisticated application of existing techniques) is very
encouraging. It was observed that different solutions were obtained from the random
starting points, and this suggests that an hierarchical Bayesian approach, as used
in Neal's neural net implementation and described in section 4, may be useful in
further increasing performance.
6The bottom three lines of the table were obtained from [7]. The MacKay result is the test
error for the net with highest "evidence".
382 CHAPTER 66
4 Discussion
I have presented a Gaussian process framework for regression problems and have
shown that it produces excellent results on the two test problems tried.
In section 2 I have described maximum likelihood training of the parameter vector
8. Obviously a hierarchical Bayesian analysis could be carried out for a model M
using a prior P(8IM) to obtain a posterior P(8IV, M). The predictive distribution
for a test point and the "model evidence" P(VIM) are then obtained by averaging
the conditional quantities over the posterior. Although these integrals would have
to be performed numerically, there are typically far fewer parameters in 8 than
weights and hyperparameters in a neural net, so that these integrations should
be easier to carry out. Preliminary experiments in this direction with the Hybrid
Monte Carlo method [7] are promising.
I have also conducted some experiments on the approximation of neural nets (with a
finite number of hidden units) by Gaussian processes, although space limitations do
not allow me to describe these here. Other directions currently under investigation
include (i) the use of Gaussian processes for classification problems by softmaxing
the outputs of k regression surfaces (for a k-class classification problem), and (ii)
using non-stationary covariance functions, so that C(x, x') # C(lx - x'l).
REFERENCES
[1] N. A. C. Cressie, Statistics for Spatial Data, Wiley (1993).
[2] F. Girosi, M. Jones, and T. Poggio, Regularization Theory and Neural Networks Architectures,
Neural Computation, Vol. 7(2) (1995), pp219-269.
[3] A. G. Journel and Ch. J. Huijbregts, Mining Geostatistics, Academic Press (1978).
[4] D. J. C. MacKay, A Practical Bayesian Framework for Backpropagation Networks, Neural
Computation, Vol. 4(3) (1992), pp448-472.
[5] D. J. C. MacKay, Bayesian Methods for Backpropagation Networks, In J. L. van Hemmen,
E. Domany, and K. Schulten, editors, Models of Neural Networks II, Springer (1993).
[6] R. M. Neal, Bayesian Learning via Stochastic Dynamics, in: S. J. Hanson, J. D. Cowan, and
C. L. Giles, eds., Neural Information Processing Systems, Vol. 5 (1993), pp475-482. Morgan
Kaufmann, San Mateo, CA,.
[7] R. M. Neal, Bayesian Learning for Neural Networks, PhD thesis, Dept. of Computer Science,
University of Toronto (1995).
[8] T. Poggio and F. Girosi, Networks for approximation and learning, Proceedings of IEEE,
Vol. 78 (1990), pp1481-1497.
[9] J. R. Quinlan, Combining Instance-Based and Model-Based Learning, in: P. E. Utgoff, ed.,
Proc. ML'93, Morgan Kaufmann, San Mateo, CA (1993).
[10] G. Wahba, Spline Models for Observational Data, Society for Industrial and Applied Math-
ematics, CBMS-NSF Regional Conference series in applied mathematics (1990).
Acknowledgements
I thank Radford Neal and David MacKay for many useful discussions and for gen-
erously proving data used in this paper, Chris Bishop, Peter Dayan, Radford Neal
and Huaiyu Zhu for comments on earlier drafts, George Lindfield for making his
implementation of the scaled conjugate gradient search routine available and Carl
Rasmussen for his C implementation which runs considerably faster than my MAT-
LAB version. This work was supported by EPSRC grant GR/J75425.
STOCHASTIC FORWARD-PERTURBATION, ERROR
SURFACE AND PROGRESSIVE LEARNING IN
NEURAL NETWORKS
Li-Qun Xu
Intelligent Systems Research Group, BT Research Laboratories
Martlesham Heath, Ipswich, IP57RE, UK.
We address the issue of progressive learning of neural networks, focusing on the situation in
which the environment or the training set for a learner is of fixed size'! We concentrate on the
phenomenon of the change in shape of the error surface of a neural network (defined over the
weight space) as a result of presenting it with a range of intermediate tasks during the course
of learning, and a goal-driven mechanism is therefore proposed and analysed. The stochastic
gradient smoothing algorithm (SGSA) [14, 16] is found to be effective in implementing this idea,
experiments done demonstrate the usefulness of our approach towards progressive learning.
1 Active Learning: Non-fixed vs Fixed Data Set
The findings in cognitive science have shown that a learning process is normally a
bidirectional process involving the interactive actions (information exchange) be-
tween a learner and its surrounding environment [3]. In active learning of neural
networks, the learner is a neural network of specified configuration A parameterised
by some connection weight vector W E IRN, learning to perform a particular task
such as function mapping [8], pattern classification [11] and robot control [13] among
many others. The environment is characterised by a training data set of fixed or
non-fixed size, or some exploratory space of certain underlying structures.
Current research on active learning of neural networks has been focusing on the
design of various well-defined information and/or generalisation criteria [6, 8, 13,
9,4] based on which the selection, can be made from among a set of available data
examples, of a new data example, which, when added to the previous training set
and learned, allow the trained network has the maximum accuracy of fit to the
data and improved generalisation performance [10]. Notably, this strategy imposes
no restrictions on the size of the data set.
However other approaches to active learning, given that the training data set is of
fixed size, follow a slight different trend which essentially emphasises the principle
of progressive learning [11]. This is usually achieved by letting the neural network
learn a succession of varied subsets that represent, on one occasion, a different level
of abstraction of the final complex task. Subsequently, the response of the learner
to the changeable environment - the particular subsets engaged - is monitored, and
the performance, in terms of model misspecification and network variance, thus far
achieved will determine what aspects of the environment the learner is to face next
time.
There are various ways whereby the whole environment can be decomposed : one
can consider to divide the entire data set according to sample size from small to
large [1], the degree of difficulty from low to high [5], [3], etc. This way of division
of the entire data set is by and large based on the results of empirical studies and
lwhich is often the case in practical applications of neural networks, especially pattern
recognition.
383
384 CHAPTER 67
the understanding of individual experimenters to the tasks at hand. In this paper
we propose an alternative way of performing progressive learning while freeing from
the difficulties of artificially decomposing the environment. In the following section
we introduce the view of progressive learning in terms of dealing with the change in
shape of error surfaces. In section 3 we propose to use stochastic gradient smoothing
algorithm (SGSA) to implement this idea. In section 4 experiments are conducted
to demonstrate the usefulness ofthe approach for active learning of neural networks.
The paper is concluded in section 5.
2 Progressive Learning - a Goal-driven Perspective
It would be instructive if we interpret the progressive way of learning an entire
environment by a neural network based on the change in shape of its error sur-
face vs connection weights : Given a neural network A specified by a connection
weight vector W E IRN, a training set e(5) of lSI pairs of labelled examples,
e(5) -- {x·J, y'J }15 1
j=l describing the whole environment, an error function can then
be expressed as,
f e(5) (W) : IRN ---+ IR1 , (1)
the notation e(5) in the equation is meant to show the fact that the shape of the
error surface in the weight space W E IRN is entirely determined by the training
set. (Figure 1 (a) shows a two-dimensional profile of such an error surface when a
final convergence has been reached.) Consequently, the strategy of using a varied
data subset in session k, e(3 k ), Sk C S, for the training of A will invariably lead
to the change in shape of its error surface, or fe('k)(W), Sk C S. This is what we
mean by the data-driven mechanism detailed in Figure 2 (a). With e(3 k ), Sk C S
being properly constructed, it is expected that the shape of the error surface will
change from initially coarse and more or less smooth terrain to the final one bearing
all the details but somehow with less regularities.
We are interested, however, in exploring the idea from an opposite perspective.
In fact, we can manipulate the shape of the error surface by subjecting it to some
mathematically smoothing operations to generate at our own disposal a sequence of
intermediate error surfaces to achieve the same effects as that of previous strategy.
Xu: Progressive Learning 385
(a)
(b)
Figure 2 (b) illustrates this idea which we call goal-driven mechanism, where the
sequence of error surfaces at the output end can be equally viewed as those ranging
between Figure 1 (b), a smooth shallow surface with established prominent charac-
teristics, and Figure 1 (a), a surface with all the details (and idiosyncrasies). It is
from this viewpoint we argue that the stochastic forward-perturbation algorithms
[15] - the category of learning algorithm that has its roots nurtured by the theory
of stochastic approximation of nonlinear dynamical processes - can be employed to
accomplish the task of progressive learning without caring about how to divide the
training set.
3 Mathematical Analysis
As opposed to the back-propagation (BP) algorithm, the operations of updating the
weight vector W by stochastic forward-perturbation algorithms generally follow an
386 CHAPTER 67
F - P - F - G pattern (Figure 3) i.e. one forward propagation of the training set
through A to measure as it currently stands error the function; one perturbation
of the weight vector with some controllable noises P E IRN; and another forward
propagation to measure the response of A due to the perturbation just introduced;
based on these measurements a gradient estimation can be taken by employing
some intuitive difference approximation or more subtle correlation methods. The
estimated gradient can then be incorporated into stochastic approximation algo-
rithms to update the weight vector.
In the following, the advanced algorithm called stochastic gradient smoothing algo-
rithm [16] is employed. According to Figure 2 (b), the following smoothing operation
has been invoked (Note that the data-dependent notation 8(5) has been dropped.)
i(w, (3) = f(W) @ [G( - w, (3) + G(W, (3)]
= f G(A, (3) [f(W + A) + f(W - A)]dA (2)
Jrn. N
where the symbol '@' denotes the convolution operator. Following some mathemat-
ical manipulations, one stochastic gradient estimator ek is shown as follows :
,-, 11~
6. == "V w f(Wk,(3k) = -2(3 C)'
L.JA; c· f(Wk) (3)
nk k i=l
(5)
where the direction vector dk =
(d kl dk2 ... dkN Y E IRN represents a running
average over the current gradient estimate ek and the previous direction vector
d k - 1 . The other two important parameters appeared in the above formulas include
17k - the learning rate (or adaptive step size) and Pk - the accumulating factor. It
is desirable that with the iteration k --t 00, 17k --t 0, Pk --t 1. The parameters 17k
and Pk are locally adaptable.
4 Experimental Results
A simulated two-class pattern classification problem, of which the details can be
found in [7], is used to demonstrate our approach for active learning of neural
networks. The distributions of these two classes (C1 and Co) are overlapped, and a
complete separation of their examples is impossible even for an optimal classifier.
In the experiments, the training set consists of 200 examples with 100 drawn from
each class, while the test set contains 1000 examples with 500 belonging to each
class. The neural network used for this task has a fully-connected three-layer 2 - 4
- 1 structure, amounting to 17 weights including biases. The single output of the
trained network should ideally be for examples in C1 a "I" and in Co a "0" when
being shown the unseen data in the testing phase.
Xu: Progressive Learning 387
i
I •
1'
I .
'OL--_~~--~---_~--~~--~~~~,~~--~,C~
a) rt..rt.aI ........
10
I SGSA -
[ , .. I
0.' 0 ... o.
I • i
1 · I
I ~~_~. ~
j •
.L:~ _...__ -~
1
o
b) 1200 lC
Figure 4 (a) A typical learning trajectory achieved by the SGSA. The vertical
dashed lines· mark the boundaries where a new intermediate error surface, char-
acterised by a different value of fh shown along the lines, is encountered by the
learner. In this case, there are 4 intermediate error surfaces (where (3k assumes a
value of 2.6, 1.0, 0.5 and 0.25 in order) to be learned before the learner faces the
final task (where (3k = 0.1) which is approximate to the error surface imposed by
the complete environment for which (3k =
o. (b) A set of 10 learning trajectories
for the test problem by the SGSA with different initial weight vectors.
For the problem, 10 trials are performed, each starting with a different weight vector
Wo having random values ranging between ±0 .5. Figure 4 (a) shows a typical
learning trajectory, the mean squared error vs number of iterations, achieved by
the SGSA.
Table 1 summarises the average performance among ten trials for the SGSA, where
E tr denotes the mean squared error (MSE) for the training set; ntr gives, up to
the indicated iterations, the number of examples yet to be correctly learned in
the training set (200 examples in total) ; nte shows the generalisation performance
measured by the number of examples that are still misclassified, up to the given
training iterations, in the test set (1000 examples in total) . An important result
is that the generalisation performance given by the nte tends to saturate rather
than deteriorate with more iterations, as is not the case with the experiments using
deterministic Quick-prop algorithm (an advanced version ofthe BP algorithm) [15] .
388 CHAPTER 67
ALGO. Iter. 700 800 900 1000 1100 1200 1300 1400 1500
E tr 2.658 2.551 2.477 2.410 2.367 2.313 2.290 2.261 2.229
SGSA ntr 11.4 9.9 9.7 9.6 8.9 8.6 8.3 8.4 8.3
nte 66.9 66.0 64.6 62.2 62.1 62.5 63.2 61.9 61.9
Table 1 The average performance among 10 trials of the SGSA at selected iter-
ations when applied to the simulated two-class pattern classification problem.
Thus the SGSA compares favourably with the Quick-prop in this regard. Figure 4
(b) describes the corresponding set of 10 learning trajectories obtained.
5 Summary
An alternative way has been explored for active learning of neural networks, fo-
cusing on the viewpoint of interpreting the error surface imposed by the whole
training set (a certain task) in terms of a range of its intermediate versions, or a set
of intermediate tasks. We analysed two different perspectives, called, respectively,
data-driven mechanism and goal-driven mechanism, that give rise to such a range
of error surfaces. We argued that the later approach could well be an effective way
to control the amount of information about a complex environment at a time acces-
sible to a learner, therefore fulfilling the same objective (of progressive learning) as
the former without explicitly decomposing the environment in a hard fashion. The
stochastic gradient smoothing algorithm (SGSA) was employed to implement this
idea. Experiments have been conducted to support our claims. Further theoretical
studies of this issue are needed.
REFERENCES
[1) C. Cachin, Neural Networks, Vol. 7(1) (1994).
[2) D.A. Cohn, Z. Ghahramani, Z., M.1. Jordan, in: [12).
[3) J.E. Elman, Technical Report CRL-9101, UCSD (1991).
(4) A. Krogh, J. Vedelsby, in [12).
[5) J. Ludik, 1. Cloete, Proc. of ESANN (April 1994), Brussels.
[6) D.J.C. MacKay, Neural Computation, Vol. 4(4) (1992).
[7) L. Niles, H. Silverman et al., Proc. of ICASSP'89, Glasgow (1989).
[8) M. Plutowski, H. White, Technical Report CS91-180, UCSD (1991).
[9) P. Sollich, Physical Review E, Vol. 49 (1994).
[10) E.M. Strand, W.T. Jones, Proc. of IJCNN Vol. I, Baltimore (1992).
[11) N. Szilas, E. Ronco, Proc. of ECCS'95, Saint-Malo, France (1995).
[12) G. Tesauro, D.S. Touretzky, and T.K. Leen eds., Advances in Neural Information Processing
Systems 7, MIT Press, Cambridge, MA (1995).
[13) S.B. Thrun, in D.A. White and D.A. Sofge (eds.), Handbook of intelligent control: Neural,
Fuzzy and Adaptive Approaches, Van Nostrand Reinhold, Kentucky (1993).
[14) L.Q. Xu, T.J. Hall, Proc. of ICANN'94 , Naples, Italy (1994).
[15) L.Q. XU, Proc. of EANN'95, Helsinki, Finland (1995).
[16] L.Q. Xu, T.J. Hall, submitted to Neural Networks, (May 1995).
Acknowledgements
This research was conducted at the University of Abertay, Dundee, and was funded
in part by a grant from the Nuffield Foundation.
DYNAMICAL STABILITY OF A HIGH-DIMENSIONAL
SELF-ORGANIZING MAP
Howard Hua Yang
Lab for Information Representation, FRP, RIKEN, 2-1 Hirosawa,
Wako-Shi, Saitama 351-01, Japan. Email: [email protected]
The convergence and the convergence rate of one self-organization algorithm for the
high-dimensional self-organizing map in the linearized Amari model are studied in the context
of stochastic approximation. The conditions on the neighborhood function and the learning rate
are given to guarantee the convergence. It is shown that the convergence of the algorithm can be
accelerated by averaging.
Keywords: high-dimensional self-organizing map, neighborhood function, convergence, stochastic
approximation, acceleration by averaging.
1 Introd uction
The self-organizing neural networks were found in modeling some self-organized
phenomena in nervous systems. The typical models are those given by Willshaw and
von der Malsburg (1976)[14]' Grossberg (1976)[8], Amari (1980)[1], and Kohonen
(1982)[9]. These self-organization systems can alter their internal connections to
represent the statistical features and topological relations in an input space.
The self-organizing maps (SOMs) are expressed by the weights of the self-organizing
networks. They are very effective in modeling the topographic mapping among neu-
ral fields. Although the SOMs are widely used in neural modeling and applications,
the theory of the SOMs is far from being complete. The question whether the self-
organizing maps converge to the topological correct mappings is still open especially
in the high dimensional case.
There are many models for the SOMs. Due to the space limit we only consider
Amari's nerve field model[l, 2]. Another important SOM model is the feature
map[9]. The stability of the feature map is discussed in [3, 6, 7, 10, 12].
The stability of the SOM in the nerve field model is analyzed in [1, 5, 13, 15].
The existing convergence results are most for the one-dimensional model. It is very
difficult to analyze the stability of the SOM in any high dimension. The results in
[5, 15] are only valid for some special cases of the linearized Amari model. We shall
formulate the linearized Amari model in any high dimension and discuss not only
the convergence but also the convergence rate of the self-organization algorithm for
updating the weights. The approach used can also be used to analyze the stability
of the feature map.
2 High-dimensional Self-organizing Maps
=
Let the K-dimensional grid CK {O, 1,· .. , N}K of (N + I)K neurons be the presy-
naptic field in the Amari model. A neuron in CK is denoted by
J = (il, ... , iK) E CK.
Let WK be the space of all mappings W = W(J) : CK -+ en ~ R n where each
W( J) is a column vector in Rn. The weight vectors in WK are labeled by the
neurons in CK.
2.1 High Dimensional Topographic Map
Let us consider a system consisting of a presynaptic field CK and a postsynaptic field
en. The neighborhood relation between each pair of neurons in CK is determined by
389
390 CHAPTER 68
their relative positions in CK. A high dimensional topographic map is an ordered
mapping in WK under which the neighborhood relation in CK is preserved in en.
To achieve a topographic map, we choose a random mapping in WK, then use the
following algorithm to update the mapping recursively:
(1- At)Wt(J) + ~ I:J/ENc(t) Wt(J'), J E Nc(t) n
C~
Wt+1(J) ={ Wt(J), J E C~ - Nc(t) (1)
B(J) (boundary condition), J E 6CK
where
At is a learning rate, C~ = {1,· .. , N - 1}K the inner lattice of CK,
6CK = CK - C~, Nc(t) = (1't(1t) a neighborhood set,
It = (i1 (t),· .. , iK(t» a random stimulus process on CK,
(1't(1) a neighborhood set around I such that (1't(1) s;;: CK for each
IE CK and t, and Ct the number of points in the set Nc(t).
The equation (1) is a linearized version of the learning equation in [1]. It does
not update the mapping on the boundary of CK. But the boundary condition will
affect the map on C~ (the inner lattice) whenever the neighborhood set touches
the boundary of the grid CK, and eventually shape the topographic map.
n n
Let ht(1, J) = l(JEO',(I) ,Clk) be the characteristic function of the set (1't(1) C~.
The equation (1) can be rewritten as the following:
Theorem I Under the assumptions Al and A2, if Q > 0 (positive definite) and
the learning rate At > 0 satisfies the following conditions:
LAt = 00, LA; < 00, (6)
t
then for i = 1"", n, w; -+ mi a.s. (almost sure convergence).
(8)
where {J.li( Q)} are eigenvalues of Q. Note the learning rate At ex r a with 0 < (Y < 1
satisfies the condition (8).
The next theorem gives the convergence rate of the averaged weight Wt. It also
-T
shows that W t converges to M in mean square.
Theorem 2 Let Q > 0, the learning rate satisfy (7) or (8), and the following
conditions hold:
392 CHAPTER 68
A3 r > ~, Qt - Q = O(r1'), and b~ - bi = O(t-1');
A4 for each J, SUPt E~1 Iht(Jk, J)-lit(J)I.8 P(Jk) < 00, where f3 > 2 and lit(J) =
E~=1 ht(Jk, J)P(Jk);
A5 for each J and J', SUPt E~=1Iht(J,<, J)ht(Jk' J') -lit(J, J')I.B P(Jk) < 00,
where lit(J, J') = E~=1 ht(Jk, J)ht(Jk, J')P(Jk) < 00;
A6 1imt--+ooE[e;(e;f] = Si > O.
Denote V = Q- 1SiQ-1 • Then J''or wit = 1.t "t-1
L.J.=o wit, we have
Vi(w: - mi) E. N(O, V)
limt--+ooE[t(w: - mi)(w; - mi)T] = V. '(9)
Proof From A4 and A5, we have SUPt E[lQt - Qtl.8] < 00 and
SUPt E[lb~ - b~ 1.8] < 00. Therefore,
sup E[le;I.8] < 00. (10)
t
Because w; is bounded, we have SUPt E[le;1 1.rt_1] <
2 00. From (10), we have
J~~ limt--+ooE[le;1210e:I>C)i.rt-1] = 0, a.s.
So Assumptions 2.1-2.5 in [11] are satisfied. However, we cannot apply the results
there directly due to the term 7]; in the system (5). We can use the same approach
in [11] to get the following:
c- 1 - 1 ~ 1 ~ t'
= Viatilo
vtilt
l' .
- Vi ~ Q- (e! + 7]!) - Vi ~ ~ (e! + 7]!),
.
(11)
t1~"T
~ 7]!(7]!) ~ 0, and
1~
Vi l'
~ Q- 7]! ~ 0, ast ~ 00,
we can apply the proofs in [11] to (11) and conclude that Vi(w; - mi) is asymp-
totically normal with zero mean and the covariance matrix V, and (9) holds. 0
Theorem 2 allows us to use a small constant learning rate or even the learning rate
>'t <X rOt with 0 < a < ~ which tends to zero slower than 7t.
The assumption
=
f3 > 2 in A4 and A5 can be relaxed to f3 2 if we only want to obtain (9).
Note we showed the convergence of the SOM but we did not show that the stationary
states are the topographic mappings which preserves the topology. Based on the
simulation results in [5, 13] we believe that the SOM converges to a topographic
map, or a micro-structure consisting of several topographic sub-maps.
3 Conclusion
The convergence of the learning algorithm for updating the SOM is studied. The
conditions on the neighborhood function and the learning rate have been found to
guarantee the convergence. The learning algorithm can be accelerated by averaging
the weight vectors in the training history. For the learning algorithm with averaging,
the convergence rate has been found.
Yang: Dynamical Stability of a High-D SOM 393
REFERENCES
[1] S.-I. Amari, Topographic organization of nerve fields, Bulletin of Mathematical Biology, Vol.
42 (1980), pp339-364.
[2] S.-I. Amari, Field theory of self-organizing neural net., IEEE Trans. on Systems, Man and
Cybernetics, Vol. 13(5) (September/October 1983), pp741-748.
[3] M. Budinich and J. G. Taylor, On the ordering conditions for self-organising maps, Neural
Computation, Vol. 7(2) (March 1995).
[4] Han-Fu Chen, Recursive Estimation and Control for Stochastic Systems, John Wiley & Sons,
Inc. (1985).
[5] M. Cottrell and J. C. Fort, A stochastic model of retinotopy: A self organizing process,
Biological Cybernetics, Vol. 53 (1986), pp405-411.
[6] M. Cottrell, J. C. Fort, and G. Pages, Two or three things that we know about the kohonen
algorithm, in: Proc. ESANN94 Conf., De Facto Ed., Brussels (April 1994).
[7] E. Erwin, K. Obermayer, and K. Schulten, Self-organizing maps: Ordering, convergence prop-
erties and energy functions, Biological Cybernetics, Vol. 67 (1992), pp47-55.
[8] S. Grossberg, Adaptive pattern classification and universal recording: 1. Parallel development
and coding of neural feature detectors, Biological Cybernetics, Vol. 22 (1976), pp121-134.
[9] T. Kohonen, Self-organized formation of topologically correct feature maps, Biological Cy-
bernetics, Vol. 43 (1982), pp59-69.
[10] Z.-P. Lo and B. Bavarian, On the Rate of Convergence in Topology Preserving Neural
Networks, Biological Qybernetics, Vol. 65 (1991), pp55-63.
[11] T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM
Journal of Control and Optimization, Vol. 30(4) (July 1992), pp838-855,.
[12] H. Ritter and K. Schulten, Convergence properties of kohonen's topology conserving maps:
Fluctuations, stability, and dimension selection, Biological Cybernetics, Vol. 60 (1988), pp59-
71.
[13] A. Takeuchi and S.-I Amari, Formation of topographic maps and columnar microstructures,
Biological Cybernetics, Vol. 35 (1979), pp63-72.
[14] D.J. Willshaw and C. von der Malsburg, How patterned neural connections can be set up by
self-organization, Proceedings of the Royal Society of Longdon, Vol. B 194 (1976), pp431-445.
[15] Hua Yang and T. S. Dillon, Convergence of self-organizing neural algorithms, Neural Net-
works, Vol. 5 (1992), pp485-493.
MEASUREMENTS OF GENERALISATION BASED ON
INFORMATION GEOMETRY
Huaiyu Zhu* and Richard Rohwer**
Neural Computing Research Group, Department of Computer Science and
Applied Mathematics, Aston University, Birmingham B4 7ET, UK
* Current address: Santa Fe Institute, 1399 Hyde Park Road,
Santa Fe, NM87501, USA. Email: [email protected]
** Current address: Prediction Co., 320 Aztec Street, SuiteB,
Santa Fe, NM87501, USA.
Neural networks are statistical models and learning rules are estimators. In this paper a theory
for measuring generalisation is developed by combining Bayesian decision theory with information
geometry. The performance of an estimator is measured by the information divergence between
the true distribution and the estimate, averaged over the Bayesian posterior. This unifies the
majority of error measures currently in use. The optimal estimators also reveal some intricate
interrelationships among information geometry, Banach spaces and sufficient statistics.
1 Introduction
A neural network (deterministic or stochastic) can be regarded as a parameterised
statistical model P(Ylx, w), where x E X is the input, y E Y is the output and
w E W is the weight. In an environment with an input distribution P(x), it is
also equivalent to P(zlw), where z := [x, y] E Z := X x Y denotes the combined
input and output as data [11]. Learning is the task of inferring w from z. It is
a typical statistical inference problem in which a neural network model acts as a
"likelihood function", a learning rule as an "estimator", the trained network as
an "estimate" and the data set as a "sample". The set of probability measures
on sample space Z forms a (possibly infinite dimensional) differential manifold
P [2, 16]. A statistical model forms a finite-dimensional submanifold Q, composed
of representable distributions, parameterised by weights w acting as coordinates.
To infer w from z requires additional information about w. In a Bayesian frame-
work such auxiliary information is represented by a prior P(p), where p is the true
but unknown distribution from which z is drawn. This is then combined with the
likelihood function P(zlp) to yield the posterior distribution P(plz) via the Bayes
formula P(plz) = P(zlp)P(p)j P(z).
An estimator T: Z -+ Q must, for each z, fix one q E Q which in a sense approxi-
mate p. 1 This requires a measure of "divergence" D(p, q) between p, q E P defined
independent of parameterisation. General studies on divergences between probabil-
ity distributions are provided by the theory of information geometry (See [2, 3, 7]
and further references therein). The main thesis of this paper is that generalisation
error should be measured by the posterior expectation of the information diver-
gence between true distribution and estimate. We shall show that this retains most
of the mathematical simplicity of mean squared error theory while being generally
applicable to any statistical inference problems.
1 Some Bayesian methods give the entire posterior P{plz) instead of a point estimate q as the
answer. They will be shown later to be a special case of the current framework.
394
Zhu & Rohwer: Measurements of Generalisation 395
2 Measurements of Generalisation
The most natural "information divergence" between two distribution p, q E P is
the 8-divergence defined as [2] 2
1 1 1
Armed with the 8-divergence, we now define the generalisation error
where p is the true distribution, r is the learning rule, z is the data, and q =
r( z) is the estimate. A learning rule r is called 8-optimal if it minimises Eb (r).
A probability distribution q is called a 8-optimal estimate, or simply a 8-estimate,
from data z, if it minimises Eb(qJZ). The following theorem is a special case of a
standard result from Bayesian decision theory.
Theorem 1 (Coherence) A learning rule r is 8-optimal if and only if for any
data z, excluding a set of zero probability, the result of training q = r( z) is a
8-estimate.
Definition 2 (8-coordinate) Let fl := 1/8, v:= 1/(1- 8). Let L,.. be the Banach
space of flth power integrable functions. Then L,.. and Lv are dual to each other as
Banach spaces. Let pEP. Its 8-coordinate is defined as h(p) := pb /8 E L,.. for
8> 0, and lo(p) := logp [2]. Denote by Il / b the inverse of lb.
The entities lal for the multinomial model and n for the Gaussian model are effective
previous sample sizes, a fact known since Fisher's time. In a restricted model, the
sample size might not be well reflected, and some ancillary statistics may be used
for information recovery [2].
Example 3 In some Bayesian methods, such as the Monte Carlo method [10J, no
estimator is explictly given. Instead, the posterior is directly used for sampling p.
This produces a prediction distribution on test data which is the posterior marginal
distribution. Therefore these methods are implicitly 1-estimators.
Example 4 Multilayer neural networks are usually not a-convex for any a, and
there may exist local optima of EO('lz) on Q. A practical learning rule is usu-
ally a gradient descent rule which moves w in the direction which reduces Eo(qlz).
The i-divergence can be minimised by a supervised learning rule, the Boltzmann
machine learning rule [1]. The O-divergence can be minimised by a reinforcement
learning rule, the simulated annealing reinforcement learning rule for stochastic
networks[13]'
~inq}((p,q) {==} Llw'" (owlo(q))p - angleowlo(q))q (11)
~inq}((q,p) {==} Llw", (owlo(q), lo(p) - AO(q))q (12)
5 Conclusions
The problem of finding a measurement of generalisation is solved in the framework
of Baysian decision theory, with machinery developed in the theory of information
geometry.
By working in the Bayesian framework, this ensures that the measurement is inter-
nally coherent, in the sense that a learning rule is optimal if and only if it produces
optimal estiamtes for almost all the data. By adopting an information geometric
measurement of divergence between distributions, this ensures that the theory is
independent of parameterisation. This resolves the controversy in [8, 12, 9].
To guarantee a unique and well-defined solution to the learning problem, it is nec-
essary to generalise the concept of information divergence to the space of finite
positive measures. This development reveals certain elegant relations between in-
formation geometry and the theory of Banach spaces, showing that the dually-affine
geometries of statistical manifolds are in fact intricately related to the dual linear
geometries of Banach spaces.
In a computational model, such as a classical statisitical model or a neural network,
the optimal estimator is the projection of the ideal estimator to the model. This
theory generalises the theory of linear Gaussian regression to general statistical
estimation and function approximation problems. Further research may lead to
Kalman filter type learning rules which are not restricted to linear and Gaussian
models.
398 CHAPTER 69
REFERENCES
[1] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, A learning algorithm for Boltzmann ma-
chines, Cog. Sci., Vol. 9 (1985), pp147-169.
[2] S. Amari, Differential-Geometrical Methods in Statistics, Vol. 28 of Springer Lecture Notes
in Statistics. Springer-Verlag, New York 1985.
[3] S. Amari, Differential geometrical theory of statistics, In Amari et al. [4], Ch. 2, pp19-94.
[4] S. Amari, O. E. Barndoff-Nieldon, R. E. Kass, S. L. Lauritzen, and C. R. Rao, eds., Differential
Geometry in Statistical Inference, Vol. 10 of IMS Lecture Notes Monograph. IMS, Hayward,
CA (1987).
[5] S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., Advances in Neural Information Processing
Systems, Vol. 5 (1993), San Mateo, CA. Morgan Kaufmann.
[6] R. E. Kass, Canonical parameterization and zero parameter effects curvature, J. Roy. Stat.
Soc., B, Vol. 46 (1984), pp86-92.
[7] S. L. Lauritzen, Statistical manifolds, in: Amari et al. [4], Ch. 4, pp163-216.
[8] D. J. C. MacKay, Bayesian Methods for Adaptive Models. PhD thesis, California Institute of
Technology, Pasadena, CA (1992).
[9] D. J. C. MacKay, Hyperparameters: Optimise, or integrate out?, Technical report, Cambridge
(1993).
[10] R. M. Neal, Bayesian learning via stochastic dynamics, in: Hanson et al. [5], pp475-482.
[11] H. White, Learning in artificial neural networks: A statistical perspective, Neural Compu-
tation, Vol. 1(4) (1989), pp425-464.
[12] D. H. Wolpert, On the use of evidence in neural neworks, In Hanson et al. [5], pp539-546.
[13] H. Zhu, Neural Networks and Adaptive Computers: Theory and Methods of Stochastic Adap-
tive Computations. PhD thesis, Dept. of Stat. & Compo Math., Liverpool University (1993),
ftp://archive.eis.ohio-state.edu/pub/neuroprose/Thesis/zhu.thesis.ps.Z.
[14] H. Zhu and R. Rohwer, Bayesian invariant measurements of generalisation for continu-
ous distributions, Technical Report NCRG/4352, Dept. Compo Sci. & Appl. Math., Aston
University (August 1995), ftp://es.aston.ae.uk/neural/zhuh/eontinuous.ps.Z.
[15] H. Zhu and R. Rohwer, Bayesian invariant measurements of generalisation for discrete dis-
tributions, Technical Report NCRG/4351, Dept. Compo Sci. & Appl. Math., Aston University
(August 1995), ftp://es.aston.ae.uk/neural/zhuh/diserete.ps.Z.
[16] H. Zhu and R. Rohwer, Information geometric measurements of generalisation, Techni-
cal Report NCRG/4350, Dept. Compo Sci. & Appl. Math., Aston University (August 1995),
ftp://es.aston.ae.uk/neural/zhuh/generalisation.ps.Z.
Acknowledgements
We are grateful to Prof. S. Amari for clarifying a point of information geometry.
We would like to thank many people in the Neural Computing Research Group,
especially C. Williams, for useful comments and practical help. This work was
partially supported by EPSRC grant (GRjJ17814).
TOWARDS AN ALGEBRAIC THEORY OF NEURAL
NETWORKS: SEQUENTIAL COMPOSITION
R. Zimmer
Computer Science Department, Brunei University,
Uxbridge, Middx. UB83PH, UK. Email: Robert.Zimmer@brune/.ac.uk
This paper marks a step towards an algebraic theory of neural networks. In it, a new notion of
sequential composition for neural networks is defined, and some observations are made on the
algebraic structure of the class of networks under this operation. The composition is shown to
reflect a notion of composition of state spaces of the networks. The paper ends on a very brief
exposition of the usefulness of similar composition theories for other models of computation.
1 What is an Algebraic Study?
The view of mathematics that underlies this work is one forged by Category Theory
(see for example [1]). In Category Theory scrutiny is focused less on objects than on
mappings between them. Thus the natural question is: "What are the mappings?" .
In this kind of study the answer tends to be: "Functions (or sometimes partial
functions) that preserve the relevant operations." So before we can start looking
at the category of neural networks we need to answer the question: "What are the
relevant operations?"
For example, recall that a group is a set, G, with a binary operation (*), a constant
(e), and a unary operation 0- 1 . To be a group, it is also necessary that these
operations satisfy some equations to the effect that * is associative, e is a two-sided
identity for * and 0- 1 forms inverses, but these are the defining group operations.
A group homomorphism is a function, ¢, from one group to another satisfying
¢(a * b) = ¢(a) * ¢(b), ¢(e) = e' and ¢(a)-1 = (¢(a))-1.
That is a group homomorphism is defined to be a function that preserves all of the
defining operations of a group.
2 What are the Defining Operations of Neural Nets?
A synchronous neural net is a set of nodes each one of which computes a function;
the input for the function at a node at a given time is the output of some nodes
at the previous time step (the nodes that supply the input to the given node is
fixed for all time and determined by the interconnection structure of the net). This
structure is captured diagrammatically by drawing a circle for each node and an
arrow from node N to node M if M acts as an input for N. For example, the
following is a picture of a two-noded net each of whose nodes takes an input from
node f, and g takes another input from itself:
Figure 1
The letters indicate that node on the left computes the function fen) and the node
on the right computes function g(n, m). So a neural net is a finite directed graph
399
400 CHAPTER 70
each of whose nodes is labelled by a function. We will call the underlying unlabelled
graph the shape of the net.
As well as the shape (the interconnection structure), there is a recursive computa-
tional structure generated by the network. In this example we get the two recursive
functions:
fn+l = f(Jn).
gn+l = g(Jn,gn).
where, for example, J; denotes the state of the node labelled by f at time i. Once
f and 9 are known, these equations entirely determine the dynamics of the system:
that is, the sequence of values of the pair < J;, gi >.
3 The State Space of a Boolean Net
In this section we will restrict ourselves to nets whose nodes compute boolean
functions. A state of such a net is a vector of O's and 1's, representing the states
of the nodes of the net. When we say that a net is in state vect, we mean that
for all i, the ith node of the net has most recently computed the ith value of vect.
Since the next state behaviour of a net is completely determined by the recursion
equations, we can completely compute a graph of the possible states of the net. In
the example above if f is invert and 9 is and then the state graph is given below:
@ . . -----
Figure 2
This exemplifies a mapping from boolean neural nets to state spaces and points
to an algebraic relationship whose desirability can guide us in our setting out of
our algebraic theory: we should ensure that the space of boolean neural networks
is fibred over the space of boolean state spaces. For a detailed discussion of fibred
categories see [2], for now it will suffice to give an informal definition.
Recall that a category is simply a collection of objects and maps between them.
In essence, what it means for a category, E, to be a fibred over a category, C,
is that there is a projection from E to C (the part of E that gets projected to a
particular C-object C constitutes the fibre over C) such that: for every C-mapping
f : G --t G' and E-object, E, in the fibre over C, there is a unique lifting of f to a
map starting at E and projecting onto f. The idea is that what happens in E gets
reflected down to C, and, if we are lucky, what happens in C is liftable to E. This
can hold for maps and operations. We want to ensure that the category of Neural
Networks is in this relationship to the category of state spaces, and that the state
space composition is liftable.
Zimmer: An Algebraic Theory of Neural Networks 401
4 Composition in the State Space
We wish to define a notion of composition of nets that is a lifting of state space
composition. The sequential composition of two graphs with the same nodes is quite
well-known: an arc in the composite graph is an arc in the first graph followed by an
arc in the second. Two nets with the same nodes automatically yield state graphs
with the same nodes. Our first lifting of state-space composition will take advantage
of this fact and only be defined for networks on the same (perhaps after renaming)
nodes. We shall see that this lifting is all we need to explain the closure properties
of state spaces under composition.
To continue with our example, consider the net given by the equations
In+l = In * Un
Un+l = '" In.
Notice that the shape of this net differs from the other. The shape is shown by:
Figure 3
The state space and the result of composing it with the graph above are pictured
below:
(§) .·-----0
.
! t
o .·-----0
.
Figure 4
The observation that initiated this work is that the composite state space (and
all others that can be similarly derived) is the state space of a neural network.
That is, if SS denotes the function that sends a neural net to its state space, then
for any two boolean nets, Nand M, the composition SS(N) * SS(M) turns out
402 CHAPTER 70
to be SS( something). We will now describe an operation on the nets such that
SS(N) * SS(M) == SS(N * M).
5 Simple Composition of Nets
The first form of neural network composition is direct functional composition on
the recursion equations. The concept is most readily understood in terms of an
example. Thus, to continue the example above:
First Net fn+l =='" fn and gn+l == fn * gn
Second Net fn+l == fn * gn and gn+l =='" fn
The composition consists of the second net using the results of the first net:
Composite Net fn+2 == fn+! * gn+l =='" fn * fn * gn == 0
gn+2 =='" fn+l =="'''' fn == fn
Note that the composite net works with n + 2 as one step, since its one step is a
sequence of one step in each of two nets. With that rewriting in mind (n + 1 for
n + 2), the equations for the composite net yield the composite state space above.
And it is not hard to see that this will hold for any two boolean nets with the same
set of nodes.
The set of nets with the same nodes is a monoid under this operation: the compo-
sition is associative and has an identity.
6 Two More Complex Notions of Sequential Composition
More interesting notions of composition allow us to compose nets with different
nodes. The first of these compositions requires a function from the nodes of the
second net to the nodes of the first. The composition is then very like the simple
one except that instead of simply using the same names to say, for example,
fn+2 == fn+l * gn+l =='" fn * fn * gn
there is an extra dimension, -say
fn+2 == fn+l * gn+l
in the second net and translate to
</;(f)n+l * </;(g )n+l
in the first net.
This is obviously more general (the simple version is just the special case in which
f is the identity) and we believe it will lead to rich composition theories.
An even more general composition comes from taking the union of these compo-
sitions for all functions </;. This composition is related to the concept of wreath
product for groups and automata (see [3]).
7 What use is a Composition Theory?
A proper composition theory can be used to explain how complex things can and
do get built from simple ones. The theory can be used to break things down as well.
For example, there could well be a set of primitive objects and a theorem that all
objects can be built from them. A famous instance of such a theorem was given by
Krohn and Rhodes:
Theorem [Krohn-Rhodes 4] Given a finite monoid, M, M divides a wreath
product of simple groups and switches each of which divides M.
And an instance that we've been using is given by:
Theorem [5] Given a finite concrete category, C, divides a wreath product of prim-
itive minimal categories each of which divides C.
Zimmer: An Algebraic Theory of Neural Networks 403
This paper marks only the beginning of a composition theory of neural nets. The
Krohn-Rhodes theorem has fostered a whole branch of algebra that has shed much
light on automata and semigroups, while the other theorem has found application in
various fields of computing. We would like our neural net composition to be useful
in both of these fashions. The mathematics will consist of a study of the category
of Networks as a fibred category and some projects that natural arise are: uncover
a set of primitives, study the ordering determined by the decomposition, devise a
homomorphism theorem, and extend the work to other operations, such as parallel
composition. And to give some idea of the kinds of application we envisage, here
are two applicatibnsof the categorical composition theory:
Designing Chips: We take an algebraic description of a chip design and use the
decomposition to implement it using pre-defined primitive modules (see [6]).
Problem Solving: We turn difficult problems into sequences of problems that are
more easily solved (see [7]).
We hope to find applications that are akin to these. The problem solving work
automatically makes hierarchies of search spaces. Could we form similar neural
network hierarchies? Moreover, if one wanted to design a net with a particular
behaviour this might be a modular way to do it. And to finish with an interesting
entirely open question: how can we apply a theory like this to nets as they are
trained?
REFERENCES
[1] Saunders MacLane. Categories for the Working Mathematician, Graduate Texts in Mathe-
matics 5, Springer-Verlag, Berlin (1971).
[2] John Gray, Fibred and Cofibred Categories, Proceedings of The Conference on Cate.qorical
Algebra, S. Eilenberg et al eds., Springer Verlag, Heidleberg (1965).
[3] Samuel Eilenberg, Automata, Languages and Machines, Vol. B (1976), Academic Press, New
York.
[4] K. Krohn and J. Rhodes, Algebraic Theory of Machines, I. Prime Decomposition Theorem
for Finite Semigroups and Machines, Trans. Amer. Soc. Vol. 116 (1965), pp450-464.
[5] Robert Zimmer, Decomposing Categories, to appear in Categorical Structures in Computer
Science.
[6] R. Zimmer, A MacDonald and R Holte, Automated Representation Changing for Problem
Solving and Electronic CAD, Applications of Artificial Intelligence 1993: Knowledge-Based
Systems in Aerospace and Industry, Usama M. Fayad, Ramasmy Uthurusamy, eds., Proc.
SPIE (1993), pp126-137.
[7] R. Holte, C. Druumond, R. Zimmer, Alan MacDonald, Speeding Up Problem-Solving by Ab-
straction: A Graph-Oriented View, to appear (1996).