RAM-based Neural Networks
RAM-based Neural Networks
RAM-Based
Neural Networks
Editor
James Austin
iv/^ir.i^reTiflUra
RAM-Based
Neural Networks
PROGRESS IN NEURAL PROCESSING
Series Advisors
Alan Murray (University of Edinburgh)
Lionel Tarassenko (University of Oxford)
Andreas S. Weigend (Leonard N. Stern School of Business, New York University)
RAM-Based
Neural Networks
Editor
James Austin
University of York
World Scientific
Singapore • New Jersey • London • Hong Kong
Published by
World Scientific Publishing Co. Pte. Ltd.
P O Box 128, Farrer Road, Singapore 912805
USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not
required from the publisher.
ISBN 981-02-3253-5
Printed in Singapore.
Preface
This book aims to show the state of the art in RAM based networks and to introduce
the reader to this mature and developing area of artificial neural networks. It is based
on presentations given at the Weightless Neural Network Workshop , Canterbury, in
1995. The authors of the chapters represent the majority of active researchers in RAM
based networks through out the world. While some of the papers contain in-depth re-
search results which may only be of interest to specialist workers undertaking research
in RAM based networks, many chapters contain introductory material that describe
the RAM based methods at a level suitable for workers with a basic knowledge of neu-
ral networks.
The area that the book covers is popularly known by at least three names, these
being RAM based networks, weightless networks and N tuple networks, and can be
used interchangeably. Other terms can that can relate to some forms of the networks
in this book are Sigma-Pi networks and Binary Networks. The reason for choosing
RAM based networks as the title, is because all the networks have some form of a Ran-
dom Access Memory structure (RAM) in their architecture, where a RAM is the uni-
versal storage system used in current computer systems and is the main
implementation model for the networks. It also reflects the major strength of RAM
based systems which is there simple implementation in dedicated and high speed hard-
ware. The term weightless networks reflects the fact that RAM based networks can be
expressed purely as a logic expression with no 'weights' in the conventional neural
network sense. In this case learning is seen as determining the logic function required
to map the inputs to the outputs of the network under the conditions laid down in the
training set. The term N tuple networks comes from the origins of the methods given
by Bledsoe and Browning and their approach to selecting inputs for the network.
RAM based networks have existed for some time. Originally termed N-Tuple
methods developed by Bledsoe and Browning in 1959 and being studied for some
years before loosing favour in the late 1960's. They were then picked up by Igor Ale-
ksander in the UK in the early 1970's where there simple implementation in Random
Access Memories was identified and explored. His work, lead to a number of new con-
cepts which some of his students explored and developed. In the 1980's neural net-
works emerged from the dark ages of the subject and the general interest in all forms
of neural networks took off, including research in RAM based systems. The area has
mostly been studied in the UK with one or two notable exceptions, and is now an ac-
cepted and mature subject with a number of successful commercial applications.
In terms of theory N tuple methods have had a hard time, mainly due to the dif-
ficulty in dealing with the combinatorial explosion found in most mathematical ap-
proaches to the analysis of the methods. This, along with there unique construction has
put off many researchers from studying them. However, this has not stopped others
taking a more pragmatic approach to the area, showing how useful problems can be
v
vi
solved and high performance systems can be implemented. In addition, the book con-
tains many useful pointers to the formal analysis of RAM based networks.
The book is structured into three sections, each introduced by an overview of the
papers in the section. Section 1 contains chapters which introduce and/or compare
methods known as RAM based. Section 2 presents work that extends the basic RAM
based understanding and methods. Finally Section 3 contains chapters that are mainly
concerned with applications of RAM based methods. However, many of these also ex-
tend the basic techniques and show, for example, how the methods may be implement-
ed.
I would like to thank a number of people who helped in the production of this
book. In particular David Bisset and the original panel who reviewed the papers for
the Weightless Neural network Conference where many of the chapters saw there first
drafts. Thanks goes to Aaron Turner, who checked all the latex of the chapters, my
secretary, Christine Linfoot for chasing all the Authors, and finally the members of the
Advanced Computer Architecture Group for reading and helping with comments on
the chapters.
Jim Austin,
York, May 1997
Contents
Preface v
vii
viii
This section introduces the reader to some of the major RAM based methods. The first
paper presents an overview of the RAM based methods up to 1994. It covers the most
well known methods in RAM based networks. This paper is followed by an overview
of the MAGNUS system, introduced by Igor Aleksander who has been one of the ma-
jor influences in the development of RAM based systems throughout the last 30 years.
The paper shows how RAM based systems in the form of WISARD is related to
MAGNUS, a system designed to explore possibility of systems that react in an intel-
ligent way to sensory data. The paper by DeCarvaho, Fairhirst and Bisset describes a
form of RAM learning called GSN' which allows training of multi-layered RAM
based systems, and aims to compare the various forms of the GSN methods. The final
paper generally introduces the AURA RAM based network, which extends the RAM
based method for use in rule based systems, which is an unusual application for neural
networks but exploits the speed and flexibility of the RAM based method for this task.
1
This page is intentionally left blank
RAM BASED NEURAL NETWORKS, A SHORT HISTORY
J. AUSTIN
Advanced Computer Architecture Group,
Department of Computer Science,
University of York, York YOl 5DD, UK
This chapter describes the interrelationship between the different types of RAM based
neural networks. From their origins in the N tuple networks of Bledsoe and Browning
it describes how each network architecture differs and the basic function of each. It dis-
cusses the MRD, ADAM, AURA, PLN, pRAM, GSN and TIN architectures. As such,
the chapter introduces many of the networks discussed later in the book.
The typical model of a neuron used in a large number of neural networks is based on
the McCulloch and Pits neuron which can be described by equations (1) and (2).
These specify a linear weighted sum of the inputs, followed by a non-linear activation
function.
u = Yiwixj XJ (1)
;=i
yV == — l- -au (2)
l+e~
1+e
where u is the activation of the neuron, w are the weights, x are the inputs, a controls
the shape of the output sigmoid, and y is the output. When equation (2) is replaced by
a Heavyside function, the neuron is called a Linear Threshold Unit (LTU).
When given the appropriate activation function and used in networks with 3 lay-
ers (MLN) they have been shown to be universal function approximators2. In addition,
they have very good generalization abilities. However, this universality comes at a
cost. To train a MLN an arbitrary complex problem requires repeated presentation of
training examples, which often results in very long learning times.
The most popular approach to training an MLN is the generalized delta rule, and
its derivatives3 which requires both the forward propagation phase and a backward
(back error) propagation phase. The complexity of the training algorithm has limited
3
4
its use to applications which allow enough time for training the network.
2 N tuple method
The RAM based systems were not originally designed with a consideration of the lim-
itations of MLNs, but do provide solutions to these problems. They originate in the
work of Bledsoe and Browning4, who invented a method of pattern recognition com-
monly termed the N-tuple method. The basic principle behind the N tuple method is
that learning to recognize an image can be thought of as building a set of logic func-
tions that can describe the problem. The logic functions will evaluate true for all im-
ages which belong to the class that the logic function represents and evaluate false for
all other classes.
This shown in the simple example in Fig. 1. Each class of image has a set of logic
functions that relate to it. For an unknown image, the set of logic functions that has the
majority of functions which evaluate to true, indicate the class of image. The image of
a T can be recognized by using the logic function;
R = A BC +D E F+G H I (3)
(3)
The image of an I shown in Fig. 1 can be recognized by using the logic function;
R = A B C + DEF
D EF +G
GHi Hi (4)
(4)
ANDing the minterms of the expressions together, they are arithmetically summed to
give a count of the number of terms that are true.
In the example in Fig 1 there are 3 'tuples' each of size 3. These form the mint-
erms in the equations above. They are
Tuple 1; pixels A B C
Tuple 2; pixels D E F
Tuple 3; pixels G H I
To learn the logic functions that represent the data belonging to a given class was
originally shown by Aleksander and Stonham5. The approach was based upon the
structure shown in Fig 2. The problem involves remembering which logic terms would
be needed for specific classes of image. This was most easily achieved by using a log-
ical 1-in-N decoder followed by a set of binary storage locations for each term, and
using one such unit for each tuple. The logical decoders compute all possible logical
functions of the N inputs they connect to. When presented with a piece of data, the var-
ious decoders will indicate the functions required. To recognize an unknown piece of
data, the same approach is used, only now the storage locations are accessed and
summed.
The typical definition of a RAM node used in this book is one tuple of N inputs,
followed by a logical 1 in n decoder, followed by a set of storage locations and a sum-
ming device.
Input Input
. „ Output lines
A
A B
B
~00 00 1000
1000
11 00 0100
0100
00 11 0010
0010
11 1 11 1 0001
0001
Table 1: Activation table for 1 in n decoder
Table 1: Activation table for 1 in n decoder
2.1 RAM based units compared to LTU
2.1 RAM based units compared to LTU
RAM based units are most similar to 'higher order' networks6, which combine
the input data prior to the inputs application to a system composed of LTUs. However,
where higher order units combine continuous values using non-linear functions (pow-
ers etc.) RAM units combine the inputs using logical functions (AND and OR). In ef-
fect, this is the same approach, only one is continuous the other is binary.
Fig. 3 and 4 show the relation between a network of RAM units and a higher or-
der network. Fig. 3 shows what is commonly called an N tuple network. One such net-
work can be used to identify the similarity of an unknown image to one trained. A
group of such networks, used to recognize one of a number of classes is called a Multi-
RAM discriminator (MRD). Fig. 4 shows a higher order network with an equivalent
structure as an N tuple network. The first layer consists of a set of decoders followed
by a single layer of units with binary weights.
Another important difference is that RAM units always take sub-samples from
the image in the form of tuples. Each tuple is made up of N samples of input data, fed
7
to its own set of storage cells. In RAM based systems, the size of a tuple, N, is an im-
portant parameter as it effects the classification ability of the network.
The major advantage of the RAM based approach is that the decoder/storage cell
combination is a random access memory (RAM). Thus allowing very simple and di-
rect implementation in cheap and readily available components. This was shown by
Aleksander et. al. 7 in the WISARD pattern recognition machine.
Although the basic N tuple approach was powerful, in terms of is learning speed and
simple implementation. The method has a major limitation. This relates to the learning
capacity of a given N tuple network. By inspection it may be obvious that a given N
tuple network cannot implement all possible functions of the data inputs.
To show this a simple problem (intra-exor) problem is shown in table 2.
8
Inputs _
Output
ABCD °UtpUt
1010
1010 11
0101
0101 11
1001
1001 0
0110
0110 |_00
Table 2: The intra-exor problem
The tuple distribution given in fig. 5 shows a two tuple system that cannot solve
the problem intra-exor problem shown in table 2 with a given N tuple network. How-
ever, by altering the placement of each tuple as shown in fig 6, the problem is solvable.
This is unlike the EX-OR problem when used on a single layer network. The network
cannot solve the problem given an arbitrary ordering of the inputs.
To solve this problem, one could used an tuple size equal to the input data size.
However, this loses any generalization ability of the network, and results in RAM
nodes with a very large memory requirement.
9
The N tuple method was used in the WISARD pattern recognition machine7, ca-
pable of recognizing images at about 25 frames per-second, and more recently in the
C-NNAP system .To overcome the problem given here, the training and testing
method involved moving the image around whilst training and testing. This effectively
ensured that most of the time the EXOR type problem did not arise (it also allowed
good generalization).
For many problems, such as image processing, the intra-exor problem is not an
issue, as training methods overcome the limitation. The method has been successfully
applied to many problems, such as monitoring crowded underground railway plat-
forms, face recognition , satellite image recognition10 and character recognition", as
well as the applications described in section 3.
Figure 5: An arrangement of tuples that will not solve the intra-exor problem
The N tuple method was implemented in dedicated hardware in the form of WISARD
in the early 1980's7. The machine was taken up commercially by Computer Recogni-
tion Systems of Woking (UK) and marketed for a number of years. The commercial
system used conventional digital processor methods and achieved a recognition rate
10
of 12.5 frames a second for images of 512 2 images, using sparse sampling (i.e. not all
pixels used).
More recently the method has been used in a parallel image processing system
(C-NNAP14), where the method has been extended to form an associative memory15.
This system uses Field Programmable gate arrays to implement the memory scanning
and training processes.
3 Overcoming the limitations of N tuple; RAM pyramids
As the intra-exor problem has shown, the use of the N tuple method can be rather hit-
or-miss. Some training examples will not be separable, while others will be.
To increase the robustness of the method the functional capacity needs to be
raised whilst maintaining the generalization ability, along with the training speed, and
simple hardware implementation.
To achieve this, it is important to understand how the method is capable of gen-
eralizing on unseen data. The method operates by a set membership classification
process. Each image is broken up into a number of tuples. In the N tuple method, for
a given set of patterns the binary patterns appearing in each tuple are recorded during
training. During testing, each tuple is checked by each RAM node to see if it contains
a known bit pattern, and the number of RAM units that recognize the input tuple pat-
tern is counted and output as a recognition figure. In effect each RAM unit, which
process the tuple data is looking for patterns that belong to a set of known patterns that
were presented during training. The generalization comes from the way tuple patterns
are allowed to be mixed between training examples. The larger the tuple size the
smaller will be the generalization set size.
Unfortunately, the size of the generalization set is not set by training, (apart from
the number of examples given), but by the tuple size, N. Thus, to get a good balance
between classification success and generalization, requires experimentation, (see
chapter 2.1 for an evaluation of this).
The problems are caused by the linear combination of the results of the RAM
units. In the N tuple method these are summed. This results in the intra-exclusive OR
problem given in section 2.2.
The obvious solution is to combine the outputs of the RAM units non-linearly.
This can be done in two ways; use a multi-layer network of linear threshold units or
use more RAM units. The former method has been called a Hybrid network8. This is
very similar to the higher order networks, except that it uses logical functions to com-
bine data. The approach results in a solution to the problem, but at the cost of longer
training time (iterative). The latter methods are more popular, and uses the RAM
based approach exclusively.
The typical form of the multi-layer RAM net (MLRN) is to combine the results
of one layer of RAM units using subsequent layers. Because each RAM unit has a lim-
11
ited number of inputs, it is necessary to have many layers of RAMs for large images.
These networks can implement any subset of logical combinations of the input
data . Thus, overcoming the intra-exclusive OR problem given earlier. However, the
solution requires a new training method, for the same reason that the multi-layer net-
works of LTU required the introduction of the generalized delta rule (GDR). As was
shown by the GDR, to perform back propagation requires soft-limited output func-
tions on each neuron. The function implemented by RAM units is not continuously
differentiable, thus the MLRNs cannot be trained using the GDR when implemented
using the basic RAM node. This problem has forced a number of researchers to inves-
tigate how MLRN can be trained. Notably, the Probabilistic Logic Node (PLN)17, the
probabilistic RAM (pRAM)18, the Goal Seeking Neuron (GSN) 19 (and chapter 1.3)
and Time Integrating Neuron (TIN) networks20 (and chapter 2.7), and their deriva-
tives.
All these networks provide solutions to training MLRNs.
pattern input caused a correct output, (2) the pattern input caused an incorrect output,
(3) the pattern did not occur on the input. By allowing tri-state storage in the RAM
node reinforcement learning could be used. This was implemented by Aleksander in
the PLN21. In practice the PLN represents the information as;
The use of the V , 'don^t know' state required an extension of the learning algo-
rithm. When a 'u' state is accessed in the RAM, the output of the ram is set to 1 or 0
with a probability of 0.5. Thus any input pattern will allow the propagation of a result
to the output of the network.
The approach taken in the PLN required (1) iterative learning, (2) 3 state storage
locations. As a result, direct implementation in standard RAM components is not pos-
sible and training is slower. However, training is still more rapid than other approach-
es using LTUs.
3.2 Weighting the storage locations, N tuple RAM extensions and the PLN
The approach used by the PLN nets and by the N tuple approach only permitted each
node to record the presence and absence of a particular input pattern. This approach
can be softened by allowing the system to record the frequency of occurrence of the
input patterns to a RAM node. This would allow the important features of a piece of
data to be weighted in preference to other features.
The original N tuple method described by Bledsoe and Browning suggested this.
However, because of implementation difficulties it was not used in the hardware im-
plementations of the N tuple method. It has been shown that the approach can improve
recognition accuracy4. To overcome the implementation problem Austin and Smith
used a weighted scheme during training and converted this to a binary representation
for later implementation in RAM based nodes22. This approach retained the improved
recognition accuracy while partially maintaining the implementation efficiency.
In the case of MLRN, Mayers21 with the w-PLN and Gorse and Taylor18 with the
pRAM, both realised the benefit of using a weighted scheme in MLRNs. The pRAM
is described in the next section. Mayers developed the w-PLN which generalized 3
state storage used in the PLN to w states, and showed the improved recognition accu-
racy which resulted. The reinforcement algorithm was extended to take this into ac-
count. This approach has not been implemented, but has been used to model delay
learning in invertebrates21.
Nevill and Stonham23 provide a description of the PLN which has the addition of
13
a sigmoid activation function in a w state PLN. In this one, the storage locations hold
values with w states. However, after the value is addressed and read out, the value is
passed through a sigmoid normalization function which gives a continuous value be-
tween 0 and 1. This value is interpreted as the probability that the unit will fire with a
1. The same reinforcement algorithm is used.
Gorse and Taylor24 fully extended the design of a RAM unit to a probabilistic frame-
work. The probabilistic RAMs hold the probability of a given input pattern occurring,
instead of just holding a normalized value as done in the PLN approach. By using
probabilities, the likelihood of the input pattern belonging to a particular class can be
calculated, rather than a boolean yes/no class membership decision.
In addition they pass the probability accessed by any input pattern to other nodes.
Thus, the approach is purely probabilistic in its operation.
To implement this approach, the Kings' team have used a fully probabilistic ap-
proach, where information is passed between the nodes probabilistically. To do this
they introduced a pulse coded method of communication between RAM nodes25
which simplifies the operation and hardware implementation of the node . For each
cycle of operation of the pRAM, the node (1) determines if pulses are present at its
inputs, (2) forms a binary pattern of the bits that are on and off, (3) accesses the mem-
ory location that is addressed by that bit pattern (4) reads out the probability, u, that
the node fires from the memory location. (5) sets a one or a zero at the output of the
node depending on u. Thus, over time the node will access a number of memory loca-
tions and, over time, fire with a mean rate depending on those nodes values. This is an
entirely probabilistic system, including the hardware implementation, which makes it
unique amongst neural network systems. The pRAM uses the reinforcement learning
method to update the probabilities.
Another way of implementing a probabilistically based RAM node is not to use con-
tinues firing rates, but to pass the probability of the node firing using a boolean value.
This method, used in the GSN 19 allows the system to work statically, i.e. at any mo-
ment the exact probability of a nodes output can be obtained. In the pRAM, to obtain
the probability of the output requires an averaging of the output pulses from the neu-
ron. In addition, the GSN does not use reinforcement learning, but a method that al-
lows learning of an unknown pattern in one presentation. The method is reviewed in
chapter 1.3. Although the GSN operates in 'one pass', it embodies a search process
that is similar to a depth-first search with backtracking in AI. This search process has
a large worst case search time. So, although the training set is only presented once,
14
each pattern can take some time to learn. Furthermore, the network may not train on
one pass, as the ordering of the patterns may result in the system not finding a solution.
The sigma-pi unit3 has been shown to be equivalent to a RAM based node by
Gurney20. He shows how a RAM based node can be made equivalent to the sigma-pi
units and shows how, with the addition of systems that perform a sigmoid activation
on the output of a node, so that the network can use back propagation learning. He
shows that, if a node is described by the following,
y - o(a) (6)
Where Sm is the range covered by the values stored at each addressed location,
n is the number of input values (tuple size), Su is the value held in the addressed lo-
cation, u is one of the RAM addresses, and x is the input tuple pattern. The term
Q"_ (1 + UjXj) effectively activates a given address if the tuple matches the indexed
address, and the term V indexes all addresses in the RAM. The final equation is a
sigmoid activation function a , and the output of the neuron is given by y.
Then the operation of a RAM can be made continuous over its inputs and its out-
puts. As he points out, the practicality of this model is bound by the size of n (the
number of bits in the input). He then proposes a stochiastic version of this model (TIN)
which reduces the computational complexity. The approach is similar to that taken in
the pRAM, using bit streams as inputs and outputs of the unit. However, his model in-
corporates a sigmoid output function which makes the node continuous. The result of
this is a node that can be interpreted as a continuous system and, as Gurney shows, can
be trained using a version of back error propagation. However, it is not clear if any
advantage is gained from this in terms of hardware implementation or speed.
The hardware implementation of this node type has been described by Hui, Mor-
gan, Gurney and Bolori27. The most recent chip implements 10240 neurons in ES2
1.5um double metal CMOS.
4 Concluding remarks
The development of the RAM based approach has come a long way since the original
concept was proposed by Bledsoe and Browning in 1959. The original RAM imple-
15
5 Summary
This chapter has briefly described the work in the area of RAM based neural networks.
It has shown that these neurons all have a non-liner function prior to the weights,
which provides the rapid training characteristics. In addition, when the binary version
of the nodes are used, the implementation of the decoder and storage is simply a log-
ical decoder and set of bit storage locations.
References
IGOR ALEKSANDER
Department of Electrical and Electronic Engineering,
Imperial College of Science Technology and Medicine, London, UK
This chapter reviews a progression of weightless systems from the WISARD single-lay-
er pattern recogniser to recent work on MAGNUS, a state machine designed to store
sensory experience in its state structure. The stress is on algorithmic effects, that is, the
effect of mapping these systems into conventional processors as virtual neural ma-
chines. The chapter briefly reviews the changes from the generalisation of discrimina-
tors in WISARD to generalising RAM (G-RAM) as currently used in MAGNUS
systems. This leads to the introduction to MACCON (the Machine Consciousness Tool-
box) a flexible version of MAGNUS which runs on current PC operating systems.
1 Introduction
There is no need to rehearse the history of the various steps that led to the WISARD
and MAGNUS systems as the full story may be found elsewhere1. The point stressed
here is that the most important application of weightless systems may still lie in the
future: the cognitive behaviour if neural state machines, rather than the need for ever
improving pattern recognition systems which is a central driving force from the past.
Generalisation is at the centre of concern in pattern recognition while control and par-
titioning of state space is important in recursive systems which, in the case of weight-
less systems, are neural state machines. An interesting trend in the technology of such
developments is the optimality or otherwise of hardware construction. While the orig-
inal WISARD built by Bruce Wilkie was made entirely in hardware, the subsequent
commercial version3 made use of a standard architecture with a wide bus. More recent
developments in programming technology have brought this work entirely into the al-
gorithmic domain, the original MAGNUS being a C++ program under UNIX, which
in its latest MACCON format (Machine Consciousness Toolbox) is in C++ under
Windows95 or NT. It is therefore the algorithmic nature of the weightless paradigm
which is stressed in this chapter.
Does this mean that the learning advantages of neural systems are lost to the al-
gorithmic methodology? Not at all. It is felt that the important change is that from real
to virtual machine, the virtual machine retains all the important properties of cellular-
ity and notional parallelism which go with the earlier weightless systems, but imports
the extraordinary property of virtuality, which in the case of MACCON, turns a laptop
18
19
into a neural network with considerable cognitive power and flexibility. This leads to
a level of rapid system prototyping which is currently inconceivable in special-pur-
pose hardware.
The other trend is the exploitation of the power of state space in neural state ma-
chines. It will be argued that the associative powers of a single state coupled with the
links achieved through state transitions provides an easily accessible knowledge store
which takes neural systems beyond the mere labelling of patterns into schemes that
can become artificially conscious of the characteristics of environments sensed
through inputs. The controversial words 'become artificially conscious' are chosen
with considerable care: they imply creating a scheme which has a memory of the past,
a perception of the present and a control model which allows the net to understand its
environment in a predictive way.
2 Discriminator algorithms
The W1SARD contains several discriminators, each containing k RAMs with n ad-
dress inputs such that the totality of kn inputs are connected to a W-point binary input
pattern interface on a one-to-one basis when kn = W, an oversampled manner when kn
> W and an undersampled manner when kn < W. Starting with all RAM locations set
to 0, the system is trained to store Is having assigned each discriminator to a given
class. Only one the appropriate discriminator is trained for each training pattern. An
unknown pattern is assigned to the class of the discriminator which responds with the
highest number of 1 s. In algorithmic form this is well known to be identical to the n-
tuple recognition algorithm of Bledsoe and Browning developed in 19594.
It has been shown1 that to a first approximation the response of each discrimina-
tor is of the order of Rj=(Aj)" where Aj is the overlap similarity (that is [W- Hamming
Distance]/W) between the unknown pattern and that most similar to it in the training
set for discriminator j . This means that the WISARD is merely an assessor of Ham-
ming distance between the unknown and all the patterns in the training set. In the first
instance the correctness of that assessment is independent of the value of n. However
a key property of such systems is that they provide a confidence level C which is de-
fined as
_ _ (Rjmax-Rknext)
(Rjmax-Rknext)
C =
Rjmax
where Rjmax is the response of the leading discriminator and Rknext that of the
next strongest one. Clearly n features in this as C - 1 - [Ajnext - Akmax}" where
Ajtnax is the similarity leading to the response Rjmax etc..
To summarise the WISARD algorithm therefore, an unknown pattern is assigned
20
to the class to which belongs the nearest Hamming distance pattern of the training set,
with a confidence level which depends on the ratio of the highest class to the next
highest one taken to the power of n. The advantages of this are that a high degree of
discrimination can be obtained with a high n and that a training set for one class can
contain a variety of differing patterns giving correct recognition if the unknown is
nearer to any element of that set than that of any other set. The disadvantages are that
the total memory cost of the system are Dk(2)n [where D is the number of discrimina-
tors] introducing a penalty for high levels of discrimination both the discrimination
and the cost being exponential functions of n.
In recent work the generalisation of weightless systems has been ensured through a
storage of all the address patterns of a node and the application of a simple nearest-
neighbour algorithm to any unknown pattern. This is known as the VG-RAM (Virtual
Generalising RAM). The algorithm goes as follows.
• The output of that node is assigned the class vm simply by reading the sec-
ond half of the pair (tm.vm).
It may be shown that given a single layer of VG-RAMs each sampling an n-tuple,
the performance of this as a recogniser is identical to that of the WISARD algorithm.
The way that an overall response is computed is to add the labels within each class giv-
en in response to an unknown input. Each class tally becomes equivalent to a discrim-
inator response (to a first approximation) and follows the R=(A)n rule (R now being
the tally for the class). The advantages are that the storage penalty is now only Tkn
1. It has often been stressed in the literature that any suitable similarity meas-
ure could be introduced. Hamming distance is nevertheless convenient,
and is assumed in the totality of this chapter.
21
where T is the total number of training patterns (irrespective of class), while the ad-
justment of discrimination is still an exponential function of n. There is however a
time penalty, in the sense that if the average number of training patterns per class is
Tc, then, in very broad terms, the VG-RAM algorithm can be expected to run Tc times
longer than the WISARD one. The background to these considerations is given in'.
In summary, the VG-RAM is favoured because of its reasonable storage require-
ments with respect to WISARD-like methods in pattern recognition. It will be shown
next, that the VG-RAM is also well suited to being the building brick of neural state
machines such as the MAGNUS structure.
4.1 Structure
The real promise for neural networks, particularly weightless ones, may be in areas
where knowledge storage is required - that is, recursive systems. The simplest of these
is the single neural state machine, which may be described as follows in algorithmic
terms. A state at time t is defined as
5 ( 0 = 51(0,
sl(t),s2(t)...sw(t)
sj(t) being the yth binary state variable at time t there being w state variables alto-
gether.
/(/) = ('1(0,
(i\(t),i2(t))...iv(t)
ij(t) being the <th binary input variable of a v-variable input pattern.
tion 3, and the notation is used to indicate that each state variable is computed on the
basis of n random samples from the input and f random samples from the state varia-
bles themselves.
4.2 Training
Clearly G can be accomplished only once some training has been done. Several gen-
eral training algorithms may be deployed. A supervised algorithm will be mentioned
followed by three unsupervised ones.
This is a rather obvious form of training where some target 5(f) exists for a given
S(t - 1) and I(t-l).
- 1 ) . Knowledge of these vectors is sufficient for the training of the
state variables to take place according to the connections in 4.1. More substantial is
the notion that given any 5 ( 0 , S(t- 1) and / ( / - 1) three training procedures may
be defined: transfer mode, attractor creation and input habituation.
Transfer mode
One step of training is applied to link 5 ( 0 to S(t - 1) and I(t - 1). Note that the gen-
eralisation of the node will do more than just create the trained transition. Say that
there are T training triplets
51(f-l),/l(f-l)->51(0
52(f-l),/2(f-l)-»52(0
ST(t-l),IT(t-l)-±_SX(4)ST(t)
Then, given some previously unseen S(t- 1) and I(t-l), - 1 ) , and given a high
connectivity, the computed 5 ( 0 will be that which has been learned for the vector
- 0 , Ij(t- 1) which is nearest in Hamming distance, that is, 5 / ( 0 . As the con-
Sj(t-l),
nectivity is reduced, it becomes possible to output a mixture of learned state variables
(and only a mixture of learned ones). Braga has studied this problem5 and its effect
on the retrievability of information in state machines.
The main property of transfer training is that it can create an inner state irrespec-
tive of the current inner state. That is, assuming high connectivity, if Sj(t-l), Sj(t-l),
Ij(t - 1) (5/(0 is the;'th training step, then 5«(f - 1), Ij'(t - 1) will lead to a state
5/(0 if lj'{t- 1) is closer in Hamming Distance to ( / ( / - 1) and Su(t- 1) is arbi-
23
trary in the sense that it is roughly orthogonal to any trained state. Of course, depar-
tures from the ideals assumed in this result lead to noise in the generation of the
desired state.
Attractor creation
This consists of extending training step; (that is, Sj(t - 1), Ij(t - 1) (S/(r)) as fol-
lows
Sj(t),Ij(t)->Sj\t+l)+ 1)
Input habituation
This is a further extension of the training step j beyond the Attractor creation level to
S / ( t + l ) , * ( f + l ) - > S / ( t + 2)
Unsupervised learning
Three methods of unsupervised learning come to mind: localised region growing, ran-
dom state selection and iconic. Our work has mainly centred on the last of these three.
The algorithmic issue is how does one choose the internal state to accompany an ex-
24
Most of the recent work in weightless systems at Imperial College has focused on cog-
nitive modelling using neural state machines. The key requirement here is that the in-
ternal state structure should be a recallable representation of experience. The
methodology used in this context is called Iconic Learning which was first analysed
in 19917 and discussed more fully in terms of its cognitive implications in 19938 and
as a contribution to neural models of consciousness in 1996 . It is also reviewed in1.
Its algorithmic nature is simply stated.
Taking the general form of training seen earlier,
5, ( f - l ),/,(r-l)->5,(0
a relationship is struck between 5,(0 and Ij(t - 1) so that the binary variables of
5/(0 sample the variables of Ij(t - 1). This is written as:
5,(0 = mv
\i(Ij(t-l))
5.1 RaisonD'ktre
RaisonD'ttre
The MAGNUS (Multi-Automata General Neural Unit Structure) concept was a devel-
opment of the GNU based on the realisation that multiple state machine architectures
are important in achieving cognitive tasks. For example, the representation of dura-
tion, sequentiality and output sequence generation would require more than one state
machine. The system was designed to be constructible in as flexible way as possible.
An example of progressively structured architectures is given in the next section.
The MAGNUS allows the user to create a virtual machine designed to study var-
ious aspects of cognition. One of its designers, Richard Evans has done this to some
effect by creating a 'virtual robot' which finds its way around a virtual 'kitchen world'
l0
. This system included a way for the output of the MAGNUS to move its window
around the kitchen world image altering its size if required. Evans' results include ob-
ject finding and object naming in the kitchen world.
which a window may be positioned, the content of the window being a sensory input
to the neural system. Next, one or more neural state machines can be configured. A
simple example is shown in fig. 1.
The number of neurons in the block called MAGNUS (which is a neural state ma-
chine) and the number of inputs per neuron can be selected at will during the config-
uration phase. The display shows the previous state which, together with the input,
form the interface sampled by the neurons in the net to produce the new state at the
arrival of a clock pulse (controlled by the user). On clicking a 'run' button the content
of the STATE window first are shifted to the PREVIOUS state window before the new
state is computed. An interesting feature of the system is that it can cope with colour
images as discussed in the Appendix. An example of the use of this simple configura-
tion follows below to illustrate some of the concepts introduced earlier in this chapter.
A small network of 30x30 neurons with 64 inputs each (32 feedback and 32 from the
input) was set up with the MACCON in colour mode and in the configuration shown
in Fig. 1. The training set consisted of only two images as shown in fig. 2
These consist of a red vertical bar on a blue background and a green horizontal
bar also on a blue background. These patterns have been chosen because they have
simple overlap regions which facilitates a theoretical discussion. The object of the
tests is to elucidate the difference between the three types of iconic transfer: transfer
mode, attractor creation and input habituation.
b. Memory: where the ability to retain the recognised state even when the
input is replaced by an arbitrary noisy pattern is measured, and
27
Memory
Transfer Mode (% OK) 97 50 50 ... 50
Attractor Ctn. (% OK) 100 90 87 ... 71
Input Habtn. (% OK) 100 98 97 ... 95
Attention
Transfer Mode (% OK) N/A N/A N/A ... N/A
Attractor Ctn. (% OK) 100/0 22/66 25/67 ... 23/68
Input Habtn. (% OK) 100/0 5.5/94 0/100 ... 0/100
In the recognition tests above, the first pattern is an arbitrary noise state. In the
transfer mode the resulting error is due to several factors. First, a particular neuron
could be atypically connected. This means that the connections from the input are so
unevenly connected that the neuron does not discriminate between elements of the
training set. A particular input connection could be in the blue area common to the two
training patterns (probability=l/4), or it could be insensitive to both of the colours
used in the two training patterns (probability = (7/8)x(7/8)). The total probability of a
single neuron input being atypically connected is
HHH-
Consequently, the probability of all inputs (from the non-feedback part of the
overall input to a neuron) being atypically connected is
28
32
(0.842) 32 = 0.00407
The rest of the error may be accounted for through the possibility that in testing,
the arbitrary feedback state may be closer to one of the incorrect training patterns. This
is made relatively easy through the fact that the Hamming distance on the input side
is at most 32/4. Bearing in mind that it is most likely that 74 of the inputs will be in the
common blue area, the Hamming distance is more likely to have a maximum of 24/4,
that is, 6. This additional error is hard to calculate as it involves taking into account
all the possible distributions of inputs into the four quadrants of the input space. The
results indicate, however, that this is a greater contribution than that likely from atyp-
ical connections.
The object of these tests is to show that feedback and attractor creation removes
the error. This is borne out by both the attractor creation and input habituation results.
This is due to the fact that, while the error calculated above is still present in the next
step, the second step removes the ambiguity in the feedback state. So only the atypi-
cality error is left, reduced (because it applies to input and feedback variables) to
(0.0047x0.0047), that is 0.0000166 which is too small to be noticed.
In the memory tests, the transfer mode would not be expected to have any mem-
ory, and the results confirm a return to an arbitrary state as soon as the input is applied.
In fact the arbitrary state is not purely random, but a mixture of the two trained patterns
as might be expected from the training algorithm. The attractor creation, on the other
hand, demonstrates something akin to short term memory as might be expected from
the dependence of the output on the previous state which is deteriorating due to the
lack of control from the input. The result shows a situation where the input is in a false
minimum which still retains some of the memory of the original state. False minima
also exist which are mixtures of the two trained states, that is, there is no long term
memory. Long term memory is clearly indicated in the case of input habituation, the
error being due to the same causes as in input transfer, but this time generated by the
feedback state rather than the input.
Attention tests indicate that the system in one memory state may be switched
into another by a single application of the alternative input. The results are displayed
to show the percentage of each training state present in the current state. Initially a per-
fect memory state is present. The alternative state is presented for one step at the input
and the response shown under r=l. The result at t-2 is with the input replaced by an
arbitrary noise state. As expected, the attractor creation training remains influenced by
an unrecognised input, and the full alternative state is not achieved. The input habitu-
ated system, on the other hand, achieves a full switch to the appropriate state. Further
tests show that starting in an arbitrary state, the habituated system falls fully into one
of the two attractors showing the usefulness of this form of training: the training ex-
29
perience is retrievable with high accuracy through short exposures at the input. Dis-
torted inputs too cause falls into nearest perfectly remembered states.
4 Conclusions
In much of the weightless neural systems literature it is often stated that a major ad-
vantage of the technique is its ease of implementation in hardware. In this chapter it
has been argued that an appreciation of the algorithms used by weightless methods
leads to an easy creation of virtual machines which take neural systems forward into
cognitive computation. Starting with the Bledsoe and Browning n-tuple algorithm
(which, in hardware became the WISARD), it has been recalled that the RAM func-
tion may be augmented (G-RAM) by a node generalisation algorithm, through a
spread of stored training content to addresses close in Hamming distance to the train-
ing addresses. The neural state machine was introduced to indicate the way in which
spreading helps in the creation of various state space structures. MAGNUS and its
portable version MACCON are virtual machines in which neural state machines may
be interconnected to study hypotheses related to cognition. Of course, if cognitive be-
haviour is required in a system, then such machines may be used as cognitive engines.
In colour mode, each pattern interface pixel can be in one of a set C of eight 'colours':
Any input of the neuron is 'dedicated' to one of these colours, that is, if and only
if the dedication of the input matches the colour of the pixel, the input is set to 1. So a
neuron connected to some monochromatic field, say red, is expected to have 1/8 of its
inputs set to 1. A change of colour in the monochromatic field would lead to a different
set of 1/8 ones to be at 1, the set being disjoint from the first one. It is still possible to
calculate the Hamming distance characteristics for any two colour images. Say that
the two images have different colours in P of its pixels. As approximately 1/8 of/'pix-
els would be coded as 1 in the first image, and a different 1/8 in the second image, the
effective Hamming distance H between the two is most likely to be (1/8 +1/8)/*, that
is,
H-P
P
H~- 4
Training proceeds by associating the input vector with one of the colour message.
30
Clashes can still occur if an attempt is made to associate more than one colour with
the same input vector. In this case the output message stored is V but, in contrast with
the 0/1 case, 'u' causes one of the eight colours to be generated at random (where in
the 0/1 case it is 0 and 1 which are arbitrarily selected).
References
A. C. P. L. F. DE CARVALHO
Computing Department, University of Sao Paulo at Sao Carlos,
Sao Carlos, SP CP 668, CEP 13560-970, Brazil
M. C. FAIRHURST, D. L. BISSET
Electronic Engineering Laboratories, University of Kent at Canterbury,
Canterbury, Kent, CT2 7NT, England
1 Introduction
31
32
2 G S N\f' networks
The goal in this state is to determine the values that the node can learn in
the next learn phase without disrupting previously stored information. In
order to discover this, the input to be learnt is fed into the input terminals. A
function is applied to the values stored in the GSN addressed memory contents
to determine its validating value. This validating value is propagated through
the pyramid to its apex. The validating value produced by the node on the apex
will be the pyramid validating value. If the output is equal to u (undefined)
then it can learn any desired output. If, on the other hand, the output is a
defined value (0 or 1), only this value can be taught. The output of a node rii
in the validating state is given by the Equation 1 below:
33
f 0 iffValmeAi,Ci[aiJ=0;
Oi = 1 1 i f f V a i m e ^ I C , [ a i m ] = l;
[ u iff 3aim 6 Ai | d[alm) = u or 3a i m ,a t ( G A, | C,[a im ] ^ Cl[ail\.
(1)
of the address aim. Thus each node n_, receives its desired output, learns and
provides desired outputs to nodes in the previous layer.
f 0 iffH^/JMI^/JI;
0i = { 1 m\\Axh\\>\\Ai/o\\; (3)
{ u iS\\Al/o\\ = \\Aih\\.
In this equation, the node will generate as output a value 1 only if the
number of such values in the addressable locations is greater than the number
of O's. If the inverse occurs (i.e., the number of O's is greater than the number
of l's), the output will be equal to 0. It does not matter how many unde-
fined values exist in the addressable locations. When the addressed memory
contents store the same number of defined values 0 and 1, the output will be
an undefined value. This rule has the effect of minimizing the propagation of
undefined values.
3 Learning algorithms
The C and D learning algorithms were the first strategies developed to teach
GSN' neural networks. Their only difference is the approach taken for selecting
the memory content to store the desired output value when more than one
option exists. While the C algorithm, chooses the memory position at random,
the D algorithm, chooses a memory position whose address has a larger number
of bits in common with the defined value to be learned. According t o 7 , the
purpose of the D algorithm is to make the internal representations of patterns
from the same class as similar as possible and vice-versa. As a consequence,
the network increases the discrimination between patterns of different classes
and reduces the discrimination between members of the same class.
By using these learning strategies, GSN' has achieved very good recog-
nition performance when applied to character recognition 6,5,7 . Its one-shot
learning characteristic makes it particularly suitable for pattern recognition
35
4 Experiments
Experiments have been carried out to evaluate the influence of each of these
algorithms on the performance of GSN' considering a number of benchmarks
found to be of importance in practical systems. The parameters of comparison
adopted in this comparative study are the saturation rate of each pyramid
layer, the training time and the recognition performance. For such, GSN'
networks comprising 100 3-layer pyramids have been used. In order to make
the results achieved more representative, they are shown for connectivities
2, 4 and 6. A machine printed data set with numerals extracted from mail
37
envelopes has been used. The networks were trained with 10 samples from
each numeral and tested with 100 additional sample of each numeral. All the
results represent the average of 10 similar experiments run with different input
mappings.
4-2 Saturation
One of the main problems faced by Boolean neural architectures is the satura-
tion of their nodes's memory contents during the learning phase. This problem
is dependent on the relation between the storage capacity and the number of
patterns per class or the number of classes. The teaching of each new pattern
38
reduces the number of memory contents available and this is even more critical
when low connectivity is used, since a small number of memory contents are
accessible. If not enough memory contents are available, there will be a point
where just a few or even no more new patterns can be learned. When one-shot
learning is used, this problem is even worse because the earlier patterns will be
better represented than the later ones. Figure 2, Figure 3 and Figure 4 show
the saturation rates for each pyramid layer when connectivities 2, 4 and 6 are
used.
With connectivity 2, it can be seen that, for the C and Clazy cases, the
first and the last layers have a much higher saturation than the intermediate
layer. With D, D azy and P, the three layers present more similar saturation
rates. The third layer yields a higher level of saturation than the second layer
39
for all the cases. According t o 9 , the ideal situation is one where the saturation
rates are evenly distributed among all the layers. When connectivity 4 is
used, all five algorithms present a lower saturation rate than in the previous
case, which is due to the larger number of more memory contents available.
However, while the saturation rates for the C and Clazy algorithms keep a
similar relative proportions in all the three layers for different connectivities,
when the D, Dlazy and P are used, the saturation rates for the second layer
become relatively higher than in the first layer.
The saturation rates for the second layer become higher than in the first
because of the larger increase in the number of memory contents in the first
layer when the overall connectivity is increased. This unbalanced increase in
the number of memory contents affects the C and C'azy algorithms in a smaller
scale. By using connectivity 6, the saturation rates of all layers, in particular
the last layer, are decreased. This decreased is expected because of the higher
number of memory contents available.
• The same pattern in the area covered by Pi being associated with differ-
ent desired outputs.
strategies save memory contents for future patterns, allowing more patterns to
be learned. As the connectivity increases, the percentage of conflicts produced
by each learning strategy clearly decreases. Larger connectivity means more
memory contents available, and therefore lower probability of not having a
memory content available when one is needed.
5 Conclusion
In this chapter five learning algorithms available to train GSNT/ neural networks
are described and investigated. The algorithms have been investigated with
respect to a number of important criteria: recognition performance, memory
saturation and training time. It has been found that, usually, the larger the
42
connectivity, the better the recognition performance. The exceptions are the
cases where the C and C'azy algorithms have been used. They also show that
the increase in connectivity decreases the saturation but increases the training
time. The results achieved show that the lower saturation rates are obtained
by the C and Clazy cases, the shortest training time by the P and Clazy cases
and the best recognition performances by the Dlazy and P cases.
References
The ADAM binary neural network which has been used for image analysis applications,
is constructed around a central component termed a Correlation Matrix Memory
(CMM). A recent re-examination of the CMM has led to development of the Advanced
Uncertain Reasoning Architecture (AURA). AURA inherits many useful characteristics
from ADAM, but is intended for applications requiring the manipulation of symbolic
knowledge. This chapter shows how the AURA architecture has been developed from
ADAM and explains its method of operation. The chapter also outlines the use of
AURA in symbolic processing applications, and highlights some of the ways in which
the AURA approach is superior to other methods.
1 Introduction
The ADAM neural network1 is a binary network used for image analysis applications
which is constructed around a central component known as a Correlation Matrix
Memory (CMM) network. Our recent work has been examining the CMM networks
in ADAM and has developed a more thorough understanding of the way the network
operates. Through this, we have been able to develop the Advanced Uncertain Rea-
soning Architecture (AURA), which is intended for use in knowledge based systems
tasks3. AURA is similar to ADAM in its ability to deal with uncertain data and simple
implementation in hardware. AURA is also expected to share the ability of ADAM to
operate at speed on very large amounts of data. In Section 2 we review the basic
ADAM architecture and examine the central role played by the Correlation Matrix
Memory networks in ADAM. In Section 3 we describe the structure, operation, and
concepts of the AURA architecture. Concluding comments appear in Section 4.
2 Background
43
44
The basic ADAM architecture is shown in Figure 1 and comprises a number of distinct
stages: an N-tuple decoder, a first stage CMM network, a class separator, and finally
a second stage CMM network. Use of two CMM networks allows the storage capacity
of the system to be determined independently of the input and output pattern sizes.
Details of the ADAM architecture can be found in .
Our recent work has focused on the way the CMM networks operate within ADAM.
The binary CMM network is a key functional element of the ADAM system which
provides the networks associative capability. A CMM network can be regarded as a
single-layer weightless neural network, but is most easily seen in terms of a binary ma-
trix W. Thus each element Wy of the matrix W takes the value 0 or 1 to represent the
absence or presence (respectively) of an unweighted synaptic connection between cor-
responding input and output unit .
45
Figure 2 shows an example of a small CMM network in both train and test modes.
In Figure 2 a training is accomplished by creating an association between two binary
patterns using a simple Hebbian learning procedure. The two patterns are applied to
the rows and columns of the CMM network simultaneously. An element of the matrix
is set to 1 corresponding with the intersection of active rows and columns. (A row or
column is active when a bit is set in the applied pattern at that position.) In mathemat-
ical terms this is equivalent to computing the outer product of the two patterns to ob-
tain the matrix W (i.e. W = uvT, where u is the (input) column vector and v is the
(transposed, output) row vector).
In Figure 2b a pattern is applied to the rows of the CMM network trained in Fig-
ure 2a. This pattern is similar to the pattern used in training except that the bit applied
to the first row is no longer set (this could have occurred due to noise, for example).
Recall is accomplished by summing the active row connections in each vertical col-
umn. An appropriate thresholding function then recovers the original binary pattern,
first applied to the columns in Figure 2a. Thus a CMM network can be made robust
against corrupted input patterns during recall by a suitable choice of thresholding
function (both L-max and Willshaw thresholding are used in AURA3). This property
is exploited in AURA to provide a powerful partial matching mechanism for incom-
plete inputs (see Section 3.6).
It is clear that the input to a CMM network need not be N tuple states as in AD-
AM, but any pre-processed data. CMM networks have been used to represent frame-
based knowledge, and to implement multiple-hypothesis generation, similarity match-
ing, and hypothesis selection in an experimental system based on the original ADAM
architecture7. Building on these ideas, we have developed methods for pre-processing
46
and presenting rule based data3 which have led to the new AURA architecture.
The AURA architecture can be used to form the basis of high performance knowledge
based systems. AURA provides facilities to enable high-speed matching between en-
vironment variables and a knowledge base consisting of rules that typically have the
form: antecedent-expression -> consequent-expression.
The architecture exploits particular properties of CMM networks to support eval-
uation of which generally contains bound symbolic references to environment vari-
ables. In addition, a powerful form of partial matching is performed with incomplete
expressions (in which some environment variables are not available). However, in this
case consequent-expression should be treated as a tentative, uncertain conclusion, to
be confirmed by subsequent processing.
The system is first trained with a knowledge-base, i.e. a set of rules or predicates.
These specify particular pre-conditions and (via class separator patterns) the conse-
quential actions to be taken if the pre-conditions are satisfied. In operation, values be-
come bound to environment variables, forming the input to AURA. Thus rules in the
knowledge-base become eligible to be "fired" when the configuration of variables sat-
isfies the pre-conditions of those rules.
The principal components of the AURA architecture are shown in Figure 3. The ar-
chitecture is based on the use of an array of CMM weightless neural networks, sup-
ported by mechanisms which: a) convert lexical tokens into binary pattern vectors
with exactly k bits set (where k is a constant for a given CMM network), b) perform
binding of variable-names to values, c) form superimposed codings of sets of bound
variables, d) route the superimposed sets to appropriate CMM networks, and e) re-
solve multiple network outputs (which occur in the form of superimposed class sepa-
rator patterns). The lexical converter replaces the N tuple pre-processing used in
ADAM, converting symbolic and numeric data to fixed weight binary patterns. This
is, in effect, what N tuple pre-processing does to image data.
permit a description of the discussion surrounding this binding problem here, but the
approach taken in AURA appears to overcome some of the problems encountered pre-
viously. The approach taken is similar to 4 but uses binary instead of real-valued ten-
sors.
In the AURA architecture, tensor products are used to bind variables to values.
Tensor product (TP) vectors can be formed in two steps for a pair of binary patterns
representing the symbolic name and the value to which it is bound. The first step in-
volves calculation of a TP matrix and is equivalent (in the two pattern case) to the out-
er product computation used to store a pair of patterns in a CMM network. In the
second step, the TP vector is obtained by simply concatenating the rows of the matrix
obtained in the first step.
determined according to rule arity. The use of multiple networks allows full control
over the matching process: a unique CMM network is allocated for the storage of rules
of each required arity. Thus a rule such as A.B.C -» S which has an arity of three,
would be stored in the CMM allocated for rules of arity three.
A major aim in the design of the AURA architecture has been the provision of a spe-
cial form of partial match, not generally found in other systems. The usual form of par-
tial or incomplete match which is provided by many systems (including AURA)
involves matching one particular combination of terms (n) from the full set of terms
or attributes (m) in a previously stored rule (or record).
The AURA architecture extends this idea, providing a second, more general ver-
sion of the partial match which provides much greater flexibility in the matching proc-
ess. In this version, we can specify that any combination of n terms from m is
permitted in a successful match. For example, we can choose to accept a rule of arity
five which matches on any three of the five boolean product terms in the rule. To per-
form this second type of match, most systems would require mC„ separate match op-
erations, whereas in AURA this is performed in a single match operation.
Since partial matching is achieved by manipulating the threshold in AURA, it is
simple to dynamically relax the criteria for a successful match, in effect by reducing
the specified value of n. This combines the two versions of partial match described
above, and can be useful (for example) when the initial value of n fails to produce a
successful match.
3.7 Applications
The AURA architecture is primarily aimed towards applications requiring rapid ma-
nipulation of large knowledge bases. A subsidiary aim is to provide a degree of sup-
port for uncertainty management in such applications by exploiting the partial match
capability of the architecture.
An important source of motivation for this work is the requirements of mission
management systems as found, for example, in future generations of aircraft. Such
systems constantly demand performance improvements in command and control func-
tions including data fusion, situation assessment, and sensor management. In most
cases, the performance of these functions can be considerably enhanced with support
from high-speed rule evaluation. For example, in the situation assessment function,
generic situations could be recognised rapidly using rules. The special partial match
abilities of AURA could prove extremely useful in "filtering out" small variations be-
tween similar situations.
50
4 Conclusions
The main objective of the work described here is to develop novel architectures for
knowledge manipulation. One of the most important aims of the work is to develop
systems that can be implemented efficiently in hardware. Towards this end, we have
developed an architectural framework based on weightless correlation matrix memory
neural networks. A set of dedicated support chips is under development to enable the
construction of high-performance systems based on AURA concepts.
The major benefits of the architecture described here are: 1) efficient rule match-
ing in knowledge bases, 2) powerful partial match mechanism, 3) simple hardware im-
plementation. The system is being applied to rapid reasoning in aircraft systems,
where it is necessary to support all these features.
References
The following set of chapters describe extensions to the methods used in RAM based
systems. As with all neural network methods, there is a continual aim to improve the
performance of methods and undertake comparisons with other techniques.
The first five chapters describe new methods for the analysis of N tuple systems
that allow the networks to be used more effectively. The first chapter by Morciniec
and Rohwer presents a thorough comparison of the RAM based methods and other
neural networks which clearly demonstrates that RAM based networks are at least as
good as a wide range of other networks and statistical methods on a range of complex
and well known benchmark problems. The next chapter shows that RAM based net-
works, although commonly thought of as binary networks are capable of using contin-
uous inputs in the domain of image processing. The chapter by Howells, Bisset and
Fairhirst describes, in general terms, how RAM based networks that use the GSN
learning methods may be compared with and integrated with other RAM based meth-
ods. Jorgensen, Christensen and Liisberg show how the well known cross validation
methods and information techniques can be used to reduce the size of the RAM net-
works and in the process improving the accuracy of the networks. Finally a very val-
uable insight into calculation of the storage capacity of a wide section of RAM based
networks is given by Adeodato and Taylor. The general solution permits the capacity
of G-RAM, pRAM and GSN networks to be estimated.
The final three chapters in section 2 describe new RAM methods which extend
the basic ability of the networks. The chapter by Morciniec and Rohwer shows how to
deal with zero weighted locations in weighted forms of RAM based networks. Nor-
mally this these are dealt with in an ad-hoc fashion. Although a principled approach is
presented (based on the Good-Turing density estimation method), it is shown that us-
ing very small default values is a good method. It also contrasts binary and weighted
RAM based approaches. The next chapter by Neville shows how a version of the Back
propagation algorithm can be used to train RAM networks, allowing the RAM meth-
ods to be closely related to weighted neural network systems, and showing how Back
propagation methods can be accelerated using RAM based methods. The chapter by
Jorgsnsen shows how the use of negative weights in the storage locations allows rec-
ognition success to be improved for handwritten text classification. Finally, Howells,
Bisset, and Fairhurst explain how the BCN architecture can be improved by allowing
each neuron to hold more information about patterns it is classifying (which results in
the GCN architecture) and by the addition of a degree of confidence to be added
(which results in the PCN architecture).
51
This page is intentionally left blank
BENCHMARKING N-TUPLE CLASSIFIER WITH STATLOG DATASETS
M. MORCINIEC*, R. ROHWER§,
Neural Computing Research Group,
Aston University, Birmingham, B4 7ET, UK
The n-tuple recognition method was tested on 11 large real-word data sets and its per-
formance compared to 23 other classification algorithms. On 7 of these, the results show
no systematic performance gap between the n-tuple method and the others. Evidence
was found to support a possible explanation for why the n-tuple method yields poor re-
sults for certain datasets. Preliminary empirical results of a study of the confidence in-
terval (the difference between the two highest scores) are also reported. These suggest
a counter-intuitive correlation between the confidence interval distribution and the
overall classification performance of the system.
1 Introduction
The n-tuple classification system is one of the oldest neural network pattern recogni-
tion methods , and there have been many reports of its successful application in vari-
ous domains 4 ' 7 - 10 - ll ' 12 .The major advantage of the training set is its lightning speed.
Learning is accomplished by recording features of patterns in a random-access mem-
ory, which requires just one presentation of the training set to the system. Similarly,
recognition of a pattern is achieved by checking memory contents at addresses given
by the pattern.
It is prudent to suspect that relatively poor performance will accompany the
speed and simplicity of the n-tuple algorithm. We therefore carried out a large-scale
experiment8 in which the n-tuple method was tested on 11 real-word datasets previ-
ously used by the European Community ESPRIT StatLog project9 in a comparison of
23 other classification algorithms including the most popular neural network methods.
The results, reviewed below, show the n-tuple method to be a strong performer, except
in a few cases for which we can offer explanations.
Statistics were also recorded on the confidence intervals (the differences between
the two highest scores). Preliminary results suggest that two types of distribution oc-
cur, and performance is correlated to the distribution type.
f
Current address: Hewlett-Packard Labs, Filton Rd., Stoke Gifford, Bristol BS12 6QZ, UK
§
Current address: Prediction Company, 320 Aztec St., Suite B Santa Fe, NM, 87510, USA,
email: [email protected]
53
54
The StatLog project was designed to carry out comparative testing and evalua-
tion of classification algorithms on large scale applications. About 20 data sets
were used to estimate the performance of 23 procedures. These are described
in detail in 9 . This study used 11 large data sets, selected as described in 8 . A
specific random division into training and test sets was supplied for each data
set.
The attributes of the patterns in the StatLog data sets are mostly real
numbers or integers. Therefore each attribute was rescaled into an integer
interval, quantised, and converted into a bit string by the method of Kolcz
and Allinson 2 , 3 based on CMAC and Gray coding techniques.
The prescription for encoding integer x is to concatenate K bit strings,
the j'th of which (counting from 1) is x+jfl, rounded down and expressed as
a Gray code. The Gray code of an integer i can be obtained as the bitwise
exclusive-or of i (expressed as an ordinary base 2 number) with j / 2 (rounded
down). This provides a representation in aK bits of the integers between 0 and
(2 a - l)K inclusive, such that if integers x and y differ arithmetically by K
or less, their codes differ by Hamming distance \x — y\, and if their arithmetic
distance is K or more, their corresponding Hamming distance is at least K.
The resulting bit strings are concatenated together, producing an input vector
of length L = aKA, where A is the number of attributes.
3 Benchmarking results
RAM nets
(•) n-tuple recogniser
Discriminators
(*) 1-hidden-layer MLP. ( 6♦ ) Radial Basis Functions.
(<?) Cascade Correlation.
Cv>) (0) SMART (Projection pursuit).
(®) Dipol92 (Pairwise linear
1 inear discrim.).
discrim.). (0)
(0) Logistic
Logistic discriminant.
discriminant. j
(0) Quadratic discriminant.
discriminant. ((©)
0 ) Linear discriminant.
Methods related to density estimation
(a) CASTLE (Prob. decision tree). (/?) k-NN (k nearest neighbours).
(7) LVQ (Kohonen). (<5) Kohonen Topo. map.
(e) NaiveBayes (Indep. attributes).
(E) (()
(C) ALLOC80 (Kernel functions)
Decision trees
(a) NewID (Decision Tree) (b) AC2 (Decision Tree)
(c) Cal5 (Decision Tree) (d) CN2 (Decision Tree)
(e) C4.5 (Decision Tree) (f) CART (Decision Tree)
(g) IndCART (CART variation) (h) BayesTree (Decision Tree)
(i) ITrule (Decision Tree)
4 Counting Hypercubes
It is well established that the tuple distance between two patterns (the expected
number of tuples on which they differ) decays exponentially with Hamming
distance according to
p(H)*N(l-e-*B) e-r") (1)
a<n. (2)
56
Figure 1: Results for N-tuple ( • ) and other algorithms. Classification error rates increase
from left to right, and are scaled separately for each d a t a set, so that they equal 1 at the
error rate of the trivial method of always guessing the class with the highest prior probability,
ignoring the input p a t t e r n . The arrows indicate performance which was worse than this.
57
Figure 2: The number of hypercubes required to cover the space occupied by data. The
datasets on which n-tuple classifier performed poorly are printed in bold face. A star denotes
the existence of skewed priors. The DNA dataset had a highly redundant representation of
its attributes and most of the data for Technical was concentrated in one hypercube.
c
Only the eigenvalues smaller than the edge length were rounded up; ie., they were
dropped from the product. Neglecting to round the others does not affect the order of
magnitude of the result.
58
5 Confidence intervals
The n-tuple system classifies a pattern to the class c that yields maximal
tuple proximity (score) with discriminator Dc. The difference of maximal and
next maximal score (the confidence interval) varies with the pattern. The
mean confidence interval as a function of tuple size n is plotted in Figure 3A.
The n-tuple system with 100 tuples was applied to the classification of several
StatLog datasets. Two of them, Belgian II and Cut20, pose problems to the
classifier (see section 3). The confidence interval increases with n, as higher-
order correlations become available to the classifier. However, there seems to
be no correlation between the size of the confidence interval and the percentage
of correct classifications made.
Figure 3B presents the distribution of confidence intervals for the tuple
size n = 12. It appears that the distributions tend to follow two forms: one
approximately symmetrical, with a very low count of small confidence intervals,
the other asymmetric with a considerable number of small score differences.
The datasets on which the n-tuple classifier scores poorly seem to possess the
symmetrical distribution.
This preliminary data seems to suggest, oddly, that the n-tuple classifier
gives correct classifications with small confidence intervals, whereas mistakes
are made "confidently".
6 Conclusions
A large set of comparative experiments shows that the n-tuple method is highly
competitive with other popular methods, and other neural network methods
in particular, except on data sets of high volume relative to the volumes nat-
urally associated with the n-tuple method. Preliminary results suggest that
confidence interval distributions fall into two categories, and that these are
correlated with classification performance.
Acknowledgement
The authors are grateful to Louis Wehenkel of Universite de Liege for useful
correspondence and permission to report results on the BelgianI and Belgianll
data sets, Trevor Booth of the Australian CSIRO Division of Forestry for
permission to report results on the Tsetse data set, and Reza Nakhaeizadeh
of Daimler-Benz, Ulm, Germany for permission to report on the Technical,
Cut20 and Cut50 data sets.
59
Figure 3: A) Average Confidence intervals as a function of tuple size n for several StatLog
datasets. B) distribution of relative confidence intervals for tuple size n = 12.
60
References
Weightless neural networks have been used in pattern recognition vision systems for
many years. The operation of these networks requires that binary values be produced
from the input data, and the simplest method of achieving this is to generate a logic ' 1'
if a given sample from the input data exceeds some threshold value, and a logic '0' oth-
erwise. If, however, the lighting of the scene being observed changes, then the input
data 'appears' very different. Various methods have been proposed to overcome this
problem, but so far there have been no detailed comparisons of these methods indicating
their relative performance and practicalities. In this chapter the results are given of some
initial tests of the different methods using real world data.
1 Introduction
The use of weightless networks for pattern recognition is well known and the first
commercial hardware neural network vision system, WISARD1, used weightless net-
works. These networks require binary values for forming the 'tuples' used to address
their RAM neurons. In a simple system, the input to the network comprises a set of
boolean values, and the tuples are formed by sampling these values. If however the
input comprises 'grey level' data, the values used to form the tuples must be obtained
by processing the input data in some manner.
The simplest method, 'thresholding', is to generate a logic T if the input value
exceeds a specified threshold value, and a logic '0' if otherwise. One drawback of
thresholding is that the tuples so formed can be very different when the d.c. level of
the input data changes, which might occur in a vision system due to changes in the am-
bient light of a scene as viewed by a video camera.
This problem can be partly reduced by automatic thresholding, that is, by deter-
mining the threshold value of each input data. This can be achieved by automatic con-
trols on a video camera, but if that option is not available suitable processing of the
input data is required. Ideally the threshold value should be the median of the input
data so that half of the data are logic ' 1'. However, calculating the median is relatively
difficult, often involving sorting the data2. Therefore, as a compromise, the mean of
the input data is often used, but this reduces the performance of the system.
61
62
One robust method of processing grey level data is to use 'thermometer coding'3.
This involves converting each sampled point of the input into an array of Boolean val-
ues, where the greater the amplitude of the source, the more Boolean values are true.
In a 4-bit thermometer code, the input data are quantised into five values, represented
by 0000, 0001, 0011, 0111 and 1111; similarly, a 16-bit thermometer code quantises
the data into seventeen values. This coding is equivalent to maintaining many thresh-
old systems, each with a different threshold value. A major drawback of the method
is that, for a t-bit thermometer code, t times as much memory is required for the RAM
neurons than for thresholding.
Therefore, a different technique has been developed at the University of Reading,
called Minchinton cells4. These cells are general purpose simple pre-processing ele-
ments which are placed between the input data and the RAM neurons and which can
make the system more tolerant of changes in the d.c. level of the data without the need
for any extra memory and for very little extra processing. The method has been used
successfully, for example, in a hybrid weightless system which attempted to find the
position of eyes within images of faces , and as part of an auto-associative weightless
neural network .
Three types of Minchinton cells have been defined , all of which process values
from the non-Boolean input data and which produce a Boolean result. In the follow-
ing, let I be the input data and I[x] be the value at position x within these data: typically
x is chosen randomly. The cell types are:
Definition Function
The first of the above is thresholding, which is thus one form of the general
Minchinton cell. As thermometer coding can be achieved with multiple thresholds,
such coding can also be implemented using Minchinton cells.
The Type 0 cell returns a Boolean true if the value at position xl in the input data
exceeds that at position x2: this type gives better tolerance to changes in overall light-
ing for the following reason. Suppose the lighting of the input data increases. It is like-
ly therefore, if saturation effects are ignored, that the values at the two positions will
increase by about the same amount, hence the difference between the values will hard-
ly change and so the output of the cell is likely to be unchanged. Any changes in the
cell outputs that do occur can, of course, be compensated for by the generalisation
63
2 The Tests
The aim of the research is to investigate the response of weightless networks in recog-
nising and discriminating different objects which are subject to changes in how they
are illuminated over time. Although software simulations could be used, it was felt
that using real world data would add verisimilitude to the work. As vision systems are
often used on production lines, it was decided to choose some suitable products and
record their appearance under different lighting conditions using a standard video
camera. The products chosen were four disk boxes as they were convenient and of ap-
propriate shape and size.
These boxes were placed under a video camera such that all four were visible at
64
one time and then the image from the camera was captured every 10 minutes over a
24 hour period and stored for later processing. Each complete image comprised
512x512 8-bit values, so the image of each disk box was of size 256x256 bytes. Then,
when each image was tested, the data for the four disk boxes were separated and proc-
essed by VISIWIN8 a Windows™ based product in which various configurations of
weightless networks can be simulated.
Figure 1 shows four of the images of data used in the tests. Each image comprises
the four diskette boxes, and each image was taken at a different time of the day. As
can be seen from these images, the system is being asked to recognise images subject
to different amounts of illumination and to shadows. In the tests four discriminators
were used: the Fuji MF2HD box was taught in the first discriminator, the 3M box to
the second, the Fuji MF2DD to the third, and the Dysan box to the fourth.
Figure I: Four images, each of four diskette boxes, taken during the 24 hour period
3 Results
A multi discriminator system seeks to recognise and discriminate data. The best meas-
ure for this purpose is relative confidence[9], defined as follows, where the response
of a given discriminator is the number of RAM neurons which report a ' 1' when the
data are presented:
Note, where a network misrecognised the input image, the relative confidence
measure was set to 0.
In the tests done, the three Type 0 Minchinton Cell networks responded almost
identically, as did the three Type 1 Minchinton Cell networks, and the 4-bit thermom-
eter code network was significantly better than the 16-bit thermometer code network.
The graphs below thus show the relative confidence of the four discriminators for the
network using thresholding, the 4-bit thermometer code network, and the Type 0 and
Type 1 Minchinton cells with no oversampling.
Figure 2 shows graphs of Relative Confidence for the four discriminators when
the networks were trained on data at the start of the sequence, that is test A; and figure
3 shows the responses of the discriminators when the network was trained on data
once every hour, that is test B.
66
4 Discussion
The graphs clearly show that the Type 0 and Type 1 Minchinton cells are better than
thermometer coding which itself is better than thresholding. Usually the Type 0 cell is
better than the Type 1 cell, however, in some of the graphs in test B, the Type 1 cell
performs best. In test B, the network is trained on more representative data: thus it
seems that the Type 0 cell is better at generalisation.
As mentioned earlier, the 4-bit thermometer code has a better relative confidence
than the 16-bit code. This is because the response of all discriminators using the 16-
bit thermometer code were consistently higher than those when the 4-bit code was
used and hence the relative difference between the 'best' and the 'next best' discrim-
inator is smaller.
The period on each graph labelled between 20 and 70 is at night, when the disk
boxes were illuminated very poorly. As one would expect, the performance of all net-
works was reduced. However, Thermometer coding performed well on image 1 be-
cause most of it was dark.
In the period between 100 and 110, significant shadows occurred in the images,
70
which is the cause of the sharp dips in the performance of all networks. On the A set
of data, the Type 0 cells performed best over this interval; on the B set, the Type 1 cells
were slightly better.
5 Conclusion
These results clearly demonstrate that the best method for pre-processing continuous
data for use by a weightless network is the Type 0 Minchinton cell. In general this
gives higher relative confidence, generalises better and is both more computationally
and memory efficient than the other methods investigated.
References
1 GSN
71
72
clamped. Each neuron maps its input values to its output values as described above.
Recognition occurs when the outputs of all pyramids coincide with the desired outputs
for a given stored pattern. Detailed descriptions are given in2.
2 Formalising GSN
The primary idea of the formal framework is to associate pattern identifiers with the
different processes associated with GSN network operations. In such a way, it be-
comes clear which patterns are under consideration by which neurons at any one time
during network operation with the eventual emergence of a single unique pattern in-
dicating successful recognition of that pattern by the network.
Initially, an individual network node is considered. Since there are only two de-
fined values ("0" and "1") which may be output by a GSN, each pattern learnt by the
network will cause one of these two values to be produced. It is thus possible to con-
sider the output of a neuron to be one of two sets (called discrimination sets), one con-
taining all the patterns which cause a "0" to be produced, and the other containing all
the patterns which cause a " 1 " to be produced.
During recognition, a pattern may be presented to the neuron which does not ad-
dress a location containing either defined value. In this case an undefined value vu' is
produced by the neuron. This is analogous to both discrimination sets being output by
the neuron. Note that this is not the same as any arbitrary pattern being produced by
the neuron. In the next layer of the pyramid the "u" will be interpreted as being either
a "0" or a " 1 " when the addressable set is calculated. This is analogous to using either
of the two discrimination sets of the neuron, not any arbitrary set of patterns. The out-
put of a single neuron may thus be modelled as a set of sets containing one or two el-
ements. The constituent sets will be one or both of the discrimination sets for a neuron.
For a first layer neuron, a pattern presented for recognition will either correspond
exactly with a number of stored patterns in the region under consideration by the neu-
ron (two or more patterns may be identical within a given region, especially if neuron
connectivity is low) or fail to match any stored pattern. The input to a neuron thus may
again be considered to be a (possibly empty) set of patterns. In order to distinguish
which of the two discrimination sets for the neuron should be considered as relevant
to the input set, it is necessary to split each of the discrimination sets into subsets. Each
subset will be analogous to a memory location and the number of elements within a
discrimination set will be equal to the number of memory locations containing the
symbol associated with the discrimination set. The output of the neuron will be the dis-
crimination set (with its elements flattened into a single set) of which the input set of
the neuron is an element. If the input set is not a member of either set, both are output
(a "u" will have been referenced).
Second and subsequent pyramid layers are presented with a set of patterns from
each input neuron. The input set for a given neuron will be the intersection of the out-
73
put sets of each input neuron. The output is found, as above, by the discrimination set
of which the input set is a member. (To understand this, think of a neuron of connec-
tivity two where one input identifies the pattern as A or B and the other identifies the
pattern as B or C. The input set will be the pattern B.)
A further operation is required if both discrimination sets from a previous layer
neuron have been presented as input (a "u" was output). It is necessary here to form
the cartesian product of the input sets before taking the intersection of each element of
the product—analogous to each alternative address being tried if a "u" exists on the
input. The intersection will now be a set of sets and the discrimination set produced
by the neuron will be the one sharing most common elements with the intersection set-
- this corresponds to the greater of the number of "0"s and " 1 "s within the addressable
set.
This simple framework allows reasoning to be performed on GSN networks sim-
ply using operations on sets of patterns.
3 Generalisation
The above model allows precise reasoning to be performed about GSN networks.
With some abstraction however, the model may be generalised to allow reasoning
about all RAM-based neural network architectures (for examples see ).
The model consists of a formal mathematical framework for RAM-based neural
network architectures which for the first time has enabled such architectures to be for-
mally analysed and compared. The framework has already proved invaluable both in
giving an intuitive understanding of the operation of existing architectures (something
which often proves opaque when considering neural networks) and in aiding the de-
velopment of new architectures. In the latter case, aspects of the behaviour of a new
network architectures may be accurately predicted without the need for possibly
lengthy architecture construction or simulation.
At the intuitive level, the theory operates by manipulating sets of patterns in such
a way as to associate each possible neuron address location and output with a set of
patterns (either training or target) which the particular contents of the location or out-
put represents (for example, if neuron x produced an output _y, the theory would give
meaning to symbol y by associating it with, again for example, the pattern classes with
which it was consistent, that is, output y means the neuron has recognised' the given
set of pattern classes). Each neuron is also associated with discrimination sets which
define sets of patterns between which it is unable to distinguish. The theory further as-
sociates the sets of patterns associated with the memory locations of a neuron with the
discrimination sets for the neuron.
The sets employed by the theory (which also acts to define a RAM-based neural
network) are defined as follows:-
74
• a set of neurons H. where each n e 9(\% a pair of the form < /', S„ > where i
is a unique natural number and S„ is defined to be:-
• IP is the union of the sets 2* and 1?, the sets of training and target patterns
respectively.
Note that the set of symbols is expressed by the relationship between memory lo-
cations and discrimination sets—no explicit symbols are employed. Figure 1 provides
a pictorial representation of the structure of the neuron set which should aid under-
standing.
75
The above structures are employed to represent a network having been presented
with a particular training set (rules for which are not included here). Further sets may
be used to model the behaviour of the network during recognition. A formal set of
rules defining the conditions necessary for membership of the various sets employed
to model recognition given the definition of the network introduced above are now
presented:-
where sample{ is the I'th symbol of the sample pattern sample and OUTis the set of
neuron outputs whose elements are pairs of the form <n, {<Dl...'Dx)> where n e !A£or
n e /and ©t... Dx are the discrimination sets produced by neuron n as output (x may
vary depending on the symbol produced by the neuron - recall that differing symbols
may produce a differing number of discrimination sets).
The rule thus merely states that the output of a virtual input neuron is defined by
applying the function IN to the corresponding symbol to the neuron within a sample
pattern presented to the network.
The inputs to a particular neuron n e ^within the network may be deduced via
the following rule:-
(neuron-input) <n\ n> e Q <n, {©„-,... ©„-,}> e OUT \- {©„<,... Vn>x) e I9&
76
where 10^, is the set of output discrimination sets of neurons providing inputs to neu-
ron n.
The address set \ of a neuron n e ^may be deduced by the following rule using
the architecture dependent function ADDRESS:-
The relationship between the outputs of neuron n e iA/and the address set \ may
be deduced by:-
where y ^ ^ , represents the set of partial solution produced by neuron n for each a e
\ (to understand this rule, compare with calculating set intersections in GSN).
The total (or complete) output for a neuron n e JV^may be deduced from the par-
tial solutions using the definition of the COMPRESS function as follows:-
The rule states that the total output is found by applying COMPRESS to the set
of partial outputs. Note that any existing output for the neuron <njo must be ignored.
The set OWTmay only contain one element for each neuron.
The network outputs may be deduced from the set of outputs CUTby looking at
inputs to the virtual output neurons i e O.
The deduction of the sets of patterns above allows proofs about the behaviours of
given networks to be constructed. The sets also help to give a more intuitive under-
standing of the meaning of various network outputs.
4 Conclusions
A framework has been introduced to allow reasoning about RAM-based neural net-
works. The framework allows:-
Note that the framework abstracts away from many architecture features. For ex-
ample, whether neurons are arranged in regular structures such as layers or pyramids
plays no part in the framework. All information regarding neuron connections is con-
tained within the set C whether these connections form a regular structure or not.
References
It is shown that it is simple to perform a cross-validation test on a training set when us-
ing RAM based Neural Networks. This is relevant for measuring the network generali-
sation capability (robustness). An information measure combining an extended concept
of cross-validation with Shannon information is proposed. We describe how this meas-
ure can be used to select the input connections of the network. The task of recognising
handwritten digits is used to demonstrate the capability of the selection strategy.
1 Introduction
When an artificial neural network (NN) is being trained to a given task, one would like
to achieve a robust network in the sense that the performance of the network is rela-
tively unaffected by redrawing a specific example from the training set. If one counts
the number of examples in the training set that can be correctly classified by training
the network on the remaining examples, it corresponds to the so-called leave-one-out
cross-validation (CV) test1. The error rate obtained with a cross-validation procedure
on the training set is a nearly unbiased estimator of the test errors Accordingly, as the
size of the training set increases the mean value of the cross-validation error will ap-
proach the mean of the test error. It follows that when optimising a neural network ar-
chitecture one could try to optimise the leave-one-out cross-validation error. It is
however not simple to perform a leave-one-out cross-validation test on a conventional
feed-forward NN architecture, as it is computationally expensive due to replicated
training sessions3.
If one uses a RAM-based NN 4,5 we illustrate in this chapter that it is straightfor-
ward to perform a leave-one-out cross-validation test. By selecting input connections
based on CV error one obtains a network with a better generalisation performance than
a network of the same size trained with random connections. Instead of counting the
number of misclassifications in a CV test a more informative measure is the average
information (in the Shannon sense) obtained per example by a leave-one-out crossval-
idating test. We describe how it is possible to calculate the average mutual information
gained from the network in a leave-one-out CV test.
Often there will exist severe differences between the distribution of the training
set and the "real" distribution, especially when the training set size is small. In such
78
79
situations the CV error can be a poor estimator of the test error. A potential problem
with a leave-one-out crossvalidation test is the following. If each pattern of the train-
ing set occurs more than once, the CV-error for a RAM based NN will be 0 (ignoring
ambiguous decisions). Since it is very possible for many recognition tasks that specific
examples occur more than once, it is necessary to deal with this problem. We have
therefore extended the above mentioned information measure to deal with a more ro-
bust cross-validation concept. We have denoted the information measure cogentropy
(as it COmputes the GENeralisation capability by use of enTROPY). We show that
use of the cogentropy measure solves the problem of having multiple occurrences of
the same example in the training set.
In order to evaluate the applicability of the cogentropy measure we have tested
its use on the task of recognising hand-written digits.
The RAM based neural network can be considered as a number of Look Up Tables
(LUTs). This is illustrated in Fig. 1. Each LUT probes a subset of the binary input data.
In the conventional scheme the bits to be used are selected at random. The sampled bit
sequence is used to construct an address. This address corresponds to a specific entry
(column) in the LUT. The number of rows in the LUT corresponds to the number of
possible classes. For each class the output can take on the values 0 or 1. A value of 1
corresponds to a vote on that specific class. The output vectors from all LUTs are add-
ed, and subsequently a winner takes all decision is made to perform the classification.
In order to perform a simple training of the network the output values are initially set
to 0. For each example in the training set the following steps are then carried out:
Present the input and the target class to the network.
• Present the input and the target class to the network.
For all LUTs calculate their corresponding column entries.
• For all LUTs calculate their corresponding column entries.
Set the output value of the target class to 1 in all the "active" columns.
• Set the output value of the target class to 1 in all the "active" columns.
By use of this training strategy it is guaranteed that each training pattern always
obtains the maximum number of votes. As a result the network makes no misclassifi-
cation on the training set (ambiguous decisions might occur).
80
Figure 1: Illustration of the RAM-based neural network structure. The number of training examples from
class Cj that generate the address am is denoted n. m . If n. m > 1, the output value, o,, is set to 1, else 0.
2.2 Cross-validation
The idea of a cross-validating procedure is to test whether each of the training patterns
can be learned by the NN if it is trained on the remaining patterns. During the normal
recall process of the LUT contents one obtains an output value of 1 if the correspond-
ing column entry is visited by one or more training patterns. In order to perform the
CV-test it is necessary to know the actual number, njm, of training examples that have
visited the cell corresponding to column am and class cj. Accordingly, these numbers
are stored in the LUTs (Fig. 1). A CV-test is now performed in the following manner:
• If (nJm > 1 and j = k) or (nJm > 0 and) * k) then set the output value, ojt of
the target class,j, to 1.
• If the class obtaining most votes differs from the target class, then incre-
ment the CV-error rate.
• The best LUT (low CV-error) is chosen as the first LUT (LUT 0) of the
network.
• The best LUT (low CV-error) is chosen as the second LUT (LUT 1) of the
network.
When selecting the input connections for the LUT candidates it can often be ad-
vantageous to restrict the possible connections in different ways. One way is to force
different LUTs to pick connections from subspaces obtained from principal compo-
nent analysis.
The advantage of the above described selection procedure is that one obtains a
network with a better generalisation performance than a network of the same size
trained without a selection scheme. In addition the use of a cross-validation procedure
82
makes it possible to estimate the number of input connections to be used pr. LUT. The
determination of the number of input connections corresponds to evaluating the ap-
propriate number of hidden units in a conventional feed-forward architecture. This
stems from the fact that the number of LUT columns increases with the number of in-
puts, so that the number of columns with non-zero values (equivalent to hidden units)
increases too. For a conventional feed-forward net it is quite difficult with basis in the
training examples to determine the number of hidden units to use . As a leave-one-out
cross-validation test is easy to perform on RAM based nets the corresponding task be-
comes much simpler for these architectures.
A problem with a leave-one-out CV-test is the following. If each pattern of the
training set occurs more than once the CV-error for the RAM based NN will be 0 (ig-
noring ambiguous decisions). Since it is very possible for many recognition tasks that
specific examples occur more than once, it is preferable to deal with this problem. The
cogentropy-measure described in the following section is able to handle this problem.
In the following we will describe how the average mutual information gained from the
network in a leave-one-out CV-test can be calculated. Then we extend the measure to
deal with a more robust cross-validation concept. We have denoted the extended
measure cogentropy (as it COmputes the GENeralisation capability by use of the in-
formation enTROPY).
The mutual information I(am,Cj) between the column entry am and the class C; is
defined as follows :
■P(cj\am)
7 a
( m' c
j) = log (1)
P(Cj),) )
wherep(cj\am) is the probability that an object generating the column entry am belongs
to the class cj. p(cj) is the probability that an object belongs to class c.-. In order to es-
timate these probabilities it is tempting to use the feature distribution of the training
set:
"km
5Xm
LmHkm
C -
/>( *|«J
P(Ck\° ) k) = *m r -
m) = =~-,p(ckP(
C : (2)
n
lL jm 1njm
1j jm
However, for these estimates to be reliable a large and representative training set
83
• If a given feature exists in k classes of out P in the training set, then there is
certainly a probability greater than zero for this feature to exist in each of
these k classes. However, since the actual distribution obtained on the
training set cannot be associated with the general distribution, we attribute
to each of the k classes an equal probability of having the actual feature
(rectangular distribution). In this way we do not a priori favour any of the k
classes with respect to this specific feature.
• If a given feature does not exist in a specific class in the training set, then
the possibility exists that this feature never appears for objects belonging
to this class. In any case, we have no instance in support of claiming that
there is a probability larger than zero for objects belonging to this class to
possess the given feature. Accordingly, we associate a small probability e
(e « 1) for these cases.
These two arguments explain why it is sensible to let the LUT cell output values,
Oj, be either 0 (e) or 1. According to these arguments the estimate of p(Cj\am) be-
comes:
—_J , if \in
nkmkm>
>00
p( Cj\am
PiCj\a m)
) =
= i l,Q(njm)jm (3)
i
if
.e .- "*
if
m==° °
"km
f 1, if x > 1
where Q(x) =
= <
{
\ 0O.if*
. i f JC = 0
1 (4)
Pic,) = \?
P(cj)
P^ W
¥
84
Consider now the case where an example has been extracted from the training set.
By use of Eq. (1) we want to calculate the amount of information that can be obtained
about this specific example by using a specific LUT. Two cases must be considered:
• njm > 1 before extraction: In this situation the values of the probabilities
defined in Eq. (3) do not change by extracting the considered example.
By extending the above ideas to deal with combination of LUTs one can calculate
the average information contained in the network about a training example. The de-
tails of this will be described elsewhere.
In order to increase the robustness of our information measure we now consider
a generalised type of cross-validation. Let r denote the information gained about a
specific example by making a winner takes all classification. Now for each example
one can calculate what we denote the critical number of examples. The critical number
of examples is defined as the smallest number of examples (belonging to the same
class as the training example in question) that (by an intelligent selection) must be re-
drawn from the training distribution in order for the classification to become false or
ambiguous. The critical number, ncrit, is a measure of the robustness of the classifica-
tion. The example information contained in the network after removal of the critical
examples is denoted Ia. The information loss obtained by extracting the critical exam-
ples from the training set can be expressed as:
'loss =
'loss - h-K
'b~'a (5)
loss
toss •£. «
if« cn( >0
averageloss <I n-crit , i f n c r „ > 0 (6)
'averageloss ~ i "it (6)
0 = 0
,ifn ccWmf =
ifw 0
85
1 Ja_
, I,h ~ Average
{
loss = I 1 ~ ~
averageloss K* + ~ ' i f"crit
if « c r „>> 0°
IgenCV := |
V
%rif
"cr«(> "cri,
"cr.l (7)
(7)
/ft = Ia>
a,
7 ifn"en,
if
cri, =
= 00
3 Results
In order to illustrate the capabilities of the selection strategies we have chosen the task
of recognising hand-written digits. The digits used are from the NIST database . The
digits were scaled into a 16x16 format and centred.
The appropriate number of input connections pr. LUT can be determined by per-
forming a leave-one-out cross-validation test on the training set. A system containing
50 LUTs is trained with an increasing number of input connections pr. LUT. For each
case the cross-validation error is calculated. The minimum of the resulting perform-
ance curve is normally rather flat (see Fig. 2) allowing the number of input connec-
tions to vary within this range.
Several RAM nets were composed by selecting LUT connections based on the
cogentropy measure or the cross-validation error rate. Table 1 compares these results
with the results obtained by picking all connections by random. The decrease in error
rate by using the cogentropy is also noted. The decrease from 4.0% to 3.5% might
seem small, but it is actually the small percentages of errors that are difficult to avoid.
Table 1 shows that the use of the cogentropy measure reduces the required
number of LUTs considerably for a given performance demand. It is seen that 100
LUTs selected by using the cogentropy measure performs better than 807 LUTs cho-
sen randomly (507 of these LUTs were restricted to pick their connections in varying
local receptive fields of size 4 x 4). As the recall time and the memory requirements
are proportional to the number of LUTs in the system this would speed up the recall
86
time of the system by a factor of 8 and, furthermore, reduce the memory requirements
with a factor of 8.
Figure 2: Variation of cross-validation error with the number of input bits sampled per LUT. The RAM nets
trained consisted of 50 LUTs and was trained on 500 handwritten digits.
Figure 3: Cumulative distribution of ncril among 14095 training examples from the NIST database.
87
4 Conclusion
We have shown that it is simple to perform a leave-one-out cross-validation test on a
RAM based network. We have also introduced an information measure (denoted the
cogentropy) that incorporates a generalised crossvalidating test. The information
measure is used to evaluate the quality of a given combination of LUTs. The tech-
niques were tested on the task of recognising hand-written digits. The results show
that by using the information measure it is possible to obtain a considerably reduction
in the number of Look Up Tables required for a given performance demand. Further-
more the use of cross-validation makes it possible to determine the number of input
connection to use pr. LUT.
References
1. M. Stone, J. of Royal Stat. Soc. B, 36, 111 (1974).
2. B. Efron, J. Amer. Statist. Assoc, 78, 1, (1968).
3. L.K. Hansen and J. Larsen, Advances in Computational Mathematics, 5, 269,
(1996)
4. I. Alexander and T.J. Stonham, Computers and Digital Techniques, 2, 29 (1979).
5. I. Alexander, An Introduction to Neural Computing, (Chapman and Hall, Lon-
88
don, 1990).
6. J. Moody 1991 in Proceedings of the first IEEE workshop on Neural Networks
for Signal Processing, ed. B.H. Juang and S.Y. Kung (Piscataway, New Jersey)
7. See e.g. F.T.S. Yu, Optics and Information Theory, (John Wiley and Sons, New
York, 1976).
8. T. M. Jorgensen, "Information Processing in Optical Measuring Systems,"
Ph.D.-thesis (Optics and Fluid Dynamics Department, Riso National Laboratory,
1992).
9. National Institute of Standards and Technology: NIST Special Data Base 3,
Handwritten Segmented Characters of Binary Images, HWSC Rel. 4-1.1(1992)
10. T.M. Jergensen, "Optimisation of RAM nets using inhibition between classes"
(accompanying chapter)
A MODULAR A P P R O A C H TO STORAGE CAPACITY
P. J. L. ADEODATO
Laboratorio de Computacdo Inteligente, Departamento de Informdtica,
Universidade Federal de Pernambuco, Cx. Postal 7851, 50 732 - 970,
Recife - PE, Brazil
J. G. TAYLOR
Centre for Neural Networks, Department of Mathematics
King's College London, The Strand WC2R 2LS,
London, United Kingdom
1 Introduction
Here we answer the question in the context of binary input and binary out-
put spaces with discriminant functions implementable on RAM-based neural
networks.
Wong and Sherrington1'6 analysed the storage capacity of large and dilute
(n » 1 and n <C log N) general neural units (GNUs). Penny and Stonham2
89
90
2 Context a n d Definitions
This work deals with multidimensional binary input and 1-dimensional binary
output spaces. The neurons have permanent connections in the network. Input
vectors (input patterns) address memory positions (sites) to store or retrieve
(train or recall) the desired response (target class). Training is supervised.
The neurons have no temporal processing power. The 2 n combinations of the
n input variables connected to the n input terminals (fan-in) of the neuron
address its 2" sites. Take X as the discrete sample space named input space,
which consists of the values {a;1, x2,..., x2" }. Take Y as the binary sample
space target class which consists of the values y, for class-1, and y, for class-2.
The training set consists of m training examples (loosely referred to as
patterns), (xityi), each drawn from the stationary joint probability distribution
p
{*- = x,Y = y} =p(x,y). Take s m = ( ( x i , y 1 ) , . . . , ( x m , y m ) j to denote a
training data set. Each training set is a sample (sm) of our compound sample
space ( S m = {X x Y} m ) of m examples. Considering that the examples are
91
P(xi,yi) p{xj,yj)iox
p((*.,2/»), (Xj,Vj)) = p(xi,Vi) lj)ior i ^ ;",
j , i,j £ NN (1)
For the single neuron, let C be the binary random variable named collision
which assumes the value c when there is at least one pair of training examples
{{xi,yi),{xj,yj)) with x{ = Xj and yt ^ yj and the value c, otherwise. The
value c indicates the error-free storage. This definition of collision is the same
as that of an inconsistent training set?1 although it varies according to the
neuron model or the network. For a network, the definition of collision is more
complex as it may involve more than a pair of training examples. In broad
terms, it takes the value c when the network is unable to store (learn) any of
the mappings / : X -> Y which fit all examples from the training set.
The term probability of collision measures the risk of disruption of informa-
tion whereas probability of error-free storage measures the level of confidence
in the storage process. Storage capacity is the maximum number of training
examples (m) randomly drawn from a uniform distribution that can be stored
in a neuron or network at a given risk of disruption of information (probability
of collision).
For uniformly distributed examples, the probability of error-free storage
(p(c)) is defined as the fraction of training sets correctly stored in the network
and the total number of training sets of size m.
_ number of training sets correctly stored
p{c) = - (2)
total number of training sets of size m
Using the symbol "| • |" to denote the cardinality of the set "•", we can start
substituting values in Equation 2. The denominator is simply (|X| | Y | ) m ,
the number of elements of the sample space Sm = {X x Y } m . Omitting
the functional dependence of p{c) and of the numerator on |X| and m and
considering |Y| = 2 for the binary output space, the previous equation can be
re-written as:
,_. number of training sets correctly stored .„.
(3)
P(c) = (2JXF
(2|X|)-
This definition of p(c) is valid both for single neurons (where |X| = 2 n )
and for networks (where |X| = 2N) in multidimensional binary input and 1-
dimensional binary output spaces. Despite its generality, this equation is only
useful for single neuron since the numerator is too complex to be calculated for
each architecture. Hence, next section presents the model for the single neuron
and the others show how to use that result to express p{cnet) = f\p(cneu)]-
92
This section presents the theoretical results for assessing the probability of
error-free storage and the storage capacity of the single neuron. Emphasis is
given to concepts since the formalism has been presented elsewhere.10
Back to Equation 3, its numerator is simply the number of consistent train-
ing data sets (|S C |). Therefore Equation 4 gives the probability of error-free
storage {p(c)) for the single neuron. The calculation is still very complex. The
next subsections present the exact solution for p{c), an approximation (pa(c))
with tight error bounds and the storage capacity of the neuron model.
= |S c (m,|X|)|
|Sc(m,|X|)|
p(c)
n = for m e wIN (4)
' (2|X|) m
(2|X|)- v
'
"In how many ways can we configure the m patterns in a consistent fashion?"
The exact value for |S c (m, |X|)| is reached by a recursive method. Even elim-
inating recursion, the expressions remained very complex. Equations 5 and 6
are 2 non-recursive compact forms for expressing p(c); one as a series along
the input space dimension (|X|) and the other along the training set size di-
mension (m). The full derivation is available in Adeodato and Taylor10.
We will refer to this as the exact model since it was developed without
any approximation. There are two drawbacks in this model however. First,
Equations 5 and 6 are expensive to compute, even for n < 5. Second, they are
not invertible for expressing the storage capacity explicitly (m = f\p(c),n]).
£j=l(-i)|x|~'(l7l)2'im
p{c)
m ■■= pjXJr for f o rm
m e€l V
IV (5)
(2|X|)«
(zfz)m(z-l)^U=2
mp{c) , <'*>"^- ) r-' < — m
(2|X|)-
for m G W (6)
= (1 - (2IXI)- 1 )
P(c2) = where |X| = 2 n for RAM-based (7)
Figure l a shows the curves for the approximate and the exact models. The
fit is striking, particularly for high fan-in. In fact, the approximate model is a
first order approximation of p(c). Taking Ap(c) = p(c) -pa{c) as the approxi-
mation error, its maximum value is bounded 10 by 0.125/|X| — an exponential
decrease with the fan-in. Figure l b shows the error curves: non-negative with
a single maximum which is less than 1% for n > 4. For these error bounds,
1
Pa{c)
P a(c) < p(c)
P(C) < (c) ++
Pa(c)
Pa ^L. 0)
(9)
8|X|
Figure 1: (a) Superimposition of the probability of error-free storage for the exact and the
approximate models against the training set size (m) for n = 1,2,3,4. ( b ) Error (Ap(c))
between the models.
Equations 7 and 8 are invertible hence allowing the explicit expression of the
storage capacity (m) as a function of the probability of error-free storage and
the fan-in (m = f\p(c),n\).
(c),n}). The bounds defined by Equations 9 can be con-
verted into bounds for the maximum number (m) of storable training exam-
ples. This number (m) represents the storage capacity of the neuron for a given
confidence level (p(c)); a standard measure in designing learning systems.
12 ^ 1 /, 1+8
, o InMc)
fab(e)] ^ <r l
1^
11
L , Jn\p(c) Ap(c-)]
JnipCc) - Ap(c)}
Ap(c)\
2 2 V SSI -
1 + 8
ln\p{c2)\
m <
i 2 V ln\p('C2)}
ln\p(c2)}
(10)
94
For a RAM-based neuron, the previous Equations (10) define the interval for
the maximum number (m) of storable training examples.
P(cneu)N
P{Cgnu) = P(Cneu)
P(Cgnu) (12)
(12)
n N
p(cflnu) -=
P(Cgnu) (f".
£ + ( 1l - £ NJ
) p (p{c)
c))" (13)
(13)
N
The inversion of Equation 13 gives the explicit expression for the storage
capacity of the GNU (Equation 14) which is substituted in the storage capacity
95
Figure 2: Probability of collision in an 8-dimensional GNU for varying fan-in and number
of stored patterns. Check against Braga's actual data. 3
Figure 3: log 2 [storage capacity] x fan-in (n) for a confidence level of 50% (p(c) = 0.5) for
several models, (a) 8-dimensional GNU with our model's interval (cO) and Braga's model
(cl). (b) 50-dimensional GNU with our model's interval (cO), Braga's model 7 (cl), Wong
and Sherrington's modef (c2) and Adeodato and Taylor's old models 4 (c3).
The probability of collision in the pyramid (p(c pyr )) is the effect of neuron
collisions propagated from the input to the output layer (L). For uniform
pyramids (equal neurons), it simplifies to:
Inverting Equation 16 we get the explicit Equation (17) for storage capacity
of the pyramid as a function of the desired p(cpyr) which is then substituted
in the storage capacity for the neuron (m, in Section 3). With cost measured
as the total amount of memory in the network, Figure 5b presents the ratio
of log2 [storage capacity] and log2[cost] as a function of p(cpyr). The reduction
in storage capacity is bigger than the reduction in cost when we reduce the
fan-in of the neurons. This "extra" reduction is caused by the multiplicity of
representation of the functions, explained by Al-alawi and Stonham.12 This is
another result coherent with previous architecture-specific techniques.
., log 22 p ( c pp uv r ) .
p(c) = 2<( -U-iBi
p{c) n(L-l) + l )
' (17)
P(Cpyr)
P f c p y r ) = ( p ( c(Picgnupyr))*
9„upy7.))^ (19)
(19)
7 Conclusions
References
M. MORCINIEC+, R. ROHWER§,
Neural Computing Research Group,
Aston University, Birmingham, B4 7ET, UK
1 Introduction
The frequentist n-tuple system can be obtained from the original, binary system by set-
ting the tally truncation threshold 0 to °o instead of the more usual 1. This allows one
to use full tallies to estimate low-order conditional feature densities and apply a Baye-
sian framework to the classification problem.
Given p(c\a), the probability of class c conditioned on feature vector a (the set
of all memory locations addressed by an unknown pattern), optimal classification re-
sults can be obtained by assigning the unknown pattern to the most probable class. Be-
cause estimates of conditional feature densities arise naturally in the n-tuple system,
Bayes' rule is applied to obtain class probabilities. The likelihood and evidence for the
full feature vector are impossible to compute directly, but these can be estimated from
low order densities using independence assumptions. The most common approach
assumes that p(aAc) as well as p(a,) are independent*, where a, is the address of
the pattern in n-tuple i. The conditional class density can then be approximated by
P(«,|c)
p(aAc)
c|a)
p(c\a) B I== pl
7(C)
n PC*,)
i
(i)
* Current address: Hewlett-Packard Labs, Filton Rd., Stoke Gifford, Bristol BS12 6QZ, UK
5
Current address: Prediction Company, 320 Aztec St., Suite B Santa Fe, NM, 87510, USA,
email: [email protected]
It often goes unnoticed that it turns out to be highly restrictive to demand both of these condi-
tions together, a difficulty we presume to be dwarfed by the inaccuracy of each
assumption individually.
102
However implausible this assumption may appear, there have been reports
of reasonable results obtained with this method 3 ' 2 . The major advantage of
frequentist systems is that they do not suffer from saturation. This makes
them superior for small n-tuple sizes n, but the advantage tends to disappear
as n is increased, due to worsening probability estimates based on diminishing
tallies in each of the increasingly numerous memory locations 12 . It would
be desirable to modify the frequentist system in such a way as to retain its
robustness for any tuple size n.
The maximum likelihood estimate (MLE) has been routinely 3 ' n ' 1 3 applied
for the frequentist n-tuple system. In this approach, estimate p of the true
probability p of an event is approximated as the ratio of the event's tally r
to the sample size N; p = jj. Under the assumption that each tally value is
binomially distributed (with unknown probability p that the feature is present
in a pattern of class c and 1 — p that it is not) the ratio & is the maximum
likelihood estimate of p. The uncertainty of the tally can be defined as its
standard deviation, which can be estimated as
6r = y/Np(l-p)-p) Rs
M •y/Np\l-p) sMT^p).
jNp{\-p)-p) = y/r(l-p). (2)
6r w Vr.
8r y/r. (3)
Equation 3 shows that the accuracy of MLE is limited for the events with
small tallies. The relative tally uncertainty &, grows with diminishing tallies
and becomes undefined for zero tally.
It should be noted that the fact that a tally r = 0 for some event doesn't
imply that the probability of the event is also zero. It merely states that the
event has not taken place in a finite sample of size AT. This problem is known
in the literature 14 as the "Zero Frequency Problem". Various unprincipled,
ad hoc techniques exist which try to rectify it. The most common one is to
add an arbitrary small constant to each zero tally. However, the choice of a
particular constant is difficult to justify formally. We make some experimen-
tal observations concerning this Maximum Likelihood system with zero tally
correction (MLZ) in section 5.
103
N ==
N Y^rn
rnrr. . (4)
(4)
r= l
4 Smoothing GTEs
The major problem with the Good-Turing theorem is that the distribution
{"0,712,712,...} tends to be sparse and requires smoothing. Moreover, for large
values of r there are "gaps" in the distribution of nr. This suggests that we
should average a non-zero nr value with the zero nr values surrounding it. We
use the transform proposed by Church and Gale 5
2nr
zZrT =- **- (6)
t-q
104
where t, r, q are the successive indices of non-zero nr. Averaging occurs for
larger values of r only, because if there are no "gaps" the transformation has
no effect.
After averaging we still have to smooth the zr. This is accomplished by
fitting a log polynomial onto the data. Unlike Church and Gale who used
polynomial of order one (a straight line) we found that polynomials of higher
105
<r 3 OT
7 (r (r
2
) )»(r+l)
«« (>
GT
r+ l )a2:2=±!(l
+D
)«r+l
-' ^ n2d±)
1 +
■ I±l^ (8)
(8)
"nf. ^ nnTr
The probability estimates computed using the corrected tallies have to be nor-
malised because two different methods (GTE and SGTE) of estimation are
employed. We compute the probability pp o r m for the tally r using unnor-
malised probabilities p* = ^ as
We used several real-world datasets which have been used in the European
Community StatLog project 10 . The attributes of the data are in most cases
real-valued and pre-processing techniques have been applied l l 9 ' 8 , providing
binary input for the classifier.
In order to obtain probabilities p(a;|c) normalised within a tuple node one
would have to apply Good-Turing estimation for each tuple in each class c
separately. This is hardly possible because the distribution nT is very sparse,
especially for small tuple sizes n. Therefore, the estimation has been carried
106
Figure 2: A) Relative adjusted tally r* for class 0 of the "tsetse" dataset. B) Illustration
of the smoothed tally r' combined from tallies raT and rsaT. The error-bars on raT are
OT
1.65 x <r(r ). The switch from G T E to S G T E takes place at r = 3.
out collectively for all T tuples within a class c, i.e., for the population of T2 n
features. Consequently, the probabilities p(fti|c) are normalised within each
discriminator c and each zero tally is smoothed by the same amount regardless
of the tuple which generated it.
Figure 2A shows the relative composite smoothed tally r'/r computed for
the first discriminator of the n-tuple system trained on the "tsetse" dataset 1 0 .
Figure 3: A) Performance of binary and frequentist systems with 100 tuples on the
tsetse.tst". The test set comprises of 1499 patterns. Systems were trained using 3500
training samples. The error-bars are of size one standard deviation computed for 10 random
tuple mappings. B) Performance of frequentist Good-Turing system compared with MLZ
system using zero tally correction c = 1 0 - 1 5 0 .
tuple systems against each other. The benchmarking studies were carried out
for several STATLOG databases 1 0 . Figure 3A shows a representative plot for
a run on the "tsetse" database. Both frequentist systems perform better than
the binary version for small values of n, because they do not suffer from the
saturation effect. Unlike the frequentist system with MLE, the GT version
retains the performance with increasing n. However, it eventually becomes
inferior to binary system. It seems that for n large enough any technique other
than zero tally counting (which is equivalent to setting the tally truncation
threshold 6 to one) is less effective.
We also compared the performance of the GT system to that of MLZ which
is technically an ML system with zero tallies substituted by arbitrarily chosen
constant t. Preliminary experimental results plotted on Figure 3B suggest
that if e is small enough then MLZ will outperform GT system, especially for
large n. This can be explained by observing that MLZ with f —* 0 will make
exactly the same classification decision as the binary system, except for the
patterns that are tied (have the same score) in the binary version. For large
n, the saturation is very low, as is the probability of a tie. Consequently, the
performance of MLZ must be equal to the performance of a binary system
within a margin i ^ j f * where D is test set size and Dt,ed number of tied
patterns.
6 Conclusions
Acknowledgements
The authors are grateful to Trevor Booth of the Australian CSIRO Division of
Forestry for permission to report results on the Tsetse data set, and William
Gale of AT&T Bell Laboratories for private communication.
References
R NEVILLE
Division ofElec. Eng., School of Engineering, University of Hertfordshire,
Hatfield campus, College Lane, Hatfield, Herts. ALJO 9AB, UK
The chapter outlines recent research that enables one to train digital "Higher Order"
sigma-pi artificial neural networks using pre-calculated constrained look-up tables of
Backpropagation delta changes. By utilising these digital units that have sets of quan-
tised sitevalues (i.e. weights) one may also quantise the sigmoidal activation-output
function and then the output function may also be pre-calculated. The research present-
ed shows that by utilising weights quantised to 128 levels these units can achieve accu-
racy's of better than one percent for target output functions in the range Ye [0,1 ]• This
is equivalent to an average Mean Square Error (MSE) over all training vectors of 0.0001
or an error modulus of 0.01. The sigma-pi are RAM based and as such are hardware re-
alisable units which may be implemented in Microelectronic technology. The article
present a development of a sigma-pi node which enables one to provide high accuracy
outputs utilising the cubic node's methodology of storing quantised weights (site-val-
ues) in locations that are stored in RAM-based units. The networks presented are trained
with the Backpropagation training regime that may be implemented on- line in hard-
ware. One of the novelties of this work is that it shows how one may utilise the bounded
quantised site-values (weights) of sigma-pi nodes to enable training of these Neurocom-
puting systems to be relatively simple and very fast.
1 Introduction
Most of the current work with Artificial Neural Networks uses nodes whose activation
is defined via a linear weighted sum of the inputs, and whose output is then obtained
by passing this through a sigmoidal squashing function. The linearity required imme-
diately places restriction on the node functionality and, although it is possible to im-
plement any continuous function using two layers of such nodes, the resources
required in terms of hardware and time may be prohibitive. Further, biological nets
make use of non-linear activation components in the form of axo-axonic synapses per-
forming presynaptic inhibition. The simplest way of modelling such synapses and in-
troducing increased node complexity is to use multi-linear activation, that is the nodes'
activation is now the sum of product terms or is in 'Sigma-pi' form1. Units of this type
are also designated 'higher order' in that they contain polynomial terms of order greater
than one (linear). Our Sigma-pi units are 'higher order' nodes and as such make use of
non-linear activation components. Recent research by Lenze2, has shown how to make
110
111
tions, Grozinger14.
2 Rationale
The sigma-pi model 15,1617 , which in the case of this research take the form of a Sto-
chastic-Model (S-Model) as the site-values are interpreted as probabilities. The new
model that has been developed is termed a Real-Valued Activation Time Integration
113
Node (RVA-TIN), and contains a hypercube of sites which are averaged over time to
provide a Real-Valued Activation which is then interpolated through an Activation-
output function, the RVA-TIN is shown in Figure 1.
The site-values, S^ ^ are stored in an array (n-tuple), which may be stored
sequentially in locations in RAM. The input xt may also be interpreted as the proba-
bility of a " 1 " appearing at the rth input to the node. The output y is defined by a prob-
abilistic process which is presented in the following paragraphs. A Time Integration
Node stores site-values in a hypercube, which is addressed by an i bit input vector. The
input vector addresses a site \i which contains a site-value S^ which stores a value
5 e {-Sm,..., Sm } which is interpreted as a quantised number for reasons of hard-
ware implementation, but it may also be a real number. The site-value is then used to
estimate the activation which is defined in a similar manner to Barto. et al.18. The ac-
tivations, a, are averages where the bar denotes an exponential average over time. The
relationship relates the next activation estimation to the present average, a (;) and the
present instantaneous activation, a^, in an exponential manner, which is a recurrence
relationship. The activation value is then passed through an Activation-output Func-
tion (which may be linear or sigmoidal) in order to produce the output. The Analogue-
Model (A-Model) RAV-TIN's instantaneous real-valued activation is:
n
1
■Sm 2"'-.is,
s„n<' ;i+R;Z/) (1)
a
U) =
S 2
mT&T
» i= i
1=1
hence it is term a sigma-pi model where the input probability distribution, defines
the probability of the input. The S-Model's instantaneous activation is calculated on
every forward pass as
and the average real-valued activation is defined as a function a(Zj, Sn) by:
5
(»+D == ^«(r)
«((+!) A.5(r) ++ (!
( 1 -- X*-)
) aa(«)
() (3)
where X determines the activation's rate of increase and decay rate, the output is a
function of, y = f{a) hence
114
yy = o( a (»++ n)
= O(«(, D) (4)
W
1
a(a) = (6)
wD
p
1l+e
+e
The RVA-TIN may be utilised for continuous valued inputs where the real val-
ued input defines the probability (z,) of entering a " 1" onto an input JC, of a node. One
should note in the depiction of the RVA-TIN the site-values are quantised and the ac-
tivations are quantised, hence the sigmoid is quantised into multi-levels, but this is due
to quantisation of a ( , + , } to allow these models to be hardware realised. When the
RVA-TIN sigma-pi unit is configured into a multi-layered net topology a change in
the input probabilities means that the new outputs at the final layer will be estimated
after L time steps, where L is the length of the output bit-stream. The sigmoidal func-
tion may not be required if, for example, one were using a RVA-TIN to perform an
approximation to a function, e.g. a polynomial function, and if one required a linear
unit one would utilise a linear Output Function.
In this section I define an implementation of Backpropagation that has its root in the
original research carried out by Gurneyl0, however I have developed the algorithm so
it is faster and simpler. One should note that this work was not possible until
Milligan1 introduced a weight model for these units. The research into Back-Propa-
lib
gation of errors for sigma-pi units of Gurney10 was derived from the standard semi-
linear Back-Propagation of Werbos 20 and then Rumelhart, Hinton and Williams1.
This is based on the back propagation of a delta term which is derived from the previ-
ous node's output error times the weight of the connection associated with it. Gurney's
breakthrough was the interpretation of these connection weights as probabilities
which he derived from bit-streams, to provide what he terms 'dynamic weights'. I how-
ever do not use Gurney's bit-stream methodology to derive the weights on the connec-
tions of the nodes as bit-streams may have associated problems. Instead I define an
instantaneous dynamic weight that may be derived for the present forward pass and
does not have to be assimilated over multiple Feedforward operations as was previ-
ously the case with Gurney's dynamic weights derived from bit-streams. By utilising
these dynamic weights the Back-Propagation paradigm cf the semi-linear units can
then be directly interpreted into a sigma-pi notation of digital sigma-pi (probabilistic)
logic nodes. The benefits of utilising the sigma-pi nodal model with this regime is that
it uses stochastic state space search methods and the nodal model itself has an inherent
ability to introduce noise while it is being trained. The Backpropagation training re-
gime is dependent on the output error term which is defined as
T T n
1 .■ 9 ''
e = (7)
2>(0
1 = 1
I
( <= = 11
1 1=
y y
L2^2&t
_= 1i
'~v,-y>
where y\ was the target output value. The actual output, y1 is defined as an expectation
that is assimilated in an output bit stream of length L. The training algorithm presented
first carries out L Feedforward passes to fill the bit stream. On the L-1 Feedforward
pass the instantaneous dynamic weights are calculated, the output error or delta term.
The weight update made at the penultimate Feedforward pass, utilising the instanta-
neous dynamic weight as
AS; = aad'
^ (8)
where the delta change is made to the y'th unit in the net. For visible or output units the
delta change to they'th unit in the output given input address u. is
*(>-; - \')aJ'(a')
AS; = o.{y\-y')o\a') (9)
116
For hidden units the delta change to they'th unit in the hidden layer, given input address
|i is
ASi == a
ASjl
J
f oV) IX 6*w*
a^-a'iJ) 5*>v*
8*w*U)
0)
U) (10)
m k<= I,
5 S +25
■HH( lll,1]) -~^5 H
[ 0l O, ], l + 2 5 m
W
J =
01)
( U )
«4 m^
where u[/,] be the n-tuple partial address formed by placing a "don't care" or wild card
in the j'th bit position in [i e {u.,, u.2> H3.-- H,}- This in fact means that each input line
to the cube or n-tuple is endowed with the ability of toggling its binary value and hence
toggling the address this defines e.g. if for example the second bit is toggled in a 3-
tuple unit one obtains two partial addresses (|X[/2]) i.e. (i^H^m -» Hi. ->H2>M-3> where
-■Hi implies an inverse or complement of the bit. Hence u.[l(] set the j'th bit position
in address (I to 1 and u[0,-] set the j'th bit position in address n to 0 in order to calculate
the ith instantaneous dynamic weight. To train networks of cubic nodes one requires
polarized weights, because if the connection's weight has an inhibiting effect (i.e. it is
a negative quantity.) then the Backpropagated delta term must reflect this. Hence
wi /
w
== 5
S
m (2vv,-l)
m( 2
^i ~ 1 ) (12)
(12)
be done because the sigma-pi neuron model may view its activation in the polarized
notation a s a e { - l , ,. . ...,, 11}
} and these may be interpreted in turn as a set of integer
elements aq e { -amm>,...,■■■' miam } which are represented in the machine as a set of integer
5 a X
Binary 22 , . . ,
Om-q
5„_, o(a»_f)
a(a„.,) o'(am-,)
o'(a.-,) «o'(a„-,)
aa'(am_q) att^-<J(o.-,)
— a'(am_q)
a*-,
"m-ii ■™ >
000^
OOOj -2
-2 0.09
0.09 008
008 0.17
017 00.35
35
001 2 -1
-1 0.25
0.25 0.18
0.18 0.37
0.37 0.74
0.74
0102
OlOj 0
0 0.50
0.50 0.25
0.25 0.5
0.5 1.0
1.0
011 2 11 0.75
0.75 0.18
0.18 0.37
0.37 0.74
0.74
10022 22 0.9
0.9 0.08
0.08 0.17
0.17 0.35
0.35
6 Experimental Work
In order to illustrate the learning accuracy of the RVA-TIN, simulation results are pre-
sented for units/nets trained on three different experiments. The tasks are either mul-
tiple input or single input and single output. Task 1 maps the XOR function,
{*„ x2} -> {y} , {0.1,0.1; 0.1,0.9; 0.9,0.1; 0.9,0.9} -> {0.1,0.9,0.9,0.1} using a two-
input hidden node and a three-input visible node. Two inputs to the output node are
118
connected to the two hidden node inputs and the third is connected to the output of the
hidden unit. Task 2 maps the function x -> x , using a single layer eight input unit.
Task 3 maps the function x -> x , using a network made up of eight three-input hid-
den units and an eight-input visible unit. The function is mapped using twelve input
output stimuli-pairs spaced equidistant across the function xe [0, 1]. One should
note in task 2 each of the eight-inputs and task 3 each of the three-inputs of the hidden
units were allocated the same analogue input value. Each has its own random number
in order to obtain independent instantaneous binary inputs. This is necessary in order
to obtain multiple binary inputs, if this is not done only addresses zero and the maxi-
mum input address are selected and the cube effectively becomes a 1-cube. Hence to
obtain input addresses that utilise the full range of cube locations each input line's in-
put bit is generated independently of the others For all the experiments the constants
utilised were a = 32.0, p = 24.0, X = 0.75, L = 1024, Sm = 64 and ~a~m = 128
The graphs above show the average error modulus over 20 nets for a given train-
ing epoch versus the number of epochs. Each epoch is defined as the presentation of
all four/twelve input-output stimuli pairs. The graph of Task l(XOR) shows that the
majority of the nets converged to an error modulus of 0.01 in less than 200 epochs.
Task 2(x -> x ) mapping the squared function is shown to be a harder task and takes
at least 800 epochs for the single layer nets to converge to an error modulus of 0.01.
However by introducing a hidden layer in Task 3(x - » x 2 ) the nets converge to an er-
ror modulus of 0.01 in 300 epochs, which shows the significance of a hidden layer
with regards to expanding the internal state space of the system.
I IV
For comparison purposes only the following paragraph discuss the speed of conver-
gence of our RVA-TIN units trained with Back-propagation and the SRV (Stochastic
Real Valued) units of Gullapalli21'22 trained with deterministic Reinforcement. Of
course, one should note that if Gullapalli utilised a supervised regime he would have
faster learning, but by utilise the same task as he does in his simulations we can com-
pare the convergence times to show how a supervised methodology is more efficient.
I also compare our RVA-TIN units trained with Back-propagation and the semi-linear
Threshold Logic Units (TLU) of Gurney10 trained with System Identification. The re-
sults of the three tasks are tabulated below to show the learning efficiency of the RVA-
TIN units when compared to SVR units and TLU units.
Table 2 shows a comparison of learning efficiency of the RVA-TIN units when
compared to SVR units and TLU units when tested on the three tasks.
l1 75
75 (0.01) 2400(0.01)
2400(0.01) - 0.00122
0.00122
The above table shows that Backpropagation training of RVA-TINs is more ef-
ficient that either of the others regimes. However we are not presenting the results in
light of their efficiency, but just as comparisons. The more important point to note is
that RAM- based sigma-pi nets allow one to Partially Pre-calculated weights for Back-
propagation training.
7 Concluding Remarks
The chapter presents new techniques which enable real-valued functions to be mapped
onto RAM-based RVA-TIN sigma-pi nets utilising Backpropagation training. The
study shows that one may utilise quantised weights (site-values, Sn e {-64 64})
which are stored in 8-bit binary code and quantised activations
(ae {-128 128}) which are stored in 9-bit binary code and map real-valued
function to an accuracy of 1%. The research work makes use of the quantised nature
of the RVA-TTN nodes activation, to calculate the sigmoidal activation-output func-
120
tion and the sigma primed terms, by utilising look-up table methodologies, one of the
problems inherent in most neural networks which have very serious overheads relating
to complex and time consuming mathematical function calculations which are nor-
mally calculated during the feed-forward and training phases of these neural systems.
This research enables sigma-pi structures to be efficiently trained and mapped on to
RAM-based technologies. By using a methodology of constrained look-up tables one
can pre-calculate the sigma-pi's activation-output function, a(a) and the partial
weight (site-value) updates. This means that computationally intensive mathematical
functions, such as the sigmoidal function (6), are pre-calculated. Previous work using
semi-linear units utilising Backpropagation of Error training (Grozinger, 10) utilise an
approximated sigmoid, using piece-wise linear functions. The RVA-TIN sigma-pi has
bounded activations and site-values which aid the high speed training of a neural sys-
tem as one implements look-up tables for the majority of the mathematical computa-
tions. Due to the functionality of sigma-pi units one does not carrying out the more
normal £*,Wy calculations that are used by semi-linear units, which require one to
perform n multiply operations (where n is the number of inputs to node/) and then
sum all the products, these are time consuming tasks, hence the computational require-
ments of sigma-pi units are less than semi-linear.
To put this research in perspective and to counter balance the above comments
one does have to provide storage space for the pre-calculated look-up tables to store
a(a), a'(a), aa'(a)
a(a), a'(a), ao'(a) and a(2/S m)o'(a).
y'(a)- In the
<x(2/5Ja'(< theeexperiments am = 128 this means
that the system requires 4 x (2a m + 1) memory locations which is approximately lit
of memory to implement the pre-calculated look-up tables. But when one utilises large
artificial neural networks this becomes an insignificant requirement when one is now
able to pre-calculate complex functions which do not have to be calculated in real time
while the system is training.
The use of RAM-based Sigma-pi nets is not wide spread, however in this chapter
I have presented research that shows they have some advantages over the more com-
monly utilised semi-linear units. The major advantage they have is that they are im-
plementable in cheap RAM-based hardware. It is interesting to note that one can now
obtain 4Gb dynamic RAM chips. However they may also be implemented on Mas-
sively Parallel Processors (Neville et al.13).
References
T.M. J0RGENSEN
Ris0 National Laboratory, P.O. Box 49
DK-4000 Roskilde, Denmark
A strategy for adding inhibitory weights to RAM based net has been developed. As a
result a more robust net with lower error rates can be obtained. In the chapter we de-
scribe how the inhibition factors can be learned with a one shot learning scheme. The
main strategy is to obtain inhibition values that minimise the error-rate obtained in a
cross-validating test performed on the training set. The inhibition technique has been
tested on the task of recognising handwritten digits. The results obtained matches the
best error rates reported in the literature
1 Introduction
With respect to the learning of classification tasks the conventional N-tuple nets or
RAM-nets such as the WISARD system 1 have many nice properties such as fast train-
ing and recall rates. Furthermore, the architecture is very easy to interpret. With P
classes to distinguish the set of N-tuples can be considered as a number of Look Up
Tables (LUT). This is illustrated in figure 1. Each LUT probes a subset of the binary
input data. The sampled bit sequence is used to construct an address. This address cor-
responds to a specific entry in the LUT. The number of output values (rows) from a
given column entry is equal to the number of possible classes. For each class the out-
put can take on the values 0 or 1. A value of 1 corresponds to a vote on that specific
class. The output vectors from all LUTs are added, and subsequently a winner takes
all decision is made to perform the classification.
The number of LUT-addresses shared by any two examples belonging to differ-
ent classes should be as small as possible. However, in many situations two different
classes might only differ in a few of their features. In such a case an example (that has
not been used for training) has a high risk of sharing most of its features with an in-
correct class. In this situation the RAM will have an unacceptable high error rate. In
order to circumvent this situation it is desirable to weight different features differently
for a given class.
The C M A C 2 is an example of an architecture that allows real weights to be stored
in the RAM-net. These weights can then be trained with a perceptron like learning
rule. We present an alternative solution introducing negative weights into the LUT
cells. With this approach the modified architecture bears a close resemblance to the
123
124
Figure 1: Architecture of the RAM net. The number of training sets from class c, that generate address am
is denoted by n; m. If n; m > 1, the output value Oj is set to 1, else 0.
In order to illustrate the need for inhibition we consider the case of distinguishing be-
tween the digits 4 and 9. The main difference between 4's and 9's is the appearance
or non-appearance of a upper horizontal bar as illustrated in figure 2. If such a bar is
met in a test example it is desirable that the network inhibits class 4. Otherwise there
is a risk that too much emphasis is put on the other parts of the test pattern. In general
there can be several LUT columns in the network that corresponds to variations of
such a bar. These columns will be characterised by not "voting" on class 4 but possible
on class 9. Due to the limited size of the training set it is, however, not necessarily so
that columns voting on class 9 but not on class 4 (and vice versa) represent real distin-
guishing features. Accordingly, a strategy is needed for selecting those columns that
are the most likely candidates for inhibition.
125
Figure 2: An example of two digit classes that are very close with exception of the bar (or non-bar structure)
at the top.
The first step in locating column candidates for inhibition is to locate the training ex-
amples having low confidence as well as those being misclassified in a cross-validat-
ing test (see the accompanying chapter3 on using cross-validation tests in connection
with RAM nets). For each of these examples all LUT columns voting on the true class
but not on the competing class are found. Afterwards a small inhibition term, -Ct,^,^,
is added to the weight value of the competing class for each detected column. A typical
value oia.inhn, is ^0/NLUJS, where NLuTs denotes the total number of LUTs in the ar-
chitecture.
The inhibition factor is calculated so that the confidence after inhibition corre-
sponds to a desired level. Inevitably this technique will add inhibition to some LUT
columns that do not represent relevant distinguishing features. However, the LUT col-
umns being sought are likely to be visited by many of the low confidence training ex-
amples and accordingly their corresponding cells will obtain larger inhibition factors
than the rest. From the argumentation it is also evident that ainflif, should be sufficient-
ly small, in the sense that the effect of inhibiting "wrong" cells will become negligible.
An advantage of using the above described inhibition technique is that essential
features of the basic RAM net model are preserved:
All output values that would be set to 1 in the simple system are also set to 1 in
the modified version. However, some of the cells containing O's in the simple system
will have their contents changed to negative output values in the modified net. In other
words, the conventional net is extended so that inhibition from one class to another is
allowed.
The traditional RAM net architecture can also be viewed as a constrained feed-
forward architecture. This is illustrated in figure 3. Every LUT column with non-zero
weights corresponds to a neuron in the hidden layer. As only one column pr. LUT can
be addressed at a given time the number of active neurons in the hidden layer is limited
to the number of LUTs. As shown in figure 3 the traditional architecture does not al-
low any inhibition from the hidden units to the output units, but such inhibitory
weights are introduced by the above described technique.
Figure 3: The figure illustrates how the RAM net architecture can be viewed as a constrained feed-forward
architecture. The three neurons shown in the hidden layer correspond to three columns of the depicted LUT.
Introducing inhibition between classes corresponds to allowing negative weights on the connections be-
tween hidden neurons and the output layer.
127
4 Results
Table 1: Test results obtained for the task of recognising handwritten digits.
In order to get an idea of the impact of the inhibition techniques on the architec-
ture we evaluated the number of negative weights that were added for the architecture
with 935 LUTs. In total around 36000 negative weights have been added to the cells.
This corresponds to an average around 40 negative weights pr. LUT. For comparison
the number of positive weights (all having value 1) is 1.8 million in total (around 2000
pr. LUT). The number of inhibitory weights is seen to be small compared to the
number of positive weights; nevertheless they have a large impact on the performance.
As described in Jorgensen3 it is possible to calculate what we denote a critical ex-
ample number, ncrit, which estimates the amount of examples from the training set that
128
support a given classification. The ncrjt value is suited as a confidence measure, and
accordingly it can be used as a part of or as the rejection criterion. This is illustrated
in figure 4. (the curves correspond to the RAM net with 935 LUTs listed in table 1).
The upper curve shows the amount of errors that are accepted for different acceptance
levels of ncri[, whereas the lower curve shows the amount of accepted classifications
among the examples being correctly classified. As an example it can be seen that using
an ncrit value of 2 as acceptance level implies an error rate of 0.1%.
Figure 4: The two curves illustrate the amount of errors and correctly classified test examples that are ac-
cepted for different acceptance levels of ncrit.
129
5 Conclusion
A one shot learning technique that introduces inhibition into a RAM net architecture
has been described. This new technique was tested on the task of recognising hand-
written digits, where it leads to a significant performance improvement. The obtained
error rate matches the best results reported in the literature. The obtained error rate
(without rejection) on a test set of 13521 digits was 1.2%.
Acknowledgments
I thank Steen Sloth Christensen for stimulating discussions and for providing me with
his basic software implementation of a RAM neural net.
References
This chapter introduces a novel networking strategy for RAM-based Neurons which
significantly improves the training and recognition performance of such networks
whilst maintaining the generalisation capabilities achieved in previous network config-
urations. A number of different architectures are introduced each using the same under-
lying principles.
Initially, features which are common to all architectures are described illustrating the
basis of the underlying paradigm. Three architectures are then introduced illustrating
different techniques of employing the paradigm to meet differing performance speci-
fications.
This chapter will introduce a family of neural architectures which collectively encom-
pass a basic set of common attributes. The network architectures introduced are all
RAM-based networks employing layers of neurons each of which is independently at-
tached to a given sample pattern whose distinct outputs are merged (by means of a fur-
ther layer) to produce an output matrix which is equal in dimension to the original
sample pattern. Each architecture may employ a varying number of groups of such
layers arranged sequentially, and a general structural pattern for a group is shown in
Figure 1. Each of the architectures introduced in this chapter share the following com-
mon features:-
130
131
• The layers comprising the network are arranged in a fixed number of groups,
the exact number of such groups being architecture dependent.
• A Merge layer exists after each group whose function is to combine the corre-
sponding outputs of the constituent layers of the group. The connectivity of
the neurons comprising a Merge layer is equal to the number of layers within
the group to which it pertains.
• The number of layers within each group may be varied depending on the rec-
ognition performance required from the network.
• Neurons within a given layer possess the same connectivity pattern relative to
their position within the matrix.
The connectivity pattern for neurons of differing layers can take various forms.
Since, for each neuron, the input set is calculated modulo the dimensions of the pat-
tern, then neurons which, for example, are situated at the right edge of a pattern and
are virtually clamped to the 'x' bits on their right will actually be clamped to V bits
132
The Boolean Convergent Network (BCN)1 is the most basic network architecture in
the series and the structure from which other architectures were developed. It is a
RAM-based neural network where the inputs and output of all the component neurons
are taken from the symbols '0',' 1' and the undefined value 'u'. The inputs to a neuron
form an addressable set incorporating all memory locations which may be formed by
treating any undefined value within the input as either a '0' or a ' 1'. The output of a
neuron will be any defined value which occurs exclusively within the memory loca-
tions included within the addressable set. If the addressable set contains either no de-
fined value or examples of both defined values, then the undefined value V is output.
The performance of the BCN architecture in a typical pattern classification task, where
the sample patterns were taken from digitised postcode data corresponding to the nu-
merals 0 through 9 with ten example patterns of each numeral presented to the network
for training, is shown in Table 2. Three hundred sample patterns of each numeral were
presented for processing. The same training and sample patterns were used for all the
experiments represented.
The various columns indicate the performance achieved by varying the number
of layers within the groups. Two differing connectivity patterns for the 13 layer option
are shown (labelled 13A and 13B). It is clear that there is scope for optimisation of the
network in terms of the number of layers and set of connectivity patterns for a given
dataset.
These results demonstrate the potential classification performance of such a net-
work, even though the experiments were not exhaustive.
Layer Variations
Class 1 1 1 1
6 9 13A 13B 17
0 92 95 99 99 98
l1 78 78 80 82 83
2 92 97 91 91 90
3 93 97 96 95 94
4 95 96 95 96 94
Layer Variations
Class 1 1 1 1
6 9 13A 13B 17
5 75 74 75 73 70
6 77 86 84 83 83
7 95 96 96 96 95
8 93 91 95 95 96
9 85 91 91 89 94
Although the BCN architecture is simple to implement and train, it is, in practice, suit-
able only for processing of datasets in which the constituent classes are relatively
tightly clustered as in the case, for example, in the processing of machine printed char-
acter datasets. More complex datasets involving handwritten characters demand a
higher performance specification and for this it is necessary to consider how the BCN
architecture may be improved. In order to identify how to achieve this, it is first nec-
essary to identify precisely where its deficiencies lie.
A major problem with the BCN architecture, and indeed a common problem with
Boolean RAM-based networks, is that a particular neuron output represents a set of
pattern classes when the input to a neuron specifically indicates a particular class. For
example, a neuron may output the value '0' when an exact match is found with either
of two separate pattern classes. The '0' output does not indicate which of the two pat-
tern classes has been identified and, more worryingly, is indicating that the input is
consistent with a second pattern class when in fact it is not. This may lead to increased
difficulty in recognition tasks and increase the probability of incorrect recognition.
The problem arises because neuron outputs are restricted to a small set of sym-
bols, often just '0' and ' 1'. These symbols must then be overloaded in order to repre-
sent all the pattern classes under consideration. It is possible to remove the
overloading by increasing the symbol set available to the neurons. This may be
achieved in the following two ways.
Firstly, the symbol set may be extended to allow a symbol to represent each pat-
tern class under consideration. For example, if ten pattern classes representing numer-
als were being considered, the symbols '0' through '9' could be employed
135
The Performance of the GCN architecture has been examined using a number of da-
tasets. Some typical results are shown in Table 1. The results were achieved using an
architecture employing 9 layers within both the Pre and Main groups with all neurons
of connectivity 8.
The first results column shows the performance achieved using 300 sample ma-
chine printed characters per class with only 10 training patterns employed for each
class. It thus provides a comparison with the BCN results shown in figure 2.
The second and third columns indicate results achieved with handwritten char-
acters using two sample datasets respectively. The results were achieved using a var-
ying number of training and sample patterns per class (the entire dataset was
employed) and illustrate the potential of the network to perform handwritten character
recognition.
The final two columns indicate the results achieved using a standard handwritten
136
dataset with firstly the entire training set (Cl) and secondly the same training set with
potentially confusing training patterns manually removed (C2). The results thus illus-
trate the potential improvement expected where it is possible to optimise a training set.
Handwritten
Class Machine 1 1 1
A B Cl
Cl C2
0 98
98 94
94 89
89 93
93 97
97
l
1 98
98 88
88 72
72 61
61 61
2 99
99 74
74 94
94 98
98 100
100
3 98
98 87
87 80
80 87
87 90
90
4 99
99 90
90 95
95 92
92 95
95
5 99
99 80
80 64
64 83
83 86
86
6 94
94 75
75 73
73 90
90 94
94
7 99
99 65
65 79
79 77
77 81
8 99
99 96
96 66
66 97
97 98
98
9 99
99 76
76 72
72 88
88 91
91
average
average 98.0
98.0 82.5
82.5 78.4
78.4 86.6
86.6 89.3
89.3
The GCN architecture produces a much improved performance over that of the sim-
pler BCN architecture at the expense of an increased complexity. It does however pro-
duce a simple 'recognition' of a sample pattern. It provides no indication of how
certain it is of its conclusion or of how closely the sample pattern matches other pat-
tern classes. For some applications, this information may be required (humans after all
can provide this information when considering patterns). To provide this capability, a
further architecture enhancement is necessary. As was done when developing GCN,
an examination of where the deficiencies lie within the GCN architecture is performed
in order to devise where further modifications may be made.
A given storage location within a Pre group GCN neuron indicates which train-
ing sets (and therefore which pattern class) contain example patterns which generate
137
an address corresponding to the neuron within the training process. For example, if ex-
amples of classes 1 and 2 addressed the given location within the neuron, the location
would contain the symbol '12' correctly indicating that if a sample pattern addressed
the location it would be consistent with it being an example of pattern class 1 or 2.
However, it does not indicate what proportion of the members of the respective train-
ing sets addressed each location. For example, almost all the members of the training
set for class 1 may address the location while only one of the members of the training
set for class 2. The symbol stored in the location would still be ' 12' even though a sam-
ple pattern addressing this location would be more likely to be an example of a mem-
ber of class 1 than of class 2. The question is "Is there a method of allowing a neuron
to indicate that a sample pattern is more likely to be a member of class 1 while still
allowing the possibility that it is a member of class 2 if other neurons take this view?"
The Probabilistic Convergent Network (PCN) has been designed to achieve this aim.
The Probabilistic Convergent Network (PCN)3 extends from the architectures
described above but still retains the same abstract structure as BCN and GCN as was
described earlier. Two groups of layers are employed by the network with a feedback
for the Main group. There are a number of important differences with the GCN ar-
chitecture however. Firstly, the number of symbols is increased to allow for a measure
of the proportion of the training patterns pertaining to each class addressing a given
location. The number of divisions for each class is network dependent and will be de-
fined so as to provide the necessary recognition performance.
Secondly, whereas GCN typically converges onto a single base symbol indicat-
ing the pattern class which has been identified, PCN typically converges onto a com-
pound symbol indicating the relative probabilities of the sample pattern being a
member of each pattern class. This is an important advantage of PCN as it allows the
network to communicate that a sample pattern is, for example, most likely to belong
to class x but is also quite close to class y whereas it is extremely unlikely to belong to
class z. Typically of course the same pattern classes will appear to be similar and dif-
ferent due to the similar attributes possessed by the differing pattern classes (a '6' has
more features in common with an '8' than a '7' for example). PCN does, however,
have the capacity to give an indication of a degree of confidence which can be attached
to its decision and this is a feature which may be important in a number of practical
applications.
138
2
2 265
265 66
66 62
62 44
44 117
117 86
86 94
94 78
78 79
79 109
109
33 288
288 72
72 61
61 62
62 82
82 91
91 142
142 43
43 92
92 67
67
4
4 332
332 45
45 61
61 75
75 71
71 100
100 112
112 43
43 89
89 72
72
5
5 248
248 65
65 57
57 46
46 102
102 119
119 99
99 81
81 92
92 91
91
6
6 253
253 55
55 74
74 61
61 97
97 115
115 78
78 81
81 85
85 101
101
7
7 145
145 87
87 72
72 49
49 174
174 69
69 120
120 95
95 84
84 106
106
8
8 262
262 62
62 74
74 56
56 114
114 86
86 105
105 63
63 87
87 91
91
9
9 294
294 51
51 69
69 72
72 71
71 104
104 80
80 62
62 94
94 104
104
10
10 275
275 56
56 78
78 81
81 86
86 93
93 79
79 69
69 85
85 98
98
of '0' characters was 92% and the average for all ten pattern classes was 82.1 %.
6 Conclusions
• The ability to regularly modify the architecture implies that the simplest net-
work possible meeting a given performance specification for an application
may be employed.
• The existing benefits of RAM-based networks are retained in that the layers
comprising the network require at most one shot learning and are trained inde-
pendently.
References
For any approach to be worth while studying, demonstrable proof of its utility on prac-
tical problems is essential. This section contains a number of practical studies. All the
applications are found in image processing, the traditional area for the successful use
of RAM based methods (as in its original use). The main reason for this is that the
methods scale well to the large input data sizes needed for image analysis problems.
The final paper examines the implementation of ADAM, a RAM based network for
image analysis, on a parallel system of transputers.
The first paper by O'Keefe and Austin shows there use in finding features in fax
images. A problem that makes use of the potentially fast processing and noise tolerant
properties as it is applied to faxes that are sent via typical fax machines. In addition it
illustrates how RAM based methods compare with traditional object recognition
methods.
Texture recognition is examined by Hepplewhite and Stonham, how introduce a
novel pre-processing method and compare a number of existing N tuple pre-process-
ing methods for this task.
RAM based networks are particularly suitable for small mobile robots as shown
by Bishop, Keating and Mitchell, who demonstrate that a compound, insect like, eye
can be created and used to control a simple robot.
Feature analysis is a vital part of machine vision explored by Clarkson and Ding.
They show how a pRAM based network can be used to find features in a fingerprint
recognition system. In addition, they show how noise injection can be used to improve
performance of the method.
The use of colour in the detection of danger labels is investigated by Linneberg,
Andersen, Jorgensen and Chistensen where the power of the N tuple method to solve
real problems is demonstrated.
The problems involved in exploring complex images using saccadic image scan-
ning methods is explored by Ntourntoufis and Stonham. They extend the MAGNUS
network presented in section one to dealing with multiple objects in a 'Kitchen World'
scene. Illustrating that iconic internal representations used in MAGNUS can be used
to control image understanding systems.
Finally, hand written text is examined in the chapter by De Carvalho and Bisset,
where the SOFT and GSN RAM based methods are combined in a modular approach
to a difficult classification problem.
141
This page is intentionally left blank
C O N T E N T ANALYSIS OF D O C U M E N T IMAGES USING T H E
A D A M ASSOCIATIVE M E M O R Y
S. E. M. O'KEEFE, J. AUSTIN
Advanced Computer Architecture Group, Department of Computer Science
University of York, York YOl 5DD, UK
1 Introduction
143
144
The accumulation of evidence for an object depends upon the location and
identification of the features which comprise the object. If the object in the
image does not match the object template perfectly because of noise, clutter
or other distortions, then the probability of detecting and correctly identi-
fying each feature is reduced, and the probability of identifying the entire
object is therefore also reduced. The accumulation of discrete pieces of evi-
dence requires the quantisation of the parameters spanning the accumulator.
This quantisation makes the algorithm sensitive to small changes in features
and their measurement. Thus, noise and clutter may effect features, which
leads to the dispersion of evidence throughout the accumulator rather than
the formation of a discrete peak. Grimson and Huttenlocher 4 have analysed
the Generalised Hough Transform in detail and produced a formal model of
the errors introduced by the quantisation and measurement processes. They
model the effect of inaccuracies in measurement, and predict the effect of these
errors on the accumulation of evidence. The results of their analysis show the
method to be sensitive to the number of features which comprise an object
model. Using a simple model of noise and clutter in the image, they also show
the method to be prone to the generation of false positives.
In the GHT, then, the object model is in the form of a template which
describes the features comprising the object, and their relationship to each
other. Matching a model to the image data involves a search through all
available models to decide which model best fits the data. This search may
present a computational problem when there are a large number of models.
This is the case when analysing real images. The computational problem is
solved by implementing the Generalised Hough Transform using a fast associa-
tive memory, ADAM (Advanced Distributed Associative Memory) (Austin and
Stonham 5 , O'Keefe and Austin 6 ' 7 ). ADAM is a neural network specifically
designed for image analysis tasks (Austin et al. 8 ) . Put simply, the network
allows for the association of input and output patterns in a compact form, so
that the presentation of an input pattern stimulates the output of the associ-
ated pattern without a serial search of the pattern memory. The network has
binary weights, and therefore may be held in memory in a compact form, with
only one bit per weight. Thus, the network is ideal for performing the sorts of
template matching involved in the GHT.
The network is also able to generalise about the features it has been trained
to recognise, so that it is able to recognise noisy or corrupted versions of them.
It can return a measure of confidence in the recognition of each feature, and
can thus be made robust to corruption of the image by noise at the feature
level.
146
single CMM associating input and output would be impractical. ADAM uses
an intermediate code or "class" which is much smaller than either of the input
or output arrays. The first CMM learns associations between the contents of
the input array and some intermediate code, selected to be unique for that
input, and the second CMM learns the association between the intermediate
code and the output. In this way, the total size of the matrices is reduced.
The input array is tupled. This preprocessing of the input splits it into
groups of binary elements ("tuples"). Each tuple decoder interprets the values
in its tuples as representing a state. The decoder has a separate output for each
state, and sets one, and only one, output corresponding to the state of the tuple.
As a consequence of this tupling of the input, linearly inseparable patterns may
be classified; the input to the first CMM is much more sparse than the original
data, thereby reducing saturation of the memory and increasing capacity; and
the number of bits set in the input to the CMM is fixed, making thresholding
of the output from the CMM simpler. The properties of the memory can be
controlled by selecting the size of tuples in the input and output, the size of
the intermediate class, and the number of bits set in the class.
4 Implementation
Implementation
An object recognition tool for use in the analysis of images has been im-
plemented, using the ADAM neural network to implement the GHT. A block
diagram of the structure of the recogniser is shown in figure 3. The recogniser
148
Figure 4: Calculation of object position from feature position and recalled parameters
To locate an object in an image, the recogniser must search for the m mxm
xm
blocks of pixels which constitute the features which have been learned. When
a feature is found in the image, the corresponding vector/label pairs are found
from the template. From the position of the feature, and the recalled position of
the object centre, the location of the putative object in the image is calculated
(figure 4). At this position, the label for the object is accumulated. This is
repeated for all vector/label pairs that has been recalled for the feature, and
this is repeated for every feature found in the image.
When the whole image has been scanned for features, the accumulated
evidence is examined. When an object is actually present in the image, and
all or most of its features are found, a large number of labels are accumulated
for that object at the position of the object centre. This is apparent as a peak
in accumulated evidence. Each object recognised is represente by a peak, and
there will also be peaks due to clutter and noise in the image.
The recogniser is trained by teaching the ADAM to associate features
(m x m blocks of pixels) with object labels and positions. Once the ADAM
has been trained, the recogniser can search for the object in any image. At
every position, an m x m block of pixels is extracted. If the ADAM recognises
this block of pixels as a known feature, it recalls the associated grid structure
containing the vector/label pairs. Recognition of a feature is determined by
the confidence with which the first CMM produces the intermediate code — a
high confidence implies recognition of a feature, and low confidence implies the
feature is not recognised. Object labels appear on the grid points corresponding
to the object position relative to the feature. This may be performed in parallel
by an array of ADAM networks, although it is currently simulated in software.
The contents of this recalled grid are added into an accumulator grid.
The accumulator grid covers the whole image, with cells in the accumulator
grid mapping onto points in the image. The grid structure associated with a
feature, and recalled from the memory, is added into the accumulator at the
position of the feature in the image. Thus, at each position in the accumulator
some number of labels is accumulated, corresponding to putative objects in
the image, for which there is some feature evidence belonging to the object.
Once the whole image has been scanned for features, and the evidence has
been accumulated, object positions are determined. This is done by looking
at every cell in the accumulator and determining whether the accumulated
evidence in that cell exceeds the threshold for an object. Assuming that the
ADAM has been trained to recognise more than one object, then each cell of
the accumulator may hold labels for more than one class of object. This is
illustrated in figure 5. To determine which is the dominant class of object at
this location, the contents of the accumulator cell is N-point thresholded. That
151
Figure 5: Labels for more than one class of object accumulated in one cell
is, the N largest elements in the cell are selected. This gives us the object label
" 0 0 1 0 0 1". To determine the level of confidence in this label, the difference
between the "strength" of this label and the other labels in the cell (which
are treated as noise) is compared with the response expected from the object
model.
5 Results
Figure 6: Test images (a) fragment of a fax (b) fragment after addition of noise
The higher the peak on the map, the higher the confidence that the object is
at that point (confidences are scaled to a maximum of 256). It can be seen
in figure 7(a) that both instances of object "A" have produced peaks in the
accumulator. However, the detection of the objects is far from satisfactory.
The left-hand peak in figure 7(a) is actually a double peak, indicating that
the evidence for the object has been spread over two adjacent cells in the
accumulator. This is due to the coarseness of the quantisation of the vector.
The right-hand peak is lower than the left, showing a lower confidence in the
detection of this object. There is also a considerable amount of noise in the
accumulator. This is the result of detection and recognition of features which
are not part of an object.
Figure 7(b) shows the large reduction in confidence which occurs when
noise is added to the image. Both objects are barely distinguishable.
The presence of clutter (extraneous features) and noise complicates the
recognition process. In order to overcome these problems, changes to the pa-
rameterisation used to model objects and the introduction of feedback into the
recognition process have been implemented u . The object will be modelled
in terms of the position of each features relative to all others within a certain
distance. Recognition of a feature will result in the recall of a grid structure
containing labels for features, as well as labels for objects. The feature infor-
mation will be fed back to the recogniser, so the recognition process at any
position will be informed by the image data at that position, and the sup-
porting evidence from nearby features. Where there is insufficient supporting
evidence, the feature will not be recognised. In this way, clutter will be sup-
pressed, and noisy features which are part of an object are more likely to be
recognised, improving the recognition confidence.
Another step in the development is the transition to specialised hardware.
The C-NNAP processor (Austin et al. 1 0 , and chapter 3.9) is designed to imple-
ment ADAM directly in hardware, and to run ADAMs in parallel. Migration
to this hardware will give two orders of magnitude increase in the speed with
which images are analysed.
153
Figure 7: Results of processing. Each map shows the confidence with which the class of
objects has been located at each point in the image. Confidences are scaled to a maximum
of 256.
6 Conclusions
References
L.HEPPLEWHITE, T.J.STONHAM
Neural Networks and Pattern Recognition Research Group,
Department of Electronics and Electrical Engineering,
Brunei University, Uxbridge, Middlesex,
UB8 3PH, UK
A novel approach to real time texture recognition, derived from the n-tuple method
of Bledsoe and Browning, is presented for use in industrial applications. A wealth
of texture recognition methods are currently available, however few have the com-
putational tractability needed in an automated environment. Methods based on
the nth order co-occurrence spectrum are discussed together with their shortcom-
ings before a new method which uses nth order co-occurrence methods to describe
texture edge maps instead of pixel intensity values is presented. The resulting co-
occurrence representation of the texture can be classified by established statistical
methods or weightless neural networks. Finally the new method is applied to the
problems of texture classification and segmentation.
1 Introduction
155
156
mg(i) — rP i(x
\xx +
-r+ R.
k[i-
k i(ilii —
- —
—i-)cos6,y
1 js6,y
cosd,y
cosd,
cuau,y T +
y+ i-tysine)
lik I (it—- - —)
k*■(i- —sm#
-— IJ sindj
)1 i =i 0..n
= 0..n (1) (1)
_1_ 1 i _ J £
Hence n-tuple samples extractedA
from a- given
—; i
texture can reflect micro-
texture information of various spatial frequencies, spatial extent and orien-
tation by suitable selection of the tuning parameters: tuple size, n, interpixel
spacing, A;, and orientation, 6. Each oriented n-tuple sample taken from an im-
IR.CU li uiii eui m i -
age containing g intensity levels corresponds to a n-tuple state, Sg € [0,gnn — 1],
,59e[o,5 -i],
from the universe of possible states assigned as follows:
n
Se ==
Sg J29gi-m
l
(i)
.mee(i)
(i) (2)
(2)
•=o
Subsequently a macro-texture descript:
description of the texture can be obtained
by recording the relative occurrence of these n-tuple states, Sg, within a region
of the texture. Thus, discontinuities in texture (see fig. la) are reflected by
differences in the relative occurrence of n-tuple states.
157
Figure 2: Intensity profile taken from Tree Bark Texture and it's Pass-Band decomposition
158
Figure 3: Rank coding of 4-tuples and the luminance profiles they represent
The TCS methods can sample at these lower spatial frequencies through
the interpixel spacing, k, and tuple size, n. However, as demonstrated lower
spatial frequencies remain in each n-tuple sample biasing the high frequency
samples and making parameter selection difficult. In short, intensity changes
can not be represented at all scales simultaneously and the TCS method must
be 'tuned' to a particular scale, (see fig. 2 rhs)
V 2
G ( *=
V2G(x,y)
)-m -y 2
-(* + y
■ , ( , - ) = ( l - ^e ) e2 ^^
2
)
The second characteristic of the LoG filter, the gradient operator, is per-
(3)
(3)
formed by the Laplacian operator. This performs the second derivative of the
band limited texture thus representing edges at this scale as zero-crossings in
the resulting filter output.(see fig. 4) Since orientational features are extracted
by the subsequent n-tuple operators, the Laplacian is chosen as the lowest or-
der isotropic derivative operator. An important property of the resulting LoG
filter is it's balance between positive and negative values. This means that the
response to dark/light edges is exactly opposite to that of light/dark edges.
The image resulting from convolution of this filter with a texture defines edges
within the texture as zero crossings in the new image, (see fig. 5)
Figure 4: Intensity change in image gives rise to a peak in the first derivative and a zero
crossing in the second derivative
160
V2ln2
V2ln2
N/27
/Jj uupper
upper
p p e r — ry
n (4)
V*)
~ 2Z7T(T
J TUpp
^ er
*>"™upper
_ V2ln2
V *Ll'llZ.
V41TIZ
*"* Jupper
Jupper
Supper /(-\
/ Slower
Jlower —~« = :
— Z~2.
_ A1.6
(5)
lower — 7T~~ ower
lTT(JiOW er *Ru
,1-0 \")
lTT(Jiower 1.0
lt.pr r u i t n n t r o r n r H i r
The BTCS can now sample the filter output recording the co-occurrence of
he optimal
zero crossings. Hence the int
optimal interpixel spacing, k, is given by the Nyquist
sampling theorem:
1 11
k = (6(6)
- 27
"Supper
*Jupp
Jupper TTfd-^Jlower
•J'^Slower >
£t
"Jupper *>- Jlower
n-tuple size, n,
Consequently the n-tuple n, can be defined by the maximum number
of zero crossings a single n-tuple is to represent, zmax, and the lowest frequency
at which they occur, fiowerver- .
^ Zmax
*max ,_*
•n cy. (7)
n w 2kfl0
wer
wer
Hence in a controlled industrial environment, where a small discrete set of
scales will be important, the dominant scale can be chosen by setting the centre
frequency of the LoG filter. This automatically sets the above parameters
of the new texture method termed the Zero Crossing Texture Co-occurrence
Spectrum (ZCTCS).
161
4 Texture Classification
Figure 5: Brodatz textures (top row), LoG filtered (middle), Binarized LoG (bottom)
Textures(L to R): pressed cork(d4), expanded mica x3(d5), woven aluminium wire(d6),
grass(d9), herringbone weave(dl6), herringbone weave x 4 ( d l 7 ) , beach sand x4(d28), beach
sand(d29), water(d38), straw maHing(d85)
fc == 1i dimen. miss-classified
kk == 22 fcfc = 3 dimen. miss-classified textures
textures
BTCS 7799 %
% - - 4 x 16 d5,d6,d28,d29,d38
d5,d6,d28,d29,d38
TUTS 97% - - 6561 dl7,d28,d29
GLTCS 9988 %% 97
97% % 93
93% % 4 xx 24 d28,d29,d38
ZCTCS 100
100% % 100
100%% 97
97% % 4 xx 16 d5,d28
ZCTCS* 95 % % 100% 92% 16 d5,d28
f\ denotes cross
croi s operator
operator
Also shown in the results is the performance of the ZCTCS with a simple
cross operator instead of the four orientational masks. As the results demon-
strate the performance of the ZCTCS is comparable with others methods.
However the important advantage of the method is in which samples are miss-
classified. Any misclassifications made by the ZCTCS are exclusively associ-
ated with textures D5 and D28. Analogous to the current filter which discrim-
inates the perceptually similar d4, d9 and d29, another filter would be needed
for the d5 and d28 pair.
5 Texture Segmentation
Figure 7: Texture collage segmentation results using GLTCS and ZCTCS methods
6 Discussion
The new Zero Crossing Texture Co-occurrence Spectrum method has been
shown to extend the usefulness of n-tuple pattern recognition methods for tex-
ture recognition. Results presented for texture classification and texture seg-
mentation demonstrate how the n-tuple methods can now be tuned to certain
spatial frequencies in the texture. By tuning the n-tuple methods to spatial
frequency, the ZCTCS also addresses the previously ad hoc selection of param-
eters. The n-tuple size, n, and interpixel spacing, k, have been shown to be
related to the spatial frequency to which the Laplacian of Gaussian is tuned.
Due to the low dimensionality of the method's representation, particularly
the cross operator, several spatial frequency bands could be described at once
without degrading performance significantly. In particular, by approximating
the Laplacian of Gaussian by a Difference of Gaussians a computationally
efficient filter bank can be created using the X-Y separability of the Gaussian.
The ZCTCS thus retains the computational efficiency and simplicity of the
binary n-tuple methods making real-time texture segmentation possible.
Acknowledgments
References
The Department of Cybernetics has recently developed some simple robot insects
which can move around an environment they perceive through simple sensors. Suitable
sensors currently implemented include proximity switches, active and passive infra red
detectors, ultrasonics and a simple compound eye. This chapter describes how a such
an eye linked to a simple weightless neural network can be used to give an estimate of
position within a complex environment. Such information could be used by the insect
to generate more intricate behaviours.
1 Introduction
There is much interest in the development of intelligent machines which can learn
from their environment. Various machines have been developed which have many
sensors, sometimes including high resolution video, which require a great deal of
computing power to process the information coming in to the machine. More process-
ing is then required to determine suitable action in response to this information. Re-
searchers in the Department of Cybernetics at the University of Reading believe that
it is best to start with much simpler systems. Also, we believe that there is much to
learn from the behaviour of simple organisms like insects. Therefore, a number of sim-
ple robotic insects have been developed which are small and which can operate
rapidly . The first systems to be built have few sensors which the insects use to deter-
mine a limited (though not trivial) behaviour. However, the insects were designed so
that extra sensors could be added to allow more complex behaviour.
The simple insects have two ultrasonic sensors, which enable the insect to detect
how far the nearest object is in front of a sensor, and two motors, each of which can
be set to move forwards or backwards at a given speed. The actions of the insect are
determined using a simple weightless network pre-programmed into an EEROM. A
binary pattern, corresponding to the data from the ultrasonic sensors, is passed to the
address lines of the EEROM, and the value at the addressed location specifies the
speed and direction of each motor. Thus the operation of network defines one set of
simple insect behaviours. Different behaviours can be selected by using DIL switches,
providing extra address lines to the EEROM.
The ultrasonic sensors are controlled by a programmable logic array, PLA, and
166
167
associated analog electronics. This PLA first causes an ultrasonic signal to be emitted
by the transmitters for both eyes and then it waits for a signal to be reflected back from
an obstacle to either receiver. The time taken before the reflection comes back (assum-
ing one does return) indicates the distance of the obstacle away from the nearest eye.
The time taken is determined by the PLA.
An extra microprocessor circuit has been added to the insects which implements
a Hopfield type neural network . This circuit allows the insect to learn suitable behav-
iour, for example to move around an environment avoiding obstacles. This it achieves
despite, or perhaps because of, its very simple sensory system. Therefore it was decid-
ed that a more advanced sensor was needed and the type of sensor chosen was deter-
mined by the fact that insects were being modelled.
Given the complex behaviours prevalent in the insect world, an investigation was
made into their visual organs with the intention of building an electronic analogue. It
was noted that many insects possess a pair of compound eyes, bulging out of the head
with a wide field of vision. The eyes cannot move, have a short visual range and are
of fixed focus. Each compound eye may contain upward of 10000 or more simple pho-
toreceptive cells, each with its own lens. Most insect eyes respond to light between the
wavelengths of 300mu to 600mu (orange), with near ultraviolet light being the most
significant region for phototaxis (the insects visually related behaviours), with signals
from the compound eye being directly related to such behaviours.
To investigate how useful a simple vision system might be to the Cybernetic in-
sects, a small compound eye was built (see figure 1). A typical output image (normal-
ised to the interval [0..255]) is shown in figure 2.
The eye consists of a three dimensional array of fifteen light dependant resistors
(LDRs) mounted on the top fifteen faces of a truncated icosahedron. The input circuit-
ry produces an output voltage dependent on the light falling on any one of the LDRs.
An A/D converter gives a digital output of this voltage whilst an analogue multiplexer
selects which LDR is currently being measured.
The half angle of the LDRs used is 120°, which gives good coverage but greater
than optimal overlap between sensors. The discrimination of the eye could be im-
proved by using sensors with a 60° half angle. The summed response of all sensors
versus angle in the horizontal plane, along with typical responses of two adjacent sen-
sors are shown in figure 3.
Figure 3a shows the response of the sensors used (120° half angle), figure 3b
gives the response for a 60° half angle whilst figure 3c is for a 30° half angle. The ar-
rows on figures 3a and 3b show the increased discrimination of the 60° sensor com-
pared to the 120°. The summed response in figure 3c shows that reducing the receptive
field of the sensors too much results in uneven coverage and hence increased sensitiv-
168
The output from the compound eye consists of an array of fifteen values digitised to
the interval [0..255]. Attempts to use a standard weightless network with thresholding
proved unsuccessful, so the matrix was sampled by a large array of Minchinton Cells
whose output formed the retina of 1350 binary values presented to a 150 RAM multi-
discriminator network.
To quantise an environment to one of n positions, n discriminators were used,
with each discriminator being taught to respond strongly to the light patterns charac-
teristic of a particular position within a 4x2m section of the lab. First experiments used
sixteen discriminators in a [4x4] array, spaced at 0.5m intervals along the x-axis and
lm intervals along the y-axis. To achieve rotational invariance, each discriminator
was taught at five rotations of 72° around this point (see figure 4).
A tuple size of nine was selected to give a sharp discriminator response. Each dis-
criminator was trained with eleven sets of data collected at random intervals through-
out a week. No attempt was made to normalise laboratory lighting, window blind
positions or other activity in the lab. Hence the total number of patterns each discrim-
inator was taught (after training rotations) was 55 (11x5), at which point average dis-
criminator saturation was 7%.
This process was then repeated using the same data, but this time training four
discriminators in a [2x2] regular array, spaced at lm intervals along the x-axis and 2m
intervals along the y-axis.
4 Results
The post training responses for an array of [4x4] and [2x2] discriminators are given
in figures 5a and 5b. Each graph maps a discriminator response across the entire train-
ing area. Light areas correspond to regions of high response, dark areas, low response.
Thus in figure 5a, where the experimental space is quantised over four regularly
spaced discriminators (in the positions shown in figure 4a), the discriminator trained
in the region at the top left of the area (Disc23) responds most strongly when in that
location and its response decreases monotonically with distance from the training po-
sition. Similarly Disc22 responds best in the top right position, Disc21 in the bottom
left and Disc20 in the bottom right. Comparable results (figure 5b) were observed for
the sixteen discriminator system shown in figure 4b.
171
5 Discussion
The results presented here have demonstrated that a simple compound eye can be used
to resolve position in a complex environment. The initial experiments described here-
in have shown that for a given size of retina, there is a limit to the large scale spatial
resolution that can be resolved by the system. In our trials, the Relative Confidence
values for the four discriminator system were consistently higher than those of the six-
teen discriminator system. This is because the pattern on the binary retina does not
change much over a range of 0.5m, hence there is little useful information over which
the discriminators can differentiate. This is due to the fact that:
a. The pattern of light falling on the eye does not change much over a small
range.
b. An LDR half angle of 120° is not optimal for discrimination (see figure 3).
seen data. Thus an important constraint in the design of an insect eye must be the spa-
tial range over which it is required to operate.
The results obtained from the experiments reported here have led to the design of
an improved eye with 32 light sensitive cells, each with a receptive field of 60°. This
new eye will be used by the Cybernetic insects to guide themselves to the approximate
location of an infra-red beacon (analogous to a food scent) from anywhere within the
lab. Signals from the beacon then guide the insect to link up to a power supply and
hence allow it to feed itself autonomously when it gets hungry!
References
1. R.J.Mitchell, D.A.Keating and C.Kambhampati, 1994, Proc. Control 94, pp: 492-
497.
2. R.J.Mitchell, D.A.Keating and C.Kambhampati, 1994, Proc. EURISCON 94, pp:
78-85.
3. J.M.Bishop, P.R.Minchinton & R.J.Mitchell, 1991, Proc IMechE Conf, Euro-
Tech Direct 91, pp: 187-199.
EXTRACTING DIRECTIONAL INFORMATION FOR THE RECOGNITION
OF FINGERPRINTS BY pRAM NETWORKS
T. G. CLARKSON, Y. DING
Department of Electronic & Electrical Engineering, Kings College London,
Strand, London WC2R 2LS, U.K.
The directional image is calculated from the original grey level fingerprint image and
represents the local orientation of the ridge or valley of the fingerprint. To extract the
directional image, the original grey level image G (size M x M) is divided into several
small blocks (of size m x m). Obviously the number of blocks is Mlm x Mlm. In our
experiment M is 512 and m is selected as 16 because the average ridge thickness is
around eight pixels in a standard NIST finger print image captured at a resolution of
512 dpi. A block size of 16 pixels will contain at least one ridge or one valley and
hence has a characteristic direction. The orientation D(ij) of each block centred at
(ij) is the direction for which, the sum of difference in grey values along a direction
n, is a minimum. That is:
L
D(i, j)) = min £ \G„(ih jkk)-™
D(i,j) ) -m„\;
n\; n = 0, I...N
0,l...N (1)
(1)
n k -=11\
where G„(ik, jk) ;k=\, 2,... L indicates the grey values of continuing points in one of
the N+\ directions (N = 7) as shown in Figure 1. L is the size of the area (in pixels)
174
175
over which the directional image is measured and here L = 17. Note that L is anodd
number to keep the neighbouring points symmetric about the central point (i,j). m„ is
the mean grey value of pixels along each direction. Therefore from the grey level im-
age G(512 x 512), a directional image D(32 x 32) is obtained byfindingthe predom-
inant direction in each block of pixels.
grey image is first converted into a binary image. A dynamic thresholding algorithm
is designed to extract the binary image of thefingerprint,using the median value of a
mask of size 17 x 17 as the threshold. Thresholding is performed by moving the mask
over the image, and at each new position the new threshold value is determined for the
pixel at the centre of the mask. In this way the blurring problem caused by a non-
uniform background or different contrasts is avoided.
A two-layer pyramidal neural net was designed and trained as a classifier to im-
plement a local valley and ridge orientation identifier for each of the eight directions
in Figure 1. Note that ridges and valleys have the same significance in determining the
directional image. The top two patterns in Figure 2 represent features which fall ex-
actly along the axis sampled by the inputs of an identifier net. As shown in Figure 3,
thefirstlayer of one identifier net contains two 8-pRAM neurons; the output layer is
a 2-pRAM neuron whosefiringfrequency is in the range of 0 to 1. Some standard pat-
terns which are vectors in the {0,1} hyperspace, are used to train the pRAM net to
output the corresponding code as indicated in Figure 2.
3 Training Algorithm
There are several training algorithms possible for pRAM neural nets, but we normally
employ reinforcement training with noise enhancement1,2. In reinforcement training,
an individual node only receives information about the quality of the performance of
the network as a whole, and nodes have to discover for themselves how to change their
behaviour so as to improve their performance3. The pRAM version of reinforcement
algorithm is achieved by the update rule:
where
p - reward rate e [0,1] (typically 0.1),
X - reward to penalty ratio e [0,1 ] (typically 0.05),
a - output of the pRAM e {0,1},
aH - content of the memory location addressed by u,
E - error in the network' s output signal
r(E) - reward signal € [ 0 , 1 ] ,
p(E) - penalty signal e [ 0 , 1 ] .
178
The Kronecker delta 8„ t ensures that the update only occurs at the location
which was accessed at time t. The above equation produces a new connection weight
which is still in the normalised range 0 and 1, i.e., a™ e [0,1 ] .
Given a training pattern I, if the output vector falls into the hypersphere with
Hamming distance D from its centre vc, where vc has the maximal addressing proba-
bility from pattern /, then the reward signal is given to the network as a whole; other-
wise the penalty signal is given.
It has been demonstrated that the generalisation ability of pRAM nets is greatly
increased by injecting random noise into the training patterns • and optimum basins
of attraction are formed. As the noise injected, the memory contents of all nearest
neighbour vectors of any given pattern are modified so that they are more likely to fire
or not tofirefor that pattern or a similar one, which is how the generalisation behav-
iour is achieved. A training program in which the noise level starts from zero and is
increased in steps of 5% was used. During training, the noise is increased until the
success rate for any pattern falls below 80% so that maximum generalisation is ob-
tained.
The firing rate of the neuron in the output layer is employed to judge the local finger-
print valley or ridge orientation in a 17 x 17 window. As shown in Figure 1, the binary
value of pixels in each of the eight directions, except the central pixel value, is a 16-
dimensional vector which is input into the net. The firing rate of the vector in each di-
rection is denoted as Fr(t'), i = 0,1,..., 7; where i indicates the direction as in Figure 1.
Suppose the two largest values of Fr(i') are Fr(l) and Fr(r) respectively, Fr(0 > Fr(r).
dir denotes the orientation in the block; if Fr(r) - Fr(r) > 0.15 and Fr(r) > 0.70. That
means the firing rate in direction t is much greater than that in any other direction and
there is sufficient confidence to judge t as the direction of the local valley or ridge.
Otherwise the directional vectors in the eight neighbouring windows are estimated and
summed for each direction and dir is assigned the orientation given by the maximum
of the eight summed responses.
section.
Any fingerprint recognition system must start with identifying a pair of fingerprint im-
ages. The system outputs an estimated value of the probability that the two images
originate from the same finger. However we must first decide the matching Rol of a
pair of fingerprint images, in which the effective features indicating the difference be-
tween the two fingerprints are extracted and input to the pRAM classifier. For a pair
of grey level images (we call the first fingerprint image the target image and the sec-
ond is called the \it test image), the directional images of size 32 x 32 are calculated
by the above method. Then the 15 x 15 central region of the target directional image,
denoted Dj, is used to search the same size optimum matching region in the examina-
tion directional image, denoted D 2 , within a larger 24 x 24 patch by estimating the
matching value £ pixel by pixel. (, can be estimated by the following three methods:
We define C, to be the correlation coefficient between the two directional images and
decide the area giving the best match by finding the maximum value of C •
(3)
144 144
XI X
£I r/ >D£>i(8 ,8885+++J)j);\)j)x) A.xxDDUD
i <( 88 ++'>i,f,8n
0,(8+ i,8 2D\J2(22((3
2
3(3
3+T++/i0JJ0+
0
++'.( i,,i,
-0T 3+
3J3+T+
7JJ0J0T+
Q Q++j)j)J)j)
r _ ; =: 003iii == 000
-
nf l 4 14 "~
"" pr 44 14
~~ " p 14
14
V
£Z>( D»
J Il IJ>,(8
' j = o i o= o
((88 + i,8
,I1(8 <".8 +++;;)'
i,8
/,8 /;
if 2
■•tfJlH
;) ) 2 £I/ I p£X
l / \ =/ 0i
1D2D
^ (222((3(33++ / 00 +++ Ii,3
= =o 0i = o
( 3 ++
i,3 )2
+7y/00++j y) 2
where Dl (ij) and D2(iJ) are the orientation values of pixels in D\ and D2 at address
(ij) respectively, and C, is between 0 and 1. The bigger the value of £, the more sim-
ilar the Rol given by the top left vertex (3+ /Q, 3 + / 0 in D2 to the central Rol decided
by the top left vertex (8, 8) in Dy and vice versa.
We define £ as the sum of the orientation difference between the collateral pixels in
the Rol of the two directional images and decide the area of best match with respect
to the top left vertex at (3+ /Q, 3+ J0) in D 2 by finding the minimum value of C,.
180
14 14
14
= HX [2,[D
CC = X^ i ( }(%
8(8H+ i.i,8
8 8 ++-;)0D
;;)0D
) 0 222(3
;)8D
08D (33 +
+ //„
+ i00„ + i,«,3
,3+J j)]2;
3 + JJe 0 + j)f; (4)
;; = 0i =
= 0i =00
70 = 0, 1,..., 10;
0,1,..., 10;JJ,0 = 0,1,.... 10; (5)
The 15x15 pixel Rols in both Dj (with the top left vertex at (8, 8)) and D2 (with the
top left vertex at (3+ /(> 3+ JQ) are divided into nine equal size squares in which the
directional histogram #,-(»') = 0, ...7;;' = 0, 1, ...8;* = 1 , 2 is generated, where i
denotes the eight directions, j indicates the nine areas and k indicates the histogram in
Dj or D2.
We define
8 7
Ccc==
=• x
o X] XX H222o)]
<)(i)-H]
["jo)-tf,
X [*jo
[")(')-
[#)(o-ff, (oi222 ..
j(i)] (6)
(6)
i/=i
i; ==i11i1./=-.==11
the best matching areas are decided by finding the minimum value of C,.
Our experiments have shown that the Directional Histogram method is the opd-
mum approach both by means of visual inspection of the matching Rol and the final
recognition result. One example is shown in Figure 4, where the central directional
Rol in Z)j (the top directional image) is used to find the most matching region in D2
(the bottom directional image). Sometimes the matching value of nonmatching pairs
is lower than that of matching pairs (by the Directional Histogram method) Therefore
non-linear recognition by a neural network algorithm is necessary.
181
Feature selection is not a trivial matter. Classification is often more accurate when the
pattern is simplified represented by important features or properties . In our feature
detector, only the variability of the orientation difference statistics is considered. As
shown in Figure 4, the Rol in both the target and tested directional images are divided
into nine equal size areas in which the histograms
Hkj(i);i = 0,1, ...,l;j = 0,2 8;* = 1 , 2 are generated. So the 9-dimension
feature vector is defined as follows:
7
22 2
//(;') W/a; F(j)
( ; )) = F(/)/a;
F(j)/> F(j)
Hj) = £C= £ ][H)(i)
= -J2j(i)
[#}(0-/-(0]
[#j(0-/?(*')]0]
i-/?(i)] ;y=o,i,...,8
;y=0,i,...,8;
;/=0,l,...,8;
;;' (7)
i=
I 11
where a =max/15 is the maximum value of F(/);./ = 0,1,..., 8; Then the input fea-
tures fij) are converted into 4-bit binary numbers in proportion to their value. There-
fore the total input size of a input sample to the net is a vector in the {0,1}
hyperspace. It has been found that using features based on the difference of histograms
in a pair of directional images makes the algorithm robust, noise tolerant and compu-
tationally efficient.
6.2 Classifier design
P == U(P(odOi,)o, +pWod
(P(0,)Oi ++■P(Oi)0,)
p(Oi)ot)
p(Oi) (8)
; = 1I
i=
wherep(o,) is the firing rate of the i'th neuron in the output layer of the neural network.
48 target images were used in this test to search the matching pairs in a database which
includes 400 samples. The order of the matching values (from the largest to the small-
est) of the desired pairs are reported in Fig. 5. We found that the matching values of
correctly-classified pairs were always listed in the ten best matches and in 68.7% of
184
cases, the P value of a correct pair was the highest output. This result has value in prac-
tical applications as it greatly decreases the amount of manual inspection required
while maintaining a high accuracy rate. This demonstrates that the neuron firing rate
is a effective way to estimate the similarity of the patterns. In the practical system only
the central (24 x 24) directional images need to be stored in the database so the costs
in terms of both memory space and time are greatly reduced. To reduce the error rate
in identifying matching pairs, one possible way is to input another 9-dimension vector
which is extracted by a similar approach but uses a pair of Rols which possess mirror
symmetry to the original pair of ROIs. Suppose the original (15 x 15) Rol pair is ref-
erenced to the top left vertex (8,8) in Dx and (/, J) in D2, so the mirror symmetry Rol
can be determined by the top left vertex (8 + (8 - /), 8 + (8 - /)) in D, and (8, 8) in
D2. In this way, the binary input will be in {0,1}M hyperspace. However, the resulting
neural nets will be more complicated. Research continues into increasing the number
of features or selecting new features and making the system robust by testing distorted
fingerprint images in a larger database.
References
1. Y Guan, T G Clarkson, D Gorse and J G Taylor, Noisy reinforcement training for
pRAM nets, Neural Networks, 7, 1994, 523-538.
2. T G Clarkson, Y Guan, D Gorse and J G Taylor, Generalisation in probabilistic
RAM nets, IEEE Transactions on Neural Networks, 4, 2,1993, 360-364.
3. Y Ding, Y Guan, T G Clarkson and R P Clark, The Back Thermal Symmetry
185
The detection of spatial and temporal relations in a scene is best illustrated by consid-
ering a mobile robot which can change the visual image it receives by moving the posi-
tion of its visual sensor. The problem is then to show how such a robot can utilise its
actions, moving the sensor, to get proper indexing of the visual information and so en-
code spatial relationships linguistically. The preliminary problem addressed in this
chapter is a simpler one. The robot's visual sensor is directed to specific locations in a
two-dimensional scene where various objects are located. For each object, a linguistic
label is provided, describing the class of the object in question. The robot's task is to
leam to search in the scene for objects by name.A novel two-phase configuration of a
MAGNUS (Multi-Automata of General Neural UnitS) weightless neural system is used
to carry out the investigation. A training procedure which enables the network to per-
form the given task optimally is presented.
The MAGNUS is a general weightless neural system used in areas such as language
and two-dimensional representations of three-dimensional scenes. It is a learning and
generalising device in which the training set forms attractors or attractor trajectories
in state space. It is able, for example, to relate language-like symbol strings to internal
iconic representations and respond to environmental input with appropriate actions. 1
This capability of internal representation of informational structures distinguishes
MAGNUS from earlier learning devices 2 which merely performed pattern recogni-
tion.
The neural processing elements of the system are Generalising Random Access
Memories (GRAMs). A G-RAM accepts binary inputs and produces a binary output.
During training, input-output associations are recorded. Generalisation in the G-RAM
is provided by some spreading procedure. 4 During recall, the G-RAM uses the stored
t Current address: Anite Systems Ltd, Finance Division, Gavrelle House, 2-14,
Bunhill Row, London EC] 8LP, UK
186
187
information to produce outputs, not only for a trained input pattern but also for other
input patterns that are closer to it, in Hamming distance, than to any other training pat
tern. Input patterns which are equidistant from training patterns and which require dif
ferent outputs, will produce an output 1 or 0 with equal probability.
The core structure of the MAGNUS is that of a neural state machine (NSM) ,
with well defined sensory inputs. In this chapter, two types of sensory inputs are used,
visual and "linguistic". The training of the network is iconic,5, enabling the creation
of internal states that are directly linked to sensory experience. The NSM structure
provides the system with the ability to create an internal state structure which is rep
resentative of the system's environmental experience.
MAGNUS interacts with TableTop, a near-real-world model consisting of a two-
dimensional projection of a three-dimensional scene, containing cups, glasses, mugs
and plates. The resulting image (which, in the present case, has dimensions 512x512)
has a wrap-around topology, eliminating the problem of borders. Visual sensory input
to MAGNUS is provided by a central retinal window (RW), acting on the visual scene.
Functions performing panning (x), scrolling (y) and zooming (z)1 of the RW are im
plemented in the TableTop model. Fig. 1 shows the structures of the MAGNUS sys
tem and the TableTop near-real-world model.
The MAGNUS system is characterised by a multi-field neuronal structure. Five
such fields can be distinguished, corresponding to the parts of the network coding for
the visual (Fv) and linguistic (FL) internal representations, and for the 3 motor outputs
of the network (Fx, FY and Fz). The latter determine the position and size of the RW
at the next moment in time. A thermometer coding scheme is used to transform the
outputs of Fx, Fy and Fz into integer RW co-ordinates x, y and z.
Two types of feed-back exist in the network: an internal feed-back, between the
neural field outputs and the network inputs, and a external feed-back, between the mo
tor outputs of the network and the RW. This second level of feed-back characterises
Q
ing to that of an arbitrary object of class Cc. The visual input to the network corre
sponding to object j of class Cc is denoted I,c . The patterns coding the co-ordinates
(;c, y, z) of object j of class Cc are denoted Xjc, Yjc and Zjc, respectively.
Figure 1: The structures of the MAGNUS system and the TableTop near-real-world model.
3 Network Training
All network fields are trained by storing input-output associations in their G-RAMs.
In the case of a field with partial feed-back from its own output, any training associa
tion can be conveniently expressed as a triplet {E,P,S}, where E represents the exter
nal input, P the previous state and 5 the current state. In the case of a stable state
creation, P and S are identical and such a training association is denoted {£■,(£)}. Fur
thermore, if the training is iconic, and are identical and the resulting association is
denoted [S].
The training is divided into 3 stages. During the first stage, each object of each
class, in association with the corresponding linguistic input, is trained iconically so as
to form a stable state representation at the output of fields Fv and FL:
FvVFFLL:Y
: iVj,Vc:
yc:{[Uje.L e)}}
{[(//cAf)]) (1)
(1)
189
In the second stage, each object name is trained iconically so as to form a stable
state at the output of FL, when the visual input represents an object belonging to a dif-
ferent class. This stage enables the network to switch its internal representations be-
tween object classes:
FL::Vj,c,d\c
Vj,c,d
,c,d\I c *d:{(I
*d: {0jc,[Ld[L
])}dm (2)
During the third stage, the previously trained internal representations are associ-
ated with thermometer codes of co-ordinates of objects belonging to the same class,
through the training of the fields Fx, FY and Fz:
F FZz:: VVc,j,k
FXXFFYYFFZ- Vcj.k
< : {(Ijt7 ^ 0 * , Y& Z te )}
Zkc)} (3)
4 Recall
During a network run or recall, a linguistic input Lc is presented to the network and
remains unchanged during the duration of the run. Instead of recalling the outputs of
all fields synchronously, fields Fy and FL are recalled a number of times first. This is
followed by a number of recalls of fields Fx, Fy and Fz. Before the first recall of fields
Fx, Fy and Fz, noise is added to their feed-back inputs by setting the X, Y and Z states
to thermometer codes representing random co-ordinates. Finally, after the recalls of
fields Fx, Fy and Fz, a move of the RW is performed.
The asynchronous updates of the network fields enables a greater flexibility in
the specification of the network dynamics than in the case of a fully synchronous sys-
tem, yielding an improved state structure. In the present experiment, fields Fv and FL
are updated five times. This is followed by five updates of fields Fx, FY and Fz. This
scheme allows the visual and linguistic internal states to settle into stable states before
they are seen by fields Fx, Fy and Fz. Similarly, the states X, Y and Z are allowed to
settle into stable ones before an actual retinal move is operated.
190
Fig. 2 shows characteristic inputs and internal states during a run of the network.
The network contains 5872 64-input G-RAMs: fields Fv and FL contain each 2304
G-RAMs; fields Fx, F¥ and Fz contain 512, 512 and 240 G-RAMs, respectively. The
initial internal states Sy and 5^ are chosen to be random. The word "CUP" is present-
ed at the linguistic input terminal. After several recalls of fields Fy and FL, the states
and have settled to the visual and linguistic internal representations of a cup. This is
followed by several recalls of fields Fx, FY and Fz which enable the states X, Y and
Z to settle to thermometer codes representing the location of a randomly chosen cup
in TableTop. This location is not necessarily that of the cup which produced the cur-
rent state Sy. This is a consequence of training Eq. (3) and of the initial random ther-
mometer codes X, Y and Z.
Figure 2: (a) Initial visual and linguistic inputs. States Svand SL, (b) before any recall; and after (c) one
and (d) five recalls of the fields Fv and FL . States X, Yand Z after (e) a random move, (0 one and (g)
five recalls of fields Fx. Fy and Fz. (h) Final visual input.
Fig. 3 shows characteristic inputs and internal states during a next run of the net-
work. The word "CUP" is replaced by the word "PLATE" at the linguistic input ter-
191
minal. After several recalls of fields Fv and FL, the states Sv and SL have settled to
the visual and linguistic internal representations of a plate. This is followed by several
recalls of fields Fx, FY and Fz which enable the states X, Y and Z to settle to thermom-
eter codes representing the location of a randomly chosen plate in TableTop.
Figure 3: (a) Initial retinal and linguistic inputs. States Sv and 5^, (b) before any recall; and after (c) one
and (d) two recalls of the fields Fv and FL
5 Conclusions
Acknowledgements
This research was supported by the UK Engineering and Physical Sciences Research
Council under grant no. GR/J15032. The authors would like to thank the other mem-
bers of the MAGNUS project: Igor Aleksander, Thomas Clarke, Richard Evans, Nick
Sales and Manissa Wilson.
192
References
A. C. P. L. F. DE CARVALHO
Computing Department, University of Sao Paulo at Sao Carlos,
Sao Carlos, SP CP 668, CEP 13560-970, Brazil
M. C. FAIRHURST, D. L. BISSET
Electronic Engineering Laboratories, University of Kent at Canterbury,
Canterbur, Kent, CT2 7NT, England
This chapter describes and evaluates a completely integrated Boolean neural net
work architecture, where a self-organising Boolean neural network (SOFT) is used
as a front-end processor to a feedforward Boolean neural network based on goal-
seeking principles (GSN-f). For such, it will discuss the advantages of the integrated
SOFT-GSN-^ over GSN-^ by showing its increased effectiveness in the classification
of postcode numerals extracted from mail envelopes.
1 Introduction
193
194
GSN' uses goal-seeking neurons (GSNs), which are similar to Random Access
Memory devices. Each node (the term node will be used instead of neuron
throughout this chapter) uses the input values presented to its input terminals
to access one or more of its 2N memory contents, where N is the number of
input terminals.
GSN was designed to overcome many of the problems found in other
Boolean models, maximising the efficiency of storage of values in its memory,
storing new information without disrupting previously absorbed information,
and employing one-shot (single-pass) learning. The GSN can accept, store and
generate values of 0, 1 and u, where u represents an "undefined" value. If there
is at least one undefined value on the input terminals of a GSN unit then a set
of memory contents, rather than a single storage location, will be addressed,
and this is a principal factor in the efficient distribution of stored information
across a trained network 10 .
GSN nodes have been emploied in three different architectures, each one
195
the training set are situated near to the boundary. The boundary is defined by
the addresses whose Hamming distances to both attractors are similar. In order
to reduce the influence of the initial model structure in the modelling of the
input distribution, the SOFT d structure can be dynamically redefined during
its training by eliminating and replicating nodes, according to the function
they perform.
Figure 4 shows an example of how the attractors might divide a particular
input space into two halves if connectivity 3 were used. A white circle is drawn
for the input patterns that are nearer, in Hamming distance, to the attractor 0,
and a black circle is drawn to represent the input patterns nearer, in Hamming
distance, to the attractor 1. The larger the circle, the higher the frequency
with which that particular input pattern occurs in the training set.
Immediately after the attractors for a particular node have been chosen,
the value to be stored in each of its memory contents will be defined through
competition between the attractors. This competition will determine which
would be the node's most "natural" output. Once the architecture has been
trained it can then be used to extract primitive features.
A good measure of the extent to which the pre-processed inputs generated
by SOFT d are likely to be more readily classifiable than the original input
patterns, is provided by considering how SOFT d affects the similarity, in terms
198
of proportion of pixels with the same value, between patterns from the same
class and between patterns distributed in different classes. The ideal situation
is one where the similarity in the same class is kept high while the similarity
among different classes is low.
Experimental evaluation shows that, using Hamming distance as a mea-
sure, SOFT d achieves a useful data compression, reducing the pattern repre-
sentations from 384 pixels to 96 pixels and increasing the average similarity
between patterns from the same class and the average differences between
patterns belonging to different classes, as can be seen from Table 1 and Ta-
ble 2. While Table 1 illustrates the average similarities among patterns from
the same class, Table 2 shows the similarities among patterns belonging to
different classes.
According to the figures shown in Table 1 and Table 2, SOFT d reduces the
average similarity between patterns belonging to different classes and increases
the average similarity between patterns from the same class. This should result
in an improvement of the discrimination and generalization abilities of the
Boolean classifier connected to SOFT d . The following section illustrates the
integration of SOFT d and GSN' in a fully integrated Boolean architecture.
199
4 Integrated Architecture
The design of modular neural architectures involves the division of the task in
subtasks, the assignment of a neural module to each of the subtask and the
communication among the modules.
As stated by 1 3 , most of the approaches taken to implement modular neural
networks are based on the divide and conquer philosophy, where the task to
be solved is first divided in simpler, smaller tasks, which are then handled by
different modules.
Regarding the communication among the modules, one of the main pre-
requisites for a modular architecture is the compatibility of its internal inputs
and outputs. In the SOFT d -GSN^ modular architecture, this condition is sat-
isfied, since the SOFT d output is directly compatible with the GSN' input. In
the integrated architecture described in this chapter, the input image is first
pre-processed by an already trained SOFT d to be then directly used, without
further pre-processing, as input by GSN'.
SOFT d and GSN-^ are integrated by connecting the input terminals of
GSN' first layer nodes to the output terminals of SOFT d last layer nodes,
making them work like a single entity. Figure 5 illustrates the integrated
architecture.
The training of the integrated architecture comprises two stages. In the
first stage SOFT d is trained using the original training set images as inputs.
200
The second stage involves the training of GSN' using as input the output
provided by the propagation of the original input image through SOFT d , using
SOFT d 's recall phase. A pre-trained SOFT d could also be linked to GSN'. The
recall phase of the integrated architecture is carried out by propagating the
original input image first through SOFT d and then through GSN' using their
respective recall functions.
For the purpose of evaluate the contribution of: SOFT d to t the recognition
performance achieved by GSN^, both the basic iic GSN-^
GSN' architecture and the
modular architecture SOFT d -GSN' vwere trained to recognize numeral char-
acters extracted from mail envelopes. In both cases, each GSN' network uses
three layers of nodes, where each node has connectivity 4.
The GSN' networks were trained and tested with the original data and
with the pre-processed data. In order to estimate, on a comparative basis, the
effectiveness of SOFT d pre-processing, the classification accuracy was mea-
sured after training the networks with three different learning algorithms pre-
viously used with the GSN-f classifier, which is to be used as the heart of
the proposed modular architecture. These algorithms are the Conventional
eaz 14oonvei
' o z)y14) 1 4 , the Deterministic lazy algorithm
C azy
lazy algorithm ((C' ithm (D'
(D,az
lazyy)
y)
)
,lazy\
14
14
14 iand
a
the
1155
Progressive algorithm ( P ) ..
Table 3 shows the correct recognition performance achieved by GSN' with-
out pre-processing when different numbers of pyramids are used. Table 4
201
presents how the use of SOFT d pre-processed data affects these results.
The results presented in Table 3 and Table 4 are the average of 10 different
simulations. These results show that the use of pre-processed input leads to
an overall improvement in the classification accuracy achieved by GSN'.
It is interesting to notice that the smaller the number of pyramids used, the
larger the difference in the correct recognition rates achieved by SOFT d -GSN^
compared to GSN^.
The larger difference between patterns from different classes and similarity
between patterns from the same class has more importance for the recognition
performance when fewer pyramids are used. This characteristic is particularly
beneficial when hardware or time considerations limit the number of pyramids
that can be used. It can also be seen that when 50 pyramids are used, GSN-^
with pre-processed inputs accomplishes a recognition performance similar to
that achieved by GSN' with 100 pyramids using the original input patterns.
It is also clear from these results that a higher improvement in the classi-
fication accuracy occurs when GSN' is trained with the Clazy algorithm. The
reason is that, due to a less ordered storing of values in the memory contents,
the larger variety of pixels distributions in the original patterns affects more
this algorithm than the others.
202
5 Conclusion
" SOFV-GSN*
SOFV-GSNt classification accuracy
accuracy
Pyramids Learning strategy
Qlazy I £)lazy
Qtazy £flazy I Pp
W
10 64.93
64.93 85.38
85.38 86.25
86.25
20 73.35 89.07 89.23
30 84.17 91.43 91.25
40 87.05 92.54 92.66
50 88.15 93.04 93.37
60 90.71 93.85 93.41
70 91.01 93.52 93.14
80 92.62 94.38 94.52
90 92.50 94.26 94.24
100 | 93.10 1 94.63 | 94.54
References
An image processing system that automatically locates danger labels on the rear of con-
tainers is presented. The system uses RAM based neural networks to locate and classify
labels after a pre-processing step involving non-linearfilteringand RGB to HSV con-
version. Results on images recorded at the container terminal in Esbjerg are presented.
1 Introduction
205
206
Figure I: Examples of the different types of danger labels that must be classified. The different categories
are distinguished by their colour. The major colours used are red, green, yellow, and blue.
2 Concept
• Move the camera fovea to obtain a high resolution image of the object.
• Use the colour and pictogram information of the objects to perform a clas-
sification.
Both when locating the labels and when performing the final classification the
image data are initially pre-processed to create a binary pattern. These patterns are fed
into RAM based neural networks for classification.
It is of course essential that regions with labels are actually detected. Accepting
some false positive classifications reduces the chance of a false negative. Within the
207
first task it is therefore acceptable that a few label-free regions are classified as regions
of interest. The following classification step must then detect that no labels are actu-
ally present.
The first step has been tested on real-world images while the latter part is still a
laboratory set-up.
Initially the camera is zoomed out to record a full image of the container rear. The next
step is then to find objects with label-like shapes. This is done by scanning the image
with a search window. The dimensions of the window are slightly larger than the di-
mensions of the labels. For each window position a RAM based system is used to de-
termine whether a label shaped object is present within the window.
A pre-processing of the data within the search window is required since the RAM
based system requires binary input. The labels are characterised by edges oriented at
45° and 135° with respect to the horizontal line. Non-linear edge detection filters were
developed to detect the contour of a label.
Filter kernels designed to detect tilted edges of 45° and 135° were used. The filter re-
sult is a binary image where each pixel classified as an edge pixel is assigned the value
1. The operation of the filter is based on the hypothesis that a pixel can be considered
an edge pixel. The average intensities arc calculated for the background and object.
The hypothesis is now tested using these two average intensities and their correspond-
ing variances. For further description of this non-linear edge detection filter see Jor-
gensen et al. . Figure 2 illustrates how this filtering scheme extracts the edge
information and creates a binary example from an image of a label. In order to locate
the striped label (see figure 1), a vertical edge detecting filter was added to the pre-
processing scheme.
Figure 2: This example shows how a label is converted to a binary pattern. Four line detectors are used. For
each orientation separate detectors are used to detect light to dark and dark to light transitions.
208
The output of the edge filters is fed into a RAM based NN system. The purpose of the
NN is to mark regions of interest in the image. A search window is centred on exam-
ples of labels and the system is trained to detect the presence of a label.
The translation invariance needed to make the scheme robust is partially obtained
during the filter step and partially by assigning different classes to different positions.
For each label used for training a range of examples is produced. Each of these exam-
ples corresponds to a specific (small) shift within the search window. Each shift is as-
signed a specific class to avoid interference. Producing these different classes causes
the system to become invariant to small shifts of the labels' position within the search
window. Consequently the container rear can be scanned with a larger step size. Dark
labels on light background and vice verse as well as the striped label are treated as sep-
arate classes in order to simplify the classification task for the neural network.
The network also needs examples of label free regions. The examples to be used
must correspond to regions having features that can be confused with the ones desired.
1. Store a range of label examples in the RAM net together with an example of a
totally empty region.
3. Use the false positive detected regions as examples of label free regions.
4 Camera set-up
The image used to locate potential danger labels do not provide sufficient resolution
for proper identification of the danger labels found. In order to overcome this limita-
tion an active vision system was set up. The use of a motorised camera with control-
lable pan, tilt and zoom makes it possible for the system to zoom in on a region of
interest. One can then obtain a high resolution image of the potential label and thereby
gain the information needed to perform a final identification of the object. The camera
used is a standard surveillance camera capable of 170° pan, 110° tilt and zoom from
4.5° to 38°. The camera controller developed is running as a separate task allowing the
localisation and identification system to perform further processing while the camera
is moving.
209
5 Label Identification
The RAM based neural network produced by the training procedure outlined in sec-
tion 3.2 outputs two votes, numbers corresponding to a label being present and a label-
free region, respectively. Using a Winner-Takes-All decision the votes are used to
mark regions of interest in the image, i.e. regions that are likely to contain a label. Hav-
ing detected the regions of interest the next step is to verify and classify the potential
label. First the camera is zoomed in onto the object to obtain a high resolution image.
At this stage the edge detection filters are used to re-locate the label.
From the detected edge pixels (see figure 3) we calculate the equations for the
four line segments forming the contour of the label. An example of the result of this
can be seen in figure 4.
Danger labels are colour coded. Accordingly it seems sensible to use this colour
information for the classification of the labels. A way of handling colour information
is to use the hue, saturation and intensity representation4. For the upper and lower
halves of the label a hue histogram is calculated. The colour of the upper and lower
half of the label will give the main class of the label. In the cases where the colour in-
formation does not give sufficient information the pictogram in the upper half of the
label is classified by another neural network.
Figure 4: High resolution image of a potential danger label. The rectangle indicates the position of the label
(calculated at lower resolution, hence the offset).
210
To obtain the colour of the upper and a lower half of the label a hue histogram is <
culated. The hue circle is divided into the overlapping colours regions red, orange,
blue as well as a special case for no colour defined as pixels with a low saturatior
being nearly black or white. See figure 5.
For several of the labels this information will be sufficient to perform a segm
tation into different main groups. The remaining information is provided by the pic
gram in the upper half of the label.
Figure 5: Hue histogram for the upper and lower half of the label. The colours are Re(d), Or(ange),
Ye(llow), Gr(een), Cy(an), Bl(ue), Vi(olet), Pi(nk), and the special case "No colour"
Whenever the colour data does not contain sufficient information to perform a 1
identification of the danger label, classification of the pictogram at the top of the la
will provide the missing information. For a few types of labels ambiguity will still i
ist e.g. "poison" versus "poison gas" or "inflammable gas" versus "inflammable 1
211
uid." A solution to this ambiguity could be to apply an OCR reader to the text line in
the middle of the label and/or to the number at the bottom of the label.
From the high resolution image of the label we cut a square region containing the
pictogram (see figure 6). The pictograms are pre-processed and passed to a RAM
based neural network for classification. The pictogram image is divided into local ar-
eas for each of which a grey scale histogram is calculated. These histograms are re-
duced to only 3 grey levels in order to reduce the input space to the neural classifier.
Translation invariance is obtained by the use of local histograms and by variation in
the training set.
Figure 7: Example of a container rear end containing 4 labels. The white squares indicate the regions inden-
tified by the sustem as regions containing labels. All four labels are correctly located.
212
6 Results
In order to obtain realistic images for training and testing purposes a video camera was
installed at the DFDS container terminal in Esbjerg. Each time a container entered the
terminal a video sequence of 10 seconds was recorded onto a VHS tape. These video
sequences were subsequently digitised for further processing.
The processing time on a 486-66 MHz machine for one container rear is around
10 seconds for the label localisation. An example of how the algorithm successfully
locates several labels on a container rear without performing any errors is shown in
figure 7. The active vision and label identification part of this project is still to be test-
ed in a real world environment, but from our laboratory tests this scheme appears to
be robust.
7 Conclusion
We have presented a system for automatic localisation and classification of danger la-
bels on the rear of containers. The system uses a combination of traditional image
processing methods and RAM based neural networks. For the label localisation sys-
tem the test results were obtained on images recorded under real conditions.
Acknowledgments
Part of this work was funded by the EUREKA project HERA. The partners were Ram-
boll, Hannemann & Hojlund (DK), Applied Bio Cybernetics (DK), BICC (UK) and
Ris0(DK).
References
T. MACEK
Department of Computers, Faculty of Electrical Engineering,
Czech Technical University, Karlovo namesti 13,
121 35 Praha 2, Czech Republic
G. MORGAN, J. AUSTIN
Advanced Computer Architecture Group, Department of Computer Science,
University of York,
York YO1 5 0 0 , United Kingdom
This text proposes an efficient method for the implementation of the ADAM binary
neural network on a message passing parallel computer. Performance examples
obtained by testing an implementation on a Transputer based system are presented.
It is shown that simulation of an ADAM network can be efficiently sped-up by
parallel computation. It is also shown how the overlapping of the computation
and communication positively influences performance issues.
1 Introduction
The speed of simulation of Artificial Neural Networks is one of the crucial prob-
lems slowing their more widespread application. Conventional workstations are
not fast enough and special purpose parallel hardware is often unacceptable
on cost grounds. The alternative solution, of using a general purpose paral-
lel computer, is more cost effective, leads to a shorter development time and
results in a more maintainable system.
Most neural networks have long teaching and recall times due the use of
evaluation functions which are slow to compute. For example, weights based
upon floating point values result in the use of slow floating point computations.
Binary neural networks (where the weight is only either 0 or 1) rely upon
much faster logical operations. Therefore teaching and recall in such neural
networks is much faster. However, much larger networks are also used in many
applications. Therefore we have focused on the development of methods for
the simulation of binary weighted neural networks on parallel computers.
In this text we describe our experience in implementing the ADAM neural
network on a message passing computer system. We present the techniques
used in our implementation as well as some of the results obtained by running
experiments on the Bansputer based Meiko machine with 32 processors.
The first results of this work were presented in l . We focus here on the irn-
214
provements which have been obtained by a higher level of the parallelism in the
communication and by overlapping the computation and the communication.
In the next section we briefly describe the structure of an ADAM network.
The third section is devoted to the issues which we considered important in the
design of the parallel implementation. This is followed by the fourth section
which is devoted to the presentation of results. The final section of the text is
devoted to a summary of the results.
2 ADAM
In this section we will describe very briefly the ADAM system (see 2 for more
details). ADAM stands for Advanced Distributed Associative Memory. It is
able to store and retrieve associations between pairs of the binary vectors, even
if the vectors are incomplete or corrupted. The structure of ADAM is depicted
in figure 1.
Before use an ADAM is taught the associations required of it. This is
called Teaching. When in use an ADAM system is used for Recalling the data
taught to it. We now describe teaching and recalling in more detail.
During teaching the association between binary vectors applied to the in-
puts and outputs is stored. The structure of ADAM is based on the intercon-
nection of two binary correlation memories. The input vector (after n-tuple
preprocessing see 3 ) is associated with so called class vector in the first corre-
lation memory. The output vector (after n-tuple preprocessing) is associated
215
3 Parallelisation Issues
vector. Each part of the class vector can include different a number of bits set
to one and moreover this number changes with the class. If the vertical slicing
method is used, whole class vectors are processed at each of the processors.
By altering the number of the tuples we can adjust the computation load at
the processor.
The two correlation matrices are stored and executed in two separate
branches of a tree, as shown in figure 3. The root of the tree is a processor
which receives the system's input and transmits it to the subnetwork handling
the first correlation matrix. When this subnetwork has finished working on its
result (the class vector), it is returned to the root which then communicates it
to the second correlation matrix. The recalled pattern from the second CM is
then returned to the root transputer for output from the system.
In the case of teaching the two CMs can be evaluated in parallel, in the
case of the recalling pipelining can be used. For these reasons it is better
to allocate an independent subnetwork for processing each correlation matrix.
The number of the processors in each subnetwork is determined according to
the number of tuples in the CM concerned. Therefore the two CM networks
need not be the same size.
Figure 4 shows the proposed topology of the networks of the whole system.
The structure is based on interconnection of two ternary trees. The choice of
the topology of subnetworks was determined by the type of the communications
required. It is not necessary to communicate between any arbitrary pair of
the processors but the operations for broadcasting, scattering the data to all
transputers and gathering the results must be implemented efficiently. For this
purpose a star or a tree are the optimal structures. We used a ternary tree
because of the number of links on the transputers.
218
Command Function
RECALL-CM recall operation for one input pattern.
TEACH.CM teach one association.
WRITE-M.CM distribute correlation matrix to the processors.
READ-M.CM collect matrix from the processors.
MAKE-MIXED-CM create ADAM in the network.
INIT.TREE-CM initiate particular processor.
EXIT-CM finish finish processing of whole network.
READ-VAR.CM reading of the internal variables (for debugging).
All processors in the CM subtrees behive similarly since they are driven by
the same program. It is based upon a command interpreter loop consisting of
reading a command from the parent processor and executing the corresponding
operation. The set of the commands used is shown in table 1. By way of
an example we will describe the commands for software initialisation of the
network and for the recall operation.
The commands for software initialisation of the network are the first which
should be sent to the CM subtrees. The command INIT.TREE.CM (see table 1)
is sent to the each processor separately. This is followed by the address of the
processor and the parameters of that slice of the CM, including the number of
tuples to be allocated at that particular processor.
When a processor receives a command it checks the address field. If the
219
The smallest number of the tokens in one of the input places of RECEIVE
represents the length of the empty part of the buffer. Similarly the smallest
number of tokens in the input places of the SEND transitions represents the
the number of the data which should be sent to the previous layer. The SEND
is enabled if the results have been received from the subtrees (UP1, UP2,
UPS) and the local computation {COMPUTE) has also finished. The DOWN
transitions are enabled if there is at least one item in the input buffer. The
UP transitions are enabled if there is at least one empty place in the output
buffer.
If a processor is the leaf of the tree or if it has less than three subtrees,
the corresponding processes DOWN and UP are not placed at the processor.
Implementation of the teaching is similar, but the additional processes UP
and SEND are not used.
In this section we have described the main elements of our parallel imple-
mentation. We now present some results illustrating how well the implemen-
tation works.
4 Evaluation of results
Figure 7 illustrates how the speed-up depends upon the technique used in
the implementation. Three techniques are compared. The first curve (marked
serial) correspond to our first version of the program where just one process
was running on each processor. The second (marked parallel with buffer=l)
used the method described above with communication and computation in
parallel but with no buffering. This means that data were distributed down
the tree in parallel with computation but the next datum was not received in
the same time from the previous layer. The third curve (marked parallel with
buffer=2) corresponds to the fully parallel version described in this text.
It can be seen that the performance has been improved significantly by
parallel computation and communication.
5 Conclusions
We have briefly presented here a method for the parallel simulation of the
ADAM neural network. It has been shown that the simulation of the ADAM
neural network can be efficiently sped-up by parallel computation.
Acknowledgements
We would like to express our thanks to the British Council and Amstrad, which
have been sponsoring this research in the framework of the Excalibur Scheme.
We would also like to thank to our employers, The Department of Computer
Science, The University of York and The Department of Computers at FEL,
CVUT in Prague who made it possible to do this work.
References
1 Introduction
The primary focus of neural network research has been into real-valued networks: net-
works whose connections have weights that are represented using floating point num-
bers. Unfortunately, this does cause problems as a floating point calculation usually
requires multiple clock cycles, resulting in a time consuming training and recall oper-
ation. This is exacerbated in some research domains such as image processing systems
which often have input data with high dimensionality. To overcome some of the prob-
lems of real-valued networks, an area of active research is the field of binary neural
networks, sometimes described as Weightless Neural Networks Binary networks
have the advantage of single pass training and recall using boolean operations. There
is extensive ongoing research into the theory , implementation and application of bi-
nary networks. The uses of binary neural networks range from address database
deduplication11, rule-based systems5, probability density estimations, time-series
analysis of financial data9, to molecule matching14. For problems of a reasonable size,
such as an input image of 512 pixels, or a 25 million address database, there is a sig-
nificant amount of data to be processed. To enable this data to be processed in real-
time, e.g. camera frame rates for image work, then it is cost effective to use dedicated
accelerating hardware. The hardware that the group has developed and is currently us-
ing is the Cellular Neural Network Associative Processor, C-NNAP.
Section 1 of this chapter describes the operations required to train and recall from
a binary neural network. Section 2 describes the hardware design of the C-NNAP ar-
225
226
A binary neural network contains a binary matrix, often called a Correlation Matrix
Memory (CMM), that stores relationships between two binary vectors. They are ad-
vantageous as they can be taught or tested in a single pass over the data using boolean
operations, this results in simple and fast processing. The networks do have draw-
backs, they may not generalise as well as a MLP (Multi Layer Perceptron) network,
nor can they solve as many problems as an MLP. However, Morciniec stated that
"RAMnets are often capable of delivering results which are competitive with those ob-
tained by more sophisticated, computationally expensive models." These models in-
cluded back propagation and radial-basis functions, which require many hours of
training time. There are therefore many situations where a slightly reduced function-
ality is acceptable as the processing time is of primary importance.
Historically, binary neural networks originate from work done in the 1950's, such
as that done on the N-tuple system. The N-tuple system was developed by Bledsoe and
Browning for the recognition of alphanumeric patterns, and popularised by
Aleksander. Bledsoe and Browning's system had a binary matrix for storing data, and
used the N-tuple system for pre-processing the image data. Further work was done on
the binary matrix system by Willshaw . Willshaw's thresholding mechanism howev-
er, is not tolerant of noise in the input date. In this scenario it is better to use L-max
thresholding which is used in the two-layer CMM network of the Advanced Distrib-
uted Associative Memory, or ADAM . The ADAM network was specifically de-
signed for use in scene analysis.
The interesting feature of the above networks is that they are a superset of the ba-
sic correlation matrix memory (CMM). The training and recall from a CMM is now
explained.
Training a CMM network requires two binary vectors. The first vector, known as the
training pattern, is the output from the pre-processing stage. The pre-processing could
be grey-scale n-tupling4, CMAC conversion1, or any technique that converts an input
to a binary vector. The second binary vector required is the output pattern, and is
known as the separator pattern (sometimes known as the class pattern). The separator
patterns are as near to mutually orthogonal as possible which reduces clustering of
data in the network, this in turn reduces the post-processing required.
227
Mo ==
(M„
[Mo = 00
(1)
[M
Mk = A/, i M
®k_s[p
x@s[ PhkzN
k, ke N
Recall from the network is described by Equation 2. This is an inner product operation,
where pk is pre-processed input vector k, after undergoing a transpose. M is the binary
matrix. The output from this operation is a vector of integer values v.
=
vv = Wpl)
I (A
{Mpl)
(Mpl) (2)
I == Il
In architectural terms, the inner product means that each input ~ 1' activates a col-
228
umn of weights in the matrix. These columns are then integer accumulated, resulting
in the summed values of vector, v. This is shown in Figure 2(a) for a network contain-
ing a single correlation. Figure 2(b) shows the recall from a network that has been
taught many correlations.
Figure 2:
The vector of summed values, vk, is thresholded using either L-max thresholding
or Willshaw thresholding. L-max thresholding sets the L highest summed values to
" 1 " , all other summed values are set to "0". Willshaw thresholding is described by
Equation 3. This sets the summed values in vector v that equal Wto " 1 " , where Wis
the number of bits set in the input pattern, the remaining summed values are set to "0".
An often used variant of Willshaw thresholding sets all values in vector v, that are
greater than or equal to w to 1, all others are set to 0 (this provides for better control
of noisy input data). The output from thresholding vector t is the original separator pat-
tern.
fVj = Ww==■■11
t ==
'
"ft[vi*W w = 00
[Vj*W = °
n <(3)
3
>
It should be noted that the thresholded vector may actually contain more than one
separator pattern and that further processing would then be required to extract the in-
dividual patterns. The technique used is called Middle Bit Indexing (MBI), and is de-
scribed in7 and in greater detail in .
229
In many binary neural network applications, the training and recall data needs to be
processed at fast rates. For image work, camera frame-rates sometimes need to be
maintained (typically 25 frames a second in the UK). To process 25 million addresses
in a day also requires a processing rate of at least 17,400 addresses a minute. In order
to make this possible it is necessary to use dedicated hardware: the C-NNAP platform.
C-NNAP is a dedicated platform for processing binary neural networks, and is avail-
able to all users on the department network.
C-NNAP is based on a standard VME bus system which makes the architecture
available to a wide range of computer platforms. The system's host-computer, the S-
Node, is a Silicon Graphics Challenge DM server with an integral VME bus. The host
computer is required to pre-process the data and control data movements within the
system. The VME bus of the S-node hosts the C-NNAP nodes, or C-nodes, which are
tasked with the training and recall of the CMMs. A large CMM can be distributed over
multiple C-nodes, on completion of C-node processing the S-node combines the re-
sults from all the boards. The architecture of a C-node is shown in Figure 3, and a pho-
tograph of a completed board is shown in Figure 4.
The C-node hosts a dedicated processor called the SAT processor (Sum And
Threshold). The SAT recalls sixteen bits of active binary matrix in a single 175 nano-
second clock cycle. It is described fully in Section 4. The C-node also has three 25ns
SRAM memories. The DSP memory is used solely for DSP code variable storage. The
buffer memories are two memories that can be switched between buses. Memory one
is accessible from either the DSP, or the S-node over the VME bus. Memory two is
accessible from the SAT processor. The weights memory is a single memory that can
be switched to either the DSP or VME bus, or to the SAT processor. The VME inter-
face is provided through the VIC & VAC chipset.
The pre-processed data values applied to train and recall from the binary network
are termed the index values, so termed because they index into the CMM. The index
values for the binary pattern " 0 1 0 0 1 0 0 0 " are one and four as bits one and four are
set (the first bit is classed as bit 0). The index values are generated on the S-node and
then the DSP is informed that a transfer is required. The DSP configures the VIC &
VAC as VME bus master which then perform a block transfer of the index values from
the host computers memory into the buffer memory. During the training phase the
DSP then uses these index values to train the CMM stored in the weights memory.
During a recall from the network the memories are switched so that the SAT
processor can access the index values in the buffer memory, and can also access the
weights memory. It is then interrupted by the DSP, after which the SAT will start re-
231
calling data from the matrix, placing results in the buffer memory. On completion it
interrupts the DSP which swaps the buffer memories back. At this point the host com-
puter will have generated new index values which are transferred to the second buffer
memory. By this process there will always be a new set of index values in the buffer
memory so that the SAT processor will continuously process data every time the buff-
er memories are swapped. The flow of data and control is shown in Figure 5, where
the system consists of the S-node and three C-nodes.
The SAT processor has been designed to accumulate sixteen bits of the CMM in a sin-
gle cycle. SAT performance is vital to the processing rate as the complexity of the re-
call using standard software techniques results in slow execution rates. The SAT has
been designed so that it can process both single stage CMMs and two stage CMM net-
works, such as the ADAM network. It can also perform either Willshaw thresholding
or L-tnax thresholding. It is currently implemented using two Actel FPGA devices.
The hardware of the SAT consists of four state machines, sixteen counters for
summing the data from the weights memory, and other hardware for memory access
and control. To recall data, the SAT is given an offset value within the control data
which indicates the first address in the weights matrix. To this value the SAT adds the
index value which results in the address to be accessed in memory. In Figure 6 the
SAT is shown accessing the sixteen bits of matrix indicated by the first index value.
232
The sixteen bits it is accessing are then passed to individual counters which are
clocked when the data is valid. This has the effect of incrementing those counters
whose input is a "1". This entire cycle takes 175 nanoseconds. When all the active
lines in a sixteen-bit row of the matrix have been summed the sixteen summed values
are stored in the buffer memory and the process repeated on the next row of sixteen
bits after zeroing the counters.
When all the summed values for all the rows have been written to the buffer
memory, the SAT will either Willshaw or L-max threshold them. L-max thresholding
begins by inspecting all the summed values to find the maximum value stored, this
then becomes the current threshold value. All of the summed values are then re-
checked. When the SAT processor finds values that equal L it saves the bit address of
those values. If insufficient separator bits were found after the first thresholding iter-
ation then the operation is repeated by finding the next highest summed value and us-
ing this as the new threshold value. This is repeated until all L separator bits have been
recovered.
At the end of the thresholding operation, the results area of the buffer memory
will contain a series of integers which identify which bits in the separator pattern were
set. This is shown in Figure 7 where the values 0.3.9 etc. are stored in the buffer mem-
233
ory instead of the entire thresholded separator pattern. This is an improvement on ver-
sion 1 of the SAT8 which stored the entire bit pattern.
5 Performance
For a single CMM, the SAT processor has two equations to determine its execution
speed which can be used to determine the speed of different applications. The equa-
tions were derived from the state machine diagrams. Equation 4 calculates S which in
seconds is the SAT start-up time and summing of the data. Equation 5 calculates time
Tin seconds which is the time for thresholding the data. This equation should be omit-
ted if global thresholding is done across multiple boards by the S-node. In the Equa-
tions, P is the number of CMM index values. For grey scale work this is the number
of pixels per image, and for binary tupling it is (pixels/n) x 2 n . The separator size is
a . For thresholding, i is the number of iterations required to find all L bits of the sep-
arator pattern, ty is how often a summed value equals the stored threshold value per
iteration.
9X
l 0Q- 99- x9 .[x3r3.5ap
aft + 34a
5 ^=5 0550x) x>l Q
S
F r-35-a5Pa 6P+;33144aa ++27.51
16
It
27.5]
27. 5 ]
J
(4)
(4)
234
The above equations have been verified against the SAT processor running the
specified 40 MHz clock.
By formula transformation the maximum number of index values, /, can be cal-
culated for a given length of time x and separator size. This is useful when calculating
the quantity of data that can be processed at real-time rates. This is shown in Equation
6 for thresholding, and Equation 7 for the recall operation.
l 16(T- »8125e~66-S)
_ 16(T-1.48125e~
:-1.48125e~
(T-1.48125e"
-1.48125e --S)
6) (7)
/ = !( 99
(7)
50e~
50e"
50< x 3 .
3.5 5
0e~ x 3.5 x a x
x a
a
In a typical scenario the separator pattern will be 64 bits with 6 bits set, all of
which will be found on the first thresholding iteration. This data yields the information
that 57140 index values can be processed in 40ms. If n-tupling of size four is used as
a pre-processor, for example, then the number of input bits will be 228560, a 478 by
478 binary pixel image.
Indy 29500 m 2
For larger problems than this it will be necessary to upgrade the relevant system.
For example, doubling the processing rate of C-NNAP can be achieved by the addition
of a further C-node, which currently costs in the region of \pounds 3000. If this value
is cost C it will take 2.1C to purchase a further indy to achieve double the processing
rate. Upgrading the four processors of the Challenge to 200 MHz R10000-processors
would currently cost 30C. Therefore both of these options are significantly more ex-
pensive than a C-node upgrade.
236
6 Summary
This chapter has shown that binary neural networks are advantageous where either the
processing rates are important, or where the input data is of high dimensionality. The
chapter described in detail the C-NNAP architecture which is a general purpose proc-
essor of binary neural networks. It was shown that a single C-node C-NNAP can proc-
ess data at nearly eight times the rate of a powerful workstation, at a fraction of the
cost. It has therefore been shown that C-NNAP provides a powerful, yet cost effective
platform for processing binary neural networks.
7 References
Dr. P J L Adeodato
Universidade Federal de Pernambuco - Brazil
Dr. I Aleksander
Imperial College of Science, Technology and Medicine - UK
Dr. J Austin
Mr. J Kennedy
Mr. K Lees
Dr. G Morgan
Dr. S E M O'Keefe
University of York - UK
Dr. J M Bishop
Mr. S K Box
Mr. J F Hawker
Dr. D A Keating
Dr. R J Mitchell
The University of Reading - UK
Dr. D L Bisset
Professor M C Fairhurst
Dr. G Howells
University of Kent at Canterbury - UK
Professor A C P L F De Carvalho
University of Sao Paulo at Sao Carlos - Brazil
Dr. S S Christensen
Dr. T M Jorgensen
Dr. C Liisberg
Ris0 National Laboratory - Denmark
Dr. T G Clarkson
Dr. Y Ding
King's College London - UK
239
240
Dr. L Hepplewhite
Dr. T J Stonham
Dr. P Ntourntoufis
Brunei University - UK
Dr. T Macek
Czech Technical University - Czech Republic
Dr. M Morciniec
Hewlett-Packard Laboratories - UK
Dr. R Rohwer
Prediction Company - USA
Dr. J G Taylor
King's College London - UK
Progress in Neural Processing • 9
Series Advisors
Alan Murray (University of Edinburgh)
Lionel Tarassenko (University of Oxford)
Andreas S. WeiIen^ (Leonard N. Stern
Stem School
Schoolof
ofBusiness.
Business,New
New York
YorHUniversity)
University)
ISBN 981-02-3253-5