0% found this document useful (0 votes)
5 views

ml42

Uploaded by

Srisurya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

ml42

Uploaded by

Srisurya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Chapter

LEARNING WITH NEURAL


5 NETWORKS (NN)

5.1 TOWARDS COGNITIVE MACHINE

Human intelligence possesses robust attributes with complex sensory, control, affective (emotional
processes), and cognitive (thought processes) aspects of information processing and decision
making. There are over a hundred billion biological neurons in our central nervous system (CNS),
playing a key role in these functions. CNS obtains information from the external environment via
numerous natural sensory mechanisms—vision, hearing, touch, taste, and smell. With the help of
cognitive computing, it assimilates the information and offers the right interpretation. The cognitive
process then progresses towards some attributes, such as learning, recollection, and reasoning,
which results in proper actions via muscular control.
The progress in technology based on information in recent times, has widened the capabilities
and applications of computers. If we wish a machine (computer) to exhibit certain cognitive
functions, such as learning, remembering, reasoning and perceiving, that humans are known
to exhibit, we need to define ‘information’ in a general manner and develop new mathematical
techniques and hardware with the ability to handle the simulation and processing of cognitive
information. Mathematics, in its present form, was developed to comprehend physical processes,
but cognition, as a process, does not essentially follow these mathematical laws. So what exactly is
cognitive mathematics then? The question is rather difficult. However, scientists have converged to
the understanding that if certain ‘mathematical aspects’ of our process of thinking are re-examined
along with the ‘hardware aspects’ of ‘the neurons’—which is the primary component of the brain—
we may, to a certain level, be able to successfully emulate the process.
Biological neuronal procedures are rather complex [76], and the advancement made in terms
of understanding the field with the help of experiments is raw and inadequate. However, with the
help of this limited understanding of the biological processes, it has been possible to emulate some
human learning behaviors, via the fields of mathematics and systems science. Neuronal information
processing involves a range of complex mathematical processes and mapping functions. And they
serve as a parallel-cascade computing structure in synergism. The aim of system scientists is to
182 Applied Machine Learning

create an intelligent cognitive system on the basis of this limited understanding of the brain—a
system that can help human beings to perform all kinds of tasks requiring decision making. Various
new computing theories of the neural networks field have been developing, which it is hoped
will be capable of providing a thinking machine. Given that they are based on neural networks
architecture, they should hopefully be able to create a low-level cognitive machine, which scientists
have been trying to build for so long.
The subject of cognitive machines is in an exciting state of research and we believe that we are
slowly progressing towards the development of truly cognitive machines.

5.1.1 From Perceptrons to deep Networks


Historically, research in neural networks was inspired by the desire to produce artificial systems
capable of sophisticated ‘intelligent’ processing similar to the human brain. The perceptron is the
earliest of the artificial neural networks paradigms. Frank Rosenblatt built this learning machine
device in hardware in 1958. In 1959, Bernard Widrow and Marcian Hoff developed a learning
rule, sometimes known as Widrow-Haff rule, for ADALINE (ADAptive LINear Elements). Their
learning rule was simple and yet elegant.
Affected by the predominantly rosy outlook of the time, some people exaggerated the potential
of neural networks. Biological comparisons were blown out of proportion. In 1969, significant
limitations of perceptrons, a fundamental block for more powerful models, were highlighted by
Marvin Minsky. It brought to a halt much of the activity in neural network research.
Nevertheless, a few dedicated scientists, such as Teuvo Kohonen and Stephen Grossberg,
continued their efforts. In 1982, John Hopfield introduced a recurrent-type neural network that
was based on the interaction of neurons through a feedback mechanism. The back-propagation
learning rule arrived on the neural-network scene at approximately the same time from several
independent sources (Werbos; Parker; Rumelhart, Hinton and Williams). Essentially a refinement
of the Widrow-Hoff learning rule, the back-propagation learning rule provided a systematic means
for training multilayer feedforward networks, thereby overcoming the limitations presented by
Minsky. Research in the 1980s triggered a boom in the scientific community. New and better
models have been proposed. A number of today’s technological problems are in the areas where
neural-network technology has demonstrated potential.
As the research in neural networks is evolving, more and more types of networks are being
introduced. For reasonably complex problems, neural networks with back-propagation learning
have serious limitations. The learning speed of these feedforward neural networks is, in general, far
slower than required and it has been a major bottleneck in their applications. Two reasons behind
this limitation may be: (i) the slow gradient-based learning algorithms extensively used to train
neural networks, and (ii) all the parameters of the network are tuned iteratively by using learning
algorithms. A new learning algorithm was proposed in 2006 by Huang, Guang-Bin et. al. [77],
called Extreme Learning Machine (ELM), for single hidden layer feedforward neural networks, in
which the weights connecting input to hidden nodes are randomly chosen and never updated and
weights connecting hidden nodes to output are analytically determined. Experimental results based
on real-world benchmarking function approximation (regression) and classification problems show
that ELM can produce best generalization in some cases and can be thousands of times faster than
traditional popular learning algorithms for feedforward neural networks.
Learning with Neural Networks (NN) 183

Novel research investigations in ELM and related areas have produced a suite of machine
learning techniques for (single and multi-) hidden layer feedforward networks in which hidden
neurons need not be tuned. ELM theories argue that random hidden neurons capture the essence of
some brain learning mechanisms. ELM has a great potential as a viable alternative technique for
large-scale computing and AI (Artificial Intelligence).
The ‘traditional’ neural networks are based on what we might interpret as ‘shallow’ learning; in
fact, this learning methodology has very little resemblance to the brain, and one might argue that it
would be more fair to regard them simply as a discipline under statistics.
The subject of cognitive machines is in an exciting state of research and we believe that we are
slowly progressing towards the development of truly intelligent systems. A step towards realizing
strong AI has been taken through the recent research in ‘deep learning’. Considering the far-reaching
applications of AI, coupled with the awareness that deep learning is evolving as one of its most
powerful methods, today it is not possible for one to enter the machine learning community without
any knowledge of deep networks.
Deep learning algorithms are in sharp contrast to shallow learning algorithms in terms of the
number of parameterized transformations a signal comes across as it spreads from input layer to
the output layer. A parameterized transformation refers to a processing unit containing trainable
parameters—weights and thresholds. A chain of transformations from input to output is a Credit
Assignment Path (CAP), which describes potentially causal connections from input to output and
may have varied lengths. In case of a feedforward neural network, the depth of the CAPs, and
therefore, the depth of the network, is a number of hidden layers plus one (the output layer is also
parameterized).
Today, deep learning, based on learning representations of data, is a significant member of
family of machine learning techniques. It is possible to represent an observation, for instance
an image, in various ways, such as a vector of intensity values per pixel or in a more abstract
way as a set of edges, regions of particular shape, and so on. Some representations make learning
tasks simpler. Deep learning aims to replace hand-crafted features with efficient algorithms for
supervised or unsupervised feature learning, and hierarchical feature extraction.
The field of deep learning has been characterized in several ways. These definitions have in
common:
(i) multiple layers of nonlinear processing units
(ii) the supervised/unsupervised learning of feature representations in each layer, with the layers
giving rise to a hierarchy from low-level to high-level characteristics.
What a layer of nonlinear processing unit, employed in a deep learning algorithm, consists of is
dependent on the problem that needs to be solved.
Deep learning is linked closely to a category of brain-development theories published by
cognitive neuro scientists in the early 1990s. Some of the deep-learning representations are inspired
by progresss in neuroscience and are roughly based on interpretation of information processing
and communication patterns in a nervous system, such as neural coding which tries to describe the
relationship between a range of stimuli and related neuronal responses in the brain.
The term ‘deep learning’ gained traction in the mid 2000s, after a publication by Geoffrey
Hinton and Ruslan Salakhutdinov. They showed how many-layered feedforward network could
184 Applied Machine Learning

be effectively pre-trained one layer at a time. Since its resurgence, deep learning has become part
of the many-of-the-art systems in various disciplines, particularly computer vision and speech
recognition. The real impact of deep learning in industry began in large-scale speech recognition
around 2010. Recent useful references on the subject are [78–82].
Our focus in this chapter will be on traditional neural networks [83, 84]. These networks are
being used today for many real-world regression and classification problems. Also, a sound
understanding of these networks is a prerequisite to learn ELM and deep learning algorithms. The
two recent research developments are outside the scope of this book.
The terms ‘Neural Networks (NN)’ and ‘Artificial Neural Networks (ANN)’ are both commonly
used in the literature for the same field of study. We will use the term ‘Neural Networks’ in this
book.
Broadly speaking, AI (Artificial Intelligence) is any computer program that does something
smart [2, 5]. Machine learning is a subfield of AI. That is, all machine learning counts as AI,
but not all AI counts as machine learning. For example, rule-based expert systems, frame-based
expert systems, knowledge graphs, evolutionary algorithms could be described by AI but none of
them is in machine learning. Deep learning may be considered as subfield of machine learning.
Deep neural networks are a set of algorithms setting new records in accuracy for many important
problems. Deep is a technical term; it refers to number of layers in a neural network. Multiple
hidden layers allow deep neural networks to learn features of the data in a hierarchy.
Deep learning may share elements of traditional machine learning, but some researchers feel that
it will emerge as a class by itself, as a subfield of AI.

5.2 NEURON MODELS

A discussion of anthropomorphism to introduce neural network technology may be worthwhile—


as it helps explain the terminology of neural networks. However, anthropomorphism can lead to
misunderstanding when the metaphor is carried too far. We give here a brief description of how
the brain works; a lot of details of the complex electrical and chemical processes that go on in the
brain, have been ignored. A pragmatic justification for such a simplification is that by starting with
a simple model of the brain, scientists have been able to achieve very useful results.

5.2.1 Biological Neuron


To the extent a human brain is understood today, it seems to operate as follows: bundles of neurons,
or nerve fibers, form nerve structures. There are many different types of neurons in the nerve
structure, each having a particular shape, size and length depending upon its function and utility
in the nervous system. While each type of neuron has its own unique features needed for specific
purposes, all neurons have two important structural components in common. These may be seen
in the typical biological neuron shown in Fig. 5.1. At one end of the neuron are a multitude of tiny,
filament-like appendages called dendrites, which come together to form larger branches and trunks
where they attach to soma, the body of the nerve cell. At the other end of the neuron is a single
filament leading out of the soma, called an axon, which has extensive branching on its far end.
These two structures have special electrophysiological properties which are basic to the function of
neurons as information processors, as we shall see next.
Learning with Neural Networks (NN) 185

Synaptic gap Synaptic terminal


Synapse

Cell body (Soma)


Axon

Nucleus
Dendrites
Synaptic
terminals

Figure 5.1 A typical biological neuron

Neurons are connected to each other via their axons and dendrites. Signals are sent through the
axon of one neuron to the dendrites of other neurons. Hence, dendrites may be represented as the
inputs to the neuron, and the axon as its output. Note that each neuron has many inputs through its
multiple dendrites, whereas it has only one output through its single axon. The axon of each neuron
forms connections with the dendrites of many other neurons, with each branch of the axon meeting
exactly one dendrite of another cell at what is called a synapse. Actually, the axon terminals do not
quite touch the dendrites of the other neurons, but are separated by a very small distance of between
50 and 200 angstroms. This separation is called the synaptic gap.
A conventional computer is typically a single processor acting on explicitly programmed
instructions. Programmers break tasks into tiny components, to be performed in sequence rapidly.
On the other hand, the brain is composed of ten billion or so neurons. Each nerve cell can interact
directly with up to 200,000 other neurons (though 1000 to 10,000 is typical). In place of explicit
rules that are used by a conventional computer, it is the pattern of connections between the neurons
in the human brain, that seems to embody the ‘knowledge’ required for carrying out various
information-processing tasks. In human brain, there is no equivalent of a CPU that is in overall
control of the actions of all the neurons.
The brain is organized into different regions, each responsible for different functions. The largest
parts of the brain are the cerebral hemispheres, which occupy most of the interior of the skull.
They are layered structures; the most complex being the outer layer, known as the cerebral cortex,
where the nerve cells are extremely densely packed to allow greater interconnectivity. Interaction
with the environment is through the visual, auditory and motion control (muscles and glands) parts
of the cortex.
In essence, neurons are tiny electrophysiological information-processing units which
communicate with each other through electrical signals. The synaptic activity produces a voltage
pulse on the dendrite which is then conducted into the soma. Each dendrite may have many synapses
acting on it, allowing massive interconnectivity to be achieved. In the soma, the dendrite potentials
are added. Note that neurons are able to perform more complex functions than simple addition on
the inputs they receive, but considering a simple summation is a reasonable approximation.
186 Applied Machine Learning

When the soma potential rises above a critical threshold, the axon will fire an electrical signal.
This sudden burst of electrical energy along the axon is called axon potential and has the form
of an electrical impulse or spike that lasts about 1 msec. The magnitude of the axon potential is
constant and is not related to the electrical stimulus (soma potential). However, neurons typically
respond to a stimulus by firing not just one but a barrage of successive axon potentials. What varies
is the frequency of axonal activity. Neurons can fire between 0 to 1500 times per second. Thus,
information is encoded in the nerve signals as the instantaneous frequency of axon potentials and
the mean frequency of the signal.
A synapse pairs the axon with another cell’s dendrite. It discharges chemicals known as
neurotransmitters, when its potential is increased enough by the axon potential. The triggering of
the synapse may require the arrival of more than one spike. The neurotransmitters emitted by the
synapse diffuse across the gap, chemically activating gates on the dendrites, which, on opening,
permit the flow of charged ions. This flow of ions, changes the dendritic potential and generates
voltage pulse on the dendrite, which is then conducted into the neuron body. At the synaptic junction,
the number of gates that open on the dendrite is dependent on the number of neurotransmitters
emitted. It seems that some synapses excite the dendrites they impact, while others act in a way that
inhibits them. This results in changing the local potential of the dendrite in a positive or negative
direction.
Synaptic junctions alter the effectiveness with which the signal is transmitted; some synapses
are good junctions and pass a large signal across, whilst others are very poor, and allow very little
through.
Essentially, each neuron receives signals from a large number of other neurons. These are the
inputs to the neuron which are ‘weighted’. That is, some signals are stronger than others. Some
signals excite (are positive), and others inhibit (are negative). The effects of all weighted inputs are
summed. If the sum is equal to or greater than the threshold for the neuron, the neuron fires (gives
output). This is an ‘all-or-nothing’ situation. Because the neuron either fires or does not fire, the rate
of firing, not the amplitude, conveys the magnitude of information.
The ease of transmission of signals is altered by activity in the nervous system. The neural pathway
between two neurons is susceptible to fatigue, oxygen deficiency, and agents like anesthetics. These
events create a resistance to the passage of impulses. Other events may increase the rate of firing.
This ability to adjust signals is a mechanism for learning.
After carrying a pulse, an axon fiber is in a condition of complete non-excitability for a specific
time period known as the refractory period. During this interval, the nerve conducts no signals,
irrespective of how intense the excitation is. Therefore, we could segregate the time scale into
successive intervals, each equal to the length of the refractory period. This will permit a discrete-time
description of the neurons’ performance in terms of their states at discrete-time instances.

5.2.2 Artificial Neuron


Artificial neurons bear only a modest resemblance to real things. They model approximately three
of the processes that biological neurons perform (there are at least 150 processes performed by
neurons in the human brain).
Learning with Neural Networks (NN) 187

An artificial neuron
(i) evaluates the input signals, determining the strength of each one;
(ii) calculates a total for the combined input signals and compares that total to some threshold
level; and
(iii) determines what the output should be.

Input and Outputs


Just as there are many inputs (stimulation levels) to a biological neuron, there should be many
input signals to our artificial neuron (AN). All of them should come to our AN simultaneously. In
response, a biological neuron either ‘fires’ or ‘doesn’t fire’ depending upon some threshold level.
Our AN will be allowed a single output signal, just as is present in a biological neuron: many inputs,
one output (Fig. 5.2).
Inputs

Output

Figure 5.2 Many inputs, one output model of a neuron

Weighting Factors
Each input will be given a relative weighting, which will affect the impact of that input (Fig.
5.3). This is something like varying synaptic strengths of the biological neurons—some inputs
are more important than others in the way they combine to produce an impulse. Weights are
adaptive coefficients within the network, that determine the intensity of the input signal. In fact,
this adaptability of connection strength is precisely what provides neural networks their ability to
learn and store information, and, consequently, is an essential element of all neuron models.

Input Connection
x1 weights
w1
Total input
x2 w2 Âwj xj
Â

wn

xn
Figure 5.3 A neuron with weighted inputs
188 Applied Machine Learning

Excitatory and inhibitory inputs are represented simply by positive or negative connection
weights, respectively. Positive inputs promote the firing of the neuron, while negative inputs tend
to keep the neuron from firing.
Mathematically, we could look at the inputs and the weights on the inputs as vectors.
The input vector
È x1 ˘
Íx ˙
x= Í ˙
2
(5.1a)
Í� ˙
Í ˙
Î xn ˚

and the connection weight vector


wT = [w1 w2 … wn] (5.1b)
The total input signal is the product of these vectors. The result is a scalar

N
 wj x j = w T x (5.1c)
j =1
Activation Functions
Although most neuron models sum their input signals in basically the same manner, as described
above, they are not all identical in terms of how they produce an output response from this input.
Artificial neurons use an activation function, often called a transfer function, to compute their
activation as a function of total input stimulus. Several different functions may be used as activation
functions, and, in fact, the most distinguishing feature between existing neuron models is precisely
which function they employ.
We will, shortly, take a closer look at the activation functions. We first build a neuron model,
assuming that the transfer function has a threshold behavior, which is, in fact, the type of response
exhibited by biological neurons: when the total stimulus exceeds a certain threshold value q, a
constant output is produced, while no output is generated for input levels below the threshold.
Figure 5.4a shows this neuron model. In this diagram, the neuron has been represented in such a
way that the correspondence of each element with its biological counterpart may be easily seen.
Equivalently, the threshold value can be subtracted from the weighted sum and the resulting
value compared to zero; if the result is positive, then output a 1, else output a 0. This is shown
in Fig. 5.4b; note that the shape of the function is the same but now the jump occurs at zero. The
threshold effectively adds an offset to the weighted sum.
An alternative way of achieving the same effect is to take the threshold out of the body of the
model neuron, and connect it to an extra input value that is fixed to be ‘on’ all the time. In this
case, rather than subtracting the threshold value from the weighted sum, the extra input of +1 is
multiplied by a weight and added in a manner similar to other inputs—this is known as biasing the
neuron. Figure 5.4c shows a neuron model with a bias term. Note that we have taken constant input
‘1’ with an adaptive weight ‘w0’ in our model.
Learning with Neural Networks (NN) 189

The first formal definition of a synthetic neuron model, based on the highly simplified
considerations of the biological neuron, was formulated by McCulloch and Pitts (1943). The
two-port model (inputs—activation value—output mapping) of Fig. 5.4 is essentially the MP
neuron model. It is important to look at the features of this unit—which is an important and popular
neural network building block.

Inputs Connection
x1 weights
w1 Total input
Output
x2 Âwj xj 1 ŷ
w2 Â
q
wn
xn Synapses Cell body
Axons Dendrites
(From other neurons)
(a)

x1
w1
x2 w2 Âwj xj + 1 ŷ
Â

wn q
xn (b)

x1
Activation value
w1
x2 w2 a = Âwj xj + w0 1 ŷ
Â

wn
xn w0

1
(c)

Figure 5.4 The MP neuron model

It is a basic unit, thresholding a weighted sum of its inputs to get an output. It does not particularly
consider the complex patterns and timings of the real nervous activity in real neural systems. It
190 Applied Machine Learning

does not have any of the complex characteristics existing in the body of biological neurons. It is,
therefore, a model, and not a copy of a real neuron.
The MP artificial neuron model involves two important processes:
(i) Forming net activation by combining inputs. The input values are amalgamated by a weighted
additive process to achieve the neuron activation value a (refer to Fig. 5.4c).
(ii) Mapping this activation value a into the neuron output ŷ . This mapping from activation to
output may be characterized by an ‘activation’ or ‘squashing’ function.
For the activation functions that implement input-to-output compression or squashing, the range
of the function is less than that of the domain. There is some physical basis for this desirable
characteristic. Recall that in a biological neuron, there is a limited range of output (spiking
frequencies). In the MP model, where DC levels replace frequencies, the squashing function serves
to limit the output range. The squashing function shown in Fig. 5.5a limits the output values to {0,
1}, while that in Fig. 5.5b limits the output values to {–1, 1}. The activation function of Fig. 5.5a
is called unipolar, while that in Fig. 5.5b is called bipolar (both positive and negative responses of
neurons are produced).

ŷ ŷ

1 1

0 a a
–1

(a) Unipolar squashing function (b) Bipolar squashing function

Figure 5.5

5.2.3 Mathmatical Model


From the earlier discussion, it is evident that the artificial neuron is really nothing more than a
simple mathematical equation for calculating an output value from a set of input values. From now
onwards, we will be more on a mathematical footing; the reference to biological similarities will
be reduced. Therefore, names like a processing element, a unit, a node, a cell, etc., may be used for
the neuron. A neuron model (a processing element/a unit/a node/a cell of our neural network), will
be represented as follows:
The input vector
x = [x1 x2 … xn]T;
the connection weight vector
wT = [w1 w2 … wn];
the unity-input weight w0 (bias term), and the output ŷ of the neuron are related by the following
equation:
Learning with Neural Networks (NN) 191

Ê n ˆ
ˆy = s (wT x + w0 ) = s Á Â wj x j + w0 ˜ (5.2)
Ë j =1 ¯

where s(◊) is the activation function (transfer function) of the neuron.


The weights are always adaptive. We can simplify our diagram as in Fig. 5.6a; adaptation need
not be specifically shown in the diagram.

x1 x0 = 1
w1 w0
x2 w2 a s (◊) ŷ x1 w1 a s (◊) ŷ
 x2 w2 Â
wn wn
xn w0 xn
1
(a) (b)
Figure 5.6 Mathematical model of a neuron (perceptron)

The bias term may be absorbed in the input vector itself as shown in Fig. 5.6b.
yˆ = s (a )

Ê n ˆ
= s Á Â wj x j ˜ ; x0 = 1 (5.3a)
Ë j=0 ¯

Ê n ˆ
= s Á Â wj x j + w0 ˜ = s (wT x + w0 ) (5.3b)
Ë j =1 ¯
In the literature, this model of an artificial neuron is also referred to as a perceptron (the name
was given by Rosenblatt in 1958).
The expressions for the neuron output ŷ are referred to as the cell recall mechanism. They
describe how the output is reconstructed from the input signals and the values of the cell parameters.
The artificial neural systems under investigation and experimentation today, employ a variety of
activation functions that have more diversified features than the one presented in Fig. 5.5. Below,
we introduce the main activation functions that will be used later in this chapter.
The MP neuron model shown in Fig. 5.4 used the hard-limiting activation function. When
artificial neurons are cascaded together in layers (discussed in the next section), it is more common
to use a soft-limiting activation function. Figure 5.7a shows a possible bipolar soft-limiting
semilinear activation function. This function is, more or less, the ON-OFF type, as before, but has a
sloping region in the middle. With this smooth thresholding function, the value of the output will be
practically 1 if the weighted sum exceeds the threshold by a huge margin and, conversely, it will be
practically –1 if the weighted sum is much less than the threshold value. However, if the threshold
and the weighted sum are almost the same, the output from the neuron will have a value somewhere
between the two extremes. This means that the output from the neuron can be related to its inputs in
a more useful and informative way. Figure 5.7b shows a unipolar soft-limiting function.
192 Applied Machine Learning

s (a) s (a)

1 1

a 0 a

–1
(a) Bipolar (b) Unipolar
Figure 5.7 Soft-limiting activation functions.

For many training algorithms (discussed in later sections), the derivative of the activation
function is needed; therefore, the activation function selected must be differentiable. The logistic
or sigmoid function, which satisfies this requirement, is the most commonly used soft-limiting
activation function. The sigmoid function (Fig. 5.8a):
1
s (a) = (5.4)
1 + e- l a

is continuous and varies monotonically from 0 to 1 as a varies from – • to •. The gain of the
sigmoid, l, determines the steepness of the transition region. Note that as the gain approaches
infinity, the sigmoid approaches a hard-limiting nonlinearity. One of the advantages of the sigmoid
is that it is differentiable. This property had a significant impact historically, because it made it
possible to derive a gradient search learning algorithm for networks with multiple layers (discussed
in later sections).
s (a)
s (a) l=1 1
l=1
1
a
l = 0.2

0 a 1
(a) Sigmoid functions (b) Hyperbolic tangent function
Figure 5.8

The sigmoid function is unipolar. A bipolar function with similar characteristics is a hyperbolic
tangent (Fig. 5.8b):
1 - e- l a
s (a) =
1+ e - la
= tanh 12 l a ( ) (5.5)

The biological basis of these activation functions can easily be established. It is known that
neurons located in different parts of the nervous system have different characteristics. The neurons
of the ocular motor system have a sigmoid characteristic, while those located in the visual area
Learning with Neural Networks (NN) 193

have a Gaussian characteristic. As we said earlier, anthropomorphism can lead to misunderstanding


when the metaphor is carried too far. It is now a well-known result in neural network theory that a
two-layer neural network is capable of solving any classification problem. It has also been shown
that a two-layer network is capable of solving any nonlinear function approximation problem [3,
83]. This result does not require the use of sigmoid nonlinearity. The proof assumes only that
nonlinearity is a continuous, smooth, monotonically increasing function that is bounded above and
below. Thus, numerous alternatives to sigmoid could be used, without a biological justification. In
addition, the above result does not require that the nonlinearity be present in the second (output)
layer. It is quite common to use linear output nodes since this tends to make learning easier. In other
words,
s(a) = l a; l > 0 (5.6)
is used as an activation function in the output layer. Note that this function does not ‘squash’
(compress) the range of output.
Our focus in this chapter will be on two-layer perceptron networks with the first (hidden) layer
having log-sigmoid
1
s (a) = (5.7a)
1 + e- a

1 - e- a
or tan-sigmoid s (a) = (5.7b)
1 + e- a
activation function, and the second (output) layer having linear activation function
s(a) = a (5.8)
The log-sigmoid function has historically been a very popular choice, but since it is related to the
tan-sigmoid by the simple transformation
slog-sigmoid = (stan-sigmoid +1)/2 (5.9)
both of these functions are in use in neural network models.
We have so far described two classical neuron models:
• perceptron—a neuron with sigmoidal activation function (sigmoidal function is a softer
version of the original perceptron’s hard limiting or threshold activation function); and
• linear neuron—a neuron with linear activation function.

5.3 NETWORK ARCHITECTURES

In the biological brain, a huge number of neurons are interconnected to form the network and
perform advanced intelligent activities. The artificial neural network is built by neuron models.
Many different types of artificial neural networks have been proposed, just as there are many
theories on how biological neural processing works. We may classify the organization of the neural
networks largely into two types: a feedforward net and a recurrent net. The feedforward net has a
hierarchical structure that consists of several layers, without interconnection between neurons in
each layer, and signals flow from input to output layer in one direction. In the recurrent net, multiple
194 Applied Machine Learning

neurons in a layer are interconnected to organize the network. In the following, we give typical
characteristics of the feedforward net and the recurrent net, respectively.

5.3.1 Feedforward Networks


A feedforward network consists of a set of input terminals which feed the input patterns to a layer
or subgroup of neurons. The layer of neurons makes independent computations on data that it
receives, and passes the results to another layer. The next layer may, in turn, make its independent
computations and pass on the results to yet another layer. Finally, a subgroup of one or more
neurons determines the output from the network. This last layer of the network is the output layer.
The layers that are placed between the input terminals and the output layer are called hidden layers.
Some authors refer to the input terminals as the input layer of the network. We do not use
that convention since we wish to avoid ambiguity. Note that each neuron in a network makes its
computation based on the weighted sum of its inputs. There is one exception to this rule: the role of
the ‘input layer’ is somewhat different as units in this layer are used only to hold input data, and to
distribute the data to units in the next layer. Thus, the ‘input layer’ units perform no function—other
than serving as a buffer, fanning out the inputs to the next layer. These units do not perform any
computation on the input data, and their weights, strictly speaking, do not exist.
The network outputs are generated from the output layer units. The output layer makes the
network information available to the outside world. The hidden layers are internal to the network
and have no direct contact with the external environment. There may be from zero to several hidden
layers. The network is said to be fully connected if every output from a single node is channeled to
every node in the next layer.
The number of input and output nodes needed for a network will depend on the nature of the
data presented to the network, and the type of the output desired from it, respectively. The number
of neurons to use in a hidden layer, and the number of hidden layers required for processing a task,
is less obvious. Further comments on this question will appear later.

A Layer of Neurons
A one-layer network with n inputs and M neurons is shown if Fig. 5.9. In the network, each input xj;
j = 1, 2, …, n is connected to the qth neuron input through the weight wqj; q = 1, 2, …, M. The qth
neuron has a summer that gathers its weighted inputs to form its own scalar output
n
 wqj xj + wq0; q = 1, 2, …, M
j =1

Finally, the qth neuron outputs ŷq through its activation function s(◊):
Ê n ˆ
ŷq = s Á Â wqj x j + wq 0 ˜ ; q = 1, 2, …, M (5.10a)
Ë j =1 ¯
= s (wqT x + wq0); q = 1, 2, …, M (5.10b)
where weight vector wq is defined as,
wqT = [wq1 wq2 … wqn] (5.10c)
Learning with Neural Networks (NN) 195

x1 ŷ1
 s (◊)

ŷ2
x2 Â s (◊)

1
ŷ3
x3 Â s (◊)

yˆM
xn  s (◊)

Inputs Neuron layer

1
x1 wq1 wq0 Â wqj xj + wq0
j

x2 wq2 yˆq
 s (◊)
wqn
xn
Inputs qth neuron

Figure 5.9 A one-layer network

Note that it is common for the number of inputs to be different from the number of neurons
(i.e., n π M). A layer is not constrained to have the number of its inputs equal to the number of its
neurons.
The layer shown in Fig. 5.9 has M ¥ 1 output vector
È yˆ1 ˘
͈ ˙
y
yˆ = Í 2 ˙ ,
Í� ˙ (5.11a)
Í ˙
ÍÎ yˆ M ˙˚

n ¥ 1 input vector
È x1 ˘
Íx ˙
x= Í ˙
2
(5.11b)
Í� ˙
Í ˙
Î xn ˚
196 Applied Machine Learning

M ¥ n weight matrix
È w11 w12 � w1n ˘ È w1
T ˘
Íw Í ˙
w22 � w2 n ˙˙ Í w2T ˙
W= Í
21
= (5.11c)
Í � � � ˙ Í � ˙
Í ˙ Í ˙
Î wM 1 wM 2 � wMn ˚ ÍwT ˙
Î M ˚
and M ¥ 1 bias vector
È w10 ˘
Íw ˙
w0 = Í
20 ˙
(5.11d)
Í � ˙
Í ˙
Î wM 0 ˚
Note that the row indices on the elements of matrix W indicate the destination neuron for the
weight, and the column indices indicate which source is the input for that weight. Thus, the index
wqj says that the signal from jth input is connected to the qth neuron.
The activation vector is,
È w1T x + w10 ˘
Í T ˙
Í w2 x + w20 ˙
Wx + w0 = Í ˙ (5.11e)
Í � ˙
ÍwT x + w ˙
Î M M0 ˚

The outputs are,


yˆ1 = s (w1T x + w10 )
yˆ 2 = s (w2T x + w20 ) (5.11f)


T
yˆM = s (wM x + wM 0 )
The input-output mapping is of the feedforward and instantaneous type since it involves no time
delay between the input x and the output ŷ.
Consider a neural network with a single output node. For a dataset with n attributes, the output
node receives x1, x2, …, xn, takes a weighted sum of these and applies the s (◊) function. The output
Ê n ˆ
of the neural network is therefore s Á Â wj x j + w0 ˜ .
Ë j =1 ¯

First consider a numerical output y (i.e., y Œ¬). If s (◊) is a linear activation function (Eqn (5.8)),
the output is simply
Learning with Neural Networks (NN) 197

n
ŷ = Â wj x j + w0
j =1

This is exactly equivalent to the formulation of linear regression given earlier in Section 3.6
(refer to Eqn (3.70)).
Now consider binary output variable y. If s (◊) is log-sigmoid function (Eqn (5.7a)), the output
is simply
1
yˆ =
Ê n ˆ
1 + exp Á - Â wj x j + w0 ˜
Ë j =1 ¯
1
=
1 + e - ( w1x1 + w2 x2 + � + wn xn + w0)
which is equivalent to logistic regression formulation given in Section 3.7 (refer to Eqn (3.84)).
Note that here ŷ takes continuous values in the interval {0, 1} and represents the probability of
belonging to Class q, i.e., ŷ = P(Class 1|x), and P(Class 2|x) = 1 – ŷ .
In both cases, although the neural network models are equivalent to the linear and logistic
regression models, the resulting estimates for the weights in neural network models will be different
from those in linear and logistic regression. This is because the estimation methods are different. As
we will shortly see, the neural network estimation method is different from maximum likelihood
method used in logistic regression, and may be different from least-squares method used in linear
regression.
We will use multiple output nodes ŷq ; q = 1, …, M, for multiclass discrimination problems
(detailed in Section 5.8). For regression (function approximation) problems, multiple output nodes
correspond to multiple response variables we are interested in for numeric prediction. In this case,
a number of regression problems are learned at the same time. An alternative is to train separate
networks for separate regression problems (with one output node). In this chapter, we will focus on
this alternative approach. Our focus is justified on the ground that in many real-life applications, we
are interested in only one response variable, i.e., scalar output variable.

Multi-Layer Perceptrons
Neural networks normally have at least two layers of neurons, with the first layer neurons having
nonlinear and differentiable activation functions. Such networks, as we will see, can approximate
any nonlinear function. In real life, we are faced with nonlinear problems, and multilayer neural
network structures have the capability of providing solutions to these problems.
Figure 5.10 shows a two-layer NN, with n inputs and two layers of neurons. The first of these
layers has m neurons feeding into the second layer possessing M neurons. The first layer or the hidden
layer, has m hidden-layer neurons; the second or the output layer, has M output-layer neurons. It is
not uncommon for different layers to have different numbers of neurons. The outputs of the hidden
layer are inputs to the following layer (output layer); and the network is fully connected. Neural
198 Applied Machine Learning

networks possessing several layers are known as Multi-Layer Perceptrons (MLP); their computing
power is meaningfully improved over the one-layer NN.
All continuous functions, which display certain smoothness, can be approximated to any desired
accuracy with a network of one hidden layer of sigmoidal hidden units, and a layer of linear output
units [83]. Does this mean that it is not required to employ more than one hidden layer and/or mix
different kinds of activation functions? In fact, the accuracy may be enhanced with the help of
network architectures with more hidden layers/mixing activation functions. Especially when the
mapping to be learned is highly complicated, there is a likelihood of performance improvement.
However, as the implementation and training of the network become increasingly complex with
sophisticated network architectures, it is normal to apply only a single hidden layer of similar
activation functions, and an output layer of linear units. We will focus on two-layer feedforward
neural networks with sigmoidal/hyperbolic tangent hidden units and linear output units for function
approximation problems. For classification problems, the linear output units will be replaced with
sigmoidal units. These are widely used network architectures, and work very well in many practical
applications.
1 1

z1 ŷ1
x1 Â s (◊) Â

1 1

z2 ŷ2
x2 Â s (◊) Â

1 1

zm yˆM
x3 Â s (◊) Â

Figure 5.10 A two-layer network

Defining the input terminals as xj; j = 1, …, n; and the hidder-layer outputs as zl, allows one to
write
Ê n ˆ
zl = s Á Â wlj x j + wl 0 ˜ ; l = 1, 2, …, m
Ë j =1 ¯
(5.12a)
T
= s (wl x + wl0)
where
wlT � [wl1 wl2 … wln]
are the weights connecting input terminals to hidden layer.
Learning with Neural Networks (NN) 199

Defining the output-layer nodes as ŷq , one may write the NN output as,

Ê m ˆ
ŷq = Á Â vql zl + vq 0 ˜ ; q = 1, …, M
Ë l =1 ¯
(5.12b)
= vqT z + vq0
where
vqT � [vq1 vq2 … vqm]

are the weights connecting hidden layer to output layer.


For the multiclass discrimination problems, our focus will be on two-layer feedforward neural
networks with sigmoidal/hyperbolic tangent hidden units (outputs of hidden units given by (5.12a)),
and sigmoidal output units. The NN output of this multilayer structure may be written as,

Ê m ˆ
ŷq = s Á Â vql zl + vq 0 ˜ ; q = 1, …, M (5.12c)
Ë l =1 ¯
The inputs to the output-layer units (refer to Eqns (5.12b)-(5.12c)) are the nonlinear basis
function values zl; l = 1, …, m, computed by the hidden units. It can be said that the hidden units
make a nonlinear transformation from the n-dimensional input space to the m-dimensional space
spanned by the hidden units and in this space, the output layer implements a linear/logistic function.

5.3.2 Recurrent Networks


The feedforward networks (Figs 5.9–5.10) implement fixed-weight mappings from the input
space to the output space. Because the networks have fixed weights, the state of any neuron is
solely determined by the input to the unit, and not the initial and past states of the neurons. This
independence of initial and past states of the network neurons limits the use of such networks
because no dynamics are involved. The maps implemented by the feedforward networks of the type
shown in Figs 5.9–5.10, are static maps.
To allow initial and past state involvement along with serial processing, recurrent neural
networks utilize feedback. Recurrent neural networks are also characterized by use of nonlinear
processing units; thus, such networks are nonlinear dynamic systems (Networks of the form shown
in Figs 5.9–5.10 are nonlinear static systems).
The architectural layout of a recurrent network takes diverse forms. Feedback may come from
the output neurons of a feedforward network to the input terminals. Feedback may also come from
the hidden neurons of the network to the input terminals. In case the feedforward network possesses
two or more hidden layers, the likely forms of feedback expand further. Recurrent networks possess
a rich collection of architectural layouts.
It often turns out that several real-world problems, which are thought to be solvable only through
recurrent architectures, are solvable with feedforward architectures also. A multilayer feedforward
network, which realizes a static map, is capable of representing the input/output behavior of a
dynamic system. To make this possible, the neural network has to be provided with information
200 Applied Machine Learning

regarding the system history—delayed inputs and outputs (refer to Section 1.4.1). The amount
of history required is dependent on the level of accuracy sought, and the resulting computational
complexity. Large number of inputs increase the number of weights in the network that may result
in higher accuracy, but then it may significantly increase the training time. Trial-and-error on the
number of inputs, as well as the network structures, is the search process as in other machine
learning systems (Later sections will give more details).
From several practical applications published over the past decade, there seems to be considerable
evidence that multilayer feedforward networks have an extraordinary capability to do quite well in
most cases.
We will focus on two-layer feedforward neural networks with sigmoidal or hyperbolic tangent
hidden units and linear/sigmoidal output units. This, in all likelihood, is the most popular network
architecture as it works well in many practical applications.
The rest of this chapter is organized as follows: We first consider principles of design for the
primitive units that make up artificial neural networks (perceptrons, linear units, and sigmoid units),
along with learning algorithms for training single units. We then present the BACKPROPAGATION
algorithm for training multilayer networks of such units, and several general issues related to the
algorithm. We conclude the chapter with our discussion on RBF networks.

5.4 PERCEPTRONS

Classical NN systems are based on units called PERCEPTRON and ADALINE (ADAptive
Linear Element). Perceptron was developed in 1958 by Frank Rosenblatt, a researcher in neuro-
physiology, to perform a kind of pattern recognition tasks. In mathematical terms, it resulted from
the solution of classification problem. ADALINE was developed by Bernard Widrow and Marcian
Hoff; it originated from the field of signal processing, or more specifically from the adaptive noise
cancellation problem. It resulted from the solution of the regression problem; the regressor having
the properties of noise canceller (linear filter).
The perceptron takes a vector of real-valued inputs, calculates a linear combination of these
inputs; then outputs +1 if the result is greater than the threshold and –1 otherwise (refer to Fig. 4.2).
The ADALINE in its early stage consisted of a neuron with a linear activation function (Eqn
5.8), a hard limiter (a thresholding device with a signum activation function) and the Least Mean
Square (LMS) learning algorithm. We focus on the two most important parts of ADALINE—its
linear activation function and the LMS learning rule. The hard limiter is omitted, not because it is
irrelevant, but for being of lesser importance to the problems to be solved. The words ADALINE
and linear neuron are both used here for a neural processing unit with a linear activation function
and a corresponding learning rule (not necessarily LMS).
The roots of both the perceptron and the ADALINE were in the linear domain. However, in real
life, we are faced with nonlinear problems, and the perceptron was superseded by more sophisticated
and powerful neuron and neural network structures (multilayer neural networks). What is the type
of unit to be used to construct multilayer networks? Firstly, we may be encouraged to select the
linear units. But, multiple layers of cascaded linear units still produce only linear functions, and
we prefer networks capable of representing highly nonlinear functions. Another likely selection
Learning with Neural Networks (NN) 201

could be perceptron unit. However, because of its discontinuous threshold, it is not differentiable,
and therefore, not suited to the gradient descent approach for optimizing the performance criterion.
What is required is a unit with output , which is a nonlinear function of its inputs—an output
which is also a differentiable function of its inputs. One solution is the sigmoid unit, a unit similar
to perceptron, but based on a smoothened, differentiable threshold function (Fig. 5.8; Eqns (5.7)).
These activation functions are nothing but softer versions of original perceptron’s hard-limiting
threshold functions. In literature, these softer versions are also referred to as perceptrons, and the
multilayer neural networks are also referred to as Multi-Layer Perceptron (MLP) Networks.
In the following, we discuss principles of perceptron learning for classification tasks. The next
section gives the principles for linear-neuron learning. There after principles of ‘soft’ perceptron
(sigmoid unit) learning will be presented.

5.4.1 Limitations of Perceptron Algorithm for Linear Classification Tasks


The roots of the Rosenblatt’s perceptron were in the linear domain. It was developed as the simplest
yet powerful classifier providing the linear separability of class patterns or examples. In Section
4.3, we have presented a detailed account of perceptron algorithm. It was observed that there is a
major problem associated with this algorithm for real-world solutions: datasets are almost certainly
not linearly separable, while the algorithm finds a separating hyperplane only for linearly separable
data. When the dataset is linearly inseparable, the test of the decision surface will always fail
for some subset of training points regardless of the adjustments we make to the free parameters,
and the algorithm will loop forever. So, an upperbound needs to be imposed on the number of
iterations. Thus, when perceptron algorithm is applied in practice, we have to live with the errors—
true outputs will not always be equal to the desired ones.
History has proved that limitations of Rosenblatt’s perceptron can be overcome by neural
networks. The perceptron criterion function is based on misclassification error (number of samples
misclassified) and the gradient procedures for minimization are not applicable. The neural networks
primarily solve the regression problems, are based on minimum squared-error criterion (Eqn
(3.71)) and employ gradient procedures for minimization. The algorithms for separable—as well
as inseparable—data classification are first developed in the context of regression problems and
then adapted for classification problems. Some methods for minimization of squared-error criterion
were discussed in Section 3.6; the gradient procedures will be discussed in the present chapter.

5.4.2 Linear Classification using Regression Techniques


In the following, we present basic concepts of linear classification using regression techniques
employing classical hard-limiting perceptrons.
In real-life, we are faced with nonlinear classification problems. The perceptron was superseded
by more sophisticated and powerful neuron and neural network structures. A popular network
used today—the multilayer network, has hidden layers of neurons with sigmoidal activations
(discussed in later sections). These activation functions are nothing but softer versions of original
perceptron’s hard-limiting threshold functions. In literature, these softer versions are also referred
to as perceptrons. Using a Multi-Layer Perceptron (MLP) network for nonlinear classification is not
radically different; it directly follows from the concepts for linear classification discussed below.
202 Applied Machine Learning

The regression techniques discussed in Section 3.6, and also later in this chapter, can be used for
linear classification with a careful choice of the target values associated with classes. Let the set of
training (data) examples D be
D = {xj(i), y(i)}; i = 1, …, N; j = 1, …, n
(5.13)
(i) (i)
= {x , y }
where x(i) = [x1(i) x2(i) … xn(i)]T is an n-dimensional input vector (pattern with n-features) for the ith
example in a real-valued space; y(i) is its class label (output value), and y(i) Œ [+1, –1], +1 denotes
Class 1 and –1 denotes Class 2. To build a linear classifier, we need a linear function of the form
g(x) = wTx + w0 (5.14)
so that the input vector x(i) is assigned to Class 1 if g(x(i)) > 0, and to Class 2 if g(x(i)) < 0, i.e.,

ÏÔ + 1 if wT x (i ) + w0 > 0
yˆ (i ) = Ì (5.15)
ÓÔ- 1 if w x + w0 < 0
T (i )

w = [w1 w2 …wn]T is the weight vector and w0 is the bias. In terms of regression, we can view this
classification problem as follows.
Given a vector x(i), the output of the summing unit (linear combiner) will be wTx(i) + w0 (decision
hyperplane) and thresholding the output through a sgn function gives us perceptron output yˆ (i ) =
± 1 (refer to Fig. 5.11).

1 n
w0
w1 Â wj xj + w0 ŷ
x1 j =1 +1
w2 Â
x2 = w T x + wo –1
◊◊ wn
xn Summing unit
(linear combiner)
Figure 5.11 Linear classification using regression technique

The sum of error squares for the classifier becomes (Eqns (3.71))

N N
E= Â (e(i ) )2 = Â ( y(i) – yˆ (i ))2 (5.16)
i =1 i =1

We require E to be a function of (w, w0) to design the linear function g(x) = wTx + w0 that
minimizes E. To obtain E(w, w0), we replace the perceptron outputs yˆ (i ) by the linear combiner
outputs wTx(i) + w0; this gives us the error function
N
E(w, w0) = Â (y(i) – (wTx(i) + w0))2 (5.17)
i =1
Learning with Neural Networks (NN) 203

Consider pattern i. If x(i) Œ Class 1, the desired output y(i) = +1 with summing unit output wTx(i)
+ w0 > 0 ( yˆ (i ) = +1, refer to Fig. 5.11), the contribution of correctly classified pattern i to E(w, w0)
is small when compared with wrongly classified pattern (wTx(i) + w0 < 0; yˆ (i ) = –1).
The error function E(w, w0) in Eqn (5.17), is a continuous, differentiable function; therefore,
gradient descent approach (discussed later in this sub-section) for minimization of E(w, w0) will be
applicable. The training algorithm based on this E(w, w0) can be seen as the training algorithm of
a linear neuron without the nonlinear (signum) activation function. Nonlinearity is ignored during
training; after training and once the weights have been fixed, the model is the perceptron model
with the hard limiter following the linear combiner.
• If the unthresholded output wTx(i) + w0 can be trained to fit the desired values y(i) = ±1 in a
perfect way, then the thresholded output will fit them as well (because sgn (1) = 1 and sgn
(– 1) = –1). Even when the target values cannot fit perfectly in the unthresholded case, the
thresholded value will correctly fit the ±1 target value whenever the unthresholded output has
the correct sign. Note, however, that while gradient descent procedure will learn weights that
minimize the error in the unthresholded output, these weights will not necessarily minimize
the number of training examples misclassified by the thresholded output.
• The perceptron training rule (Section 4.3) converges after a finite number of iterations to
a weight vector that perfectly classifies the training data provided the training examples
are linearly separable. The gradient descent rule converges only asymptotically toward the
minimum-error weight vector, possibly requiring unbounded time, but converges regardless
of whether the training data is linearly separable or not.

5.4.3 Standard Gradient Descent Optimization Scheme: Steepest Descent


Gradient descent serves as the basis for learning algorithms that search the hypothesis space
of possible weight vectors to find the weights that best fit the training examples. The gradient
descent training rule for a single neuron is important because it provides the basis for the
BACKPROPAGATION algorithm, which can learn networks with many interconnected units.
For linear classification using regression techniques, the task is to train unthresholed perceptron
(it corresponds to the first stage of perceptron, without the threshold; Fig. 5.11) for which the output
is given by
n
g(x) = Â wj xj + w0 = wTx + w0 (5.18)
j =1

Let us define a single weight vector w for the weights (w, w0):

wT = [w0 w1 w2 … wn]T (5.19)


In terms of the weight vector w , the output
T
g(x) = w x (5.20a)
where
xT = [x0 x1 x2 … xn]T; x0 = 1 (5.20b)
204 Applied Machine Learning

T (i )
The unthresholded output w x is to be trained to fit the desired values y(i) minimizing the error
(Eqn (5.17))
N
E (w ) = 1
2 Â ( y (i ) - wT x (i ) ) 2 (5.21)
i =1

This error function is a continuous, differentiable function; therefore, gradient descent approach
for minimization of E (w ) will be applicable (the constant 12 is used for computational convenience
only; it gets cancelled out by the differentiation required in the error minimization process).
To understand the gradient descent algorithm, it is helpful to visualize the error space of possible
weight vectors and the associated values of the performance criterion (cost function E). For the
unthresholded unit (a linear weighted combination of inputs), the error surface is parabolic with
a single global minimum. The specific parabola will depend, of course, on the particular set of
training examples.
How can we calculate the direction of steepest descent along the error surface? This direction
can be found by computing the derivative of E with respect to each component of the vector w. This
vector-derivative is called the gradient of E with respect to w, written —E (w ) .
T
È ∂E ∂E ∂E ˘
—E ( w ) = Í º ˙ (5.22)

Î 0 w ∂ w1 ∂ wn˚

When interpreted as a vector in weight space, the gradient specifies the direction that produces
the steepest increase in E. The negative of this vector, therefore, gives the direction of steepest
decrease. Therefore, the training rule for gradient descent is,
w ¨ w + Dw (5.23a)
where
Dw = - h —E (w ) (5.23b)
Here h is a positive constant (less than one), called the learning rate which determines the step
size in the gradient descent search. This training rule can also be written in its component form:
wj ¨ wj + Dwj; j = 0, 1, 2, …, n (5.24a)
where
∂E
Dwj = - h (5.24b)
∂ wj
which shows that steepest descent is achieved by altering each component wj of w in proportion
∂E
to .
∂ wj
Gradient descent search helps determine a weight vector that minimizes E by starting with an
arbitrary initial weight vector and then altering it again and again in small steps. At each step, the
weight vector is changed in the direction producing the steepest descent along the error surface.
The process goes on till the global minimum error is attained.
To build a practical algorithm for repeated updation of weight according to (5.24), we require an
effective technique to calculate the gradient at each step. Luckily, this is quite easy. The gradient
with respect to weight wj; j = 1, …, n, can be obtained by differentiating E from Eqn (5.21) as,
Learning with Neural Networks (NN) 205

È ˆˆ ˘
2
∂E ∂ Í 1 N Ê (i ) Ê n
= Â y - Á Â wj x j + w0 ˜ ˜ ˙˙
∂ wj ∂ wj Í 2 i = 1 ÁË
(i )

Î Ë j =1 ¯¯ ˚

Ê n ˆ
The error e for the i sample of data is given by e = y – Á Â wj x j + w0 ˜ .
(i) th (i ) (i) (i)

It follows that Ë j =1 ¯
N
È ∂ ˘
N
∂e ( i )
1
2 Â Í ∂ w (e ( i ) ) 2 ˙ = Â e(i ) ∂ wj
i =1 Î j ˚ i =1

∂e ( i )
= - x (j i )
∂ wj

∂E ∂ È 1 N (i ) 2 ˘
= Í 2 Â (e ) ˙
∂ wj ∂ wj ÍÎ i = 1 ˙˚
N
=– Â e(i) xj(i)
i =1

N Ê (i ) Ê n ˆ ˆ (i )
=– Â Á y - ÁÂ j j w x(i )
+ w0 ˜ ˜ xj
i =1 Ë Ë j =1 ¯¯
The gradient with respect to bias,

N Ê Ê n ˆˆ
∂E N
= - Â e(i ) = - Â Á y (i ) - Á Â wj x(ji ) + w0 ˜ ˜
∂ w0 i =1 i =1 Ë Ë j =1 ¯¯

Therefore, the weight update rule for gradient descent becomes


N Ê (i ) Ê n ˆ ˆ (i )
w j ¨ wj + h  Á y - Á  wj x j + w0 ˜ ˜ x j
(i )
(5.25a)
i =1 Ë Ë j =1 ¯¯

N Ê (i ) Ê n ˆˆ
w0 ¨ w 0 + h  Á y - Á  wj x j + w0 ˜ ˜
(i )
(5.25b)
i =1 Ë Ë j =1 ¯¯

An epoch is a complete run through all the N associated pairs. Once an epoch is completed, the
pair (x(1), y(1)) is presented again and a run is performed through all the pairs again. After several
epochs, the ouput error is expected to be sufficiently small.
The iteration index k corresponds to the number of times the set of N pairs is presented and
cumulative error is compounded. That is, k corresponds to the epoch number.
206 Applied Machine Learning

In terms of iteration index k, the weight update equations are


N Ê (i ) Ê n ˆ ˆ (i )
w(k + 1) = w(k) + h  Á y - Á  wj (k ) x j + w0(k )˜ ˜ x
(i )
(5.26a)
i =1 Ë Ë j =1 ¯¯

N Ê (i ) Ê n ˆˆ
w0 (k + 1) = w0(k) + h  Á y - Á  wj (k ) x j + w0 (k )˜ ˜
(i )
(5.26b)
i =1 Ë Ë j =1 ¯¯

5.5 LINEAR NEURON AND THE WIDROW-HOFF LEARNING RULE

The perceptron (‘softer’ version; sigmoid unit) has been a fundamental building block in the
present-day neural models. Another important building block has been ADALINE (ADAptive
LINear Element), developed by Bernard Widrow and Marcian Hoff in 1959. It originated from
the field of signal processing, or more specifically, from the adaptive noise cancellation problem.
It resulted from the solution of regression problem; the regressor having the properties of noise
canceller (linear filter). All its power in linear domain is still in full service, and despite being a
simple neuron, it is present (without a thresholding device) in almost all the neural models for
regression functions. The words ADALINE and linear neuron are both used here for a neural
processing unit with a linear activation function shown in Fig. 5.12a. The neuron labeled with
summation sign only (Fig. 5.12b) is equivalent to linear neuron of Fig. 5.12a.
1 w0 n
x1 w1 a = Â wj xj + w0 s(a)
j =1 ŷ
◊◊ Â
◊ = w Tx + w0 a
wn
xn Summing unit Linear activation
(linear combiner) function (slope = 1)
(a)

n
1 w0
yˆ = Â wj xj + w0
x1 w1 j =1
Â
◊◊ = w Tx + w0
◊ wn
xn (b)
Figure 5.12 Neural processing unit with a linear activation function

In the last section, we have discussed gradient descent optimization scheme to determine the
optimum setting of the weights (w, w0) that minimize the criterion function given by Eqn (5.21).
Note that this ‘sum of error squares’ criterion function is deterministic and the gradient descent
Learning with Neural Networks (NN) 207

scheme gives a deterministic algorithm for minimization of this function. Now we explore a digress
from this criterion function. Consider the problem of computing weights (w, w0) so as to minimize
Mean Square Error (MSE) between desired and true outputs, defined as follows.
È1 N
( ) ˘˙
2
E(w, w0) = E Í 2 Â y - (w x + w0 )
(i ) T (i )
(5.27)
ÍÎ i = 1 ˙˚
where E is the statistical expectation operator.
The solution to this problem requires the computation of autocorrelation matrix E[x xT] of the
set of feature vectors, and cross-correlation matrix E[x y] between the desired response and the
feature vector. This presupposes knowledge of the underlying distributions, which, in general, is
not known. Thus, our major goal becomes to see if it is possible to solve this optimization problem
without having this statistical information.
The Least Mean Square (LMS) algorithm, originally formulated by Widrow and Hoff, is a
stochastic gradient algorithm that iterates weights (w, w0) in the regressor after each presentation
of data sample, unlike the standard gradient descent that iterates weights after presentation of the
whole training dataset. That is, the kth iteration in standard gradient descent means the kth epoch, or
the kth presentation of the whole training dataset, while kth iteration in stochastic gradient descent
means the presentation of kth single training data pair (drawn in sequence or randomly). Thus, the
calculation of the weight change Dw or the gradient needed for this, is pattern-based, not epoch-
based ( Dw = –h —E (w ) ; Eqn (5.23b)).
LMS is called a stochastic gradient algorithm because the gradient vector is chosen at ‘random’
and not, as in steepest descent case, precisely derived from the shape of the total error surface.
Random means here the instantaneous value of the gradient. This is then used as the estimator of
the true quantity.
The design of the LMS algorithm is very simple, yet a detailed analysis of its convergence
behavior is a challenging mathematical task. It turns out that under mild conditions, the solution
provided by the LMS algorithm converges in probability to the solution of the sum-of-error-squares
optimization problem.

5.5.1 Stochastic Gradient Descent


While the standard gradient descent training rule of Eqn (5.26) calculates weight updates after
summing errors over all the training examples in the given dataset D ; the concept behind stochastic
gradient descent is to approximate this gradient descent search by updating weights incrementally,
following the calculation of the error for each individual example. This modified training rule is
like the training rule given by Eqns (5.26) except that as we iterate through each training example,
we update the weights according to the gradient with respect to the distinct error function,

E(k) = 12 [y(i) – ŷ(k)]2 = 12 [e(k)]2 (5.28a)

n
yˆ (k ) = Â wj(k) xj(i) + w0(k) (5.28b)
j =1
208 Applied Machine Learning

where k is the iteration index. Note that the input components xj(i) and the desired output y(i) are
not functions of the iteration index. Training pairs (x(i), y(i)), drawn in sequence or randomly, are
presented to the network at each iteration. The gradients with respect to weights and bias are
computed as follows:
∂E ( k ) ∂e ( k ) ∂ yˆ (k )
= e( k ) = - e( k )
∂ wj ( k ) ∂ wj (k ) ∂ wj (k )

= – e(k) x (i)
j
∂E ( k )
= – e(k)
∂ w0 (k )

The stochastic gradient descent algorithm becomes,


wj(k + 1) = wj(k) + h e(k) xj(i) (5.29a)
w0(k + 1) = w0(k) + h e(k) (5.29b)
In terms of vectors, this algorithm may be expressed as,
w(k + 1) = w(k) + h e(k) x(i) (5.30a)
w0(k + 1) = w0(k) + h e(k) (5.30b)
Stochastic gradient training algorithm iterates over the training examples i = 1, 2, …, N (drawn
in sequence or randomly); at each iteration, altering weights as per the above equations. The
sequence of these weight updates, iterated over all the training examples, gives rise to reasonable
approximation to the gradient with respect to the entire set of training data. By making the values of
h small enough, stochastic gradient descent can be made to approximate standard gradient descent
(steepest descent) arbitrarily closely.
At each presentation of data (x(i), y(i)), one step of training algorithm is performed which updates
both the weights and the bias. Note that teaching the network one fact at a time from one data pair,
does not work. All the weights and the bias set so meticulously for one fact, could be drastically
altered in learning the next fact. The network has to learn everything together, finding the best
weights and bias settings for the total set of facts. Therefore, with incremental learning, the training
should stop only after an epoch has been completed.

5.6 THE ERROR-CORRECTION DELTA RULE

In this section, gradient descent strategy for adapting weights for a single neuron having
differentiable activation function is demonstrated. This will just be a small (nonlinear) deviation
from the derivation of adaptive rule for the linear activation function, given in the previous section.
Including this small deviation will be a natural step for deriving gradient-descent based algorithm
for multilayer neural networks (in the next section, we will derive this algorithm).
A neural unit with any differentiable function s (a) is shown in Fig. 5.13. It first computes a
linear combination of its inputs (activation value a); then applies nonlinear activation function s (a)
Learning with Neural Networks (NN) 209

to the result. The output ŷ of nonlinear unit is a continuous function of its input a. More precisely,
the nonlinear unit computes its output as,
ŷ = s (a) (5.31a)
n
a= Â wj xj + w0 (5.31b)
j =1

x0 = 1 w0
x1 w1 n
x2 w2 a = Â wj xj + w0 s(a)
j =1 ŷ
 a
� T
= w x + w0
wn
xn = wTx

Figure 5.13 Neural unit with any differentiable activation function

The problem is to find the expression for the learning rule for adapting weights using a training
set of pairs of input and output patterns; the learning is in stochastic gradient descent mode, as in
the last section. We begin by defining error function E(k):

E(k) = 12 (y(i) – ŷ(k))2 = 12 [e(k)]2 (5.32a)

e(k) = y(i) – ŷ(k) (5.32b)


Ê n ˆ
ŷ(k) = s Á Â wj (k ) x (j i ) + w0 (k )˜ (5.32c)
Ë j =1 ¯
For each training example i, weights wj; j = 1, …, n (and bias w0) are updated by adding to it Dwj
(and Dw0).
∂ E (k )
Dwj(k) = - h (5.33a)
∂ wj ( k )

∂ E (k )
wj(k + 1) = wj(k) - h (5.33b)
∂ wj ( k )

∂ E (k )
w0(k + 1) = w0(k) - h (5.33c)
∂ w0 (k )
Note that E(k) is a nonlinear function of the weights now, and the gradient cannot be calculated
following the equations derived in the last section for a linear neuron. Fortunately, the calculation
of the gradient is straight forward in the nonlinear case as well. For this purpose, the chain rule is,

∂ E (k ) ∂ E (k ) ∂a (k )
= (5.34)
∂ wj (k ) ∂ a (k ) ∂ wj (k )
210 Applied Machine Learning

where the first term on the right-hand side is a measure of an error change due to the activation
value a(k) at the kth iteration, and the second term shows the influence of the weights on that
particular activation value a(k). Applying the chain rule again, we get,
∂ E (k ) ∂ E (k ) ∂e (k ) ∂ y� (k ) ∂ a (k )
=
∂ wj (k ) ∂e (k ) ∂ y� (k ) ∂ a (k ) ∂ wj (k )
∂s(a (k )) (i)
= e(k) [–1] xj
∂a (k )
= – e(k) s ¢(a(k)) xj(i) (5.35)
The learning rule can be written as,
wj(k + 1) = wj(k) + h e(k) s ¢(a(k)) xj(i) (5.36a)
w0(k + 1) = w0(k) + h e(k) s¢(a(k)) (5.36b)
This is the most general learning rule that is valid for a single neuron having any nonlinear and
differentiable activation function and whose input is formed as a product of the pattern and weight
vectors. It follows the LMS algorithm for a linear neuron presented in the last section, which was
an early powerful strategy for adapting weights using data pairs only.
This rule is also known as delta learning rule with delta defined as,
d(k) = e(k) s ¢(a(k))
= (y (i) – ŷ(k)) s ¢(a(k)) (5.37)
In terms of d(k), the weights-update equations become
wj (k + 1) = wj (k) + h d(k) xj(i) (5.38a)
w0(k + 1) = w0(k) + h d(k) (5.38b)
It should be carefully noted that the d(k) in these equations is not the error but the error change
∂ E (k )
– due to the input a(k) to the nonlinear activation function at the kth iteration:
∂a (k )
∂ E (k ) ∂ E (k ) ∂e(k ) ∂ y� (k )
- =-
∂ a (k ) ∂e(k ) ∂ y� (k ) ∂ a (k )
= – e(k) [–1] s ¢(a(k))
= e(k) s ¢(a(k)) = d(k) (5.39)
Thus, d(k) will generally not be equal to the error e(k). We will use the term error signal for d(k),
keeping in mind that, in fact, it represents the error change.
In the world of neural computing, the error signal d(k) is of highest importance. After a hiatus
in the development of learning rules for multilayer networks for about 20 years, the adaptation rule
based on delta rule made a breakthrough in 1986 and was named the generalized delta learning
rule. Today, the rule is also known as the error backpropagation learning rule (discussed in the
next section).
Learning with Neural Networks (NN) 211

Interestingly, for a linear activation function (Fig. 5.12),


s (a(k)) = a(k)
Therefore,
s ¢(a(k)) = 1
and
d (k) = e(k) s ¢(a(k)) = e(k) (5.40)
That is, delta represents the error itself. Therefore, the delta rule for a linear neuron is same as
the LMS learning rule presented in the previous section.

5.6.1 Sigmoid Unit: Soft-Limiting Perceptron


In the neural unit of Fig. 5.13, any nonlinear, smooth, differentiable, and preferably nondecreasing
function can be used. The requirement for the activation function to be differentiable is basic for
the error backpropagation algorithm. On the other hand, the requirement that a nonlinear activation
function should monotonically increase is not so strong, and it is connected with the desirable
property that its derivative does not change the sign.
The activation functions that are most commonly used in multilayer neural networks are the
squashing sigmoidal functions. The sigmoidal unit is very much like a perceptron, but is based on
smoothed differentiable threshold function.
The neural unit shown in Fig. 5.13 becomes a sigmoidal unit when s (◊) represents the sigmoidal
nonlinearity illustrated in Fig. 5.8. The sigmoidal unit first computes a linear combination of its
inputs (activation value a), and then applies a threshold to the result. The thresholded output ŷ =
s(a) is a continuous function of its input. More precisely, the sigmoid unit computes its output as,
ŷ = s (a) (5.41a)
Ê n ˆ
= s Á Â wj x j + w0 ˜ (5.41b)
Ë j =1 ¯
Because a sigmoid unit maps a very large input down to a small range outputs, it is often referred
to as squashing function.
The most common squashing sigmoidal functions are unipolar logistic function (Fig. 5.8a, Eqn
(5.7a)) and the bipolar sigmoidal function (related to a tangent hyperbolic; Fig. 5.8b, Eqn (5.7b)).
The unipolar logistic function, henceforth referred to as log-sigmoid, squashes the inputs to outputs
between 0 and 1, while the bipolar function, henceforth referred to as tan-sigmoid, squashes the
inputs to the outputs between –1 and +1.
Log-sigmoid
1
s (a) = (5.42a)
1 + e- a
Tan-sigmoid
1 - e- a
s(a ) = (5.42b)
1 + e- a
212 Applied Machine Learning

Sigmoidal unit has the useful property that its derivative is easily expressed in terms of its output:
Log-sigmoid
d s(a ) d È 1 ˘ e- a 1 È 1 ˘
= Í -a ˙
= -a 2
= Í1 - 1 + e - a ˙
da da Î1 + e ˚ (1 + e ) 1 + e- a Î ˚

= s (a ) [1 - s (a )] (5.43a)

= yˆ (1 - yˆ ) (5.43b)
Tan-sigmoid
d s (a ) d È1 - e - a ˘ 2e - a È Ê 1 - e- a ˆ ˘ È Ê 1 - e- a ˆ ˘
= Í ˙ = = 1
Í1- ˙ Í1 + ˙
da da Î1 + e - a ˚ (1 + e - a ) 2 2 ÍÎ ÁË 1 + e - a ˜¯ ˙˚ ÍÎ ÁË 1 + e - a ˜¯ ˙˚

1
= 2
(1 - s (a )) (1 + s (a )) (5.44a)

= 1
2
(1 - y� ) (1 + y�) (5.44b)

As we shall see, the gradient descent learning makes use of these derivatives.
The most general learning rule that is valid for a single neuron having any nonlinear and
differentiable activation function is given by Eqns(5.37–5.38). For the specific case of sigmoidal
(log-sigmoid) nonlinearity, we have,
d
s¢(a(k)) = s(a(k)) = s(a(k)) [1 – s(a(k))]
da (k )

= ŷ(k)[1 – ŷ(k)]
Therefore,
d(k) = e(k) s¢(a(k)) = (y (i) – ŷ(k)) ŷ(k) [1 – ŷ(k)]
The weight-update equations become,

wj (k + 1) = wj (k) + h d(k) xj(i) (5.45a)

w0(k + 1) = w0(k) + h d(k) (5.45b)

d(k) = [y(i) – ŷ(k)] ŷ(k) [1 – ŷ(k)] (5.45c)

We construct multilayer networks using sigmoid units (next section will describe commonly used
structures). Initially we may be tempted to select the linear units discussed earlier. But, multiple
layers of cascaded linear units continue to produce only linear functions and we favor networks
possessing the ability to represent highly nonlinear functions. The (hard-limiting) perceptron unit
is another likely selection, but its discontinuous threshold makes it undifferentiable and therefore,

You might also like