ml42
ml42
Human intelligence possesses robust attributes with complex sensory, control, affective (emotional
processes), and cognitive (thought processes) aspects of information processing and decision
making. There are over a hundred billion biological neurons in our central nervous system (CNS),
playing a key role in these functions. CNS obtains information from the external environment via
numerous natural sensory mechanisms—vision, hearing, touch, taste, and smell. With the help of
cognitive computing, it assimilates the information and offers the right interpretation. The cognitive
process then progresses towards some attributes, such as learning, recollection, and reasoning,
which results in proper actions via muscular control.
The progress in technology based on information in recent times, has widened the capabilities
and applications of computers. If we wish a machine (computer) to exhibit certain cognitive
functions, such as learning, remembering, reasoning and perceiving, that humans are known
to exhibit, we need to define ‘information’ in a general manner and develop new mathematical
techniques and hardware with the ability to handle the simulation and processing of cognitive
information. Mathematics, in its present form, was developed to comprehend physical processes,
but cognition, as a process, does not essentially follow these mathematical laws. So what exactly is
cognitive mathematics then? The question is rather difficult. However, scientists have converged to
the understanding that if certain ‘mathematical aspects’ of our process of thinking are re-examined
along with the ‘hardware aspects’ of ‘the neurons’—which is the primary component of the brain—
we may, to a certain level, be able to successfully emulate the process.
Biological neuronal procedures are rather complex [76], and the advancement made in terms
of understanding the field with the help of experiments is raw and inadequate. However, with the
help of this limited understanding of the biological processes, it has been possible to emulate some
human learning behaviors, via the fields of mathematics and systems science. Neuronal information
processing involves a range of complex mathematical processes and mapping functions. And they
serve as a parallel-cascade computing structure in synergism. The aim of system scientists is to
182 Applied Machine Learning
create an intelligent cognitive system on the basis of this limited understanding of the brain—a
system that can help human beings to perform all kinds of tasks requiring decision making. Various
new computing theories of the neural networks field have been developing, which it is hoped
will be capable of providing a thinking machine. Given that they are based on neural networks
architecture, they should hopefully be able to create a low-level cognitive machine, which scientists
have been trying to build for so long.
The subject of cognitive machines is in an exciting state of research and we believe that we are
slowly progressing towards the development of truly cognitive machines.
Novel research investigations in ELM and related areas have produced a suite of machine
learning techniques for (single and multi-) hidden layer feedforward networks in which hidden
neurons need not be tuned. ELM theories argue that random hidden neurons capture the essence of
some brain learning mechanisms. ELM has a great potential as a viable alternative technique for
large-scale computing and AI (Artificial Intelligence).
The ‘traditional’ neural networks are based on what we might interpret as ‘shallow’ learning; in
fact, this learning methodology has very little resemblance to the brain, and one might argue that it
would be more fair to regard them simply as a discipline under statistics.
The subject of cognitive machines is in an exciting state of research and we believe that we are
slowly progressing towards the development of truly intelligent systems. A step towards realizing
strong AI has been taken through the recent research in ‘deep learning’. Considering the far-reaching
applications of AI, coupled with the awareness that deep learning is evolving as one of its most
powerful methods, today it is not possible for one to enter the machine learning community without
any knowledge of deep networks.
Deep learning algorithms are in sharp contrast to shallow learning algorithms in terms of the
number of parameterized transformations a signal comes across as it spreads from input layer to
the output layer. A parameterized transformation refers to a processing unit containing trainable
parameters—weights and thresholds. A chain of transformations from input to output is a Credit
Assignment Path (CAP), which describes potentially causal connections from input to output and
may have varied lengths. In case of a feedforward neural network, the depth of the CAPs, and
therefore, the depth of the network, is a number of hidden layers plus one (the output layer is also
parameterized).
Today, deep learning, based on learning representations of data, is a significant member of
family of machine learning techniques. It is possible to represent an observation, for instance
an image, in various ways, such as a vector of intensity values per pixel or in a more abstract
way as a set of edges, regions of particular shape, and so on. Some representations make learning
tasks simpler. Deep learning aims to replace hand-crafted features with efficient algorithms for
supervised or unsupervised feature learning, and hierarchical feature extraction.
The field of deep learning has been characterized in several ways. These definitions have in
common:
(i) multiple layers of nonlinear processing units
(ii) the supervised/unsupervised learning of feature representations in each layer, with the layers
giving rise to a hierarchy from low-level to high-level characteristics.
What a layer of nonlinear processing unit, employed in a deep learning algorithm, consists of is
dependent on the problem that needs to be solved.
Deep learning is linked closely to a category of brain-development theories published by
cognitive neuro scientists in the early 1990s. Some of the deep-learning representations are inspired
by progresss in neuroscience and are roughly based on interpretation of information processing
and communication patterns in a nervous system, such as neural coding which tries to describe the
relationship between a range of stimuli and related neuronal responses in the brain.
The term ‘deep learning’ gained traction in the mid 2000s, after a publication by Geoffrey
Hinton and Ruslan Salakhutdinov. They showed how many-layered feedforward network could
184 Applied Machine Learning
be effectively pre-trained one layer at a time. Since its resurgence, deep learning has become part
of the many-of-the-art systems in various disciplines, particularly computer vision and speech
recognition. The real impact of deep learning in industry began in large-scale speech recognition
around 2010. Recent useful references on the subject are [78–82].
Our focus in this chapter will be on traditional neural networks [83, 84]. These networks are
being used today for many real-world regression and classification problems. Also, a sound
understanding of these networks is a prerequisite to learn ELM and deep learning algorithms. The
two recent research developments are outside the scope of this book.
The terms ‘Neural Networks (NN)’ and ‘Artificial Neural Networks (ANN)’ are both commonly
used in the literature for the same field of study. We will use the term ‘Neural Networks’ in this
book.
Broadly speaking, AI (Artificial Intelligence) is any computer program that does something
smart [2, 5]. Machine learning is a subfield of AI. That is, all machine learning counts as AI,
but not all AI counts as machine learning. For example, rule-based expert systems, frame-based
expert systems, knowledge graphs, evolutionary algorithms could be described by AI but none of
them is in machine learning. Deep learning may be considered as subfield of machine learning.
Deep neural networks are a set of algorithms setting new records in accuracy for many important
problems. Deep is a technical term; it refers to number of layers in a neural network. Multiple
hidden layers allow deep neural networks to learn features of the data in a hierarchy.
Deep learning may share elements of traditional machine learning, but some researchers feel that
it will emerge as a class by itself, as a subfield of AI.
Nucleus
Dendrites
Synaptic
terminals
Neurons are connected to each other via their axons and dendrites. Signals are sent through the
axon of one neuron to the dendrites of other neurons. Hence, dendrites may be represented as the
inputs to the neuron, and the axon as its output. Note that each neuron has many inputs through its
multiple dendrites, whereas it has only one output through its single axon. The axon of each neuron
forms connections with the dendrites of many other neurons, with each branch of the axon meeting
exactly one dendrite of another cell at what is called a synapse. Actually, the axon terminals do not
quite touch the dendrites of the other neurons, but are separated by a very small distance of between
50 and 200 angstroms. This separation is called the synaptic gap.
A conventional computer is typically a single processor acting on explicitly programmed
instructions. Programmers break tasks into tiny components, to be performed in sequence rapidly.
On the other hand, the brain is composed of ten billion or so neurons. Each nerve cell can interact
directly with up to 200,000 other neurons (though 1000 to 10,000 is typical). In place of explicit
rules that are used by a conventional computer, it is the pattern of connections between the neurons
in the human brain, that seems to embody the ‘knowledge’ required for carrying out various
information-processing tasks. In human brain, there is no equivalent of a CPU that is in overall
control of the actions of all the neurons.
The brain is organized into different regions, each responsible for different functions. The largest
parts of the brain are the cerebral hemispheres, which occupy most of the interior of the skull.
They are layered structures; the most complex being the outer layer, known as the cerebral cortex,
where the nerve cells are extremely densely packed to allow greater interconnectivity. Interaction
with the environment is through the visual, auditory and motion control (muscles and glands) parts
of the cortex.
In essence, neurons are tiny electrophysiological information-processing units which
communicate with each other through electrical signals. The synaptic activity produces a voltage
pulse on the dendrite which is then conducted into the soma. Each dendrite may have many synapses
acting on it, allowing massive interconnectivity to be achieved. In the soma, the dendrite potentials
are added. Note that neurons are able to perform more complex functions than simple addition on
the inputs they receive, but considering a simple summation is a reasonable approximation.
186 Applied Machine Learning
When the soma potential rises above a critical threshold, the axon will fire an electrical signal.
This sudden burst of electrical energy along the axon is called axon potential and has the form
of an electrical impulse or spike that lasts about 1 msec. The magnitude of the axon potential is
constant and is not related to the electrical stimulus (soma potential). However, neurons typically
respond to a stimulus by firing not just one but a barrage of successive axon potentials. What varies
is the frequency of axonal activity. Neurons can fire between 0 to 1500 times per second. Thus,
information is encoded in the nerve signals as the instantaneous frequency of axon potentials and
the mean frequency of the signal.
A synapse pairs the axon with another cell’s dendrite. It discharges chemicals known as
neurotransmitters, when its potential is increased enough by the axon potential. The triggering of
the synapse may require the arrival of more than one spike. The neurotransmitters emitted by the
synapse diffuse across the gap, chemically activating gates on the dendrites, which, on opening,
permit the flow of charged ions. This flow of ions, changes the dendritic potential and generates
voltage pulse on the dendrite, which is then conducted into the neuron body. At the synaptic junction,
the number of gates that open on the dendrite is dependent on the number of neurotransmitters
emitted. It seems that some synapses excite the dendrites they impact, while others act in a way that
inhibits them. This results in changing the local potential of the dendrite in a positive or negative
direction.
Synaptic junctions alter the effectiveness with which the signal is transmitted; some synapses
are good junctions and pass a large signal across, whilst others are very poor, and allow very little
through.
Essentially, each neuron receives signals from a large number of other neurons. These are the
inputs to the neuron which are ‘weighted’. That is, some signals are stronger than others. Some
signals excite (are positive), and others inhibit (are negative). The effects of all weighted inputs are
summed. If the sum is equal to or greater than the threshold for the neuron, the neuron fires (gives
output). This is an ‘all-or-nothing’ situation. Because the neuron either fires or does not fire, the rate
of firing, not the amplitude, conveys the magnitude of information.
The ease of transmission of signals is altered by activity in the nervous system. The neural pathway
between two neurons is susceptible to fatigue, oxygen deficiency, and agents like anesthetics. These
events create a resistance to the passage of impulses. Other events may increase the rate of firing.
This ability to adjust signals is a mechanism for learning.
After carrying a pulse, an axon fiber is in a condition of complete non-excitability for a specific
time period known as the refractory period. During this interval, the nerve conducts no signals,
irrespective of how intense the excitation is. Therefore, we could segregate the time scale into
successive intervals, each equal to the length of the refractory period. This will permit a discrete-time
description of the neurons’ performance in terms of their states at discrete-time instances.
An artificial neuron
(i) evaluates the input signals, determining the strength of each one;
(ii) calculates a total for the combined input signals and compares that total to some threshold
level; and
(iii) determines what the output should be.
Output
Weighting Factors
Each input will be given a relative weighting, which will affect the impact of that input (Fig.
5.3). This is something like varying synaptic strengths of the biological neurons—some inputs
are more important than others in the way they combine to produce an impulse. Weights are
adaptive coefficients within the network, that determine the intensity of the input signal. In fact,
this adaptability of connection strength is precisely what provides neural networks their ability to
learn and store information, and, consequently, is an essential element of all neuron models.
Input Connection
x1 weights
w1
Total input
x2 w2 Âwj xj
Â
wn
xn
Figure 5.3 A neuron with weighted inputs
188 Applied Machine Learning
Excitatory and inhibitory inputs are represented simply by positive or negative connection
weights, respectively. Positive inputs promote the firing of the neuron, while negative inputs tend
to keep the neuron from firing.
Mathematically, we could look at the inputs and the weights on the inputs as vectors.
The input vector
È x1 ˘
Íx ˙
x= Í ˙
2
(5.1a)
Í� ˙
Í ˙
Î xn ˚
N
 wj x j = w T x (5.1c)
j =1
Activation Functions
Although most neuron models sum their input signals in basically the same manner, as described
above, they are not all identical in terms of how they produce an output response from this input.
Artificial neurons use an activation function, often called a transfer function, to compute their
activation as a function of total input stimulus. Several different functions may be used as activation
functions, and, in fact, the most distinguishing feature between existing neuron models is precisely
which function they employ.
We will, shortly, take a closer look at the activation functions. We first build a neuron model,
assuming that the transfer function has a threshold behavior, which is, in fact, the type of response
exhibited by biological neurons: when the total stimulus exceeds a certain threshold value q, a
constant output is produced, while no output is generated for input levels below the threshold.
Figure 5.4a shows this neuron model. In this diagram, the neuron has been represented in such a
way that the correspondence of each element with its biological counterpart may be easily seen.
Equivalently, the threshold value can be subtracted from the weighted sum and the resulting
value compared to zero; if the result is positive, then output a 1, else output a 0. This is shown
in Fig. 5.4b; note that the shape of the function is the same but now the jump occurs at zero. The
threshold effectively adds an offset to the weighted sum.
An alternative way of achieving the same effect is to take the threshold out of the body of the
model neuron, and connect it to an extra input value that is fixed to be ‘on’ all the time. In this
case, rather than subtracting the threshold value from the weighted sum, the extra input of +1 is
multiplied by a weight and added in a manner similar to other inputs—this is known as biasing the
neuron. Figure 5.4c shows a neuron model with a bias term. Note that we have taken constant input
‘1’ with an adaptive weight ‘w0’ in our model.
Learning with Neural Networks (NN) 189
The first formal definition of a synthetic neuron model, based on the highly simplified
considerations of the biological neuron, was formulated by McCulloch and Pitts (1943). The
two-port model (inputs—activation value—output mapping) of Fig. 5.4 is essentially the MP
neuron model. It is important to look at the features of this unit—which is an important and popular
neural network building block.
Inputs Connection
x1 weights
w1 Total input
Output
x2 Âwj xj 1 ŷ
w2 Â
q
wn
xn Synapses Cell body
Axons Dendrites
(From other neurons)
(a)
x1
w1
x2 w2 Âwj xj + 1 ŷ
Â
–
wn q
xn (b)
x1
Activation value
w1
x2 w2 a = Âwj xj + w0 1 ŷ
Â
wn
xn w0
1
(c)
It is a basic unit, thresholding a weighted sum of its inputs to get an output. It does not particularly
consider the complex patterns and timings of the real nervous activity in real neural systems. It
190 Applied Machine Learning
does not have any of the complex characteristics existing in the body of biological neurons. It is,
therefore, a model, and not a copy of a real neuron.
The MP artificial neuron model involves two important processes:
(i) Forming net activation by combining inputs. The input values are amalgamated by a weighted
additive process to achieve the neuron activation value a (refer to Fig. 5.4c).
(ii) Mapping this activation value a into the neuron output ŷ . This mapping from activation to
output may be characterized by an ‘activation’ or ‘squashing’ function.
For the activation functions that implement input-to-output compression or squashing, the range
of the function is less than that of the domain. There is some physical basis for this desirable
characteristic. Recall that in a biological neuron, there is a limited range of output (spiking
frequencies). In the MP model, where DC levels replace frequencies, the squashing function serves
to limit the output range. The squashing function shown in Fig. 5.5a limits the output values to {0,
1}, while that in Fig. 5.5b limits the output values to {–1, 1}. The activation function of Fig. 5.5a
is called unipolar, while that in Fig. 5.5b is called bipolar (both positive and negative responses of
neurons are produced).
ŷ ŷ
1 1
0 a a
–1
Figure 5.5
Ê n ˆ
ˆy = s (wT x + w0 ) = s Á Â wj x j + w0 ˜ (5.2)
Ë j =1 ¯
x1 x0 = 1
w1 w0
x2 w2 a s (◊) ŷ x1 w1 a s (◊) ŷ
 x2 w2 Â
wn wn
xn w0 xn
1
(a) (b)
Figure 5.6 Mathematical model of a neuron (perceptron)
The bias term may be absorbed in the input vector itself as shown in Fig. 5.6b.
yˆ = s (a )
Ê n ˆ
= s Á Â wj x j ˜ ; x0 = 1 (5.3a)
Ë j=0 ¯
Ê n ˆ
= s Á Â wj x j + w0 ˜ = s (wT x + w0 ) (5.3b)
Ë j =1 ¯
In the literature, this model of an artificial neuron is also referred to as a perceptron (the name
was given by Rosenblatt in 1958).
The expressions for the neuron output ŷ are referred to as the cell recall mechanism. They
describe how the output is reconstructed from the input signals and the values of the cell parameters.
The artificial neural systems under investigation and experimentation today, employ a variety of
activation functions that have more diversified features than the one presented in Fig. 5.5. Below,
we introduce the main activation functions that will be used later in this chapter.
The MP neuron model shown in Fig. 5.4 used the hard-limiting activation function. When
artificial neurons are cascaded together in layers (discussed in the next section), it is more common
to use a soft-limiting activation function. Figure 5.7a shows a possible bipolar soft-limiting
semilinear activation function. This function is, more or less, the ON-OFF type, as before, but has a
sloping region in the middle. With this smooth thresholding function, the value of the output will be
practically 1 if the weighted sum exceeds the threshold by a huge margin and, conversely, it will be
practically –1 if the weighted sum is much less than the threshold value. However, if the threshold
and the weighted sum are almost the same, the output from the neuron will have a value somewhere
between the two extremes. This means that the output from the neuron can be related to its inputs in
a more useful and informative way. Figure 5.7b shows a unipolar soft-limiting function.
192 Applied Machine Learning
s (a) s (a)
1 1
a 0 a
–1
(a) Bipolar (b) Unipolar
Figure 5.7 Soft-limiting activation functions.
For many training algorithms (discussed in later sections), the derivative of the activation
function is needed; therefore, the activation function selected must be differentiable. The logistic
or sigmoid function, which satisfies this requirement, is the most commonly used soft-limiting
activation function. The sigmoid function (Fig. 5.8a):
1
s (a) = (5.4)
1 + e- l a
is continuous and varies monotonically from 0 to 1 as a varies from – • to •. The gain of the
sigmoid, l, determines the steepness of the transition region. Note that as the gain approaches
infinity, the sigmoid approaches a hard-limiting nonlinearity. One of the advantages of the sigmoid
is that it is differentiable. This property had a significant impact historically, because it made it
possible to derive a gradient search learning algorithm for networks with multiple layers (discussed
in later sections).
s (a)
s (a) l=1 1
l=1
1
a
l = 0.2
0 a 1
(a) Sigmoid functions (b) Hyperbolic tangent function
Figure 5.8
The sigmoid function is unipolar. A bipolar function with similar characteristics is a hyperbolic
tangent (Fig. 5.8b):
1 - e- l a
s (a) =
1+ e - la
= tanh 12 l a ( ) (5.5)
The biological basis of these activation functions can easily be established. It is known that
neurons located in different parts of the nervous system have different characteristics. The neurons
of the ocular motor system have a sigmoid characteristic, while those located in the visual area
Learning with Neural Networks (NN) 193
1 - e- a
or tan-sigmoid s (a) = (5.7b)
1 + e- a
activation function, and the second (output) layer having linear activation function
s(a) = a (5.8)
The log-sigmoid function has historically been a very popular choice, but since it is related to the
tan-sigmoid by the simple transformation
slog-sigmoid = (stan-sigmoid +1)/2 (5.9)
both of these functions are in use in neural network models.
We have so far described two classical neuron models:
• perceptron—a neuron with sigmoidal activation function (sigmoidal function is a softer
version of the original perceptron’s hard limiting or threshold activation function); and
• linear neuron—a neuron with linear activation function.
In the biological brain, a huge number of neurons are interconnected to form the network and
perform advanced intelligent activities. The artificial neural network is built by neuron models.
Many different types of artificial neural networks have been proposed, just as there are many
theories on how biological neural processing works. We may classify the organization of the neural
networks largely into two types: a feedforward net and a recurrent net. The feedforward net has a
hierarchical structure that consists of several layers, without interconnection between neurons in
each layer, and signals flow from input to output layer in one direction. In the recurrent net, multiple
194 Applied Machine Learning
neurons in a layer are interconnected to organize the network. In the following, we give typical
characteristics of the feedforward net and the recurrent net, respectively.
A Layer of Neurons
A one-layer network with n inputs and M neurons is shown if Fig. 5.9. In the network, each input xj;
j = 1, 2, …, n is connected to the qth neuron input through the weight wqj; q = 1, 2, …, M. The qth
neuron has a summer that gathers its weighted inputs to form its own scalar output
n
 wqj xj + wq0; q = 1, 2, …, M
j =1
Finally, the qth neuron outputs ŷq through its activation function s(◊):
Ê n ˆ
ŷq = s Á Â wqj x j + wq 0 ˜ ; q = 1, 2, …, M (5.10a)
Ë j =1 ¯
= s (wqT x + wq0); q = 1, 2, …, M (5.10b)
where weight vector wq is defined as,
wqT = [wq1 wq2 … wqn] (5.10c)
Learning with Neural Networks (NN) 195
x1 ŷ1
 s (◊)
ŷ2
x2 Â s (◊)
1
ŷ3
x3 Â s (◊)
yˆM
xn  s (◊)
1
x1 wq1 wq0 Â wqj xj + wq0
j
x2 wq2 yˆq
 s (◊)
wqn
xn
Inputs qth neuron
Note that it is common for the number of inputs to be different from the number of neurons
(i.e., n π M). A layer is not constrained to have the number of its inputs equal to the number of its
neurons.
The layer shown in Fig. 5.9 has M ¥ 1 output vector
È yˆ1 ˘
͈ ˙
y
yˆ = Í 2 ˙ ,
Í� ˙ (5.11a)
Í ˙
ÍÎ yˆ M ˙˚
n ¥ 1 input vector
È x1 ˘
Íx ˙
x= Í ˙
2
(5.11b)
Í� ˙
Í ˙
Î xn ˚
196 Applied Machine Learning
M ¥ n weight matrix
È w11 w12 � w1n ˘ È w1
T ˘
Íw Í ˙
w22 � w2 n ˙˙ Í w2T ˙
W= Í
21
= (5.11c)
Í � � � ˙ Í � ˙
Í ˙ Í ˙
Î wM 1 wM 2 � wMn ˚ ÍwT ˙
Î M ˚
and M ¥ 1 bias vector
È w10 ˘
Íw ˙
w0 = Í
20 ˙
(5.11d)
Í � ˙
Í ˙
Î wM 0 ˚
Note that the row indices on the elements of matrix W indicate the destination neuron for the
weight, and the column indices indicate which source is the input for that weight. Thus, the index
wqj says that the signal from jth input is connected to the qth neuron.
The activation vector is,
È w1T x + w10 ˘
Í T ˙
Í w2 x + w20 ˙
Wx + w0 = Í ˙ (5.11e)
Í � ˙
ÍwT x + w ˙
Î M M0 ˚
�
T
yˆM = s (wM x + wM 0 )
The input-output mapping is of the feedforward and instantaneous type since it involves no time
delay between the input x and the output ŷ.
Consider a neural network with a single output node. For a dataset with n attributes, the output
node receives x1, x2, …, xn, takes a weighted sum of these and applies the s (◊) function. The output
Ê n ˆ
of the neural network is therefore s Á Â wj x j + w0 ˜ .
Ë j =1 ¯
First consider a numerical output y (i.e., y Œ¬). If s (◊) is a linear activation function (Eqn (5.8)),
the output is simply
Learning with Neural Networks (NN) 197
n
ŷ = Â wj x j + w0
j =1
This is exactly equivalent to the formulation of linear regression given earlier in Section 3.6
(refer to Eqn (3.70)).
Now consider binary output variable y. If s (◊) is log-sigmoid function (Eqn (5.7a)), the output
is simply
1
yˆ =
Ê n ˆ
1 + exp Á - Â wj x j + w0 ˜
Ë j =1 ¯
1
=
1 + e - ( w1x1 + w2 x2 + � + wn xn + w0)
which is equivalent to logistic regression formulation given in Section 3.7 (refer to Eqn (3.84)).
Note that here ŷ takes continuous values in the interval {0, 1} and represents the probability of
belonging to Class q, i.e., ŷ = P(Class 1|x), and P(Class 2|x) = 1 – ŷ .
In both cases, although the neural network models are equivalent to the linear and logistic
regression models, the resulting estimates for the weights in neural network models will be different
from those in linear and logistic regression. This is because the estimation methods are different. As
we will shortly see, the neural network estimation method is different from maximum likelihood
method used in logistic regression, and may be different from least-squares method used in linear
regression.
We will use multiple output nodes ŷq ; q = 1, …, M, for multiclass discrimination problems
(detailed in Section 5.8). For regression (function approximation) problems, multiple output nodes
correspond to multiple response variables we are interested in for numeric prediction. In this case,
a number of regression problems are learned at the same time. An alternative is to train separate
networks for separate regression problems (with one output node). In this chapter, we will focus on
this alternative approach. Our focus is justified on the ground that in many real-life applications, we
are interested in only one response variable, i.e., scalar output variable.
Multi-Layer Perceptrons
Neural networks normally have at least two layers of neurons, with the first layer neurons having
nonlinear and differentiable activation functions. Such networks, as we will see, can approximate
any nonlinear function. In real life, we are faced with nonlinear problems, and multilayer neural
network structures have the capability of providing solutions to these problems.
Figure 5.10 shows a two-layer NN, with n inputs and two layers of neurons. The first of these
layers has m neurons feeding into the second layer possessing M neurons. The first layer or the hidden
layer, has m hidden-layer neurons; the second or the output layer, has M output-layer neurons. It is
not uncommon for different layers to have different numbers of neurons. The outputs of the hidden
layer are inputs to the following layer (output layer); and the network is fully connected. Neural
198 Applied Machine Learning
networks possessing several layers are known as Multi-Layer Perceptrons (MLP); their computing
power is meaningfully improved over the one-layer NN.
All continuous functions, which display certain smoothness, can be approximated to any desired
accuracy with a network of one hidden layer of sigmoidal hidden units, and a layer of linear output
units [83]. Does this mean that it is not required to employ more than one hidden layer and/or mix
different kinds of activation functions? In fact, the accuracy may be enhanced with the help of
network architectures with more hidden layers/mixing activation functions. Especially when the
mapping to be learned is highly complicated, there is a likelihood of performance improvement.
However, as the implementation and training of the network become increasingly complex with
sophisticated network architectures, it is normal to apply only a single hidden layer of similar
activation functions, and an output layer of linear units. We will focus on two-layer feedforward
neural networks with sigmoidal/hyperbolic tangent hidden units and linear output units for function
approximation problems. For classification problems, the linear output units will be replaced with
sigmoidal units. These are widely used network architectures, and work very well in many practical
applications.
1 1
z1 ŷ1
x1 Â s (◊) Â
1 1
z2 ŷ2
x2 Â s (◊) Â
1 1
zm yˆM
x3 Â s (◊) Â
Defining the input terminals as xj; j = 1, …, n; and the hidder-layer outputs as zl, allows one to
write
Ê n ˆ
zl = s Á Â wlj x j + wl 0 ˜ ; l = 1, 2, …, m
Ë j =1 ¯
(5.12a)
T
= s (wl x + wl0)
where
wlT � [wl1 wl2 … wln]
are the weights connecting input terminals to hidden layer.
Learning with Neural Networks (NN) 199
Defining the output-layer nodes as ŷq , one may write the NN output as,
Ê m ˆ
ŷq = Á Â vql zl + vq 0 ˜ ; q = 1, …, M
Ë l =1 ¯
(5.12b)
= vqT z + vq0
where
vqT � [vq1 vq2 … vqm]
Ê m ˆ
ŷq = s Á Â vql zl + vq 0 ˜ ; q = 1, …, M (5.12c)
Ë l =1 ¯
The inputs to the output-layer units (refer to Eqns (5.12b)-(5.12c)) are the nonlinear basis
function values zl; l = 1, …, m, computed by the hidden units. It can be said that the hidden units
make a nonlinear transformation from the n-dimensional input space to the m-dimensional space
spanned by the hidden units and in this space, the output layer implements a linear/logistic function.
regarding the system history—delayed inputs and outputs (refer to Section 1.4.1). The amount
of history required is dependent on the level of accuracy sought, and the resulting computational
complexity. Large number of inputs increase the number of weights in the network that may result
in higher accuracy, but then it may significantly increase the training time. Trial-and-error on the
number of inputs, as well as the network structures, is the search process as in other machine
learning systems (Later sections will give more details).
From several practical applications published over the past decade, there seems to be considerable
evidence that multilayer feedforward networks have an extraordinary capability to do quite well in
most cases.
We will focus on two-layer feedforward neural networks with sigmoidal or hyperbolic tangent
hidden units and linear/sigmoidal output units. This, in all likelihood, is the most popular network
architecture as it works well in many practical applications.
The rest of this chapter is organized as follows: We first consider principles of design for the
primitive units that make up artificial neural networks (perceptrons, linear units, and sigmoid units),
along with learning algorithms for training single units. We then present the BACKPROPAGATION
algorithm for training multilayer networks of such units, and several general issues related to the
algorithm. We conclude the chapter with our discussion on RBF networks.
5.4 PERCEPTRONS
Classical NN systems are based on units called PERCEPTRON and ADALINE (ADAptive
Linear Element). Perceptron was developed in 1958 by Frank Rosenblatt, a researcher in neuro-
physiology, to perform a kind of pattern recognition tasks. In mathematical terms, it resulted from
the solution of classification problem. ADALINE was developed by Bernard Widrow and Marcian
Hoff; it originated from the field of signal processing, or more specifically from the adaptive noise
cancellation problem. It resulted from the solution of the regression problem; the regressor having
the properties of noise canceller (linear filter).
The perceptron takes a vector of real-valued inputs, calculates a linear combination of these
inputs; then outputs +1 if the result is greater than the threshold and –1 otherwise (refer to Fig. 4.2).
The ADALINE in its early stage consisted of a neuron with a linear activation function (Eqn
5.8), a hard limiter (a thresholding device with a signum activation function) and the Least Mean
Square (LMS) learning algorithm. We focus on the two most important parts of ADALINE—its
linear activation function and the LMS learning rule. The hard limiter is omitted, not because it is
irrelevant, but for being of lesser importance to the problems to be solved. The words ADALINE
and linear neuron are both used here for a neural processing unit with a linear activation function
and a corresponding learning rule (not necessarily LMS).
The roots of both the perceptron and the ADALINE were in the linear domain. However, in real
life, we are faced with nonlinear problems, and the perceptron was superseded by more sophisticated
and powerful neuron and neural network structures (multilayer neural networks). What is the type
of unit to be used to construct multilayer networks? Firstly, we may be encouraged to select the
linear units. But, multiple layers of cascaded linear units still produce only linear functions, and
we prefer networks capable of representing highly nonlinear functions. Another likely selection
Learning with Neural Networks (NN) 201
could be perceptron unit. However, because of its discontinuous threshold, it is not differentiable,
and therefore, not suited to the gradient descent approach for optimizing the performance criterion.
What is required is a unit with output , which is a nonlinear function of its inputs—an output
which is also a differentiable function of its inputs. One solution is the sigmoid unit, a unit similar
to perceptron, but based on a smoothened, differentiable threshold function (Fig. 5.8; Eqns (5.7)).
These activation functions are nothing but softer versions of original perceptron’s hard-limiting
threshold functions. In literature, these softer versions are also referred to as perceptrons, and the
multilayer neural networks are also referred to as Multi-Layer Perceptron (MLP) Networks.
In the following, we discuss principles of perceptron learning for classification tasks. The next
section gives the principles for linear-neuron learning. There after principles of ‘soft’ perceptron
(sigmoid unit) learning will be presented.
The regression techniques discussed in Section 3.6, and also later in this chapter, can be used for
linear classification with a careful choice of the target values associated with classes. Let the set of
training (data) examples D be
D = {xj(i), y(i)}; i = 1, …, N; j = 1, …, n
(5.13)
(i) (i)
= {x , y }
where x(i) = [x1(i) x2(i) … xn(i)]T is an n-dimensional input vector (pattern with n-features) for the ith
example in a real-valued space; y(i) is its class label (output value), and y(i) Œ [+1, –1], +1 denotes
Class 1 and –1 denotes Class 2. To build a linear classifier, we need a linear function of the form
g(x) = wTx + w0 (5.14)
so that the input vector x(i) is assigned to Class 1 if g(x(i)) > 0, and to Class 2 if g(x(i)) < 0, i.e.,
ÏÔ + 1 if wT x (i ) + w0 > 0
yˆ (i ) = Ì (5.15)
ÓÔ- 1 if w x + w0 < 0
T (i )
w = [w1 w2 …wn]T is the weight vector and w0 is the bias. In terms of regression, we can view this
classification problem as follows.
Given a vector x(i), the output of the summing unit (linear combiner) will be wTx(i) + w0 (decision
hyperplane) and thresholding the output through a sgn function gives us perceptron output yˆ (i ) =
± 1 (refer to Fig. 5.11).
1 n
w0
w1 Â wj xj + w0 ŷ
x1 j =1 +1
w2 Â
x2 = w T x + wo –1
◊◊ wn
xn Summing unit
(linear combiner)
Figure 5.11 Linear classification using regression technique
The sum of error squares for the classifier becomes (Eqns (3.71))
N N
E= Â (e(i ) )2 = Â ( y(i) – yˆ (i ))2 (5.16)
i =1 i =1
We require E to be a function of (w, w0) to design the linear function g(x) = wTx + w0 that
minimizes E. To obtain E(w, w0), we replace the perceptron outputs yˆ (i ) by the linear combiner
outputs wTx(i) + w0; this gives us the error function
N
E(w, w0) = Â (y(i) – (wTx(i) + w0))2 (5.17)
i =1
Learning with Neural Networks (NN) 203
Consider pattern i. If x(i) Œ Class 1, the desired output y(i) = +1 with summing unit output wTx(i)
+ w0 > 0 ( yˆ (i ) = +1, refer to Fig. 5.11), the contribution of correctly classified pattern i to E(w, w0)
is small when compared with wrongly classified pattern (wTx(i) + w0 < 0; yˆ (i ) = –1).
The error function E(w, w0) in Eqn (5.17), is a continuous, differentiable function; therefore,
gradient descent approach (discussed later in this sub-section) for minimization of E(w, w0) will be
applicable. The training algorithm based on this E(w, w0) can be seen as the training algorithm of
a linear neuron without the nonlinear (signum) activation function. Nonlinearity is ignored during
training; after training and once the weights have been fixed, the model is the perceptron model
with the hard limiter following the linear combiner.
• If the unthresholded output wTx(i) + w0 can be trained to fit the desired values y(i) = ±1 in a
perfect way, then the thresholded output will fit them as well (because sgn (1) = 1 and sgn
(– 1) = –1). Even when the target values cannot fit perfectly in the unthresholded case, the
thresholded value will correctly fit the ±1 target value whenever the unthresholded output has
the correct sign. Note, however, that while gradient descent procedure will learn weights that
minimize the error in the unthresholded output, these weights will not necessarily minimize
the number of training examples misclassified by the thresholded output.
• The perceptron training rule (Section 4.3) converges after a finite number of iterations to
a weight vector that perfectly classifies the training data provided the training examples
are linearly separable. The gradient descent rule converges only asymptotically toward the
minimum-error weight vector, possibly requiring unbounded time, but converges regardless
of whether the training data is linearly separable or not.
Let us define a single weight vector w for the weights (w, w0):
T (i )
The unthresholded output w x is to be trained to fit the desired values y(i) minimizing the error
(Eqn (5.17))
N
E (w ) = 1
2 Â ( y (i ) - wT x (i ) ) 2 (5.21)
i =1
This error function is a continuous, differentiable function; therefore, gradient descent approach
for minimization of E (w ) will be applicable (the constant 12 is used for computational convenience
only; it gets cancelled out by the differentiation required in the error minimization process).
To understand the gradient descent algorithm, it is helpful to visualize the error space of possible
weight vectors and the associated values of the performance criterion (cost function E). For the
unthresholded unit (a linear weighted combination of inputs), the error surface is parabolic with
a single global minimum. The specific parabola will depend, of course, on the particular set of
training examples.
How can we calculate the direction of steepest descent along the error surface? This direction
can be found by computing the derivative of E with respect to each component of the vector w. This
vector-derivative is called the gradient of E with respect to w, written —E (w ) .
T
È ∂E ∂E ∂E ˘
—E ( w ) = Í º ˙ (5.22)
∂
Î 0 w ∂ w1 ∂ wn˚
When interpreted as a vector in weight space, the gradient specifies the direction that produces
the steepest increase in E. The negative of this vector, therefore, gives the direction of steepest
decrease. Therefore, the training rule for gradient descent is,
w ¨ w + Dw (5.23a)
where
Dw = - h —E (w ) (5.23b)
Here h is a positive constant (less than one), called the learning rate which determines the step
size in the gradient descent search. This training rule can also be written in its component form:
wj ¨ wj + Dwj; j = 0, 1, 2, …, n (5.24a)
where
∂E
Dwj = - h (5.24b)
∂ wj
which shows that steepest descent is achieved by altering each component wj of w in proportion
∂E
to .
∂ wj
Gradient descent search helps determine a weight vector that minimizes E by starting with an
arbitrary initial weight vector and then altering it again and again in small steps. At each step, the
weight vector is changed in the direction producing the steepest descent along the error surface.
The process goes on till the global minimum error is attained.
To build a practical algorithm for repeated updation of weight according to (5.24), we require an
effective technique to calculate the gradient at each step. Luckily, this is quite easy. The gradient
with respect to weight wj; j = 1, …, n, can be obtained by differentiating E from Eqn (5.21) as,
Learning with Neural Networks (NN) 205
È ˆˆ ˘
2
∂E ∂ Í 1 N Ê (i ) Ê n
= Â y - Á Â wj x j + w0 ˜ ˜ ˙˙
∂ wj ∂ wj Í 2 i = 1 ÁË
(i )
Î Ë j =1 ¯¯ ˚
Ê n ˆ
The error e for the i sample of data is given by e = y – Á Â wj x j + w0 ˜ .
(i) th (i ) (i) (i)
It follows that Ë j =1 ¯
N
È ∂ ˘
N
∂e ( i )
1
2 Â Í ∂ w (e ( i ) ) 2 ˙ = Â e(i ) ∂ wj
i =1 Î j ˚ i =1
∂e ( i )
= - x (j i )
∂ wj
∂E ∂ È 1 N (i ) 2 ˘
= Í 2 Â (e ) ˙
∂ wj ∂ wj ÍÎ i = 1 ˙˚
N
=– Â e(i) xj(i)
i =1
N Ê (i ) Ê n ˆ ˆ (i )
=– Â Á y - ÁÂ j j w x(i )
+ w0 ˜ ˜ xj
i =1 Ë Ë j =1 ¯¯
The gradient with respect to bias,
N Ê Ê n ˆˆ
∂E N
= - Â e(i ) = - Â Á y (i ) - Á Â wj x(ji ) + w0 ˜ ˜
∂ w0 i =1 i =1 Ë Ë j =1 ¯¯
N Ê (i ) Ê n ˆˆ
w0 ¨ w 0 + h  Á y - Á  wj x j + w0 ˜ ˜
(i )
(5.25b)
i =1 Ë Ë j =1 ¯¯
An epoch is a complete run through all the N associated pairs. Once an epoch is completed, the
pair (x(1), y(1)) is presented again and a run is performed through all the pairs again. After several
epochs, the ouput error is expected to be sufficiently small.
The iteration index k corresponds to the number of times the set of N pairs is presented and
cumulative error is compounded. That is, k corresponds to the epoch number.
206 Applied Machine Learning
N Ê (i ) Ê n ˆˆ
w0 (k + 1) = w0(k) + h  Á y - Á  wj (k ) x j + w0 (k )˜ ˜
(i )
(5.26b)
i =1 Ë Ë j =1 ¯¯
The perceptron (‘softer’ version; sigmoid unit) has been a fundamental building block in the
present-day neural models. Another important building block has been ADALINE (ADAptive
LINear Element), developed by Bernard Widrow and Marcian Hoff in 1959. It originated from
the field of signal processing, or more specifically, from the adaptive noise cancellation problem.
It resulted from the solution of regression problem; the regressor having the properties of noise
canceller (linear filter). All its power in linear domain is still in full service, and despite being a
simple neuron, it is present (without a thresholding device) in almost all the neural models for
regression functions. The words ADALINE and linear neuron are both used here for a neural
processing unit with a linear activation function shown in Fig. 5.12a. The neuron labeled with
summation sign only (Fig. 5.12b) is equivalent to linear neuron of Fig. 5.12a.
1 w0 n
x1 w1 a = Â wj xj + w0 s(a)
j =1 ŷ
◊◊ Â
◊ = w Tx + w0 a
wn
xn Summing unit Linear activation
(linear combiner) function (slope = 1)
(a)
n
1 w0
yˆ = Â wj xj + w0
x1 w1 j =1
Â
◊◊ = w Tx + w0
◊ wn
xn (b)
Figure 5.12 Neural processing unit with a linear activation function
In the last section, we have discussed gradient descent optimization scheme to determine the
optimum setting of the weights (w, w0) that minimize the criterion function given by Eqn (5.21).
Note that this ‘sum of error squares’ criterion function is deterministic and the gradient descent
Learning with Neural Networks (NN) 207
scheme gives a deterministic algorithm for minimization of this function. Now we explore a digress
from this criterion function. Consider the problem of computing weights (w, w0) so as to minimize
Mean Square Error (MSE) between desired and true outputs, defined as follows.
È1 N
( ) ˘˙
2
E(w, w0) = E Í 2 Â y - (w x + w0 )
(i ) T (i )
(5.27)
ÍÎ i = 1 ˙˚
where E is the statistical expectation operator.
The solution to this problem requires the computation of autocorrelation matrix E[x xT] of the
set of feature vectors, and cross-correlation matrix E[x y] between the desired response and the
feature vector. This presupposes knowledge of the underlying distributions, which, in general, is
not known. Thus, our major goal becomes to see if it is possible to solve this optimization problem
without having this statistical information.
The Least Mean Square (LMS) algorithm, originally formulated by Widrow and Hoff, is a
stochastic gradient algorithm that iterates weights (w, w0) in the regressor after each presentation
of data sample, unlike the standard gradient descent that iterates weights after presentation of the
whole training dataset. That is, the kth iteration in standard gradient descent means the kth epoch, or
the kth presentation of the whole training dataset, while kth iteration in stochastic gradient descent
means the presentation of kth single training data pair (drawn in sequence or randomly). Thus, the
calculation of the weight change Dw or the gradient needed for this, is pattern-based, not epoch-
based ( Dw = –h —E (w ) ; Eqn (5.23b)).
LMS is called a stochastic gradient algorithm because the gradient vector is chosen at ‘random’
and not, as in steepest descent case, precisely derived from the shape of the total error surface.
Random means here the instantaneous value of the gradient. This is then used as the estimator of
the true quantity.
The design of the LMS algorithm is very simple, yet a detailed analysis of its convergence
behavior is a challenging mathematical task. It turns out that under mild conditions, the solution
provided by the LMS algorithm converges in probability to the solution of the sum-of-error-squares
optimization problem.
n
yˆ (k ) = Â wj(k) xj(i) + w0(k) (5.28b)
j =1
208 Applied Machine Learning
where k is the iteration index. Note that the input components xj(i) and the desired output y(i) are
not functions of the iteration index. Training pairs (x(i), y(i)), drawn in sequence or randomly, are
presented to the network at each iteration. The gradients with respect to weights and bias are
computed as follows:
∂E ( k ) ∂e ( k ) ∂ yˆ (k )
= e( k ) = - e( k )
∂ wj ( k ) ∂ wj (k ) ∂ wj (k )
= – e(k) x (i)
j
∂E ( k )
= – e(k)
∂ w0 (k )
In this section, gradient descent strategy for adapting weights for a single neuron having
differentiable activation function is demonstrated. This will just be a small (nonlinear) deviation
from the derivation of adaptive rule for the linear activation function, given in the previous section.
Including this small deviation will be a natural step for deriving gradient-descent based algorithm
for multilayer neural networks (in the next section, we will derive this algorithm).
A neural unit with any differentiable function s (a) is shown in Fig. 5.13. It first computes a
linear combination of its inputs (activation value a); then applies nonlinear activation function s (a)
Learning with Neural Networks (NN) 209
to the result. The output ŷ of nonlinear unit is a continuous function of its input a. More precisely,
the nonlinear unit computes its output as,
ŷ = s (a) (5.31a)
n
a= Â wj xj + w0 (5.31b)
j =1
x0 = 1 w0
x1 w1 n
x2 w2 a = Â wj xj + w0 s(a)
j =1 ŷ
 a
� T
= w x + w0
wn
xn = wTx
The problem is to find the expression for the learning rule for adapting weights using a training
set of pairs of input and output patterns; the learning is in stochastic gradient descent mode, as in
the last section. We begin by defining error function E(k):
∂ E (k )
wj(k + 1) = wj(k) - h (5.33b)
∂ wj ( k )
∂ E (k )
w0(k + 1) = w0(k) - h (5.33c)
∂ w0 (k )
Note that E(k) is a nonlinear function of the weights now, and the gradient cannot be calculated
following the equations derived in the last section for a linear neuron. Fortunately, the calculation
of the gradient is straight forward in the nonlinear case as well. For this purpose, the chain rule is,
∂ E (k ) ∂ E (k ) ∂a (k )
= (5.34)
∂ wj (k ) ∂ a (k ) ∂ wj (k )
210 Applied Machine Learning
where the first term on the right-hand side is a measure of an error change due to the activation
value a(k) at the kth iteration, and the second term shows the influence of the weights on that
particular activation value a(k). Applying the chain rule again, we get,
∂ E (k ) ∂ E (k ) ∂e (k ) ∂ y� (k ) ∂ a (k )
=
∂ wj (k ) ∂e (k ) ∂ y� (k ) ∂ a (k ) ∂ wj (k )
∂s(a (k )) (i)
= e(k) [–1] xj
∂a (k )
= – e(k) s ¢(a(k)) xj(i) (5.35)
The learning rule can be written as,
wj(k + 1) = wj(k) + h e(k) s ¢(a(k)) xj(i) (5.36a)
w0(k + 1) = w0(k) + h e(k) s¢(a(k)) (5.36b)
This is the most general learning rule that is valid for a single neuron having any nonlinear and
differentiable activation function and whose input is formed as a product of the pattern and weight
vectors. It follows the LMS algorithm for a linear neuron presented in the last section, which was
an early powerful strategy for adapting weights using data pairs only.
This rule is also known as delta learning rule with delta defined as,
d(k) = e(k) s ¢(a(k))
= (y (i) – ŷ(k)) s ¢(a(k)) (5.37)
In terms of d(k), the weights-update equations become
wj (k + 1) = wj (k) + h d(k) xj(i) (5.38a)
w0(k + 1) = w0(k) + h d(k) (5.38b)
It should be carefully noted that the d(k) in these equations is not the error but the error change
∂ E (k )
– due to the input a(k) to the nonlinear activation function at the kth iteration:
∂a (k )
∂ E (k ) ∂ E (k ) ∂e(k ) ∂ y� (k )
- =-
∂ a (k ) ∂e(k ) ∂ y� (k ) ∂ a (k )
= – e(k) [–1] s ¢(a(k))
= e(k) s ¢(a(k)) = d(k) (5.39)
Thus, d(k) will generally not be equal to the error e(k). We will use the term error signal for d(k),
keeping in mind that, in fact, it represents the error change.
In the world of neural computing, the error signal d(k) is of highest importance. After a hiatus
in the development of learning rules for multilayer networks for about 20 years, the adaptation rule
based on delta rule made a breakthrough in 1986 and was named the generalized delta learning
rule. Today, the rule is also known as the error backpropagation learning rule (discussed in the
next section).
Learning with Neural Networks (NN) 211
Sigmoidal unit has the useful property that its derivative is easily expressed in terms of its output:
Log-sigmoid
d s(a ) d È 1 ˘ e- a 1 È 1 ˘
= Í -a ˙
= -a 2
= Í1 - 1 + e - a ˙
da da Î1 + e ˚ (1 + e ) 1 + e- a Î ˚
= s (a ) [1 - s (a )] (5.43a)
= yˆ (1 - yˆ ) (5.43b)
Tan-sigmoid
d s (a ) d È1 - e - a ˘ 2e - a È Ê 1 - e- a ˆ ˘ È Ê 1 - e- a ˆ ˘
= Í ˙ = = 1
Í1- ˙ Í1 + ˙
da da Î1 + e - a ˚ (1 + e - a ) 2 2 ÍÎ ÁË 1 + e - a ˜¯ ˙˚ ÍÎ ÁË 1 + e - a ˜¯ ˙˚
1
= 2
(1 - s (a )) (1 + s (a )) (5.44a)
= 1
2
(1 - y� ) (1 + y�) (5.44b)
As we shall see, the gradient descent learning makes use of these derivatives.
The most general learning rule that is valid for a single neuron having any nonlinear and
differentiable activation function is given by Eqns(5.37–5.38). For the specific case of sigmoidal
(log-sigmoid) nonlinearity, we have,
d
s¢(a(k)) = s(a(k)) = s(a(k)) [1 – s(a(k))]
da (k )
= ŷ(k)[1 – ŷ(k)]
Therefore,
d(k) = e(k) s¢(a(k)) = (y (i) – ŷ(k)) ŷ(k) [1 – ŷ(k)]
The weight-update equations become,
We construct multilayer networks using sigmoid units (next section will describe commonly used
structures). Initially we may be tempted to select the linear units discussed earlier. But, multiple
layers of cascaded linear units continue to produce only linear functions and we favor networks
possessing the ability to represent highly nonlinear functions. The (hard-limiting) perceptron unit
is another likely selection, but its discontinuous threshold makes it undifferentiable and therefore,