Recurrent Neural Networks For Prediction Learning Algorithms Architectures and Stability Danilo P Mandic Instant Download
Recurrent Neural Networks For Prediction Learning Algorithms Architectures and Stability Danilo P Mandic Instant Download
https://ebookbell.com/product/recurrent-neural-networks-for-
prediction-learning-algorithms-architectures-and-stability-
danilo-p-mandic-4311386
https://ebookbell.com/product/recurrent-neural-networks-for-temporal-
data-processing-hubert-cardot-and-romuald-bon-2353042
https://ebookbell.com/product/recurrent-neural-networks-for-shortterm-
load-forecasting-an-overview-and-comparative-analysis-1st-edition-
filippo-maria-bianchi-et-al-6791102
https://ebookbell.com/product/braincomputer-interfacing-for-assistive-
robotics-electroencephalograms-recurrent-quantum-neural-networks-and-
usercentric-graphical-interfaces-1st-edition-vaibhav-gandhi-5138764
https://ebookbell.com/product/recurrent-neural-networks-with-python-
quick-start-guide-simeon-kostadinov-58262830
Recurrent Neural Networks Xiaolin Hu P Balasubramaniam
https://ebookbell.com/product/recurrent-neural-networks-xiaolin-hu-p-
balasubramaniam-2625740
https://ebookbell.com/product/recurrent-neural-networks-from-simple-
to-gated-architectures-1st-edition-fathi-m-salem-37419192
https://ebookbell.com/product/recurrent-neural-networks-design-and-
applications-l-c-jain-4120836
https://ebookbell.com/product/recurrent-neural-networks-concepts-and-
applications-amit-kumar-tyagi-43393116
https://ebookbell.com/product/recurrent-neural-networks-with-python-
quick-start-guide-sequential-learning-and-language-modeling-with-
tensorflow-1st-edition-simeon-kostadinov-51644338
Recurrent Neural Networks for Prediction
Authored by Danilo P. Mandic, Jonathon A. Chambers
Copyright 2001
c John Wiley & Sons Ltd
ISBNs: 0-471-49517-4 (Hardback); 0-470-84535-X (Electronic)
RECURRENT
NEURAL
NETWORKS
FOR PREDICTION
WILEY SERIES IN ADAPTIVE AND LEARNING SYSTEMS FOR
SIGNAL PROCESSING, COMMUNICATIONS, AND CONTROL
Editor: Simon Haykin
Beckerman/ADAPTIVE COOPERATIVE SYSTEMS
Chen and Gu/CONTROL-ORIENTED SYSTEM IDENTIFICATION: An H
Approach
Cherkassky and Mulier/LEARNING FROM DATA: Concepts, Theory and Methods
Diamantaras and Kung/PRINCIPAL COMPONENT NEURAL NETWORKS:
Theory and Applications
Haykin and Puthusserypady/CHAOTIC DYNAMICS OF SEA CLUTTER
Haykin/NONLINEAR DYNAMICAL SYSTEMS: Feedforward Neural Network
Perspectives
Haykin/UNSUPERVISED ADAPTIVE FILTERING, VOLUME I: Blind Source
Separation
Haykin/UNSUPERVISED ADAPTIVE FILTERING, VOLUME II: Blind
Deconvolution
Hines/FUZZY AND NEURAL APPROACHES IN ENGINEERING
Hrycej/NEUROCONTROL: Towards an Industrial Control Methodology
Krstic, Kanellakopoulos, and Kokotovic/NONLINEAR AND ADAPTIVE
CONTROL DESIGN
Mann/INTELLIGENT IMAGE PROCESSING
Nikias and Shao/SIGNAL PROCESSING WITH ALPHA-STABLE
DISTRIBUTIONS AND APPLICATIONS
Passino and Burgess/STABILITY ANALYSIS OF DISCRETE EVENT SYSTEMS
Sanchez-Peña and Sznaier/ROBUST SYSTEMS THEORY AND APPLICATIONS
Tao and Kokotovic/ADAPTIVE CONTROL OF SYSTEMS WITH ACTUATOR
AND SENSOR NONLINEARITIES
Van Hulle/FAITHFUL REPRESENTATIONS AND TOPOGRAPHIC MAPS:
From Distortion- to Information-Based Self-Organization
Vapnik/STATISTICAL LEARNING THEORY
Werbos/THE ROOTS OF BACKPROPAGATION: From Ordered Derivatives to
Neural Networks and Political Forecasting
Yee and Haykin/REGULARIZED RADIAL-BASIS FUNCTION NETWORKS:
Theory and Applications
RECURRENT
NEURAL
NETWORKS
FOR PREDICTION
LEARNING ALGORITHMS,
ARCHITECTURES AND STABILITY
Danilo P. Mandic
School of Information Systems,
University of East Anglia, UK
Jonathon A. Chambers
Department of Electronic and Electrical Engineering,
University of Bath, UK
Preface xiv
1 Introduction 1
1.1 Some Important Dates in the History of Connectionism 2
1.2 The Structure of Neural Networks 2
1.3 Perspective 4
1.4 Neural Networks for Prediction: Perspective 5
1.5 Structure of the Book 6
1.6 Readership 8
2 Fundamentals 9
2.1 Perspective 9
2.1.1 Chapter Summary 9
2.2 Adaptive Systems 9
2.2.1 Configurations of Adaptive Systems Used in Signal
Processing 10
2.2.2 Blind Adaptive Techniques 12
2.3 Gradient-Based Learning Algorithms 12
2.4 A General Class of Learning Algorithms 14
2.4.1 Quasi-Newton Learning Algorithm 15
2.5 A Step-by-Step Derivation of the Least Mean Square (LMS)
Algorithm 15
2.5.1 The Wiener Filter 16
2.5.2 Further Perspective on the Least Mean Square (LMS)
Algorithm 17
2.6 On Gradient Descent for Nonlinear Structures 18
2.6.1 Extension to a General Neural Network 19
2.7 On Some Important Notions From Learning Theory 19
2.7.1 Relationship Between the Error and the Error Function 19
2.7.2 The Objective Function 20
2.7.3 Types of Learning with Respect to the Training Set
and Objective Function 20
2.7.4 Deterministic, Stochastic and Adaptive Learning 21
2.7.5 Constructive Learning 21
viii CONTENTS
References 267
Index 281
Preface
Acknowledgements
Danilo Mandic acknowledges Dr M. Razaz for providing a home from home in the
Bioinformatics Laboratory, School of Information Systems, University of East Anglia.
Many thanks to the people from the lab for creating a congenial atmosphere at work.
The Dean of the School of Information Systems, Professor V. Rayward-Smith and
his predecessor Dr J. Glauert, deserve thanks for their encouragement and support.
Dr M. Bozic has done a tremendous job on proofreading the mathematics. Dr W. Sher-
liker has contributed greatly to Chapter 10. Dr D. I. Kim has proofread the mathe-
matically involved chapters. I thank Dr G. Cawley, Dr M. Dzamonja, Dr A. James
and Dr G. Smith for proofreading the manuscript in its various phases. Dr R. Harvey
has been of great help throughout. Special thanks to my research associates I. Krcmar
and Dr R. Foxall for their help with some of the experimental results. H. Graham
has always been at hand with regard to computing problems. Many of the results
presented here have been achieved while I was at Imperial College, where I greatly
benefited from the unique research atmosphere in the Signal Processing Section of the
Department of Electrical and Electronic Engineering.
Jonathon Chambers acknowledges the outstanding PhD researchers with whom he
has had the opportunity to interact, they have helped so much towards his orientation
in adaptive signal processing. He also acknowledges Professor P. Watson, Head of
the Department of Electronic and Electrical Engineering, University of Bath, who
has provided the opportunity to work on the book during its later stages. Finally,
he thanks Mr D. M. Brookes and Dr P. A. Naylor, his former colleagues, for their
collaboration in research projects.
Danilo Mandic
Jonathon Chambers
List of Abbreviations
ACF Autocorrelation function
AIC Akaike Information Criterion
ANN Artificial Neural Network
AR Autoregressive
ARIMA Autoregressive Integrated Moving Average
ARMA Autoregressive Moving Average
ART Adaptive Resonance Theory
AS Asymptotic Stability
ATM Asynchronous Transfer Mode
BIC Bayesian Information Criterion
BC Before Christ
BIBO Bounded Input Bounded Output
BP Backpropagation
BPTT Backpropagation Through Time
CM Contraction Mapping
CMT Contraction Mapping Theorem
CNN Cellular Neural Network
DC Direct Current
DR Data Reusing
DSP Digital Signal Processing
DVS Deterministic Versus Stochastic
ECG Electrocardiagram
EKF Extended Kalman Filter
ERLS Extended Recursive Least Squares
ES Exponential Stability
FCRNN Fully Connected Recurrent Neural Network
FFNN Feedforward Neural Network
FIR Finite Impulse Response
FPI Fixed Point Iteration
GAS Global Asymptotic Stability
GD Gradient Descent
HOS Higher-Order Statistics
HRV Heart Rate Variability
i.i.d. Independent Identically Distributed
IIR Infinite Impulse Response
IVT Intermediate Value Theorem
KF Kalman Filter
xviii LIST OF ABBREVIATIONS
281
282 INDEX
difficulties 5 Williams–Zipser 83
history 4 Recurrent perceptron
principle 33 GAS relaxation 125
reasons for using neural networks 5 Recursive algorithm 25
Preservation of Recursive Least-Squares (RLS)
contractivity/expansivity 218 algorithm 14
Principal component analysis 23 Referent network 205
Proximity functions 54 Riccati equation 15
Pseudolinear regression algorithm 105 Robust stability 116
Quasi-Newton learning algorithm 15 Sandwich structure 86
Rate of convergence 121 Santa Fe Institute 6
Real time recurrent learning (RTRL) Saturated-modulus function 57
92, 108, 231 Seasonal ARIMA model 266
a posteriori form 141 Seasonal behaviour 172
normalised form 159 Semiparametric modelling 72
teacher forcing 234 Sensitivities 108
weight update for static and Sequential estimators 15
dynamic equivalence 209 Set
Recurrent backpropagation 109, 209 closure 224
static and dynamic equivalence 211 compact 225
Recurrent neural filter dense subset 224
a posteriori form 140 Sigmoid packet 56
fully connected 98 Sign-preserving 162
stability bound for adaptive Spin glass 2
algorithm 166 Spline, cubic 225
Recurrent neural networks (RNNs) Staircase function 55
activation feedback 81 Standardisation 23
dynamic behaviour 69 Stochastic learning 21
dynamic equivalence 205, 207 Stochastic matrix 253
Elman 82 Stone–Weierstrass theorem 62
fully connected, relaxation 133 Supervised learning 25
fully connected, structure 231 definition 239
Jordan 83 Surrogate dataset 173
local or global feedback 43 System identification 10
locally recurrent–globally System linearity 263
feedforward 82 Takens’ theorem 44, 71, 96
nesting 130 tanh activation function
output feedback 81 contraction mapping 124
pipelined (PRNN) 85, 132, 204, 234 Teacher forced adaptation 108
rate of convergence of relaxation
Threshold nonlinearity 36
127
Training set construction 24
relaxation 129
Turing machine 22
RTRL optimal learning rate 159
static equivalence 205, 206 Unidirected algorithms 111
universal approximators 49 Uniform approximation 51
INDEX 285
Introduction
Artificial neural network (ANN) models have been extensively studied with the aim
of achieving human-like performance, especially in the field of pattern recognition.
These networks are composed of a number of nonlinear computational elements which
operate in parallel and are arranged in a manner reminiscent of biological neural inter-
connections. ANNs are known by many names such as connectionist models, parallel
distributed processing models and neuromorphic systems (Lippmann 1987). The ori-
gin of connectionist ideas can be traced back to the Greek philosopher, Aristotle, and
his ideas of mental associations. He proposed some of the basic concepts such as that
memory is composed of simple elements connected to each other via a number of
different mechanisms (Medler 1998).
While early work in ANNs used anthropomorphic arguments to introduce the meth-
ods and models used, today neural networks used in engineering are related to algo-
rithms and computation and do not question how brains might work (Hunt et al.
1992). For instance, recurrent neural networks have been attractive to physicists due
to their isomorphism to spin glass systems (Ermentrout 1998). The following proper-
ties of neural networks make them important in signal processing (Hunt et al. 1992):
they are nonlinear systems; they enable parallel distributed processing; they can be
implemented in VLSI technology; they provide learning, adaptation and data fusion
of both qualitative (symbolic data from artificial intelligence) and quantitative (from
engineering) data; they realise multivariable systems.
The area of neural networks is nowadays considered from two main perspectives.
The first perspective is cognitive science, which is an interdisciplinary study of the
mind. The second perspective is connectionism, which is a theory of information pro-
cessing (Medler 1998). The neural networks in this work are approached from an
engineering perspective, i.e. to make networks efficient in terms of topology, learning
algorithms, ability to approximate functions and capture dynamics of time-varying
systems. From the perspective of connection patterns, neural networks can be grouped
into two categories: feedforward networks, in which graphs have no loops, and recur-
rent networks, where loops occur because of feedback connections. Feedforward net-
works are static, that is, a given input can produce only one set of outputs, and hence
carry no memory. In contrast, recurrent network architectures enable the informa-
tion to be temporally memorised in the networks (Kung and Hwang 1998). Based
on training by example, with strong support of statistical and optimisation theories
2 SOME IMPORTANT DATES IN THE HISTORY OF CONNECTIONISM
(Cichocki and Unbehauen 1993; Zhang and Constantinides 1992), neural networks
are becoming one of the most powerful and appealing nonlinear signal processors for
a variety of signal processing applications. As such, neural networks expand signal
processing horizons (Chen 1997; Haykin 1996b), and can be considered as massively
interconnected nonlinear adaptive filters. Our emphasis will be on dynamics of recur-
rent architectures and algorithms for prediction.
In the early 1940s the pioneers of the field, McCulloch and Pitts, studied the potential
of the interconnection of a model of a neuron. They proposed a computational model
based on a simple neuron-like element (McCulloch and Pitts 1943). Others, like Hebb
were concerned with the adaptation laws involved in neural systems. In 1949 Donald
Hebb devised a learning rule for adapting the connections within artificial neurons
(Hebb 1949). A period of early activity extends up to the 1960s with the work of
Rosenblatt (1962) and Widrow and Hoff (1960). In 1958, Rosenblatt coined the name
‘perceptron’. Based upon the perceptron (Rosenblatt 1958), he developed the theory
of statistical separability. The next major development is the new formulation of
learning rules by Widrow and Hoff in their Adaline (Widrow and Hoff 1960). In
1969, Minsky and Papert (1969) provided a rigorous analysis of the perceptron. The
work of Grossberg in 1976 was based on biological and psychological evidence. He
proposed several new architectures of nonlinear dynamical systems (Grossberg 1974)
and introduced adaptive resonance theory (ART), which is a real-time ANN that
performs supervised and unsupervised learning of categories, pattern classification and
prediction. In 1982 Hopfield pointed out that neural networks with certain symmetries
are analogues to spin glasses.
A seminal book on ANNs is by Rumelhart et al. (1986). Fukushima explored com-
petitive learning in his biologically inspired Cognitron and Neocognitron (Fukushima
1975; Widrow and Lehr 1990). In 1971 Werbos developed a backpropagation learn-
ing algorithm which he published in his doctoral thesis (Werbos 1974). Rumelhart
et al . rediscovered this technique in 1986 (Rumelhart et al. 1986). Kohonen (1982),
introduced self-organised maps for pattern recognition (Burr 1993).
+1
x1 w1
w0
w2
x2
.. y = Φ(Σ x i wi +w0)
i
wN node
.
xN
Figure 1.1 Connections within a node
NN: RN → RM . (1.3)
4 PERSPECTIVE
1.3 Perspective
Before the 1920s, prediction was undertaken by simply extrapolating the time series
through a global fit procedure. The beginning of modern time series prediction was
in 1927 when Yule introduced the autoregressive model in order to predict the annual
number of sunspots. For the next half century the models considered were linear, typ-
ically driven by white noise. In the 1980s, the state-space representation and machine
learning, typically by neural networks, emerged as new potential models for prediction
of highly complex, nonlinear and nonstationary phenomena. This was the shift from
rule-based models to data-driven methods (Gershenfeld and Weigend 1993).
Time series prediction has traditionally been performed by the use of linear para-
metric autoregressive (AR), moving-average (MA) or autoregressive moving-average
(ARMA) models (Box and Jenkins 1976; Ljung and Soderstrom 1983; Makhoul 1975),
the parameters of which are estimated either in a block or a sequential manner with
the least mean square (LMS) or recursive least-squares (RLS) algorithms (Haykin
1994). An obvious problem is that these processors are linear and are not able to
cope with certain nonstationary signals, and signals whose mathematical model is
not linear. On the other hand, neural networks are powerful when applied to prob-
lems whose solutions require knowledge which is difficult to specify, but for which
there is an abundance of examples (Dillon and Manikopoulos 1991; Gent and Shep-
pard 1992; Townshend 1991). As time series prediction is conventionally performed
entirely by inference of future behaviour from examples of past behaviour, it is a suit-
able application for a neural network predictor. The neural network approach to time
series prediction is non-parametric in the sense that it does not need to know any
information regarding the process that generates the signal. For instance, the order
and parameters of an AR or ARMA process are not needed in order to carry out the
prediction. This task is carried out by a process of learning from examples presented
to the network and changing network weights in response to the output error.
Li (1992) has shown that the recurrent neural network (RNN) with a sufficiently
large number of neurons is a realisation of the nonlinear ARMA (NARMA) process.
RNNs performing NARMA prediction have traditionally been trained by the real-
time recurrent learning (RTRL) algorithm (Williams and Zipser 1989a) which pro-
vides the training process of the RNN ‘on the run’. However, for a complex physical
process, some difficulties encountered by RNNs such as the high degree of approxi-
mation involved in the RTRL algorithm for a high-order MA part of the underlying
NARMA process, high computational complexity of O(N 4 ), with N being the number
of neurons in the RNN, insufficient degree of nonlinearity involved, and relatively low
robustness, induced a search for some other, more suitable schemes for RNN-based
predictors.
In addition, in time series prediction of nonlinear and nonstationary signals, there
is a need to learn long-time temporal dependencies. This is rather difficult with con-
ventional RNNs because of the problem of vanishing gradient (Bengio et al. 1994).
A solution to that problem might be NARMA models and nonlinear autoregressive
moving average models with exogenous inputs (NARMAX) (Siegelmann et al. 1997)
realised by recurrent neural networks. However, the quality of performance is highly
dependent on the order of the AR and MA parts in the NARMAX model.
INTRODUCTION 5
The main reasons for using neural networks for prediction rather than classical time
series analysis are (Wu 1995)
• they are computationally at least as fast, if not faster, than most available
statistical techniques;
• they are self-monitoring (i.e. they learn how to make accurate predictions);
• they are as accurate if not more accurate than most of the available statistical
techniques;
• they provide iterative forecasts;
• they are able to cope with nonlinearity and nonstationarity of input processes;
• they offer both parametric and nonparametric prediction.
Many signals are generated from an inherently nonlinear physical mechanism and have
statistically non-stationary properties, a classic example of which is speech. Linear
structure adaptive filters are suitable for the nonstationary characteristics of such
signals, but they do not account for nonlinearity and associated higher-order statistics
(Shynk 1989). Adaptive techniques which recognise the nonlinear nature of the signal
should therefore outperform traditional linear adaptive filtering techniques (Haykin
1996a; Kay 1993). The classic approach to time series prediction is to undertake an
analysis of the time series data, which includes modelling, identification of the model
and model parameter estimation phases (Makhoul 1975). The design may be iterated
by measuring the closeness of the model to the real data. This can be a long process,
often involving the derivation, implementation and refinement of a number of models
before one with appropriate characteristics is found.
In particular, the most difficult systems to predict are
• those with non-stationary dynamics, where the underlying behaviour varies with
time, a typical example of which is speech production;
• those which deal with physical data which are subject to noise and experimen-
tation error, such as biomedical signals;
• those which deal with short time series, providing few data points on which to
conduct the analysis, such as heart rate signals, chaotic signals and meteorolog-
ical signals.
In all these situations, traditional techniques are severely limited and alternative
techniques must be found (Bengio 1995; Haykin and Li 1995; Li and Haykin 1993;
Niranjan and Kadirkamanathan 1991).
On the other hand, neural networks are powerful when applied to problems whose
solutions require knowledge which is difficult to specify, but for which there is an
abundance of examples (Dillon and Manikopoulos 1991; Gent and Sheppard 1992;
Townshend 1991). From a system theoretic point of view, neural networks can be
considered as a conveniently parametrised class of nonlinear maps (Narendra 1996).
6 STRUCTURE OF THE BOOK
There has been a recent resurgence in the field of ANNs caused by new net topolo-
gies, VLSI computational algorithms and the introduction of massive parallelism into
neural networks. As such, they are both universal function approximators (Cybenko
1989; Hornik et al. 1989) and arbitrary pattern classifiers. From the Weierstrass The-
orem, it is known that polynomials, and many other approximation schemes, can
approximate arbitrarily well a continuous function. Kolmogorov’s theorem (a neg-
ative solution of Hilbert’s 13th problem (Lorentz 1976)) states that any continuous
function can be approximated using only linear summations and nonlinear but contin-
uously increasing functions of only one variable. This makes neural networks suitable
for universal approximation, and hence prediction. Although sometimes computation-
ally demanding (Williams and Zipser 1995), neural networks have found their place
in the area of nonlinear autoregressive moving average (NARMA) (Bailer-Jones et
al. 1998; Connor et al. 1992; Lin et al. 1996) prediction applications. Comprehensive
survey papers on the use and role of ANNs can be found in Widrow and Lehr (1990),
Lippmann (1987), Medler (1998), Ermentrout (1998), Hunt et al. (1992) and Billings
(1980).
Only recently, neural networks have been considered for prediction. A recent compe-
tition by the Santa Fe Institute for Studies in the Science of Complexity (1991–1993)
(Weigend and Gershenfeld 1994) showed that neural networks can outperform conven-
tional linear predictors in a number of applications (Waibel et al. 1989). In journals,
there has been an ever increasing interest in applying neural networks. A most com-
prehensive issue on recurrent neural networks is the issue of the IEEE Transactions of
Neural Networks, vol. 5, no. 2, March 1994. In the signal processing community, there
has been a recent special issue ‘Neural Networks for Signal Processing’ of the IEEE
Transactions on Signal Processing, vol. 45, no. 11, November 1997, and also the issue
‘Intelligent Signal Processing’ of the Proceedings of IEEE, vol. 86, no. 11, November
1998, both dedicated to the use of neural networks in signal processing applications.
Figure 1.2 shows the frequency of the appearance of articles on recurrent neural net-
works in common citation index databases. Figure 1.2(a) shows number of journal and
conference articles on recurrent neural networks in IEE/IEEE publications between
1988 and 1999. The data were gathered using the IEL Online service, and these publi-
cations are mainly periodicals and conferences in electronics engineering. Figure 1.2(b)
shows the frequency of appearance for BIDS/ATHENS database, between 1988 and
2000, 1 which also includes non-engineering publications. From Figure 1.2, there is a
clear growing trend in the frequency of appearance of articles on recurrent neural
networks. Therefore, we felt that there was a need for a research monograph that
would cover a part of the area with up to date ideas and results.
Number of journal and conference papers on Recurrent Neural Networks via IEL
140
120
100
80
Number
60
40
20
0
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Year
Number of journal and conference papers on Recurrent Neural Networks via BIDS
70
(b)
60
50
40
Number
30
20
10
0
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
Year
Figure 1.2 Appearance of articles on RNNs in major citation databases. (a) Appearance
of articles on recurrent neural networks in IEE/IEEE publications in period 1988–1999. (b)
Appearance of articles on recurrent neural networks in BIDS database in period 1988–2000.
8 READERSHIP
Chapter 4 contains a detailed discussion of activation functions and new insights are
provided by the consideration of neural networks within the framework of modu-
lar groups from number theory. The material in Chapter 5 builds upon that within
Chapter 3 and provides more comprehensive coverage of recurrent neural network
architectures together with concepts from nonlinear system modelling. In Chapter 6,
neural networks are considered as nonlinear adaptive filters whereby the necessary
learning strategies for recurrent neural networks are developed. The stability issues
for certain recurrent neural network architectures are considered in Chapter 7 through
the exploitation of fixed point theory and bounds for global asymptotic stability are
derived. A posteriori adaptive learning algorithms are introduced in Chapter 8 and
the synergy with data-reusing algorithms is highlighted. In Chapter 9, a new class
of normalised algorithms for online training of recurrent neural networks is derived.
The convergence of online learning algorithms for neural networks is addressed in
Chapter 10. Experimental results for the prediction of nonlinear and nonstationary
signals with recurrent neural networks are presented in Chapter 11. In Chapter 12,
the exploitation of inherent relationships between parameters within recurrent neural
networks is described. Appendices A to J provide background to the main chapters
and cover key concepts from linear algebra, approximation theory, complex sigmoid
activation functions, a precedent learning algorithm for recurrent neural networks, ter-
minology in neural networks, a posteriori techniques in science and engineering, con-
traction mapping theory, linear relaxation and stability, stability of general nonlinear
systems and deseasonalising of time series. The book concludes with a comprehensive
bibliography.
1.6 Readership
This book is targeted at graduate students and research engineers active in the areas
of communications, neural networks, nonlinear control, signal processing and time
series analysis. It will also be useful for engineers and scientists working in diverse
application areas, such as artificial intelligence, biomedicine, earth sciences, finance
and physics.
Recurrent Neural Networks for Prediction
Authored by Danilo P. Mandic, Jonathon A. Chambers
Copyright 2001
c John Wiley & Sons Ltd
ISBNs: 0-471-49517-4 (Hardback); 0-470-84535-X (Electronic)
Fundamentals
2.1 Perspective
Adaptive systems are at the very core of modern digital signal processing. There are
many reasons for this, foremost amongst these is that adaptive filtering, prediction or
identification do not require explicit a priori statistical knowledge of the input data.
Adaptive systems are employed in numerous areas such as biomedicine, communica-
tions, control, radar, sonar and video processing (Haykin 1996a).
Comparator
_
Filter +
Structure Σ Desired
Input
Signal Response
Control
Algorithm Error
1 The aim is to minimise some function of the error e. If E[e2 ] is minimised, we consider minimum
mean squared error (MSE) adaptation, the statistical expectation operator, E[ · ], is due to the
random nature of the inputs to the adaptive system.
FUNDAMENTALS 11
y(k)
x(k) Adaptive y(k)
Adaptive
Filter Filter
_ _
e(k) e(k)
Σ Σ
+ +
d(k)
x(k) Unknown d(k)
N1(k) s(k) + No(k)
Input System Output
Reference input Primary input
d(k)
x(k)
in acoustic environments and estimation of the foetal ECG from the mixture of the
maternal and foetal ECG (Widrow and Stearns 1985).
In the adaptive prediction configuration, the desired signal is the input signal
advanced relative to the input of the adaptive filter, as shown in Figure 2.2(c). This
configuration has numerous applications in various areas of engineering, science and
technology and most of the material in this book is dedicated to prediction. In fact,
prediction may be considered as a basis for any adaptation process, since the adaptive
filter is trying to predict the desired response.
The inverse system configuration, shown in Figure 2.2(d), has an adaptive system
cascaded with the unknown system. A typical application is adaptive channel equal-
isation in telecommunications, whereby an adaptive system tries to compensate for
the possibly time-varying communication channel, so that the transfer function from
the input to the output of Figure 2.2(d) approximates a pure delay.
In most adaptive signal processing applications, parametric methods are applied
which require a priori knowledge (or postulation) of a specific model in the form of
differential or difference equations. Thus, it is necessary to determine the appropriate
model order for successful operation, which will underpin data length requirements.
On the other hand, nonparametric methods employ general model forms of integral
12 GRADIENT-BASED LEARNING ALGORITHMS
_ +
Σ
The presence of an explicit desired response signal, d(k), in all the structures shown
in Figure 2.2 implies that conventional, supervised, adaptive signal processing tech-
niques may be applied for the purpose of learning. When no such signal is available,
it may still be possible to perform learning by exploiting so-called blind, or unsuper-
vised, methods. These methods exploit certain a priori statistical knowledge of the
input data. For a single signal, this knowledge may be in the form of its constant mod-
ulus property, or, for multiple signals, their mutual statistical independence (Haykin
2000). In Figure 2.3 the structure of a blind equaliser is shown, notice the desired
response is generated from the output of a zero-memory nonlinearity. This nonlinear-
ity is implicitly being used to test the higher-order (i.e. greater than second-order)
statistical properties of the output of the adaptive equaliser. When ideal convergence
of the adaptive filter is achieved, the zero-memory nonlinearity has no effect upon
the signal y(k) and therefore y(k) has identical statistical properties to that of the
channel input s(k).
where ∆w represents the change in w from one iteration to the next. This will gener-
ally ensure that after training, an adaptive system has captured the relevant properties
of the unknown system that we are trying to model. Using a Taylor series expansion
FUNDAMENTALS 13
x1
w1 =10
x2 w2 =1
Σ y
x3 w3 =0.1
w4 =0.01
x4
Figure 2.4 Example of a filter with widely differing weights
∂J (w)
J (w) + ∆w + O(w2 ) < J (w). (2.2)
∂w
This way, with the assumption that the higher-order terms in the left-hand side of
(2.2) can be neglected, (2.1) can be rewritten as
∂J (w)
∆w < 0. (2.3)
∂w
From (2.3), an algorithm that would continuously reduce the error measure on the
run, should change the weights in the opposite direction of the gradient ∂J (w)/∂w,
i.e.
∂J
∆w = −η , (2.4)
∂w
where η is a small positive scalar called the learning rate, step size or adaptation
parameter.
Examining (2.4), if the gradient of the error measure J (w) is steep, large changes
will be made to the weights, and conversely, if the gradient of the error measure J (w)
is small, namely a flat error surface, a larger step size η may be used. Gradient descent
algorithms cannot, however, provide a sense of importance or hierarchy to the weights
(Agarwal and Mammone 1994). For example, the value of weight w1 in Figure 2.4 is
10 times greater than w2 and 1000 times greater than w4 . Hence, the component of
the output of the filter within the adaptive system due to w1 will, on the average,
be larger than that due to the other weights. For a conventional gradient algorithm,
however, the change in w1 will not depend upon the relative sizes of the coefficients,
but the relative sizes of the input data. This deficiency provides the motivation for
certain partial update gradient-based algorithms (Douglas 1997).
It is important to notice that gradient-descent-based algorithms inherently forget old
data, which leads to a problem called vanishing gradient and has particular importance
for learning in filters with recursive structures. This issue is considered in more detail
in Chapter 6.
To introduce a general class of learning algorithms and explain in very crude terms
relationships between them, we follow the approach from Guo and Ljung (1995). Let
us start from the linear regression equation,
y(k) = xT (k)w(k) + ν(k), (2.5)
where y(k) is the output signal, x(k) is a vector comprising the input signals, ν(k)
is a disturbance or noise sequence, and w(k) is an unknown time-varying vector of
weights (parameters) of the adaptive system. Variation of the weights at time k is
denoted by n(k), and the weight change equation becomes
w(k) = w(k − 1) + n(k). (2.6)
Adaptive algorithms can track the weights only approximately, hence for the following
analysis we use the symbol ŵ. A general expression for weight update in an adaptive
algorithm is
ŵ(k + 1) = ŵ(k) + ηΓ (k)(y(k) − xT (k)ŵ(k)), (2.7)
where Γ (k) is the adaptation gain vector, and η is the step size. To assess how far an
adaptive algorithm is from the optimal solution we introduce the weight error vector,
w̆(k), and a sample input matrix Σ(k) as
w̆(k) = w(k) − ŵ(k), Σ(k) = Γ (k)xT (k). (2.8)
Equations (2.5)–(2.8) yield the following weight error equation:
w̆(k + 1) = (I − ηΣ(k))w̆(k) − ηΓ (k)ν(k) + n(k + 1). (2.9)
For different gains Γ (k), the following three well-known algorithms can be obtained
from (2.7). 3
1. The least mean square (LMS) algorithm:
Γ (k) = x(k). (2.10)
3. Kalman filter (KF) algorithm (Guo and Ljung 1995; Kay 1993):
P (k − 1)x(k)
Γ (k) = , (2.13)
R + ηxT (k)P (k − 1)x(k)
ηP (k − 1)x(k)xT (k)P (k − 1)
P (k) = P (k − 1) − + ηQ. (2.14)
R + ηxT (k)P (k − 1)x(k)
3 Notice that the role of η in the RLS and KF algorithm is different to that in the LMS algorithm.
For RLS and KF we may put η = 1 and introduce a forgetting factor instead.
FUNDAMENTALS 15
The KF algorithm is the optimal algorithm in this setting if the elements of n(k)
and ν(k) in (2.5) and (2.6) are Gaussian noises with a covariance matrix Q > 0 and a
scalar value R > 0, respectively (Kay 1993). All of these adaptive algorithms can be
referred to as sequential estimators, since they refine their estimate as each new sample
arrives. On the other hand, block-based estimators require all the measurements to
be acquired before the estimate is formed.
Although the most important measure of quality of an adaptive algorithm is gen-
erally the covariance matrix of the weight tracking error E[w̆(k)w̆T (k)], due to the
statistical dependence between x(k), ν(k) and n(k), precise expressions for this covari-
ance matrix are extremely difficult to obtain.
To undertake statistical analysis of an adaptive learning algorithm, the classical
approach is to assume that x(k), ν(k) and n(k) are statistically independent. Another
assumption is that the homogeneous part of (2.9)
are exponentially stable in stochastic and deterministic senses (Guo and Ljung 1995).
The quasi-Newton learning algorithm utilises the second-order derivative of the objec-
tive function 4 to adapt the weights. If the change in the objective function between
iterations in a learning algorithm is modelled with a Taylor series expansion, we have
After setting the differential with respect to ∆w to zero, the weight update equation
becomes
∆w = −H −1 ∇w E(w). (2.18)
The Hessian H in this equation determines not only the direction but also the step
size of the gradient descent.
To conclude: adaptive algorithms mainly differ in their form of adaptation gains.
The gains can be roughly divided into two classes: gradient-based gains (e.g. LMS,
quasi-Newton) and Riccati equation-based gains (e.g. KF and RLS).
y(k)
Figure 2.5 Structure of a finite impulse response filter
which represents an energy measure. In that case, function F is most often just the
inner product F = xT (k)w(k) and corresponds to the operation of a linear FIR filter
structure. As before, the goal is to find an optimisation algorithm that minimises the
cost function J(w). The common choice of the algorithm is motivated by the method
of steepest descent, and generates a sequence of weight vectors w(1), w(2), . . . , as
where g(k) is the gradient vector of the cost function J(w) at the point w(k)
∂J(w)
g(k) = . (2.22)
∂w w=w(k)
Suppose the system shown in Figure 2.1 is modelled as a linear FIR filter (shown in
Figure 2.5), we have F (x, w) = xT w, dropping the k index for convenience. Con-
sequently, the instantaneous cost function J(w(k)) is a quadratic function of the
FUNDAMENTALS 17
weight vector. The Wiener filter is based upon minimising the ensemble average of
this instantaneous cost function, i.e.
and assuming d(k) and x(k) are zero mean and jointly wide sense stationary. To find
the minimum of the cost function, we differentiate with respect to w and obtain
∂JWiener
= −2E[e(k)x(k)], (2.24)
∂w
where e(k) = d(k) − xT (k)w(k).
At the Wiener solution, this gradient equals the null vector 0. Solving (2.24) for
this condition yields the Wiener solution,
w = R−1
x,x r x,d , (2.25)
where Rx,x = E[x(k)xT (k)] is the autocorrelation matrix of the zero mean input
data x(k) and r x,d = E[x(k)d(k)] is the crosscorrelation between the input vector
and the desired signal d(k). The Wiener formula has the same general form as the
block least-squares (LS) solution, when the exact statistics are replaced by temporal
averages.
The RLS algorithm, as in (2.12), with the assumption that the input and desired
response signals are jointly ergodic, approximates the Wiener solution and asymptot-
ically matches the Wiener solution.
More details about the derivation of the Wiener filter can be found in Haykin
(1996a, 1999a).
Φ y(k)
Following the same procedure as for the general gradient descent algorithm, we obtain
∂e(k)
= −x(k) (2.29)
∂w(k)
and finally
∂J(k)
= −e(k)x(k). (2.30)
∂w(k)
The set of equations that describes the LMS algorithm is given by
N
y(k) = xi (k)wi (k) = xT (k)w(k),
i=1 (2.31)
e(k) = d(k) − y(k),
w(k + 1) = w(k) + ηe(k)x(k).
The LMS algorithm is a very simple yet extremely popular algorithm for adaptive
filtering. It is also optimal in the H ∞ sense which justifies its practical utility (Hassibi
et al. 1996).
Adaptive filters and neural networks are formally equivalent, in fact, the structures of
neural networks are generalisations of linear filters (Maass and Sontag 2000; Nerrand
et al. 1991). Depending on the architecture of a neural network and whether it is used
online or offline, two broad classes of learning algorithms are available:
• techniques that use a direct computation of the gradient, which is typical for
linear and nonlinear adaptive filters;
• techniques that involve backpropagation, which is commonplace for most offline
applications of neural networks.
Backpropagation is a computational procedure to obtain gradients necessary for
adaptation of the weights of a neural network contained within its hidden layers and
is not radically different from a general gradient algorithm.
As we are interested in neural networks for real-time signal processing, we will
analyse online algorithms that involve direct gradient computation. In this section we
introduce a learning algorithm for a nonlinear FIR filter, whereas learning algorithms
for online training of recurrent neural networks will be introduced later. Let us start
from a simple nonlinear FIR filter, which consists of the standard FIR filter cascaded
FUNDAMENTALS 19
with a memoryless nonlinearity Φ as shown in Figure 2.6. This structure can be seen
as a single neuron with a dynamical FIR synapse. This FIR synapse provides memory
to the neuron. The output of this filter is given by
The nonlinearity Φ( · ) after the tap-delay line is typically a sigmoid. Using the ideas
from the LMS algorithm, if the cost function is given by
we have
where e(k) is the instantaneous error at the output neuron, d(k) is some teach-
ing (desired) signal, w(k) = [w1 (k), . . . , wN (k)]T is the weight vector and x(k) =
[x1 (k), . . . , xN (k)]T is the input vector.
The gradient ∇w J(k) can be calculated as
∂J(k) ∂e(k)
= e(k) = −e(k)Φ (xT (k)w(k))x(k), (2.36)
∂w(k) ∂w(k)
where Φ ( · ) represents the first derivative of the nonlinearity Φ(·) and the weight
update Equation (2.35) can be rewritten as
This is the weight update equation for a direct gradient algorithm for a nonlinear
FIR filter.
When deriving a direct gradient algorithm for a general neural network, the network
architecture should be taken into account. For large networks for offline processing,
classical backpropagation is the most convenient algorithm. However, for online learn-
ing, extensions of the previous algorithm should be considered.
In this section we discuss in more detail the inter-relations between the error, error
function and objective function in learning theory.
The error at the output of an adaptive system is defined as the difference between
the output value of the network and the target (desired output) value. For instance,
20 ON SOME IMPORTANT NOTIONS FROM LEARNING THEORY
or as an average value
1 2
N
Ē(N ) = e (i). (2.40)
N + 1 i=0
The objective function is a function that we want to minimise during training. It can
be equal to an error function, but often it may include other terms to introduce con-
straints. For instance in generalisation, too large a network might lead to overfitting.
Hence the objective function can consist of two parts, one for the error minimisa-
tion and the other which is either a penalty for a large network or a penalty term for
excessive increase in the weights of the adaptive system or some other chosen function
(Tikhonov et al. 1998). An example of such an objective function for online learning
is
1 2
N
J(k) = (e (k − i + 1) + G(w(k − i + 1)22 )), (2.41)
N i=1
where G is some linear or nonlinear function. We often use symbols E and J inter-
changeably to denote the cost function.
2.7.3 Types of Learning with Respect to the Training Set and Objective Function
Batch learning
Batch learning is also known as epochwise, or offline learning, and is a common
strategy for offline training. The idea is to adapt the weights once the whole training
set has been presented to an adaptive system. It can be described by the following
steps.
1. Initialise the weights
2. Repeat
• Pass all the training data through the network
FUNDAMENTALS 21
The choice of the type of learning is very much dependent upon application. Quite
often, for networks that need initialisation, we perform one type of learning in the
initialisation procedure, which is by its nature an offline procedure, and then use some
other learning strategy while the network is running. Such is the case with recurrent
neural networks for online signal processing (Mandic and Chambers 1999f).
error is too big, new hidden units are added to the network, training resumes, and so
on. The most used algorithm based upon network growing is the so-called cascade-
correlation algorithm (Hoehfeld and Fahlman 1992). Network pruning starts from a
large network and if the error in learning is smaller than allowed, the network size is
reduced until the desired ratio between accuracy and network size is reached (Reed
1993; Sum et al. 1999).
Systems of this type arise in a wide variety of situations. For a linear σ, we have a
linear system. If the range of σ is finite, the state vector of (2.42) takes values from
a finite set, and dynamical properties can be analysed in time which is polynomial in
the number of possible states. Throughout this book we are interested in functions, σ,
and combination matrices, A, which would guarantee a fixed point of this mapping.
Neural networks are commonly of the form (2.42). In such a context we call σ the
activation function. Results of Siegelmann and Sontag (1995) show that saturated
linear systems (piecewise linear) can represent Turing machines, which is achieved by
encoding the transition rules of the Turing machine in the matrix A.
2. Rescaling, which means transforming the input data in the manner that we
multiply/divide them by a constant and also add/subtract a constant from the
data. 5
4. Principal component analysis (PCA) represents the data by a set of unit norm
vectors called normalised eigenvectors. The eigenvectors are positioned along
the directions of greatest data variance. The eigenvectors are found from the
covariance matrix R of the input dataset. An eigenvalue λi , i = 1, . . . , N , is
associated with each eigenvector. Every input data vector is then represented
by a linear combination of eigenvectors.
As pointed out earlier, standardising input variables has an effect on training, since
steepest descent algorithms are sensitive to scaling due to the change in the weights
being proportional to the value of the gradient and the input data.
5 In real life a typical rescaling is transforming the temperature from Celsius into Fahrenheit scale.
24 LEARNING STRATEGIES
The training of a network makes use of two sequences, the sequence of inputs and the
sequence of corresponding desired outputs. If the network is first trained (with a train-
ing sequence of finite length) and subsequently used (with the fixed weights obtained
from training), this mode of operation is referred to as non-adaptive (Nerrand et al.
1994). Conversely, the term adaptive refers to the mode of operation whereby the net-
work is trained permanently throughout its application (with a training sequence of
infinite length). Therefore, the adaptive network is suitable for input processes which
exhibit statistically non-stationary behaviour, a situation which is normal in the fields
of adaptive control and signal processing (Bengio 1995; Haykin 1996a; Haykin and
FUNDAMENTALS 25
Li 1995; Khotanzad and Lu 1990; Narendra and Parthasarathy 1990; Nerrand et al.
1994).
The computation of the coefficients during training aims at finding a system whose
operation is optimal with respect to some performance criterion which may be either
qualitative, e.g. (subjective) quality of speech reconstruction, or quantitative, e.g.
maximising signal to noise ratio for spatial filtering. The goal is to define a positive
training function which is such that a decrease of this function through modifications
of the coefficients of the network leads to an improvement of the performance of the
system (Bengio 1995; Haykin and Li 1995; Nerrand et al. 1994; Qin et al. 1992). In the
case of non-adaptive training, the training function is defined as a function of all the
data of the training set (in such a case, it is usually termed as a cost function). The
minimum of the cost function corresponds to the optimal performance of the system.
Training is an optimisation procedure, conventionally using gradient-based methods.
In the case of adaptive training, it is impossible, in most instances, to define a
time-independent cost function whose minimisation leads to a system that is optimal
with respect to the performance criterion. Therefore, the training function is time
dependent. The modification of the coefficients is computed continually from the
gradient of the training function. The latter involves the data pertaining to a time
window of finite length, which shifts in time (sliding window) and the coefficients are
updated at each sampling time.
A supervised learning algorithm performs learning by using a teaching signal, i.e. the
desired output signal, while an unsupervised learning algorithm, as in blind signal
processing, has no reference signal as a teaching input signal. An example of a super-
vised learning algorithm is the delta rule, while unsupervised learning algorithms are,
26 MODULARITY WITHIN NEURAL NETWORKS
for example, the reinforcement learning algorithm and the competitive rule (‘winner
takes all’) algorithm, whereby there is some sense of concurrency between the elements
of the network structure (Bengio 1995; Haykin and Li 1995).
Updating the network weights by pattern learning means that the weights of the
network are updated immediately after each pattern is fed in. The other approach is
to take all the data as a whole batch, and the network is not updated until the entire
batch of data is processed. This approach is referred to as batch learning (Haykin and
Li 1995; Qin et al. 1992).
It can be shown (Qin et al. 1992) that while considering feedforward networks
(FFN), after one training sweep through all the data, the pattern learning is a first-
order approximation of the batch learning with respect to the learning rate η. There-
fore, the FFN pattern learning approximately implements the FFN batch learning
after one batch interval. After multiple sweeps through the training data, the dif-
ference between the FFN pattern learning and FFN batch learning is of the order6
O(η 2 ). Therefore, for small training rates, the FFN pattern learning approximately
implements FFN batch learning after multiple sweeps through the training data. For
recurrent networks, the weight updating slopes for pattern learning and batch learn-
ing are different 7 (Qin et al. 1992). However, the difference could also be controlled
by the learning rate η. The difference will converge to zero as quickly as η goes to
zero 8 (Qin et al. 1992).
The hierarchical levels in neural network architectures are synapses, neurons, layers
and neural networks, and will be discussed in Chapter 5. The next step would be
combinations of neural networks. In this case we consider modular neural networks.
Modular neural networks are composed of a set of smaller subnetworks (modules),
each performing a subtask of the complete problem. To depict this problem, let us
recourse to the case of linear adaptive filters described by a transfer function in the
6 In fact, if the data being processed exhibit highly stationary behaviour, then the average error
calculated after FFN batch learning is very close to the instantaneous error calculated after FFN
pattern learning, e.g. the speech data can be considered as being stationary within an observed frame.
That forms the basis for use of various real-time and recursive learning algorithms, e.g. RTRL.
7 It can be shown (Qin et al. 1992) that for feedforward networks, the updated weights for both
pattern learning and batch learning adapt at the same slope (derivative dw/dη) with respect to the
learning rate η. For recurrent networks, this is not the case.
8 In which case we have a very slow learning process.
Another Random Document on
Scribd Without Any Related Topics
to state the reasons which led me to think no fight would take place, for
doing so would have been to betray confidence. And so we parted company
—they to feast their eyes on a bombardment—and if they only are near
enough to see it they will heartily regret their curiosity, or I am mistaken—
and we to return to Mobile.
It was dark before the Diana was well down off Fort Pickens again, and,
as she passed out to sea between it and Fort M’Rae, it was certainly to have
been expected that one side or other would bring her to. Certainly our friend
Mr. Brown in his clipper Oriental would overhaul us outside, and there lay a
friendly bottle in a nest of ice waiting for the gallant sailor who was to take
farewell of us according to promise. Out we glided into night and into the
cool sea breeze, which blew fresh and strong from the north. In the distance
the black form of the Powhatan could be just distinguished; the rest of the
squadron could not be made out by either eye or glass, nor was the schooner
in sight. A lantern was hoisted by my orders, and was kept aft for some time
after the schooner was clear of the forts. Still no schooner. The wind was
not very favorable for running toward the Powhatan, and it was too late to
approach her with perfect confidence from the enemy’s side. Besides, it was
late; time pressed. The Oriental was surely lying off somewhere to the
westward, and the word was given to make sail, and soon the Diana was
bowling along shore, where the sea melted away in a fiery line of foam so
close to us that a man could, in nautical phrase, “shy a biscuit” on the sand.
The wind was abeam, and the Diana seemed to breathe it through her sails,
and flew along at an astonishing rate through the phosphorescent waters
with a prow of flame and a bubbling wake of dancing meteor-like streams
flowing from her helm, as though it were a furnace whence boiled a stream
of liquid metal. “No sign of the Oriental on our lee-bow?” “Nothin’ at all in
sight, sir.” The sharks and huge rays flew off from the shore as we passed
and darted out seaward, marking their runs in brilliant trails of light. On
sped the Diana, but no Oriental came in sight.
I was tired. The sun had been very hot; the ride through the batteries, the
visits to quarters, the excursion to Pickens, had found out my weak places,
and my head was aching and legs fatigued, and so I thought I would turn in
for a short time, and I dived into the shades below, where my comrades
were already sleeping, and kicking off my boots, lapsed into a state which
rendered me indifferent to the attentions no doubt lavished upon me by the
numerous little familiars who recreate in the well-peopled timbers. It never
entered into my head, even in my dreams, that the captain would break the
blockade if he could—particularly as his papers had not been indorsed, and
the penalties would be sharp and sure if he were caught. But the confidence
of coasting captains in the extraordinary capabilities of their craft is a
madness—a hallucination so strong that no danger or risk will prevent their
acting upon it whenever they can. I was assured once by the “captain” of a
Billyboy, that he could run to windward of any frigate in Her Majesty’s
service, and there is not a skipper from Hartlepool to Whitstable who does
not believe his own Mary Ann or Three Grandmothers is, on certain “pints,”
able to bump her fat bows and scuttle-shaped stern faster through the seas
than any clipper which ever flew a pendant. I had been some two hours and
a half asleep, when I was awakened by a whispering in the little cabin.
Charley, the negro cook, ague-stricken with terror, was leaning over the
bed, and in broken French was chattering through his teeth: “Monsieur,
Monsieur, nous sommes perdus! Le batement de guerre nous poursuit. Il n’a
pas encore tiré. Il va tirer bientot! Oh, mon Dieu! mon Dieu!” Through the
hatchway I could see the skipper was at the helm, glancing anxiously from
the compass to the quivering reef-points of his mainsail. “What’s all this we
hear, captain?” “Well, sir, there’s been somethin’ a runnin’ after us these
two hours” (very slowly). “But I don’t think he’ll keech us up no how this
time.” “But, good heavens! you know it may be the Oriental, with Mr.
Brown on board.” “Ah, wall—may bee. But he kept quite close up on me in
the dark—it gave me quite a stark when I seen him. May be, says I, he’s a
privateerin’ chap, and so I draws in on shore close as I cud,—gets mee
centre-board in, and, says I, I’ll see what yer med of, mee boy. He an’t a
gaining much on us.” I looked, and sure enough, about half or three-
quarters of a mile astern, and somewhat to leeward of us, a vessel, with sails
and hull all blended into a black lump, was standing on in pursuit. I strained
my eyes and furbished up the glasses, but could make out nothing definite.
The skipper held grimly on. The shore was so close we could have almost
leaped into the surf, for the Diana, when her centre-board is up, does not
draw much over four feet. “Captain, I think you had better shake your wind,
and see who he is. It may be Mr. Brown.” “Meester Brown or no I can’t
help carrine on now. I’d be on the bank outside in a minit if I didn’t hold my
course.” The captain had his own way; he argued that if it was the Oriental
she would have fired a blank gun long ago to bring us to; and as to not
calling us when the sail was discovered he took up the general line of the
cruelty of disturbing people when they’re asleep. Ah! captain, you knew
well it was Mr. Brown, as you let out when we were off Fort Morgan. By
keeping so close in shore in shoal water the Diana was enabled to creep
along to windward of the stranger, who evidently was deeper than
ourselves. See there! Her sails shiver! so one of the crew says; she’s struck!
But she’s off again, and is after us. We are just within range, and one’s eyes
become quite blinky, watching for the flash from the bow, but, whether
privateer or United States schooner she was too magnanimous to fire. A
stern chase is a long chase. It must now be somewhere about two in the
morning. Nearer and nearer to shore creeps the Diana. “I’ll lead him into a
pretty mess, whoever he is, if he tries to follow me through the Swash,”
grins the skipper. The Swash is a very shallow, narrow, and dangerous
passage into Mobile Bay, between the sand-banks on the east of the main
channel and the shore. The Diana is now only some nine or ten miles from
Fort Morgan, guarding the entrance to Mobile. Soon an uneasy dancing
motion welcomes her approach to the Swash. “Take a cast of the lead,
John!” “Nine feet.” “Good! Again!” “Seven feet.” “Good—Charley, bring
the lantern.” (Oh, Charley, why did that lantern go out just as it was wanted,
and not only expose us to the most remarkable amount of “cussin’,”
imprecation, and strange oaths our ears ever heard, but expose our lives and
your head to more imminent danger?) But so it was, just at the critical
juncture when a turn of the helm port or starboard made the difference,
perhaps, between life and death, light after light went out, and the captain
went dancing mad after intervals of deadly calmness, as the mate sang out,
“Five feet and a half! seven feet—six feet—eight feet—five feet—four feet
and a half—(Oh, Lord!)—six feet,” and so on, through a measurement of
death by inches, not at all agreeable. And where was Mr. Brown all this
time? Really, we were so much interested in the state of the lead-line, and in
the very peculiar behavior of the lanterns which would not burn, that we
scarcely cared much when we heard from the odd hand and Charley that she
had put about, after running aground once or twice, they thought, as soon as
we entered the Swash, and had vanished rapidly in the darkness. It was little
short of a miracle that we got past the elbow, for just at the critical moment,
in a channel not more than a hundred yards broad, with only six feet of
water, the binnacle light, which had burned steadily for a minute, sank with
a sputter into black night. When the passage was accomplished, the captain
relieved his mind by chasing Charley into a corner, and with a shark, which
he held by the tail, as the first weapon that came to hand, inflicting on him
condign punishment, and then returning to the helm. Charley, however,
knew his master, for he slyly seized the shark and flung his defunct corpse
overboard before another fit of passion came on, and by the morning the
skipper was good friends with him, after he had relieved himself, by a series
of castigations of the negligent lamplighter with every variety of
Rhadamanthine implement.
The Diana had thus distinguished her dirty little person by breaking a
blockade, and giving an excellent friend of ours a great deal of trouble (if it
was, indeed Mr. Brown), as well as giving us a very unenviable character
for want of hospitality and courtesy; and, for both, I beg to apologize with
this account of the transaction. But she had a still greater triumph. As she
approached Fort Morgan, all was silence. The morning was just showing a
gray streak in the east. “Why, they’re all asleep at the fort,” observed the
indomitable captain, and, regardless of guns or sentries, down went his
helm, and away the Diana thumped into Mobile Bay, and stole off in the
darkness toward the opposite shore. There was, however, a miserable day
before us. When the light fairly broke we had got only a few miles inside, a
stiff northerly wind blew right in our teeth, and the whole of the blessed day
we spent in tacking backward and forward between one low shore and
another low shore, in water the color of pea-soup, so that temper and
patience were exhausted, and we were reduced to such a state that we took
intense pleasure in meeting with a drowning alligator. He was a nice-
looking young fellow about ten feet long, and had evidently lost his way,
and was going out to sea bodily, but it would have been the height of
cruelty to take him on board our ship miserable as he was, though he passed
within two yards of us. There was to be sure the pleasure of seeing Mobile
in every possible view, far and near, east and west, and in a lump and run
out, but it was not relished any more than our dinner, which consisted of a
very gamy Bologna sausage, pig who had not decided whether he would be
pork or bacon, and onions fried in a terrible preparation of Charley the
cook. At five in the evening, however, having been nearly fourteen hours
beating about twenty-seven miles, we were landed at an outlying wharf, and
I started off for the Battle House and rest. The streets are filled with the
usual rub-a-dub-dubbing bands, and parades of companies of the citizens in
grotesque garments and armament, all looking full of fight and secession. I
write my name in the hotel book at the bar as usual. Instantly young
Vigilance Committee, who has been resting his heels high in air, with one
eye on the staircase and the other on the end of his cigar, stalks forth and
reads my style and title, and I have the satisfaction of slapping the door in
his face as he saunters after me to my room, and looks curiously in to see
how a man takes off his boots. They are all very anxious in the evening to
know what I think about Pickens and Pensacola, and I am pleased to tell the
citizens I think it be a very tough affair on both whenever it comes. I
proceed to New Orleans on Monday.
It is reported that the patrols are strengthened, and I could not help hearing
a charming young lady say to another, the other evening, that “she would
not be afraid to go back to the plantation, though Mrs. Brown Jones said she
was afraid her negroes were after mischief.”
There is a great scarcity of powder, which is one of the reasons, perhaps,
why it has not yet been expended as largely as might be expected from the
tone and temper on both sides. There is no sulphur in the States; nitre and
charcoal abound. The sea is open to the North. There is no great overplus of
money on either side. In Missouri, the interest on the state debt, due in July,
will be used to procure arms for the state volunteers to carry on the war. The
South is preparing for the struggle by sowing a most unusual quantity of
grain; and in many fields corn and maize have been planted instead of
cotton. “Stay laws,” by which all inconveniences arising from the usual
dull, old-fashioned relations between debtor and creditor are avoided (at
least by the debtor), have been adopted in most of the seceding states. How
is it that the state legislatures seem to be in the hands of the debtors and not
of the creditors?
There are some who cling to the idea that there will be no war after all, but
no one believes that the South will ever go back of its own free will, and the
only reason that can be given by those who hope rather than think in that
way is to be found in the faith that the North will accept some mediation,
and will let the South go in peace. But could there—can there be peace?
The frontier question—the adjustment of various claims—the demands for
indemnity, or for privileges or exemptions, in the present state of feeling,
can have but one result. The task of mediation is sure to be as thankless as
abortive. Assuredly the proffered service of England would, on one side at
least, be received with something like insult. Nothing but adversity can
teach these people its own most useful lessons. Material prosperity has
puffed up the citizens to an unwholesome state. The toils and sacrifices of
the old world have been taken by them as their birthright, and they have
accepted the fruits of all that the science, genius, suffering, and trials of
mankind in time past have wrought out, perfected, and won as their own
peculiar inheritance, while they have ignorantly rejected the advice and
scorned the lessons with which these were accompanied.
May 23.—The Congress at Montgomery, having sat with closed doors
almost since it met, has now adjourned till July the 20th, when it will
reassemble at Richmond, in Virginia, which is thus designated, for the time,
capital of the Confederate States of America. Richmond, the principal city
of the Old Dominion, is about one hundred miles in a straight line south by
west of Washington. The rival capitals will thus be in very close proximity
by rail and by steam, by land and by water. The movement is significant. It
will tend to hasten a collision between the forces which are collected on the
opposite sides of the Potomac. Hitherto, Mr. Jefferson Davis has not
evinced all the sagacity and energy, in a military sense, which he is said to
possess. It was bad strategy to menace Washington before he could act. His
secretary of war, Mr. Walker, many weeks ago, in a public speech,
announced the intention of marching upon the capital. If it was meant to do
so, the blow should have been struck silently. If it was not intended to seize
upon Washington, the threat had a very disastrous effect on the South, as it
excited the North to immediate action, and caused General Scott to
concentrate his troops on points which present many advantages in the face
of any operations which may be considered necessary along the lines either
of defence or attack. The movement against the Norfolk navy-yard
strengthened Fortress Monroe, and the Potomac and Chesapeake were
secured to the United States. The fortified ports held by the Virginians and
the Confederate States troops, are not of much value as long as the streams
are commanded by the enemy’s steamers; and General Scott has shown that
he has not outlived either his reputation or his vigor by the steps, at once
wise and rapid, he has taken to curb the malcontents in Maryland, and to
open his communications through the city of Baltimore. Although immense
levies of men may be got together, on both sides, for purposes of local
defence or for state operations, it seems to me that it will be very difficult to
move these masses in regular armies. The men are not disposed for regular,
lengthened service, and there is an utter want of field trains, equipment, and
commissariat, which cannot be made good in a day, a week, or a month.
The bill passed by the Montgomery Congress, entitled “An act to raise an
additional military force to serve during the war,” is, in fact, a measure to
put into the hands of the government the control of irregular bodies of men,
and to bind them to regular military service. With all their zeal, the people
of the South will not enlist. They detest the recruiting sergeant, and Mr.
Davis knows enough of war to feel hesitation in trusting himself in the field
to volunteers. The bill authorizes Mr. Davis to accept volunteers who may
offer their services, without regard to the place of enlistment, “to serve
during the war, unless sooner discharged.” They may be accepted in
companies, but Mr. Davis is to organize them into squadrons, battalions, or
regiments, and the appointment of field and staff officers is reserved
especially to him. The company officers are to be elected by the men of the
company, but here again Mr. Davis reserves to himself the right of veto, and
will only commission those officers whose election he approves.
The absence of cavalry and the deficiency of artillery may prevent either
side obtaining any decisive results in one engagement; but, no doubt, there
will be great loss whenever these large masses of men are fairly opposed to
each other in the field. Of the character of the Northern regiments I can say
nothing more from actual observation; nor have I yet seen, in any place,
such a considerable number of the troops of the Confederate States, moving
together, as would justify me in expressing any opinion with regard to their
capacity for organized movements, such as regular troops in Europe are
expected to perform. An intelligent and trustworthy observer, taking one of
the New York state militia regiments as a fair specimen of the battalions
which will fight for the United States, gives an account of them which leads
me to the conclusion that such regiments are much superior, when furnished
by the country districts, to those raised in the towns and cities. It appears, in
this case at least, that the members of the regular militia companies in
general send substitutes to the ranks. Ten of these companies form the
regiment, and, in nearly every instance, they have been doubled in strength
by volunteers. Their drill is exceedingly incomplete, and in forming the
companies there is a tendency for the different nationalities to keep
themselves together. In the regiment in question the rank and file often
consists of quarrymen, mechanics, and canal boatmen, mountaineers from
the Catskill, bark peelers, and timber cutters—ungainly, square-built,
powerful fellows, with a Dutch tenacity of purpose crossed with an English
indifference to danger. There is no drunkenness and no desertion among
them. The officers are almost as ignorant of military training as their men.
The colonel, for instance, is the son of a rich man in his district, well
educated, and a man of travel. Another officer is a shipmaster. A third is an
artist; others are merchants and lawyers, and they are all busy studying
“Hardee’s Tactics,” the best book for infantry drill in the United States. The
men have come out to fight for what they consider the cause of the country,
and are said to have no particular hatred of the South, or of its inhabitants,
though they think they are “a darned deal too high and mighty, and require
to be wiped down considerably.” They have no notion as to the length of
time for which their services will be required, and I am assured that not one
of them has asked what his pay is to be.
Reverting to Montgomery, one may say without offence that its claims to
be the capital of a republic which asserts that it is the richest, and believes
that it will be the strongest in the world, are not by any means evident to a
stranger. Its central position, which has reference rather to a map than to the
hard face of matter, procured for it a distinction to which it had no other
claim. The accommodations which suited the modest wants of a state
legislature vanished or were transmuted into barbarous inconveniences by
the pressure of a central government, with its offices, its departments, and
the vast crowd of applicants which flocked thither to pick up such crumbs
of comfort as could be spared from the executive table. Never shall I forget
the dismay of myself, and of the friends who were travelling with me, on
our arrival at the Exchange Hotel, under circumstances with some of which
you are already acquainted. With us were men of high position, members of
Congress, senators, ex-governors, and General Beauregard himself. But to
no one was greater accommodation extended than could be furnished by a
room held, under a sort of ryot-warree tenure, in common with a
community of strangers. My room was shown to me. It contained four large
four-post beds, a ricketty table, and some chairs of infirm purpose and
fundamental unsoundness. The floor was carpetless, covered with litter of
paper and ends of cigars, and stained with tobacco juice. The broken glass
of the window afforded no ungrateful means of ventilation. One gentleman
sat in his shirt sleeves at the table reading the account of the marshalling of
the Highlanders at Edinburgh in the Abbottsford edition of Sir Walter Scott;
another, who had been wearied, apparently, by writing numerous
applications to the government for some military post, of which rough
copies lay scattered around, came in, after refreshing himself at the bar, and
occupied one of the beds, which by the bye, were ominously provided with
two pillows apiece. Supper there was none for us in the house, but a search
in an outlying street enabled us to discover a restaurant, where roasted
squirrels and baked opossums figured as luxuries in the bill of fare. On our
return we found that due preparation had been made in the apartment by the
addition of three mattresses on the floor. The beds were occupied by
unknown statesmen and warriors, and we all slumbered and snored in
friendly concert till morning. Gentlemen in the South complain that
strangers judge of them by their hotels, but it is a very natural standard for
strangers to adopt, and in respect to Montgomery it is almost the only one
that a gentleman can conveniently use, for if the inhabitants of this city and
its vicinity are not maligned, there is an absence of the hospitable spirit
which the South lays claim to as one of its animating principles, and a little
bird whispered to me that from Mr. Jefferson Davis down to the least
distinguished member of his government there was reason to observe that
the usual attentions and civilities offered by residents to illustrious
stragglers had been “conspicuous for their absence.” The fact is, that the
small planters who constitute the majority of the land-owners are not in a
position to act the Amphytrion, and that the inhabitants of the district can
scarcely aspire to be considered what we would call gentry in England, but
are a frugal, simple, hog-and-hominy living people, fond of hard work and,
occasionally, of hard drinking.
ebookbell.com