Hannachi2021 Book PatternsIdentificationAndDataM
Hannachi2021 Book PatternsIdentificationAndDataM
Abdelwaheb Hannachi
Patterns
Identification
and Data Mining
in Weather and
Climate
Springer Atmospheric Sciences
The Springer Atmospheric Sciences series seeks to publish a broad portfolio
of scientific books, aiming at researchers, students, and everyone interested in
this interdisciplinary field. The series includes peer-reviewed monographs, edited
volumes, textbooks, and conference proceedings. It covers the entire area of
atmospheric sciences including, but not limited to, Meteorology, Climatology,
Atmospheric Chemistry and Physics, Aeronomy, Planetary Science, and related
subjects.
Patterns Identification
and Data Mining in Weather
and Climate
Abdelwaheb Hannachi
Department of Meteorology, MISU
Stockholm University
Stockholm, Sweden
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To the memory of my father and mother who
taught me big principles, and to my little
family Houda, Badr, Zeid and Ahmed for
their patience.
Preface
Weather and climate is a fascinating system, which affects our daily lives, and is
closely interlinked with the environment, society and infrastructure. They have large
impact on our lives and activities, climate change being a typical example. It is a
high-dimensional highly complex system involving nonlinear interactions between
very many modes or degrees of freedom. This made weather and climate look
mysterious in ancient societies. Complex high-dimensional systems are difficult to
comprehend by our three-dimensional concept of the physical world. Humans have
sought out patterns in order to describe the working world around us. This task,
however, proved difficult and challenging.
In the climate context, the quest to identify patterns is driven by the desire to
find structures embedded in state space, which can lead to a better understanding
of the system dynamics, and eventually learn its behaviour and predict its future
state. With the advent of computers and observing systems, massive amounts of data
from the atmosphere and ocean are obtained, which beg for exploration and analysis.
Pattern identification in atmospheric science has a long history. It began in the 1920s
with Gilbert Walker, who identified the southern oscillation and the atmospheric
component of ENSO teleconnection, although the latter concept seems to have been
mentioned for the first time by Ångström in the mid-1930s. The correlation analysis
of Gilbert Walker used to identify the southern oscillation is akin to the iterative
algorithm used to compute empirical orthogonal functions. However, the earliest
known eigenanalysis in atmospheric science goes back to the time of the previous
USSR school by Obukhov and Bagrov around the late 1940s and early 1950s,
respectively. But it was Ed. Lorenz who coined the term ‘empirical orthogonal
functions’ (EOFs) in the mid-1950s. Since then, research on the topic has been
expanding, and a number of textbooks have been written, notably by Preisendorfer
in the late 1980s, followed by texts by Thiebaux, and von Storch and Zwiers about
a decade later, and another one by Jolliffe in the early 2000s. These texts did an
excellent job in presenting the theory and methods, particularly those related to
eigenvalue problems in meteorology and oceanography.
Weather and climate data analysis has witnessed a fast growth in the last few
decades both in terms of methods and applications. This growth was driven by the
vii
viii Preface
need to analyse and interpret the fast-growing volume of climate data using both
linear and nonlinear methods. In this book, I attempt to give an up-to-date text by
presenting linear and nonlinear methods that have been developed in the last two
decades, in addition to including conventional ones.
The text is composed of 17 chapters. Apart from the first two introductory and
setting up chapters, the remaining 15 chapters present the different methods used to
analyse spatio-temporal data from atmospheric science and oceanography. The EOF
method, a cornerstone of eigenvalue problems in meteorology and oceanography, is
presented in Chap. 3. The next four chapters present derivatives of EOFs, including
eigenvalue problems involved the identification of propagating features. A whole
chapter is devoted to predictability and predictable patterns, and another one on
multidimensional scaling, which discusses various dissimilarity measures used in
pattern identification, followed by a chapter on factor analysis. Nonlinear methods
of space-time pattern identification, with different perspectives, are presented in the
next three chapters. The previous chapters deal essentially with discrete gridded
data, as is usually the case, with no explicit discussion of the continuous case, such
as curves or surfaces. This topic is presented and discussed in the next chapter.
Another whole chapter is devoted to presenting and discussing the topic of coupled
patterns using conventional and newly developed approaches. A number of other
methods are not presented in the previous chapters. Those methods are collected
and presented in the penultimate chapter. Finally, and to take into account the recent
interest in automatic methods, the last chapter presents and discusses few commonly
used methods in machine learning. To make it as a stand-alone text, a number of
technical appendices are given at the end of the book.
This book can be used in teaching data analysis in atmospheric science, or
other topics such as advanced statistical methods in climate research. Apart from
Chap. 15, in the context of coupled patterns and regression, and Appendix C, I
did not discuss explicitly statistical modelling/inference. This topic of statistical
inference in climate science is covered in a number of other books reported in
the reference list. To help students and young researchers in the field explore the
topics, I have included a number of small exercises, with hints, embedded within
the different chapters, in addition to some basic skeleton Matlab codes for some
basic methods. Full Matlab codes can be obtained from the author upon request.
A list of software links is also given at the end of the book.
ix
Over the last few decades, we have amassed an enormous amount of weather and climate
data of which we have to make sense now. Pattern identification methods and modern data
mining approaches are essential in better understanding how the atmosphere and the climate
system works. These topics are not traditionally taught in meteorology programmes. This
book will prove a valuable source for students as well as active researchers interested
in these topics. The book provides a broad overview over modern pattern identification
methods and an introduction to machine learning.
– Christian Franzke, ICCP, Pusan National University
The topic of EOFs and associated pattern identification in space-time data sets has gone
through an extraordinary fast development, both in terms of new insights and the breadth
of applications. For this reason, we need a text approximately every 10 years to summarize
the fields. Older texts by, for instance, Jolliffe and Preisendorfer need to be succeeded by
an up-to-date new text. We welcome this new text by Abdel Hannachi who not only has a
deep insight in the field but has himself made several contributions to new developments in
the last 15 years.
– Huug van den Dool, Climate Prediction Center, NCEP, College Park, MD
Now that weather and climate science is producing ever larger and richer data sets, the topic
of pattern extraction and interpretation has become an essential part. This book provides an
up-to-date overview of the latest techniques and developments in this area.
– Maarten Ambaum, Department of Meteorology, University of Reading, UK
The text is very ambitious. It makes considerable effort to collect together a number
of classical methods for data analysis, as well as newly emerging ones addressing the
challenges of the modern huge data sets. There are not many books covering such a
wide spectrum of techniques. In this respect, the book is a valuable companion for many
researchers working in the field of climate/weather data analysis and mining. The author
deserves congratulations and encouragement for his enormous work.
– Nickolay Trendafilov, Open University, Milton Keynes
This nicely and expertly written book covers a lot of ground, ranging from classical
linear pattern identification techniques to more modern machine learning methodologies,
all illustrated with examples from weather and climate science. It will be very valuable both
as a tutorial for graduate and postgraduate students and as a reference text for researchers
and practitioners in the field.
– Frank Kwasniok, College of Engineering, Mathematics and Physical Sciences,
University of Exeter
xi
We will show them Our signs in the horizons and within
themselves until it becomes clear to them that it is the truth
Holy Quran Ch. 51, V. 53
Acknowledgements
This text is a collection of work I have been conducting over the years on weather
and climate data analysis, in collaboration with colleagues, complemented with
other methods from the literature. I am especially grateful to all my teachers,
colleagues and students, who contributed directly or indirectly to this work. I would
like to thank, in particular, Zoubeida Bargaoui, Bernard Legras, Keith Haines, Ian
Jolliffe, David B. Stephenson, Nickolay Trendafilov, Christian Franzke, Thomas
Önskog, Carlos Pires, Tim Woollings, Klaus Fraedrich, Toshihiko Hirooka, Grant
Branstator, Lesley Gray, Alan O’Neill, Waheed Iqbal, Andrew Turner, Andy Heaps,
Amro Elfeki, Ahmed El-Hames, Huug van den Dool, Léon Chafik and all my MISU
colleagues, and many other colleagues I did not mention by name. I acknowledge the
support of Stockholm University and the Springer team, in particular Robert Doe,
executive editor, and Neelofar Yasmeen, production coordinator, for their support
and encouragement.
xiii
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Complexity of the Climate System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Data Exploration, Data Mining and Feature Extraction. . . . . . . . . . . 3
1.3 Major Concern in Climate Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Characteristics of High-Dimensional Space
Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Curse of Dimensionality and Empty Space
Phenomena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 Dimension Reduction and Latent Variable Models . . . . 11
1.3.4 Some Problems and Remedies in Dimension
Reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Examples of the Most Familiar Techniques . . . . . . . . . . . . . . . . . . . . . . . 13
2 General Setting and Basic Terminology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Simple Visualisation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Data Processing and Smoothing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Preliminary Checking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Simple Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Data Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Basic Notation/Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.1 Centring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.2 Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.3 Scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.4 Sphering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.5 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Stationary Time Series, Filtering and Spectra . . . . . . . . . . . . . . . . . . . . . 26
2.6.1 Univariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.2 Multivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
xv
xvi Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
Chapter 1
Introduction
1 The total mass m of the atmosphere is of the order of O(1022 )gr. The total number of molecules
m 45
ma Na is of the order O(10 ). The constants Na and ma are respectively the Avogadro number
6.023 × 10 and the molar air mass, i.e. the mass of Na molecules (29 gr).
23
2 The quotation appears in the section “More from the Notebooks of Lazarus Long” of Robert A.
Heinlein’s novel. Some sources, however, attribute the quotation to the American writer/lecturer
Samuel L. Clemens known by pen name Mark Twain (1835–1910), although this seems perhaps
unlikely, since this concept of climate as average of weather was not readily available around 1900.
Fig. 1.1 Illustration of a highly simplified paradigm for the weather/climate system
represented by the shape of the marbles in the containers describes the probability
density function of the system.
The weather/climate system of our rotating Earth consists of the evolution of the
coupled Atmosphere–Land–Ocean–Ice system driven by radiative forcing from the
Sun as well as from the earth’s interior, e.g. volcanoes. The climate, as a complex
nonlinear dynamical system varies on a myriad of interacting space/time scales. It
is characterised by its high number of degrees of freedom (dof) and their complex
nonlinear interactions. It also displays nontrivial sensitivity to initial as well as
boundary conditions. Weather and climate at one location can be related to those at
another distant location, that is, weather and climate are not local but have a global
character. This is known as teleconnections in atmospheric science, and represents
links between climate anomalies occurring at one specific location and at large
distances. They can be seen as patterns connecting widely separated regions, such
the El-Niño Southern Oscillation (ENSO), the North Atlantic Oscillation (NAO)
and the Pacific North America (PNA) pattern. More discussions on teleconnections
are given in the beginning of Chap. 3.
In practice, various climate variables, such as sea level pressure, wind field and
ozone concentrations are measured at different time intervals and at various spatial
locations. In general, however, these measurements are sparse both is space and
time. Climate models are usually used, via data assimilation techniques, to produce
regular data in space and time, known as the “reanalyses”. The analysis of climate
data is not solely restricted to reanalyses but also includes other observed records,
e.g. balloon measurements, satellite irradiances, in situ recordings such as rain
gauges, ice cores for carbon dating, etc. Model simulations are also extensively used
for research purposes, e.g. for investigating/understanding physical mechanisms,
1.2 Data Exploration, Data Mining and Feature Extraction 3
studying anthropogenic climate change and climate prediction, and also for climate
model validation, etc.
The recent explosive growth in the amount of observed and simulated atmo-
spheric data that are becoming available to the climate scientist has created an ever
increasing need for new mathematical data analysis tools that enable researchers
to extract the most out of these data in order to address key questions relating to
major concerns in climate dynamics. Understanding weather and climate involves
a genuine investigation of the nature of nonlinearities and causality in the system.
Predicting the climate is also another important reason that drives climate research.
Because of the high dimensionality involved in weather/climate system advanced
tools are required to analyse and hence understand various aspects of the system. A
major objective in climate research is the identification of major climate modes,
patterns, regimes, etc. This is precisely one of the most challenging problems in
data mining/feature extraction.
In climate research and other scientific fields we are faced with large datasets,
typically multivariate time series with large dimensionality where the objective
is to identify or find out interesting or more prominent patterns of variability.
A basic step in the analysis of multivariate data is exploration (Tukey 1977).
Exploratory data analysis (EDA) provides tools for hypothesis formulation and
feature selection (Izenman 2008). For example, according to Seltman (2018) one
reads: “Loosely speaking, any method of looking at data that does not include formal
statistical modeling and inference falls under the term exploratory data analysis.”
For moderately large to large multivariate data, simple EDA, such as scatter plots
boxplots, may not be possible, and “advanced” EDA is needed. This includes for
instance the identification (or extraction) of structures, patterns, trends, dimension
reduction, etc. For some authors (e.g. Izenman 2008) this may be categorised as
“descriptive data mining”, to distinguish it from “predictive data mining”, which is
based on model building including classification, regression and machine learning.
The following quotes show examples of excerpts of data mining from various
sources. For instance, according to the glossary of data mining3 we read:
• “data mining is an information extraction activity whose goal is to discover
hidden facts contained in data bases. Using a combination of machine learning,
statistical analysis, modelling techniques and database technology, data mining
finds patterns and subtle relationships in data and infers rules that allow the
prediction of future state”.
3 http://twocrows.com/data-mining/dm-glossary/.
4 1 Introduction
4 http://www.pcc.qub.ac.uk/tec/courses/datamining.
5 https://home.kku.ac.th/wichuda/DMining/ClementineUsersGuide_11.1.pdf.
1.3 Major Concern in Climate Data Analysis 5
d
xk2 = r 2 . (1.1)
k=1
Note that in the Euclidean three-dimensional space the hypersphere is simply the
usual sphere. The definition of hyperspheres allows another set of coordinates,
namely spherical coordinates in which a point (x1 , x2 , . . . xn ) can be defined using
an equivalent set of coordinates: (r, θ1 , . . . , θd−1 ), where r ≥ 0, − π2 ≤ θk ≤ π2 ,
for 1 ≤ k ≤ d − 2, and 0 ≤ θd−1 ≤ 2π . The relationship between the two sets of
coordinates is given by:
6 That is |( ∂x
∂θl )kl |, for k = 1, . . . d, and l = 0, . . . , d − 1 where θ0 = r.
k
6 1 Introduction
2π
× ρ d−1 cosd−2 θ1 . . . cos2 θd−3 cosθd−2 dθd−1 (1.3)
0
and yields
where
d
π2 2π
C◦ (d) = = C◦ (d − 2). (1.5)
( d2 + 1) d
Table 1.1 shows the values of C◦ (d) for the first 8 values of d.
Comparing the volume Vd (2r) of the hypercube d (2r) of side 2r, i.e.
Vd (2r) = 2d r d , to that of the hypersphere Vd◦ (r) we see that both the hypervolumes
depend exponentially on the linear scale r but with totally different coefficients.
For instance the coefficient C◦ (d) in Eq. (1.5) is not monotonic and it tends to
zero rapidly for large dimensions. The volume of a hypersphere of any radius
will decrease towards zero when increasing the dimension d from the moment
d > π r 2 . The decrease is sharper when the space dimension is even. The coefficient
C (d) = 2d for the hypercube, on the other hand, increases monotonically with d.
The previous result seems paradoxical, and one might ask what happens to the
content of the hyperspheres in high dimensions, and where does it go? To answer
this question, let us first look at the concentration of the hypercube content. Consider
the d-dimensional hypercube of side 2r and the inscribed hypersphere of radius r
(Fig. 1.2). The fraction of the residual volume Vd (2r) − Vd◦ (r) to the hypercube
volume goes to one, i.e.
Hence the content of the hypersphere becomes more and more concentrated close
to its surface. A direct consequence of this is that a uniformly distributed data in the
hypersphere or hypercube is mostly concentrated on their edges. In other words to
sample uniformly in a hypercube or hypersphere (with large dimensions) we need
extremely large sample sizes.
Exercise Using Eq. (1.1) compute the area Sd◦ (a) of the hypersphere of dimension
d and radius a.
a
Answer Use the fact that Vd◦ (a) = 0 dρ dSd◦ (ρ) and, keeping in mind that
d
π 2 a d−1 d
Sd◦ (a) = a d−1 Sd◦ (1), yields Sd◦ (a) = .
( d2 +1)
√
sphere is r3 = 3 − 1. Extending the same procedure to the d-dimensional
√ 4d -
hypercube and its corresponding 2 unit hyperspheres yields rd = d − 1. Hence
d
for d ≥ 10, the small inner hypersphere reaches outside the hypercube. Note also,
as pointed out earlier that the volume of this inner hypersphere goes to zero as d
increases.
• Diagonals in hypercubes (Scott 1992)
Consider the d-dimensional hypercube [−1, 1]d . Any diagonal vector v joining the
origin to one of the corners is of the form (±1, ±1, . . . , ±1)T . Now the angle θ
between v and any vector basis, i.e. coordinate axis i is given by:
v.i ±1
cos θ = =√ , (1.8)
v d
which goes to zero as d goes to infinity. Hence the diagonals are nearly orthogonal
to the coordinate axes in high-dimensional spaces.7 An important consequence is
that any data that tend to cluster near the diagonals in hyperspaces will be mapped
onto the origin in every paired scattered plot. This points to the importance of the
choice of the coordinate system in multivariate data analysis.
• Consequence for the multivariate Gaussian distribution (Scott 1992; Carreira-
Perpiñán 2001)
7 Note that because of the square root, this is true for very large dimensions d, e.g. d ≥ 103 .
1.3 Major Concern in Climate Data Analysis 9
Equiprobable sets form (1.9) are hyperspheres and the origin is the mode of the
distribution. Now consider the set of points y within the hypersphere associated
with a pdf of εf (0), i.e. points satisfying f (y) ≥ εf (0) for small ε. The probability
of this set is
P r x 2 ≤ −2log ε = P r χd2 ≤ −2 log ε . (1.10)
For a given ε this probability falls off rapidly as d increases. This decrease becomes
sharp after d = 5. Hence the probability of points not in the tail decreases rapidly
as d increases. Consequently, unlike our prediction from low-dimensional intuition,
most of the mass of the multivariate normal in high-dimensional spaces is in the tail.
of the order of a century of data record. In various climate analysis studies we use
at least 4 to 5 dimensions which translate into an astronomical sample size record,
e.g. million years.
The moral is that if n1 is the sample size required in one dimension, then in
d dimensions we require nd = nd1 , i.e. a power law in the linear dimension. The
curse of dimensionality, coined by Bellman (1961), refers to this phenomenon,
i.e. the sample size needed to have an accurate estimate of a function in a high-
dimensional space grows exponentially with the number of variables or the space
dimension. The curse of dimensionality also refers to the paradox of neighbourhood
in high-dimensional spaces, i.e. empty space phenomenon (Scott and Thompson
1983). Local neighbourhoods are almost empty whereas nonempty neighbourhood
are certainly nonlocal (Scott 1992). This is to say that high-dimensional spaces are
inherently empty or sparse.
Example For the uniform distribution in the unit 10-dimensional hypersphere the
probability of a point falling in the hypersphere of radius 0.9 is only 0.35 whereas
the remaining 0.65 probability is for points on the outer shell of thickness 0.1.
The above example shows that density estimation in high dimensions can be
problematic. This is because regions of relatively very low density can contain
considerable part of the distribution whereas regions of apparently high density
can be completely devoid of observations in samples of moderate sizes (Silverman
1986). For example, 70% of the mass in the standard normal distribution is within
one standard deviation of the mean whereas the same domain contains only 0.02%
1.3 Major Concern in Climate Data Analysis 11
of the mass, in 10 dimensions, and has to take a radius of more than three standard
deviations to reach 70%. Consequently, and contrary to our intuition, the tails are
much more important in high dimensions than in low dimensions. This has a serious
consequence, namely the difficulty in probability density function (pdf) estimation
in high-dimensional spaces (see e.g. Silverman 1986). Therefore since most density
estimation methods are based on local concepts, e.g. local averages (Silverman
1986), in order to find enough neighbours in high dimensions, the neighbourhood
has to extend far beyond local neighbours and hence locality is lost (Carreira-
Perpiñán 2001) since local neighbourhoods are mostly empty.
The above discussion finds natural application in the climate system. If, for
example, we are interested in studying a phenomenon, such as El-Nino Southern
Oscillation (ENSO) using say observations from 40 grid points of monthly sea
surface temperature (SST) data in the Tropical Pacific region, then theoretically one
would necessarily need a sample size of O(1040 ). This metaphoric observational
period is far beyond the age of the Universe.8 This is another facet of the emptiness
phenomenon related to the inherent sparsity of high-dimensional spaces. As has
been pointed out earlier, this has a direct consequence on the probability distribution
of high-dimensional data. For example, in the one-dimensional case the probability
density function of the uniform distribution over [−1, 1] is a box of height 2−1 ,
whereas in 10 dimensions the hyperbox height is only 2−10 ≈ 9.8 × 10−4 .
Given the complexity, e.g. high dimensionality, involved in the climate system one
major challenge in atmospheric data analysis is the reduction of the system size.
The basic objective behind dimension reduction is to enable data mapping onto a
lower-dimensional space where data analysis, feature extraction, visualisation, inter-
pretation, . . ., etc. become manageable. Figure 1.5 shows a schematic representation
of the target/objective of data analysis, namely knowledge or KDD.
In probabilistic terminology, the previous concepts have become known as latent
variable problems. Latent variable models are probabilistic models that attempt
8 Inpractice, of course, data are not totally independent, and the sample size required is far less
than the theoretical one.
12 1 Introduction
9 Thishigh dimensionality can arise from various causes, such as uncertainty related for example
to nonlinearity and stochasticity, e.g. measurement errors.
1.4 Examples of the Most Familiar Techniques 13
Some Remedies
Some of the previous problems can be tackled using some or a combination of the
following approaches:
• Occam’s razor or model simplicity.
• Parsimony principle.
• Arbitrary choices of e.g. the latent (hidden) dimensions.
• Prior physical knowledge of the process under investigation. For example,
when studying tropical climate one can make use of the established ENSO
phenomenon linking the tropical Pacific SST to the sea saw in the large scale
pressure system.
10 Although it is also possible to perform a EOF analysis of more than one field, for example SST
and surface air temperature combined (Kutzbach 1967). This has been labelled combined principal
component analysis by Bretherton et al. (1992).
Chapter 2
General Setting and Basic Terminology
Abstract This chapter introduces some basic terminologies that are used in
subsequent chapters. It also presents some basic summary statistics of data sets and
reviews basic methods of data filtering and smoothing.
2.1 Introduction
It was not until the early part of the twentieth century that correlation started to
be used in meteorology by Gilbert Walker1 (Walker 1909, 1923, 1924; Walker and
Bliss 1932). It is fair to say that most of the multivariate climate data analyses are
based mainly on the analysis of the covariance between the observed variables of the
system. The concept of covariance in atmospheric science has become so important
that it is routinely used in climate analysis. Data, however, have to be processed
before getting to this slightly advanced stage. Some of the common processing
techniques are listed below.
In their basic form multivariate data are normally composed of many unidimen-
sional time series. A time series is a sequence of values x1 , x2 . . . xn , in which each
datum represents a specific value of the variable x. In probabilistic terms x would
represent a random variable, and the value xi is the ith realisation of x in some
experimental set-up. In everyday language xt represents the observation at time t of
the variable x.
In order to get a basic idea about the data, one has to be able to see at least some
of their aspects. Plotting some aspects of climate data is therefore a recommended
first step in the analysis. Certainly this cannot be done for the entire sample; for
example, simple plots for certain “key” variables could be very useful. Trends, for
instance, are examples of aspects that are best inspected visually before quantifying
them.
Various plotting techniques exist for the purpose of visualisation. Examples
include:
• Simple time series plots—this constitutes perhaps the primary step to data
exploration.
• Single/multiple scatter plots between pairs of variables—these simple plots
provide information on the relationships between various pairs of variables.
• Histogram plots—they are a very useful first step towards exploring the distribu-
tions of individual variables (see e.g. Silverman 1986).
1 The modern concept of correlation can be traced as far back as late seventeenth century with
Galton (1885), see e.g. Stigler (1986). The use of the concept of correlation is actually older than
Galton’s (1885) paper and goes back to 1823 with the German mathematician Carl Friedrich Gauss
who developed the normal surface of N correlated variates. The term “correlation” appears to
have been first quoted by Auguste Bravais, a French naval officer and astronomer who worked on
bivariate normal distributions. The concept was also used later in 1868 by Charles Darwin, Galton’s
cousin, and towards the end of the seventeenth century, Pearson (1895) defined the (Galton-)
Pearson’s product-moment correlation coefficient. See Rodgers and Nicewander (1988) for some
details and Pearson (1920) for an account on the history of correlation. Rodgers and Nicewander
(1988) list thirteen ways to interpret the correlation coefficients.
2.3 Data Processing and Smoothing 17
• Contour plots of variables in two dimensions—contour plots are also very useful
in analysing, for example, individual maps or exploring smoothed histograms
between two variables.
• Box plots—these are very useful visualisation tools (Tukey 1977) used to display
and compare the distributions of an observed sample for a number of variables.
Other useful methods, such as sunflower scatter plots (Cleveland and McGill
1984), Chernoff faces (Chernoff 1973), brushing (Venables and Ripley 1994;
Cleveland 1993) and colour histograms (Wegman 1990) are also often used in high-
dimensional data exploration and visualisation. A list of these and other methods
with a brief description and further references can be found in Martinez and
Martinez (2002), see also Everitt (1978).
Climate data are the direct result of experimentation, which translate via our senses
into observations or (in situ) measurements and represent therefore information. By
its very nature, data are always subject to uncertainties and are hence deeply rooted
in probabilistic concepts. It is in general recommended to process the data before
indulging into advanced analyses techniques. The following list provides familiar
examples of processing that are normally applied at each grid point.
• Checking for missing/outlier values—these constitute simple processing tech-
niques that are routinely applied to data. For example, unexpectedly large values
or outliers can either be removed or replaced. Missing values are normally
interpolated using observations from the neighbourhood.
• Detrending—if the data indicate evidence of a trend, linear or polynomial, then it
is in general recommended to detrend the data, by calculating the trend and then
removing it.
• Deseasonalising—seasonality constitutes one of the major sources of variability
in climate data and is ubiquitous in most climate time series. For example, with
monthly data a smooth seasonal cycle can be estimated by fitting a sine wave.
Alternatively, the seasonal cycle can be obtained by the collection of the 12
monthly averages.
A more advanced way is to apply Fourier analysis and considers the few leading
sine waves as the seasonal cycle. The deaseasonalised data are then obtained by
subtracting the seasonal cycle from the original (and perhaps detrended) data. If
the seasonal component is thought to change over time, then techniques based on
X11, for example, can be used. This technique is based on a local fit of a seasonal
component using a simple moving average. Pezzulli et al. (2005) have investigated
the spectral properties of the X11 filter and applied it to sea surface temperature.
18 2 General Setting and Basic Terminology
The method uses a Henderson filter for the moving average and provides a more
flexible alternative to the standard (constant) seasonality.
2.3.2 Smoothing
Moving Average
1
k+M−1
yk = xk , (2.1)
M
i=k
Exponential Smoothing
Unlike the simple moving average where the weights are uniform, the exponential
smoothing uses an exponentially decaying weighting function of the past observa-
tions as
∞
yk = (1 − φ) φ i xk−i , (2.2)
i=0
2 Similar to fitting a probability model where the data is decomposed as data = fit + residuals.
3 Depending on the calendar month; 28, 29, 30 or 31.
2.3 Data Processing and Smoothing 19
for an infinite time series. The smoothing parameter φ satisfies φ < 1, and the
smaller |φ|, the smoother the obtained curve. The coefficient (1 − φ) in Eq. (2.2)
is introduced to make the weights sum to one. In practice for finite time series, the
previous equation is truncated to yield
1−φ i
m
yk = φ xk−i , (2.3)
1 − φ m+1
i=0
where m depends on k.
Spline Smoothing
Unlike moving averages or exponential smoothing, which are locally linear, splines
(Appendix A) provide a nonlinear smoothing using polynomial fitting. The most
popular spline is the cubic spline based on fitting a twice continuously differentiable
piece-wise cubic polynomial. If xk = x(tk ) is the observed time series at time tk ,
k = 1, . . . n, then the cubic spline f () is defined by
(i) f (t) = fk (t) = ak + bk t + ck t 2 + dk t 3 for t in the interval [tk , tk+1 ], k =
1, . . . n − 1.
α
(ii) at each point tk , f () and its first two derivatives are continuous, i.e. atd α fk (tk ) =
α
at α fk−1 (tk ), α = 0, 1, 2.
d
Remark The cubic spline (Appendix A) can also be obtained from an optimisation
problem.
Kernel Smoothing
The kernel smoothing is a global weighted average and is often used to estimate
pdfs, see Appendix A for more advanced smoothing methods. The weights are
obtained as the value of a specific kernel function, e.g. exponential or Gaussian,
applied to the target point. Designate again by xk = x(tk ), k = 1, . . . n the finite
sample time series, and the smoothed time series is given by
n
yl = κli xi , (2.4)
i=1
where κli = K ti −t
h , and K() is the smoothing kernel. The most widely used kernel
l
The parameter h in κli is a smoothing parameter and plays a role equivalent to that
of a window width.
The list provided here is not exhaustive, and other methods exist, see e.g.
Chatfield (1996) or Tukey (1977).
Once the gridded data have been processed, advanced techniques can be applied
depending on the specific objective of the analysis. In general, the processed data
have to be written as an array to facilitate computational procedures, and this is
presented in the next section. But before that, let us define first a few characteristics
of time series.
1
n
x= xk (2.5)
n
k=1
1
n
sx2 = (xk − x)2 . (2.6)
n−1
k=1
See Appendix B for the properties of these estimators. The standard sample
deviation of the time series is sx . The time series is termed centred when the mean
is removed. When the time series is scaled by its standard deviation, it is termed
standardised and consequently has unit variance. Sometimes when the time series is
centred and standardised, it is termed scaled. Often the time series is supposed to be
a realisation of some random variable X with cumulative distribution function (cdf)
FX () with finite mean μX and finite variance σX2 (Appendices B and C). In this case
the sample mean and variance x and sx2 are regarded as estimates of the (population)
mean and variance μX and σX2 , respectively. Now let the time series be sorted into
x(1) ≤ x(2) ≤ . . . ≤ x(n) , and then the following function
⎧
⎨ 0 if u < x(1)
F̂X (u) = nk if x(k) ≤ u < x(k+1) (2.7)
⎩
1 if u > x(n)
1
n
cxy = (xk − x)(yk − y). (2.8)
n−1
k=1
Similarly, the sample correlation coefficient rxy between the two time series is the
covariance between the corresponding scaled time series, i.e.
cxy
rxy = . (2.9)
sx sy
6 n
ρr = 1 − dt2 . (2.10)
n(n2 − 1)
t=1
It can be seen that the transform of the sample x1 , x2 , . . . , xn using the empir-
ical distribution function (edf) F̂X () in (2.7) is precisely pn1 , pn2 , . . . , pnn , where
p1 , p2 , . . . , pn are the ranks of the time series and similarly for the second time
series y1 , y2 . . . , yn . Therefore, the rank correlation is an estimator of the correlation
corr (FX (X), FY (Y )) between the transformed uniform random variables FX (X)
and FY (Y ).
Most analysis methods in climate are described in a matrix form, which is the
essence of multivariate analyses. A given spatio-temporal field, e.g. sea level pres-
sure, is composed of a multivariate time series, where each time series represents the
values of the field X at a given spatial location, e.g. grid point s, taken at different
times4 t noted by X(s, t). The spatial locations are often represented by grid points
that are regularly spaced on the spherical earth at a given vertical level. For example,
a continuous time series at the jth grid point sj can be noted xj (t), where t spans a
given period. The resulting continuous field represents then a multivariate, or vector,
times series:
T
x(t) = x1 (t), . . . , xp (t) .
When the observations are sampled at discrete times, e.g. t = t1 , t2 , . . . tn , one gets
a finite sample xk , k = 1, . . . n, of our field, where xk = x(tk ). In our set-up the
j’th grid point sj represents the j’th variable. Now if we assume that we have p such
variables, then the sampled field X can be represented by an array X = (xij ), or
data matrix, as
⎛ ⎞
x11 x12 . . . x1p
⎜ x21 x22 . . . x2p ⎟
⎜ ⎟
X = (x1 , x2 , . . . , xn )T = ⎜ . .. .. ⎟ . (2.11)
⎝ .. . . ⎠
xn1 xn2 . . . xnp
Suppose now that we have another observed field Y = (yij ), observed at the same
times as X but perhaps at different grid points s∗k , k = 1, . . . q, think, for example,
of sea surface temperature. Then one can form the combined field obtained by
combining both data matrices Z = [X, Y] as
⎛ ⎞
x11 . . . x1p y11 . . . y1q
⎜ x21 . . . x2p y21 . . . y2q ⎟
⎜ ⎟
Z = (zij ) = ⎜ . .. .. .. ⎟ . (2.13)
⎝ .. . . . ⎠
xn1 . . . xnp yn1 . . . ynq
This combination is useful when, for example, one is looking for combined patterns
such as empirical orthogonal functions (Jolliffe 2002; Hannachi et al. 2007). The
ordering or set-up of the data matrix shown in (2.11) or (2.13) where the temporal
component is treated as observation and the space component as variable is usually
referred to as S-mode convention. In the alternative convention, namely the T-mode,
the previous roles are swapped (see e.g. Jolliffe 2002).
2.5 Basic Notation/Terminology 23
2.5.1 Centring
Since the observed field is a realisation of some multivariate random variable, one
can speak of the mean μ of the field, also known as climatology, which is the
expectation of x, i.e.
μ = E(x) = xp(x)dx1 . . . dxp . (2.14)
where 1n = (1, 1, . . . , 1)T is the column vector of length n containing only ones
and In is the n × n identity matrix. The Matlab command to compute the mean of X
and get the anomalies is
>> [n p] = size(X);
>> Xbar = mean(X,1);
>> Xc = X-ones(n,1)*Xbar;
24 2 General Setting and Basic Terminology
1
n
1
S= (xk − x)(xk − x)T = XT Xc . (2.18)
n−1 n−1 c
k=1
If we designate
2 ⎛ ⎞
σ11 0 ... 0
⎜ 0 2 ...
σ22 0 ⎟
⎜ ⎟
D = Diag () = ⎜ . .. .. ⎟, (2.20)
⎝ .. . ... . ⎠
0 0 ... 2
σpp
5 The coefficient 1 1
n−1used in (2.15) instead of n is to make the estimate unbiased, but the difference
in practice is in general insignificant.
2.5 Basic Notation/Terminology 25
then we get
1 1
= D− 2 D− 2 . (2.21)
2.5.3 Scaling
1
Xs = XD− 2 , (2.22)
so each variable in Xs is unit variance, but the correlation structure among the
variables has not changed. Note that the centred and scaled data matrix is
1
Xcs = Xc D− 2 . (2.23)
Note also that the correlation matrix of X is the covariance of the scaled data
matrix Xs .
2.5.4 Sphering
1
In (2.24) − 2 represents the inverse of the square root6 of . The covariance matrix
of X◦ is the identity matrix, i.e. n1 XT◦ X◦ = Ip . Because sphering destroys the first
6 Thesquare root of a symmetric matrix is a matrix R such that RRT = . The square root of
, however, is not unique since for any orthogonal matrix Q, i.e. QQT = QT Q = I, the matrix
RQ is also square root. The standard square root is obtained via a congruential relationship with
respect to orthogonality and is obtained using the singular value decomposition theorem.
26 2 General Setting and Basic Terminology
two moments of the data, it can be useful when the covariance structure in the data is
not desired, e.g. when we are interested in higher order moments such as skewness.
The singular value decomposition (see also Appendix D) is a powerful tool that
decomposes any n × p matrix X into the product of two orthogonal matrices and a
diagonal matrix as
X = UVT ,
r
X= λk uk vTk , (2.25)
k=1
where uk and vk , k = 1, . . . r, are, respectively, the left and right singular vectors of
X and r is the rank of X.
The SVD theorem can also be extended to the complex case. If X is a n × p
complex matrix, we have a similar decomposition to (2.25), i.e. X = UV∗T , where
now U and V satisfy U∗T U = V∗T V = Ir and the superscript (∗ ) denotes the
complex conjugate.
Application
If X = UVT is the SVD decomposition of the n ×p data matrix X, then
k uk uk = In and that the covariance matrix is S =
T 2 T
k λk uk uk . The Matlab
routine is called SVD, which provides all the singular values and associated singular
vectors.
>> [u s v] = svd (X);
The routine SVD is more economic and provides a preselected number of singular
values (see Chap. 3).
Let us consider a continuous stationary time series (or signal) x(t) with autocovari-
ance function γx () and spectral density function fx () (see Appendix C). A linear
filter L is a linear operator transforming x(t) into a filtered time series y(t) = Lx(t).
This linear filter can be written formally as a convolution, i.e.
2.6 Stationary Time Series, Filtering and Spectra 27
y(t) = Lx(t) = h(u)x(t − u)du, (2.26)
where the function h() is known as the transfer function of the filter or its impulse
response function. The reason for this terminology is that if x(t) is an impulse,
i.e. a Dirac delta function, then the response is simply h(t). From (2.26), the
autocovariance function γy () of the filtered time series is
γy (τ ) = h(u)h(v)γx (τ + u − v)dudv. (2.27)
Taking the Fourier transform of (2.27), the spectral density function of the response
y(t) is
where
(ω) = h(u)e−iuω du = |(ω)|eiφ(ω) (2.29)
is the Fourier transform of the transfer function and is known as the frequency
response function. Its amplitude |(ω)| is the gain of the filter, and φ(ω) is its phase.
In the discrete case the transfer function is simply a linear combination of Dirac
pulses as
h(u) = ak δk (2.30)
k
giving as output
yt = ak xt−k . (2.31)
k
The frequency response function is then the discrete Fourier transform of h() and is
given by
1
(ω) = ak e−iωk . (2.32)
2π
k
Exercise
1. Derive the frequency response function of the moving average filter (2.1).
2. Derive the same function for the exponential smoothing filter (2.2).
28 2 General Setting and Basic Terminology
Note that the cross-covariance function is not symmetric in general. The Fourier
transform of the cross-covariance function, i.e. the cross-spectrum fxy (ω) =
1
2π γxy (k)e−iωk , is given by
Note that the cross-covariance function is not limited to time series defined, e.g. via
Eq. (2.26), but is defined for any two time series.
The previous concepts can be extended to the multivariate time series in the same
manner (Appendix C). Let us suppose that xt , t = 1, 2, . . ., is a d-dimensional time
series with zero mean (for simplicity). The lagged autocovariance matrix is
(τ ) = E xt xTt+τ . (2.36)
T (τ ) = (−τ ). (2.37)
1 −iωk
F(ω) = e (k). (2.38)
2π
k
where the notation (∗ ) represents the complex conjugate. Note that the diagonal
elements of the cross-spectrum matrix, [F]ii (ω), i = 1, . . . d, represent the
individual power spectrum of the ith component xti , of xt . The real part FR =
Re (F(ω)) is the co-spectrum, and its imaginary part FR = I m (F(ω)) is the
quadrature spectrum. The co-spectrum is even and satisfies
FR (ω) = (0) + cos (k) + T (k) , (2.41)
k≥1
The relations (2.27) and (2.35) can also be extended naturally to the multivariate
filtering problem. In fact if the multivariate
signal x(t) is passed through a linear
filter L to yield y(t) = Lx(t) = H(u)x(t − u)du, then the covariance matrix of
the output is
y (τ ) = H(u) x (τ + u − v)HT (v)dudv.
1 −iωτ , using
By expanding the cross-spectrum matrix Fxy (ω) = 2π τ xy (τ )e
(2.43), and similarly for the output spectrum matrix Fy (ω), one gets
Abstract This chapter describes the idea behind, and develops the theory of
empirical orthogonal functions (EOFs) along with a historical perspective. It also
shows different ways to obtain EOFs and provides examples from climate and
discusses their physical interpretation. Strength and weaknesses of EOFs are also
mentioned.
3.1 Introduction
The inspection of multivariate data with a few variables can be addressed easily
using the techniques listed in Chap. 2. For atmospheric data, however, where one
deals with many variables, those techniques become impractical, see Fig. 3.1 for an
example of data cube of sea level pressure. In the sequel, we adopt the notation and
terminology presented in Chap. 2. In general, and before any advanced analysis, it
is recommended to inspect the data using simple exploratory tools such as:
• Plotting the mean field x of xk , k = 1, . . . n.
• Plotting the variance of the field, i.e. diag(S) = s11 , . . . spp , see Eq. (2.18).
• Plotting time slices of the field, or the time evolution of the field at a given latitude
or longitude, that is the Hovmöller diagram.
• Computing and plotting one-point correlation maps between the field and a
specific time series. This could be a time series from the same field at, say,
a specific location sk in which case the field to be plotted is simply the k’th
column of the correlation matrix (Eq. (2.21)). Alternatively, the time series
could be any climate index zt , t = 1, 2, . . . n in which case the field to be plotted
is simply ρ1 , ρ2 , . . . , ρp , where ρk = ρ(xk , zt ) is the correlation between the
index and the k’th variable of the field. An example of one-point correlation map
for DJF NCEP/NCAR sea level pressure is shown in Fig. 3.1 (bottom), the base
point is also shown. This map represents the North Atlantic Oscillation (NAO)
teleconnection pattern, discussed below.
As stated in Chap. 1, when we have multivariate data the objective is often to
find coherent structures or patterns and to examine possible dependencies and
relationships among them for various purposes such as exploration, identifica-
tion of physical and dynamical mechanisms and prediction. This can only be
achieved through “simplification” or reduction of the data structure. The words
simplification/reduction will become clear later. This chapter deals with one of the
most widely used techniques to simplify/reduce and interpret the data structure,
namely principal component analysis (PCA). This is an exploratory technique for
multivariate data, which is in essence an eigenvalue problem, aiming at explaining
and interpreting the variability in the data.
The climate system is studied using observations as well as model simulations. The
weather and climate system is not an isolated phenomenon, but is characterised by
high interconnections, namely climate anomalies at one location on the earth can be
related to climate anomalies at other distant locations. This is the basic concept
of what is known as teleconnection. In simple words, teleconnections represent
patterns connecting widely separated regions (e.g. Hannachi et al. 2017). Typical
examples of teleconnections include The El-Niño Southern Oscillation, ENSO (e.g.
Trenberth et al. 2007), the North Atlantic Oscillation, NAO (Hurrell et al. 2003;
Hannachi and Stendel 2016; Franzke and Feldstein 2005) and the Pacific-North
American (PNA) patterns (Hannachi et al. 2017; Franzke et al. 2011).
ENSO is a recurring ocean-atmosphere coupled pattern of interannual fluc-
tuations characterised by changes in sea surface temperature in the central and
eastern tropical Pacific Ocean associated with large scale changes in sea level
pressure and also surface wind across the maritime continent. The Ocean part
of ENSO embeds El-Niño and La-Niǹa, and the atmospheric part embeds the
Southern Oscillation. An example of El-Niño is shown in Chap. 16 (Sect. 16.9).
El-Niño and La-Niña represent, respectively, the warming (or above average) and
cooling (or below average) phases of the central and eastern Pacific Ocean surface
temperature. This process has a period of about three to 7 years where the sea surface
temperature changes by about 1–3 ◦ C. The Southern Oscillation (SO) involves
changes in pressure, and other variables such as wind, temperature and rainfall, over
the tropical Indo-Pacific region, and is measured by the difference in atmospheric
pressure between Australia/Indonesia and eastern South Pacific. An example of
SO is discussed in Chap. 8 (Sect. 8.5). ENSO, as a teleconnection, has an impact
34 3 Empirical Orthogonal Functions
over considerable parts of the globe, especially North and South America and parts
of east Asia and the summer monsoon region. Although ENSO teleconnection,
precisely the Southern Oscillation, seems to have been discovered by Gilbert Walker
(Walker 1923), through correlation between surface pressure, temperature and
rainfall, the concept of teleconnection, however, seems to have been mentioned for
the first time in Ångström (1935). The connection between the Southern Oscillation
and El-Niño was only recognised later by Bjerknes in the 1960s, see, e.g. Bjerknes
(1969).
The NAO (Fig. 3.1, bottom) is a kind of see-saw in the atmospheric mass
between the Azores and the extratropical North Atlantic. It is the dominant mode of
near surface pressure variability over the North Atlantic, Europe and North Africa
(Hurrell et al. 2003; Hannachi and Stendel 2016), and has an impact on considerable
parts of the northern hemisphere (Hurrell et al. 2003; Hannachi and Stendel 2016).
The two main centres of action of the NAO are located, respectively, near the Azores
and Iceland. For example, in its positive phase the pressure difference between the
two main centres of action is enhanced, compared to the climatology, resulting in
stronger than normal westerly flow.
PCA has been used since the beginning of the twentieth century by statisticians
such as Pearson (1901) and later Hotelling (1933, 1935). For statistical and more
general application of PCA, the reader is referred, e.g., to the textbooks by Seal
(1967), Morrison (1967), Anderson (1984), Chatfield and Collins (1980), Mardia
et al. (1979), Krzanowski (2000), Jackson (2003) and Jolliffe (2002) and more
references therein. In the atmospheric science, it is difficult to get the exact origin of
eigenvalue problems. According to Craddock (1973), the earliest1 recognisable use
of eigenvalue problems in meteorology seems to have been mentioned in Fukuoka
(1951). The earliest known and comprehensive development of eigenvalue analyses
and orthogonal expansion in atmospheric science are the works of Obukhov (1947)
and Bagrov (1959) from the previous USSR and Lorenz (1956) from the US Weather
Service. Fukuoka (1951) also mentioned the usefulness of these methods in weather
prediction. Obukhov (1947) used the method for smoothing purposes whereas
Lorenz (1956), who coined the name empirical orthogonal functions (EOFs), used
it for prediction purposes.
Because of the relatively large number of variables involved and the low
speed/memory of computers that were available in the mid-1950s, Gilman (1957),
for example, had to partition the atmospheric pressure field by slicing the north-
ern hemisphere into slices, and the data matrix was thus reduced substantially
and allowed an eigenanalysis. Later developments were conducted by various
1 Wallace (2000) maintains the view that the way Walker (1923) computed the Southern Oscillation
bears resemblance to iterative techniques used to compute empirical orthogonal functions.
3.3 Computing Principal Components 35
PCA aims to find a new set of variables that explain most of the variance observed
in the data.2 Figure 3.2 shows the axes that explain most of the variability in the
popular three-variable Lorenz model. It has been extensively used in atmospheric
research to analyse particularly large scale and low frequency variability. The first
seminal work on PCA in atmospheric science goes back to the mid-1950s with
Ed. Lorenz. The method, however, has been used before by Obukhov (1947), see
e.g. North (1984), and was mentioned later by Fukuoka (1951), see e.g. Craddock
(1973). Here and elsewhere we will use both terminologies, i.e. PCA or EOF
analysis, interchangeably. Among the very few earliest textbooks on EOFs in
atmospheric science, the reader is referred to Preisendorfer and Mobley (1988), and
to later textbooks by Thiebaux (1994), Wilks (2011), von Storch and Zwiers (1999),
and Jolliffe (2002).
The original aim of EOF analysis (Obukhov 1947; Fukuoka 1951; Lorenz 1956)
was to achieve a decomposition of a continuous space–time field X(t, s), where t
and s denote respectively time and spatial position, as
X(t, s) = ck (t)ak (s) (3.1)
k≥0
using an optimal set of orthogonal basis functions of space ak (s) and expansion
functions of time ck (t). When the field is discretised in space and/or time a similar
2 This is based on the main assumption in data analysis, that is variability represents information.
36 3 Empirical Orthogonal Functions
expansion to (3.1) is also sought. For example, if the field is discretised in both time
and space the expansion above is finite, and the obtained field can be represented by
a data matrix X as in (2.11). In this case the sum extends to the rank r of X. The
basis functions ak (s) and expansion coefficients ck (t) are obtained by minimising
the residual:
M
2
R1 = X(t, s) − ck (t)ak (s) dtds, (3.2)
T S k=1
where the integration is performed over the time T and spatial S domains for the
continuous case and for a given expansion order M. A similar residual is minimised
for the discrete case except that the integrals are replaced by discrete sums.
In probability theory expansion (3.1), for a given s (and as such the parameter s is
dropped here for simplicity) is known as Karhunen–Loève expansion associated
with a continuous zero-mean3 stochastic process X(t) defined over an interval
[a, b], and which is square integrable, i.e. E|X(t)|2 < ∞, for all t in the interval
[a, b]. Processes having these properties constitute a Hilbert space (Appendix F)
with the inner product < X1 (t), X2 (t) >= E (X1 (t)X2 (t)). The covariance
function of X(t):
γ (s, t) = E (X(t)X(s))
where λ1 , λ2 . . . and φ1 (), φ2 (), . . . are respectively the eigenvalues and associated
orthonormal eigenfunctions of the Fredholm eigen problem:
b
Aφ(t) = γ (t, s)φ(s)ds = λφ(t)
a
and satisfying < φi , φj >= δij . This result is due to Mercer (1909), see also
Basilevsky and Hum (1979), and the covariance function γ (t, s) is known as Mercer
kernel. Accordingly the stochastic process X(t) is then expanded as:
∞
X(t) = Xk φk (t),
k=1
b
k
k
|X(t) − Xi φi (t)|2 dt = 1 − λi
a i=1 i=1
when the first k terms of the expansion are used. Note that when the stochastic
process is stationary, i.e. γ (s, t) = γ (s − t) then the previous expansion
becomes
∞
γ (s − t) = λk φk (s)φk (t)
k=1
Given the (centred) data matrix (2.11), the objective of EOF/PC analysis is to find
the linear combination of all the variables explaining maximum variance, that is to
T
find a unit-length direction a = a1 , . . . , ap that captures maximum variability.
The projection of the data onto the vector a yields the centred time series Xa, whose
variance is simply the average of the squares of its elements, i.e. aT XT Xa/n. The
EOFs are therefore obtained as the solution to the quadratic optimisation problem
max F (a) = aT Sa
(3.3)
subject to aT a = 1.
3.3 Computing Principal Components 39
Fig. 3.3 Illustration of a pair of EOFs: a simple monopole EOF1 and dipole EOF2
max aT Sa − μ(1 − aT a)
a
which is also equivalent to maximising (aT Sa)/(aT a). The EOFs are therefore
obtained as the solution to the eigenvalue problem:
Sa = λ2 a. (3.4)
The EOFs are the eigenvectors of the sample covariance matrix S arranged in
decreasing order of the eigenvalues. The first eigenvector a1 gives the first principal
component, i.e. the linear function Xa1 , with the largest variance; the second EOF
a2 gives the second principal component with the next largest variance subject to
being orthogonal to a1 as illustrated in Fig. 3.3, etc.
Remark In PCA one usually defines the PCs first as linear combinations of the
different variables explaining maximum variability from which EOFs are then
derived. Alternatively, one can similarly define EOFs as linear combinations vT X,
where v is a vector of weights, of the different maps of the field that maximise
the norm squared. Applying this definition one obtains a similar equation to (3.3),
namely:
vT Pv
max , (3.5)
vT v
where P = XXT is the matrix of scalar product between the different maps.
Equation (3.5) yields automatically the (standardised) principal components. Note
that Eq. (3.5) is formulated using a duality argument to Eq. (3.3), and can be useful
for numerical purposes when, for example, the sample size is smaller than the
number of variables.
40 3 Empirical Orthogonal Functions
S = U2 UT , (3.6)
where U is a p × p orthogonal
4 matrix, i.e. UT U = UUT = I, and 2 is a diagonal
matrix, i.e. 2 = Diag λ21 , . . . , λ2p , containing the eigenvalues5 of S. The EOFs
u1 , u2 , . . . up are therefore the columns of U. It is clear that if p < n, then there are
at most p positive eigenvalues whereas if n < p there are at most n − 1 positive
eigenvalues.6 To be more precise, if r is the rank of S, then there are exactly r
positive eigenvalues. To be consistent with the previous maximisation problem, the
eigenvalues are sorted in decreasing order, i.e. λ21 ≥ λ22 ≥ . . . ≥ λ2p , so that the first
EOF yields the time series with maximum variance, the second one with the next
largest variance, etc. The solution of the above eigenvalue problem, Eq. (3.6), can
be obtained using either direct methods such as the singular value decomposition
(SVD) algorithm or iterative methods based on Krylov subspace solvers using
Lanczos or Arnoldi algorithms as detailed in Appendix D, see also Golub and van
Loan (1996) for further methods and more details. The Krylov subspace method is
particularly efficient for large and/or sparse systems.
In Matlab programming environment, let X(n, p1, p2) designate the two-
dimensional (p1 × p2) gridded (e.g. lat-lon) field, where n is the sample size. The
field is often assumed to be anomalies (though not required). The following simple
code computes the leading 10 EOFs, PCs and the associated covariance matrix
eigenvalues:
4 This is different from a normal matrix U, which commutes with its transpose, i.e. UT U = UUT .
5 We use squared values because S is semi-definite positive, and also to be consistent with the SVD
of S.
6 Why n − 1 and not n?
3.3 Computing Principal Components 41
Note also that Matlab has a routine pca, which does PCA analysis of the data matrix
X(n, p12) (see also Appendix H for resources):
100λ2
r k 2 %, (3.7)
j =1 λj
C = XU, (3.8)
so the k’th PC ck = (ck (1), ck (2), . . . , ck (n)) is simply Xuk whose elements are
p
ck (t) = xtj uj k
j =1
Fig. 3.4 Percentage of explained variance of the leading 40 EOFs of winter months (DJF)
NCAR/NCEP sea level pressure anomalies for the period Jan 1940–Dec 2000. The vertical bars
provide approximate 95% confidence interval of the explained variance. Adapted from Hannachi
et al. (2007)
42 3 Empirical Orthogonal Functions
for t = 1, . . . , n and where uj k is the j’th element of the kth EOF uk . It is clear
from (3.8) that the PCs are uncorrelated and that
1 2
cov(ck , cl ) = λ δkl . (3.9)
n k
1
X = √ VUT , (3.10)
n
1
C = √ V, (3.11)
n
hence the columns of V are the standardised, i.e. unit variance principal components.
One concludes therefore that the EOFs and standardised PCs are respectively the
right and left singular vectors of X.
Figure 3.5 shows the leading two EOFs of DJF SLP anomalies (with respect to
the mean seasonal cycle) from NCAR/NCEP. They explain respectively 21% and
13% of the total winter variability of the SLP anomalies, see also Fig. 3.4. Note, in
particular, that the leading EOF reflects the Arctic Oscillation mode (Wallace and
Thompson 2002), and shows the North Atlantic Oscillation over the North Atlantic
sector. This is one of the many features of EOFs, namely mixing and is discussed
below. The corresponding two leading PCs are shown in Fig. 3.6.
The SVD algorithm is reliable, as pointed out by Toumazou and Cretaux (2001),
and the computation of the singular values is governed by the condition number of
the data matrix. Another strategy is to apply the QR algorithm (see Appendix D) to
the symmetric matrix XT X or XXT , depending on the smallest dimension of X. The
algorithm, however, can be unstable as the previous symmetric matrix has a larger
condition number compared to that of the data matrix. In this regard, Toumazou and
Cretaux (2001) suggest an algorithm based on a Lanczos eigensolver technique.
3.3 Computing Principal Components 43
The method is based on using a Krylov subspace (see Appendix D), and reduces to
computing some eigen-elements of a small symmetric matrix.
Beside the SVD algorithm, iterative methods have also been proposed to compute
EOFs (e.g. van den Dool 2011). The main advantage of these methods is that they
avoid computing the covariance matrix, which may be prohibitive at high resolution
and large datasets, or even dealing directly with the data matrix as is the case with
44 3 Empirical Orthogonal Functions
Fig. 3.6 Leading two principal components of the winter (DJF) monthly sea level pressure
anomalies for the period Jan 1940–Dec 2000 (a) DJF sea level pressure PC1. (b) DJF sea level
pressure PC2. Adapted from Hannachi et al. (2007)
SVD. The iterative approach makes use of the identities linking EOFs and PCs.
EOF Em (s) and corresponding PC cm (t) of a field X(t, s) satisfy Em (s) =
An
t cm (t)X(t, s), and similarly for cm (t). The method then starts with an initial
guess of a time series, say c(0) (t), scaled to unit variance, and obtains the associated
pattern E (0) (s) following the previous identity. This pattern is then used to compute
an updated time series c(1) (t), etc. As pointed out by van den Dool (2011), the
process normally converges quickly to the leading EOF/PC. The process is then
continued with the residuals, after removing the contribution of the leading mode, to
get the subsequent modes of variability. The iterative method can be combined with
spatial weighting to account for the grid (e.g. Gaussian grid in spherical geometry)
and subgrid processes, and to maximise the signal-to-noise ratio of EOFs (Baldwin
et al. 2009).
3.4 Sampling, Properties and Interpretation of EOFs 45
There are various ways to estimate or quantify uncertainties associated with the
EOFs and corresponding eigenvalues. These uncertainties can be derived based on
asymptotic approximation. Alternatively, the EOFs and associated eigenvalues can
be obtained using a probabilistic framework where uncertainties are comprehen-
sively modelled. Monte-Carlo method is another approach, which can be used, but
can be computationally expensive. Cross-validation and bootstrap are examples of
Monte-Carlo methods, which are discussed below. But other methods of Monte-
Carlo technique also exist, such as surrogate data, which will be commented on
below.
where N (μ, σ 2 ) stands for the normal distribution with mean μ and variance σ 2 (see
Appendix B), and λ2k , k = 1, . . . p are the eigenvalues of the underlying population
covariance matrix . The standard error of the estimated eigenvalue λ̂2k is then
2
δ λ̂2k ≈ λ2k . (3.13)
n
46 3 Empirical Orthogonal Functions
For a given significance level α the interval λ̂2k ± δ λ̂2k 1− α2 , where the notation
a refers to the a’th quantile,7 provides therefore the asymptotic 100(1 − α)%
confidence limits of the population eigenvalue λ2k , k = 1, 2, . . . p. For example, the
95% confidence interval is [λ̂2k − 1.96δ λ̂2k , λ̂2k + 1.96δ λ̂2k ]. Figure 3.4 displays these
limits for the winter seal level pressure anomalies. A similar approximation can also
be derived for the eigenvectors uk , k = 1, . . . p:
δ λ̂2k
δuk ≈ uj , (3.14)
λ2j − λ2k
where λ2j is the closest eigenvalue to λ2k . Note that in practice the number n used in
the previous approximation corresponds to the size of independent data also known
as effective sample size, see below.
Probabilistic PCA
A more comprehensive method to obtain EOFs and explained variances along with
their uncertainties is to use a probability-based method. This has been done by
Tipping and Bishop (1999), see also Goodfellow et al. (2016). In this case EOFs
are computed via maximum likelihood. The model is based on a latent variable as
in factor analysis discussed in Chap. 10. Note, however, that the method relies on
multinormality assumption. More discussion on the method and its connection to
factor analysis is discussed in Sect. 10.6 of Chap. 10.
The asymptotic uncertainty method discussed above relies on quite large sample
sizes. In practice, however, this assumption may not be satisfied. An attractive
and easy to use alternative is Monte-Carlo resampling method, which has become
invaluable tool in modern statistics (James et al. 2013; Goodfellow et al. 2016).
The method involves repeatedly drawing subsamples from the training set at hand,
refitting the model to each of these subsamples, and obtain thereafter an ensemble of
realisations of the parameters of interest enabling the computation of uncertainties
on those parameters. Cross-validation and bootstrap are the most commonly used
resampling methods. The bootstrap method goes back to the late 1970s with Efron
(1979). The textbooks by Efron and Tibshirani (1994), and also Young and Smith
(2005) provide a detailed account of Monte-Carlo methods and their application. A
summary of cross-validation and bootstrap methods is given below, and for deeper
7
a = −1 (a) where () is the cumulative distribution function of the standard normal
distribution (Appendix B).
3.4 Sampling, Properties and Interpretation of EOFs 47
analysis the reader is referred to the more recent textbooks of James et al. (2013),
Goodfellow et al. (2016), Brownlee (2018) and Watt et al. (2020).
Cross-Validation
Cross-validation is a measure of performance, and involves splitting the available
(training) data sample (assumed to have a sample size n) into two subsets, one
is used for training (or model fitting) and the other for validation. That is, the
fitted model (on the training set) is used to get responses via validation with the
second sample, enabling hence the computation of the test error rate. In this way,
cross-validation (CV) can be used to get the test error, and yields a measure of
model performance and model selection, see, e.g., Sect. 14.5 of Chap. 14 for an
application to parameter identification. There are basically two types of cross-
validation methods, namely, the leave-one-out CV and k-fold CV. The former deals
with leaving one observation out (validation set), and fitting the statistical model
on the remaining n − 1 observations (training set), and computing the test error
at the left-one-out observation. This error is simply measured by the mean square
error (test error) between the observation and the corresponding value given by the
fitted model. This procedure is then repeated with every observation, and then the
leave-one-out CV is estimated by the average of the obtained n test errors.
In the k-fold CV, the dataset is first divided randomly into k subsamples of
similar sizes. Training is then applied to one subsample and validation applied to the
remaining k-1 subsamples, yielding one test error. The procedure is then repeated
with each subsample, and the final k-fold CV is obtained as the average of the
obtained k test errors. The leave-one-out approach is obviously a special case of the
k-fold CV, and therefore the latter is advantageous over the former. For example,
k-fold CV gives more accurate estimates of the test errors, a result that is related to
the bias-variance trade-off. James et al. (2013) suggest the empirical values k = 5
or k = 10.
The Bootstrap
The bootstrap method is a powerful statistical tool used to estimate uncertainties
on a given statistical estimator from a given dataset. The most common use of
bootstrap is to provide a measure of accuracy of the parameter estimate of interest.
In this context, the method is used to estimate summary statistics of the parameter
of interest, but can also yield approximate distribution of the parameter. The
bootstrap involves constructing a random subsample from the dataset, which is
used to construct an estimate of the parameter of interest. This procedure is then
repeated a large number of times, yielding hence an ensemble of estimates of the
parameter of interest. Each sample used in the bootstrap is constructed from the
dataset by drawing observations, one at a time, and returning the drawn sample to
the dataset, until the required size is reached. This procedure is known as sampling
with replacement and enables observations to appear possibly more than once in a
bootstrap sample. In the end, the obtained ensemble of estimates of the parameter of
interest is used to compute the statistics, e.g. mean and variance, etc., and quantify
the uncertainty on the parameter (James et al. 2013). In summary, a bootstrap
sample, with a chosen size, is obtained by drawing observations, one at a time, from
48 3 Empirical Orthogonal Functions
the pool of observations of the training dataset. In practice, the number of bootstrap
samples should be large enough of the order of O(1000). Also, for reasonably large
data, the bootstrap sample size can be of the order 50–80% of the size of the dataset.
The algorithm of resampling with replacement goes as follows:
(1) Select the number of bootstrap samples, and a sample size of these samples.
(2) Draw the bootstrap sample
(3) Compute the parameter of interest
(4) Go to (2) until the number of bootstrap samples is reached.
The application of the above algorithm yields a distribution, e.g. histogram, of the
parameter.
Remarks on Surrogate Data Method The class of Monte-Carlo method is quite
wide and includes other methods than CV and bootstrap resampling. One partic-
ularly powerful method used in time series is that of surrogate data. The method
of surrogate data (Theiler et al. 1992) involves generating surrogate datasets, which
share some characteristics with the original time series. The method is mostly used
in nonlinear and chaotic time series analysis to test linearity null hypotheses (e.g.
autoregressive moving-average ARMA processes) versus nonlinearity. The most
common algorithm for surrogate data is phase randomisation and amplitude adjusted
Fourier transform (Theiler et al. 1992). Basically, the original time series is Fourier
transformed, the amplitudes of this transform are then used with new uniformly
distributed random phases, and finally an inverse Fourier transform is applied to
get the surrogate sample. For real time series, the phases are constrained to be
antisymmetric. By construction, these surrogates preserve the linear structure of
the original time series (e.g. autocorrelation function and power spectrum). Various
improvements and extensions have been proposed in the literature (e.g. Breakspear
et al. 2003, Lucio et al. 2012). Surrogate data method has also been applied in
atmospheric science and oceanography. For example, Osborne et al. (1986) applied
it to identify signatures of chaotic behaviour in the Pacific Ocean dynamics.
The bootstrap method can easily be applied to obtain uncertainties on the EOFs of
a given atmospheric field, as shown in the following algorithm:
(1) Select the number of bootstrap samples, and the sample size of these samples.
(2) For each drawn bootstrap sample:
(2.1) Compute the EOFs (e.g. via SVD) and associated explained variance.
(2.2) Rank the explained variances (and associated EOFs) in decreasing order.
(3) Calculate the mean, variance (and possibly histograms, etc.) of each explained
variance and associated EOFs (at each grid point).
3.4 Sampling, Properties and Interpretation of EOFs 49
Fig. 3.7 Evolution of the frequency distribution of the north (a) and south (b) centres of action
of the NAO pattern computed over 20-yr running windows. The yellow line corresponds to the
longitude of the original sample. Adapted from Wang et al. (2014). ©American Meteorological
Society. Used with permission
where T0 is the e-folding time of ρ(.). The idea behind Leith (1973) is that if xt ,
t = 1, 2, . . . n is a realisation of an independent and identically distributed
(IID)
random variables X1 , . . . Xn , with variance σ 2 then the mean x = n−1 nt=1 xt has
variance σx2 = n−1 σ 2 . Now, consider a continuous time series x(t) defined for all
3.4 Sampling, Properties and Interpretation of EOFs 51
The variance of (3.16) can easily be derived and (3.15) can be recovered from a red
noise.
Exercise
1. Compute the variance σT2 of (3.16).
σ2
2. Derive σT2 for a red noise and show that = T
2T0 .
σT2
Hint
1. From (3.16) we have
t+ T2 t+ T2
T 2 σT2 =E x(s1 )x(s2 )ds1 ds2
t− T2 t− T2
T T
2 2
=E x(t + s1 )x(t + s2 )ds1 ds2
− T2 − T2
T T
2 2
=σ 2
ρ(s2 − s1 )ds1 ds2 .
− T2 − T2
σT2 T
Hence σ2
= 2
T 0 (1 − v
T )ρ(v)dv.
Remark Note that for a red noise or (discrete) AR(1) process, xt = φ1 xt−1 + εt ,
−τ |τ |
ρ(τ ) = e T0 (= φ1 ), see Appendix C, and the e-folding time T0 is given by the
∞
integral 0 ρ(τ )dτ . In the above formulation, the time interval was assumed to be
unity. If the time series is sampled every t then as T0 = − log t
ρ(t) , and then one
gets
n
n∗ = − log ρ(t).
2
52 3 Empirical Orthogonal Functions
Jones (1975) suggested an effective sample size of order-1, which, for a red noise,
boils down to
1 − ρ(1)
n∗ = n (3.17)
1 + ρ(1)
1 − ρ(1)2
n∗ = n (3.18)
1 + ρ(1)2
For time varying fields, or multivariate time series, with N grid points or variables
x(t) = (x1 (t), . . . , xN (t)T observed over some finite time interval, Bretherton et
al. (1999) discuss two measures of effective numbers of spatial d.o.f or number
of independently varying spatial patterns. For example, for isotropic turbulence a
similar equation to (3.15) was given by Taylor (1921) and Keller (1935). Using the
“moment matching” (mm) method of Bagrov (1969), derived from a χ 2 distribution,
an estimate of the effective number of d.o.f can be derived, namely
∗ 2
Nmm = 2E /E 2 , (3.19)
where () is a time mean and E is a quadratic measure of the field, e.g. the quadratic
norm of x(t), E = x(t)T x(t).
An alternative way was also proposed by Bagrov (1969) and TerMegreditchian
(1969) based on the covariance matrix of the field. This estimate, which is also
discussed in Bretherton et al. (1999) takes the form
N 2
λk tr()2
∗
Neff = N = , (3.20)
k=1 i=1 λi tr( 2 )
∗ κ −1 ∗
Neff = Nmm ,
2
where κ is the kurtosis assumed to be the same for all PCs. This shows, in particular,
that the two values can be quite different.
Since the leading order EOFs explain more variance than the lowest order ones, one
would then be tempted to focus on the few leading ones and discard the rest as being
noise variability. This is better assessed by the percentage of the explained variance
by the first, say, m retained EOFs:
m 2 m
k=1 λk k=1 var (Xuk )
p 2
= . (3.21)
k=1 λk
tr ()
54 3 Empirical Orthogonal Functions
In this way one can choose a pre-specified percentage of explained variance, e.g.
70%, then keep the first m EOFs and PCs that explain altogether this amount.
Remark Although this seems a reasonable way to truncate the spectrum of the
covariance matrix, the choice of the amount of explained variance remains, however,
arbitrary.
We have seen in Chap. 1 two different types of transformations: scaling and
sphering. The principal component transformation, obtained by keeping a subset of
EOFs/PCs, is yet another transformation that can be used in this context to reduce
the dimension of the data. The transformation is given by
Y = XU.
The main characteristic features of EOF analysis is the orthogonality of EOFs and
uncorrelation of PCs. These are nice geometric properties that can be very useful
in modelling studies using PCs. For example, the covariance matrix of any subset
of retained PCs is always diagonal. These constraints, however, yield partially
predictable relationships between an EOF and the previous ones. For instance, as
pointed out by Horel (1981), if the first EOF has a constant sign over its domain,
then the second one will generally have both signs with the zero line going through
3.4 Sampling, Properties and Interpretation of EOFs 55
the maxima of the first EOF (Fig. 3.3 ). The orthogonality constraint also makes the
EOFs domain-dependent and can be too non-local (Horel 1981; Richman 1986).
Perhaps one of the main properties of EOFs is mixing. Assume, for example,
that our signal is a linear superposition of signals, not necessarily uncorrelated, then
EOF analysis tends to mix these signals in order to achieve optimality (i.e. maximum
variance) yielding patterns that are mixture of the original signals. This is known as
the mixing problem in EOFs. This problem can be particularly serious when the data
contain multiple signals with comparable explained variance (e.g. Aires et al. 2002;
Kim and Wu 1999). Figure 3.10 shows the leading EOF of the monthly sea surface
temperature (SST) anomalies over the region 45.5◦ S–45.5◦ N. The anomalies are
computed with respect to the monthly mean seasonal cycle. The data are on a
1◦ × 1◦ latitude-longitude grid and come from the Hadley Centre Sea Ice and
Sea Surface Temperature8 spanning the period Jan 1870–Dec 2014 (Rayner et al.
2003). The EOF shows a clear signal of El-Niño in the eastern equatorial Pacific. In
addition we also see anomalies located on the western boundaries of the continents
related to the western boundary currents. These are discussed in more detail in
Chap. 16 (Sect. 16.9). Problems related to mixing are conventionally addressed
using, e.g. EOF rotation (Chap. 4), independent component analysis (Chap. 12) and
also archetypal analysis (see Chap. 16).
Furthermore, although the truncated EOFs may explain a substantial amount
of variance, there is always the possibility that some physical modes may not be
represented by these EOFs. EOF analysis may lead therefore to an underestimation
of the complexity of the system (Dommenget and Latif 2002). Consequently,
these constraints can cause limitations to any possible physical interpretation of
the obtained patterns (Ambaum et al. 2001; Dommenget and Latif 2002; Jolliffe
-6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
8 www.metoffice.gov.uk/hadobs/hadisst.
56 3 Empirical Orthogonal Functions
2003) because physical modes are not necessarily orthogonal. Normal modes
derived for example from linearised dynamical/physical models, such as barotropic
models (Simmons et al. 1983) are not orthogonal since physical processes are not
uncorrelated. The Arctic Oscillation/North Atlantic Oscillation (AO/NAO) EOF
debate is yet another example that is not resolved using (hemispheric) EOFs
(Wallace 2000; Ambaum et al. 2001, Wallace and Thompson 2002). Part of the
difficulty in interpretation may also be due to the fact that, although uncorrelated,
the PCs are not independent and this is particularly the case when the data are
not Gaussian, in which case other approaches exist and will be presented in later
chapters.
It is extremely difficult and perhaps not possible to get, using techniques based
solely on purely mathematical/statistical concepts, physical modes without prior
knowledge of their structures (Dommenget and Latif 2002) or other dynamical
constraints. For example, Jolliffe (2002, personnel communication) points out that
in general EOFs are unsuccessful to capture modes of variability, in case where
the number of variables is larger than the number of modes, unless the latter
are orthogonally related to the former. In this context we read the following
quotation9 (Everitt and Dunn, 2001 p. 305; also quoted in Jolliffe 2002, personnel
communication): “Scientific theories describe the properties of observed variables
in terms of abstraction which summarise and make coherent the properties of
observed variables. Latent variables (modes), are, in fact one of this class of
abstract statements and the justification for the use of these variables (modes) lies
not in an appeal to their “reality” or otherwise but rather to the fact that these
variables (modes) serve to synthesise and summarise the properties of the observed
variables”.
One possible way to evaluate EOFs is to compare them with a first-order
spatial autoregressive process (e.g. Cahalan et al. 1996), or more generally using
a homogeneous diffusion process (Dommenget 2007; Hannachi and Dommenget
2009). The simplest homogeneous diffusion process is given by
d
u = −λu + ν∇ 2 u + f (3.23)
dt
and is used as a null hypothesis to evaluate the modes of variability of the
data. The above process represents an extension of the simple spatial first-order
autoregressive model. In Eq. (3.23) λ and μ represent, respectively, damping and
diffusion parameters, and f is a spatial and temporal white noise process. Figure 3.11
shows the leading EOF of SST anomalies along with its PC and the time series of
the data at a point located in the south western part of the Indian Ocean. The data
span the period 1870–2005.
Figure 3.12 compares the data covariance matrix spectrum with that of a fitted
homogeneous diffusion process and suggests consistency with the null hypothesis,
2
PC1
−2
Jan80 Jan00 Jan20 Jan40 Jan60 Jan80 Jan00
2
SSTA ( C)
o
−2
Fig. 3.11 Leading EOF of the Indian Ocean SST anomalies (top), the associated PC (middle) and
the SST anomaly time series at the centre of the domain (0.5◦ S, 56.5◦ E) (bottom). Adapted from
Hannachi and Dommenget (2009)
58 3 Empirical Orthogonal Functions
Eigenvalue spectrum
60
50
Eigenvalue (%)
40
30
20
10
0
0 5 10 15
Rank
Fig. 3.12 Spectrum of the covariance matrix of the Indian Ocean SST anomalies, with the
approximate 95% confidence limits, along with the spectrum of the fitted homogeneous diffusion
process following (3.17). Adapted from Hannachi and Dommenget (2009)
particularly for the leading few modes of variability. The issue here is the existence
of a secular trend, which invalidates the test. For example, Fig. 3.13 shows the time
series distribution of the SST anomalies averaged over the Indian Ocean, which
shows significant departure from normality. This departure is ubiquitous in the basin
as illustrated in Fig. 3.14. Hannachi and Dommenget (2009) applied a differencing
operator to the data to remove the trend. Figure 3.15 shows the spectrum compared
to that of similar diffusion process of the differenced fall SST anomalies. The
leading EOF of the differenced data (Fig. 3.16), reflecting the Indian Ocean dipole,
can be interpreted as an intrinsic mode of variability.
Another possible geometric interpretation of EOFs is possible with multinormal
data. In fact, if the underlying probabilistic law generating the data matrix is the
multivariate Gaussian, or multinormal, i.e. the probability density function of the
vector x is
1 1
f (x) = p 1
exp − (x − μ)T −1 (x − μ) , (3.24)
(2π ) 2 || 2 2
where μ and are respectively the mean and the covariance matrix of x and || is
the determinant of , then the interpretation of the EOFs is straightforward. Indeed,
in this case the EOFs represent the principal axes of the ellipsoid of the distribution.
3.4 Sampling, Properties and Interpretation of EOFs 59
a) SST anomalies
4
SST anomalies ( C)
o
−2
1
0
0.5
0 −5
−1 0 1 −4 −2 0 2 4
o Standard Normal Quantiles
SST anomalies ( C)
Fig. 3.13 Time series of the SST anomalies averaged over the box (0–4◦ S, 62–66◦ E) (a), its
histogram (b) and its quantile-quantile (c). Adapted from Hannachi and Dommenget (2009)
Fig. 3.14 Grid points where the detrended SST anomalies over the Indian Ocean are non-
Gaussian, based on a Lilliefors test at the 5% significance level. Adapted from Hannachi and
Dommenget (2009)
60 3 Empirical Orthogonal Functions
Eigenvalue spectrum
60
50
Eigenvalue (%)
40
30
20
10
0
0 5 10 15
Rank
Fig. 3.15 Same as Fig. 3.12 but for the detrended Indian Ocean fall SST anomalies. Adapted from
Hannachi and Dommenget (2009)
Fig. 3.16 Leading EOF of the detrended fall Indian Ocean SST anomalies. Adapted from
Hannachi and Dommenget (2009)
3.6 Scaling Problems in EOFs 61
This is discussed below in Sect. 3.7. The ellipsoid is given by the isolines10 of (3.24).
Furthermore, the PCs in this case are independent.
EOFs from the covariance matrix find new variables that successively maximise
variance. By contrast the EOFs from the correlation matrix C, i.e. the sample version
of , attempt to maximise correlation instead. The correlation-based EOFs are
obtained using the covariance matrix of the standardised or scaled data matrix (2.22)
Xs = XD−1/2 , where D = diag(S). Therefore all the variables have the same
weight as far as variance is concerned. The correlation-based EOFs can also be
obtained by solving the generalised eigenvalue problem:
D−1 Sa = λ2 a, (3.25)
One of the main features of EOFs is that the PCs of a set of variables depend
critically on the scale used to measure the variables, i.e. the variables’ units. PCs
change, in general, under the effect of scaling and therefore do not constitute a
unique characteristic of the data. This problem does not occur in general when all
the variables have the same unit. Note also that this problem does not occur when the
10 This interpretation extends also to a more general class of multivariate distributions, namely
the elliptical distributions. These are distributions whose densities are constant on ellipsoids. The
multivariate t-distribution is an example.
62 3 Empirical Orthogonal Functions
correlation matrix is used instead. This is particularly useful when one computes for
example EOFs of combined fields such as 500 mb heights and surface temperature.
Consider for simplicity two variables: geopotential height x1 at one location, and
zonal wind x2 at another location. The variables x1 and x2 are expressed respectively
in P a and ms−1 . Let z1 and z2 be the obtained PCs. The PCs units will depend on
the original variables’ units. Let us assume that one wants the PCs to be expressed
in hP a and km/ h, then one could think of either premultiply x1 and x2 respectively
by 0.01 and 3.6 then apply EOF analysis or simply post-multiply the PCs z1 and
z2 respectively by 0.01 and 3.6. Now the question is: will the results be identical?
The answer is no. In fact, if C is the diagonal scaling matrix containing the scaling
constants, the scaled variables are given by the data matrix Xs = XC whose
PCs are given by the columns of Z obtained from a SVD of the scaled data, i.e.
Xs = As ZT . Now one can post-multiply by C the SVD decomposition of X to
yield: XC = U (CV)T . What we have said above is that Z = CV, which is
true since U (CV)T is no longer a SVD of XC. This is because CV is no longer
orthogonal unless C is of the form aIp , i.e. isotropic. This is known as the scaling
problem in EOF/PCA. One simple way to get around the problem is to use the
correlation matrix. For more discussion on the scaling problem in PCA refer, for
example, to Jolliffe (2002), Chatfield and Collins (1980), and Thacker (1996).
EOFs can be difficult to interpret in general. However, there are cases in which
EOFs can be understood in a geometric sense, and that is when the data come
from a multivariate normal random variable, e.g. Y , with distribution N (μ, ) with
probability density function given by (3.24). Let λk , and ak , k = 1, . . . , p, be the
eigenvalues and associated (normalised) eigenvectors of the covariance matrix ,
i.e. = AAT , with A = a1 , . . . , ap , = diag λ1 , . . . , λp and AT A = Ip .
Now from a sample data matrix X, the sample mean μ̂ and sample covariance matrix
S are maximum likelihood estimates of μ and respectively. Furthermore, when the
eigenvalues of are all distinct the eigenvalues and EOFs of S are also maximum
likelihood estimate (MLE) of λk , and ak , k = 1, . . . , p, respectively (see e.g.
Anderson 1984; Magnus and Neudecker 1995; Jolliffe 2002). Using the pdf f (y) of
Y , see Eq. (3.24), the joint probability density function of the PCs Z = AT (Y − μ)
is given by
− 12
!
p
1 zk2
p
− p2
f (z) = (2π ) λk exp − , (3.26)
2 λk
k=1 k=1
(y − μ)T −1 (y − μ) = α
It is shown above that EOFs are obtained as the solution of an eigenvalue problem.
EOFs can also be formulated through a matrix optimisation problem.
p √ Let again X be
a n × p data matrix which is decomposed using SVD as X = k=1 λk uk vTk . The
1 p
sample covariance matrix is also written as S = n−1 T
k=1 λk uk uk . Keeping the
first r < p EOFs is equivalent to truncating the
previous
√ sum by keeping the first r
terms to yield the filtered data matrix Xr = rk=1 λk uk vTk , and similarly for the
associated covariance matrix Sr . The covariance matrix Sr of the filtered data can
also be obtained as the solution to the following optimisation problem:
over the set of positive semi-definite matrices Y of rank r (see Appendix D). So Sr
provides the
pbest approximation to S in the above sense, and the minimum is in fact
φ (Sr ) = k=r+1 λk .
The expression of the data matrix as the sum of the contribution from different
EOFs/PCs provides a direct way of filtering the data. The idea of filtering the data
64 3 Empirical Orthogonal Functions
X = ZAT + E (3.28)
p
min φ (Z, A) = λk .
k=r+1
In other words A is the matrix of the first r EOFs, and Z is the matrix of the
associated PCs. This way of obtaining Z and A is referred to as the one-mode
component analysis (Magnus and Neudecker 1995), and attempts to reduce the
number of variables from p to r. Magnus and Neudecker (1995) also extend it to
two- and more mode analysis.
Remark Let xt , t = 1, . . . , n, be the data time series that we suppose to be centred,
and define zt = AT xt , where A is a p × m matrix. Let also S = Uλ2 UT be the
decomposition of the samplecovariance matrix into eigenvectors U = u1 , . . . , up ,
and eigenvalues λ2 = diag λ21 , . . . , λ2p , and where the eigenvalues are arranged
in decreasing order. Then the following three optimisation problems are equivalent
in that they yield the same solution:
• Least square sum of errors of reconstructed data (Pearson 1901), i.e.
n
m
min xt − AA xt T 2
=n λ2k .
A
t=1 k=1
3.9.1 Teleconnectivity
Teleconnection maps (Wallace and Gutzler 1981) are obtained using one-point
correlation where a base point is correlated to all other points. A teleconnection
map is simply a map of row (or column) of the correlation matrix C = (cij ) and
is characterised particularly by a (nearly) elliptical region of positive correlation
around the base point with correlation one at the base point, featuring a bullseye
to use the term of Wallace and Gutzler (1981). Occasionally, however, this main
feature can be augmented by another centre with negative correlations forming
hence a dipolar structure. It is this second centre that makes a big difference
between base points. Figure 3.18 shows an example of correlation between All
45° N
30° N
0° 30° E
Fig. 3.18 Correlation between All India Rainfall (AIR) and September Mediterranean evaporation
66 3 Empirical Orthogonal Functions
India monsoon Rainfall (AIR) index, a measure of the Asian Summer Monsoon
strength, and September Mediterranean evaporation. AIR11 is an area-averaged
of 29 subdivisional rainfall amounts for all months over the Indian subcontinent.
The data used in Fig. 3.18 is for Jun–Sep (JJAS) 1958–2014. There is a clear
teleconnection between Monsoon precipitation and Mediterranean evaporation with
an east–west dipole. Stronger monsoon precipitation is normally associated with
stronger (weaker) evaporation over the western (eastern) Mediterranean and vice
versa. There will be always differences between teleconnection patterns even
without the second centre. For instance some will be localised and others will be
spread over a much larger area. One could also have more than one positive centre.
Using the teleconnection map, one can define the teleconnectivity Ti at the ith
grid point and is defined by
Ti = − min cij .
j
The obtained teleconnectivity map is a special pattern and provides a simple way to
locate regions that are significantly inter-related in the correlation context.
The idea of one-point correlation can be extended to deal with linear relationships
between fields such as SST variable at a given grid point, or even any climate index,
correlated with another field such as geopotential height. These simple techniques
are widely used in climate research and do reveal sometimes interesting features.
D−1 Sv = λ2 v (3.30)
11 http://www.m.monsoondata.org/india/allindia.html.
3.9 Other Related Methods 67
1 1
of R is the same as that of the symmetric matrix D− 2 SD− 2 . Furthermore, from
1 1 1 1 T 1
the SVD of S we get D− 2 SD− 2 = D− 2 U2 D− 2 U . Now since T = D− 2 U
1 1
is orthogonal (but not unitary), 2 provides the spectra of D− 2 SD− 2 , and the
1
eigenvectors of R are vk = D− 2 ak where ak , k = 1, . . . p are the eigenvectors
1 1
of D− 2 SD− 2 .
Remark The EOFs of the correlation matrix C are linearly related to the regression-
1 1
based EOFs. The correlation matrix is C = D− 2 SD− 2 , and therefore the eigenvec-
1
tors vk of R are related to the eigenvectors ak of C by vk = D− 2 ak , k = 1, . . . p.
Aij = 1{Sij −Tij ≥0} (1 − δij ), where 1X is the indicator function of set X, δij is
the Kronecker symbol and Tij is a threshold parameter, which may be constant.
Note that self-interactions are not included in the adjacency matrix. A number
of parameters are then defined from this adjacency matrix, such as closeness and
betweenness, which can be compared to EOFs, for example, and identify hence
processes and patterns which are not accessible from linear measures of association.
Examples of those processes include synchronisation of climatic extreme events
(Malik et al. 2012; Boers et al. 2014), and reconstruction of causal interactions, from
a statistical information perspective, between climatic sub-processes (e.g. Ebert-
Uphoff and Deng 2012; Runge et al. 2012, 2014). More discussion is given in
Chap. 7 (Sect. 7.7) in relation to recurrence networks.
An example of connection is shown in Fig. 3.19 (left panel) based on monthly
sea level pressure field (1950–2015) from NCEP/NCAR reanalysis. The figure
shows connections between two locations, one in Iceland (60N, 330E) and the
other in north east Pacific (30N, 220E), and all other grid points. A connection
is defined when the correlation coefficient is larger than 0.3. Note, in particular,
the connections between the northern centre of the NAO (around Iceland) and
remote places in the Atlantic, North Africa and Eurasia. The middle panel of
Fig. 3.19 shows a measure of the total number of connections at each grid point.
It is essentially proportional to the fraction of the total area that a point is connected
to (Tsonis et al. 2008, 2008). This is similar to the degree defined in climate network,
see e.g. Donges et al. (2015). High connections are located in the NAO and PNA
regions, and also Central Asia. Note that if, for example, the PNA is removed from
the SLP field (e.g. Tsonis et al. 2008), by regressing out the PNA time series, then
the total number of connections (Fig. 3.19, left panel) mostly features the NAO
pattern. We note here that the PNA pattern is normally defined, and better obtained
with, the geopotential height anomalies at mid-tropospheric level, and therefore
results with, say 500-hPa geopotential heights, gives clearer pictures (Tsonis et al.
2008).
Fig. 3.19 Connections between two points (one in Iceland and one in north east Pacific) and all
other gridpoints for which monthly SLP (1950–2015) correlation is larger than 0.3 superimposed
on the SLP climatology (left), total number of connections (see text for details) defined at each
grid point (middle), and same as middle but when the PNA time series was regressed out from the
SLP field (right). Units (left) hPa
3.9 Other Related Methods 69
4.1 Introduction
In the previous chapter we have listed some problems that can be encountered
when working with EOFs, not least the physical interpretation caused mainly by
the geometric constraint imposed upon EOFs and PCs, such as orthogonality,
uncorrelatedness, and domain dependence. Physical modes are inter-related and
tend to be mostly non-orthogonal, or correlated. As an example, normal modes
derived from linearised physical models (Simmons et al. 1983) are non-orthogonal,
and this does not apply to EOFs. Furthermore, EOFs tend to be size and shape
domain-dependent (Horel 1981; Richman 1986, 1993; Legates 1991, 1993). For
instance, the first EOF pattern tends to have wavenumber one sitting on the whole
domain. The second EOF, on the other hand, tends to have wavenumber two and
be orthogonal to EOF1 regardless of the nature of the physical process involved
in producing the data, and this applies to subsequent EOFs. In his detailed review,
Richman (1986) maintains that EOFs exhibit four characteristics that hamper their
utility to isolate individual modes of variation. These are
• domain dependence,
• subdomain instability,
• sampling problems and
• inaccurate relationship to physical phenomena.
If the objective of EOFs is to reduce the data dimension, then the analysis can be
acceptable. If, however, one is looking to isolate patterns for physical interpretation,
then clearly as stated above EOFs may not be the best choice.
To overcome some of the drawbacks caused by the geometric constraints,
researchers have looked for an alternative through linear transformation of the
EOFs. The concept of rotation emerged in factor analysis and has been proposed
since the work of Thurstone (1947) in social science. In atmospheric science, rotated
EOFs (REOFs) have been applied nearly three decades later and continue to be
widely used (Horel 1981; Richman 1981, 1986; Preisendorfer and Mobley 1988;
Cheng et al. 1995). The review of Richman (1986) provides a particularly detailed
discussion of the characteristics of unrotated EOFs. REOFs yield simpler structures,
compared to EOFs, by rotating the vector of loadings or EOFs hence losing some
of the nice geometric properties of EOFs, in favour of yielding better interpretation.
REOFs, however, have their own shortcomings such as how to choose the number
of EOFs to be rotated and the rotation criteria that specify the simplicity.
The objective of pattern simplicity is manifold. Most important is perhaps the
fact that simple patterns avoid the trap of mixing, which is a main feature of EOFs.
Simple patterns and their time amplitude series cannot be spatially orthogonal
and temporally uncorrelated simultaneously. Furthermore, propagating planetary
waves Hoskins and Karoly 1981 tend to follow wave guides (Hoskins and Ambrizzi
(1993); Ambrizzi et al. (1995)) because of the presence of critical lines (Held 1983;
Killworth and McIntyre 1985). Physically relevant patterns are therefore expected
to be more local or simple, i.e. with zeros outside the main centres of action.
A number of alternatives have been developed to construct simple structure
patterns without compromising the nice properties of EOFs, namely variance
maximisation and space–time orthogonality (Jolliffe et al. 2002; Trendafilov and
Jolliffe 2006; Hannachi et al. 2006). This chapter discusses these methods and their
usefulness in atmospheric science.
Horel (1981) and Richman (1981, 1986) argued that EOFs can be too non-local
and dependent on the size and the shape of the spatial domain. Thurstone (1947, p.
360–61) applied rotated factors and pointed out that, invariance or constancy of a
solution, e.g. factors or EOFs, when the domain changes is a fundamental necessity
if the solution is to be physically meaningful (see also Horel 1981). The previous
problems encountered with EOFs have led atmospheric researchers to geometrically
transform EOFs by introducing the concept of rotation in EOF analysis.
Rotated EOF (REOF) technique is based on rotating the EOF patterns or the PCs,
and has been adopted by atmospheric scientists since the early 1980s (Horel 1981;
4.2 Rotation of EOFs 73
Richman 1981, 1986). The technique, however, is much older and goes back to the
early 1940s when it was first suggested and applied in the field of social science1
(Thurstone 1940, 1947; Carroll (1953)). The technique is also known in factor
analysis as factor rotation and aims at getting simple structures. In atmospheric
science the main objective behind rotated EOFs is to obtain
• a relaxation of some of the geometric constraints
• simple and more robust spatial patterns
• simple temporal patterns
• an easier interpretation.
In this context simplicity refers in general to patterns with compact/confined
structure. It is in general accepted that simple/compact structures tend to be more
robust and more physically interpretable. To aid interpretation one definition of
simplicity is to drive the EOF coefficients (PC loadings) to have either small or
large magnitude with few or no intermediate values. Rotation of EOFs, among other
approaches, attempts precisely to achieve this.
Simply put, rotated EOFs are obtained by applying a rotation to a selected set of
EOFs explaining say a given percentage of the total variance. Rotation has been
applied extensively in social science and psychometry, see for example Carroll
(1953), Kaiser (1958), and Saunders (1961), and later in atmospheric science (e.g.
Horel 1981; Richman 1986). Let us denote by Um the p × m matrix containing
the first m EOFs u1 , u2 , . . . um that explain a given amount of variance, i.e. Um =
(u1 , u2 , . . . um ). Rotating these EOFs yields m rotated patterns Bm given by
Bm = Um R = (b1 , b2 , . . . , bm ) , (4.1)
where R = (rij ) is a m×m rotation matrix. The obtained patterns bk = m j =1 rj i uj ,
k = 1, . . . m are the rotated EOFs (REOFs). In (4.1) the rotation matrix R has to
satisfy various constraints that reflect the simplicity criterion of the rotation, which
will be discussed in the next section.
As for EOFs, the amplitudes or the time series associated with the REOFs are
also obtained by projecting the data onto the REOFs, or equally by similarly rotating
the PCs matrix using the same rotation matrix R. The rotated principal components
C = (c1 , c2 , . . . , cm ) are given by
1 Beforethe availability of high speed computers, pattern rotation used to be done visually, which
made it somehow subjective because of the lack of a quantitative measure and the possibility of
non-reproducibility of results.
74 4 Rotated and Simplified EOFs
where Vm is the matrix of the leading (standardised) PCs, and m is the diagonal
matrix containing the leading m singular values. It is also clear from (4.1) that
and therefore the rotated patterns will be orthonormal if and only if R is unitary,
i.e. RRT = Im . In this case the rotation is referred to as orthogonal, otherwise it is
oblique.
From Eq. (4.2) we also get a similar result for the rotated PCs. The covariance
matrix of the rotated PCs is proportional to
C T C = RT 2m R. (4.4)
Equation (4.4) shows that if the rotation is orthogonal the rotated PCs (RPCs)
are no longer uncorrelated. If one choses the RPCs to be uncorrelated, then the
REOFs are non-orthogonal. In conclusion REOFs and corresponding RPCs cannot
be simultaneously orthogonal and uncorrelated respectively. In summary rotation
compromises some of the nice geometric properties of EOFs/PCs to gain perhaps a
better interpretation.
Rotation of the EOF patterns can systematically alter the structures of EOFs. By
constraining the rotation to maximise a simplicity criterion the rotated EOF patterns
can be made simple in the literal sense. Given a p ×m matrix Um = (u1 , u2 , . . . um )
of the leading m EOFs (or loadings), the rotation is formally achieved by seeking a
m × m rotation matrix R to construct the rotated EOFs B given by Eq. (4.1): The
criterion for choosing the rotation matrix R is what constitutes the rotation algorithm
or the simplicity criterion, and is expressed by the maximisation problem:
RRT = RT R = Im , (4.6)
where m is the number of EOFs chosen for rotation. The quantity inside the square
brackets in (4.7) is proportional to the (spatial) variance of the square of the rotated
T
vector bk = b1k , . . . , bpk . Therefore the VARIMAX attempts to simplify the
structure of the patterns by tending the loadings coefficients towards zero, or ±1. In
various cases, the loading of the rotated EOFs B are weighted by the communalities
of the different variables (Walsh and Richman 1981). The communalities h2j ,
j = 1, . . . p, are directly proportional to m 2
k=1 aj k , i.e. the sum of squares of the
−1/2
loadings for a particular variable (grid point). Hence if C = Diag Um UTm ,
then in the weighted or normalised VARIMAX, the matrix B in (4.7) is simply
replaced by BC. This normalisation is generally used to reduce the bias toward the
first EOF with the largest eigenvalue.
Another familiar orthogonal rotation method is based on the QUARTIMAX
simplicity criterion (Harman 1976). It seeks to maximise the variance of the
patterns, i.e.
⎡ ⎤2
1 ⎣ 2 1 2 ⎦
m p m p
f (B) = bj k − bj k . (4.8)
mp mp
k=1 j =1 k=1 j =1
Because of the orthogonality property (4.6) required by R, the rotated EOFs matrix
also satisfies BT B = Im . Therefore the sum of the squared elements of B is constant,
and the QUARTIMAX simply boils down to maximising the fourth-order moment
of the loadings, hence the term QUARTIMAX, and is based on the following
maximisation problem:
76 4 Rotated and Simplified EOFs
⎡ ⎤
1
m
p
max ⎣f (B) = .bj4k ⎦ . (4.9)
mp
k=1 j =1
Computation of REOFs
min f (x)
(4.12)
s.t. gk (x) = ck , k = 1, . . . p,
p
H (x, λ) = f (x) + λk (gk (x) − ck ) = f (x) + λT g, (4.13)
k=1
Fig. 4.4 Scatter plot of VARIMAX REOFs versus QUARTIMIN REOFs using m = 30 EOFs.
Note that scatter with negative slopes correspond to similar REOFs but with opposite sign. Adapted
from Hannachi et al. (2007)
4.3.1 Background
Another problem in REOFs, not often stressed in the literature, is that after rotation
the order is lost, and basically all REOFs become equivalent2 in that regard. It is
clear that addressing some of these concerns will depend to some extent on what the
rotated patterns will be used for. A simplification technique that can overcome most
of these problems, and which in the meantime retains some of the nice properties of
EOFs, is desirable. Such a technique is described next.
Various simplification techniques have been suggested, see also Jolliffe (2002,
chapter 11). Most of these techniques attempt to reduce the two-step procedure of
rotated PCA into just one step. Here we discuss a particularly interesting method
of simplicity that is rooted in regression analysis. A common problem that arises
in multiple linear regression is instability of regression coefficients because of
colinearity or high dimensionality. Tibshirani (1996) has investigated this problem
and proposed a technique known as the Least Absolute Shrinkage and Selection
Operator (LASSO). In a least-square multiple linear regression:
y = Xβ + ε,
subject to
To achieve simplicity the LASSO technique requires the following extra constraint
to be satisfied (Jolliffe et al. 2003):
d
ak 1 = |akj | = aTk sign(ak ) ≤ τ (4.16)
j =1
for some tuning parameter τ . In Eq. (4.16) sign(ak ) = (sign(ak1 ), . . . , sign(akp ))T
p 2
2 = 1 ≤ p
is the sign of ak . Because j =1 akj j =1 |akj | , it is clear that the
optimisation problem (4.14–4.16) is only ppossible for τ ≥ 1. Furthermore, since
a 1 is maximised over the unit sphere, i=1 ai2 = 1, only when all the components
√ √
are equal we get a 1 ≤ p. Hence if τ ≥ p we regain conventional EOFs.
Consequently EOFs can be regarded as a particular case of SEOFs. Figure 4.5
shows an example of the leading two SEOFs obtained with a threshold parameter
τ = 8. These patterns are orthogonal and they represent respectively the NAO and
the Pacific patterns. The centres of action are quite localised. These centres get
broader as τ increases. This is discussed in the next section.
d
aTk tanh (γ ak ) − τ = akj tanh γ akj − τ ≤ 0 (4.17)
j =1
for some fixed large number γ . The problem (4.14–4.16) was solved by Trendafilov
and Jolliffe (2006), see also Hannachi et al. (2005), using the projected gradient
approach (Gill et al. 1981). To ease the problem further and to make it look like the
standard EOF problem (4.14–4.15), the nonlinear condition (4.17) is incorporated
into the function F () in Eq. (4.14) as an exterior penalty function, see e.g. Gill et al.
(1981). This means that this condition will be explicitly taken into account only if it
is violated. Hence if we designate by
the exterior penalty function, then condition (4.17) can be incorporated into (4.14)
to yield the extended objective function:
1 T
Fμ (ak ) = ak Sak − μPe aTk tanh(γ ak ) − τ (4.18)
2
to be maximised, and where μ is a large positive number. It is now clear from (4.18)
that (4.17) is not taken into account if it is satisfied, but when it is positive it is
penalised and is sought to be minimised. Note again that (4.18) is not differentiable,
and to make it so we use the fact that max(x, y) = 12 (x + y + |x − y|), and hence
4.3 Simplified EOFs: SCoTLASS 85
the exterior penalty function is replaced by P (x) = 12 x [1 + tanh(γ x)]. Hence the
smooth objective function to maximise becomes:
1 T
Fμ (ak ) = ak Sak − μP aTk tanh(γ ak ) − τ (4.19)
2
subject to the orthogonality condition (4.15). Figure 4.6 shows a plot of Fμ (EOF 1)
versus γ for μ = 1000, where EOF1 is the leading EOF of the winter SLP field. The
function becomes independent of γ for large values of this parameter. Hannachi et
al. (2006) found that the solution is invariant to changes in μ (for large values).
Various methods exist to solve the nonlinear constrained maximisation prob-
lem (4.15) and (4.19), such as steepest ascent, and projected/reduced gradient
methods (Gill et al. 1981). These methods look for linear directions of ascent
to achieve the optimum solution. In various problems, however, the search for
suitable step sizes (in line search) can be problematic particularly when the objective
function to be maximised is not quadratic, for which the algorithm can converge to
the wrong local maximum.
An elegant alternative approach to the linear search method is to look for a
smooth curvilinear trajectory to achieve the optimum. For instance the minimum
of an objective function F (x) can be achieved by integrating the system of ordinary
differential equations (ODE)
dx
= −∇F (x) (4.20)
dt
Fig. 4.6 Function Fμ (EOF 1) versus γ for μ = 1000. EOF1 is the leading EOF of winter SLP
anomalies. Adapted from Hannachi et al. (2006)
86 4 Rotated and Simplified EOFs
forward in time for a sufficiently long time using a suitably chosen initial condition
(Evtushenko 1974; Botsaris 1978; Brown 1986). In fact, if x∗ is an isolated local
minimum of F (x), then x∗ is a stable fixed point of the dynamical system (4.20),
see e.g. Hirsch and Smale (1974), and hence can be reached by integrating (4.20)
from some suitable initial condition. Such methods have been around since the
mid 1970 (Evtushenko 1974; Botsaris and Jacobson 1976, 1978) and can make
use of efficient integration algorithms available for dynamical systems. Trajectories
defined by second-order differential equations have also been suggested (Snyman
1982).
In the presence of constraints the gradient of the objective function to be
minimised (or maximised) has to be projected onto the tangent space of the feasible
set, i.e. the manifold or hypersurface satisfying the constraints (Botsaris 1979, 1981;
Evtushenko and Zhadan 1977; and Brown 1986). This is precisely what projected
gradient stands for (Gill et al. 1981). Now if Ak−1 = (a1 , a2 , . . . , ak−1 ), k ≥ 2,
is the set of the first k − 1 SEOFs, then the next kth SEOF ak has to satisfy the
following orthogonality constraints:
Therefore the feasible set is simply the orthogonal complement to the space spanned
by the columns of Ak−1 . This can be expressed conveniently using projection
operators. In fact, the following matrix:
k−1
πk = Id − al aTl (4.22)
l=1
provides the projection operator onto this space. Furthermore, the condition aTk ak =
1 is equivalent to (Id − ak aTk )ak = 0. Therefore the projection onto the feasible set
is achieved by applying the operator πk (Id − ak aTk ) to the gradient of the objective
function (4.19). Hence the solution to the SEOF problem (4.14–4.16) is provided by
the solution to the following system of ODEs:
d
ak = πk Id − ak aTk ∇Fμ (ak ) = πk+1 ∇Fμ (ak ). (4.23)
dt
The kth SEOF ak is obtained as the limit, when t → ∞, of the solution to Eq. (4.23).
This approach has been successfully applied by Jolliffe et al. (2003) and Trendafilov
and Jolliffe (2006) to a simplified example, and by Hannachi et al. (2005) to the sea
level pressure (SLP) field.
Figure 4.7 shows the leading SLP SEOF for τ = 18. The patterns get broader
as τ increases. The SEOF patterns depend on τ and they converge to the EOFs
as τ increases as shown in Fig. 4.8. Figure 4.9 shows the third SLP EOF pattern
for τ = 12 and τ = 16 respectively. For the latter value the pattern becomes
4.3 Simplified EOFs: SCoTLASS 87
nearly hemispheric. The convergence of SEOF1 to EOF1, as shown in Fig. 4.7, starts
√
around τ = 23 p
Hannachi et al. (2006) modified slightly the above system of ODEs. The kth
SEOF is obtained after removing the effect of the previous k − 1 SEOFs by
computing the residuals:
k−1
Yk = X Id − al aTl = Xπk (4.24)
l=0
88 4 Rotated and Simplified EOFs
Fig. 4.8 Variance ratio of simplified PC1 to that of PC1 versus parameter τ . Adapted from
Hannachi et al. (2007)
The k’th SEOF ak is then obtained as an asymptotic limit when t tends to infinity ,
i.e. stationary solution to the dynamical system:
d
ak = Id − ak aTk ∇Fμ(k) (ak ), (4.26)
dt
5.1 Background
The introduction of EOF analysis into meteorology since the late 1940s (Obukhov
1947; Fukuoka 1951; Lorenz 1956) had a strong impact on the course of weather
and climate research. This is because one major concern in climate research is the
extraction of patterns of variability from observations or model simulations, and the
EOF method is one such technique that provides a simple tool to achieve this. The
EOF patterns are stationary patterns in the sense that they do not evolve or propagate
but can only undergo magnitude and sign change. This is certainly a limitation if
one is interested in inferring the space–time characteristics of weather and climate,
since EOFs or REOFs, for example assume a time–space separation as expressed by
the Karhunen–Loéve expansion (3.1). For instance, one does not expect in general
EOFs to reveal to us the structure of the space–time characteristics of propagating
phenomena1 such as Madden–Julian oscillation (MJO) or quasi-biennial oscillation
(QBO), etc.
The QBO, for example, represents a clear case of oscillating phenomenon that
takes place in the stratosphere, which can be identified using stratospheric zonal
1 Inreality all depends on the variance explained by those propagating patterns. If they have
substantial variance these propagating patterns can actually be revealed by a EOF analysis when
precisely they appear as degenerate pair of eigenvalues and associated eigenvectors in quadrature.
wind. This wind is nearly zonally symmetric. Figure 5.1 shows the climatology
of the zonal wind for January and July from the surface to 1 mb level using the
European Reanalyses (ERA-40) from the European Centre for Medium Range
Weather Forecasting (ECMWF), for the period January 1958–December 2001. A
number of features can be seen. The tropospheric westerly jets are located near
250-mb height near 30–35◦ latitude. In January the Northern Hemisphere (NH) jet
is only slightly stronger than the Southern Hemisphere (SH) counterpart. In July,
Fig. 5.1 Climatology of the ERA-40 zonal mean zonal wind for January (a) and July (b) for the
period Jan 1958–Dec 2001. Adapted from Hannachi et al. (2007)
5.1 Background 93
however, the NH jet is weaker than the SH jet. This latter is stronger due in part to
the absence of boundary layer friction caused by mountains and land masses.
In the stratosphere, on the other hand, both easterly and westerly flows are
present. Stratospheric westerlies (easterlies) exist over most winter (summer)
hemispheres. The stratospheric westerly flow represents the polar vortex, which
is stronger on the winter time because of the stronger equator-pole temperature
gradient. Note also the difference in winter stratospheric wind speed between the
northern hemisphere, around 40–50 m/s at about 1-mb and the southern hemisphere,
around 90 m/s at the same height.
The above figure refers mainly to the seasonality of the zonal flow. The variability
of the stratospheric flow can be analysed after removing the seasonality. Figure 5.2
shows the variance of the zonal wind anomalies over the ERA-40 period. Most of the
variance is concentrated in a narrow latitudinal band around the region equatorward
of 15◦ and extends from around 70-mb up to 1-mb.
Figure 5.3 shows a time–height plot of the zonal wind anomalies at the equator
over the period January 1994–December 2001. A downward propagating signal is
identified between about 3 and 70-mb. The downward propagating speed is around
1.2 km/month. The period at a given level varies between about 24 and 34 months,
yielding an average of 28 months, hence quasi-biennial periodicity, see e.g. Baldwin
et al. (2001) and Hannachi et al. (2007) for further references.
To get better insight into space–time characteristics of various atmospheric
processes one necessarily has to incorporate time information into the analysis. This
is backed by the fact that atmospheric variability has significant auto- and cross-
Fig. 5.2 Variance of monthly zonal mean zonal wind anomalies, with respect to the mean seasonal
cycle, over the ERA-40 period. Adapted from Hannachi et al. (2007)
94 5 Complex/Hilbert EOFs
Fig. 5.3 Time–height plot of equatorial zonal mean zonal wind anomalies for the period January
1992–December 2001. Adapted from Hannachi et al. (2007)
where a is the wave amplitude and ω and φ are respectively its frequency and
phase shift (at the origin). Complex EOFs (CEOFs) are based on this representation.
There are, in principle, two ways to perform complex EOFs, namely “conventional”
complex EOFs and “Hilbert” EOFs. When we deal with a pair of associated climate
fields then conventional complex EOFs are obtained. Hilbert EOFs correspond to
the case when we deal with a single field, and where we are interested in finding
propagating patterns. In this case the field has to be complexified by introducing an
imaginary part, which is a transform of the actual field.
The method is similar to conventional EOFs except that it is applied to the complex
field obtained from a pair of variables such as the zonal and meridional components
u and v of the wind field U = (u, v) (Kundu and Allen 1976; Brink and Muench
1986; von Storch and Zwiers 1999; Preisendorfer and Mobley 1988). The wind field
Utl = U (t, sl ), defined at each location sl , l = 1, . . . p, and time t, t = 1, . . . n, can
be written using a compact complex form as
The complex covariance matrix is then obtained using the data matrix U = (Utl ) by
1
S= U ∗T U , (5.4)
n−1
96 5 Complex/Hilbert EOFs
1 ∗
n
skl = Utk Utl ,
n
t=1
ek = U u∗k (5.5)
e∗T
k el = λk δkl .
2
(5.6)
The complex EOFs and associated complex PCs are also obtained using the singular
value decomposition of U .
Any CEOF uk has a pattern amplitude and phase. The pattern of phase informa-
tion is given by
I m(uk )
φ k = arctan , (5.7)
Re(uk )
where Re() and I m() stand respectively for the real and imaginary parts, and where
the division is performed componentwise. The pattern amplitude of uk is given by
its componentwise amplitudes. This method of doing CEOFs seems to have been
originally applied by Kundu and Allen (1976) to the velocity field of the Oregon
coastal current. The conventional CEOFs are similar to conventional EOFs in the
sense that time ordering is irrelevant, and hence the method is mostly useful to
capture covarying spatial patterns between the two fields.
2 Since u∗T
k Suk = λk =
2 1
n−1 [Uuk ]∗T [Uuk ] ≥ 0.
5.3 Frequency Domain EOFs 97
the pair of lagged variables (xt , xt+τ ) for some chosen lag τ . The complex field is
defined by
yt = xt + ixt+τ . (5.8)
5.3.1 Background
Complex EOFs in spectral or time domain is a natural extension to EOFs and aims
at finding travelling patterns. In spectral domain, the method is based on an eigen-
decomposition of the cross-spectral matrix and therefore makes use of the whole
structure of the (complex) cross-spectral matrix. Ordinary EOFs method is simply
an application of frequency domain EOFs (FDEOFs) to contemporaneous data only.
It appears that the earliest introduction of complex frequency domain EOFs
(FDEOFs) in atmospheric context dates back to the early 1970s with Wallace and
Dickinson. Their work has stimulated the introduction of Hilbert EOFs, and we start
by reviewing FDEOFs first. The spectrum gives a measure of the contribution to the
variance across the whole frequency range. EOF analysis in the frequency domain
(Wallace and Dickinson 1972; Wallace 1972; Johnson and McPhaden 1993), see
also Brillinger (1981) for details, attempts to analyse propagating disturbances by
concentrating on a specific frequency band allowing thus the decomposition of
variance in this band while retaining phase relationships between locations.
98 5 Complex/Hilbert EOFs
1 −iωk
f (ω) = e γ (k), (5.9)
2π
k
and
π
γ (τ ) = eiτ ω f (ω)dω. (5.10)
−π
T
For a multivariate time series xt = xt1 , xt2 , . . . xtp , t = 1, 2, . . . , the
previous equations extend to yield respectively the cross-spectrum matrix F and
the autocovariance or lagged covariance matrix given by
1 −iωk
F(ω) = e (k) (5.11)
2π
k
and
π
(τ ) = eiτ ω F(ω)dω. (5.12)
−π
and gives the lagged covariance between the ith and jth variables. Because the cross-
spectrum matrix is Hermitian it is therefore diagonalizable, and can be factorised as
F = EDE∗T , (5.14)
3 Tohave a spectral representation of a continuous time series x(t), Wallace and Dickinson (1972)
used the concept of stochastic integrals as
5.3 Frequency Domain EOFs 99
complexified time series y(t) is then obtained, which involves the filtered time series
and its time derivative as its real and complex parts respectively. The EOFs and PCs
are then obtained from the real time series:
zt = Re [E(yt )] , (5.15)
where E() and Re[] stand, respectively, for the expectation and the real part
operators.
In practice, FDEOFs are based on performing an eigenanalysis of the cross-
spectrum matrix calculated in a small frequency band. Let u(ω) be the Fourier
transform (FT) of the (centred) field xt , t = 1, . . . n at frequency ω, i.e.
n
u(ω) = xt e−iωt . (5.16)
t=1
1
n−τ
Sτ = xt xTt+τ
n−τ
t=1
as
(ω) = Sτ e−iωτ .
τ
π
Note that the covariance matrix satisfies S = −π (ω)dω, and therefore the
spectrum gives a measure of the contribution to the variance across the whole
frequency range. The average of the cross-spectral matrix over the frequency band
[ω0 , ω1 ], i.e.
ω1
C= (ω)dω (5.17)
ω0
∞
x(t) = Re eiωt dε(ω) ,
0
where ε is an independent random noise and Re[.] stands for the real part. The filtered time
series
(i.e. spectral
components outside [ω, ω + dω]) by defining first x (t) =
is thend obtained
f
Re eiωt dε(ω) dω, from which they get z(t) = Re (1 − ωi dt )E(xf ) . This new time series then
satisfies E z(t)zT (t + τ ) = cos(ωτ )D(ω)dω.
100 5 Complex/Hilbert EOFs
“principal components” resulting from the FDEOF are obtained by projecting the
complexified time series onto the spectral domain EOFs.
Now since waves are coherent structures with consistent phase relationships at
various lags, and given that FDEOFs represent patterns that are uniform across
a frequency band, the leading FDEOF provides coherent structures with most
wave variance. The FDEOFs are then obtained as the EOFs of C (Brillinger
1981). Johnson and McPhaden (1993) have applied FDEOFs to study the spatial
structure of intraseasonal Kelvin wave structure in the Equatorial Pacific Ocean.
They identified coherent wave structures with periods 59–125 days. Because most
climate data spectra look reddish, FDEOF analysis may be cumbersome in practice
(Horel 1984). This is particularly the case if the power spectrum of an EOF, for
example is spread over a wide frequency band, requiring an averaging of the cross-
spectrum over this wide frequency range, where the theory behind FDEOFs is no
longer applicable (Wallace and Dickinson 1972).
To summarise the following bullet points provide the essence of FDEOFs:
• Conventional EOFs are simply frequency domain EOFs applied to contempora-
neous data only.
• FDEOFs are defined as the eigenvectors of the cross-spectrum matrix defined at
a certain frequency band ω ± dω.
• This means that all frequencies outside an infinitesimal interval around ω have to
be filtered.
The method, however is difficult to apply in practice. For instance, if the power
in the data is spread over a wide range, it is not clear how FDEOFs can be applied.4
There is also the issue related to the choice of the “right” frequency. Averaging
the cross-spectrum over a wider range is desirable but then the theory is no longer
valid (Wallace and Dickinson 1972). Note that averaging the cross-spectrum matrix
over the whole positive/negative frequency domain simply yields ordinary EOFs.
In addition to the previous difficulties there is also the problem of estimating
the power spectrum at a given frequency, given that the spectrum estimate is in
general highly erratic (see Chatfield 1996). Also and as pointed out by Barnett
(1983), the interactions between various climate components involve propagation
of information and irregular short term as well as cyclo-stationary, e.g. seasonality,
interactions. This complicated (non-stationary) behaviour cannot be analysed using
spectral techniques. These difficulties have led to the method being abandoned.
Many of the above problems can be handled by Hilbert EOFs discussed next.
4 For example, Horel (1984) suggests that many maps, one for each spectral estimate may be
studied.
5.4 Complex Hilbert EOFs 101
∞ λ
For example, P −∞ tdt = limλ→∞ −λ tdt = 0. Note that when the integral
is already convergent then it is identified to its Cauchy principal value. A direct
application of this integral is the Hilbert transform (Thomas 1969; Brillinger 1981).
Definition The Hilbert transform Hx(t), of the continuous signal x(t), is defined
by
1 x(s)
Hx(t) = P ds. (5.19)
π t −s
Note that the inverse of this transform is simply its opposite. This transform is
defined for every
∞signal x(t) in Lp , the space of functions whose pth power is
integrable, i.e. −∞ |x(t)|p dt < ∞. This result derives from Cauchy’s integral
formula5 and the function z(t) = x(t) − iHx(t) = a(t)eiθ(t) is analytic. In fact,
the Hilbert transform is the unique transform that defines an imaginary part so that
5 If f () is analytic over a domain D in the complex plane containing a simple path C0 , then
1 f (u)
f (z) = du
2iπ C0 u − z
102 5 Complex/Hilbert EOFs
the result is analytic. The Hilbert transform is defined as a convolution of x(t) with
the function 1t emphasizing therefore the local properties of x(t). Furthermore, using
the polar expression, it is seen that z(t) provides the best local fit of a trigonometric
function to x(t) and yields hence an instantaneous frequency of the signal and
provides information about the local rate of change of x(t).
The Hilbert transform y(t) = Hx(t) is related to the Fourier transform Fx(t) by
∞
1
y(t) = Hx(t) = Im Fx(s)e−ist ds , (5.20)
π 0
Equation (5.21) is obtained after remembering that the Hilbert transform is a simple
convolution of the signal x(t) with the function 1t . The filter transfer function is
therefore 1t , and the frequency response function is given by the principal value of
the Fourier transform of 1t . It is therefore clear from (5.21) that the Hilbert filter6
precisely removes the zero frequency but does not affect the modulus of all others.
The analytic signal z(t) has the same positive frequencies as x(t) but zero negative
frequencies.
for any z inside C0 . Furthermore, the expression f (t) = 12 (f (t) + i H[f (t)]) +
2 (f (t) − i H[f (t)]) provides a unique decomposition of f (t) into the sum of two analytic
1
Remark: Difference Between Fourier and Hilbert Transform Like Fourier trans-
form the Hilbert transform provides the energy 12 a 2 (t) = |z(t)|2 , and frequency
ω = − dθ(t)dt . In Hilbert transform these quantities are local (or instantaneous)
where at any given time the signal has one amplitude and one frequency. In
Fourier transform, however, the previous quantities are global in the sense that
each component in the Fourier spectrum covers the whole time span uniformly. For
instance, a spike in Fourier spectrum reflects a sine wave in a narrow frequency
band in the whole time span. So if we represent this in a frequency-time plot one
gets a vertical narrow band. The same spike in the Hilbert transform, however,
indicates the existence of a sine wave somewhere in the time series, which can be
represented by a small square in the frequency-time domain if the signal is short
lived. The Hilbert transform is particularly useful for transient signals, and is similar
to wavelet transform in this respect.7
Table 5.1 gives examples of familiar functions and their Hilbert transforms. The
function D(t) in Table 5.1 is known as Dawson’s integral defined by D(t) =
2 t
e−t 0 ex dx, and the symbol δ() refers to the spike function or Dirac delta
2
We suppose here that our continuous signal x(t) has been discretised at various
times tk = kt to yield the discrete time series xk , k = 0, ±1, ±2, . . . , where
xk = x(kt). To get the Hilbert transform of this time series, we make use of the
transfer function of the filter in (5.21) but now defined over [−π, π ] (why?), i.e.
where ψ() is the basis wavelet, a is a dilation factor and b is the translation of the origin. In this
transform higher frequencies are more localised and there is a uniform temporal resolution for
all frequency scales. A major problem here is that the resolution is limited by the basis wavelet.
Wavelets are particularly useful for characterising gradual frequency changes.
104 5 Complex/Hilbert EOFs
⎧
⎨ −1 for −π ≤ λ < 0
h(λ) = i sign(λ)I[−π,π ] (λ) = 0 for λ = 0 (5.22)
⎩
1 for 0 < λ < π.
To get the filter coefficients (in the time domain) we compute the frequency
response function, which is then expanded into Fourier series (Appendix C). The
transfer function (5.22) can now be expanded into Fourier series, after extending
it by periodicity to the real line then applying the discrete
π Fourier transform
(e.g. Stephenson 1973). The Fourier coefficients, ak = π1 −π h(λ)eikλ dλ, k =
0, 1, 2, . . ., become:8
0 if k = 2p
ak = (5.23)
− kπ
4
if k = 2p + 1.
that is,
2 xt−(2k+1) 2 1
yt = − = (xt+2k+1 − xt−2k−1 ) . (5.24)
π 2k + 1 π 2k + 1
k k≥0
The discrete Hilbert transform formulae (5.24) was also derived by Kress and
Martensen (1970), see also Weideman (1995), using the rectangular rule of inte-
gration applied to (5.19). Now the time series is expanded into Fourier series as
xt = a(ω)e−2iπ ωt (5.25)
ω
8 Note that this yields the following expansion of the transfer function into Fourier series as
then its Hilbert transform is obtained by multiplying (5.25) by the transfer function
to yield:
1
Hxt = yt = h(ω)a(ω)e−2π iωt . (5.26)
π ω
Note that in (5.26) the Hilbert transform has removed the zero frequency and phase
rotated the time series by π2 . The analytic (complex) Hilbert transform
i
zt = xt − iHxt = a(ω) 1 − h(ω) e−2iπ ωt (5.27)
ω
π
and
for ω = 0. Note that the first of these two equations is already known. It is therefore
clear that the co-spectrum is zero. This means that the cross-correlation between the
signal and its transform is odd, hence
Note in particular that (5.30) yields γxy (0) = 0, hence the two signals are
uncorrelated.
For the multivariate case similar results hold. Let us designate by xt , t = 1, 2, . . .
a d-dimensional signal and yt = Hxt , its Hilbert transform. Then the cross-spectrum
matrix, see Eqs. (5.28)–(5.29), is given by
Fy (ω) = Fx (ω)
Fxy (ω) = i sign(ω)Fx (ω), (5.31)
for ω = 0. Note that the Hilbert transform for multivariate signals is isotropic.
Furthermore, since the inverse of the transform is its opposite we get, using again
106 5 Complex/Hilbert EOFs
Using (5.31), the latter relationship yields, for the cross-covariance matrix, the
following property:
π π
xy = Fxy (ω)dω = −2 FIx = − yx . (5.33)
−π 0
The covariance matrix of the complexified signal is also related to the cross-spectra
Fx via
π π
z = 2 1 + sign(ω) Fx (ω)dω = 4 F∗x (ω)dω, (5.36)
−π 0
where a(ω) and b(ω) are vector Fourier coefficients, and since propagating distur-
bances require complex representation as in (5.2), Eq. (5.37) can be transformed to
yield the general (complex) Fourier decomposition:
zt = c(ω) e−iωt , (5.38)
ω
where precisely Re(zt ) = xt , and c(ω) = a(ω) + ib(ω). The new complex field
T
zt = zt1 , . . . ztp can therefore be written as
zt = xt − iH(xt ). (5.39)
is precisely the Hilbert transform, or quadrature function of the scalar field xt and
is seen to represent a simple phase shift by π2 in time. In fact, it can be seen that
the Hilbert transform, considered as a filter, removes the zero frequency without
affecting the modulus of all the others, and is as such a unit gain filter. Note that
if the time series (5.37) contains only one frequency, then the Hilbert transform is
simply proportional to the time derivative of the time series. Therefore, locally in
the frequency domain H(xt ) provides information about the rate of change of xt
with respect to time t.
Computational Aspects
In practice, various methods exist to compute the finite Hilbert transform. For a
scalar field xt of finite length n, the Hilbert transform H(xt ) can be estimated using
the discrete Fourier transform (5.40) in which ω becomes ωk = 2πn k , k = 1, . . . n2 .
Alternatively, H(xt ) can be obtained by truncating the infinite sum in Eq. (5.24).
This truncation can also be written using a convolution or a linear filter as (see e.g.
Hannan 1970):
L
H(xt ) = αk xt−k (5.41)
k=−L
2 πk
αk = sin2 .
kπ 2
108 5 Complex/Hilbert EOFs
Barnett (1983) found that 7 ≤ L ≤ 25 provides adequate values for L. For example
for L = 23 the frequency response function (Appendix C) is a band pass filter with
periods between 6 and 190 time units with a particular excellent response obtained
between 19 and 42 time units (Trenberth and Shin 1984). The Hilbert transform has
also been extended to vector fields, i.e. two or more fields, through concatenation
of the respective complexified fields (Barnett 1983). Another interesting method to
compute Hilbert transform of a time series is presented by Weideman (1995), using a
series expansion in rational eigenfunctions of the Hilbert transform operator (5.19).
The HEOFs uk , k = 1, . . . p, are then obtained as the eigenvectors of the
Hermitian covariance matrix
1 ∗T
n
S= zt zt = Sxx + Syy + i Sxy − Syx , (5.42)
n−1
t=1
where Sxx and Syy are covariance matrices of xt and yt respectively, and Sxy is
the cross-covariance between xt and yt and similarly for Syx . Alternatively, these
HEOFs can also be obtained as the right complex singular vectors of the data matrix
Z = (ztk ) using SVD, i.e.
p
Z = UV∗T = λk uk v∗T
k .
k=1
Note that (5.33) is the main difference between FDEOFs, where the integration is
performed over a very narrow (infinitesimal) frequency range ω0 ± dω, and Hilbert
EOFs, where the whole spectrum is considered. The previous SVD decomposition
T
expresses the complex map zt = zt1 , zt2 , . . . , ztp at time t as
r
zt = λk vtk u∗k , (5.43)
k=1
where vtk is the value of the k’th complex PC (CPC) vk at time t and r is the rank of
X. The computation of Hilbert EOFs is quite similar to conventional EOFs. Given
the gridded data matrix X(n, p12), the Hilbert transform in Matlab is given by
XH = hilbert(X);
Fig. 5.4 Spectrum of the Hermitian covariance matrix given by Eq. (5.42), of the zonal mean zonal
wind anomalies of the ERA-40 data. Vertical bars represent approximate 95% confidence limits.
Adapted from Hannachi et al. (2007)
The leading Hilbert EOF (real and complex parts) is shown in Fig. 5.5. The patterns
are in quadrature reflecting the downward propagating signal.
The time and spectral structure can be investigated further using the associated
complex (or Hilbert) PC. Figure 5.6 shows a plot of the real and complex parts of
the Hilbert PC1 along with the power spectrum of the former. The period of the
propagating signal comes out about 30 months. In a conventional EOF analysis
the real and complex parts of Hilbert EOF1 would come out approximately as
a pair of degenerate EOFs. Table 5.2 shows the percentage of the cumulative
explained variance of the leading 5 EOFs and Hilbert EOFs. The leading EOF pair
explain respectively about 39% and 37% whereas the third one explains about 8%
of the total. Table 5.2 also shows the efficiency of Hilbert EOFs in reducing the
dimensionality of the data compared to EOFs. This comes of course at a price,
namely the double size of the Hilbert covariance matrix.
Remark The real and imaginary parts of the CPC’s are Hilbert transform of
each other. In fact using the identity λk vk =
p ∗
j =1 ukj zj , where zk is the k’th
complexified variable (or time series at the kth grid point) and uk = (uk1 , . . . , ukp )
is the kth HEOF, we can apply the Hilbert transform to yield
p p
λk Hvk = ukj Hx ∗j + iHy ∗j = ukj y ∗j − ix ∗j = iλk vk ,
j =1 j =1
Fig. 5.5 Real (a) and imaginary (b) parts of the leading Hilbert EOF of the ERA-40 zonal mean
zonal wind anomalies. Adapted from Hannachi et al. (2007)
From this decomposition we get the spatial amplitude and phase functions
respectively:
Fig. 5.6 Time series of the Hilbert PC1 (a), phase portrait of Hilbert PC1 and the power spectrum
of the real part of Hilbert PC1 of the Era-40 zonal mean zonal wind anomalies. Adapted from
Hannachi et al. (2007). (a) Complex PC1: Real and imaginary parts. (b) Phase portrait of CPC1.
(c) Spectrum of real (CPC1)
Table 5.2 Percentage of explained variance of the leading 5 EOFs and Hilbert EOFs of the ERA-
40 zonal mean zonal wind anomalies
Eigenvalue rank 1 2 3 4 5
EOFs 39.4 37.4 7.7 5.5 2.3
Hilbert EOFs 71.3 10.0 7.7 2.8 1.9
Similarly, one also gets the temporal amplitude and phase functions as
where the vector product and division in (5.44) and (5.45) are performed compo-
nentwise. For each eigenmode, the amplitude map can be interpreted as a variability
map as in ordinary EOFs. The function θ k gives information on the relative
phase. For “simple” fields, its spatial derivative provides a measure of the local
wavenumber. Its interpretation for moderately complex fields/waves can be difficult
112 5 Complex/Hilbert EOFs
(Wallace 1972), and can be made easier by applying a prior filtering (Barnett 1983).
Also for simple waves, the time derivative of the temporal phase gives a measure of
the instantaneous frequency. Note that the phase speed of the wave of the kth mode
at time t and position x can be measured by dθ k (x)/dx
dφ k (t)/dt .
The amplitude and phase of the leading Hilbert EOF Eq. (5.44) are shown in
Fig. 5.7. The spatial amplitude shows that the maximum of wave amplitude is round
25-mb on the equator. It also shows the asymmetry of the amplitude in the vertical
Fig. 5.7 Spatial modes of the amplitude and phase of leading Hilbert EOF of the ERA-40 zonal
mean zonal wind anomalies. Adapted from Hannachi et al. (2007). (a) Spatial amplitude of
complex EOF1. (b) Spatial phase of complex EOF1
5.4 Complex Hilbert EOFs 113
direction. The spatial phase shows banded structure from 1-mb height down to
around 50-mb, where propagation stops, and indicates the direction of propagation
of the disturbance where the phase changes between −180◦ and 180◦ in the course
of complete cycle.
The temporal amplitude and phase, Eq. (5.45), are shown in Fig. 5.8 of the first
Hilbert PC of the ERA-40 zonal mean zonal wind anomalies for the period January
1992–December 2001. For example, the amplitude is larger in the middle of the
wave life cycle. The temporal phase, on the other hand, provides information on the
phase of the wave. For every wave lifecycle the phase is nearly quasi-linear, with
the slope providing a measure of the instantaneous frequency.
As with conventional EOFs, Hilbert EOFs can be used to filter the data. For the
example of the ERA-40 zonal mean zonal wind, the leading Hilbert EOF/PC can be
used to filter out the QBO signal. Figure 5.9 shows the obtained filtered anomalies
for the period Jan 1992–Dec 2001. The downward signal propagating is clearer than
the signal shown in Fig. 5.3. The average downward phase propagation is about
1 km/month.
The covariance matrix S in (5.42) is related to the cross-spectrum matrix
1
n−τ
(ω) = zt+τ z∗T
t e
−iωτ
n τ
t=1
Fig. 5.8 Time series of the amplitude and phase of the leading Hilbert EOF of ERA-40 zonal mean
zonal wind anomalies. Adapted from Hannachi et al. (2007). (a) Temporal amplitude of complex
PC 1. (b) Temporal phase of complex PC 1
114 5 Complex/Hilbert EOFs
Fig. 5.9 Filtered field of ERA-40 zonal mean zonal wind anomalies using Hilbert EOF1 for the
period January 1992–December 2001. Adapted from Hannachi et al. (2007)
via
ωN
S=2 (ω)dω, (5.46)
0
where ωN = 2t 1
and represents the Nyquist frequency and t is the time
interval between observations. This can be compared to (5.36). Note that since
the covariance matrix of xt is only related to the co-spectrum (i.e. the real part
of the spectrum) of the time series it is clear that conventional EOFs, based on
covariance or correlation matrix, does not take into consideration the quadrature
part of the cross-spectrum matrix, and therefore EOFs based on the cross-spectrum
matrix generalise conventional EOFs.
It is also clear from (5.46) that HEOFs are equivalent to FDEOFs with the cross-
spectrum integrated over all frequencies. Note that the frequency band of interest
can be controlled by prior smoothing. Horel (1984) points out that HEOFs can fail
to detect irregularly occurring progressive waves, see also Merrifield and Winant
(1989). Merrifield and Guza (1990) have shown that complex EOF analysis in the
time domain (HEOFs) is not appropriate for non-dispersive and broad-banded waves
in wavenumber κ relative to the largest separation measured (array size x). In
fact Merrifield and Guza (1990), see also Johnson and McPhaden (1993), identified
κx as the main parameter causing spread of propagating variability into more
than one HEOF mode, and the larger the parameter the lesser the captured data
variance. Barnett (1983) applied HEOFs to study relationship between the monsoon
5.5 Rotation of HEOFs 115
and the Pacific trade winds and found strong coupling particularly at interannual
time scales.
Although HEOFs constitute a very useful tool to study and identify propagating
phenomena (Barnett 1983; Horel 1984; Lazante 1990; Davis et al. 1991) such as
Kelvin waves in sea level or forced Rossby waves, the method suffers various
drawbacks. For example, HEOFs is unable to isolate, in one single mode, irregular
disturbance progressive waves Horel (1984). This point was also pointed out by
Merrifield and Guza (1990), who showed that the method can be inappropriate for
non-dispersive waves that are broadband in wavenumber relative to the array size.
More importantly, the drawbacks of EOFs, e.g. non-locality and domain depen-
dence (see Chap. 3) are inherited by HEOFs. Here again, rotation can come to their
rescue. Horel (1984) suggested the varimax rotation procedure to rotate HPCs using
real orthogonal rotation matrices, which can yield more realistic modes of variation.
The varimax rotation was applied later by Lazante (1990). Davis et al. (1991)
showed, however, that the (real) varimax procedure suffers a drawback related to a
lack of invariance to arbitrary complex rephasing of HEOFs. Bloomfield and Davis
(1994) proposed a remedy to rotation by using a complex unitary rotation matrix.
Bloomfield and Davis (1994) applied the complex orthogonal rotation to synthetic
examples and to sea level pressure. They argue that the rotated HPCs are easier to
interpret than the varimax rotation.
Chapter 6
Principal Oscillation Patterns and Their
Extension
6.1 Introduction
As was pointed out in the previous chapters, EOFs and closely related methods
are based on contemporaneous information contained in the data. They provide
therefore patterns that are by construction stationary in the sense that they do
not allow in general the detection of propagating disturbances. In some cases,
little dynamical information can be gained. The PCs contain of course temporal
information except that they only reflect the amplitude of the corresponding EOF.1
Complex Hilbert EOFs, on the other hand, have been conceived to circumvent
this shortcoming and allow for the detection of travelling waves. As pointed out
in Sect. 5.3.2, HEOFs are not exempt from drawbacks, including, for example,
difficulty in the interpretation of the phase information such as in the case of non-
dispersive waves or nonlinear dynamics, in addition to the drawbacks shared with
EOFs.
It is a common belief that propagating disturbances can be diagnosed using the
lagged structure of the observed field. For example, the eigenvectors of the lagged
1 There are exceptions, and these happen when, for example, there is a pair of equal eigenvalues
separated from the rest of the spectrum.
covariance matrix at various lags can provide information on the dynamics of the
propagating structures. The atmospheric system is a very complex dynamical system
par excellence, whose state can be well approximated by an evolution equation or
dynamical system:
d
x = F(x, t), (6.1)
dt
where the vector x(t) represents the state of the atmosphere at time t, and F
is a nonlinear function containing all physical and dynamical processes such as
nonlinear interactions like nonlinear advection, and different types of forcing such
as radiative forcing, etc.
Our knowledge of the state of the system is and will be always partial, and this
reflects back on our exploration of (6.1). For example, various convective processes
will only be parametrised, and subscale processes are considered as noise, i.e.
known statistically. A first and important step towards exploring (6.1) consists in
looking at a linearised version of this dynamical system. Hence a more simplified
system deriving from (6.1) reads
d
x = Bx + εt , (6.2)
dt
where ε t is a random forcing taking into account the non-resolved processes, which
cannot be represented by the deterministic part of (6.2). These include subgrid scales
and nonlinear effects. This model goes along with the assumption of Hasselmann
(1988), namely that the climate system can be split into two components: a
deterministic or signal part and a nondeterministic or noise part. Equation (6.2) is a
simple linear stochastic system that can be studied analytically and can be compared
to observed data. This model is known as continuous Markov process or continuous
first-order (multivariate) autoregressive (AR(1)) model and has nice properties. It
has been frequently used in climate studies (Hasselmann 1976, 1988; Penland 1989;
Frederiksen 1997; Frederiksen and Branstator 2001, 2005).
In practice Eq. (6.2) has to be discretised to yield a discrete multivariate AR(1),
which may look like:
in which t can be absorbed in B and can assume that t = 1, see also next sections
for time-dependent POPs.
Remark Note that when model data are used the vector xt may be complex,
containing, for example, spectral coefficients derived, for instance, from spherical
harmonics. Similarly, the operator A may also be complex. In the sequel, however,
it is assumed that the operator involved in Eq. (6.2) is real.
6.2 POP Derivation and Estimation 119
The AR(1) model (6.3) is very useful from various perspectives, such as the explo-
ration of observed data or the analysis of climate models or reduced atmospheric
models. This model constitutes the corner stone of principal oscillation pattern
(POP) analysis (Hasselmann 1988; von Storch et al. 1988, 1995; Schnur et al.
1993). According to Hasselmann (1988), the linear model part in (6.3), within the
signal subspace, is the POP model. If the simplified model takes into account the
system nonlinearity, then it is termed principal interaction pattern (PIP) model by
Hasselmann (1988). The main concern of POP analysis consists of an Eigen analysis
of the linear part of (6.3), see e.g. Hasselmann (1988), von Storch et al. 1988), and
Wikle (2004). POP analysis has been applied mostly to climate variability but has
started recently to creep towards other fields such as engineering and biomedical
science, see e.g. Wang et al. (2012).
We consider again the basic multivariate AR(1) model (6.3), in which the multivari-
ate process (εt ) is a white noise (in time) with covariance matrix Q = E εε T . The
matrix A is known as the feedback matrix (von Storch et al. 1988) in the discrete case
and as Green’s function in the continuous case (Riskin 1984; Penland 1989). POP
analysis attempts to infer empirical space–time characteristics of the climate system
using a simplified formalism expressed by (6.3). These characteristics are provided
by the normalised eigenvectors of the matrix A (or the empirical normal modes of
the multivariate AR(1) model (6.3)) and are referred to as the POPs. In (6.3) we
suppose that xt , like ε t , is zero mean. The autoregression matrix is then given by
−1
A = E xt+1 xTt E xt xTt = 1 −1
0 , (6.4)
T
where 1 is the lag-1 autocovariance matrix of xt = x1t , x2t , . . . xpt . That is, if
γij (τ ) is the lagged autocovariance between xit and xi,t+τ , then [ 1 ]ij = γij (1).
Recall that 1 is not symmetric.
If we now have a finite sample xt , t = 1, 2, . . . n, then an estimate of A is given
by
 = S1 S−1
0 , (6.5)
where S1 and S0 are, respectively, the sample lag-1 autocovariance matrix and the
covariance matrix of the time series. Precisely, we have the following result.
120 6 Principal Oscillation Patterns and Their Extension
n−1
n−1
F (A) = xt+1 − Axt 2
= Ft (A),
t=1 t=1
Let us forget for a moment the expectation operator E(.). The differential of Ft (A)
is obtained by computing Ft (A + H) − Ft (A) for any matrix H, with small norm.
We get
Ft (A + H) − Ft (A)
= −xTt Hxt−1 − xTt−1 HT xt + xTt−1 AT Hxt−1 + xTt−1 HT Axt−1 + O H 2
= DFt (A).H + O H 2 ,
DFt (A).H = −2xTt−1 HT xt +2xTt−1 HT Axt−1 = −2tr xt−1 xTt H − xt−1 xTt−1 AT H .
Now we can bring back either the expectation operator E() or the summa-
use the expectation operator, we get DFt (A).H =
tion. If, for example, we
−2tr −1 − 0 AT H . If the summation over the sample is used instead, we
obtain the same expression except that 1 and 0 are, respectively, replaced by S1
and S0 . Hence the minimum of F () satisfies DFt (A).H = 0 for all H, and this
yields (6.4).
The next step consists of computing the eigenvalues and the corresponding
eigenvectors.2 Let us denote by λk and uk , k = 1, . . . p, the eigenvalues and the
associated eigenvectors of A, respectively. The eigenvectors can be normalised to
have unit norm but are not orthogonal.
Remarks
• Because A is not symmetric, the eigenvalues/eigenvectors can be complex, in
which case they come in conjugate pairs.
• The eigenvalues/eigenvectors of A (or similarly Â) are also solution of a
generalised eigenvalue problem.
Exercise Let L be an invertible linear transformation, and yt = Lxt . Show that the
eigenvalues of the feedback matrix are invariant under this transformation.
Hint We have yt+1 = LAL−1 yt , and the eigenvalues of A and LAL−1 are identical.
γij (τ )
Exercise Let ρij (τ ) = √ be the lagged cross-correlation between xit and
γii (0)γjj (0)
xj,t+τ , then |ρij (τ )| ≤ 1.
Hint Use |E(XY )|2 ≤ E(X2 )E(Y 2 ).
In the noise-free case (6.3) yields xt = At x0 , from which we can easily derive the
condition of stationarity. The state xt can be decomposed into a linear combination
of the eigenvectors of A as
r
xt = at(i) ai ,
i=1
(i)
at+1 = ci λti ,
−1 ≤ λ = ρz (1) ≤ 1.
The last condition is also satisfied for the sample feedback matrix. In particular,
the above inequality becomes strict in general since in a finite sample the lag-
1 autocorrelation is in general less that one. Even for the population, the above
122 6 Principal Oscillation Patterns and Their Extension
inequality tends to be strict. In fact the equality ρij (τ ) = ±1 holds only when
xi,t+τ = αxj,t .
The POPs U = u1 , . . . , up are the normalised (right) eigenvectors of A
satisfying AU = U, where = Diag λ1 , . . . , λp . Since the feedback matrix
is not symmetric, it also has left eigenvectors V = v1 , . . . , vp satisfying VT A =
VT . These left eigenvectors are the eigenvectors of the adjoint AT of A and are
known as the adjoint vectors of U (Appendix F). They satisfy VT U = UVT = Ip .
Precisely, U and V are the left and right singular vectors of A, i.e.
p
A = UVT = λk uk vTk .
k=1
The interpretation of POPs is quite different from EOFs. For example, unlike
POPs, the EOFs are orthogonal and real. von Storch et al. (1988) interpret the
real and imaginary parts of POPs as standing oscillatory and propagating patterns,
respectively. Also, in EOFs the importance of the patterns is naturally dictated by
their explained variance, but this is not the case for the POPs. In fact, there is no a
priori unique rule of pattern selection in the latter case. One way forward is to look
at the time series of the corresponding POP.
As for EOFs, each POP has an associated time series, or POP coefficients, showing
their amplitudes as a function of time. The k’th complex coefficient zk (t) at time t,
associated with the k’th POP,3 is the projection of xt onto the adjoint vk of the k’th
POP uk
Unlike PCs, the POP coefficients satisfy a stochastic model dictated by Eq. (6.3).
In fact, it can be seen by post-multiplying (6.3) by vk that zk (t) yields a (complex)
AR(1) model:
Remark The expected values of (6.7) decouple completely, and the dynamics are
described by damped spirals.
One would like to find out the covariance matrix of the new noise term in Eq. (6.7).
To this end, one assumes that the feedback matrix is diagonalisable so A = UU−1 .
The state vector xt is decomposed in the basis U of the state space following:
p
xt uk = Ux+
(k)
xt = t .
k=1
(k)
Note that the components xt are the coordinates of xt in this new basis. They are
not, in general, identical to the original coordinates. Similarly we get for the noise
term:
p
εt uk = Uε +
(k)
εt = t ,
k=1
where x+ +
(1) (p)
t = (xt , . . . , xt ) and similarly for ε t . After substituting the above
T
x+ + +
t+1 = xt + ε t+1 .
Because λk = |λk |e−iωk is inside the unit circle (for the observed sample), the
evolution of zk (t), in the noise-free case (zk (t) = λtk zk (0)), describes in the complex
plane a damped spiral with period Tk = 2π ωk and e-folding time τk = − log |λk | .
1
Within the normal modes, the variance of the POP coefficients, i.e. σk2 =
E(zk2 (t)), reflects the dynamical importance of the k’th POP. It can be shown that
ckl
σk2 = . (6.8)
1 − |λk |2
Remark The above excitation, which reflects a measure of the dynamics of eigen-
modes, is at odds with the traditional measure in which the mode with the least
damping (or largest |λk |) is the most important. The latter view is based on the noise-
free dynamics, whereas the former takes into consideration the stochastic forcing
and therefore seems more relevant.
Let us write the k’th POP uk as uk = urk + iuik and similarly for the POP
coefficient zk (t) = zkr (t) + izki (t). The noise-free evolution of the k’th POP mode
(taking for simplicity zk (0) = 1) is given by
zk (t)uk + zk∗ (t)u∗k = λtk uk + (λ∗k )t u∗k = 2|λk |t urk cos ωk t + uik sin ωk t .
Therefore, within the two-dimensional space spanned by urk and uik , the evolution
can be described by a succession of patterns, with decreasing amplitudes, given by
and these represent the states occupied by the POP mode at successive times tm =
2ωk , for m = 0, 1, 2, . . . (Fig. 6.1), see also von Storch et al. (1995). The result of
mπ
f (ω)
fz (ω) = . (6.9)
|λk − eiω |2
ur
k
O u (t )
k 0
uk(tm)
6.2 POP Derivation and Estimation 125
6.2.3 Example
POP analysis (Hasselmann 1988) was applied by a number of authors, e.g. von
Storch et al. (1988), von Storch et al. (1995), Xu (1993). Schnur et al. (1993),
for example, investigated synoptic- and medium-scale (3–25 days) wave activity in
the atmosphere. They analysed geopotential heights from ECMWF analyses for the
Dec–Jan–Feb (DJF) period in 1984–1987 for wavenumbers 5–9. Their main finding
was a close similarity between the most significant POPs and the most unstable
waves, describing the linear growth of unstable baroclinic structures with period 3–
7 days. Figure 6.2 shows POP1 (phase and amplitude) of twice-daily geopotential
height for zonal wavenumber 8. Significant amplitude is observed around 45◦ N
associated with a 90◦ phase shift with the imaginary part westward of the real
part. The period of POP1 is 4 days with an e-folding of 8.6 days. Combined with
the propagating structure of POP evolution, as is shown in Sect. 6.2.2, the figure
manifests an eastward propagating perturbation with a westward phase tilt with
height. The eastward propagation is also manifested in the horizontal cross-section
of the upper level POP1 pattern shown in Fig. 6.3.
Frederiksen and Branstator (2005), for example, investigated POPs of 300-
hPa streamfunction fields from NCAR/NCEP reanalysis and general circulation
4 TheDirac function is defined by the property δa (u)f (u)du = f (a), and in general, δ0 (u) is
simply noted as δ(u) .
126 6 Principal Oscillation Patterns and Their Extension
Fig. 6.2 Leading POP, showing phase (upper row) and amplitude (lower row), of twice-daily
ECMWF geopotential height field during DJF 1984–1987 for zonal wavenumber 8. Adapted from
Schnur et al. (1993). ©American Meteorological Society. Used with permission
Fig. 6.3 Cross-section at the 200-hPa level of POP1 of Fig. 6.2. Adapted from Schnur et al. (1993).
©American Meteorological Society. Used with permission
6.3 Relation to Continuous POPs 127
Fig. 6.4 Leading EOF (a) and POP (b) of the NCAR/NCEP 300-hPa streamfunction for March.
Adapted from Frederiksen and Branstator (2005). ©American Meteorological Society. Used with
permission
model (GCM) simulations. Figure 6.4 shows the leading EOF and POP patterns
of NCAR/NCEP March 300-hPa streamfunction field. There is similarity between
EOF1 and POP1. For example, both are characterised by approximate large scale
zonal symmetry capturing midlatitude and subtropical variability. The leading POPs
are in general real with decay e-folding times. Figure 6.5 shows the POP2 for
January. It shows a Pacific North America (PNA) pattern signature and is similar
to EOF2. The leading POPs can be obtained, in general, from a superposition of the
first 5 to 10 EOFs as pointed out by Frederiksen and Branstator (2005). Figure 6.6
shows the average global growth rate of the leading 5 POPs across the different
months. The figure shows in particular that the leading POPs are all damped.
Various interesting relationships can be derived from (6.3), and the relationship
given in Eq. (6.4) is one of them. Furthermore by computing xt+1 xTt+1 and taking
expectation, one gets
0 = A 0 AT + Q. (6.10)
Also, expanding (xt+1 − ε t+1 ) xTt+1 − ε Tt+1 and taking expectation, after
using (6.10), yield E ε t xTt + E xt ε Tt = 2Q. On a computer, the continuous
128 6 Principal Oscillation Patterns and Their Extension
Fig. 6.5 POP2 (a) and EOF2 (b) of January 300-hPa NCEP/NCAR streamfunction field. Adapted
from Frederiksen and Branstator (2005). ©American Meteorological Society. Used with permis-
sion
–0.045
–0.050
–0.055
∼ (t)
ω i
day–1
–0.060
–0.065
–0.070
–0.075
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
Fig. 6.6 Average global growth rate of the leading 5 POPs (continuous) and FTPOPs (dashed).
Adapted from Frederiksen and Branstator (2005). ©American Meteorological Society. Used with
permission
AR(1) model (6.2) has to be discretised, and when this is done, one gets a similar
equation to (6.3). So if we discretise (6.2) using a unit time interval, which can be
made after some scaling is applied, then Eq. (6.2) can be roughly approximated by
6.3 Relation to Continuous POPs 129
B 0 + 0 BT + B 0 BT + Q = O. (6.12)
B 0 + 0 BT + Q = O, (6.13)
which can be obtained using (6.12) by dropping the nonlinear term in B. A more
accurate discrete approximation of (6.2) can be obtained using an infinitesimal time
step δτ to obtain the following process:
where ηt,τ is now an autocorrelated noise but uncorrelated with xt . Now, for large
enough n, such that τ = nδτ remains finite, (6.15) can be approximated by
Note that now (6.16) is similar to (6.3) except that the noise is autocorrelated.
Multiplying both sides of (6.16) by xTt and taking expectation yield
τ = eBτ 0 . (6.17)
Relation (6.18) can be useful, for example, in deriving the conditional probability5
p (xt+τ |xt ). If we suppose that B is diagonalisable, and we decompose it using its
the noise term εt is Gaussian, then (6.15) implies that if xt is given the state vector xt+τ is also
5 If
left and right eigenvectors, i.e. B = LR, then the “continuous” POPs are given by
the right eigenvectors R of B. Note that this is not exactly the SVD decomposition
of B. Because RLT = Ip , we also have a similar decomposition of the Green’s
function:
Note that, in practice, the (feedback) matrix G(τ ) of the continuous POP is
calculated first before the matrix B. Note also that Eq. (6.18) can be useful, for
example, in
forecasting and provides a natural measure of forecasting accuracy,
namely E xt+τ − x̂t+τ /tr ( 0 ), where x̂t+τ = G(τ )xt , see e.g. Penland (1989)
and von Storch et al. (1995).
Finite time POPs (FTPOPs) or empirical finite time normal modes were introduced
by Frederiksen and Branstator (2005) as the analogue of finite time normal modes
in which the linear operator or feedback matrix A is obtained by direct linearisation
of the nonlinear equations (Frederiksen 1997; Frederiksen and Branstator 2001). In
FTPOPs, the linear operator in Eq. (6.2) is time-dependent, i.e.
d
xt = B(t)xt + ε t . (6.20)
dt
The solution to Eq. (6.20) is obtained as an integral (Appendix G):
t
xt = S(t, t0 )xt0 + S(t, u)ε u du, (6.21)
t0
where S(., .) is the propagator. Note that when the operator B is time independent,
then
An explicit expression of S(t, u) can be obtained when B(t) and B(u) commute, i.e.
B(t)B(u) = B(u)B(t), for all t and u, then (Appendix G):
t
S(t, u) = e u B(τ )dτ
. (6.23)
p 1 1
P r (xt+τ = x|xt ) = (2π )− 2 |σ (τ )|− 2 exp − (x − G(τ )xt )T [σ (τ )]−1 (x − G(τ )xt ) ,
2
which, under stationarity, tends to the multinormal distribution N (0, 0 ) when τ tends to infinity.
6.3 Relation to Continuous POPs 131
The FTPOPs are defined as the eigenvectors of S(t, 0). Over an interval [0, T ], the
propagator can be approximated via Eq. (6.24) using a second-order finite difference
scheme of S(tk , tk−1 ), for tk = T − (n − k)δt, k = 1, . . . n, and δt is a half-
hour time step. The eigenvalues λ = λr + iλi (and associated eigenvectors) of
the propagator S(T , 0) are used to compute the global growth rate ωi and phase
frequency ωr following:
ωi = 2T
1
log |λ|
(6.25)
ωr = − T1 arctan( λλri ).
In their analysis, Frederiksen and Branstator (2005) considered one year period
(T = 1 yr) for the global characteristics and examined also local characteristics for
each month using daily data. As for POPs, the eigenvalues determine the nature
of the FTPOPs, namely travelling when λi = 0 or recurring when λi = 0.
Figure 6.7 shows the leading 300-hPa streamfunction FTPOP during March using
NCEP/NCAR reanalysis for the northern and southern hemispheres. As for the
leading POP, the leading FTPOP has an approximate zonally symmetric state with
a particular high correlation between EOF1, FTPOP1 and POP1 (Fig. 6.4). There
is also a similarity between the growth rates of the leading POPs and leading
FTPOPs (Fig. 6.6). The leading FTPOP for January (Fig. 6.8) shows many common
features with EOF2 and POP2 (Fig. 6.5) especially over the PNA region and has
high correlation with both EOF2 and POP2.
Fig. 6.7 Leading NCEP/NCAR 300-hPa streamfunction FTPOP for March for the northern (left)
and southern (right) hemispheres. Adapted from Frederiksen and Branstator (2005). ©American
Meteorological Society. Used with permission
132 6 Principal Oscillation Patterns and Their Extension
Fig. 6.8 January FTPOP1 obtained from the NCAR/NCEP 300-hPa streamfunction. Adapted
from Frederiksen and Branstator (2005). ©American Meteorological Society. Used with permis-
sion
Cyclo-stationary POPs have been implicitly mentioned in the previous section when
finite time POPs were discussed. In this section more details on cyclo-stationarity
are given. In the POP model (6.3) the time series was assumed to be second-order
stationary.6 This can be a reasonable assumption when we analyse, for example,
data on an intraseasonal time scale for a given, e.g. winter, season. When the data
contain a (quasi-) periodic signal, such as the annual or biennial cycles, then the
AR(1) model can no longer be supported. An appropriate extension of the POP
model, which takes into account the periodic signal, is the cyclo-stationary POP
(CPOP) model. CPOP analysis was published first7 by Blumenthal (1991), who
applied it to analyse El-Niño Southern Oscillation (ENSO) from a climate model,
and was applied later by various authors, such as von Storch et al. (1995), who
applied it to the Madden Julian Oscillation (MJO), and Kim and Wu (1999), who
compared it to other EOF techniques.
We assume now that our data contain, say, T cycles, and in each cycle we have
l sample making the total sample size n = T l. For example, with a 10-year data
of monthly SST, we have T = 10 and l = 12. Given a time series x1 , . . . , xn ,
6 In practice POP analysis has been applied to many time series not necessarily strictly stationary.
7 see von Storch et al. (1995) for other unpublished works on CPOPs.
6.4 Cyclo-Stationary POPs 133
s = (t − 1)l + τ , (6.26)
be identified alternatively by t and τ , i.e. xs = xt,τ . Note that the noise term is not
cyclo-stationary. The CPOP (or cyclo-stationary AR(1)) model then reads
with the property xt,τ +l = xt+1,τ and A(τ + l) = A(τ ). The iteration of Eq. (6.27)
yields
!
l
B(τ ) = A(τ + l − 1) . . . A(τ ) = A(τ + l − k). (6.29)
k=1
The CPOPs are the eigenvectors of the system matrix B(τ ) for τ = 1, . . . l, i.e.
choose c = ρ(τ )eiθ , so that u(τ + 1) is unit-length and also periodic. The choice
ρ(τ ) = A(τ )u(τ ) yields u(τ +1)) = 1, and to achieve periodicity, we compute
u(τ + l) recursively yielding
!
l !
l
u(τ + l) = ρ(τ + l − k)eilθ B(τ )u(τ ) = |λ| ρ(τ + l − k)ei(lθ−φ) u(τ ).
k=1 k=1
(6.31)
Now,
( since u(τ + l) = 1 by construction, the above equation yields
|λ| lk=1 ρτ +l−k = 1, and then u(τ + l) = ei(lθ−φ) u(τ ). By choosing θ = φ/ l, the
eigenvectors u(τ ) become periodic. To summarise, the CPOPs are obtained from
the set of simultaneous eigenvalue problems B(τ )u(τ ) = λu(τ ), τ = 1, . . . l, and
one needs to only solve the problem for τ = 1.
Once the CPOPs are obtained, we proceed as for the POPs and compute the
CPOP coefficients z(t, τ ) = v(τ )∗T x(t, τ ), by projecting the data x(t, τ ) onto the
adjoint pattern v(τ ), i.e. the eigenvector of BT (τ ) with the eigenvalue λ, yielding
Here we attempt to compare two different but related techniques, POPs and normal
modes, to evaluate the effects of various feedback systems on the dynamics of
6.5 Other Extensions/Interpretations of POPs 135
dx
= F (x0 )x (6.33)
dt
associated with eigenvalues λ of F (x0 ) satisfying |λ| > 1. In (6.33) the vector x0 is
the basic state and F (x0 ) is the differential of F () at x0 , i.e. F (x0 ) = ∂F
∂x (x0 ). Note
that the eigenvalues and the corresponding normal modes are in general complex
and therefore have standing and propagating patterns.
It is commonly accepted that POPs also provide estimates of normal modes.
Schnur et al. (1993), see also von Storch and Zwiers (1999), investigated normal
modes from a quasi-geostrophic model and computed POPs using data simulated
by the model. They also investigated and compared normal modes and POPs using
the quasi-geostrophic vorticity equations on the sphere and concluded that POPs
can be attributed to the linear growing phase of baroclinic waves. Unstable normal
modes have eigenvalues with magnitude greater than unity. We have seen, however,
that the eigenvalues of A = 1 −1 0 , or its estimate Â, are inside the unit circle.
Part of this inconsistency is due to the fact that the normal modes are (implicitly)
derived from a continuous system, whereas (6.3) is a discrete AR(1). A useful way
is perhaps to compare the modes of (6.19) with the continuous POPs.
In POP analysis the state vector of the system is real. POPs are not appropriate
to model and identify standing oscillations (Bürger 1993). A standing oscillation
corresponds to a real POP associated with a real eigenvalue. But this implies a
(real) multivariate AR(1) model, i.e. a red spectrum or damped system. Since
the eigenvalues are inside the unit circle, a linear first-order system is unable to
model standing oscillations. Two alternatives can be considered to overcome this
shortcoming, namely:
8 Often taken to be the mean state or climatology, although conceptually it should be a stationary
state of the system. This choice is adopted because of the difficulties in finding stationary states.
136 6 Principal Oscillation Patterns and Their Extension
zt = xt + iyt , (6.34)
where yt = H(xt ) is then used to compute the POPs, yielding Hilbert POPs
(HPOPs). The Hilbert transform H(xt ) contains information about the system state
tendency. Both xt and H(xt ) play, respectively, the roles of position and momentum
in Hamiltonian systems. Hence the use of the complexified system states becomes
equivalent to using a second-order system but without increasing the size of the
unknown parameters. The HPOPs are therefore obtained as the eigenvectors of the
new complex matrix:
A = 1 −1
0 , (6.35)
in which one recognises the left hand side as a “time derivative”. Since Hilbert
transform can also be interpreted as a (special) time derivative, an alternative way is
to consider a similar model except that the left hand side is now a Hilbert transform
of the original field (Irene Oliveira, personal communication). The model reads
Hilbert oscillation patterns (HOPs) are obtained as the eigenvectors of Â. If Szz is
the (Hermitian) covariance matrix of the complexified field zt = xt + iyt , then one
has
1 −1
A = i Ip − Szz S ; (6.38)
2
hence, the eigenvalues of A are pure imaginary since the eigenvalues of Szz S−1 are
real (non-negative). It is also straightforward to check that if Szz is not invertible,
then i is an eigenvalue of A with associated eigenvectors given by SN, where N is
the null space of Szz .
Now let u be an eigenvector of Szz S−1 with eigenvalue λ, and then Szz v = λSv,
where v = S−1 u. Taking xt = v∗T xt , yt = v∗T yt and zt = v∗T zt , one gets, since
var(zt ) = var(xt ) + var(yt ) = 2var(xt ):
and a similar equation for the other patterns, with Sxy and Syx , exchanged. Defining
u = Sa, and noting that Syx = −Sxy , the above equation becomes
A2 u = −λu. (6.41)
Hence the eigenvalues of HCCA are the (opposite of the) square of the eigenvalues
associated with HOPs, and the associated eigenvectors are given by S−1 uk , where
uk , k = 1, . . . , p, are the HOPs.
138 6 Principal Oscillation Patterns and Their Extension
The POP model based on the multivariate AR(1), Eq. (6.3), is based on the lag-
1 autocorrelation of the process. This model can be extended by including m
consecutive lags:
m
xt = Al xt−l + ε t . (6.42)
l=1
AR(1) models in which the coupling is through the noise covariance. The idea
is to use the extended, or delay, state space as for extended EOFs, which is
presented in Chap. 7. Denoting by x t the delayed state vector using m lags, i.e.
x t = (xt , xt−1 , . . . , xt−m+1 )T , and t = (ε t , 0, . . . , 0)T , the model (6.42) can be
written as a generalised AR(1) model:
x t = Ax t−1 + t, (6.43)
where
⎛ ⎞
A1 A2. . . Ap−1 Ap
⎜ I O ... O O⎟
⎜ ⎟
⎜ O⎟
A=⎜O I... O ⎟. (6.44)
⎜ . ..
.. ⎟
⎝ .. . . O O⎠
O O ... I O
The same decomposition can now be applied as for the VAR(1) case. Note, however,
because of the Fröbenius structure of the mp × mp matrix A in (6.44) (e.g.
Wilkinson 1988, chap. 1.3), the eigenvectors vk , k = 1, . . . mp have a particular
structure:
⎛ ⎞
λm−1 uk
⎜ .. ⎟
⎜ . ⎟
vk = ⎜ ⎟, (6.45)
⎝ λuk ⎠
uk
The principal interaction pattern (PIP) method was proposed originally by Has-
selmann (1988). Slight modifications of PIP method was introduced later by, e.g.
Kwasniok (1996, 1997, 2004) and Wu (1996). A description is given below, and for
more technical details, the reader is referred to Kwasniok (1996) and later papers.
PIP method takes into account the (nonlinear) dynamics of a nonlinear system.
In its simplest form, the PIP method attempts to project the dynamics of a N-
dimensional (autonomous) dynamical system living in a Hilbert space E with basis
ei , i = 1, . . . N:
140 6 Principal Oscillation Patterns and Their Extension
du
= F (u), (6.47)
dt
where u = N i=1 ui ei , onto a lower L-dimensional Hilbert space P. This space is
spanned by the PIP patterns p1 , . . . , pL , with
N
pi = pki ek , i = 1, . . . , L, (6.48)
k=1
where the N × L matrix P = (pij ) contains the PIP patterns. The state vector u is
then projected onto the PIP patterns to yield
L
z = P roj (u) = zi pi . (6.49)
i=1
The PIPs are normally chosen to be orthonormal, with respect to a predefined scalar
product, i.e.
The patterns are then sought by minimising a costfunction measuring the discrep-
ancy between the (true) tendency of the PIP coefficients, żit , and the projection of
the tendency u̇ onto the PIP space, i.e.
L
P I P s = argmin{F =< |żit − żi |2 >}, (6.52)
i=1
where <, > is an ensemble average. In simple terms, given an initial condition u0 ,
system (6.47) is integrated for a specified time interval τ . The obtained trajectory is
then projected onto the PIPs. Similarly, the dynamics of Eq. (6.47) is projected onto
the PIPs, which is then integrated forward using the projection P roj (u0 ) = uP 0
of u0 onto the PIPs as initial condition. The norm of the difference between the
obtained two trajectories is then computed. More specifically, let uP τ be the state
of the trajectory of (6.47) at time τ , starting from u0 , and projected onto the PIP
space. Let also zτ be the state of the trajectory of (6.51) starting from uP 0 P . The
discrepancy between the two trajectories at time τ , uτP −zτ , is then computed, and
a global measure ε2 of the discrepancy is obtained. Kwasniok (2004), for example,
used ε2 = uτPmax − zτmax 2 , for some integration time τmax , whereas Crommelin
τ
and Majda (2004) used the total integral of this discrepancy, i.e. ε2 = 0 max uτPmax −
zτmax 2 . The costfunction to be minimised with respect to the matrix P is then
6.7 Principal Interaction Patterns 141
Fig. 6.9 Leading two PIPs (a,b) and two rotated EOFs (c,d) of the streamfunction from a
barotropic quasi-geostrophic model on the sphere. Adapted from Kwasniok (2004). ©American
Meteorological Society. Used with permission
142 6 Principal Oscillation Patterns and Their Extension
Fig. 6.10 Trajectory of the integration of the Charney and Devore (1979) model using the leading
four-EOF model projected onto the leading two EOFs (top left) and onto the leading EOF (middle),
and the reference trajectory projected onto the leading two EOFs (top right) and onto the leading
EOF (bottom). Adapted from Crommelin and Majda (2004). ©American Meteorological Society.
Used with permission
6.7 Principal Interaction Patterns 143
Fig. 6.11 Same as Fig. 6.10 but using a four-PIP model. The trajectories from the original
reference model are also shown for comparison. Adapted from Crommelin and Majda (2004).
©American Meteorological Society. Used with permission
Chapter 7
Extended EOFs and SSA
7.1 Introduction
Atmospheric fields are very often significantly correlated in both the space and time
dimensions. EOF technique, for example, finds patterns that are both spatially and
temporally uncorrelated. Such techniques make use of the spatial correlation but do
not take into account the significant auto- and cross-correlations in time. As a result,
travelling waves, for example, cannot be identified easily using these techniques as
was pointed out in the previous chapter.
Complex (or Hilbert) EOFs (HEOFs) (Rasmusson et al. 1981; Barnett 1983;
Horel 1984; von Storch and Zwiers 1999) presented in Chap. 5, have been intro-
duced to detect propagating structures. In HEOFs, a phase information is introduced
using the conjugate part of the field, which is provided by its Hilbert transform.
Chapter 5 provides illustration of Hilbert EOFs with the QBO signal. So the
information regarding the “propagation” is contained in the phase-shifted complex
part of the system. However, despite this extra information provided by the Hilbert
transform, HEOFs approach does not take into consideration the temporal auto- and
cross-correlation in the field (Merrifield and Guza (1990)).
POPs and HPOPs (Hasselmann 1988; von Storch et al. 1995; Bürger 1993) are
methods that aim at empirically inferring space–time characteristics of a complex
field. These methods are based on a first-order autoregressive model and can
therefore be used to identify travelling structures and in some cases forecast the
future system sates. The eigenfunctions of the system feedback matrix in POPs
and HPOPs, however, do not provide an orthogonal and complete basis. Besides
being linear, another drawback of the AR(1) model, which involves only lag-1
autocorrelations, is that it may be sometimes inappropriate to model higher order
systems.
The extended EOFs introduced by Weare and Nasstrom (1982) combine both the
aspects of spatial correlation of EOFs and the temporal auto- and cross-correlation
obtained from the lagged covariance matrix. Subsequent works by Broomhead and
colleagues (e.g. Broomhead and King (1986a,b)) and Fraedrich (1986) focussed
on the dynamical aspects of extended EOFs as a way of dynamical reconstruction
of the attractors of a dynamical system that is partially observed and termed it
singular system analysis (SSA). Multichannel singular spectrum analysis (MSSA)
was used later by Vautard et al. (1992) and Plaut and Vautard (1994) and applied it
to atmospheric fields. Hannachi et al. (2011) applied extended EOFs to stratospheric
warming, whereas an application to Rossby wave breaking and Greenland blocking
can be found in Hannachi et al. (2012), see also the review of Hannachi et al. (2007).
An appropriate example where propagating signals are prominent and where
extended EOFs can be applied to reveal these signals is the Madden–Julian
oscillation (MJO). MJO is an eastward propagating planetary-scale wave of tropical
convective anomalies and is a dominant mode of intraseasonal tropical variability.
The oscillation has a broadband with a period between 40 and 60 days and has
been identified in zonal and divergent wind, sea level pressure, outgoing long wave
radiation and OLR (Madden and Julian 1994). Figure 7.1 shows the OLR field over
the tropical band on 25 December 1996. It can be noted, for example, from this
figure the low-value region particularly over the warm pool, which is an area of
large convective activity, in addition to other regions such as Amazonia and tropical
Africa. The OLR data come from NCEP/NCAR reanalyses over the tropical region
Fig. 7.1 Outgoing long wave radiation distribution over the tropics in 25 December 1996. Units
w/m2 . Adapted from Hannachi et al. (2007)
7.2 Dynamical Reconstruction and SSA 147
Fig. 7.2 Leading EOF of OLR anomalies. Units arbitrary. Adapted from Hannachi et al. (2007)
equatorward from 30◦ latitude. Seasonality of OLR is quite complex and depends
on the latitude band (Hannachi et al. 2007).
Figure 7.2 shows the leading EOF pattern explaining about 15% of the total
variability. It has opposite signs north and south of the equator and represents the
seasonal cycle, mostly explained by the inter-tropical convergence zone (ITCZ).
7.2.1 Background
d
x = F (x),
dt
which cannot be analytically integrated. A chaotic trajectory is the trajectory of the
chaotic system within its phase space. A characteristic feature of a chaotic system is
its sensitivity to initial conditions, i.e. trajectories corresponding to two very close
initial conditions diverge exponentially in a finite time. A chaotic system gives rise
to a chaotic attractor, a set with extremely complex topology. Figure 7.3 shows an
example of the popular Lorenz (1963) system (Fig. 7.3a) and a contaminated (or
1 Of at least 3 variables.
148 7 Extended EOFs and SSA
a 6 b 6
4 4
2
2
0
X'
X
0
–2
–2 –4
–4 –6
0 100 200 300 400 0 100 200 300 400
Time Time
Fig. 7.3 Time series of the Lorenz (1963) model (a) and its contamination obtained by adding a
red noise. Adapted from Hannachi and Allen (2001)
noisy) time series (Fig. 7.3b). SSA can be used to obtain the hidden signal from the
noisy data.
The general problem of dynamical reconstruction consists in inferring dynamical
and geometric characteristics of a chaotic attractor from a univariate time series,
x1 , x2 , . . . , xn , sampled from the system. The solution to this problem is based on
transforming the one-dimensional sequence into a multivariate time series using the
so-called method of delays, or delay coordinates first proposed by Packard et al.
(1980), and is obtained using a sliding window through the time series to yield
T
xt = xt , xt+τ , . . . , xt+(M−1)τ , (7.1)
where τ is the time delay or lag and M is known as the embedding dimension.
A basic result in the theory of dynamical systems indicates that the characteristics
of the dynamical attractor can be faithfully recovered using the delay coordinate
method provided that the embedding dimension is large enough (Takens 1981).
In the sequel we suppose for simplicity that τ = 1. The analysis of the multivariate
time series xt , t = 1, 2, . . . n − M + 1, for dynamical reconstruction is once
more faced with the problem of dimensionality. Attractors of low-order chaotic
systems are in general low-dimensional and can in principle be explored within
a smaller dimension than that of the embedding space. A straightforward approach
is to first reduce the space dimension using, for example, EOFs of the data matrix
obtained from the multivariate time series given in Eq. (7.1), i.e. time EOFs. This
approach has been in fact adopted by Broomhead and King (1986a,b). They used
SVD to calculate an optimal basis for the trajectory of the reconstructed attractor.
If the dynamic is in fact chaotic, then the spectrum will be discontinuous or
singular, with the first few large singular values well separated from the floor (or
7.2 Dynamical Reconstruction and SSA 149
background) spectrum associated with the noise level, hence the label singular
system or spectrum analysis (SSA). At the same time, Fraedrich (1986) applied
SSA, using a few climate records, in an attempt to analyse their chaotic nature
and estimate their fractal dimensions. Vautard et al. (1992) analysed the spectral
properties of SSA and applied it the Central England temperature (CET) time series
in an attempt to find an oscillation buried in a noise background. They claim that
CET contains a 10-year cycle. Allen and Smith (1997) showed, however, that the
CET time series consists of a long-term trend in a coloured background noise. SSA
is in fact a useful tool to find a periodic signal contaminated with a background white
noise. For example, if the time series consists of a sine wave plus a white noise, then
asymptotically, the spectrum will have a leading pair of equal eigenvalues and a flat
spectrum. The time EOFs corresponding to the leading eigenvalues will consist of
two sine waves (in delay space) in quadrature. The method, however, can fail when,
for example, the noise is coloured or the system is nonlinear. A probability-based
approach is proposed in Hannachi (2000), see also Hannachi and Allen (2001).
The anomaly data matrix, referred to as trajectory matrix by Broomhead and
King (1986a,b), obtained from the delayed time series of Eq. (7.1), that we suppose
already centred, is given by
⎛ ⎞
x1 x2 . . . xM
⎜ x2 x3 . . . xM+1 ⎟
⎜ ⎟
X=⎜ .. .. .. ⎟ . (7.2)
⎝ . . . ⎠
xn−M+1 xn−M+2 . . . xn
The trajectory matrix expressed by Eq. (7.2) has a special form, namely with
constant second diagonals. This property gets transmitted to the covariance matrix:
1 1
n−M+1
C= XT X = xt xTt , (7.3)
n−M +1 n−M +1
t=1
which is a Toeplitz matrix, i.e. with constant diagonals corresponding to the same
lags. This Toeplitz structure is known to have useful properties, see e.g. Graybill
(1969). If σ 2 is the variance of the time series, then C becomes
⎛ ⎞
1 ρ1 . . . ρM−1
⎜ . . ⎟
1 ⎜ ρ1 1 . . .. ⎟
C==⎜
⎜ .
⎟,
⎟ (7.4)
σ 2
⎝ .. .. ..
. . ρ1 ⎠
ρM−1 . . . ρ1 1
M
ci (t) = xTM+t−1 ui = uil xM+t−l (7.5)
l=1
a b
10.0 0.6
0.4
Eigenvector
Eigenvalue
0.2
1.0
0.0
–0.2
0.1 –0.4
0 10 20 30 0 10 20 30
Eigenvalue rank Lag
Fig. 7.4 Spectrum of the grand covariance matrix, Eq. (7.3), of the noisy time series of Fig. 7.3b
weighted by the inverse of the same covariance matrix of the noise (a) and the leading two extended
EOFs. Adapted from Hannachi and Allen (2001)
7.3 Examples 151
M−1
M−1
[b]1 = b0 + ρk bk = aM−1 + ρk aM−k−1 = λaM−1 = λb0
k=1 k=1
and similarly for the remaining elements. Hence if the eigenvalues are distinct, then
clearly a = b, and therefore, a is symmetric.
This conclusion indicates therefore that the SSA filter is symmetric and does
not cause a frequency-dependent phase shift. This phase shift remains, however,
possible particularly for singular covariance matrices.
Remarks
• SSA enjoys various nice properties, such as being an adaptive filter.
• The major drawback in SSA, however, is related to the problem of choosing
the embedding dimension M. In general the bigger the value of M the more
accurate the reconstruction. However, in the case of extraction of periodic signals,
M should not be much larger than the period, see e.g. Vautard et al. (1992) for
discussion.
As we have mentioned above, SSA, like EOFs, can be used to reconstruct the
original time series using a time reconstruction, instead of space reconstruction as
in EOFs. Since xt , t = 1, . . . n − M + 1, can be decomposed as
M
xt = ck (t)uk , (7.6)
k=1
one gets
M
xt+l−1 = ck (t)uk,l (7.7)
k=1
7.3 Examples
For a white noise time series with variance σ 2 , the covariance matrix of the delayed
vectors is simply σ 2 IM , and the eigenvectors are simply the degenerate unitary
vectors.
152 7 Extended EOFs and SSA
for t = 1, 2, . . ., with independent and identically distributed (IID) noise with zero
mean and variance σ 2 . The autocorrelation function is
ρx (τ ) = ρ |τ | , (7.9)
and the Toeplitz covariance matrix C of xt = (xt , xt+1 , . . . , xt+M−1 )T takes the
form
⎛ ⎞
1 ρ . . . ρ M−1
⎜ . . ⎟
1 ⎜ ρ 1 . . .. ⎟
C=⎜
⎜ ..
⎟.
⎟ (7.10)
σ 2
⎝ .. ..
. . . ρ ⎠
ρ M−1 . . . ρ1 1
To compute the eigenvalues of (7.10), one could, for example, start first by inverting
C and then compute its eigenvalues. The inverse of C has been computed, see e.g.
Whittle (1951) and Wise (1955). For example, Wise (1955) provided a general way
to compute the autocovariance matrix of a general ARMA(p, q) time series.
Following Wise (1955), Eq. (7.8) can be written in a matrix form as
(I − ρJ) xt = ε t , (7.11)
where xt = (xt , xt+1 , . . .)T , and ε t = (εt , εt+1 , . . .)T are semi-infinite vectors, I is
the semi-infinite identity matrix, and J is the semi-infinite auxiliary matrix whose
finite counterpart is
⎛ ⎞
0 1 ... 0
⎜ .. ⎟
⎜ 0 0 ... .⎟
Jn = ⎜
⎜. . .
⎟,
⎟
⎝ .. . . . . 1⎠
0 ... 0 0
that is with ones on the superdiagonal and zeros elsewhere. The matrix Jn is
nilpotent with Jnn = O. Using Eq. (7.11), we get
−1
E xt xTt = = σ 2 (I − ρJ)−1 I − ρJT ,
7.3 Examples 153
that is
σ 2 −1 = I − ρJT (I − ρJ) . (7.12)
0 ... 0 −ρ 1 + ρ 2
Another direct way to compute C−1 is given, for example, in Graybill (1969,
theorem 8.3.7, p. 180). Now the eigenvalues of C−1 are given by solving the
polynomial equation:
Dn = |C−1 − λIn | = 0.
is the determinant of C−1 − λI after removing the first line and first column, and
) )
) −ρ 0 0 ... 0 )
) )
) −ρ 1 + ρ 2 − λ −ρ ... 0 )
) )
) .. .. .. .. )
δn−1 =) . . . . −ρ )
) )
) 0 −ρ 1 + ρ − λ
2 −ρ )
) ... )
) 0 ... 0 −ρ 1 + ρ − λ)
2
is the determinant of C−1 − λI after removing the second line and first column. We
now have the recurrence relationships:
δn = −ρn−1
, (7.14)
n = 1 + ρ 2 − λ n−1 + ρδn−1
154 7 Extended EOFs and SSA
n = aμn−2
1 + bμn−2
2 ,
where μ1,2 are the roots (supposed to be distinct) of the quadratic equation x 2 −
(1 + ρ 2 − λ)x + ρ 2 = 0. Note that the constants a and b are obtained from the initial
conditions such as 2 and 3 .
The use of SSA to find periodic signals from time series has been known since
the early 1950s with Whittle (1951). The issue was reviewed by Wise (1955),
who showed that for a stationary periodic (circular) time series of period n, the
autocovariance matrix of the time series contains at least one eigenvalue with
multiplicity greater or equal than two. Wise2 (1955) considered a stationary zero-
mean random vector x = (xn , xn−1 , . . . , x1 ) having a circular covariance matrix, i.e.
γn+l = γn−l = γl ,
where γn = E (xt xt+n ) = σ 2 ρn , and σ 2 is the common variance of the xk ’s. The
corresponding circular autocorrelation matrix is given by
⎛ ⎞
1 ρ1 ρ2 . . . ρ1
⎜ ρ1 1 ρ1 . . . ρ2 ⎟
⎜ ⎟
=⎜ .. .. .. ⎟. (7.15)
⎝ ρ2 . . . ρ1 ⎠
ρ1 ρ2 . . . ρ1 1
Exercise
1. Show that the matrix
⎛ ⎞
0 1 0
0 ...
⎜0 0 0⎟
1 ...
⎜ ⎟
⎜ ⎟
W = ⎜ ... ..
.
.. ⎟
.
..
.
⎜ ⎟
⎝0 0 0 ... 1⎠
1 0 0 ... 0
2 Wise also extended the analysis to calculate the eigenvalues of a circular ARMA process.
7.4 SSA and Periodic Signals 155
T
2. Show that = I+ρ1 W + WT +ρ2 W2 + W2 +. . .+qρp Wp + Wp T
where p = [n/2], i.e. the integer part of n/2 and q equals 1 if n is odd, and 12
otherwise.
3. Compute the eigenvalues of W and show that it is diagonalisable, i.e. W =
A A−1 .
Hint The eigenvalues of W are the unit roots ωk , k = 1, . . . n of ωn − 1 = 0. Hence
W = A A−1 , and Wα + W−α = A α + −α A−1 . This means in particular
that Wα + W−α is diagonalisable with ωk + ωk−1 as eigenvalues.
Exercise (Wise 1955) Compute the eigenvalues (or latent roots) λk , k = 1, 2 . . . n
of and show, in particular, that they can be expressed as an expansion into Fourier
sine–cosine functions as
2π k 4π k 2πpk
λk = 1 + 2ρ1 cos + 2ρ2 cos + . . . + 2qρp cos (7.16)
n n n
and that λk = λn−k .
Remark The eigenvalues of can also be obtained directly by calculating the
determinant = | − λI| by writing (see e.g. Mirsky 1955, p. 36)
⎡ ⎤
n−1
= nk=1 ⎣(1 − λ) + ρj ωk ⎦ .
j
j =1
n2/−1
2π kj
λk = 1 + 2 ρj cos + ρ n2 cos π k. (7.18)
n
j =1
The same procedure can be applied when the time series contains, for example,
a periodic signal but is not periodic. This was investigated by Basilevsky (1983)
and Basilevsky and Hum (1979) using the Karhunen-Loéve decomposition, which
consists precisely of an EOF analysis of the lagged covariance (Toeplitz) matrix, i.e.
156 7 Extended EOFs and SSA
*
U= 2/n (u0 , u1 , . . . , un−1 )T (7.20)
and
UT U − 2π
7.5.1 Background
In EEOF analysis the atmospheric state vector at time t, i.e. xt = xt1 , . . . xtp ,
t = 1, . . . , n, used in traditional EOF, is extended to include temporal information
as
It is now clear from (7.22) that time is incorporated in the state vector side by side
with the spatial dimension. If we denote by
then the extended state vector (7.22) is written in a similar form to the conventional
state vector, i.e.
p
x t = x 1t , x 2t , . . . , x t (7.25)
except that now the elements x kt , k = 1, . . . p, of this grand state vector, Eq. (7.24),
are themselves temporal-lagged values. The data matrix X in Eq. (7.23) now takes
the form
⎡ p ⎤
x 11 x 21 ... x1
⎢ .. .. .. ⎥
X =⎣ . . . ⎦, (7.26)
p
x 1n−M+1 x 2n−M+1 ... x n−M+1
which is again similar to traditional data matrix X, see Chap. 2, except that now its
elements are (temporal) vectors.
The vector x st in (7.24) is normally referred to as the delayed vector obtained
from the time series (xts ), t = 1, . . . n of the field value at grid point s. The new data
matrix (7.26) is now of order (n − M + 1) × pM, which is significantly larger than
the original matrix dimension n × p.
We suppose that X in (7.26) has been centred and weighted, etc. The covariance
matrix of time series (7.25) obtained using the grand data matrix (7.26) is
⎡ ⎤
C11 C12 . . . C1M
⎢ C21 C22 . . . C2M ⎥
1 ⎢ ⎥
= XT X = ⎢ . .. .. ⎥ , (7.27)
n−M +1 ⎣ .. . . ⎦
CM1 CM2 . . . CMM
7.5 Extended EOFs or Multivariate SSA 159
where each Cij , 1 ≤ i, j ≤ M, is the lagged covariance matrix between the i’th and
j’th gridpoints and is given by3
1
n−M+1
T j
Cij = x it x t . (7.28)
n−M +1
t=1
If the elements
of the data matrix (7.26) were random variables, then each submatrix
T
Cij = E x i x j , from the covariance matrix = E X T X , will exactly take
a symmetric Toeplitz form, i.e. with constant diagonals, and consequently, will
be block Toeplitz. Due to finite sampling; however, Cij is approximately Toeplitz
for large values of n, compared to the window length M. This is in general the case
when we deal with high frequency data, e.g. daily observations or even monthly
averages from long climate model simulations. The symmetric covariance matrix
is therefore approximately block Toeplitz for large values of n.
An alternative form of the data matrix is provided by re-writing the state
vector (7.22) in the form
that is
xt = xt1 , . . . , xt,p .
X 1 = X P, (7.32)
3 Other alternatives to compute C also exist, and they are related to the way the lagged covariance
ij
between two time series is computed, see e.g. Priestly (1981) and Jenkins and Watts (1968).
4 used by Weare and Nasstrom (1982).
160 7 Extended EOFs and SSA
X = VUT , (7.34)
Fig. 7.5 Spectrum of the grand covariance matrix, Eq. (7.3), of the leading 10 OLR PCs using a
window lag M = 80 days. The vertical bars show the approximate 95% confidence interval using
an hand-waving effective sample size of 116. Adapted from Hannachi et al. (2007)
5 That
is a matrix containing exactly 1 in every line and every column and zeros elsewhere. A
permutation matrix P is orthogonal, i.e. PPT = PT P = I.
7.5 Extended EOFs or Multivariate SSA 161
of the newly obtained variables, i.e. the number of columns of the grand data
matrix. The diagonal matrix contains the singular values θ1 , . . . θd of X , and
V = (v 1 , v 2 , . . . , v d ) is the matrix of the right singular vectors or extended PCs
where the k’th extended PC is v k = (vk (1), . . . , vk (n − M + 1))T .
The computation of Hilbert EOFs is again similar to conventional EOFs. Given
the gridded data matrix X(n, p12) and the window length m, the cornerstone of
EEOFs is to compute the extended data matrix EX as shown in the simple Matlab
code:
>> EX = [ ];
>> for t=1:(n-m+1)
>> test0 = [ ]; test1 = [ ];
>> for s = 1:p12
>> test1 = X (t:(t+m-1), s)’;
>> test0 = [test0 test1];
>> end
>> EX = [EX; test0];
>> end
The extended EOFs and extended PCs along with associated explained variance
are then computed as for EOFs using the new data matrix EX from the above
piece of Matlab code. These extended EOFs and PCs can be used to filter the
data by removing the contribution from nonsignificant components and also for
reconstruction purposes as detailed below and illustrated with the OLR example.
The extended EOFs U can be used as a filter exactly like EOFs. For instance, the
SVD decomposition (7.34) yields the expansion of each row x t of X in (7.26)
d
x Tt = θk vk (t)uk , (7.35)
k=1
d
j
xTt+j −1 = θk vk (t)uk (7.36)
k=1
j T
uk = uj,k , uj +M,k , . . . , uj +(p−1)M,k . (7.37)
162 7 Extended EOFs and SSA
j
Note that the expression of the vector uk in the expansion (7.36) depends on the
form of the data matrix. The one given above corresponds to (7.26), whereas for the
data matrix X 1 in Eq. (7.31) we get
j T
uk = u(j −1)p+1,k , u(j −1)p+2,k , . . . , ujp,k . (7.38)
Note also that when we filter out higher EEOFs, expression (7.36) is to be truncated
to the required order d1 < d.
Figure 7.6, for example, shows PC1 and its reconstruction based on Eq. (7.40)
using the leading 5 extended PCs. Figure 7.5 shows another pair of approximately
equal eigenvalues, namely, eigenvalues 4 and 5. Figure 7.7 shows a time plot
of extended PC4/PC5, and their phase plot. This figure reflects the semi-annual
oscillation signature in OLR. Figure 7.8 shows the extended EOF8 pattern along
10◦ N as a function of the time lag. Extended EOF8 shows an eastward propagating
Fig. 7.6 Time series of the raw and reconstructed OLR PC1. Adapted from Hannachi et al. (2007)
Fig. 7.7 Time series of OLR extended PC4 and PC5 and their phase plot. Adapted from Hannachi
et al. (2007)
7.5 Extended EOFs or Multivariate SSA 163
Fig. 7.8 Extended EOF 8 of the OLD anomalies along 10o N as a function of time lag. Units
arbitrary. Adapted from Hannachi et al. (2007)
wave with an approximate phase speed of about 9 m/s, comparable to that of Kelvin
waves.
The expansion (7.36) is exact by construction. However, when we truncate it by
keeping a smaller number of EEOFs for filtering purposes, e.g. when we reconstruct
the field components from a single EEOF or a pair of EEOFs corresponding, for
example, to an oscillation, then the previous expansion does not give a complete
picture. This is because when (7.36) is truncated to a smaller subset K of EEOFs
yielding, e.g.
j
yTt+j −1 = θk vk (t)uk , (7.39)
k in K
(7.40)
The eigenvalues related to the MJO are reflected by the pair (8,9), see Fig. 7.5.
The time series and phase plots of extended EOFs 8 and 9 are shown in Fig. 7.9.
This oscillating pair has a period of about 50 days. This MJO signal projects onto
many PCs. Figure 7.10 shows a time series of the reconstructed PC1 component
using extended EOFs/PCs 8 and 9. For example, PCs 5 to 8 are the most energetic
components regarding MJO. Notice, in particular, the weak projection of MJO onto
the annual cycle. Figure 7.11 shows the reconstructed OLR field for the period 3
March 1997 to 14 March 1997, at 5◦ N, using the extended 8th and 9th extended
EOFs/PCs.
The figure reveals that the MJO is triggered near 25–30◦ E over the African jet
region and matures over the Indian Ocean and Bay of Bengal due to the moisture
excess there. MJO becomes particularly damp near 150◦ E. Another feature that is
Fig. 7.9 Time series of OLR extended PCs 8 and 9 and their phase portrait. Adapted from
Hannachi et al. (2007)
Fig. 7.10 Reconstructed OLR PCs 1 to 8 using the extended EOFs 8 and 9. Adapted from
Hannachi et al. (2007)
clear from Fig. 7.11 is the dispersive nature of the MJO, with a larger phase speed
during the growth phase compared to that during the decay phase.
Note that these reconstructions can also be obtained using least squares (see e.g.
Vautard et al. 1992, and Ghil et al. 2002). The reconstructed components can also
be restricted to any subset of the Eigen elements of the grand data matrix (7.26) or
similarly the grand covariance matrix . For example, to reconstruct the time series
associated with an oscillatory Eigen element, i.e. a pair of degenerate eigenvalues,
the subset K in the sum (7.39) is limited to that pair.
The reconstructed multivariate time series yt , t = 1, . . . n, can represent the
reconstructed (or filtered) values of the original field at the original p grid points.
In general, however, the number of grid points is too large to warrant an eigen
decomposition of the grand data or covariance matrix. In this case a dimension
reduction of the data is first sought by using say the leading p0 PCs and then
apply a MSSA to these retained PCs. In this case the dimension of X becomes
(n − M + 1) × Mp0 , which may be made considerably smaller than the original
dimension. To get the reconstructed space–time field, one then use the reconstructed
PCs in conjunction with the p0 leading EOFs.
166 7 Extended EOFs and SSA
Fig. 7.11 Reconstructed OLR anomaly field using reconstructed EOFs/PCs 1 to 8 shown in
Fig. 7.10. Adapted from Hannachi et al. (2007)
Remark The previous sections discuss the fact that extended EOFs are used
essentially for identifying propagating patterns, filtering and also data reduction.
Another example where extended EOFs can be used is when the data contain some
kind of breaks. This includes studies focussing on synoptic and large-scale processes
in a particular season. In these studies the usual procedure is to restrict the analyses
to data restricted to the chosen period. This includes analysing, for example, winter
(e.g. DJF) daily data over a number of years. If care is not taken, an artificial
oscillatory signal emerges. An example was given in Hannachi et al. (2011) who
used geometric moments of the polar vortex using ERA-40 reanalyses. They used
December–March (DJFM) daily data of the aspect ratio, the centroid latitude and
the area of the vortex from ERA-40 reanalyses for the period 1958–2002.
Figure 7.12a shows the spectrum of the extended time series in the delay
coordinates of the vortex area time series using a window lag M = 400 days. A
pair of nearly identical eigenvalues emerges and is well separated from the rest
of the spectrum. The associated extended EOFs are shown in 7.11b, and they
show a clear pair of sine waves in quadrature. The associated extended PCs are
shown in 7.12c, revealing again a phase quadrature supported also by the phase
plot (7.12d). The time series is then filtered by removing the leading few extended
EOFs/PCs. The result is shown in Fig. 7.13. Note that in Fig. 7.13 the leading four
extended EOFs/PCs were filtered out from the original vortex area time series.
7.5 Extended EOFs or Multivariate SSA 167
Fig. 7.12 Spectrum of the grand covariance matrix, Eq. (7.3), of the northern winter (DJFM) polar
vortex (a) using a window lag M = 400 days, the leading two extended EOFs (b), the extended
PC1/PC2 (c) and the phase portrait of the extended PC1/PC2 (d). Adapted from Hannachi et al.
(2007)
Fig. 7.13 Raw polar vortex area time series and the reconstructed signal using the leading four
extended EOFs/PCs (a) and the filtered time series obtained by removing the reconstructed signal
of (a). Adapted from Hannachi et al. (2011). ©American Meteorological Society. Used with
permission
168 7 Extended EOFs and SSA
EEOFs are useful tools to detect propagating features, but some care needs to be
taken when interpreting the patterns. There are two main difficulties of interpretation
related, respectively, to the standing waves and the relationship between the
EEOFs substructures. The method finds one or more EEOFs where each EEOF
is composed of a number of patterns or substructures. These patterns are taken
to represent propagating features, and this would imply some sort of coherence
between individual features within a given EEOF. The fact that the method attempts
to maximise the variance of each EEOF (without considering extra constraints on
the correlation, or any other measure of association, between the substructures of a
given EEOF) suggests the existence of potential caveats. Chen and Harr (1993) used
a two-variable model to show that the partition of the loadings is much sensitive
to the variance ratio than to the correlation between the two variables. This may
yield some difficulties in interpretation particularly when some sort of relationships
are expected between the substructures of a given EEOF. Chen and Harr (1993)
constructed a 6-variable toy model data to show that interpretation of EEOF patterns
can be misleading.
In the same token and like POPs, EEOFs interpretation can also be difficult when
the data contains a standing wave. The problem arises for particular choices of
the delay parameter τ . Monahan et al. (1999) showed that if the dataset contains
a standing wave the EEOFs describing this wave will be degenerate if τ coincides
with a zero of the autocovariance function of the wave’s time series. The model used
by Monahan et al. (1999) takes the simple form:
xt = ayt + ε t ,
where E ε t ε Tt+τ = η(τ )I, aT a = 1 and E (yt yt+τ ) = a(τ ), and ε t and yt
T
are uncorrelated. For example, if zt = xTt , xTt+τ , the corresponding covariance
matrix is
(0) (τ )
z = ,
(τ ) (0)
where (τ ) = a(τ )aaT + η(τ )I is the covariance matrix function of xt . Denoting
γ (τ ) = a(τ )+η(τ ), then the eigenvalue λ = γ (0)+γ (τ ) is degenerate if γ (τ ) = 0.
Using more lags, the degeneracy condition becomes slightly more complicated. For
T
example, when using two time lags, i.e. zt = xTt , xTt+τ , xTt+2τ , then one gets
two obvious eigenvalues, λ1 = γ (0) + γ (τ ) + γ (2τ ), and λ2 = γ (0) − γ (2τ )
T T
with respective eigenvectors aT , aT , aT and aT , 0T , −aT . If γ (τ ) = γ (2τ ),
then the second eigenvalue degenerates, and if in addition γ (τ ) = 0, then the first
eigenvalue degenerates. When this happens, the substructures within a single EEOF
can be markedly different, similar to the case mentioned above. Monahan et al.
7.7 Alternatives to SSA and EEOFs 169
(1999) performed EOF and a single-lag EEOF analysis of the monthly averages
of the Comprehensive Ocean-Atmosphere Data Set (COADS) SLP from January
1952 to June 1997. The first EOF was identified as an east–west dipole and was
interpreted as a standing mode, with its PC representing the ENSO time series. They
obtained degeneracy of the leading eigenvalue when the lag τ is chosen near to the
first zero of the sample autocovariance function of the PC of the standing mode,
i.e. at 13 months. The result was a clear degradation, and in some substructures
a suppression, of the standing wave signal leading to a difficulty of interpretation.
Therefore, in the particular case of (unknown) standing wave, it is recommended to
try various lags and check the consistency of the EEOF substructures.
is a (2M −1)×(2M −1) Hankel matrix. It is defined using the (2M −1)×(2M −1)
cyclic permutation matrix P = (pij ), (i, j = 1, . . . 2M − 1), given by
The DAH modes and associated eigenvalues are then given by the eigenelements
of the grand (symmetric) Hankel matrix H. In a similar fashion to EEOFs, the
eigenvalues of H come in pairs of equal in magnitude but with opposite sign, and
the associated modes coefficients (time series) are shifted by quarter of a period.
The obtained modes, and their coefficients, are used, in similar way to EEOFs, to
identify oscillating features and to reconstruct the data. Kondrashov et al. (2018a)
used DAH to analyse Arctic sea ice and predict September sea ice extent, whereas
Kondrashov et al. (2018b) used it to analyse wind-driven ocean gyres. The authors
argue, in particular, that the model has some predictive skill.
Chapter 8
Persistent, Predictive and Interpolated
Patterns
8.1 Introduction
Early attempts along this direction were explored by Renwick and Wallace
(1995) to determine patterns that maximise correlation between forecast and
analysis. This is an example of procedure that attempts to determine persistent
patterns. Persistent patterns are very useful in prediction but do not substitute
predictable patterns. In fact, prediction involves in general (statistical and/or dynam-
ical) models, whereas in persistence there is no need for modelling. This chapter
reviews techniques that find most persistent and also most predictable patterns,
respectively. The former attempt to maximise the decorrelation time, whereas the
latter attempt to minimise the forecast error. We can also increase forecastability
by reducing uncertainties of time series. A further technique based on smoothing is
also presented, which attempts to find most smooth patterns.
Here we consider the case of a univariate stationary time series (see Appendix C)
xt , t = 1, 2, . . . , with zero mean and variance σ 2 . The autocorrelation function
ρ(τ ) is defined by ρ(τ ) = σ −2 E (xt xt+τ ), where xt , t = 1, 2 . . ., are considered as
random variables. Suppose now that the centred time series is a realisation of these
random variables, i.e. an observed time series of infinite sample, then an alternative
definition of the autocorrelation is
1
n
1
ρ(τ ) = 2 lim xt xt+τ , (8.1)
σ n→∞ n
k=1
where σ 2 = limn→∞ n1 nk=1 xt2 is the variance of this infinite sequence. The
connection between both the definitions is assured via what is known as the ergodic
theorem. Note that the first definition involves a probabilistic framework, whereas
the second is merely functional. Once the autocorrelation function is determined,
the power spectrum is defined in the usual way (see Appendix C) as
∞
∞
f (ω) = ρ(τ )e 2π iτ ω
=1+2 ρ(τ ) cos2π ωτ. (8.2)
τ =−∞ τ =1
The autocorrelation function of a stationary time series goes to zero at large lags.
The decorrelation time is defined theoretically as the smallest lag τ0 beyond which
the time series becomes decorrelated. This definition may be, however, meaningless
8.2 Background on Persistence and Prediction of Stationary Time Series 173
since in general the autocorrelation does not have a compact support,2 and therefore
the decorrelation time is in general infinite.
Alternatively, the decorrelation time can be defined using the integral of the
autocorrelation function when it exists:
∞ ∞
T = ρ(τ )dτ = 2 ρ(τ )dτ (8.3)
−∞ 0
T = f (0). (8.5)
The integral T in Eqs. (8.3) and (8.4) represents in fact a characteristic time between
effectively independent sample values (Leith 1973) and can therefore be used as a
measure of the decorrelation time. This can be seen, for example, when one deals
with a AR(1) or Markovian time series. The autocorrelation function of the red
noise is ρ(τ ) = exp (−|τ |/τ0 ), which yields T = 2τ0 . Therefore, in this case
the integral T plays the role of twice the e-folding time of ρ(τ ), as presented
in Sect. 3.4.2 of Chap. 3. Some climatic time series are known, however, not to
have finite decorrelation time. These are known to have long memory, and the
corresponding time series are also known as long-memory time series. By contrast
short-memory time series have autocorrelation functions that decay exponentially
with increasing lag, i.e.
for every integer k ≥ 0. Long-memory time series have autocorrelations that decay
hyperbolically, i.e.
ρ(τ ) ∼ aτ −α (8.7)
for large lag τ and for some 0 < α < 1, and hence T = ∞. Their power spectrum
also behaves in a similar way as
2A function with compact support is a function that is identically zero outside a bounded subset of
the real axis.
174 8 Persistent, Predictive and Interpolated Patterns
as ω → 0. In Eqs. (8.7) and (8.8) a and b are constant (e.g. Brockwell and Davis
2002, chap. 10).
where h(z) = k≥0 αk zk , and B is the backward shift operator, i.e. Bzt = zt−1 .
The prediction error covariance is given by
2
σ12 = min E xt+τ − x̂t+τ . (8.11)
The prediction theory based on the entire past of the time series using least square
estimation has been developed in the early 1940s by Kolmogorov and Wiener and
is known as the Kolmogorov–Wiener approach. Equation (8.10) represents a linear
filter with response function H (ω) = h(eiω ), see Appendix C. Now the prediction
error ετ = xt+τ − x̂t+τ has (1 − H (ω)) as response function. Hence the prediction
error variance becomes
π
σετ = |1 − H (ω)|2 f (ω)dω, (8.12)
−π
1
π
which can also be written as 2π exp 2π −π log f (ω) dω . The logarithm used here
is the natural logarithm with base e. Note that the expression given by Eq. (8.14) is
always finite for a stationary process.
π
Exercise Show that for a stationary time series, 0 ≤ exp 2π 1
−π log f (ω) dω <
∞.
π
Hint Use the fact that log x ≤ x along with the identity σ 2 = −π f (ω) dω,
π
yielding −∞ ≤ −π log f (ω) dω ≤ σ 2 .
π
Note that −π log f (ω) dω can be identical to −∞, in which case σ1 = 0, and the
time series xt , t = 1, 2, . . ., is known as singular or deterministic; see e.g. Priestly
(1981) or (Brockwell and Davis 1991, 2002).
The Kolmogorov formula can be extended to the multivariate case, see e.g.
Whittle (1953a,b) and Hannan (1970). Given a nonsingular spectral density matrix
F(ω), i.e. |F(ω)| = 0, the one-step ahead prediction error is given by (Hannan 1970,
theorem 3” p. 158, and theorem 3”’ p. 162)
π π
1 1
σ12 = exp tr (log 2π F(ω)) dω = exp log |2π F(ω)|dω .
2π −π 2π −π
(8.15)
Recall that in (8.15) |2π F(ω)| = (2π )p |F(ω)|, where p is the dimension of the
multivariate time series. Furthermore, if x̂t+τ is the minimum square error τ -step
ahead prediction of xt and ετ the error covariance matrix3 of ε τ = xt+τ − x̂t+τ ,
and then σ12 = det( ε1 ).
3 The error covariance matrix ετ can also be expressed explicitly using an expansion in
power series of the spectral density matrix. In particular, if F(ω) is factorised as F(ω) =
∗ iω
2π 0 (e )0 (e ), then ε1 takes the form:
1 iω
ε1 = 0 (0)∗0 (0);
Table 8.1 Comparison between time domain and spectral domain characteristics of multivariate
stationary time series
Time domain Spectral domain
τ = 0 : = F(ω)dω ω = 0 : F(0) = τ τ
π −2iπ ω dω
τ = 0 : τ = 2π 1
−π F(ω)e ω = 0 : F(ω) = τ e2π iωτ
[ 0 ]ii = σi is the ith variance
2 [F(0)]ii = Ti is the ith decorrelation time
Remark Stationary time series can be equally analysed in physical space or spectral
space and yields the same results. There is a duality between these two spaces.
Table 8.1 shows a comparison between time and spectral domains.
The table shows that spectral domain and time domain analyses of stationary
time series are mirrors of each other assured using Fourier transform. It is also clear
from the table that the image of EOFs is persistent patterns, whereas the image of
frequency domain EOFs is POPs.
The objective here is similar to that of EOFs where, instead of looking for patterns
that maximise the observed variance of the space–time field, one seeks patterns
that persist most. The method has been introduced and applied by DelSole (2001).
Formally speaking, the spatial patterns themselves, like EOFs, are stationary. It
is the time component of the patterns that is most persistent, i.e. has the largest
decorrelation time.
Given a space–time field xt , t = 1, 2, . . . , the objective is to find a pattern u,
the optimally persistent pattern (OPP), ∞whose time series coefficients display the
maximum decorrelation time T = 2 0 ρ(τ )dτ . Precisely, given a p-dimensional
times series xt , t = 1, 2, . . . , we let yt , t = 1, 2, . . . , n, be the univariate time
series obtained by projecting the field onto the pattern u, i.e. yt = uT xt . The
autocovariance function of this time series is given by
γ (τ ) = E (yt+τ yt ) = uT τ u, (8.16)
uT τ u
ρ(τ ) = , (8.17)
uT u
where is the covariance matrix of the time series xt , t = 1, 2, . . . . Using the
identity −τ = Tτ , the decorrelation time of (yt ) reduces to
8.3 Optimal Persistence and Average Predictability 177
∞
uT τ + Tτ dτ u
T = 0
. (8.18)
uT u
The maximum of T in (8.18) is given by the generalised eigenvalue problem:
−1 Mu = λu, (8.19)
∞
where M = 0 τ + Tτ dτ . Note that Eq. (8.19) can also be transformed to
yield a simple eigenvalue problem of a symmetric matrix as
where C = 1/2 is a square root of the covariance matrix . The optimal patterns
are then given by
uk = −1/2 vk , (8.21)
So the eigenvalues maximise the Raleigh quotient, and the eigenvectors maximise
the ratio of the zero frequency to the total power.
Exercise Show that M is semi-definite positive, and consequently M is also
symmetric and positive semi-definite.
Hint Use the autocovariance function γ () of yt .
Remark If F(ω) is the spectral density matrix of xt , then M = F(0) and is symmet-
ric semi-definite positive. Let zt , t = 1, 2, . . . , n, be a stationary multivariate time
series with M as covariance matrix, and then the generalised eigenvalue problem,
Eq. (8.19), represents the solution to a filtering problem based on signal-to-noise
maximisation, in which the input is xt , and the output is xt + zt .
The eigenvalue problem, Eq. (8.20), produces a set of optimal patterns that can
be ordered naturally according to the magnitude of the (non-negative) eigenvalues
of M. The patterns uk , k = 1, . . . , p, are not orthogonal, but vk , k = 1, . . . , p,
178 8 Persistent, Predictive and Interpolated Patterns
p
p
xt = αk (t)vk = xTt vk vk . (8.23)
k=1 k=1
In this case the time series coefficients are not uncorrelated. Alternatively, it is
possible to compromise the orthogonality of the filter patterns and get instead
uncorrelation of the time coefficients in a similar way to REOFs. This is achieved
using again the orthogonality of the filter patterns, or in other words the bi-
orthogonality of the optimal persistent pattern uk , and the associated filter wk =
uk , i.e.
wk ul = δkl
p
xt = βk (t)wk , (8.24)
k=1
where now
Note that the patterns (or filters) wk , k = 1, . . . , p, are not orthogonal, but the new
time series coefficients βk (t), k = 1, . . . , p, are uncorrelated. In fact, we have
The sampling errors associated with the decorrelation times T = λ, resulting from
a finite sample of data, can be calculated in a manner similar to to that of EOFs
(DelSole 2001; Girshik 1939; Lawley 1956; Anderson 1963; and North et al. 1982).
In some time series with oscillatory autocorrelation functions, such as those
produced by a AR(2) model, the correlation integral T in (8.3) can tend to zero as
the theoretical decorrelation time goes to infinity.4 DelSole (2001) proposes using
the integral of the square of the autocorrelation function, i.e.
∞
4 The example of a AR(2) model where ρ(τ ) = e−|τ |/τ0 cos ω0 τ and T1 = 0 ρ(τ )dτ =
−1
τ0 1 + ω02 τ02 was given in DelSole (2001).
8.3 Optimal Persistence and Average Predictability 179
∞
T2 = ρ 2 (τ )dτ. (8.27)
0
In this case, the maximisation problem cannot be solved analytically as in Eq. (8.18),
and the solution has to be found numerically. Note also that the square of the auto-
correlation function does not solve the problem related to the infinite decorrelation
time for a long-memory time series.
A comparison between the performance of T-optimals, Eq. (8.3), or T2-optimals,
Eq. (8.27), applied to the Lorenz (1963) model by DelSole (2001) shows that the T2-
optimal remains correlated well beyond 10 time units, compared to that obtained
using EOFs or Eq. (8.3). The latter patterns become uncorrelated after three time
units as shown in Fig. 8.1. This may have implication on forecasting models. For
example, the optimal linear prediction of the Lorenz (1963) system (Penland 1989)
had skill about 12 time unit, which makes the T2-optimal provide potential for
statistical forecast models. The T2-optimal mode (Fig. 8.1), however, cannot be well
modelled by the first-order Markov model. It can be better modelled by a second-
order Markov model or AR(2) as suggested by DelSole (2001). Furthermore, the
T2-optimal mode cannot also be produced by the POP model.
In practice we deal with discrete time series, and the matrix M is normally written
as a sum + 1 + T1 + 2 + T2 + . . . , which is normally limited to some
lag τ0 beyond which the autocorrelations of the (individual) variables become non-
significant, i.e.
0.5
ρτ
0.0
–0.5
0 10 0 10 0 10
τ τ τ
Fig. 8.1 Autocorrelation function of the leading PC (left), the leading time series of the T-optimal
(middle) and that of the T2 -optimal (right) of the Lorenz model. Adapted from DelSole (2001).
©American Meteorological Society. Used with permission
180 8 Persistent, Predictive and Interpolated Patterns
M = + 1 + T1 + · · · + τ0 + Tτ0 . (8.28)
Remark Note that, in general, τ0 need not be large, and in some cases it can be
limited to the first few lags. For example, for daily geopotential heights τ0 is around
a week. For monthly sea level pressure, τ0 ≈ 9 − 12 months. For sea surface
temperature, one expects a larger value of τ0 . Again here we suppose that the data
are short memory. There are of course exceptions with variables that may display
signature of long memory such as wind speed or perhaps surface temperature. In
those cases the lag τ0 will be significantly large. Some time series may have long-
term trends or periodic signals, in which case it is appropriate to filter out those
signals prior to the analysis.
Persistent patterns from atmospheric fields, e.g. reanalyses, using T or T2
measures may not be too different, particularly, for the prominent modes. In fact,
DelSole (2001) finds that the leading few optimal T- or T2-persistent patterns are
similar for allowable choices of EOF truncation (about 30 EOFs), but with possible
differing ordering between the two methods. For example, the leading T2-optimal
of daily NCEP 500–mb heights for the period 1950–1999 is shown in Fig. 8.2. This
pattern is also the second T-optimal pattern. The pattern bears resemblance to the
Arctic Oscillation (AO) Thompson and Wallace (1998). The trend signature in the
time series is not as strong as in the AO pattern as the T2-optimal is based on
all days not only winter days. Note also the 12–15 days decorrelation time from
the autocorrelation function of the time series (Fig. 8.2). An interesting result of
OPP is that it can also identify other low-frequency signatures such as trends and
discontinuities. For example, the trailing OPP patterns are found to be associated
with synoptic eddies along the storm track (Fig. 8.3).
In reality, the above truncation in Eq. (8.28) can run into problems when we
compute the optimal decorrelation time. In fact, a naive truncation of Eq. (8.18),
giving equal weights to the different lagged covariances, can yield a negative
decorrelation time as the lagged covariance matrix is not reliable due to the small
sample size used when the lag τ is large. Precisely, to obtain the finite sample of
Eq. (8.22), a smoothing of the spectrum is required as the periodogram is not a
consistent estimator of the power spectrum.
When we have a finite sample xt , t = 1, . . . , n, we use the sample lagged
covariance matrix
1 n−τ
xt+τ xTt 0 ≤ τ < n
Cτ = n1 t=1n+τ (8.29)
t=1 xt xt−τ −n < τ < 0
T
n
Fig. 8.2 The leading filter and T2-optimal persistent patterns (top), the associated time series
(middle) and its autocorrelation function (bottom) of the daily 500-hPa geopotential height
anomalies for the period 1950–1999. The analysis is based on the leading 26 EOFs/PCs. Adapted
from DelSole (2001). ©American Meteorological Society. Used with permission
M
F̂(ω) = ατ Cτ e−iωk τ (8.31)
τ =−M
182 8 Persistent, Predictive and Interpolated Patterns
Fig. 8.3 Same as Fig. 8.2 but for the trailing filter and T2-optimal persistent pattern. Adapted from
DelSole (2001). ©American Meteorological Society. Used with permission
with ω√k = 2π k/n, and ατ is a lag window at lag τ . DelSole (2006) considered
M = n (Chatfield 1989) and used a Parzen window,
8.3 Optimal Persistence and Average Predictability 183
,
τ 2
1−6 +6 τ
if 0 ≤ τ ≤ M/2
ατ = M
τ 2
M (8.32)
2 1− M if M/2 ≤ τ ≤ M,
because it cannot give negative power spectra compared, for example, to the Tukey
window. Note that here again, through an inverse Fourier transform of Eq. (8.31),
the finite sample eigenvalues maximise a similar Raleigh quotient to that derived
from Eq. (8.22).
DelSole (2006) applied OPP analysis to surface temperature using the observed
data HadCRUT2, compiled jointly by the Climate Research Unit and the Met
Office’s Hadley Centre, and 17 IPCC AR4 (Intergovernmental Panel for Climate
Change 4th Assessment Report) climate models. He found, in particular, that the
leading two observed OPPs are statistically distinguishable from noise and can
explain most changes in surface temperature. On the other hand, most model
simulations produce one single physically significant OPP.
Remark A similar method based on the lag-1 autocorrelation ρ(1) was proposed by
Wiskott and Sejnowski (2002). They labelled it slow feature analysis (SFA), and it
is discussed more in Sect. 8.6.
Average predictability pattern (APP) analysis was presented by DelSole and Tippett
(2009a,b) based on the concept of average predictability time (APT) decomposition,
which is a metric for the average predictability. Let us designate by τ the
covariance matrix of time series xt , t = 1, 2, . . . , at
the forecast of a p-dimensional
lead time τ , i.e. E (x̂t+τ − xt+τ )(x̂t+τ − xt+τ )T , and the APT is given by
∞
S=2 Sτ , (8.33)
τ =1
where Sτ = p1 tr ( ∞ − τ ) −1 ∞ , also known as the Mahalanobis signal, and
∞ is the climatological covariance.
APT decomposition analysis seeks vectors v such that the scalar time series yt =
xTt v, t = 1, 2, . . ., has maximum APT. Keeping in mind that for these univariate
time series στ2 = vT τ v and σ∞2 = vT v, the pattern v is obtained as the solution
∞
of the following generalised eigenvalue problem:
∞
2 ( ∞ − τ ) v = λ ∞ v. (8.34)
τ =1
Note that Eq. (8.34) can be transformed to yield the following eigenvalue problem:
184 8 Persistent, Predictive and Interpolated Patterns
∞
−1/2 T −1/2
2 I − ∞ τ ∞ w = λw, (8.35)
τ =1
1/2
where w = ∞ v and is taken to be unit norm. The APP u is then obtained by
projecting the time series onto v, i.e. E(xt yt ), and
u = ∞ v. (8.36)
A similar argument to the OPP can be applied here to get the decomposition of the
time series xt , t = 1, . . . , n, using Eqs. (8.24) and (8.25) after substituting uk and
p
vk , Eq. (8.36), for uk and wk , Eq. (8.24), respectively, i.e. xt = k=1 (vTk xt )uk .
To estimate the APPs, the linear model xt+τ = Axt + ε t , with A = Cτ C−1 0 , is
used, and for which τ = C0 − Cτ C−1 0 C T . The patterns are then solution to the
τ
eigenproblem:
Cτ C−1
0 Cτ v = λC0 v.
T
(8.37)
τ
The estimation from a finite sample is obtained in a similar manner to the OPPs.
In fact, to avoid getting negative APT values, which could result from a “naive”
truncation of Eq. (8.37), DelSole and Tippett (2009b) suggest using a Parzen lagged
window ατ , given in Eq. (8.32), to weight the lagged covariance matrices, which
does not produce negative spectra. The APT is then estimated using
M
S=2 ατ Sατ . (8.38)
τ =1
DelSole and Tippett (2009b) applied APP analysis to the National Center
for Environmental Prediction/National Center for Atmospheric Research
(NCEP/NCAR) 6-hourly 1000-hPa zonal velocity during the 50-year period
from 1 January 1956 to 31 December 2005, providing a sample size of 73,052.
Their analysis reveals that the prominent predictable patterns reflect the dominant
low-frequency modes, including a climate change signal (leading pattern), a ENSO-
related signal (second pattern) in addition to the annular mode (third pattern). For
example, the second predictable component (Fig. 8.3) has an average predictability
time of about 5 weeks, and predictability is mostly captured by ENSO signal. They
also obtained the MJO signal when the zonal wind was reconstructed based on
the leading seven predictable patterns. The remaining patterns were identified with
weather predictability having time scales less than a week.
Fischer (2015, 2016) provides an alternative expression of the OPP and APP
analyses based on the reduced rank multivariate regression Y = XB + E, where E
is a n × p residual matrix and B a p × d matrix of regression coefficients. For OPP
8.4 Predictive Patterns 185
M
and APP analyses, the tth row of Y is given by yt = τ =−M ατ xt+τ , where ατ
represents the Parzen lag window, Eq. (8.32), at lag τ .
8.4.1 Introduction
Each of the methods presented so far serves a particular goal. EOFs, for example, are
defined without constraint on the temporal structure (since they are based on using
contemporaneous maps) and are forced to be correlated with the original field. On
the other hand, POPs, for example, compromise the last property, i.e. correlation
with the field, but use temporal variation. EEOFs, or MSSA, use both the spatial
and the temporal structure and produce instead a set of patterns that can be used to
filter the field or find propagative patterns. They do not, however, make use of the
of the temporal structure, e.g. autocovariance, in an optimal way. This means that
there is no constraint on predictability.
Because they are formulated using persistence, the OPPs use the autocovariance
structure of the field. They achieve this by finding patterns whose time series evo-
lution decays very slowly. These patterns, however, are not particularly predictable.
Patterns that maximise covariance or correlation between forecast and analysis
(Renwick and Wallace 1995) deal explicitly with predictability since they involve
some measure of forecast skill. These patterns, however, are not the most predictable
since they are model-dependent.
An alternative way to find predictable patterns is to use a conventional measure
of forecast skill, namely the prediction error variance. In fact this is a standard
measure used in prediction and yields, for example, the Yule–Walker equations
for a stationary univariate time series. Predictive Oscillation Patterns (PrOPs)
Kooperberg and O’Sullivan (1996) achieve precisely this. PrOPs are patterns whose
time coefficients minimise the one-step ahead prediction error and as such are
considered as the Optimally Persistent Patterns or simply the most predictable
patterns. When this approach was introduced, Kooperberg and O’Sullivan (1996)
were not motivated by predicting the weather but mainly by working out a hybrid of
EOFs and POPs that attempt to retain desirable aspects of both, but not by predicting
the weather.
the time series yt , t = 1, 2 . . . , has the smallest one-step ahead prediction error
variance. Using the autocovariance of this time series, as a function of the lagged
covariance matrix of xt , see Eq. (8.16), the spectral density function f (ω) of yt ,
t = 1, 2 . . . , becomes
Higher order PrOPs are obtained in a similar manner under the constraint of being
orthogonal to the previous PrOPs. For instance, the k + 1th PrOP is given by
π uT F(ω)u
uk+1 = argmin log dω. (8.44)
u, uT uα = δα,k+1 −π uT u
Suppose, for example, that the first k PrOPs, k ≥ 1, have been identified. The next
one is obtained as the first PrOP of the residuals:
k
zt = xt − yt,j uj (8.45)
j =1
⎛ ⎞
k
zt = ⎝Ip − uj uTj ⎠ xt = Ak xt , (8.46)
j =1
and Ak is simply the projection operator onto the orthogonal complement to the
space spanned by the first k PrOPs. The PrOP optimisation problem derived from
the residual time series zt , t = 1, 2, . . . , yields
π uT ATk F(ω)Ak u
min ln dω, (8.47)
u −π uT u
which provides the k + 1th PrOP uk+1 . Note that Ak uk+1 = uk+1 , so that uk+1
belongs to the null space of (u1 , . . . , uk ).
The optimisation problem in Eq. (8.42) or Eq. (8.47) can only be carried out
numerically using some sort of descent algorithm (Appendix E) because of the
nonlinearity involved. Given a finite sample xt , t = 1, 2, . . . , n, the first step consists
in estimating the spectral density matrix using, for example, the periodogram (see
Appendix C):
1 −itωp T itωp
n n
I(ωp ) = xt e xt e (8.48)
nπ
t=1 t=1
for ωp = 2πpn , p = −[ 2 ], . . . , [ 2 ], where [x] is the integer part of x. Note that the
n−1 n
first sum in the rhs of Eq. (8.48) is the Fourier transform x̂(ωp ) of xt , t = 1, . . . , n
at frequency ωp , and I(ωp ) = π1 x̂(ωp )x̂∗T (ωp ), where the notation x ∗ stands for
the complex conjugate of x. However, since the periodogram is not a consistent
estimator of the power spectrum, a better estimator of the spectral density matrix
F(ω) can be obtained by smoothing the periodogram (see Appendix C):
n
[2]
F̂(ω) = (ω − ωp )I(ωp ), (8.49)
p=−[ n−1
2 ]
[ n2 ]−1
π uT F̂(ωk )u uT F̂(ωk+1 )u
F(u) = log + (8.50)
n uT u uT u
k=−[ n−1
2 ]
and similarly for Eq. (8.47). A gradient type (Appendix E) can be used in the
minimisation. The gradient of F(u) is given by
π
1 1 F(ω)
− ∇F(u) = Ip − T
dω u
4π 2π −π u F(ω)u
⎡ ⎤
[ n2 ]−1
⎢ F̂(ωk ) F̂(ωk+1 ) ⎥
≈ ⎣I p − + ⎦ u. (8.51)
T T uT F̂(ωk+1 )uT
n−1 u F̂(ωk )u
k=−[ 2 ]
The application of PrOPs to 47-year 500-mb height anomalies using the National
Meteorology Center (NMC) daily analyses (Kooperberg and O’Sullivan 1996)
shows some similarities with EOFs. For example, the first PrOP is quite similar
to the leading EOF, and the second PrOP is rather similar to the third EOF. An
investigation of forecast errors (Fig. 8.4) suggests that PrOPs perform, for example,
better than POPs and have similar performance to EOFs.
Fig. 8.4 Forecast error versus the number of the patterns using PCA (continuous), POPs (dotted)
and PrOPs (dashed) of the NMC northern extratropical daily 500-hPa geopotential height
anomalies for the period of Jan 1947–May 1989. Adapted from Kooperberg and O’Sullivan (1996)
8.5 Optimally Interpolated Patterns 189
8.5.1 Background
Accordingly, the error xt − xt,τ has Ip − H(ω) as frequency response function, and
hence the error covariance matrix takes the form (see Sect. 2.6, Chap. 2):
π ∗T
τ = Ip − H(ω) F(ω) Ip − H(ω) dω. (8.55)
−π
x̂t = αj xt−j = h(B)xt , (8.56)
j =0
where h(z) = j =0 αj z
j, such that the mean square error
T
xt − x̂t 2
=E xt − x̂t xt − x̂t (8.57)
We give below an outline of the proof and, for more details, refer to the above
references. We let xk = (Xk1 , . . . , Xkp ), k = 0, ±1, ±2, . . . , be a sequence of zero-
mean second-order random vectors, i.e. with components having finite variances.
Let also Ht be the space spanned by the sequence {Xkj , k = t, j = 1, . . . , p, k =
0, ±1, ±2, . . . , k = t} known also as random function. Basically, Ht is composed
of finite linear combinations of elements of this random function. Then Ht has
the structure of a Hilbert space with respect to a generalised scalar product (see
Appendix F). The estimator x̂t in Eq. (8.56) can be seen as the projection5 of xt
onto this space. Therefore xt − x̂t is orthogonal to x̂t and also to xs for all s = t.
The first of these two properties yields
5 Not exactly so, because Ht is the set of all finite linear combinations of elements from
the sequence. However, this can be easily overcome by considering partial sums of the form
N
hN (B)xt = k=−N,k=0 αk xt−k and then make N approach infinity. The limit h() of hN () is
then obtained from
π
lim tr (H(ω) − HN (ω)) F(ω) (H(ω) − HN (ω))∗ dω = 0, (8.61)
N →∞ −π
E xt − x̂t x∗T
s = O for all s = t, (8.62)
where the notation (∗ ) stands for the complex conjugate. This equation can be
expressed using the spectral density
matrix F(ω) and the multivariate frequency
response function Ip − H(ω) of xt − x̂t , refer to Sect. 2.6 in Chap. 2. Note that
t can be set to zero because of stationarity. This necessarily implies
π
−π Ip − H(ω) F(ω)e
isω dω = O for s = 0. (8.63)
This necessarily implies that the matrix inside the integral is independent of ω, i.e.
Ip − H(ω) F(ω) = A, (8.64)
where
A is a constant matrix. The second orthogonality property, i.e.
E xt − x̂t x̂∗T
t = O, implies a similar relationship; namely,
π
Ip − H(ω) F(ω)H∗T (ω)dω = O. (8.65)
−π
Now, by expanding the expression of the covariance matrix τ in Eq. (8.55) and
using the last orthogonality property in Eq. (8.65), one gets
π
= Ip − H(ω) F(ω)dω = 2π A. (8.66)
−π
Finally, substituting the expression of Ip − H(ω) from Eq. (8.64) into the expres-
sion of in Eq. (8.55) and noting that A is real (see Eq. (8.66)) yield
π
1
A−1 = F−1 (ω)dω, (8.67)
2π −π
2
E yt − ŷt = uT u. (8.68)
The optimally interpolated pattern (OIP) u is the vector minimising the interpolation
error variance, Eq. (8.68). Hence the OIPs are given by the eigenvectors of the
interpolation error covariance matrix in Eq. (8.59) associated with its successively
increasing eigenvalues.
192 8 Persistent, Predictive and Interpolated Patterns
Exercise In general, Eq. (8.66) can only be obtained numerically, but a simple
example where can be calculated analytically is given in the following model
(Hannachi 2008):
xt = uαt + ε t ,
β
= ε + uuT ,
1 − βuT ε −1 u
where F̂(ω) is an estimate of the spectral density matrix given, for example, by
Eq. (8.49). Note that, as it was mentioned above, smoothing the periodogram makes
F̂(ω) full rank. A trapezoidal rule can then be used to approximate this integral to
yield
n
2 −1
ˆ −1 ≈ 1
F̂−1 (ωk ) + F̂−1 (ωk+1 ) , (8.70)
4π n n−1
k=− 2
where [x] is the integer part of x, and ωk = 2π k/n represents the kth discrete
frequency. Any other form of quadrature, e.g. Gaussian, or interpolation (e.g. Linz
and Wang 2003) can also be used to evaluate the previous integral.
In Hannachi (2008) the spectral density was estimated using smoothed peri-
odogram given in Eq. (8.49), where () is a smoothing spectral window and
I(ωk ) = π −1 x̂(ωk )x̂∗T (ωk ), where x̂(ωk ) is the Fourier transform of xt , t =
1, . . . , n at ωk , that is x̂(ωk ) = n−1/2 nt=1 xt e−iωk t .
8.5 Optimally Interpolated Patterns 193
8.5.4 Application
where α = 0.004, εt1 , εt2 and εt3 are first-order AR(1) models with respective lag-1
autocorrelation of 0.5, 0.6 and 0.3. Figure 8.5 shows a simulated example from this
model. The trend is shared between PC1 and PC2 but is explained solely by OIP1,
as shown in Fig. 8.6, which shows the histograms of the correlation coefficients
between the linear trend and the PCs and OIP time series.
Fig. 8.5 Sample time series of system, Eq. (8.71), giving xt , yt and zt (upper row), the three
PCs (middle row) and the three OIPs (bottom row). Adapted from Hannachi (2008). ©American
Meteorological Society. Used with permission. (a) xt . (b) yt . (c) zt . (d) PC 1. (e) PC 2. (f) PC 3.
(g) OIP 1. (h) OIP 2. (i) OIP 3
194 8 Persistent, Predictive and Interpolated Patterns
Fig. 8.6 Histograms of the absolute value of the correlation coefficient between the linear trend in
Eq. (8.71) and the PCs (upper row) and the OIP time series (bottom row). Adapted from Hannachi
(2008). ©American Meteorological Society. Used with permission
Fig. 8.7 Leading two OIPs (upper row) and EOFs (bottom row). The OIPs are based on the leading
5 EOFs/PCs of northern hemispheric SLP anomalies. Adapted from Hannachi (2008). ©American
Meteorological Society. Used with permission
Fig. 8.9 Spatial and temporal correlation of the leading 3 OIP patterns (thin) and associated IPC
time series (bold), obtained based on the leading 5 EOFs/PCs, with the same OIPs and IPCs, for
m=5,6, . . . 25 EOFs/PCs of northern hemispheric SLP anomalies. Adapted from Hannachi (2008).
©American Meteorological Society. Used with permission
Power spectra of the leading five OIP time series, i.e. interpolated PCs (IPCs),
is shown in Fig. 8.10 based on two estimation methods, namely Welch and Burg
methods. The decrease of low-frequency power is clear as one goes from IPC1
to IPC5. Another example where OIP method was applied is the tropical SLP.
The leading two OIPs (and EOFs) are shown in Fig. 8.11. The leading OIP of
tropical SLP anomalies, with respect to the seasonal cycle, represents the Southern
Oscillation mode. Compare, for example, this mode with the leading EOF, which is
a monopole pattern. The second EOF is more associated with OIP1.
Forecastable patterns are patterns that are derived based on uncertainties in time
series. Forecastable component analysis (ForeCA) was presented by Goerg (2013)
and is based on minimising a measure of uncertainty represented by the (differential)
entropy of a time series. For a second-order stationary time series xt , t =
1, 2, . . ., with autocorrelation function ρ(.) and spectral density f (.), a measure
of uncertainty is given by the “spectral” entropy6
6 The idea behind this definition is that if U is a uniform distribution over [−π, π ] and V a random
variable,
√ independent of U , and with probability density function g(.), then the time series yt =
2 cos(2π V t + U ) has precisely g(.) as its power spectrum (Gibson 1994).
8.6 Forecastable Component Analysis 197
π
1
H (x) = − f (ω) log f (ω)dω. (8.72)
log 2π −π
Note that the factor log 2π in Eq. (8.72) corresponds to the spectral entropy of a
white noise.
198 8 Persistent, Predictive and Interpolated Patterns
Fig. 8.11 Leading two OIPs and EOFs of the northern hemispheric SLP anomalies. The OIPs
are obtained based on the leading 5 EOFs/PCs. Adapted from Hannachi (2008). ©American
Meteorological Society. Used with permission. (a) Tropical SLP OIP 1. (b) Tropical SLP OIP
2. (c) EOF 1. (d) EOF 2
8.6 Forecastable Component Analysis 199
Given a n × p data matrix X, the ForeCA pattern is defined as the vector w that
maximises the forecastability of the time series x = XT w, subject to the constraint
wT Sw = 1, with S being the sample covariance matrix. Before proceeding, the data
matrix is first whitened, and then the multivariate spectrum matrix F(ω) is obtained.
The univariate spectrum along the direction w is then given by
The ForeCA patterns are then obtained by minimising the uncertainty, i.e.
π
w∗ = argmax F (ω) log F (ω)dω. (8.75)
w,wT w=1 −π
In application the integral in Eq. (8.74) is approximated by a sum over the discrete
frequencies ωj = j/n, j = 0, 1, . . . , n − 1. Goerg (2013) used the weighted
overlapping segment averaging (Nuttal and Carter 1982) to compute the power
spectrum. The estimate from Eq. (8.49) can be used to compute the multivariate
spectrum. Fischer (2015, 2016) compared five methods pertaining to predictability,
namely OPP, APP, ForeCA, principal trend analysis (PTA) and slow feature analysis
(SFA). PTA or trend EOF analysis (Hannachi 2007) is presented in Chap. 15, and
SFA (Wiskott and Sejnowski 2002), mentioned in the end of Sect. 8.3, is based
on the lag-1 autocorrelation. In Fischer (2016) these methods were applied to a
global dataset of speleothem δ 18 O for the last 22 ka, whereas in Fischer (2015) the
methods were applied to the Australian daily near-surface minimum air temperature
for the period of 1910–2013. He showed, in particular, that the methods give
comparable results with some subtle differences. Figure 8.12 shows the leading
three predictable component time series of δ 18 O dataset for the five methods. It is
found, for example, that OPP analysis, SFA and PTA tend to identify low-frequency
persistent components, whereas APP analysis and ForeCA represent more measure
of predictability. Of particular interest from this predictability analysis is the
association of these signals with North American deglaciation, summer insolation
and the Atlantic meridional overturning circulation (AMOC), see Fig. 8.12.
200 8 Persistent, Predictive and Interpolated Patterns
OPA
0
–2
APTD
0
–2
Standardized δ18O
ForeCA
0
–2
SFA
0
–2
PTA
0
–2
20
15
10
20
15
10
20
15
10
5
Fig. 8.12 The leading three predictable component time series (black) obtained from the five
methods, OPP (first row), APP (second row), ForeCA (third row), SFA (fourth row) and PTA
(last row) applied to the global δ 18 O dataset. The yellow bars refer, respectively, to the timing of
Heinrich Stadial 1, centred around 15.7 ka BP, and the Younger Dryas, centred around 12.2 ka BP.
The red curves represent the percentage of deglaciated area in North America (left), the summer
insolation at 65◦ N (middle) and the AMOC reconstruction (right). Adapted from Fischer (2016)
Chapter 9
Principal Coordinates or
Multidimensional Scaling
9.1 Introduction
n 1
λ
dij = xi − xj λ = |xik − xj k |λ (9.1)
k=1
9.2 Dissimilarity Measures 203
a+d
sij = , (9.2)
a+b+c+d
and the dissimilarity (or distance) can be defined by dij = 1 − sij . As an illustration
we consider a simple example consisting of six patients labelled A, B, . . . , F .
The variables (or attributes) considered are sex (M/F), age category (old/young),
employment (employed/unemployed) and marital status (married/single) and we use
the value 1 and 0 to characterise the variables. We then construct the following data
matrix (Table 9.3):
a+d
The similarity between, for example, A and B is a+b+c+d = 2+1+1+0
2+1
= 34 .
For categorical data the most common dissimilarity measure is given by Sneath’s
coefficient:
1
sij = 1 − δij k ,
p
k
xi − xj S−1 xi − xj
T
Mahalanobis
|xik −xj k |
Canberra k |xik |+|xj k |
One minus Pearson correlation 1 - corr(xi , xj )
seek a low-dimensional representation of the data such that the obtained distances
give a faithful representation of the true distances or dissimilarities between units.
9.3 Metric Multidimensional Scaling 205
So far we did not impose extra constraints on the distance matrix D. However, to
be able to solve the problem, we need the nondegenerate1 distance or dissimilarity
matrix = (δij ) to satisfy the triangular metric inequality:
for all i, j and k. In fact, when (3) is satisfied one can always find n points in a
(n − 1)-dimensional Euclidean space En−1 in which the interpoint distances satisfy
dij = δij for all pairs of points. For what follows, and also for simplicity, we suppose
that the data have been centred by requiring that the centroid of the set of n points
lie at the origin of the coordinates:
n
xk = 0. (9.4)
k=1
Note that if the dissimilarity matrix does not fulfil nondegeneracy or the
triangular metric inequality (9.3), then the matrix has to be processed to achieve
these properties. In the sequel we assume, unless otherwise stated, that these two
properties are satisfied. It is possible to represent the information contained in by
n points in En−1 , i.e. using a n × (n − 1) matrix of coordinates. An application of
PCA analysis to this set of points can yield a lower dimensional representation in
2
which the dissimilarity matrix ∗ minimises i,j δij − δij∗ . Since EOF analysis
involves an eigen decomposition of a covariance-type matrix, we will express the
distances δij , that we suppose Euclidean, using the n × n scalar product matrix A=
p
XXT , where the element at the ith line and j th column of A is aij = k=1 xik xj k .
p 2
Using Eq. (9.4) and recalling that δij2 = k=1 xik − xj k , we get
1 2
aij = −
δij − δi.2 − δ.j2 + δ..2 , (9.5)
2
where δi.2 = δ.i 2 = n1 nj=1 δij2 and δ..2 = n12 ni,j =1 δij2 .
Exercise Derive Eq. (9.5).
Hint By summing δij2 = aii + ajj − 2aij over one index then both indexes one gets
n
δ.j2 = ajj + n1 ni=1 aii = δj. 2 and
i=1 aii = 2 δ.. . Hence the diagonal terms are
n 2
aii = δi.2 − 12 δ..2 , which yields −2aij = δij2 − δi.2 − δ.j2 + δ..2 .
The expression aij = 1
2 aii + ajj − δij2 represents the cosine law from triangles.2
Also, the matrix with elements (δij2 − δi.2 − δ.j2 + δ..2 ), i, j = 1, . . . n, is known as
the double-centred dissimilarity matrix. Denote by 1 = (1, 1, . . . , 1)T the vector of
length n containing only ones, the scalar product matrix A in (9.5) can be written in
the more compact form
1 1 1
A=− In − 11T 2 In − 11T , (9.6)
2 n n
X = U, (9.8)
2 This is because in a triangle with vertices indexed by i, j and k, and side lengths dij , dik
d 2 +d 2 −d 2
and dj k the angle θi at vertex i satisfies cos θi = ij 2dijikdik j k . Hence, if we define bij k =
dij2 + dik
2 − d 2 /2, then b
jk ij k = dij dik cos θi , i.e. a scalar product.
Young–Householder decomposition.
9.3 Metric Multidimensional Scaling 207
EOFs whereas in MDS we have the distance matrix and we seek the data matrix.
One should, however, ask: “how do we obtain EOFs from the same dissimilarity
matrix?” The eigenvectors of the matrix of scalar products provide, in fact, the PCs,
see Sect. 3.3 of Chap. 3. So one could say the PCs are the principal coordinates, and
the EOFs are therefore given by a linear transformation of the principal coordinates.
When the matrix A has zero eigenvalues the classical procedure considers only the
part of the spectrum corresponding to positive eigenvalues. The existence of zero
eigenvalues is implicitly assured by the double-centred structure of A since A1 = 0.
A natural question arises here, which is related to the robustness of this procedure
when errors are present. Sibson (1972, 1978, 1979) has investigated the effect of
errors on scaling. It turns out, in particular, that classical scaling is quite robust,
where observed dissimilarities remain approximately linearly related to distances
(see also Mardia et al. 1979).
When the dissimilarity matrix is not Euclidean the matrix A obtained from (9.5) may
cease to be positive semi-definite, hence some of its eigenvalues may be negative.
The classical solution to this problem is simply to choose the first k largest (positive)
eigenvalues of A, and the corresponding normalised eigenvectors, which are taken
as principal coordinates. Another situation that appears in practice corresponds to
the case when one is provided with similarities rather than dissimilarities. A n × n
symmetric matrix C = (cij ) is a similarity matrix if cij ≤ cii for all i and j . A
standard way to obtain a distance matrix D = (dij ) from the similarity matrix C is
to use the transformation:
2
i<j dij − d̂ij
S= 2
, (9.11)
i<j dij
where (d̂ij ) are obtained as the least square monotone regression of the distances
(dij ) on the dissimilarities (δij ). Note that if (δij ) and (dij ) have the same ranking
order then the stress is zero, and is between 0 and 1 otherwise. Unlike the
classical scaling where the dimension space of the new configuration can be chosen
9.4 Non-metric Scaling 209
explicitly by fixing, say, the percentage of total variance explained by the principal
coordinates, in the non-metric case the dimension is unknown. In practice two or
three dimensions are normally chosen. The monotonic regression (increasing in
this case) requires that the order is preserved after transformation. Two possible
definitions of monotonicity exist:
• If δij < δkl , then d̂ij ≤ d̂kl , this is known as the primary monotone condition.
• If δij ≤ δkl , then d̂ij ≤ d̂kl , which is the secondary monotone condition.
The true requirements differ only in the presence of ties. The secondary monotone
approach requires, in particular, that if δij = δkl , then one should have d̂ij = d̂kl .
The primary approach is more general, with less convergence problems and is also
referred to as the global scaling (Chatfield and Collins 1980).
For a given configuration of distances (dij ), the distances (d̂ij ) are obtained, as
stated above, from the (primary/secondary) least squares monotone regression of
(dij ) on (δij ). That is, they are the set of values that minimise (Kruskal 1964a,b):
2
dij − d̂ij (9.12)
i<j
(1972), and Lingoes and Roskam (1973), point out that the secondary definition
of monotonicity is usually less satisfactory than the primary definition. Gordon
(1981) further argues that the secondary configuration appears to be less readily
interpretable. More discussion on the subject can be found in Young (1987), Cox
and Cox (1994), Borg and Groenen (2005), and Mardia et al. (1979).
Remark: Non-metric Scaling and Errors The non-metric scaling solution can also
be viewed as a solution to the classical scaling problem except that the distances
are not known perfectly but are corrupted with various types of noise such as
measurement and distortion errors. In this case what we observe is a distance matrix
D = (dij ) of the form:
where f () is an unknown monotonic increasing function and (εij ) are error terms.
The objective is then to reconstruct this unknown configuration (δij ). It is clear
in this example that a better and more reliable information is not the exact value of
these distances but their rank order. Consequently, we are better off using non-metric
or ordinal scaling. In some simple cases where, for example, f () is the identity and
the noise characteristics are known, then one could attempt to use a linear filter
and then obtain an estimate of δij . In general, however, this information is seldom
available in which case we opt for the former solution.
In replicated MDS, RMDS, (McGee 1968, Young and Hamer 1994) we are given
m similarity matrices that have to be analysed simultaneously. In metric RMDS
the distance matrices are determined by minimising the sum of squared elements
k ∗
2
ij k δ ij − δij where the sum includes not only indices related to individuals
but also indices of the individual distance matrices. Similarly, non-metric RMDS
also attempts to minimise the total (including individual dissimilarity matrices)
sum of squared residuals, similar to Eq. (9.12), between distances and monotonic
transformations of the dissimilarities. The RMDS generates one unique distance
matrix D that is “simultaneously like” all the dissimilarity matrices, which we can
write as:
Fk (k ) = D + E k
Many datasets may contain nonlinear structures that cannot be revealed by classical
MDS. This includes data distributed on lower dimensional manifolds, such as Swiss-
roll, tori, etc., where only geodesic distances, i.e. distances on the manifold, can
reflect the true low-dimensionality of the data. Various methods have attempted
to solve this problem. These include local linear techniques and other nonlin-
ear techniques such as local linear embedding, that computes low-dimensional
neighbourhood-preserving embeddings of high-dimensional data (Roweis and Saul
2000). An MDS-related method, the ISOMAP, was proposed by Tenenbaum et al.
(2000). Their approach builds on classical MDS but, in addition, seeks to preserve
the intrinsic geometry of the data, hence including the global geometry of the data.
Their definition of geodesic distance is based on computing shortest paths in a graph
with edges connecting neighbouring data points. This is particularly helpful when
the data contain for example holes. In their algorithm each point is connected to its
“nearest” neighbours using the provided distances dij , which are used to define a
new weighted distance dw (i, j ) given by4
dij if i and j are neighbours
dw (i, j ) = (9.14)
∞ otherwise.
4A similar distance dictated by time proximity, i.e. using the autocorrelation can be used.
212 9 Principal Coordinates or Multidimensional Scaling
to require O(n3 ) operations, see e.g. Borg and Groenen (2005) for algorithms
that exploit the sparsity structure of the neighbourhood graph. The configuration
2
of the points is then obtained by minimising δij − δij∗ , which yields the
solution, Eq. (9.8), obtained using the scalar product matrix Eq. (9.6) with [2 ]ij =
dG (i, j )2 .
Remark Note that the main part of the algorithm consists in computing the
neighbourhood using the k-nearest neighbours or using a ball of small fixed radius
. The choice of the parameters k or is discussed, e.g. in Ross et al. (2008), Gamez
et al. (2004) and Hannachi and Turner (2013b).
Nonlinear MDS via Isomap or related methods can reveal structures that cannot be
obtained using linear MDS. We discuss here an application to the Asian summer
monsoon (ASM) using the European Re-analyses (ERA-40) products from the
European Centre for Medium Weather Forecast, ECMWF, (Uppala et al. 2005). The
example is taken from Hannachi and Turner (2013b) using sea level pressure (SLP)
and 850-hPa wind fields over the Asian monsoon region (50–145◦ E, 20S –20◦ N) for
June–September (JJAS) 1958–2001. The JJAS climatology of monsoon is shown
in Fig. 9.1, which is dominated by the low-level Somali jet bringing moisture from
the Indian Ocean into the Indian peninsula with a general low pressure over land
masses. The leading two EOFs of the SLP anomalies, with respect to the mean
seasonal cycle, are shown in Fig. 9.2. The leading mode of variability reflects one
sign over the whole region associated with an overall positive or negative phase of
Fig. 9.1 JJAS climatology of ERA-40 sea level pressure and 850-hPa wind over the ASM region.
SLP contour interval: 2-hPa, and maximum wind speed: 18 m/s. Adapted from Hannachi and
Turner (2013b)
9.5 Further Extensions 213
the EOF. The second EOF reflects a south-west/north-east dipole reflecting broadly
the variability of the strength between the Indian and east Asian monsoon.
A large volume of Asian monsoon literature deals with the existence of two main
monsoon states: break (or weak) and active (or strong) phases. Figure 9.3 shows a
kernel PDF estimate of JJAS daily SLP anomalies, where no signature of bimodality
exists. Because Isomap operates locally, it can follow the nonlinear manifold by
building a network using local neighbourhood. An example of such neighbourhood
of SLP anomalies is shown in Fig. 9.4. The two leading isomap time series and the
corresponding PDF are shown in Fig. 9.5, where a clear bimodality emerges. The
214 9 Principal Coordinates or Multidimensional Scaling
Fig. 9.5 Leading two Isomap time series of the ASM SLP anomalies (a) and the corresponding
two-dimensional kernel PDF estimate (b). Adapted from Hannachi and Turner (2013b)
two modes reflect in fact the break and active ASM phases (Hannachi and Turner
2013a). A close examination of the PDF of the two time series within the probability
space (Fig. 9.6) shows that there are indeed three robust modes. These modes are
also identified using a Gaussian mixture model as shown in Fig. 9.7. They show,
in addition to the active and break phases, a third mode, i.e. the west-north Pacific
(WNP) active phase. The precipitation map associated with those phases is shown
in Fig. 9.8. In the active phase most precipitation occurs over western India along
the Western Ghats in addition to the northern part of India and south Pakistan. The
WNP active phase is associated with precipitation over south east and north east
China with dry conditions over east China. This latter region receives precipitation
during the active and break phases.
9.5 Further Extensions 215
Fig. 9.7 Three-component Gaussian mixture model of the leading two Isomap time series of the
ASM SLP anomalies showing three regimes of ASM phases (a), namely the WNP active phase
(b), the active phase (c) and the break phase (d) of SLP anomalies and 850-hPa wind (maximum
wind arrow: 2.5 m/s). Adapted from Hannachi and Turner (2013b)
We have seen in Sect. 3.3 that when the dissimilarity matrix is non-Euclidean, one
solution is to restrict oneself to the leading positive eigenvalues and corresponding
eigenvectors for principal coordinates. An alternative and elegant way to solve the
problem has been proposed by Mathar (1985) and consists in finding the nearest
Euclidean distance matrix. Note that an Euclidean distance matrix D = (dij ) in
216 9 Principal Coordinates or Multidimensional Scaling
Fig. 9.8 Composites of APHRODITE precipitation and 850-hPa wind anomalies based on the 300
closest data points to the centres of the ASM monsoon phases of Fig. 9.7. (a) WNP active phase.
(b) Active phase. (c) Break phase. Adapted from Hannachi and Turner (2013b)
D∗ = argmin − D (z)
p , (9.16)
D
(z)
where A p = Pz APTz p. One first decomposes Pz APTz using, e.g. a SVD
procedure as:
− D∗ = U+
k U − b1 − 1b
T T T
(9.18)
When we are provided with a similarity matrix C, refer to Eq. (9.10) in Sect. 3.3,
which is positive semi-definite (PSD) the solution is already given by Eq. (9.10).
However, if C is not PSD, then again one has to find the nearest PSD matrix to
C. In the Minkowski norm, the nearest PSD matrix to C = UUT is given by
C+ = U+ UT where + contains only the positive eigenvalues defined as above.
This means that one keeps only the positive eigenvalues and associated eigenvectors.
Remark Other solutions
exist for other norms. The solution for the Fröbenius
2 12
norm, A F = i,j aij , and 2-norm; A 2 = ρ A A
2 T , where ρ(A) =
max{|λ|, |A − λIn | = 0} is the spectral radius of A, has been given for example in
Halmos (1972) and Higham (1988). Let us first consider the Fröbenius norm, and
we give the solution for a general matrix C not necessarily symmetric. Denote by
CS = 12 C + CT , and CK = 12 C − CT the symmetric and skew-symmetric
parts of C respectively. Then one can always decompose CS as
CS = UH,
10.1 Introduction
EOF analysis or PCA and related methods are examples of exploratory techniques
aiming, among other things, at reducing the dimension of the system by finding
a small number of patterns associated with maximum variance from the high
dimensional data. In these methods no explicit probability model is involved. Factor
analysis is a multivariate method that aims at finding patterns or factors from the
data based on a proper statistical model. Whereas EOFs, for example, are concerned
with explaining the observed variances, factor analysis (FA) attempts to explain the
observed covariances between the variables through a set of variables or factors.
By assuming that each observed variable is a linear combination of these factors,
together with a random error, the relationship between the factors and the original
variables yields a possible explanation of the observed association.
An example of methods that is based on a model was given by POP analysis
in Chap. 6, using the multivariate AR(1) model. This model focuses, however,
more on the temporal correlations rather than the covariability between variables.
Furthermore, the model was not explicitly formulated by specifying, for example,
the probability distribution of the noise. FA was developed in the turn of the century
in psychology and social science. It has been extended to other fields of science
recently. In meteorology, for example, and to the best of our knowledge, the method
was not applied extensively.
FA bears some similarities with PCA regarding, for example, dimension reduc-
tion and determination of prominent modes of variability or covariability. FA was
proposed around the same time as Pearson’s work on PCA (Pearson 1902) by
Spearman (1904a,b). The method was formally and mathematically formulated later
by Hotelling (1935, 1936a,b). PCA/EOF method was always regarded by most
data analysts and even some statisticians as an exploratory technique. The method,
however, can also be regarded as model-based, and this brings it closer to FA. It
turns out, indeed that EOF/PCA method can be regarded as a special case of FA.
10.2.1 Background
To explain the covariance observed in the multivariate time series one assumes
that there are m hidden variables or factors yt = (yt1 , yt2 , . . . , ytm )T whose linear
combination can explain the observed variability in the time series xt . This can be
written as
xt = yt + ε t + μ, (10.1)
m
xti = λij ytj + εti (10.2)
j =1
1 If
this is not the case it is always possible to find q new variables, q < p, such that the new
covariance matrix is full rank.
10.2 The Factor Model 221
so that λij represents the loading of the ith variable on the j th factor. Note that
in model (10.1) or (10.2) only the left hand side is known, whereas the right hand
side is entirely unknown. Model (10.1) is known as the linear factor model and is
described in many text books, e.g. Everitt (1984), Bartholomew (1987), Mardia et
al. (1979), and Lawley and Maxwell (1971).
Remarks
• Although model (10.1) looks like a regression model it is not so because, as we
stated above, the whole rhs is unknown. In fact, we need to estimate the factor
loadings, the common factors, the error terms and even the number of factors m.
In addition, the factors in (10.1) are also random variables unlike the regression
model in which the independent variables are non-random.
• If U = (u1 , . . . , ur ) are the leading r EOFs, (r < p), of the time series xt ,
t = 1, 2, . . ., and ct = (ct1 , . . . ctr ) are the corresponding r PCs, then we can
write
xt = Uct + rt , (10.3)
where rt is the remaining part, i.e. the p-r EOFs/PCs. Now there is a clear
similarity between (10.1) and (10.3). The main difference is that U is orthogonal
and the PCs uncorrelated, and this is not necessarily the case for model (10.1).
To identify model (10.1) one normally requires the following assumptions. Since xt
is assumed to be zero mean it is reasonable to assume E (yt ) = E (ε t ) = 0. One
also assumes that the factors and the noise are independent, implying
cov (yt , ε t ) = E yt ε Tt = O. (10.4)
These two assumptions are basic requirements but they do not permit a complete
model identification. What we need is an assumption on the probability distributions
of the unknown terms. The most “affordable” and common assumption is to
consider a multivariate Gaussian noise with diagonal covariance matrix, i.e.
E ε t ε Tt = = Diag ψ1 , . . . , ψp . (10.5)
In addition, we also assume that the factors are standard multivariate normal:
E yt yTt = Ip . (10.6)
222 10 Factor Analysis
= T + . (10.7)
Exercise
Derive Eq. (10.7).
Equations (10.2) or (10.7) yield
m
var (xti ) − ψi = σii − ψi = λ2ik (10.8)
k=1
which represents the part of the variance explained by the common factors known
as the communality. The diagonal elements of are referred to as the uniqueness.
The factor loadings matrix is therefore the covariance between xt and yt :
Now using Eqs. (10.6), (10.7) and (10.9), the zero-mean multivariate normal vector
T
zt = xTt , yTt has covariance:
T
E zt zTt = . (10.10)
T Ip
If the common factors are autocorrelated with E yt yTt = R, i.e. yt ∼ N (0, R),
then (10.7) becomes
= + T R. (10.11)
dx
= Bx + ε (10.12)
dt
one gets the famous fluctuation-dissipation relation (e.g. Penland 1989; Penland and
Sardeshmukh 1995):
where C(0) = E xt xTt , G(τ ) = eBτ = E xt+τ xTt [C(0)]−1 . The discretised
version of Eq. (10.12), i.e. xt+τ = G(τ )xt + ζ t,τ , is similar to Eq. (10.1) when
10.2 The Factor Model 223
One of the major drawbacks of the factor model is the non-unicity of the solution.
In fact, if H is any m × m orthogonal matrix, i.e. HHT = HT H = I, then the factor
model (10.1) can also be written as
We have therefore constructed from the original model another factor model with
new common factors yt and new loadings . Since an orthogonal matrix is a
rotation matrix, factor models are indeterminate with respect to rotations. One
possible solution to this drawback is presented later in Sect. 10.3.2.
Remark In data mining the factor model is regarded as a map f () from the latent
space, i.e. the space of the y’s, onto the data space (see, e.g. Carreira-Perpiñán 2001).
The map is simply given by
f (y) = y + μ (10.16)
with a normal probability distribution of p(y), i.e. y ∼ N(0, I). The probability
p(y) is known as the prior. The mapping f () describes an ideal measurement
process, mapping each point from the latent space onto exactly one point in a
smaller dimension manifold on the observed space. For the classical factor model
this map is simply linear. Other possibilities will be discussed in the next chapters.
The noise from the data space is given by the conditional distribution of (x|y),
which is N (f (y), ), and the marginal distribution in data space is that of x, i.e.
T
N μ, T + . Now given the normality of the joint distribution z = xT , yT ,
with covariance matrix given by Eq. (10.10), the conditional distribution of (y|x),
known as the posterior in latent space is also normal (see Appendix B) with mean:
−1
E (y|x) = A (x − μ) = T T + (x − μ) (10.17)
and covariance:
−1
cov (y|x) = I + T −1 . (10.18)
224 10 Factor Analysis
The most common way to estimate the parameters , and μ of the factor
model is to use the maximum likelihood method (MLE). Given a sample x1 , . . . , xn
of independent and identically distributed (IID) random variables drawn from a
probability distribution p (x) with a set of unknown parameters θ , the likelihood
of these parameters given the sample is given by
!
n
L (θ ) = P r (x1 , . . . , xn |θ) = p (xk |θ) . (10.19)
k=1
Usually one takes the Logarithm of Eq. (10.19) because the Logarithm is mono-
tonically increasing, and simplifies the computation substantially. The function to
be maximised becomes the log-likelihood L (θ ) = log [L (θ )]. Since the factor
model in Eq. (10.1) is based on normality assumption, the log-likelihood of the data
x1 , . . . , xn given the parameters , and μ is easy to compute and takes the form:
n
L (, , μ) = − p log(2π ) + log || + tr S −1 , (10.20)
2
where = T + and S is the sample covariance matrix of the data, i.e.
S = n−1 k (xk − μ) (xk − μ)T .
Exercise Derive the expression of the log-likelihood L , Eq. (10.20).
(n −p/2 ||− 12 exp −(x − μ)T −1 (x − μ) ,
Hint Expand L = log k=1 (2π ) k k
n −1
−1 n
then use the fact that k=1 (xk − μ) (xk − μ) = tr
T
k=1 (xk − μ)
(xk − μ)T = ntr −1 S .
Hence the parameters , and μ are provided by the minimiser of L, i.e.
min [−L (, )] = min log |T + | + tr S(T + )−1 . (10.21)
, ,
−1
(A + BCD)−1 = A−1 − A−1 B C−1 + DA−1 B DA−1 (10.22)
whenever the involved inverses exist (see e.g. Golub and van Loan 1996).
Exercise
Check that Eq. (10.22) holds.
Hint Proceed by direct substitution, i.e. (A + BCD) A−1 − A−1 B C−1 +
−1 −1
DA−1 B DA−1 = I − B C−1 +DA−1 B DA−1 + BCDA−1 −
−1
BCDA−1 B C−1 + DA−1 B DA−1 . But the last term in this sum can be
−1 −1
written as BC C + DA−1 B − C−1 C−1 + DA−1 B DA−1 which yields
−1
BCDA−1 − B C−1 + DA−1 B DA−1 .
Hence the inverse of becomes
−1 −1
T + = −1 − −1 Ip + T −1 T −1 (10.23)
The MLE of the factor model parameters leads to the following system of equations:
−1 ( − S) −1 = O
(10.25)
diag −1 ( − S) −1 = O.
Equations (10.24) or (10.25) represent a system of nonlinear equations that can only
be solved numerically using a suitable descent algorithm. We know also that the
solution is invariant to any orthogonal rotation. Hence the log-likelihood has an
infinite set of equivalent maxima, but one cannot tell whether these maxima are
global or local. Because of this nonuniqueness an additional constraint was required
to obtain a unique solution. The most commonly used constraint is
diag T −1 = T −1 , (10.26)
226 10 Factor Analysis
that is the matrix T −1 is diagonal. Note that even with this additional
constraint the equations remain difficult to solve analytically. The solution can
be made easier by maximising the log-likelihood in two stages, i.e. by fixing, for
example, then maximising with respect to . This approach was successfully
developed by Jöreskog (1967) who used a second-order Fletcher–Powell method
for the maximisation. Precisely, we have the following theorem:
ˆ the MLE of satisfying the extra constraint (10.26)
Theorem For fixed = ,
is given by
1 1
ˆ =
ˆ 2 U − Ip 2 , (10.27)
1 1
ˆ − 2 S
where is a diagonal matrix containing the eigenvalues of ˆ − 2 and the
columns of U are the associated eigenvectors.
Proof Outline Since is the only unknown we can transform the problem in which
the matrix = T + becomes the new unknown. The MLE estimate of
(see Appendix D) is ˆ = S. Now let us choose one solution among the many
solutions (obtained simply through orthogonal rotations). The MLE estimate ˆ is
T
simply obtained from S = ˆˆ + .ˆ We now have the following identity, namely:
1 1 1 1
ˆ − 2 S
ˆ − 2 − Ip =
ˆ −2
ˆ ˆ −2 .
ˆT
1 1
If UUT represents the SVD decomposition of ˆ − 2 S
ˆ − 2 , then we obtain
1
ˆ −2
ˆˆ T − 12 = U − Ip UT , which yields Eq. (10.27). Note that to guarantee
the property of semi-definite positivity the matrix − Ip has to be diagonal with
elements given by max (φi − 1, 0), where the φk s are the eigenvalues of . Of
course, any R ˆ is also a solution for any orthogonal R, which can be chosen so
that Eq. (10.26) is satisfied.
An iterative procedure based on (10.27) can be used to get the estimates as
follows. From an estimate ˆ k using (10.27). From
ˆ k of at the kth step obtain
this estimate maximise the log-likelihood numerically, subject to (10.26), to get a
new estimate ˆ k+1 , and so on until convergence is achieved.
In general the algorithm is simple to implement, and also has the advantage
of increasing the likelihood monotonically, and converges in general to a local
maximum2 and does not require the second derivatives to be calculated (Rubin and
Thayer 1982). The EM algorithm can be used therefore to explore the local maxima
in the state space of the parameters. The algorithm can be slow, however, particularly
after the first few iterations, which are usually quite effective.
The EM algorithm consists of two steps, namely, the expectation (E) step and
the maximisation (M) step. The E-step consists of computing the expectation of the
complete-data log-likelihood
Q y|x, θ (k) = E L y|x, θ (k)
with respect to the current posterior distribution p y|x, θ (k) , where θ (k) refers to
the current parameter estimates. The M-step consists of determining new parameters
θ (k+1) that maximise the expected complete-data log-likelihood:
θ (k+1) = argmin −Q y|x, θ (k) . (10.28)
θ
For the factor model setting the E-step results in computing the first two moments
E (y|xk ) and E yyT |xk for each data point xk given the parameters and
as given in Eqs. (10.17) and (10.18), which can also be written using the efficient
−1
inversion formula of T + .
Remark The second moment E yyT |x provides a measure of uncertainty in the
factors.
The expected log-likelihood of the factor model with respect to the posterior distri-
bution is obtained using
the sample-space
noise model, i.e. (x|y) ∼N (y + μ, ),
( p 1
and takes the form: Q y|x, θ (k)
= E log i (2π )− 2 ||− 2 exp − 12 (xi − y)T
−1 (xi − y) , which, after some little algebra, leads to
n n
Q y|x, θ (k) = − log (2π )p || − tr −1 S − xTi −1 E (y|xi )
2 2
i
1
+ tr T −1 E yyT |xi . (10.29)
2
The M-step is then to maximise Eq. (10.29) with to andT . The derivative
respect
−1 x E
of Eq. (10.29) with respect to , i.e. ∂Q
∂ is − i i (y|xi ) + E yy |xi .
T
T
E yyT |xi = xi E (y|xi )T . (10.30)
i i
n 1 1
− xi xTi − E (y|xi ) xTi + E yyT |xi T = 0. (10.31)
2 2 2
i
n 1
= xi xTi − E (y|xi ) xTi . (10.32)
2 2
i
n n T −1
(k+1) = i=1 xi E (y|xi )
T
i=1 E yyT |xi ,
n (10.33)
(k+1) = 1
n Diag i=1 xi xi
T − (k+1)
(E (y|xi )) xTi .
Equation (10.33) can be used iteratively to update the parameters of the factor
model. The conditional expectations in Eq. (10.33) are given by
−1
E (y|xi ) = A(k) (xi − μ) = (k)T (k) (k)T + (k) (xi − μ)
E yyT |xi = Ip − A(k) (k) + A(k) (xi − μ) (xi − μ)T A(k)T .
n
Remember that the MLE of μ is the sample mean x = n−1 i=1 xi .
Model Assessment
Once the model parameters are estimated the model fit is normally assessed using
the deviance between the fitted model and the full model in which has no
particular structure. The likelihood ratio statistic λ is given by
−1
− 2 log λ = n tr ˆ −1 S| − p .
ˆ SD−1 − log | (10.34)
It can be shown, see, e.g. Mardia et al. (1979), that the asymptotic distribution of
this statistic, under the null hypothesis Hm that there are m factors, is
−2 log λ ∼ χ 21 .
2 [(p−m)2 −p−m]
10.4 Factor Rotation 229
The estimate given by Eq. (10.36) is known as Bartlett’s factor score. The estimate
−1 T −1
provided by ŷ = Ip + T −1 x is known as Thompson’s factor
score. Refer, e.g., to Mardia et al. (1979) for more discussion on other estimates.
Exercise
Derive (10.35).
Hint Using (10.23) we get
−1
T T + = I − T −1 (I + T −1 )−1 T −1
= I − (T −1 + I − I)(I + T −1 )−1 T −1 .
A useful tool often applied in EOF and factor analysis to facilitate interpretation is
rotation. We know, for example, from Sect. 10.2.3 that if Y is the matrix of factor
scores and H is an orthogonal matrix (compatible with Y), then HT Y is also a
solution to the factor model and similarly for H where is the matrix of factor
loadings. In other words the rotated factors are also solution. Rotation of EOFs
has already been presented in Chap. 4 using a few examples of rotation, and we
give here a number of other rotation criteria. The reason for doing this here and
not in Chap. 4 is because rotation has historically been developed in connection
with factor analysis. Rotation of factor loadings goes back to the early 1940s with
230 10 Factor Analysis
Thurstone (1940, 1947), and also Carroll (1953) where the objective was to get
simple structures. This is similar to EOF rotation in atmospheric science, which
was introduced to alleviate some of the constraints imposed upon EOFs, and also to
ease physical interpretation.
There are two main rotation criteria: orthogonal and oblique. The general
orthogonal rotation problem is to find an orthogonal matrix H solution to the
following optimisation problem:
where is the factor loading matrix. The function f () is what defines the rotation,
and H is the new (rotated) loading matrix. Note that in a number of cases the
matrix H need not be square, but is known as semi-orthogonal. In oblique rotation
the problem is to find the matrix H satisfying
The condition imposed in (10.38) simply means that the columns of H are unit-
length. Note that what distinguishes (10.37) from (10.38) is the formulation, but
the rotation criteria given by f () can be the same. In general, however, criteria
appropriate for oblique rotation are also appropriate for orthogonal rotation but the
converse is not true. The most well known and popular example in atmospheric
science is the varimax criterion (Kaiser 1958) applied in meteorology by Richman
(1986), see Chap. 4. Below we give other examples of rotation criteria along with
their gradient since a wide range of numerical algorithms for minimisation require
the gradient of the function, see also Appendix E.
The quartimax rotation criterion (Carroll 1953, Ferguson 1954) used in orthogonal
rotation is based on the Fröbenius norm. The Fröbenius product of two m × m
matrices X and Y is defined by [X, Y]F = tr XT Y . The Fröbenius norm is then
-
defined by X F = tr XT X . Let now the element-wise product between X =
xij and Y = yij be given by X Y = xij yij , then the quartimax criterion is
defined by minimising the function:
1 4
m
1
f (X) = − XX F =− xij . (10.39)
4 4
i,j =1
10.4 Factor Rotation 231
The gradient of f (X) can be obtained by writing df (X) = − ij xij3 dxij which
can also be written as − ij [X X X]Tji dxij , hence we get
df (X)
= −X X X. (10.40)
dX
• Quartimin
Quartimin is another rotation criterion often used in oblique rotation, see e.g. Carroll
(1953) and Harman (1976, p. 305). The function to be minimised is given by
1 2 2 1
f (X) = xik xil = [X X, (X X) J]F , (10.41)
4 4
i k=l
df (X)
= X [(X dX) J] . (10.42)
dX
• Oblimin
1
f (X) = [X X, (Im − γ K) (X X) J]F . (10.43)
4
The gradient of the oblimin is given by
df (X)
= X [(Im − γ K) (X dX) J] , (10.44)
dX
• Oblimax
df (X) (X X X) X
= −4 +4 . (10.46)
dX XX F 2 X 2F
• Entropy
1
f (X) = − X X, log (X X) F (10.47)
2
whose gradient is given by
df (X)
= −X log (X X) − X. (10.48)
dX
Other rotation criteria can be found in Harman (1976) and Jennrich (2001, 2002).
Factor analysis can also be formulated, in a similar way to PC analysis, using matrix
decomposition. This was initially proposed by Henk A. L. Kiers as described in
Sočan (2003, pp. 19–20), and mentioned in Adachi and Trendafilov (2019). It was
applied e.g. by Unkel et al. (2010), see Adachi and Trendafilov (2019) for further
details and references. The model given by Eq. (10.11), with m factors, can be
written using the n × p data matrix X as
X = YT + U + EF A , (10.49)
containing the unique variances and EF A is the unsystematic error matrix. We also
have the following properties: UT U = Ip×p , YT Y = Im and ET Y = Op×m .
Unkel et al. (2010) solved for Y, and by minimising the costfunction
X − YT − U 2
F = X − BA 2
F = X + tr(BT BAT A) − 2tr(BT XA),
2
F
(10.50)
where B = [Y : U] and A = [ : ] are block matrices, . F is the Fröbenius
matrix norm and tr(.) is the trace operator. Unkel et al. (2010) then minimised
Eq. (10.50) iteratively. Note that here the first four matrices appearing in the rhs
of Eq. (10.49) are treated as fixed unknown matrices whereas in the standard
formulation of the factor model, e.g. Eq. (10.1), the elements of Y and U are treated
as random quantities.
Unkel et al. (2010) applied exploratory factor analysis (EFA) to gridded SLP
anomalies from NCEP/NCAR reanalysis. The data consist of monthly winter (Dec–
Jan–Feb) anomalies of SLP field for the period Dec 1948 to Feb 2006, i.e. a sample
size of n = 174, with a 2.5◦ × 2.5◦ lat×lon grid, north of 20◦ N, yielding p = 4176
grid points or variables.
Figure 10.1 shows an example of the costfunction (10.50) versus the iteration
number for m = 5 factors. Practically, the costfunction reaches its minimum after
10 to 20 iterations. Figure 10.2 shows the maps of four factor loading patterns. An
arctic-like oscillation and the North Pacific oscillation patterns can be recognised.
For comparison, Fig. 10.3 shows the leading two EOFs. These two EOFs have quite
similar structures to the two factor loadings (i) and (ii) shown in Fig. 10.2.
Fig. 10.1 Costfunction (10.58) versus iteration number. Adapted from Unkel et al. (2010)
234 10 Factor Analysis
Fig. 10.2 Spatial patterns of 4 factor loadings of winter NCEP/NCAR SLP anomalies. Adapted
from Unkel et al. (2010)
Unkel et al. (2010) applied a special rotation, as in Hannachi et al. (2009), that yield
approximate statistical independent factor scores. This subject is discussed in more
detail in Sect. 12.6 of Chap. 12. The rotated factor scores are obtained from the
rotation:
R = FT, (10.51)
Fig. 10.3 Leading EOFs of monthly winter NCEP/NCAR SLP anomalies. Adapted from Han-
nachi et al. (2007)
The difference between EOF (or PC) and FA analyses is one of the confusing issues
in data analysis. The two methods are quite similar in many ways. Both methods
are data reduction procedure, which allow to capture the data variance in a smaller
set of variables. The application of both methods yields in general quite similar
results. Yet there is a basic difference between the two methods. This difference is
236 10 Factor Analysis
Fig. 10.4 Four rotated patterns of the anomalous SLP factor loadings. Adapted from Unkel et al.
(2010)
discussed below based on the conventional FA model, e.g. Eq. (10.1), and the EFA
model given by Eq. (10.49).
In Chap. 3 we have discussed the fact that EOF analysis constitutes an exploratory
tool that produces patterns and corresponding time series successively explaining
maximum variance in the data. EOFs are also computationally easy to obtain since it
is a simple eigenvalue problem. The standard factor model, on the other hand, has a
number of parameters that have to be estimated as discussed in the previous sections.
The estimation procedure is not trivial since it is based on maximum likelihood
where a descent numerical algorithm is required for the optimisation.
10.6 Basic Difference Between EOF and Factor Analyses 237
The main point here is that EOF/PC analysis, unlike factor analysis, is not model-
based, see e.g. Mardia et al. (1979, p. 275), Chatfield and Collins (1980, p. 87),
Krzanowski (2000, p. 502) and Hinton et al. (1997) to name just a few. Early
literature on factor analysis has noticed, however, that PCA can also be seen as
a particular case of FA. This seems to have gone unnoticed until very recently,
see, e.g., Roweis (1998), Tipping and Bishop (1999), Carreira-Perpiñán (2001) and
Jolliffe (2002). In fact, EOFs can be obtained as the MLE of a factor model with
isotropic uniqueness matrix, i.e.
= σ 2 I. (10.52)
∂L
Consider the likelihood equation ∂ = 0 provided by the first equation of
system (10.25), namely,
S −1 = . (10.53)
Exercise
Derive Eq. (10.54) and point out the expression of in this equation in relation to
the invariance principle mentioned above.
Hint Multiply both sides of Eq. (10.53) by T then add σ 2 S −1 . This yields
S = A σ 2 I + 2 AT . Lastly, multiplying both sides by A yields the answer
where A is the new loading matrix.
Equation (10.54) is precisely an eigenvalue problem, and the factor loadings, i.e.
the column vectors of are the eigenvectors of the sample covariance matrix S, i.e.
the EOFs, whereas the factor scores, see Eq. (10.35), are the corresponding PCs. In
fact, if S = UDUT where U = u1 , . . . , up are the eigenvectors of the sample
covariance matrix, and D = α1 , . . . , αp is the diagonal matrix of the associated
eigenvalues arranged in decreasing order, then from Eq. (10.54) we can choose the
factor loadings
1
2
= Um Dm − σ 2 Im , (10.55)
or similarly:
−1 Im Om,p−m
S =U −2 UT , (10.57)
Op−m,m σ Dp−m
1
p
σ̂ 2 = αk . (10.59)
p−m
k=m+1
In practice and as it was pointed out in Chap. 3, EOFs are efficiently obtained using
SVD and no MLE is required. This makes the problem of EOFs easier to solve than
the FA model. Among other things, the maximum likelihood estimates of the factor
model overcomes the scaling problem encountered in EOFs. In EOF/PC analysis
the number of retained components can be fixed by choosing e.g. the percentage of
explained variance. The choice of this percentage, although arbitrary, does not alter
the PCs because they are unique. In factor model, on the other hand, the number of
factors m is difficult to estimate. One way to estimate m is to use the EM algorithm
with a cross-validation procedure rather like fitting a mixture model (Hannachi
and O’Neill 2001). This is achieved by sequentially increasing m and taking the
best model with the largest likelihood. Also, and unlike EOF/PC analysis, when m
changes the form of the factors may also change! (Chatfield and Collins 1980).
10.6 Basic Difference Between EOF and Factor Analyses 239
Adachi and Trendafilov (2019) used the matrix formulation of exploratory factor
analysis, as given by Eq. (10.49), and provided a number of inequalities, which are
used to contrast the parameters’ estimates in PCA and FA. Given a n × p data
matrix X = [x1 , . . . , xp ], a PCA formulation is given for example in Eq. (3.28),
i.e. X = ZAT + E, with the n × m and p × m matrices Z and A representing
respectively the PC scores and loadings. This formulation of PCA can be used as a
basis for a comparison with the exploratory factor analysis model. This is illustrated
in Fig. 10.5, with p = 3 and m = 2, as described also in Adachi and Trendafilov
(2019).
For the PC problem (Fig. 10.5a) the variables x1 , x2 and x3 are commonly
explained by the PC scores z1 and z2 with, in general, unequal weights given by
Fig. 10.5 Illustration of a PCA model as given by Eq. (3.28) (a) and the exploratory factor model
given by Eq. (10.49) (b) with p = 3 and m = 2, along with the relative size of the different
components for the PCA and the exploratory FA models
240 10 Factor Analysis
the loadings aij plus an error term E. In a similar way, Eq. (10.49) is illustrated in
Fig. 10.5b and shows the common part, namely YT , just like ZAT in Eq. (3.28).
Unlike PCA, however, there is the unique part, namely U, which is included in
the exploratory FA model Eq. (10.49) where each unique factor uk is weighted by
ψk , and affects only the variable xk . Adachi and Trendafilov (2019) provide a kind
of quantification of the different contributions arising in Eqs. (3.28) and (10.49) to
the total variance of the data matrix as illustrated in Fig. 10.5c. For a given number
m of principal components and factors, the authors conclude that: (1) the common
part of PCA is larger than that of FA, and (2) the residual part for FA is smaller than
that for PCA. In addition, they suggested that the unique part for FA is often larger
than the residual part for PCA.
Chapter 11
Projection Pursuit
11.1 Introduction
is maximised. In a similar manner one can find other directions that optimise various
other criteria. For example, in optimally persistent patterns (OPP), see Chap. 8, the
function I (a) is the decorrelation time of the projected time series Xa. Similarly,
for the optimally interpolated pattern (OIP) the function I (a) represents the mean
square interpolation error. The procedure of finding these directions constitute what
1 Who coined it projection pursuit. In projection pursuit, the term “projection” refers to the fact
that the data X is first projected onto the direction a, i.e. aT X, whereas “pursuit” refers to the
optimisation used to find the correct direction.
2 In projection pursuit the term “projection” refers to the fact that the data X is first projected
onto the direction a, i.e. aT X, whereas “pursuit” refers to the optimisation used to find the correct
direction.
11.2 Definition and Purpose of Projection Pursuit 243
m
H (Z) = − pk log pk (11.2)
k=1
and is also known as Shannon entropy (Shannon 1948). The entropy, Eq. (11.2), is
a measure of the uncertainty of the collection of events of Z. It also represents
the average amount of surprise upon learning the value of Z (Hamming 1980;
Ross 1998). In information theory, entropy can also be interpreted as the average
information content , or similarly a measure of the disorder in the system. So the
concept of entropy, uncertainty and information are all equivalent. In fact, as pointed
out, e.g., by McEliece (1977), given a random variable X entropy can be thought of
as a measure of
• The amount of information gained by observing X.
• Our uncertainty about X.
• The randomness of X.
It is easy to show that the entropy given in Eq. (11.2) is maximised for a (discrete)
uniform distribution, i.e. one in which pk = m1 , k = 1, . . . , m. This is to say that
a discrete uniform distribution contains the least information content, or similarly
that this distribution has the largest uncertainty, and therefore has no interesting
structure. Hence it would seem that a reasonable way to define interestingness is
as departure from discrete uniformity. This is different from the continuous case,
presented next.
The entropy (11.3) is also known as Boltzmann’s H-function, see e.g. Hamming
(1980) and Cover and Thomas (1991). We have the following theorem:
Theorem Among all continuous random variables with zero mean and unit vari-
ance the standard normal distribution has the smallest entropy.
A proof outline is given as an exercise, see below.
Exercise Compute H for the normal random variable N(0, σ 2 ).
Answer H = − 12 log(2π eσ 2 ).
Exercise Using arguments from the calculus of variations show that among all
continuous distributions with zero mean and unit variance the normal distribution
has the smallest negentropy, Eq. (11.3).
The theorem can also be extended to the multivariate normal N (μ, ) with
probability density function given by
1 1
f (x) = m 1
exp − (x − μ)T −1 (x − μ) (11.4)
(2π ) 2 || 2 2
1
H (x) = − log || (2π e)m . (11.5)
2
Exercise Prove (11.5).
Hint Make a variable change in (11.3) then use the result of the previous exercise.
It is therefore reasonable that in the continuous case one way of defining interest-
ingness is to consider departure from normality.
Another related measure of entropy is the Fisher information defined as the
variance of the efficient score in maximum likelihood estimate. For a probability
d 2
density function f (x), the Fisher information is given by I (f ) = E dx log f (x) ,
i.e.
2 2
1 df (x) d
I (f ) = dx = log f (x) f (x)dx. (11.6)
f (x) dx dx
Fisher information, also like differential entropy, is minimised with the normal
probability density function.
246 11 Projection Pursuit
• Quality 2:
The reason for the last quality is that from the central limit theorem, we know that
x1 + x2 is more normal, and hence less interesting, than the more normal of x1 and
x2 . Consequently, if x1 , . . . , xm are IID random vectors having the same distribution
as the random vector x, then (11.9) yields
Huber (1985) points out that an index I () satisfying (11.8) essentially amounts to a
normality test.
It is also desirable to have a continuously differentiable index to increase
computational efficiency by using powerful hill climbing algorithms (Friedman
and Tukey 1974) or gradient and projected gradient methods (Appendix E). In the
following, and unless otherwise stated, we will concentrate on one-dimensional
projection pursuit, but the method is straightforward to extend to two or three
dimensions.
Historically, the first PP index is that of Friedman and Tukey (1974) for one and two
dimensions. In one dimension, their index takes the form:
where s(x) is a measure of spread of the projected data, and d(x) describes the
local density of the projected scatter. In their application s(x) was taken to be the
trimmed standard deviation of the scatter, whereas for d(x) they took an average
248 11 Projection Pursuit
nearness function of the form i,j f (rij 1x≥0 (R − rij ) , where rij is the interpoint
distance between the ith and j th projected data, f () is a monotonically decreasing
function satisfying F (R) = 0, and 1x≥0 (x) = 1 for x ≥ 0, and zero otherwise
(the indicator function). Friedman and Tukey’s index has stimulated thoughts about
indexes and various indexes have been proposed since then.
Jones (1983) and Jones and Sibson (1987) proposed using the projection index:
I (x) = f 2 (x)dx, (11.12)
where f () is the probability density function of the projected data onto the direction
x. Huber (1985) and also Jones and Sibson (1987) pointed out, in particular,
that the function d(x) in (11.12) is an estimate of (11.11). Now among all the
probability distributions with
√ √ zero mean and unit variance
√ the parabolic density
function defined over [− 5, 5], i.e. f (x) = 0.003 5 5 − x 2 1[−√5,√5] (x),
known also as Epanechnikov kernel (Silverman 1986), minimises the index (11.11),
see e.g., Hodges and Lehmann (1956). Therefore maximising (11.11) can only
lead to departure from the parabolic distribution rather than finding, for example,
clusters, and this is the main critic addressed by Jones and Sibson (1987). They also
note that the Friedman and Tukey two-dimensional index is not invariant to rotation.
That is, the projection index depends on the orientation of the plane of projection,
which is a serious set back. Jones (1983), however, found little difference in practice
between the entropy index Eq. (11.3) and index Eq. (11.11).
Entropy/Information Index
4 Following Rényi (1961; 1970 p. 592)’s introduction of the concept of order-α entropy; the
differential entropy, e.g., − f log f , is of order 1, whereas index (11) is of order 2.
11.4 Types of Projection Indexes 249
2
1 df (x)
I (x) = σ 2 dx − 1, (11.14)
f (x) dx
where σ is the standard deviation of the projected data. Jee (1985) compares various
indices applied to synthetic multimodal data as well as the seven-dimensional
particle physics data of Friedman and Tukey (1974). He finds that the Fisher
information index achieves a better result with significantly higher information
content.
Moments-Based Indexes
The previous projection indexes are based on the probability density function of
the projected data. Another straightforward way to compute the index is to estimate
the pdf, which will be presented later. Jones (1983), and Jones and Sibson (1987)
suggested expanding the pdf in terms of its cumulants (see Appendix B) using
Hermite polynomials.
Hermite polynomials Hk (.), k = 0, 1, . . ., are defined from the successive
2 d k −x 2 /2
derivatives of the Gaussian PDF, i.e. Hk (x) = (−1)k ex /2 dx ke . For example,
H0 (x) = 1, H1 (x) = x and H2 (x) = x − 1. Figure 11.1 shows the first
2
five Hermite polynomials. These polynomials are orthogonal with respect to the
Hermite polynomials
H
0
15 H
1
H
2
10 H
3
H
H (x)
4
5
k
−5
−2 0 2 4
x
Fig. 11.1 Graphs of the first five Hermite polynomials
250 11 Projection Pursuit
Gaussian PDF φ(x) as weight function, namely, R φ(x)Hk (x)Hl (x)dx = k!δkl .
This allows
any PDF f (.) to be expanded in terms of those polynomials, i.e.
f (x) = k≥0 ak φ(x)Hk (x). This expansion is known as the Gram–Charlier, or
Edgeworth polynomial expansion of f (.) (Kendall and Stuart 1977, p.169).
This expansion leads to the expression f (x) = φ(x) [1 + ε(x)] where φ(x) is
the normal PDF and ε(x) is a combination
of Hermite polynomials. The entropy
index is then approximated by 12 φ(x)ε2 (x)dx. This expression can be simplified
further using the polynomial expansion of f (.).
1 1
f (x) = φ(x) 1 + κ3 H3 (x) + κ4 H4 (x) + . . . , (11.15)
6 24
where Hk (x) is the kth Hermite polynomial and κα is the cumulant of order α of
f (x). Using the properties of these polynomials, the projection entropy index yields
1 1
I (x) = f (x) log f (x)dx ≈ κ32 + κ42 . (11.16)
12 4
the projected data is centred and scaled to have zero mean and unit variance, then κ31 = μ31 ,
5 If
κ13 = μ13 , κ22 = μ22 − 1, κ40 = μ40 − 3, κ04 = μ04 − 3, κ12 = μ12 , κ21 = μ21 , κ03 = μ03 , and
κ30 = μ30 , where the μs refer to the centred cumulants.
11.4 Types of Projection Indexes 251
Let X be a random variable with zero mean and unit variance. The projected index
can be considered as realisations of this random variable. Friedman (1987) considers
the transformation of this random variable using
U = 2(X) − 1, (11.18)
where (x) is the cumulative distribution function of the standard normal distribu-
x
tion, i.e. (x) = √1 −∞ e−t /2 dt. The transformation (11.18) maps the projected
2
2π
data onto the interval [−1, 1]. The probability density function of U is given by
f −1 ( u+1
2 )
p(u) = , (11.19)
2φ −1 ( 2 )
u+1
where f () and φ() are respectively the pdfs of X and the standard normal
distribution. Furthermore, if X is normal, then U is the uniform distribution over
[−1, 1]. Friedman’s index measures departure of the transformed variable from the
uniform distribution over [−1, 1], in the L2 sens, and is given by
1 2 1
1 1
I (X) = p(u) − du = p2 (u)du − , (11.20)
−1 2 −1 2
where p(u) is the density function of U . For efficient computation Friedman (1987)
expanded p(u) into Legendre polynomials6 to yield
∞
1
I (X) = (2k + 1) [E (Pk (U ))]2 , (11.21)
2
k=1
where E[] stands for the expectation operator. For a sample of projected data index
x = (x1 , . . . xn ), the previous expression yields, when truncated to the first K terms,
the expression:
2
K
1
n
I (x) = (2k + 1) Pk (2(xt ) − 1) . (11.22)
n
k=1 t=1
Friedman (1987) also extended his index to two-dimensional projection, but the bi-
dimensional index is not rotationally invariant (see Morton 1989).
6 These are orthogonal polynomials over [−1, 1], satisfying P0 (y) = 1, P1 (y) = y, and for k ≥ 2,
kPk (y) − (2k − 1)yPk−1 (y) + (k − 1)Pk−2 (y) = 0. They are orthogonal with respect to the normal
density function φ(x).
252 11 Projection Pursuit
Equation (11.20) can be simplified using the expression of the density function
p(u) given by (11.19) to yield
∞ ∞
1 1 1 f 2 (x) 1
I (X) = [f (x) − φ(x)] dx = 2
dx − . (11.23)
2 −∞ φ(x) 2 −∞ φ(x) 2
In order for Eq. (11.23) to converge, the density function f (x) has to decrease at
least as fast as the normal density. Hall (1989) notes, as a consequence, that this
index is not very useful for heavy tailed distributions. He suggests the following
index as a remedy:
∞
I (X) = [f (x) − φ(x)]2 dx. (11.24)
−∞
Another suggestion came from Cook et al. (1993) who propose the following index
instead:
∞
I (X) = [f (x) − φ(x)]2 φ(x)dx (11.26)
−∞
1
n
1
− φ(xi )φ(xj ) + (11.27)
n2 4π
i,j =1
Chi-Square Index
where nk = Bk φ2 (x, y)dxdy, 1Bk is the indicator function of Bk , i.e. it equals
(j ) πj πj (j )
one inside Bk and zero elsewhere, and xi = xi cos 4×9 − yi sin 4×9 and yi =
πj πj (j ) (j )
xi sin 4×9 + yi cos 4×9 . The numbers xi and yi result from averaging the
projections over [0, 4 ] using nine equal angles π36k , k = 0, 1, . . . , 8. In practice
π
argues that the χ 2 index appears very efficient and fast to compute. Other measures
can also be used instead of the χ 2 . One can use for example the Kolmogorov–
Smirnov distance,
KS = max |F (x) − (x)| which is a L∞ -norm, or else use the
L1 -norm |F (x) − (x)|dx.
Clustering Index
Although the previous projection indexes attempt to maximise departure from nor-
mality, they can also be used to find clusters. There are, however, various studies that
investigated indexes specifically designed for uncovering low-dimensional cluster
structures (Eslava and Marriott 1994; Kwon 1999). Looking at the connection
254 11 Projection Pursuit
T = B + W. (11.31)
Let X = xij be our n × p data matrix and suppose that there are
K
K groups with nk units in group k, k = 1, . . . , K, with k=1 nk =
n, and that x
k ik is the p-dimensional vector of unit i in group k with
xk = n1k ni=1 xik . Then the within- and between-group covariance matrices
1 K n k
are respectively W = n−K k=1 i=1 (xik − xk ) (xik − xk ) , and B =
T
K
k=1 nk (xk − x) (xk − x) , where x is the total mean. The eigenvectors
1 T
K−1
−1
of W B are supposed to be ordered by size of eigenvalues given the complete
set of canonical variables. Now, given that in Eq. (11.31) T is constant, Bock’s
procedure essentially maximises I0 (A) = tr ABAT , subject to AT A = Ip , that
T
k ak Bak , subject to ak al = δkl , where A = a1 , . . . , ap is the matrix
is max T
p
aTk Bak
I1 (A) = I1 a1 , . . . , ap = (11.32)
k=1
aTk Wak
subject to AAT = Ip . Note that using B or T in Eq. (11.32) yields the same
result. Because Eq. (11.32) is a sum of some sort of signal-to-noise ratios, Bolton
and Krzanowski (2003) point out that when the groups are not well defined, the
separation between the groups will occur only on the first projection, and no
separation thereafter, i.e. on subsequent projections. They propose as a remedy the
following PP clustering index:
p
aT Tak
I2 a1 , . . . , ap = pk=1 Tk s.t. aTk al = δkl . (11.33)
k=1 k Wak
a
7 For example, Fisher’s linear discrimination function when there are only two groups.
11.4 Types of Projection Indexes 255
Note that only a1 can be identified as the first canonical variate whereas subsequent
vectors are found numerically. In this approach the number of clusters is supposed to
be known, and the optimal clustering can be found using, e.g. k-means, or any other
algorithm, see e.g., Gordon (1981) or Everitt (1993). Note also that when the data
are sphered, then I1 (A) becomes identical to I2 (A). In their application, Bolton
and Krzanowski (2003) argue that their refined index I2 gives better results than the
previous index. When the number of clusters is unknown they use the rule of thumb
of Hartigan8 (1975).
Since many of the projection indexes are based on the probability distribution
function of the index it is necessary therefore to have an estimate of the pdf from the
sample index x = (x1 , . . . , xn ). Let us denote by fa () an estimate of the pdf of the
index (11.34). If, for example, one in interested in the entropy index (11.13), then
the sample index is
√
I (a) = fa (x) log fa (x)dx + log σa 2π e , (11.35)
8 If W is the total within-groups sums-of-squares obtained using k-means, with k clusters, then it
k
is acceptable to take k + 1 clusters if (n − k + 1) (−1 + Wk /Wk+1 ) > 0.
256 11 Projection Pursuit
Before applying the projection, it is recommended that the data be centred and
sphered (Jones and Sibson 1987; Huber 1985; Tukey and Tukey 1981). Centring
yields a zero-mean data matrix whereas sphering yields a data matrix whose
covariance matrix is the identity matrix. The centred data matrix (see Chap. 2) is
Xc = XH, (11.38)
where H = In − n1 1n 1Tn is the centring matrix. The sphered data matrix can be
obtained by multiplying the centred data matrix by the inverse of a square root of
the sample covariance matrix S = n1 Xc XTc . For example if X = VUT , then one
√ 1
1 T
can take S− 2 = nU− 2 UT . Note that S− 2 S− 2
1 1
= S−1 . The sphered data
matrix is then X◦ = UVT . The gradient of (11.37) with respect to a is given by
n x − aT x
1 k
∇a fa (x) = x − aT
xk φ xk , (11.39)
nh3 h
k=1
and that of I (a) in Eq. (11.35) can be obtained easily by differentiating the
expression under the integral.
Once the first direction a1 is found the next direction can be obtained in a
similar fashion from the residuals X − 1aT1 X where 1 is a vector of length k
containing one, etc. The other alternative is to extend the definition of the index
to data projected onto a plane where the index I (a, b) is a function of xa = aT X
and xb = bT X, with aT a = bT b = 1, and aT b = 0. The 2D PP is useful
if we are interested in finding 2D structures simultaneously. Nason (1992) has
extended this to the 3D case to analyse multispectral images through RGB colours
representation. The same procedure can still be applied to find further 2- or 3-
dimensional structures after removing the previously found structures. This is
known as structure removal (Friedman 1987). Structure removal is in general an
11.5 PP Regression and Density Estimation 257
iterative procedure and repeatedly transforms data that are projected to the current
solution plane to standard normal.
11.5.1 PP Regression
Projection pursuit has been extended beyond exploratory data analysis to regression
analysis and density estimation. PP regression is a non-parametric regression
approach that attempts to identify nonlinear relationships between variables, e.g.,
a response and explanatory variables y and x. This is achieved by looking for m
vectors a1 , . . . , am , and nonlinear transformations φ1 (x), . . . , φm (x) such that
m
yt = y + αk φk aTk xt + εt (11.40)
k=1
provides an accurate model for the data (yt , xt ), t = 1, . . . , n. Formally, y and x are
presumed to satisfy the conditional expectation:
m
E [y|x] = μy + αk φk aTk x (11.41)
k=1
where μy = E(y) and φk , k = 1, . . . m, are normalised so that E φk aTk x = 0,
2
and E φk aTk x = 1. The functions φk (x), k = 1, . . . , m are known as the ridge
functions.
The parameters of the model (11.40), i.e. the ridge functions and projection
vectors, are obtained by minimising the mean squared error:
2
m
E y − μy − αk φk aTk x . (11.42)
k=1
Model (11.40) is the basic PP regression model.9 It includes the possibility of having
interactions between the explanatory variables. The problem is normally solved
using a forward stepwise procedure (Friedman and Stuetzle 1981). First, a trial
direction a1 is chosen to compute the projected data zt = aT1 xt , t = 1, . . . , n, then
a curve fitting is obtained between yt and zt . This amounts
to a simple 2D scatter
plot analysis, and is achieved by finding φ1 (x) such that nt=1 wt (yt − φ1 (zt ))2 is
minimised. This procedure is iterative in a1 and φ1 . Once a1 and φ1 are found, the
same procedure is repeated with the residuals yt − y − β1 φ1 aT1 xt where φ1 has
been standardised as above.
As for spline smoothing, to make the ridge function smooth one can impose a
smoothing constraint. For instance, if in (11.41) or (11.42) we let μy = 0, then the
quantity to be minimised takes the form:
m 2 2
n
m 2 d
yi − T
αk φk ak xi +λ φ k (u) du (11.43)
du2
i=1 k=1 k=1
d2
2
over all ridge functions φk for which du < ∞. Given ak , k =
φ (u)
du2 k
ˆ
1, . . . m, Eq. (11.43) provides estimates φk for the ridge functions. From these
estimates one can compute better projections by minimising Eq. (11.43) but with
λ = 0, i.e.
2
n
m
min yi − αˆk φk aTk xi .
i=1 k=1
(1) (1)
with respect to a1 . The ridge function φ1 is then obtained from minimising the
1D spline smoothing problem
n 2
ut − φ1(1) a(1)T
1 xt ,
t=1
(0) (0)T (1) (1)
where ut = yt − k=1 φk ak xt . This is then repeated to estimate ak , φk ,
(p) (p)
for k ≥ 2. The whole procedure is then re-iterated to yield estimates ak , φk , k =
1, . . . m, until convergence is achieved. Note that the forward stepwise procedure
is different from the backfitting in that in the latter the sequentially estimated ridge
functions are re-used to estimate new ridge functions.
11.5 PP Regression and Density Estimation 259
Probability density functions can also be estimated using projection pursuit. This
has been presented in Friedman et al. (1984) and in Huber (1985). The PP density
estimation of a pdf f (x) is based on the following multiplicative decomposition:
!
K
fK (x) = f0 (x) gm aTm x , (11.44)
m=1
where f0 (x) is a given standard probability density function such as the normal
having the same first two moments as the unknown density function f (x). The
objective is to determine the directions am and the univariate functions gm () such
that fK () converges to f () in some metric. From (11.44) one gets the recursion
fK (x) = fK−1 (x)gK aTK x , (11.45)
given the initial density function f0 (). At each step a direction aK and an aug-
menting function gK () are computed so as to maximise the goodness of fit of fK ().
Various measures can be used to evaluate the approximation. Huber (1985) mentions
two discrepancy measures that provide proximity between two density functions:
(i) the relative entropy or Kullback–Leibler distance E (f, g) = f (x) log fg(x)
(x)
dx,
√ √ 2
and (ii) the Hellinger distance H (f, g) = f (x) − g(x) dx. One of the
advantages of the relative entropy, despite not being a metric because E (f, g) =
E (g, f ), is that any probability density function f () with finite second-order
moments has one unique Gaussian distribution closest to it in the sense of entropy.
This Gaussian distribution has the same first- and second-order moments, i.e. mean
and covariance matrix, as that of f (). The relative entropy is also invariant under
arbitrary affine transformation and is suitable for multiplicative approximation such
as (11.44), see Huber (1985).
Friedman and Tukey (1974) used, in fact, a version of
the relative entropy W = f (x) log fK (x)dx, which has to be maximised, i.e.
max f (x) log gK aTK x dx s.t. fK (x)dx = 1. (11.46)
Note using the available sample,the relative entropy W above can be estimated from
the sample by the expression n1 nt=1 log fK (xt ). Now given the probability density
function fK−1 () and the direction aK the solution to (46) is obtained from
f (aK ) aT x
gK aTK x = K(a ) K (11.47)
fK−1K
aTK x
goes like this. Given a direction a, and since fK−1 () is supposed to be known,
one computes gK aT x according to Eq. (11.47). The estimation of fK(a) () is
obtained by projecting the data onto direction a, then computing an estimate of the
corresponding probability density function using e.g. kernel smoother. The direction
a is then chosen so to maximise (11.46).
(a)
Numerical computation of fK−1 (), however, may not be efficient in an iterative
process. Friedman et al. (1984) propose a Monte Carlo approach based on replacing
(a)
the estimate fK−1 () by a random sample from fK−1 (), from which fK−1 () is
(a)
estimated in the same way as fK (), and this is found to speed up the computation.
See also Huber (1985) for a few other alternatives.
Remark: How to Construct Clusters at the Vertices of a High-Dimensional Simplex
A 14-dimensional simplex was used by Friedman and Tukey (1974). Sammon
(1969) used Gaussian data distributed at the vertices of a 4D simplex to test their
clustering, or cluster revealing, algorithm, which is a multidimensional clustering
algorithm. The simplex can be constructed as follows. One has to fix first the origin
O and the intervertex distance r, then
(1) Choose A1 = (x1 = r, 0, . . . , 0).
(2) Compute the centre of mass of the interval [O, A1 ], i.e. G1 = g11 , 0, . . . , 0 .
Note that g11 = r/2.
(3) Choose A2 = x21 = g11 , x22 , 0, . . . , 0 , where x22 is obtained from OA2 = r,
√
(i.e. r 3/2 ).
(4) Compute the centre of mass G2 = g21 , g22 , 0, . . . , 0, of (O, A1 , A2 ) using the
fact that G2O + G2A1 + G2A2 = 0. Note that g21 = g11 .
(5) Now choose A3 = x31 = g21 , x32 = g22 , x33 , 0, . . . , 0 where again x33 is chosen
from using OA3 = r. Next compute the centre of the 3D simplex
(O, A1 , A2 , A3 ) using G3O + G3A1 + G3A2 + G3A3 = 0, i.e. G3 =
g31 = g11 , g32 = g22 , g33 , 0, . . . , 0 .
(6) We then choose A4 = x41 = g31 , x42 = g32 , x43 = g33 , x44 , 0, . . . , 0 , where
as before x44 is obtained from OA4 = r, then compute G4 =
g41 = g11 , g42 = g22 , g43 = g33 , g44 , 0, . . . , 0 .
A number of PP studies have been applied to weather and climate. Chan and
Shi (1997) compared PP based on a graduation index (Huber 1981), which is a
robust measure of scatter, with EOF analysis. They suggest that PP analysis is more
robust than EOF analysis particularly vis-a-vis outliers. Christiansen (2009) applied
projection pursuit to 20- and 500-hPa geopotential heights from NCEP/NCAR
reanalyses for the period 1948–2005 by optimising five indices. In terms of the
500-hPa heights Christiansen (2009) identified a pattern resembling a combination
11.6 Skewness Modes and Climate Application of PP 261
of the PNA and a European pattern when kurtosis is maximised, and a NAO pattern
when the negentropy is maximised. The PDF estimated from the first PP index was
unimodal but skewed when daily data are used. When seasonal means of 20-hPa
data were used the kernel PDF along the first PP direction was bimodal.
Figure 11.2 shows the kernel PDF estimates of the seasonal mean 20-hPa winter
geopotential height with the (PC1, PC2) (left) and (PC2, PC3) (right) spaces along
with the PP directions based on the leading three PCs and using five indices:
“kurtosis” (orange), “negentropy” (red), “depth” (yellow), “multi” (brown) and
“critic” (grey). The “depth” index provides a measure of the depth of the largest
trough in the PDF and targets hence bimodality of the PDF, the “critic” index was
proposed by Silverman (1986) and provides a measure of the degree of smoothing
required to get a unimodal PDF, i.e. related to the smoothing parameters used in
kernel PDF estimates, whereas the “multi” index was introduced by Nason and
Sibson (1992) and provides a measure that targets multimodality. It is clear that
the different indices do not necessarily lead to the same directions. For example,
Fig. 11.2 shows that “kurtosis”, “negentropy” and “critic” indices provide the
same direction along the bimodality. However, the other two PP indices provide
a direction nearly along the diagonal.
Figure 11.3 shows the histograms and kernel PDF estimates of the time series
obtained by maximising the “depth” (left) and “critic” (right) indices using the
leading three PCs. The associated PP patterns are also shown in Fig. 11.3. These
results suggest that projection maximising the “critic” index is associated with
weakening and strengthening of the polar vortex whereas the “depth” index is
Fig. 11.2 PDF of the winter seasonal mean 20-hPa geopotential height within the leading two PCs
along with the directions maximising the “kurtosis”, “negentropy”, “critic” (overlapping—grey
colour), “depth” and “multi” (overlapping—red colour) within the space of the leading three PCs,
projected onto the space of (PC1,PC2) (left) and (PC2,PC3) (right). Adapted from Christiansen
(2009). ©American Meteorological Society. Used with permission
262 11 Projection Pursuit
Fig. 11.3 Histograms and kernel PDF estimates (top) of the projection of 20-hPa winter mean
geopotential height on the direction maximising “depth” (left) and “critic” (right) PP indices, and
the associated PP patterns (bottom). Adapted from Christiansen (2009). ©American Meteorologi-
cal Society. Used with permission
dominated by the existence of two centres of same polarity sitting respectively over
North America and Siberia.
Pasmanter and Selten (2010) applied PP using skewness and proposed a numer-
ical algorithm to solve a system of nonlinear equations to obtain skewness modes.
Denoting the projected time series of zero-mean data xt = (x1 (t), . . . , xp (t))T ,
t = 1, . . . , n, onto the pattern a = (a1 , . . . , ap )T , by zt = aT x, the skewness modes
are obtained by maximising the third-order moment of zt , t = 1, . . . , n, subject to
having unit variance, i.e.
maxa E(zt3 )
(11.48)
subject to E(zt2 ) = aT Ca = 1,
where E(.) is the expectation operator and C = (cij ) is the covariance matrix of xt ,
t = 1, . . . , n.
11.6 Skewness Modes and Climate Application of PP 263
–1 –0.8–0.6 –0.4–0.2 0.2 0.4 0.6 0.8 1 –1 –0.8–0.6 –0.4–0.2 0.2 0.4 0.6 0.8 1
Fig. 11.4 Leading skewness modes of daily 500-hPa streamfunction from ERA-40. Skew values
are shown on top of each panel. Adapted from Pasmanter and Selten (2010)
and Wallace (2009) (see their Fig. 14). These modes explain altogether around 8%
of the total variance. Note that by construction the skewness modes are temporally
uncorrelated, so it makes sense to talk about them explaining a certain fraction of
the total variance.
The message from these techniques is that although the amount of variance
explained by those patterns is not very large, it is still significant compared to
extratropical EOFs’ variance. In addition, it is important to notice also that skewness
(or kurtosis) modes project well onto large scale flows pointing to the presence of
non-Gaussianity in large scale atmosphere (Sura and Hannachi 2015; Franzke et al.
2007). As pointed out by Sura and Hannachi (2015), it is likely that “interesting
structures” of large scale flow may lie on nonlinear manifolds, for which nonlinear
methods, despite their shortcomings (Monahan and Fyfe 2007) can still provide
useful information (Hannachi and Turner 2013b; Monahan 2001). Monahan and
DelSole (2009) also point out that when dealing with non-Gaussian statistics
“linear” approaches, i.e. operating on hyperplanes, are not efficient and they suggest
using different measures based on information-theoretic metrics.
Chapter 12
Independent Component Analysis
12.1 Introduction
S = WX (12.1)
of the data matrix X that faithfully represents the data in a particular way (see
Chap. 3). In fact, EOF/PCA is only one way to determine the matrix W in (12.1),
and is based on the second-order moments, namely the covariance matrix n1 XT X,
and yields uncorrelated variables. Alternative ways, based on higher order moments,
also exist. Independent component analysis constitutes one such alternative and is
the subject of this chapter.
The ideal deconvolution operation is the one for which there is a constant α and
integer m0 such that zn = αxn−m0 , for n in N , and this is satisfied when
an
k fk gn−k = αδ(n − m0 ). Note that a similar problem formulation arises in signal
12.2 Background and Definition 267
detection where the objective is to estimate the input signal xt given that we observe
yt = xt + εt , for which the noise εt is supposed to be independent of xt . In this
detection problem we suppose, for example, that the second-order moments of εt
are known. The solution to this problem is obtained through a mean square error
(MSE) minimisation and takes a similar form to (12.3). If in the original problem,
Eqs. (12.2)–(12.3), the unit-impulse response of the convolving operator (12.2), i.e.
fk , (k in N), is known, one talks about deconvolution when we seek the input signal
xt , t in N. The deconvolving signal, Eq. (12.3), can be obtained by inverting the
linear filter (12.2) using, for example, the z-transform or the frequency response
function.1
A more challenging task corresponds to the case when the convolving filter (12.2)
is unknown. In this case one talks about blind deconvolution (Shalvi and Weinstein
1990). Although this is not the main concern here, it is of interest to briefly discuss
the solution to this problem in a particular case where the signal xk , k in N , is
supposed to be independent white noise, i.e. IID sample from a random variable
X. The solution to this problem can be obtained using higher order moments of X
estimated from the sample (see Appendix B). Precisely, the normalised cumulants,2
(1994, chap. 3), of the sample yt , t in N, are used in this case. Note
see e.g. Kendall
that if yt = k fk xt−k , with xt , t in N, are IID realisations
of X, then the cumulants
of yt are related to those of xt via cy (p) = cx (p) k (fk )p , see e.g. Brillinger and
Rosenblatt (1967) and Bartlett (1955). It can be shown, in fact, that the magnitude
of the normalised response cumulant κy (p, q), of order (p, q), for any even q > 0,
is bounded from above, and for any even p > 0, it is bounded from below as (see
e.g. Cadzow 1996)
|κy (p, q)| ≤ |κx (p, q)| for all p > q and,
(12.4)
|κx (p, q)| ≤ κy (p, q)| for all q > p.
1 Writing (12.2) as yt = ψ(B)xt , where ψ(z) = k fk zk and B is the backward shift operator,
i.e. Bxt = xt−1 , one gets fy (ω) = |(ω)| fx (ω) = |ψ(eiω )|2 fx (ω). In this expression |(ω)|
2
is the gain of the filter, and fx () and fy () are the power spectra of xt and yt , respectively. The
Fourier transform of the deconvolving filter, i.e. its frequency response function, is then given by
[(ω)]−1 .
p
2 Of order (p, q) of a random variable Y , κ (p, q) is defined by κ (p, q) = c (p)|c (q)|− q , where
y y y y
cy (p) is the cumulant of order p of Y , and where it is assumed that cy (q) = 0. The following are
examples of cumulants: cy (1) = μ (the mean), cy (2) = μ(2) = σ 2 , cy (3) = μ(3), cy (4) =
μ(4) − 3σ 4 and cy (5) = μ(5) − 10μ(3)μ(2), where the μ’s are the centred moments.
268 12 Independent Component Analysis
easily computed, and the problem is solved using a suitable gradient-based method
(Appendix E), see e.g. Cadzow (1996), Cadzow and Li (1995), Haykin (1999) and
Hyvärinen (1999) for further details.
The previous discussion concerns univariate time series. In the multivariate case
there is also a similar problem of data representation, which is historically related
to ICA. In blind source separation (BSS) we suppose that we observe a multivariate
time series xt = (x1t , . . . , xmt )T , t = 1, . . . , n, which is assumed to be a mixture of
m source signals. This is also equivalent to saying that xt is a linear transformation
m-dimensional unobserved time series st = (s1t , . . . , smt )T , t = 1, . . . , n, i.e.
of an
xt = m i=1 aik sit , or in matrix form:
xt = Ast , (12.5)
where A = (aij ) and represents the matrix of unknown mixing coefficients. The
components of the vector st are supposed to be independent. The objective is then
to estimate this time series st , t = 1, . . . , n, or similarly the mixing matrix A. This
problem is also known in speech separation as the cocktail-party problem. Briefly,
BSS is a technique that is used whenever one is in the presence of an array of m
receivers recording linear mixtures of m signals.
Remark An interesting solution to this problem is obtained from non-Gaussian
signals st , t = 1, . . . , n. In fact, if st is Gaussian so is xt , and the solution to
this problem is trivially obtained by pre-whitening xt using principal components,
i.e. using only the covariance matrix of xt , t = 1, . . . , n. The solution to the BSS
problem is also obtained as a linear combination of xt , t = 1, . . . , n, as presented
later. Note also that the BSS problem can be analysed as an application to ICA.
x = As, (12.6)
12.3 Independence and Non-normality 269
where A = (aij ) and represents the m×m mixing matrix. The objective of ICA is to
estimate both the underlying independent components3 of s and the mixing matrix
A. If we denote the inverse of A by W = A−1 , the independent components are
obtained as a linear combination of x:
u = Wx. (12.7)
The objective of ICA is that the transformation s = Wx, see Eq. (12.7), be as
statistically independent as possible. If one denotes by f (s) = f (s1 , . . . , sm ) the
joint probability density of s, then independence means that f (s) may be factorised
as
!
m
f (s1 , . . . , sm ) = fk (sk ), (12.8)
k=1
i.e. si and sj are nonlinearly uncorrelated for all types of nonlinearities. This is
clearly difficult to satisfy in practice when using independence. We seek instead a
simpler way to measure independence, and as it will be seen later, this is possible
using information-theoretic approaches.
12.3.2 Non-normality
4 Take X to be a standard normal random variable and define Y = X 2 , then X and Y are
uncorrelated but not independent.
5 Measurable.
6 Who referred to the statement “Normality is a myth; there never was, and never will be a normal
distribution” due to Geary (1947). As Mardia (1980) points out, however, this is an overstatement
from a practical point of view, and it is important to know when √
a sample departs from normality.
7 Such as the Laplace probability density function f (x) = √1 e− 2|x| .
2
12.4 Information-Theoretic Measures 271
12.4.1 Entropy
q
H (U ) = − pk ln pk . (12.11)
k=1
For a continuous variable with probability density function f (u), the differential
entropy or Boltzmann H-function is given by
H (U ) = − f (u) ln f (u)du. (12.12)
R
(12.14)
The entropy gives rise to a number of important measures useful for ICA as detailed
next, see also their usefulness in projection pursuit.
9 Writing
first E [h(u)] = h(u)fu (u)du and E [h(g(v))] = h(g(v))fv (v)dv =
h(u)fv (v)|J |du, one obtains the required result.
12.4 Information-Theoretic Measures 273
• Df f = 0.
• Df g ≥ 0 for all f () and g().
Proof The classical proof for the second property uses the so-called Gibbs inequal-
ity10 (Gibbs 1902, chap XI, theorem 2), see also Fraser and Dimitriadis (1994).
Here we give another sketch of the proof using, as before, ideas from the calculus
of variations. Let f () be any given probability density function, and we set the task
to compute the minimum of Df g considered as a functional of g(). Let ε() be a
“small perturbation” function such that [g + ε] () is still a probability density for
a given probability density g(), that is, f + ε/ge0. This necessarily means that
ε(u)du = 0. Given the constraint satisfied by g(), we consider the objective
function G(g) = Df g − λ(1 − g(u)du), where λ is a Lagrange multiplier.
ε(x) ε(x)
Now using the approximation ln 1 + g(x) ≈ g(x) , one gets G (g + ε) − G (g) ≈
f (x)
− g(x) + λ ε(x)dx. The necessary condition of optimum yields f = −λg, i.e.
f = g since f () and g() integrate to 1. It remains to show that g = f is indeed
the minimum. Let F (g) = f (x) ln fg(x)
(x)
dx. For any “small” perturbation ε(), we
have, keeping in mind that log(1 + ε)−1 ≈ −ε + ε2 /2, F (f + ε) − F (f ) =
ε2 (x)
f (x) dx + o(ε ) ≥ 0. Hence g = f minimises the functional F ().
1 2
2
The K–L divergence is sometimes regarded as a distance between probability
density functions where in reality it is not because it is not symmetric.11 The K–L
divergence yields an important measure used in ICA, namely mutual information.
m
I (u) = H (uk ) − H (u). (12.17)
k=1
The mutual information is also known as redundancy (Bell and Sejnowski 1995),
and the process of minimising I () is known as redundancy reduction (Barlow 1989).
The mutual information I (u) provides a natural measure of the dependence between
the components of u. In fact, I (u) ≥ 0 and that I (u) = 0 if and only if (u1 , . . . , um )
are independent.
Exercise Show that I (u1 , . . . , um ) ≥ 0.
Hint See below.
The mutual information can also be defined using the K–L divergence. In fact, if
f () is the probability density
( of u, fk () is the marginal density function of uk ,
k = 1, . . . , m, and f˜(u) = m
k=1 fk (uk ), then
f (u)
I (u) = Df f˜ = f (u) ln du, (12.18)
f˜(u)
12.4.4 Negentropy
Since the entropy is not invariant to variable changes, see Eq. (12.15), it is desirable
to construct a similar function that is invariant to linear transformations. Such
measure is provided by the negentropy. If u is an m-dimensional random variable
with covariance matrix and u is the multinormal random variable with the same
covariance, then the negentropy of u is given by
(m
m
1 i=1 σi
2
I (u) = J (u) − J (ui ) + ln . (12.20)
2 ||
i=1
m
I (u) = − J (uk ) + c, (12.21)
k=1
for some 1 ≤ a ≤ 2. For the multivariate case, Hyvärinen (1998) also provides
an approximation to the mutual information; involving the third- and fourth-order
cumulants, namely
276 12 Independent Component Analysis
1
m
I (u) = c + 4 (κ3 (ui ))2 + (κ4 (ui ))2 + 7 (κ4 (ui ))4 − 6 (κ3 (ui ))2 κ4 (ui )
48
i=1
(12.24)
for uncorrelated ui , i = 1, . . . , m, and c is a constant.
X = AS, (12.25)
where S = (s1 , . . . , sn ). The usual procedure to solve this problem is to find instead
a matrix W, known as matrix of filters, such that ut = Wxt , or in matrix form:
U = WX (12.26)
Unlike EOFs/PCA where the objective function is quadratic and the solution
is easily obtained, in ICA various objective functions exist along with various
algorithms to compute them. A great deal of objective functions is based on an
information-theoretic approach, but there are also other objective functions using
ideas from neural networks.
Furthermore, these objective functions can be either “one unit” where each ICA
component is estimated at a time or “several”/“whole units” where several/all ICA
components are obtained at once (Hyvärinen 1999). The one-unit case is particularly
similar to projection pursuit where an index of non-normality is maximised. A
particularly interesting independence index for one-unit ICA is provided by the
following criteria:
Negentropy
y = wT X (12.27)
is obtained using Eq. (12.19) and estimated using, for example, Eq. (12.22). The
maximisation of J (y) yields the first ICA direction. Subsequent ICA directions
can be computed in a similar manner after removing the effect of the previously
identified directions as in PP. Alternatively, one can also estimate negentropy using a
non-parametric estimation of the data probability density function discussed below.
Non-normality
There is a subtle relationship between projection pursuit and ICA. This subtlety
can be made clear using an argument based on the central limit theorem (CLT).
The linear expression in Eq. (12.27) can be expressed as a linear combination of
the (unknown) independent components using Eq. (12.6). A linear combination
of these components is more Gaussian than any of the individual (non-Gaussian)
components. Hence to achieve the objective, one has to maximise non-normality
of the index (12.27) using, e.g. kurtosis, Eq. (12.10) or any other index of non-
normality from the preceding chapter.
Information-Theoretic Approach
n
n
m
L= ln fx (xt ) = ln fk (wTk xt ) + n ln |detW|, (12.28)
t=1 t=1 k=1
A different approach to solve the ICA problem was developed by Bell and
Sejnowski (1995) by maximising the input–output information (info-max) from a
neural network system. The info-max approach goes as follows. An input x, with
probability density f (), is passed through a (m → m) neural network system with
weight matrix W and bias vector w0 and a sigmoid function g() = (g1 (), . . . , gm ()).
In practice the sigmoids are taken to be identical.14 The output is then given by
y = g (Wx + w0 ) (12.29)
eu
14 For example, tanh() or the logistic function g(u) = 1+eu .
12.5 Independent Component Estimation 279
m d2
1 ∂ d g (u )
du2 i i
d
gk (uk ) = d
xj . (12.31)
du gk (uk )
∂wij du du gi (ui )
k=1
W = W−T + axT
(12.32)
w0 = a,
where
⎛ ⎞
d2 d2
g (u )
du2 1 1
g (u )
du2 m m ⎠
aT = ⎝ d
,..., d .
du g1 (u1 ) du gm (um )
For example, if the logistic function is used, then a = 1 − 2y. Note that when
one has a sample x1 , . . . , xn , the function to be minimised is E [ln |J |], where the
expectation of ln |J | and also the gradient, i.e. E [W] and E [w0 ], is simply a
sample mean.
Remark Using Eq. (12.14) along with Eq. (12.17), the mutual information I (u) is
related to the entropy of the output v = g(u) from a sigmoid function g() via
m
| du
d
gk (uk )|
I (u) = −H (g(u)) + E ln , (12.33)
fk (uk )
k=1
A Non-parametric Approach
A major difficulty in some of the previous methods is related to the estimation of the
pdf. This is a well known difficult problem particularly in high dimensions. A useful
and practical way to overcome this problem is to use a non-parametric approach for
the pdf estimation using kernel methods. The objective is to minimise the mutual
information of y = Wx, I (y) = H (yi ) − ln |W| − H (x), see Eq. (12.15). This is
equivalent to minimising
m
m
F (W) = H (yk ) − ln |W| = − E ln fyk (wTk x) − ln |W|, (12.34)
k=1 k=1
15 By ensuring that all response levels are used with equal frequency.
12.5 Independent Component Estimation 281
Other Methods
Various other methods exist that deal with ICA, see e.g. Hyvärinen et al. (2001)
for details. But before closing this section, it is worth mentioning a particularly
interesting and easy method to use, based on a weighted covariance matrix, see e.g.
Cardoso (1989). The method is based on finding the eigenvectors of E x 2 xxT ,
which can be estimated by
n
ˆ = 1
xk 2
xk xTk ,
n
k=1
where the data have been sphered prior to computing this covariance matrix. The
method is based on the assumption that the kurtosis of the different components is
different.
The above nonlinear measures of association between variables can also be used,
in a similar way to the linear covariances or Pearson correlation, to define climate
networks, see Sect. 3.9 (Chap. 3). For example, mutual information (Donges et
al. 2009; Barreiro et al. 2011) and transfer entropy (Runge et al. 2012) have been
used to define climate networks and quantify statistical association between climate
variables.
Prior to any numerical procedure, in ICA it is important to preprocess the data. The
most common way is to sphere the data, see Chap. 2. After centring, the sphered
(or whitened) variable is obtained from the transformation z = Q (x − E(x)), such
that the covariance matrix of z is the identity. For our sample covariance matrix
282 12 Independent Component Analysis
S = E2 ET , one can take for Q the inverse of the square root16 of S. i.e. Q =
−1 ET . From the sample data matrix X = (x1 , . . . , xn ), the sphered data matrix Z
corresponds to the standardised PC matrix.
The benefit of sphering has been nicely illustrated in various places, e.g.
Hyvärinen et al. (2001) and Jenssen (2000), and helps simplify calculations. For
example, if the new mixing matrix B, for which z = Bs = QAs, is orthogonal, then
the number of degrees of freedom is reduced from m2 to m(m − 1)/2.
Optimisation Algorithms
Once the data have been sphered, the ICA problem can be solved using one of
the previous objective functions. When the objective function is relatively simple,
e.g. the kurtosis (12.10) in the unidimensional case, the optimisation can be
achieved using any algorithm such as gradient type algorithms. For other objective
functions such as those based on information-theoretic approaches, commonly used
algorithms include the Infomax (Bell and Sejnowski 1995), the maximum likelihood
estimation and the FastICA algorithm (Hyvärinen and Oja 2000).
• Infomax
The gradient of the objective function based on Infomax has already been given
in Eq. (12.32). For example, when the logistic sigmoid is used, the learning rule
or algorithm is given by
whereas for the tangent hyperbolic, particularly useful for super-Gaussian inde-
pendent components (Hyvärinen 1999), the learning rule becomes W−T −
2 tanh (Wx) xT . The algorithm used in Infomax is based on the (stochastic)
gradient ascent of the objective function.17
• FastICA
The FastICA algorithm was introduced by Hyvärinen and Oja (2000) in order to
accelerate convergence, compared to some cases with Infomax using stochastic
gradient. FastICA is based on a fixed point algorithm similar to the Newton
iteration procedure (Hyvärinen 1999). For the one-unit case, i.e. one ICA
at a time, FastICA finds directions of maximal non-normality. FastICA was
developed in conjunction with information-theoretic approaches based on the
negentropy approximation in (12.22), using a non-quadratic function G(), with
derivative g(), and yields, after discarding the constant E (G(ν)),
16 Notethat the square root is not unique; E and EET are two square roots, and the last one is
symmetric.
17 Hyvärinen (1999) points out that algorithm (12.37) can be simplified by a right multiplication by
WT W to yield the relative gradient method (Cardoso and Hvam Laheld 1996).
12.5 Independent Component Estimation 283
J (w) = E G wT xt ,
2
where w is unitary. Because of sphering, one has E wT xt = wT w = 1. Using
a Lagrange multiplier, the solution corresponds to the stationary points of the
extended objective function J (w) − λ w 2 − 1 , given by
F (w) = E xt g wT xt − λw = 0. (12.38)
dg
The Jacobian of F (w) is JF (w) = E xt xTt T
du (w xt ) − λIm . Hyvärinen (1999)
dg dg
uses the approximation E xt xTt du (wT xt ) ≈ E xt xTt E du (wT xt ) , which
is isotropic and easily invertible. Now, using the Newton algorithm method, this
yields the iterative form:
E xt g(wTk xt ) −λwk
w∗k+1 = wk − dg
E T
du (wk xt ) −λ
(12.39)
w∗k+1
wk+1 = w∗k+1 .
∂ ln |W| −1
= diag p1 2 , . . . , pm 2 P + P−T ,
∂P
284 12 Independent Component Analysis
The use of ICA in climate research is quite recent, and the number of research
papers is limited. Philippon et al. (2007) employ ICA to extract independent
modes of interannual and intraseasonal variability of the West African vegetation.
Mori et al. (2006) applied ICA to monthly sea level pressures to find the main
independent contributors to the AO signal. See also Basak et al. (2004) for an
analysis of the NAO, Fodor and Kamath (2003) for an application of ICA to
global temperature series and Aires et al. (2000) for an analysis of tropical sea
surface temperatures. The ICA technique has the potential to avoid the PCA “mixing
problem”. PCA has the tendency to mix several modes of comparable magnitude,
often generating spurious regional overlaps or teleconnections where none exists or
distorting existing overlaps or teleconnections (Aires et al. 2002).
There is a wide class of ICA algorithms that achieve approximate independence
by optimising criteria involving higher order cumulants; for example, the JADE
criterion proposed by Cardoso and Souloumiac (1993) performs joint diagonal-
isation of a set of fourth-order cumulant matrices. The orthomax-based criteria
proposed in Kano et al. (2003) are, respectively, quadratic and linear functions of
fourth-order statistics. Unlike higher order cumulant-based methods, the popular
FastICA algorithm chooses a single non-quadratic smooth function (e.g. g(x) =
log cosh x), such that the expectations of this function yield a robust approximation
to negentropy (Hyvärinen et al. 2001). In the next section a criterion is introduced,
which requires the minimisation of the sum of squared fourth-order statistics formed
by covariances computed from squared components.
ICA can be interpreted in terms of EOF rotation. Aires et al. (2002) used a neural
network-based approach to obtain independent components. Hannachi et al. (2009)
presented a more analytical way to ICA via rotation by minimising a criterion based
on the sum of squared fourth-order statistics. The optimal rotation matrix is then
used to rotate the matrix of initial EOFs to enhance interpretation. The data are first
pre-whitened using EOFs, and the ICA problem is then solved by rotating the matrix
of the uncorrelated component scores (PCs), i.e.
Ŝ = YT , (12.40)
12.6 ICA via EOF Rotation and Weather and Climate Application 285
for some orthogonal matrix T. The method then uses the fact that if the components
s1 , . . . , sk are independent, their squares s12 , . . . , sk2 are also independent. Therefore
the model covariance matrix of the squared components is diagonal. Given any
orthogonal matrix V, and letting G = YV, the sample covariance matrix between
the element-wise squares of G is
1
R= (G G)T H(G G) , (12.41)
n−1
1
F(V) = ( R F − diag(R) F) , (12.42)
2
and
Assuming the components of s to be uniformly distributed over [−1, 1], Fig. 12.1
shows PC1 vs PC2 of the above models. The figure shows clearly that the two
models are actually distinguishable. However, this is not the case when we compute
the covariance matrix of these models, which are proportional since
a IC2 b IC2
2 2
IC1
IC1
EOF2
EOF2
0 0
−2 −2
−2 0 2 −2 0 2
EOF1 EOF1
c d EOF2 EOF1
EOF2
2 2
IC2
IC2
0 0
EOF1
−2 −2
−2 0 2 −2 0 2
IC1 IC1
Fig. 12.1 Scatter plot of PC1 vs PC2 (a, b) and IPC1 vs IPC2 (c, d) of models AO1 (a, c) and
AO2 (b, d). Adapted from Hannachi et al. (2009) ©American Meteorological Society. Used with
permission
The joint density of the latent signals s1 and s2 is uniform on a square. In Fig. 12.1a
the joint density of the scores of EOF 1 and EOF 2 for Model AO1 is uniform on
a square, and PC 1 and PC 2 are independent. In Fig. 12.1b, the joint density of
the scores of EOF 1 and EOF 2 cannot be expressed as the product of the marginal
densities, and the components are not independent. When the above procedure is
applied, the independent components are extracted successfully (Fig. 12.1c, d). An
example of application of ICA via rotation to monthly SLP from NCEP-NCAR was
presented by Hannachi et al. (2009).
Monthly SLP anomalies for the period Jan 1948–Nov 2006 were used. Fig-
ure 12.2 shows grid points where the time series of SLP anomalies depart signifi-
cantly, at the 5% level, from normality. It is clear that non-Gaussianity is a dominant
feature of the SLP anomalies. The data were first reduced by applying an EOF
analysis, and ICA rotation was applied to various number of EOFs. Figure 12.3
shows the first five independent principal components (IPCs) obtained by rotating
the first five component scores. Their cross-correlations have been checked and
found to be zero according to some prescribed level of machine precision. The cross-
correlations of various nonlinear transformations of IPCs have also been computed
and compared to those obtained using PCs.
12.6 ICA via EOF Rotation and Weather and Climate Application 287
2
IC 1
0
−2
−4
Jan50 Jan60 Jan70 Jan80 Jan90 Jan00
2 4
2
IC 2
IC 3
0
−2 0
−4 −2
Jan60 Jan80 Jan00 Jan60 Jan80 Jan00
4
2
2
IC 4
IC 4
0 0
−2 −2
Fig. 12.3 Independent principal components obtained via EOF rotation of the first five PCs of the
monthly mean SLP anomalies. Adapted from Hannachi et al. (2009). ©American Meteorological
Society. Used with permission
The upper triangular part of the matrix in Table 12.1 shows the absolute values of
the correlations between the element-wise fourth power of IPCs 1–5. The lower part
shows the corresponding correlations using the PCs instead of the IPCs. Significant
288 12 Independent Component Analysis
Table 12.1 Correlation matrix of the fourth power elements of ICs 1 to 5 (above the diagonal)
and the same correlation but for the PCs (below the diagonal). The sign of the correlations has
been dropped. Bold faces and underlined values refer to significant correlations at 1% and 5%
levels, respectively
IPC1/PC1 IPC2/PC2 IPC3/PC3 IPC4/PC4 IPC5/PC4
1 0 0.01 0 0
0.08 1 0.02 0.02 0.01
0 0.01 1 0 0.010
0.01 0.01 0.08 1 0.02
0.1 0.05 0.04 0.01 1
correlations at the 1% and 5% levels, respectively, are indicated below the diagonal.
Table 12.2 is similar to Table 12.1 except that now the nonlinear transformation is
the absolute value of the third power law. Note again the significant correlations
in the transformed PCs in the lower triangular part of Table 12.2, whereas no such
significant correlations are obtained with the IPCs.
Hannachi et al. (2009) computed those same correlations using various other
nonlinear functions and found consistent results with Tables 12.1 and 12.2. Note
that the IPCs reflect also large non-normality, as can be seen from the skewness
of the IPCs. For example, Fig. 12.4 shows the q–q plots of all SLP IPCs. The
straight diagonal lines in Fig. 12.4 are for the normal distribution, and any departure
from these lines reflects non-normality. Clearly, all the q–q plots display strong
nonlinearity, i.e. non-normality. A formal KS test reveals that the null hypothesis
of normality is rejected for the first three IPCs at 1% significance level and for the
last two IPCs at 5% level. The non-normality of the PCs has also been checked and
compared with that of the IPCs using the q–q plot (not shown). It is found that the
IPCs are more non-normal than the PCs.
The spatial patterns associated with the IPCs are shown in Fig. 12.5. In principle,
the rotated EOFs have no natural ranking, but the order of the rotated EOFs in
Fig. 12.1 is based on the amount of variance explained by those patterns. The first
REOF looks like the Arctic Oscillation (AO) pattern,18 and the second REOF repre-
sents the NAO. The fourth pattern, for example, is reminiscent of the Scandinavian
teleconnection pattern. Figure 12.6 shows, respectively, the correlation map between
18 Note,however, that the middle centre of action is displaced from the pole and shifted towards
northern Russia.
12.6 ICA via EOF Rotation and Weather and Climate Application 289
Quantile−quantile plot
4 a
IC2 quantiles
b
IC1 quantiles 0
−4
4 c d
IC4 quantiles
IC3 quantiles
−4
4 e −4 0 4
IC5 quantiles
Standard normal
0 quantiles
−4
−4 0 4
Standard Normal Quantiles
Fig. 12.4 Quantile plots on individual IPCs of the monthly SLP anomalies. Adapted from
Hannachi et al. (2009). ©American Meteorological Society. Used with permission
Fig. 12.5 Spatial patterns associated with the leading five IPCs. The order is arbitrary. Adapted
from Hannachi et al. (2009). ©American Meteorological Society. Used with permission
290 12 Independent Component Analysis
Fig. 12.6 Correlation map of monthly SLP anomaly IPC1 (top) and IPC2 (bottom) with HadISST
monthly SST anomaly for the period January 1948–November 2006. Only significant values, at
1% level, are shown. Correlations are multiplied by 100. Adapted from Hannachi et al. (2009).
©American Meteorological Society. Used with permission
SLP IPC1 (Fig. 12.6, top) and SLP IPC2 (Fig. 12.6, bottom) with the Hadley Centre
Sea Ice and Sea Surface Temperature (HadISST). It can be seen, in particular, that
the correlation pattern with IPC1 reflects the North Pacific oscillation, whereas the
correlation with IPC2 reflects well the NAO pattern. The same rotation was also
applied in the context of exploratory factor analysis by Unkel et al. (2010), see
Chap. 10.
12.6 ICA via EOF Rotation and Weather and Climate Application 291
The above independent component analysis applied to two-way data was extended
to three-way climate data by Unkel et al. (2011). Three-way data consist of data that
are indexed by three indices, such as time, horizontal space and vertical levels. For
this four-dimensional case (3 spatial dimensions plus time), with measurements on J
horizontal grid points at K vertical levels for a sample size n, the data are represented
by third-order tensor:
✘ = (xj tk ), j = 1, . . . J, t = 1, . . . n, k = 1, . . . K. (12.46)
A standard model for the three-way data, Eq. (12.46), is the three-mode Parafac
model with R components (e.g. Caroll and Chang 1970; Harshman 1970):
R
xj tk = aj r btr ckr + εj tk , j = 1, . . . J, t = 1, . . . n, k = 1, . . . K.
r=1
(12.47)
In Eq. (12.47) A = (aj r ), B = (btr ) and C = (ckr ), are the component matrices (or
modes) of the model, and ε = (εj tk ) is an error term.
A slight modification of the Parafac model, Eq. (12.47), was used by Unkel et
al. (2011), based on the Tucker3 model (Tucker 1966), which yields the following
model:
where the J × (nK) matrices XA and EA are reshaped version of ✘ = (xj tk ) and
ε = (εj tk ), obtained by K frontal slices of those tensors, and | ⊗ | is the columnwise
Kronecker (or Khatri–Rao) matrix product. The matrices A, B and C are obtained
by minimising the costfunction:
F = XA − AAT XA CCT ⊗ BBT 2
F, (12.49)
where ⊗ is the Kronecker matrix product (see Appendix D), by using an alternating
least square optimisation algorithm. The estimate  of A is then rotated towards
independence based on the algorithm of the two-way method discussed above, see
Eq. (12.41).
Unkel et al. (2011) applied the method to the NCEP/NCAR geopotential heights
by using 5 components in modes A and C and 6 components in B. The data represent
winter (Nov–Mar) means for the period 1949–2009 with 2.5◦ ×2.5◦ horizontal grid,
north of 20◦ N , sampled over 17 vertical levels, so J = 4176, n = 61 and K = 17.
Figure 12.7 shows the rotated patterns. Patterns (i) and (iii) suggest modes related
to stratospheric activity, showing two different phases of the winter polar vortex
during sudden stratospheric warming (e.g. Hannachi and Iqbal 2019). Pattern (ii)
292 12 Independent Component Analysis
Fig. 12.7 Rotated three-way independent component analysis of the winter means geopotential
heights from NCEP/NCAR. The order is arbitrary. Adapted from Unkel et al. (2011)
12.7 ICA Generalisation: Independent Subspace Analysis 293
S(y)ij k = E(Yi Yj Yk )
K(y)ij kl = E(Yi Yj Yk Yl ) − E(Yi Yj )E(Yk Yl ) (12.50)
−E(Yi Yk )E(Yj Yl ) − E(Yi Yl )E(Yj Yk ),
Abstract This chapter describes a different way to obtain nonlinear EOFs via
kernel EOFs based on kernel methods. The kernel EOF method is based on mapping
the data onto a feature space and helps delineate complex structures. The chapter
discusses various types of transformations to obtain kernel EOFs, with particular
focus on the Gaussian kernel and its application to data from models and reanalyses.
13.1 Background
It has been suggested that the large scale atmospheric flow lies on a nonlinear
manifold due to the nonlinearity involved in the dynamics of the system. Never-
theless, it is always possible to embed this manifold in a high-dimensional linear
space. The system may have two or more substructures, e.g. attractors, that one
would like to identify and separate. This problem belongs to the field of pattern
recognition. Linear spaces have the characteristic that “linear” patterns can be, in
general, identified efficiently using, for example, linear discriminant analysis or
LDA (e.g. McLachlan 2004). In general patterns are characterised by an equation of
the form:
f (x) = 0. (13.1)
For linear patterns, the function f (.) is normally linear up to a constant. Figure 13.1
illustrates a simple example of discrimination where patterns are obtained as the
solution of f (x) =< w, x > +b = 0, where “<, >” denotes a scalar product in the
linear space.
Given, however, the complex and nonlinear nature involved in weather and
climate, one expects in general the presence of different forms of nonlinearity and
x
2
x x
x
x
x x
w
x
b
x x
x
1
f( x) = < w, x> + b = 0
Fig. 13.1 Illustration of linear discriminant analysis in linear spaces permitting a separation
between two sets or groups shown, respectively, by crosses and circles
where patterns or coherent structures in the input space are not easily separable. The
possibility then arises of attempting to embed the data of the system into another
space we label “feature space” where complex relationships can be simplified and
discrimination becomes easier and efficient, e.g. linear, through the application of
a nonlinear transformation φ(.). Figure 13.2 shows an example sketching of the
situation where the (nonlinear) manifold separating two groups in the input space
becomes linear in the feature space. Consider, as an example, and for the sake of
illustration, the case where the input space contains data that are “polynomially”
nonlinear. Then, it is possible that a transformation into the feature space involving
separately all monomials constituting the nonlinear polynomial would lead to
a simplification of the initial complex relationship. For example, if the input
space (x1 , x2 ) is two-dimensional and contains quadratic nonlinearities, then the
five-dimensional space obtained by considering all monomials of degree smaller
than 3, i.e. (x1 , x2 , x12 , x22 , x1 x2 ), would decimate/dismantle the initial complex
relationships. Figure 13.2 could be an (ideal) example of polynomial nonlinearity
where the feature becomes a linear combination of the coordinates and linear
discriminant analysis, for example, could be applied. In general, however, the
separation will not be that trivial, and the hypersurface separating the different
groups could be nonlinear, but the separation remains feasible. An illustration of
this case is sketched in Fig. 13.3. This could be the case, for instance, when the
nonlinearity is not polynomial, as will be detailed later. It is therefore the objective
13.1 Background 297
f 2 (x 1 ,x 2 )
x f =( f 1 , f 2 )
2
x x
x x
x x
x x
x x
x x
x
f 1 (x 1 ,x 2 )
x1
x
x
x
Fig. 13.2 Illustration of the effect of the nonlinear transformation φ(.) from the input space
(left) into the feature space (right) for a trivial separation between groups/clusters in the feature
space. Adapted from Hannachi and Iqbal (2019). ©American Meteorological Society. Used with
permission
x2 f =( f , f ) f2 (x 1 ,x 2 )
1 2 x x
x x
x x
x
x
x x
x x x
f1 (x 1 ,x 2 )
x1
x
x
x
Fig. 13.3 As in Fig. 13.2 but for a curved separation between groups or clusters in the feature
space
of kernel EOF method (or kernel PCA) to find an embedding or a transformation that
allows such easier discrimination or pattern identification to be revealed. The kernel
method can also be extended to include other related approaches such as principal
oscillation patterns (POPs), see Chap. 6, or maximum covariance analysis (MCA),
see Sect. 14.3.
Kernel PCA is not very well known in atmospheric science since it was first
applied by Richman and Adrianto (2010). These authors applied kernel EOFs to
298 13 Kernel EOFs
classify sea level pressure over North America and Europe. They identified two
clusters for each domain in January and also in July. Their study suggests that kernel
PCA captures the essence of data more accurately compared to conventional PCA.
1 1
n n
S= ξ t ξ Tt = φ(xt )φ(xt )T . (13.2)
n n
t=1 t=1
Now, as for the covariance matrix S in the input space, an eigenvector v of S with
eigenvalue λ satisfies
n
nλv = φ(xt ) φ (xt )T v , (13.3)
t=1
and therefore, any eigenvector must lie in the subspace spanned by φ(xt ),
t = 1, . . . , n. Denoting by αt = φ(xt )T v, and K = Kij , where Kij = ξ Ti ξ j =
φ(xi )T φ(xj ), an eigenvector v of S, with non-zero eigenvalue λ, is found by solving
the following eigenvalue problem:
Kα = nλα. (13.4)
n
which is precisely 1
n t=1 nλξ t αt = λv.
13.2 Kernel EOFs 299
Exercise Starting from v = ns=1 αs φ(xs ), show, without using (13.4), that α =
(α1 , . . . , αn )T is an eigenvector of K.
n
of S, then
t=1 φ(xt )φ(xt ) v = nλv, yielding
Answer Since v is an eigenvector T
after little algebra, k s αs Kks φ(xk ) = nλ k αk φ(xk ). This equality is valid
for any eigenvector (one can also assume the vector φ(xk ), k = 1, . . . n, to be
independent), and thus we get s αs Kks = nλαk , which is precisely (13.4).
(k)
Denoting the kth EOF by vk = nt=1 αt φ(xt ), the PCs in the feature space are
computed using the dot product
n
φ(x)T vk = αt(k) φ (x)T φ(xt ) . (13.5)
t=1
The main problem with this expression is that, in general, the nonlinear mapping
φ() lies in a very high-dimensional space, and the computation is very expensive
in terms of memory and CPU time. For example, if one considers a polynomial
transformation of degree q, i.e. the mapping φ() will contain all the monomials of
order q, then we get a feature space of dimension (p+q−1)!
q!(p−1)! ∝ p .
q
So, rather than choosing φ() first, one chooses K instead, which then yields φ()
using (13.6). Of course, many kernels exist, but not all of them satisfy Eq. (13.6).
Let us consider a positive self-adjoint (Appendix F) integral operator K() defined
for every square integrable function over the p-dimensional space Rp , i.e.
K(f ) = K(x, y)f (y)dy, (13.7)
Rp
where the symbol <, > denotes a scalar product defined on the space of integrable
functions defined on Rp . Then, using Mercer’s theorem, see e.g. Mikhlin (1964)
and Moiseiwitsch (1977), and solving for the eigenfunctions of the linear integral
operator (13.7), K(f ) = λf , i.e.
K(x, y)f (y)dy = λf (x), (13.9)
Rp
300 13 Kernel EOFs
1 Notethat
the convergence
in (13.10) is the space of square integrable functions, i.e.
limk→∞ |K (x, y) − ki=1 λi φi (x)φi (y)|2 dxdy = 0.
13.2 Kernel EOFs 301
x−y 2
−
• The Gaussian kernel: K(x, y) = K(x, y)
=e 2σ 2 .
2
xT y
• Other examples include K(x, y) = exp 2a 2
and K(x, y) = tanh αxT y + β .
In these examples the vectors φ(x) are infinite.
One of the main benefits of using the kernel transformation is that the kernel
K(.) avoids computing the large (and may be infinite) covariance matrix S, see
Eq. (13.2), and permits instead the computation of the n × n matrix K = Kij and
the associated eigenvalues/eigenvectors λk and α k , k = 1, . . . , n. The kth extracted
PC of a target point x from the input data space is then obtained using the expression:
n
vTk φ(x) = αk,t K (xt , x) . (13.13)
t=1
This means that the computation is not performed in the very high-dimensional
feature space, but rather in the lower n-dimensional space spanned by the images of
xt , t = 1, . . . , n.
Remark One sees that in ordinary PCA, one obtains min(n, p) patterns, whereas in
kernel PCA one can obtain up to max(n, p) patterns.
The computation of kernel EOFs/PCs within the feature space is based on the
assumption that the transformed data φ(x1 ), . . . , φ(xn ) have zero mean, i.e.
1
n
φ(xt ) = 0, (13.14)
n
t=1
1 T
n
K̃ij = φ(xi ) − φ(xt ) φ(xj ) − φ(xt ) . (13.15)
n
t=1
1 1 1
K̃ = K − 1n×n K − K1n×n + 2 1n×n K1n×n , (13.16)
n n n
PC2
x2
0 0
-20
-40 -0.05
-60
-50 0 50 -0.05 0 0.05
x1 PC1
c) (KPC 1, KPC 2) scatter d) PDF of scaled (KPC 1, KPC 2)
0.06
2
0.04
0.02 1
KPC2
KPC2
0 0
-0.02 -1
-0.04 -2
-0.06
-0.04 -0.02 0 0.02 0.04 -2 -1 0 1
KPC1 KPC1
Fig. 13.4 First two coordinates of a scatter plot of three concentric clusters (a), the same scatter
projected onto the leading two PCs (b), and the leading two (Gaussian) kernel PCs (c), and the data
PDF within the leading two kernel PCs (d). Adapted from Hannachi and Iqbal (2019). ©American
Meteorological Society. Used with permission
(Fig. 13.4c). Figure 13.4c is obtained with σ 2 = 2.3, but the structure is quite robust
to a wide range of the smoothing parameters σ . We note here that non-local kernels
cannot discriminate between these structures, pointing to the importance of the local
character of the Gaussian kernel in this regard. The kernel PDF of the data within
the space spanned by kernel PC1 and kernel PC2 is shown in Fig. 13.4d. Note, in
particular, the curved shape of the two outer clusters (Figs. 13.4c,d), reflecting the
nonlinearity of the corresponding manifold.
To compare the above result with the performance of polynomial kernels, we now
apply the same procedure to the above concentric clusters using two polynomial
kernels of respective degrees 4 and 9. Figure 13.5 shows the obtained results. The
polynomial kernels fail drastically to identify the clusters. Instead, it seems that the
polynomial kernels attempt to project the data onto the outer sphere resulting in the
clusters being confounded. Higher polynomial degrees have also been applied with
no improvement.
304 13 Kernel EOFs
KPC2
0 0
-0.05 -0.05
-0.04 -0.02 0 0.02 0.04 -0.05 0 0.05
KPC1 KPC1
Fig. 13.5 Same as Fig. 13.4c, but with polynomial kernels of degree 4 (a) and 9 (b). Adapted from
Hannachi and Iqbal (2019). ©American Meteorological Society. Used with permission
The kernel EOF analysis is closely related to what is known as spectral clustering
encountered, for example, in pattern recognition. Spectral clustering takes roots
from the spectral analysis of linear operators through the construction of a set
of orthonormal bases used to decompose the spatio-temporal field at hand. This
basis is provided precisely by a set of orthonormal eigenfunctions of the Laplace–
Beltrami differential operator, using the diffusion map algorithm (e.g. Coifman and
Lafon 2006; Belkin and Niyogi 2003; Nadler et al. 2006). Spectral clustering (e.g.
Shi and Malik 2000) is based on using a similarity matrix S = (Sij ), where Sij
is the similarity, such as spatial correlation, between states xi and xj . There are
various versions of spectral clustering such as the normalised and non-normalised
(Shi and Malik 2000). The normalised version of spectral clustering considers the
(normalised) Laplacian matrix:
with
⎛ ⎞
e1
⎜ .. ⎟
E=⎝ . ⎠,
ek
P = D−1 M = I − D−1 S.
where B = A − 2m 1
kkT is the modularity matrix. The modularity is a measure
reflecting the extent to which nodes in a graph are connected to those of their own
groups, reflecting thus the presence of clusters known as communities.
The objective then is to maximise the modularity. The symmetric modularity
matrix has the vector 1 (containing only ones) with associated zero eigenvalue, as
in the Laplacian matrix in spectral
clustering. By expanding the vector s in terms of
the eigenvectors uj of B, s = (sT uj )uj , the modularity reduces to
Q= (uTj s)2 λj
Q = sT B(G) s,
(G)
with the elements of the new nG × nG modularity matrix B(G) , given by Bij =
Bij − δij k in G Bik , with δij being the Kronecker delta. The procedure is then
repeated by maximising the modularity using Q.
The construction of the feature space and the computation of the EOF patterns and
associated PCs within it are a big leap in the analysis of the system producing
the data. What we need next is, of course, to be able to examine the associated
13.4 Pre-images in Kernel PCA 307
patterns in the input space. In many cases, therefore, one needs to be able to map
back, i.e. perform an “inverse mapping”, onto the input space. The inverse mapping
terminology is taken here in a rather broad sense not literally as the mapping may not
be one-to-one. Let us first examine the case of the standard (linear) EOF analysis.
In standard PCA the data xt is expanded as
p
p
xt = xTt .uk uk = ctk uk , (13.20)
k=1 k=1
where uk is the kth EOF and ctk is the kth PC of xt . Since this is basically a linear
projection, given a point x in the EOF space for which only the l leading coordinates
β1 , . . . , βl , (l ≤ p) are observed, the pre-image is simply obtained as
l
x∗ = βk uk . (13.21)
k=1
The above expression minimises x − x∗ 2 , i.e. the quadratic distance to the exact
point. Of course, if all the EOFs are used, then the distance is zero and the pre-image
is exact (see Chap. 3).
Now, because the mapping may not be invertible,2 the reconstruction of the
patterns back in the p-dimensional input data space can only be achieved approx-
imately (or numerically). Let us assume again a nonlinear transformation φ(.)
mapping the input (physical) space onto the (high-dimensional) feature space F,
where the covariance matrix is obtained using Eq. (13.2). Let vk be the kth EOF,
within the feature space, with associated eigenvalue λk , i.e.
Svk = λk vk . (13.22)
Like for the standard (linear) EOF analysis, and as pointed out above, the
EOFs are combination of the data; that is, they lie within the space spanned by
φ(x1 ), . . . , φ(xn ), e.g.
n
vk = akt φ(xt ). (13.23)
t=1
It can be seen, after inserting (13.23) back into (13.22), see also Eq. (13.3), that
the vector ak = (ak1 , . . . , akn )T is an eigenvector of the matrix K = (Kij ), where
Kij = K(xi , xj ) = φ(xi )T φ(xj ), i.e.
Now, given a point x in the input space, the kernel PC of x in the feature space is the
usual PC of φ(x) within this feature space. Hence, the kth kernel PC is given by
n
βk (x) = φ(x)T vk = akt K(x, xt ). (13.25)
t=1
V = A,
m
w= βk vk , (13.26)
k=1
where β1 , . . . , βm are scalar coefficients, and we would like to find a point x from
the input space that maps onto w in Eq. (13.26). Following Schölkopf et al. (1999),
one attempts to find the input x from the input space such that its image φ(x)
approximates w through maximising the ratio r:
2
wT φ(x)
r= . (13.27)
φ(x)T φ(x)
Precisely, the problem can be solved approximately through a least square minimi-
sation:
m
∗
x = argmin J = φ(x) − βk vk 2 . (13.28)
x k=1
To solve Eq. (13.28), one makes use of Eq. (13.23) and expresses w as
m
n
n
w= βk akt φ(xt ) = αt φ(xt ) (13.29)
t=1 k=1 t=1
(αt = m k=1 βk akt ), which is then inserted into (13.28). Using the property of the
kernel (kernel trick), one gets
n
φ(x) − w 2
= K(x, x) − 2 αt K(x, xt ) + c, (13.30)
t=1
If the kernel is isotropic, i.e. K(x, y) = H ( x−y 2 ), then the gradient of the error
squared (13.30) is easy to compute, and the necessary condition of the minimum
of (13.30) is given by
n
∇x J = αt H x − xt 2
(x − xt ) = 0, (13.31)
t=1
1
n
x = n
αt H x − xt 2
xt . (13.32)
t=1 αt H x − xt 2
t=1
The above equation can be solved using the fixed point algorithm via the iterative
scheme:
n
αt dH x(m) − xt 2 xt
x(m+1)
= t=1 n
du
.
dH
t=1 αt du x(m) − xt 2
For example, in the case of the Gaussian kernel K(x, y) = exp − x − y 2 /2σ 2 ,
we get the iterative scheme:
1
n
z(m+1) = n (m) − x 2 /2σ 2 )
αt exp − z(m) − xt 2
/2σ 2 xt ,
t=1 αt exp(− z t t=1
(13.33)
which can be used to find an optimal solution x∗ by taking x∗ ≈ z(m) for large
enough m. Schölkopf et al. (1998, 1999) show that kernel PCA outperforms ordinary
PCA. This can be understood since kernel PCA can include higher order moments,
unlike PCA where only second-order moments are used. Note that PCA corresponds
to kernel PCA with φ(.) being the identity.
∂q1
∂t = −J (ψ1 , q1 ) + D1 (ψ1 , ψ2 ) + S1
∂q2
∂t = −J (ψ2 , q2 ) + D2 (ψ1 , ψ2 , ψ3 ) + S2 (13.34)
∂q3
∂t = −J (ψ3 , q3 ) + D3 (ψ2 , ψ3 ) + S3 ,
q1 = ∇ 2 ψ1 − R1−2 (ψ1 − ψ2 ) + f
q2 = ∇ 2 ψ2 + R1−2 (ψ1 − ψ2 ) − R2−2 (ψ2 − ψ3 ) + f (13.35)
q3 = ∇ 2 ψ3 + R2−2 (ψ2 − ψ3 ) + f (1 + Hh0 ).
the sphere. The forcing terms Si , i = 1, 2, 3, are calculated in such a way that the
January climatology of the National Center for Environmental Prediction/National
Center for Atmospheric Research (NCEP/NCAR) streamfunction fields at 200 hPa,
500 hPa and 800 hPa levels is stationary solution of system, Eq. (13.34). The
term h() in Eq. (13.35) represents the real topography of the Earth in the northern
hemisphere (NH). The model is spectral with a triangular truncation T21 resolution,
i.e. 32 × 64 lat × lon grid resolution. The model is symmetrical with respect to the
equator, leading to slightly more than 3000 grid points or 693 spherical harmonics
(or spectral degrees of freedom). The model was shown by many authors to simulate
faithfully the main dynamical processes in the extratropics, see Hannachi and Iqbal
(2019) for more details and references. The point here is to show that the model
reveals nonlinearity when analysed using kernel EOFs and is therefore consistent
with conceptual low-order chaotic models such as the case of the Lorenz (1963)
model (Fig. 13.6).
The model run is described in Hannachi and Iqbal (2019) and consists of
one million-day trajectory. The averaged flow tendencies are computed within
the PC space and compared to those obtained from the kernel PC space. Mean
flow tendencies have been applied in a number of studies using the leading
modes of variability to reveal possible signature of nonlinearity, see e.g. Hannachi
(1997), Branstator and Berner (2005) and Franzke et al. (2007). For example,
using a simple (toy) stochastic climate model, Franzke et al. (2007) found that
the interaction between the resolved planetary waves and the unresolved waves is
the main responsible for the nonlinearity. Figure 13.7a shows an example of the
mean tendencies of the mid-level (500-hPa) streamfunction within the PC1-PC5
state space. The flow tendencies (Fig. 13.7a) reveal clear nonlinearities, which can
be identified by examining both the tendencies and their amplitudes. Note that
13.5 Application to An Atmospheric Model and Reanalyses 311
z
0 0
0.15
-1 -1
0.1
-2
-2
0.05
-3
-3 0
-2 -1 0 1 2 -2 0 2
x x
Fig. 13.6 PDF of a long simulation of the Lorenz (1963) model shown by shaded and solid
contours within the (x, z) plane (a), and the flow tendencies plotted in terms of magnitude (shaded)
and direction (normalised vectors) within the same (x, z) plane (b). A chunk of the model trajectory
is also shown in both panels along with the fixed points. Note that the variables are scaled by 10,
and the value z0 = 25.06 of the fixed point is subtracted from z. Adapted from Hannachi and Iqbal
(2019). ©American Meteorological Society. Used with permission
in linear dynamics the tendencies are antisymmetric with respect to the origin,
and the tendency amplitudes are normally elliptical, as shown in Fig. 13.7b. The
linear model is fitted to the trajectory within the two-dimensional PC space using
a first-order autoregressive model, as explained in Chap. 6. The departure of the
linear tendencies from the total tendencies in Fig. 13.7c reveals two singular (or
fixed) points representing (quasi-)stationary states. The PDF of the system trajectory
is shown in Fig. 13.7d and is clearly unimodal, which is not consistent with the
conceptual low-order chaotic models such as Fig. 13.6.
The same procedure can be applied to the trajectory within the leading kernel
PCs. Figure 13.8a shows the departure of the tendencies from the linear component
within the kernel PC1/PC4 space and reveals again two fixed points. Figure 13.8b
shows the PDF of the mid-level streamfunction within kernel PC1/PC4 state space.
In agreement with low-order conceptual models, e.g. Fig. 13.6, the figure now
reveals strong bimodality, where the modes correspond precisely to regions of low
tendencies. Figure 13.9 displays the two circulation flows corresponding to the PDF
modes of Fig. 13.8 showing the anomalies (top) and the total (bottom) flows. The
first stationary state shows a low over the North Pacific associated with a dipole over
the North Atlantic reflecting the negative NAO phase (Woollings et al. 2010). The
second anomalous stationary solution represents approximately the opposite phase,
with a high pressure over the North Pacific associated with an approximate positive
312 13 Kernel EOFs
a) b)
3 0.4 0.4
2
0.3 0.3
1
PC4
PC4
0 0.2 0.2
-1
0.1 0.1
-2
-3 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
c) PC1 d) PC1
0.2 0.2
3
2
0.15 0.15
1
PC4
PC4
0 0.1 0.1
-1
0.05 0.05
-2
-3 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
PC1 PC1
Fig. 13.7 Total flow tendency of the mid-level streamfunction within the conventional PC1/PC5
state space. (b) Linear tendency based on a first-order autoregressive model fitted to the same
data. (c) Difference between the two tendencies of (a) and (b) showing the departure of the total
tendencies from the linear part. (d) Kernel PDF of the same data within the same two-dimensional
state space. Adapted from Hannachi and Iqbal (2019)
a) b)
0.3 0.12
2 0.25 2 0.1
1 0.2 1 0.08
KPC4
KPC4
0 0.15 0 0.06
-1 0.1 -1 0.04
-2 0.05 -2 0.02
-2 0 2 -2 0 2
KPC1 KPC1
Fig. 13.8 As in Fig. 13.7c(a) and 13.7d(b), but for the kernel PC1/PC4. Adapted from Hannachi
and Iqbal (2019). ©American Meteorological Society. Used with permission
NAO phase. In both cases the anomalies over the North Atlantic are shifted slightly
poleward compared to the NAO counterparts.
13.5 Application to An Atmospheric Model and Reanalyses 313
a) b)
c) d)
Fig. 13.9 Anomalies (a,b) and total (c,d) flows of mid-level streamfunction field obtained by
compositing over states within the neighbourhood of the modes of the bimodal PDF. Contour
interval 29.8 × 108 m2 /s (top) and 29.8 × 106 m2 /s (bottom). Adapted from Hannachi and Iqbal
(2019). ©American Meteorological Society. Used with permission
The total flow of the stationary solutions, obtained by adding the climatology
to the anomalous stationary states, is shown in the bottom panels of Fig. 13.9. The
first solution shows a ridge over the western coast of North America associated
with a diffluent flow over the North Atlantic with a strong ridge over the eastern
North Atlantic. This latter flow is reminiscent of a blocked flow over the North
Atlantic. Note the stronger North Atlantic ridge compared to that of the western
314 13 Kernel EOFs
North American continent. The second stationary state (Fig. 13.9) shows a clear
zonal flow over both basins.
In the next example, kernel PCs are applied to reanalyses. The data used in
this example consist of sea level pressure (SLP) anomalies from the Japanese
Reanalyses, JRA-55 (Harada et al. 2016; Kobayashi et al. 2015). The anomalies
are obtained as departure of the mean daily annual cycle from the SLP field and
keeping unfiltered winter (December–January–February, DJF) daily anomalies over
the northern hemisphere. The kernel PCs of daily SLP anomalies are computed, and
the PDF is estimated.
Figure 13.10a shows the daily PDF of SLP anomalies over the NH poleward
of 27◦ N using KPC1/KPC7. Strong bimodality stands out from this PDF. To
characterise the flow corresponding to the two modes, a composite analysis is
performed by compositing over the points within the neighbourhood of the two
modes A and B. The left mode (Fig. 13.11a) shows a polar high stretching south
over Greenland accompanied with a low pressure system over the midlatitude
North Atlantic stretching from eastern North America to most of Europe and
the Mediterranean. This circulation regime projects strongly onto the negative
NAO. The second mode (Fig. 13.11b) shows a polar low with high pressure over
midlatitude North Atlantic, with a small high pressure over the northern North West
Pacific, and projects onto positive NAO. The regimes are not exactly symmetrical
of each other; regime A is stronger than regime B. Hannachi and Iqbal (2019)
also examined the hemispheric 500-hPa geopotential height. Their two-dimensional
PDFs (not shown) reveal again strong bimodality associated, respectively, with polar
low and polar high. The mode associated with the polar high is stronger however.
B 0.05
0.1 A
0.05 0
0
5 -0.05
0 2
0
KPC7 0 2 -2 0 2
-5 -2 -2
KPC1 KPC7 KPC1
Fig. 13.10 (a) Kernel PDF of the daily winter JRA-55 SLP anomalies within the kernel PC1/PC7
state space. (b) Difference between the PDFs of winter daily SLP anomalies of the first and second
halves of the JRA-55 record. Adapted from Hannachi and Iqbal (2019)
13.5 Application to An Atmospheric Model and Reanalyses 315
16
4
14
2
12
10 0
8
-2
6
-4
4
2 -6
0 -8
-2
-10
-4
-6 -12
-8 -14
Fig. 13.11 SLP anomaly composites over states close to the left (a) and right (b) modes of the
PDF shown in Fig. 13.10. Units hPa. Adapted from Hannachi and Iqbal (2019).
Table 13.1 Correlation coefficients between the 10 leading KPCs and PCs for JRA-55 SLP
anomalies. The R-square of the regression between each individual KPCs and the leading 10 PCs
is also shown. Correlations larger than 0.2 are shown in bold faces
KPC1 KPC2 KPC3 KPC4 KPC5 KPC6 KPC7 KPC8 KPC9 KPC10
PC1 0.928 0.008 −0.031 −0.032 −0.090 −0.017 0.009 −0.017 −0.015 −0.030
PC2 0.031 −0.904 0.070 −0.037 0.251 0.006 −0.015 −0.054 0.063 0.003
PC3 −0.020 −0.002 0.561 −0.464 −0.271 0.471 −0.029 0.146 0.117 0.037
PC4 −0.035 −0.172 −0.287 0.400 −0.453 0.543 0.210 −0.108 −0.204 0.012
PC5 0.097 0.168 0.359 0.384 0.554 0.365 −0.043 −0.303 −0.050 0.084
PC6 0.011 0.079 −0.074 0.071 0.191 0.134 0.499 0.291 0.386 −0.489
PC7 0.027 −0.009 −0.104 −0.021 0.261 0.178 −0.091 0.615 −0.469 0.092
PC8 −0.018 −0.027 −0.079 0.072 −0.033 0.146 −0.644 0.120 0.032 −0.537
PC9 0.036 0.002 −0.129 0.155 0.052 0.079 −0.108 0.293 0.485 0.431
PC10 −0.003 −0.003 −0.040 0.072 0.042 0.097 −0.060 0.163 0.115 −0.052
R-square 0.88 0.88 0.57 0.57 0.77 0.74 0.73 0.72 0.68 0.73
The above results derived from the quasi-geostrophic model, and also reanalyses
clearly show that the leading KPCs, like the leading PCs, reflect large scale structure
and hence can explain substantial amount of variance. This explained variance is
already there in the feature space but is not clear in the original space. Table 13.1
shows the correlations between the leading 10 KPCs and leading 10 PCs of the sea
level pressure anomalies, along with the R-square obtained from multiple regression
between each KPC and the leading 10 PCs. It is clear that these KPCs are large scale
and also explain substantial amount of variance.
The kernel PCs can also be used to check for any change in the signal over
the reanalysis period. An example of this change is shown in Fig. 13.10b, which
represents the change in the PDF between the first and last halves of the data. A
clear regime shift is observed with a large decrease (increase) of the frequency of
316 13 Kernel EOFs
the polar high (low) between the two periods. This change projects onto the +NAO
(and +AO) and is consistent with an increase in the probability of the zonal wind
speed over the midlatitudes.
1
n−M+1
1 T
S= Z Z= ϕ s ϕ Ts . (13.37)
n n
s=1
By applying the same argument as for kernel EOFs, any eigenvector of S is a linear
combination of ϕ t , t = 1, . . . , n − M + 1, i.e.
n−M+1
v= αt ϕ t . (13.38)
t=1
Kα = nλα, (13.40)
where K = (Kij ) with Kij = M−1 k=0 Ki+k,j +k . The vector α represents the (kernel)
extended PC in the feature space.
The reconstruction within the feature space can be done in a similar fashion to
the standard extended EOFs (Chap. 7, Sect. 7.5.3), and the transformation back to
the input space can again be used as in Sect. 13.4, but this is not expanded further
here.
Alternative Formulations
As described in Chap. 6, the POP analysis is based on a linear model, the first
order autoregressive model AR(1). Although POPs were quite successful in various
climate applications, the possibility remains that, as we discussed in the previous
sections, the existence of nonlinearity can hinder the validity of the linear model.
The POP, or AR(1), model can be defended when nonlinearity is weak. If, however,
we think or we have evidence that nonlinearity is important and cannot be neglected,
then one solution is to use the kernel transformation and get kernel POPs.
318 13 Kernel EOFs
Since the formulation of POPs involves inverting the covariance matrix, and to
avoid this complication in the feature space, a simple way is to apply POPs using
kernel PCs by selecting, say, the leading N KPCs. The computation, as it turns out,
becomes greatly simplified as the KPCs are uncorrelated. Like for kernel extended
EOFs, patterns obtained from the POP analysis are expressed as a combination of
kernel EOFs. The transformation back to the input space can be obtained using again
the fixed point algorithm.
Chapter 14
Functional and Regularised EOFs
Abstract Weather and climate data are in general discrete and result from sampling
a continuous system. This chapter attempts to take this into account when computing
EOFs. The first part of the chapter describes methods to construct EOFs/PCs of
profiles with application to oceanography. The second part of the chapter describes
regularised EOFs with application to reanalysis data.
Functional EOF/PC analysis (e.g. Ramsay and Silverman 2006; Jolliffe and Cadima
2016) is concerned with EOF/PC analysis applied to data which consist of curves
or surfacsses. In atmospheric science, for example, the observations consist of
spatio-temporal fields that represent a discrete sampling of continuous variables,
such as pressure or temperature at a set of finite grid points. In a number of cases
methods of coupled patterns (Chap. 15) can be applied to single fields simply by
assuming that the two fields are identical. For instance, functional and smooth EOFs
correspond to smooth maximum covariance analysis (MCA), as given by Eq. (15.30)
in Chap. 15, when the left and right fields are identical. Precisely, we suppose that
we are given a sample of n curves that constitutethe coordinates of a vector curve
x(t) = (x1 (t), . . . , xn (t))T , with zero mean, i.e. nk=1 xk (t) = 0 for all values of t.
The covariance function is given by
1
S(s, t) = xT (t)x(s), (14.1)
n−1
where t and s are in the domain of the definition of the curves. The ques-
tion
then is to find smooth functions (EOFs) a(t) maximising < a, Sa >=
S(s, t)a(s)a(t)dsdt subject to a normalisation constraint condition of the type
< a, a > +α < D 2 a, D 2 a > −1 = 0. The solution to this problem is then given
by the following integro-differential equation:
S (t, s) a(s)ds = μ 1 + αD 4 a(t). (14.2)
Consider first the case of functional EOFs, that is when α = 0, which yields a
homogeneous Fredholm equation of the second kind. We suppose that the curves
can be expanded
p in terms of a number of basis functions φ1 (), . . . , φp () so that
xi (t) = k=1 λi,k φk (t), i = 1, . . . , n. In vector form this can be written as x(t) =
φ(t), where we let = (λij ). The covariance function becomes
1
S(t, s) = φ T (t)T φ(s). (14.3)
n−1
Assumingthat the functional EOF a(t) is expanded using the basis functions as
a(t) = ak φk (t) = φ T (t)a, the above integro-differential equation yields the
following system:
where = (n − 1)−1 < φ(s), φ T (s) >. This equality has to be satisfied for all
t in its domain of definition. Hence the solution a is given by the solution to the
eigenvalue problem:
Exercise
1. Derive the above eigenvalue problem satisfied by the vector coefficients a =
(a1 , . . . , ap )T .
2. Now we can formulate the problem by going back to the original problem as we
did above. Write the generalised eigenvalue problem using V = (< φi , Sφj >)
and A = (< φi , φj >).
3. Are the two equations similar? Explain your answer.
Given that the covariance-like matrix is symmetric and semi-definite positive,
one can compute its square root, or alternatively, one can use its Cholesky
decomposition to transform the previous equation into a symmetric eigenvalue
problem by multiplying both sides by the square root of .
Exercise
1. Show that is symmetric.
T
2. Show that aT b = n−1 1
a φ(t) bT φ(t) dt.
3. Deduce
that is semi-definite positive. Hint – ij =< φi (t), φj (s) >=
φi (t)φj (s)dtds = j i .
14.3 An Example of Functional PCs from Oceanography 321
The above section presents functional EOFs applied to a finite number of curves
x1 (t), . . . , xn (t), for t varying in a specified interval. The parameter may represent
time or a conventional or curvilinear coordinate, e.g. height. In practice, however,
continuous curves or profiles are not commonly observed, but can be obtained from
a set of discrete values at a regular grid.1 To construct continuous curves or profiles
from these samples a linear combination of a number of basis functions can be used
as outlined above. Examples of basis functions commonly used include radial basis
functions and splines (Appendix A).
The profile xi (t) is projected onto the basis φk (t), k = 1, . . . K, as
K
xi (t) = λi,k φk (t). (14.6)
k=1
The functional PCs are then given by solving the eigenvalue problem (Eq. (14.5)).
The problem is normally solved in two steps. First, the coefficients λi,k , i =
1, . . . n, k = 1, . . . K, are obtained from Eq. (14.6) using for example least squares
estimation. The matrix = (λi,k ) is then used as data matrix, and a SVD procedure
can be applied to get the eigenvectors (functional PCs) of the covariance matrix of
.
1 When the sampling is not regular an interpolation can be applied to obtain regular sampling.
322 14 Functional and Regularised EOFs
Fig. 14.1 An example of two vertical profiles of temperature (left), salinity (middle) constructed
using 20 B-splines (right). The dots represent the measurements. Adapted from Pauthenet (2018)
14.3 An Example of Functional PCs from Oceanography 323
Fig. 14.2 MIMOC global ocean temperature at 340 dbar (a), salinity at 750 dbar (c) climatology,
along with the spatial distribution of functional PC1 (b), PC2 (d) and PC3 (f). Panel (e) shows the
low- and high-salinity intermediate water mass (Courtesy of Talley 2008). Adapted from Pauthenet
(2018)
Fig. 14.3 The leading four vertical mode or PCs of temperature and salinity along with their
percentage explained variance. Individual explained variance is also shown for T and S separately.
The mean profile (solid) is shown along with the envelope of the corresponding PCs. (a) PC1
(72.52%). (b) PC2 (19.89%). (c) PC3 (3.43%). (d) PC4 (1.35%). Adapted from Pauthenet et al.
(2017). ©American Meteorological Society. Used with permission
We consider here the more general problem of regularised (or smooth) EOFs as
discussed in Hannachi (2016). As for the smoothed MCA (see Chap. 15), we let
V = (Vij ) = (< φi , Sφj >), and = ij = < D 2 φi , D 2 φj > , the vector a is
then obtained by solving the generalised eigenvalue problem:
Va = μ ( + α) a
Remark Note that we used the original maximisation problem and not the previous
integro-differential equation.
It can be seen that the associated eigenvalues are real non-negative, and the
eigenvectors are real. The only parameter required to solve this problem is the
14.4 Regularised EOFs 325
Fig. 14.4 Spatial structure of the leading four functional PCs of the vertical profiles of the 2007
annual mean temperature and salinity. Data are taken from the Southern Ocean State Estimate
(SOSE, Mazloff et al. 2010). (a) PC1 (72.52%). (b) PC2 (19.89%). (c) PC3 (3.43%). (d) PC4
(1.35%). Adapted from Pauthenet et al. (2017). ©American Meteorological Society. Used with
permission
example, that the curves x(t) = (x1 (t), . . . , xn (t)) are observed at discrete times
t = tk , k = 1, . . . , p. The interpolation, or smoothing, using radial basis functions
is similar to, but simpler than the B-spline smoothing and is given by
p
xi (t) = λi,k φ(|t − tk |). (14.8)
k=1
Perhaps one main advantage of using this kind of smoothing, compared to splines,
is that it involves one single radial function, which can be chosen from a list of
functions given in the Appendix A. The coefficients of the smoothed curves can
be easily computed by solving a linear problem, as shown in Appendix A. The
covariance matrix can also be computed easily in terms of the matrix = λij ,
p
and the radial function φ. The smoothed EOF curves a(t) = k=1 uk φ(|t − tk |)
are then sought by solving the above generalised eigenvalue problem to get u =
(u1 , . . . , up ). Note that the function φ(|t − tk |) now plays the role of the basis
function φk (t) used previously.
1
n
S (x, y) = F (x, tk ) F (y, tk ) . (14.9)
n−1
k=1
The objective of smooth EOFs (Hannachi 2016) is to find the “EOFs” of the
covariance matrix (14.9). Denoting by the spatial domain, which may represent
the entire globe or a part of it, an EOF is a continuous pattern u(x) maximising:
u(x)S (x, y) u(y)dxdy
subject to u(x)2 + α(∇ 2 u(x))2 dx = 1. This yields a similar integro-
differential equation to (14.2), namely
S (x, y) u(x)dx = S (x, y) u(x)d = μ 1 + α∇ 4 u(y). (14.10)
This is exactly similar to the approximation used in Sects. 14.1 and 14.2 above.
We use two-dimensional
RBFs φi (x) = φ( x − xi ) and expand u(x) in terms of
φi (x), e.g. u(x) = uk φk (x) = φ T (x)u. For example, for the case when α = 0, the
sample F(x) = (F1 (x), . . . , Fn (x))T is similarly expanded as Ft (x) = λt,k φk (x),
i.e. F(x) = φ(x) and from S(x, y) = (n − 1)−1 FT (x)F(y) we get exactly
a similar
eigenvalue problem to that of Sect. 14.1, i.e. T u = μu, where
= S 2 φ(x)φ (x)dx. The set S 2 represents the spherical Earth. The advantage
T
of this is that we can use spherical RBFs, which are specific to the sphere, see e.g.
Hubbert and Baxter (2001).
To integrate Eq. (14.10) one starts by defining the sampled space–time field through
a (centred) data matrix X = (x1 , . . . , xd ), where xk = (x1k , . . . , xnk )T is the time
series of the field at the kth grid point. The associated sample covariance matrix is
designated by S. The pattern u = (u1 , . . . , ud )T satisfying the discretised version of
Eq. (14.10) is given by the solution to the following generalised eigenvalue problem
(see also Eq. (15.34)):
Su = μ Id + αD4 u. (14.11)
tan ϕk
R 2 (D 2 u)k,l = 1
u
(δϕ)2 k−1,l
− (δϕ)
2
+ (δλ)2 cos
2
2 ϕ − δϕ uk,l
2 k
C = Diag c1 , c2 , . . . , cp
and
2 Note that here q and p represent respectively the resolutions in the zonal and meridional directions
⎛ ⎞
a1 b1 0 ... 0 0 0
⎜ (δφ)−2 a ⎟
⎜ 2 b2 ... 0 0 0 ⎟
⎜ ⎟
⎜ 0 (δφ)−2 a3 ... 0 0 0 ⎟
A=⎜
⎜ .. .. .. .. .. .. .. ⎟
⎟
⎜ . . . . . . . ⎟
⎜ ⎟
⎝ 0 0 0 . . . (δφ)−2 ap−1 bp−1 ⎠
0 0 0 . . . 0 (δφ)−2 ap + bp
tan ϕk tan ϕk
where ak = − 2
(δϕ)2
+ 2
(δλ)2 cos2 ϕk
− δϕ , bk = 1
(δϕ)2
− δϕ , and ck =
−2
(δλ cos ϕk ) . The eigenvalue problem (14.11) yields (Hannachi 2016):
α
Su = μ Ipq + αD4 u = μ Ipq + 4 A2 u, (14.14)
R
where Ipq is the pq × pq identity matrix. For a given smoothing parameter α
Eq. (14.14) is a generalised eigenvalue problem.
Exercise The objective of this exercise is to derive Eq. (14.13).
Denote by v thevector R 2 D 2 u, and write v in a form similar to u, i.e. vT =
1.
vT1 , vT2 , . . . , vTq where vTl = (v1l , v2l , . . . , vpl ), for l = 1, . . . q. Write
Eq. (14.12) for v1l , v2l and vpl for l = 1, 2 and q.
2. Show in particular that v1 = Au1 + Cu2 + Cuq , v2 = Au2 + Cu1 + Cu3 and
vq = Auq + Cuq−1 + Cu1 .
3. Derive Eq. (14.13).
(Hint. 1. v11 = δϕ1 2 u01 + a1 u11 + b1 u21 + c1 (u10 + u12 ) and v21 = δϕ1 2 u11 +
a2 u21 + b2 u31 + c2 (u20 + u22 ), and similarly vp1 = δϕ1 2 up−1,1 + ap up1 +
bp up+1,1 + cp (up0 + up2 ), and recall that u01 = 0, u10 = u1q , u20 = u2q , etc.).
Note that in Eq. (14.14) it is assumed that u0l = 0, for l = 1, . . . q, meaning that
the pattern is zero at the pole. As it is explained in Hannachi (2016) this may be
reasonable for the wind field but not for other variables such as pressure. Another
type of boundary condition is to consider
u0l = u1,l
R 2 D2 u = Bu, (14.15)
Exercise Derive Eq. (14.15) for the case when u0l = u1l for l = 1, . . . q
Hint Follow the previous exercise.
The third type of boundary conditions represents the non-periodic conditions,
such as the case of a local region on the globe. This means
in the zonal direction, and one keeps the same condition in the meridional direction,
u0l = u1l and up+1,l = upl for l = 1, . . . q. Using the expression of matrix B given
in Eq. (14.15), Eq. (14.13) yields a bloc tridiagonal matrix as
⎛ ⎞⎛ ⎞
B+C C O O ... O O u1
⎜ C ⎟⎜ ⎟
⎜ B C O ⎟⎜... ⎟ O O u2
⎜ ⎟⎜ ⎟
⎜ O C B C ⎟⎜... ⎟ O O u3
R D u = Cu = ⎜
2 2
⎜ .. .. .. ..
⎟⎜
⎟⎜..
⎟
⎟ .. .. .. (14.16)
⎜ . . . . ⎟⎜ . ⎟ . . .
⎜ ⎟⎜ ⎟
⎝ O O O O . . . B C ⎠ ⎝ uq−1 ⎠
O O O O ... C B + C uq
smooth EOFs obtained by discarding the kth observation (where the observations
can be assumed to be independent). Note that U(k) is a function of α. The residuals
obtained from approximating a data vector x using U(k) is
m
ε (k) (α) = x − βi u(k)
i ,
i=1
(k)
where βi , i = 1, . . . , m are obtained from the system of equations < x, ul >=
m (k) (k) m −1 (k)
j =1 βj < uj , ul >, l = 1, . . . , m, i.e. βi = j =1 G ij
< x, uj >. Here
(k) (k)
G is the scalar product matrix with elements [G]ij =< ui , uj >. The optimal
value of α is then the one that minimises the CV, where CV = nk=1 tr (k) , with
14.6 Application of Regularised EOFs to SLP Anomalies 331
(k) being the covariance matrix3 of ε (k) . SVD can be used to efficiently extract the
first few leading smooth EOFs then optimise the total sum of squared residuals.
We can also use instead the explained variance as follows. If one designates by
(k)
σ(k) = m i=1 μi the total variance explained by the leading m EOFs when the kth
observation is removed, then the best α is the one that maximises nk=1 σ(k) . Any
descent algorithm can be used to optimise this one-dimensional problem.
It turns out, however, and as pointed out by Hannachi (2016), that cross-
validation does not work in this particular case simply because EOFs minimise
precisely the residual variance, see Chap. 3. Hannachi (2016) used the Lagrangian
L of the original regularised EOF problem:
max u(x)S (x, y) u(y)dxdy
(14.17)
subject to u(x)2 + α(∇ 2 u(x))2 dx = 1,
that is
2
L= u(x)S (x, y) u(y)dxdy − μ 1 − [u(x)]2 dx − α ∇ 2 u(x) dx .
Fig. 14.5 Lagrangian L of the optimisation problem eq (14.17) versus the smoothing parameter
α, based on the leading smooth EOF of the extratropical NH SLP anomalies for 2.5◦ × 2.5◦ (a),
5◦ × 5◦ (b), 5◦ × 10◦ (c) and 10◦ × 10◦ (d) latitude–longitude grid. Adapted from Hannachi (2016)
Fig. 14.6 Eigenvalues of the generalised eigenvalue problem Eq. (14.11) for the winter SLP
anomalies over the northern hemisphere for the regularised problem (filled circles) and the
conventional (α = 0) problem (open circle). The eigenvalues are scaled by the total variance
of the SLP anomalies and transformed to a percentage, so for example the open circles provide the
percentage of explained variance of the EOFs. Adapted from Hannachi (2016)
Fig. 14.7 Leading two conventional (a,c) and regularised (b,d) EOFs of the northern hemisphere
SLP anomalies based on the 10◦ × 10◦ latitude–longitude grid. The smoothing parameter is based
on the optimal value obtained from the Lagrangian (Fig. 14.5d). Adapted from Hannachi (2016)
trend pattern, namely the Scandinavian pattern. Figure 14.8 shows the eigenspec-
trum associated with the inverse rank matrix, see Eq. (16.21), corresponding to the
5◦ × 5◦ downgraded resolution of the winter SLP anomalies. It is clear that when
the trend EOF method is applied with the regularisation procedure, i.e. applying
Eq. (14.11) to the data matrix Z of Eq. (16.21), the eigenvalue of the second TEOF
is raised off the “noise” floor (Fig. 14.8b), and the second trend pattern regained. The
leading two smooth trend PCs are shown in Fig. 14.8a,b. The leading two smooth
EOFs along with the associated smooth trend patterns are shown in Fig. 14.9, which
can be compared to Fig. 16.8 in Chap. 16 (see also Hannachi 2016).
334 14 Functional and Regularised EOFs
Fig. 14.8 Eigenspectrum, given in terms of percentage of explained variance, of the covariance
(or correlation) matrix of the inverse ranks of the SLP anomalies with a reduced resolution of
5◦ × 5◦ grid for the non-regularised (a) and regularised (b) cases, along with the regularised first
(c) and second (d) trend PCs associated with the leading two eigenvalues shown in (b). The optimal
smoothing parameter, α = 60, is used in (b). Adapted from Hannachi (2016)
14.6 Application of Regularised EOFs to SLP Anomalies 335
Fig. 14.9 Leading two regularised trend EOFs (a,b) of the SLP anomalies, corresponding to the
leading two eigenvalues of Fig. 14.8b, and the associated trend patterns (c,d). Contour interval in
(c,d) 1 hPa. Adapted from Hannachi (2016)
Chapter 15
Methods for Coupled Patterns
15.1 Introduction
1 It
is possible to apply these techniques to two combined fields, e.g. SLP and SST, by combining
them into a single space–time field. In this way the method does not explicitly take into account
the co-variability of both fields.
principal predictor analysis (PPA) and principal regression analysis (PRA). RDA
(von Storch and Zwiers 1999; Wang and Zwiers 1999) aims at selecting predictors
that maximise the explained variance. PPA (Thacker 1999) seeks to select predictors
that maximise the sum of squared correlations. PRA (Yu et al. 1997), as its name
indicates, fits regression models between the principal components of the predictor
data and each of the predictand elements individually. Tippett et al. (2008) discuss
the connection between the different methods of finding coupled patterns and
multivariate regression through a SVD of the regression matrix.
In atmospheric science, it seems that the first idea to combine two or more sets
of variables in an EOF analysis was mentioned in Lorenz (1956) and was first
applied by Glahn (1962) and few others in statistical prediction of weather, see
e.g. Kutzbach (1967) and Bretherton et al. (1992). This combined EOF/PC analysis
is obtained by applying a standard EOF analysis to the combined space–time field
T
zt = xTt , yTt of xt and yt , t = 1, . . . n. The grand covariance matrix of zt ,
t = 1, . . . n, is given by
1
n
Sx,y = (zt − z) (zt − z)T , (15.1)
n
t=1
where z = n−1 nt=1 zt is the mean of the combined field. However, because of the
scaling problem, and in order for the combined field to be consistent the individual
fields are normally scaled by their respective variances. Hence if Sxx and Syy are the
respective covariance matrices of xt and yt , t = 1, . . . n, and Dx = Diag(Sxx ) and
−1/2 −1/2
Dy = Diag(Syy ), the scaled variables become xt Dx and yt Dy , respectively.
The grand covariance matrix of the combined scaled fields is
−1/2 −1/2
Dx O Dx O
−1/2 Sx,y −1/2 .
O Dy O Dy
15.2.1 Background
Canonical correlation analysis (CCA) dates back to Hotelling (1936a) and attempts
to find relationships between two space–time fields. Let xt = xt1 , . . . , xtp1 , and
yt = yt1 , . . . , ytp2 , t = 1, 2, . . ., be two multidimensional stationary time series
with respective dimensions p1 and p2 . The objective of CCA is to find a pair of
patterns a1 and b1 such that the time series obtained by projecting xt and yt onto a1
(1)
and b1 , respectively, have maximum correlation. In the following we let at = aT1 xt
(1)
and bt = bT1 yt , t = 1, 2, . . ., be such time series.
Definition
The patterns a1 and b1 maximising corr at(1) , bt(1) are the leading canonical
(1) (1)
correlation patterns, and the associated time series at and bt , t = 1, 2, . . ., are
the canonical variates.
We suppose in the sequel that both time series have zero mean. Let xx and yy
be the respective covariance matrices of xt and yt . We also let xy be the cross-
covariance matrix between xt and yt , and similarly for yx , i.e. xy = E xt yTt =
Tyx . The objective is to find a1 and b1 such that
aT1 xy b1
(1) (1)
ρ = corr at , bt = 1 1
(15.2)
aT1 xx a1 2
bT1 yy b1 2
is maximum. Note that if a1 and b1 maximise (15.2) so are also αa1 and βb1 . To
(1) (1)
overcome this indeterminacy, we suppose that the time series at and bt , t =
1, 2, . . ., are scaled to have unit variance, i.e.
Another way to look at (15.3) is by noting that (15.2) is independent of the scaling of
a1 and b1 , and therefore maximising (15.2) is also equivalent to maximising (15.2)
subject to (15.3). The CCA problem (15.2) is then written as
maxa,b ρ = aT xy b s.t aT xx a = bT yy b = 1. (15.4)
340 15 Methods for Coupled Patterns
After differentiating the above equation with respect to a and b, one gets
which after combination yields (15.5). Note also that one could obtain the same
result without Lagrange multipliers (left as an exercise). We notice here that α and
β are necessarily equal. In fact, multiplying by aT and bT the respective previous
equalities (obtained after differentiation), one gets, keeping in mind Eq. (15.3), 2α =
2β = aT xy b = λ. Hence the obtained Eqs. (15.5) can be written as a single
(generalised) eigenvalue problem:
Op1 ,p1 xy a xx Op1 ,p2 a
=λ . (15.6)
yx Op2 ,p2 b Op2 ,p1 yy b
Remarks
1. The matrices involved in (15.5) have the same spectrum. In fact, we note first
that the matrices Mx and My have the same rank, which is that of xy . From the
1
SVD of xx , one can easily compute a square root2 xx
2
of xx and similarly for
yy . The matrix Mx becomes then
−1 1
Mx = xx2 AAT xx
2
, (15.7)
1 1
2 If xx = UUT , where U is orthogonal, then one can define this square root by xx
2
= U 2 . In
1
1 T
this case we have xx = xx xx . Note that this square root is not symmetric. A symmetric
2 2
1 1 1 1
square root can be defined by xx
2
= U 2 UT , in which case xx = xx
2
xx
2
, and hence the square
root of xx is not unique.
15.2 Canonical Correlation Analysis 341
−1 −1
where A = xx2 xy yy2 . Hence, the eigenvalues of Mx are identical to the
−1 1
eigenvalues of AAT (see Appendix D). Similarly we have My = yy2 AAT yy
2
,
and this completes the proof.
2. The matrices Mx and My are positive semi-definite (why?).
1 1
3. Taking u = xx 2
a and v = xx2
b, then (15.5) yields AAT u = λu and AAT v =
λv, i.e. u and v are, respectively, the left and right singular vectors of A.
−1 −1
In conclusion, using the SVD of A, i.e. xx2 xy yy2 = UVT , where =
Diag (λ1 , . . . , λm ), and m is the rank of xy , and where the eigenvalues have been
arranged in decreasing order, the canonical correlation patterns are given by
−1 −1
ak = xx2 uk and bk = yy2 vk , (15.8)
−1
Exercise Show that CCA indeed consists of the spectral analysis of X XT X XT
−1
y YT Y YT .
Hint Multiply both sides of the first equation of (15.5) by X.
The various operations outlined above can be expensive in terms of CPU time,
particularly the matrix inversion in the case of a large number of variables. This
is not the only problem that can occur in CCA. For example, it may happen that
either Sxx or Syy or both can be rank deficient in which case the matrix inversion
breaks down.3 The most common way to address these problems is to project both
fields onto their respective leading EOFs and then apply CCA to the obtained PCs.
The use of PCs can be regarded as a filtering operation and hence can reduce the
effect of sampling fluctuation. This approach was first proposed by Barnett and
Preisendorfer (1987) and also Bretherton et al. (1992) and is often used in climate
research. When using the PCs, the covariance matrices Sxx and Syy become the
identity matrices with respective orders corresponding to the number of PCs retained
for each field. Hence the canonical correlation patterns simply reduce to the left
and right singular vectors of the cross-covariance Sxy between PCs. For example, if
T
α = α1 , . . . , αq is the left eigenvector of this cross-covariance, where q is the
number of retained EOFs of the left field xt , t = 1, . . . , n, then the corresponding
canonical correlation pattern is given by
q
a= αk ek , (15.10)
k=1
where ek , k = 1, . . . , q are the q leading EOFs of the left field. The canonical
correlation pattern for the right field is obtained in a similar way. Note that since
the PCs are in general scaled to unit variance, in Eq. (15.10) the EOFs have to be
scaled so they have the same units as the original fields. To test the significance of
the canonical correlations, Bartlett (1939) proposed to test the null hypothesis that
the leading, say r, canonical correlations are non-zero, i.e. H0 : λr+1 = λr+2 =
. . . = λm = 0 using the statistic:
m
!
1
T = − n − (p1 + p2 + 3) Log 1 − λ̂2k .
2
k=r+1
3 Generalised inverse can be used as an alternative, see e.g. Khatri (1976), but as pointed out by
Bretherton et al. (1992) the results in this case will be difficult to interpret.
15.2 Canonical Correlation Analysis 343
Under H0 and joint multinormality, one gets the asymptotic chi-square approxima-
tion, i.e.
T ∼ χ(p
2
1 −r)(p2 −r)
,
see also Mardia et al. (1979). Alternatively, Monte Carlo simulations can be used
to test the significance of the sample canonical correlations λ̂1 , . . . λ̂m , see e.g. von
Storch and Zwiers (1999) and references therein.
Remark
1. It is possible to extend CCA to a set of more than two variables, see Kettenring
(1971) for details.
2. Principal prediction patterns (PPPs) correspond to CCA applied to xt and yt =
xt+τ , t = 1, . . . , n − τ . See, e.g. Dorn and von Storch (1999) and von Storch and
Zwiers (1999) for examples.
3. The equations of CCA can also be derived using regression arguments. We first
let xt = aT xt and yt = bT yt . Then to find a and b we minimize the error variance
obtained from regressing yt onto xt , t = 1, . . . , n. Writing
yt = αxt + εt ,
2
it is easy to show that α = cov(x t ,yt ) [cov(xt ,yt )]
var(xt ) and var (εt ) = var (yt )− var(xt ) ; hence,
the sample variance estimate of the noise term is
2
aT Sxy b
σ̂ = b Syy b − T
2 T
.
a Sxx a
σ̂ 2
The patterns a and b minimising bT Syy b
are then obtained after differentiation
yielding
In several instances the data can have multi-colinearity, which occurs when there are
near-linear relationships between the variables. When this happens, problems can
occur when looking to invert the covariance matrix of X and/or Y. One common
solution to this problem is to use regularised CCA (RCCA). This is very similar to
344 15 Methods for Coupled Patterns
ridge regression (Hoerl and Kennard 1970). In ridge regression4 the parameters of
the model y = Xβ + ε are obtained by adding a regularising term in the residual
sum of squares, RSS = (y − Xβ)T (y − Xβ) + λβ T β, leading to the estimate
β̂ = (XT X + λI)−1 Xy.
In regularised CCA the covariance matrices are similarly replaced by (XT X +
λI)−1 and (XT X + λI)−1 . Therefore, as for CCA, RCCA consists of a spectral
analysis of X(XT X + λ1 I)−1 XT Y(YT Y + λ2 I)−1 YT . In functional CCA, discussed
later, regularisation (or smoothing) is required to get meaningful solution. The
choice of the regularisation parameters is discussed later.
maxu,v γ = uT Sxy v s.t uT u = vT v = 1. (15.11)
Hence u and v correspond, respectively, to the leading eigenvectors of Sxy Syx and
Syx Sxy , or equivalently to the left and right singular vectors of the cross-covariance
matrix Sxy .
In summary, from the cross-covariance data matrix Sxy = n1 XT Y where X and Y
are the centred data matrices of the respective fields, the set of canonical covariances
are provided by the singular values γ1 , . . . , γm , of Sxy arranged in decreasing order.
The set of canonical covariance pattern pairs is provided by the associated left and
right singular vectors U = (u1 , . . . , um ) and V = (v1 , . . . , vm ), respectively, where
m is the rank of Sxy . Unlike CCA patterns, the canonical covariance pattern pairs
are orthogonal to each other by construction.
The CCOVA can be easily computed in Matlab. If X(n, p) and Y (n, q) are the
two data matrices of both fields, the SVD of XT Y gives the left and right singular
vectors along with the singular values, which can be arranged in decreasing order,
as for EOFs (Chap. 3):
7
Left singular vector (Sep Med. Evap.) 5
-1
-3
-5
-7
-9
° -11
30 N
° -13
0
-15
°
60 E
Fig. 15.1 Left (top) and right (bottom) leading singular vectors of the covariance matrix between
September Mediterranean evaporation and August Asian monsoon outgoing long wave radiation
Remark Both CCA and CCOVA can be used as predictive tools. In fact, if, for
example, the right field yt lags the left field xt , t = 1, . . . , n, with a lag τ , then
both these techniques can be used to predict yt from xt as in the case of principal
predictive patterns (PPPs), see e.g. Dorn and von Storch (1999). In this case, the
leading pair of canonical correlation patterns, for example, yield the corresponding
time series that are most cross-correlated at lag τ , see also Barnston and Ropelewski
(1992).
15.4 Redundancy Analysis 347
-1
-2
-3
1958 1969 1979 1989 1999
Fig. 15.2 Leading time series associated with the leading singular vectors of Fig. 15.1. Red: left
singular vector (MEVA), blue: right singular vector (OLR)
We have seen in Sect. 15.2.3 (see the remarks in that section) that CCA can be
obtained from a regression perspective by minimising the error variance. This error
variance, however, is not the only way to express the degree of dependence between
two variables. Consider two multivariate time series xt and yt , t = 1, 2, . . . (with
zero mean for simplicity). The regression of yt on xt is written as
yt = xt + εt , (15.13)
εε = yx −1
xx xy .
tr yy − εε tr yx −1
xx xy
R 2 (yt , xt ) = = . (15.14)
tr yy tr yy
348 15 Methods for Coupled Patterns
Redundancy analysis was introduced first by van den Wollenberg (1977) and
extended later by Johansson (1981) and put in a unified frame by Tyler (1982). It
aims at finding pattern transformations (matrices) P and Q such that R 2 (Qyt , Pxt )
is maximised. To simplify the calculation, we will reduce the search to one single
pattern p such that R 2 yt , pT xt is maximised. Now from (15.14) this redundancy
−1
tr pT xx p yx ppT xy
index takes the form: tr ( yy )
, which, after simplification using a
little algebra, yields the redundancy analysis problem:
pT p
xy yx
max R yt , p xt =
2 T
. (15.15)
p pT xx p
−1
xx xy yx p = λp. (15.16)
Aq = yx −1
xx xy q = λq. (15.17)
The qs are then the orthogonal eigenvectors of the symmetric positive semi-definite
matrix A = yx −1 xx xy . Furthermore, from (15.16) the redundancy (actually
minus redundancy) becomes R 2 yt , pT xt = λ.
van den Wollenberg (1977) solved (15.16) and its equivalent, where the roles of x
and y are exchanged, i.e. the eigenvalue problem of −1
yy yx xy . As pointed out by
Johansson (1981) and Tyler (1982), the transformations of xt and yt are not related.
Johansson (1981) suggested using successive linear transformations of yt , i.e. bT yt ,
such that bT yx b is maximised with bs being orthogonal. These vectors are in fact
15.5 Application: Optimal Lag Between Two Fields and Other Extensions 349
the q vectors defined above, which are to be unitary and orthogonal. Since one
wishes the vectors qk = η yx pk to be unitary, i.e. η2 pTk xy yx pk = 1, we must
have ηk2 = λ−1
k where the λk s are the eigenvalues of the matrix A (see Eq. (15.17)).
Here we have taken pTk xx pk = 1, which does not change the optimisation
−1/2
problem (15.15). Therefore the q vectors are given by qk = λk yx pk , and
−1/2
similarly pk = λk −1xx xy qk .
1. Given two zero-mean fields xt and yt , t = 1, . . . n, one wishes to find the optimal
lag τ between the two fields along with the associated patterns. If one chooses
the correlation between the two time series aT xt and bT yt+τ as a measure of
association, then the problem becomes equivalent to finding patterns a and b
satisfying
2
aT xy (τ )b
max φ (a, b) = , (15.18)
a,b aT xx a bT yy b
−1 −1
= xx2 xy (τ ) yy2 = UVT ,
we get max [φ (a, b)] = λ21 , where λ1 is the largest singular value of .
Exercise Derive the above result, i.e. max [φ (a, b)] = λ21 .
In conclusion a simple way to find the best τ is to plot λ21 (τ ) versus τ and find
the maximum, if it exists. This can be achieved by differentiating this univariate
function and looking for its zeros. The associated patterns are then given by a =
350 15 Methods for Coupled Patterns
−1/2 −1/2
xx u1 and b = yy v1 , where u1 and v1 are the leading left and right singular
vectors of , respectively.
2. Note that one could also have taken xy (τ ) + yx (−τ ) instead of xy (τ )
in (15.18) so the problem becomes symmetric. This extension simply means
considering the case y leading x by (−τ ) beside x leading y by τ , which are
the same.
3. Another extension is to look for patterns that maximise ρxy (τ )dτ . This is like
in OPP (chap. 8), which looks for patterns maximising the persistence time for a
single field, but applied to coupled patterns. In this case the numerator in (15.18)
2
M
is replaced by aT τ =−M xy (τ ) b for some lag M. The matrix involved
here is also symmetric.
Remark If in the previous extension one considers the case where yt = xt , then one
obviously recovers the OPP.
−1
xx xy (τ ) yx (−τ )u = λu. (15.19)
If one takes yt = xt+τ as a particular case, the redundancy problem (15.19) yields
the generalised eigenvalue problem:
xx (τ ) xx (−τ )u = λ xx u.
Note also that the matrix involved in (15.8) and (15.19), e.g. xy (τ ) yx (−τ ), is also
symmetric positive semi-definite. Therefore, to find the best τ (see above), one can
−1/2
plot λ2 (τ ) versus τ , where λ(τ ) is the leading singular value of xx xy (τ ), and
choose the lag associated with the maximum (if there is one).
Exercise Derive the leading solution of (15.19) and show that it corresponds to the
−1/2
leading singular value of xx xy (τ ).
As it can be seen from (15.5), the time series xt and yt play exactly symmetric
roles, and so in CCA both the time series are treated equally. In redundancy
analysis, however, the first (or left) field xt plays the role of the predictor variable,
15.6 Principal Predictors 351
whereas the second field represents the predictand or response variable. Like
redundancy analysis, principal predictors (Thacker 1999) are based on finding a
linear combination of the predictor variables that efficiently describe collectively
the response variable. Unlike redundancy analysis, however, in principal predictors
the newly derived variables are required to be uncorrelated. Principal predictors
can be used to predict one field from another as presented by Thacker (1999). A
principal predictor is therefore required to be maximally correlated with all the
response variables. This can be achieved by maximising the sum of the squared
correlations with these response variables. Let yt = yt1 , . . . , ytp be the response
field, where p is the dimension of the problem or the number of variables of the
response field, and let a be a principal predictor. The squared correlation between
xt = aT xt and the kth response variable ytk , t = 1, 2, . . . n, is
2
cov {ytk }t=1,...n , {aT xt }t=1,...n
rk2 = 2 aT S a
, (15.20)
σyk xx
2 = S
where σyk yy kk and represents the variance of the kth response variable ytk ,
t = 1, . . . n. The numerator of (15.20) is aT sk sTk a, where sk is the kth column of the
cross-covariance matrix Sxy . Letting Dyy = Diag Syy , we then have
p
1
s sT = Sxy D−1
2 k k yy Syx .
σ
k=1 yk
The principal predictors are therefore given by the solution to the generalised
eigenvalue problem:
Sxy D−1
yy Syx a = λSxx a. (15.22)
If μk and uk represent, respectively, the kth eigenvalue and associated left singular
−1/2 −1/2
vector of Sxx Sxy Dyy , then the kth eigenvalue λk of (15.22) is μ2k and the kth
−1/2
principal predictor is ak = Sxx uk . Furthermore, since aTk Sxx al = uk ul = δkl , the
new variables aTk xt are standardised and uncorrelated.
352 15 Methods for Coupled Patterns
Remark The eigenvalue problem associated with EOFs using the correlation matrix,
e.g. Eqs. (2.21) or (3.25), can be written as D−1/2 Sxx D−1/2 u = λu, where D =
Diag (Sxx ) can also be written, after the transformation u = D1/2 v, as D−1 Sxx v =
λv. This latter eigenvalue problem is identical to (15.22) when xt = yt . Therefore
principal predictors reduce, when the two fields are identical, to a simple linear
(diagonal) transformation of the correlation-based EOFs.
15.7.1 Introduction
Conventional CCA and related methods are basically formulated to deal with
multivariate observations of classical statistics where the data are sampled at
discrete, often regular, time intervals. In a number of cases in various branches of
science, however, the data can be observed/monitored continuously. In medicine, for
example, we have EEG records, etc. In meteorology, barometric pressure records
at a given location provide a good example. In the latter case, we can in fact
have barometric pressure records at various locations. Similarly, we can have a
continuous function of space, e.g. continuous surface temperature observed at
different times. Most space–time atmospheric fields belong to this category when
the coverage is dense enough. This also applies to slowly varying fields in space
and time where, likewise, the obtained patterns are expected to be smooth, see e.g.
Ramsay and Silverman (2006) for an introduction to the subject and for further
examples.
When time series are looked at from the angle of conceptual stochastic processes,
then one could attempt to look for smooth time series. The point is that there is no
universal definition of such a process, although, some definitions of smoothness
have been proposed. For example, if the first difference of a random field is zero
mean multivariate normal, then the field can be considered as smooth (Pawitan
2001, chap. 1). Leurgans et al. (1993) point out that in the context of continuous
CCA, smoothing is particularly essential and that without smoothing every possible
function can become a canonical variate with perfect canonical correlation.
In functional or continuous CCA one assumes that the discrete space–time fields xk
and yk , k = 1, . . . n, are replaced by continuous curves xk (t) and yk (t), k = 1, . . . n,
and t is now a continuous parameter, in some finite interval T. For simplicity we
suppose that the curves have been centred to have zero mean, i.e. nk=1 xk (t) =
15.7 Extension: Functional Smooth CCA 353
n
k=1 yk (t) = 0, for all values of t within the above interval. Linear combinations
5
of xk (t) using, for example, a curve or a continuous function a(t) take now the form
of an integral, i.e.
< a, xk >= a(t)xk (t)dt.
T
In the following we suppose that x(t) and y(t) are two random functions and that
xk (t) and yk (t), k = 1, . . . n, are two finite sample realisations (of length n) drawn
from x(t) and y(t), respectively.6 The covariance between < a, x > and < b, y >
is given by
E [< a, x >< b, y >] = a(t)E [x(t)y(s)] b(s)dtds
T T
= a(t)Sxy (t, s)b(s)dtds, (15.23)
T T
where Sxy (t, s) = E [x(t)y(s)] represents the cross-covariance between x(t) and
y(s). The sampleestimate of this cross-covariance is defined in the usual way by
n
Ŝxy (t, s) = n−1 1
k=1 T T a(t)xk (t)yk (s)b(s)dtds = t T a(t)Ŝxy (t, s)a(s)dtds.
A similar expression can be obtained also for the remaining covariances, i.e.
Ŝyx (t, s), Ŝxx (t, s) and Ŝyy (t, s).
Remarks
• The functions Sxx (t, s) and Syy (t, s) are symmetric functions and similarly for
their sample estimates.
• Sxy (t, s) = Syx (s, t)
• Note that by comparison to the standard statistics of covariances the index k
plays the role of “time” or sample (realisation) as pointed out earlier, whereas
the variables t and s in the above integral mimic “space” or variable.
In a similar manner to the standard discrete case, the objective of functional CCA
is to find functions a(t) and b(t) such that the cross-correlation between linear
combination < a, xk > and < b, yk > is maximised. The optimisation problem
applied to the population yields
max a(t)Sxy (t, s)b(s)dtds (15.24)
a,b T T
5 This is like removing the ensemble mean of each field from each curve. Note that t here plays
the role of variables in the discrete case and the index k refers to observations or realisation.
6 In the standard notation of stochastic processes x(t) may be better noted as x(t, ω), where ω
refers to the random part. That is, for fixed ω, i.e. ω = ω0 (a realisation), we get a usual function
x(t, ω0 ) of t, and for fixed t, i.e. t = t0 , we get a random variable x(t0 , ω).
354 15 Methods for Coupled Patterns
When one deals with continuous or functional data, smoothing becomes a useful
tool to gain some insights into the data and can also ease the interpretation of the
results. An example of a widely known nonlinear smooth surface fitting of a scatter
of data points is spline. For a given scatter of data points, smoothing spline attempts
to minimise a penalised residual sum of squares, using a smoothing parameter that
controls the balance between goodness of fit and smoothness. In general terms, the
15.7 Extension: Functional Smooth CCA 355
smoothing takes the form of an integral of the squared second derivative of the
smoothing function. This smoothing derives from the theory of elastic rods and
is proportional to the energy of the rod when stressed, see Appendix A for more
details.
The smoothing procedure in CCA is similar to the idea of spline smoothing. To
achieve smooth CCA, the constraints (15.25) are penalised by a smoothing condition
taking the following form:
d2
2
T T a(t)Sxx (t, s)a(s)dtds + α T dt 2 a(t) dt
2 (15.27)
d2
= T T b(t)Syy (t, s)b(s)dtds + α T dt 2 b(t) dt = 1,
where α is a smoothing parameter and is also unknown, see also Ramsay and
Silverman (2006).
To solve the optimisation problem (15.24) subject to the smoothing con-
straints (15.27), a few assumptions on the regularity of the functions involved are
required. To ease things, one considers the notations < a, b >1 = T a(t)b(t)dt for
natural scalar product between smooth functions a() and b(), and < a, b >S =
the
T T a(t)S(t, s)b(s)dsdt as the “weighted” scalar product between a() and b().
The functions involved, as well as their mth derivatives, m = 1, . . . 4, are supposed
to be square integrable over the interval T. It is also required that the functions
and their four derivatives have periodic boundary conditions, i.e. if T=[τ0 , τ1 ], e.g.
dαa dαa
dt α (τ0 ) = dt α (τ1 ), α = 1, . . . 4. With these conditions, we have the following
result.
Theorem The solution to the problem (15.24), (15.27), i.e.
Proof Outline 1 We present here an outline of the proof using arguments from the
calculus of variation (e.g. Cupta 2004). The general approach used in the calculus of
variation is to assume the solution to be known and then work out a way to find the
conditions that it satisfies. Let a(t) and b(t) be the solutions to (15.24) and (15.27),
then for any function â(t) and b̂(t) defined on T and satisfying the above properties
the function g (1 , 2 ) =< a + 1 â, b + 2 b̂ >Sxy is maximised when 1 = 2 = 0,
subject, of course, to the constraint (15.27). In fact, letting
356 15 Methods for Coupled Patterns
1 1
G (1 , 2 ) = g (1 , 2 ) − μ1 G1 (a, â, 1 , Sxx ) − μ2 G1 (b, b̂, 2 , Syy ),
2 2
where μ1 and μ2 are Lagrange multipliers. The necessary conditions of the
maximum of G (1 , 2 ) obtained using the gradient ∇G at 1 = 2 = 0 yields
and this is true for all â and b̂ satisfying the required properties. Now the periodic
boundary conditions imply, using integration by parts:
D aD â =
2 2
âD 4 a,
where the integration is over T. This result is a direct consequence of the fact that
d2
the operator D 2 = dt 2 in the space of the functions satisfying the above properties
is self-adjoint. Therefore the first of the two above equations leads to
Sxy (t, s) b(s)ds − μ1 Sxx (t, s) a(s)ds + αD a(t)
4
â(t)dt = 0
for all functions â() with the required properties, and similarly for the second
equation. Hence the functions a(t) and b(t) are solution to the integro-differential
equations:
Sxy (t, s) b(s)ds = μ1 Sxx (t, s) a(s)ds + αD4 a(t)
4
7 This is a formal differentiation noted as δa and operates as in the usual case. Note that the
differential δa is also a function of the same type.
15.7 Extension: Functional Smooth CCA 357
2 2
δaS b − μ aS δa + α D aD (δa) +
xy 1 xx 2 2
aSxy δb − μ2 bSyy δb + α D bD (δb) = 0.
and similarly for the second part. These equalities are satisfied for all perturbations
δa and δb having the required properties. Expanding the integrals using the periodic
boundary conditions yields (15.28).
Application
Silverman (2006).
Basis functions include Fourier and radial basis functions (Appendix A). In the
next section we consider the case of radial basis functions. Here I would like to go
back to the original problem based on (15.24) and (15.27) in the above theorem. We
also consider here the case of different spaces for x and y, hence φ and ψ.
Let us define the following matrices: V = (vij ), A = (aij ), B = (bij ), C = (cij )
and D = (dij ) given, respectively, by vij =< φi , Sxy ψj >, aij =< φi , Sxx φj >,
bij =< D 2 φi , D 2 φj >, cij =< ψi , Syy ψj > and dij =< D 2 ψi , D 2 ψj >,
where the notation <, > refers to the natural scalar product. Then the optimisation
problem (15.24), (15.27) can be transformed (see exercises below) to yield the
following generalised eigenvalue problem for the coefficients u = (u1 , . . . , up )T
and v = (v1 , . . . , vq )T :
VT O u O C + αD u
=μ .
O V v A + αB O v
There are various ways to determine the smoothing parameter α, and we discuss this
in the next two sections. In the following exercises we attempt to derive the above
system and propose a simple solution.
358 15 Methods for Coupled Patterns
Exercise 1
1. Using the above notation associated with the expansion in terms of basis
functions, show that the maximisation problem (15.24) and (15.27) (see also the
above theorem) boils down to
maxu,v uT Vv
s.t
uT (A + αB) u = 1 = vT (C + αD) v.
max α T Eβ
s.t. α T α = 1 = β T β.
Maximum Covariance
Smooth maximum covariance analysis (SMCA) is similar to SCCA except that the
constraint conditions (15.27) are reduced to
2 2
a +α
2
D a = b +α
2 2
D 2 b = 1. (15.29)
The optimisation problem (15.24) subject to (15.29) yields the following system of
integro-differential equations:
Sxy (t, s) b(s)ds = μ 1 + αD4 a(t)
4
(15.30)
Sxy (t, s) a(t)dt = μ 1 + αD b(s).
15.7 Extension: Functional Smooth CCA 359
where
0 Sxy (s, t)
K (s, t) =
Sxy (t, s) 0
We now suppose that we have two continuous space–time fields F (x, tk ) and
G (y, tk ), observed at times tk , k = 1, . . . n, where x and y represent spatial
locations. We also suppose that x and y vary in two spatial domains Dx and Dy ,
respectively. The covariance function between fields F and G (with zero mean) at x
and y is given by
1
n
S (x, y) = F (x, tk ) G (y, tk ) . (15.32)
n
k=1
The objective is to find (continuous) spatial patterns (functions) u (x) and v (y)
maximising the integrated covariance:
u(x)S (x, y) v(y)dxdy
Dx Dy
To solve (15.33), one can apply, for example, the method of expansion in terms of
radial basis functions (RBFs), see Appendix A. For global fields over the spherical
earth, one can use spherical RBFs, and one ends up with a generalised eigenvalue
problem similar to that presented in the application in Sect. 15.7.3. This will be
discussed further in more detail in the next chapter in connection with smooth EOFs.
An alternative (easy) method is to discretise the left hand side of the system. In
practice, the fields are provided by their respective (centred) data matrices X =
(xtk ), t = 1, . . . n, k = 1, . . . p, and Y = ytj , t = 1, . . . n, j = 1, . . . q. The
cross-covariance matrix is then approximated by
1 T
Sxy = X Y,
n
where the fields are supposed to have been centred. The objective is then to find
patterns a = a1 , . . . , ap and b = b1 , . . . , bq satisfying
15.7 Extension: Functional Smooth CCA 361
STxy a = μ Iq + αD4 b
(15.34)
Sxy b = μ Ip + αD4 a,
which can be written in the form of a single generalised eigenvalue problem. This
the “easy” solution and is also discussed further in the next chapter. Note that if one
is interested in smooth CCA, the identity matrices Ip and Iq above are to be replaced
by Sxx and Syy , respectively. The Laplacian operator ∇ 2 in (15.33) is approximated
using a finite difference scheme. In one dimension, for example, if the real function
a(x) is observed at discrete positions x1 , . . . xm , then
d2 1 1
a(xk ) ≈ [a(xk+1 ) − 2a(xk ) + a(xk−1 )] = (ak+1 − 2ak + ak−1 ) ,
dt 2 (δx)2 (δx)2
Once a and b are found, the corresponding smooth functions can be obtained using,
for example, radial basis functions as
p q
a(x) = k=1 ak φ (|x − xk |) and b(y) = l=1 bk (φ|y − yl |) , (15.35)
where φ() is a radial basis function. One could also use other smoothing procedures
such as splines or kernel smoothers.
In two dimensions, the discretised Laplacian in the plane is approximated by
In spherical geometry where λ and ϕ are the longitudinal and latitudinal coordinates,
i.e. x = r cos ϕ cos λ, y = r cos ϕ sin λ and z = r sin ϕ, the Laplacian takes the
form:
1 ∂ 2u 1 ∂ 2 u tanϕ ∂u
∇ 2u = + − 2 ,
r 2 ∂ϕ 2 r 2 cos2 ϕ ∂λ2 r ∂ϕ
variables. For example, xt and yt may be observed at different grid points and from
different regions, e.g. hemispheric vs. regional, as in the example when hemispheric
SLP and tropical SST fields are used. Consequently, in (15.34) the Laplacian with
respect to x and y may be denoted, for example, by D1 and D2 , respectively.
The choice of the smoothing parameter α is slightly more complicated. The
experimentalist can always choose α based on previous experience. Another more
efficient approach is based on cross-validation, see e.g. chap. 7 of Ramsay and Sil-
verman (2006). The cross-validation procedure, introduced originally in statistical
estimation problems, can be extended in the same way to deal with SMCA. Letting
T
zTt = xTt , yTt , we decompose the field zt using the eigenvectors uj = aTj , bTj
obtained from the generalised eigenvalue problem (15.34) and then compute the
(k)
“residuals”9 ε t = zt − m j =1 βj uj . If uj , j = 1, . . . , m, are the eigenvectors
obtained after removing the kth observation, and εt,k , are the resulting residuals,
then the cross-validation is computed as
m
1
n
m
1 T
n
Cv (α) = tr ε t,k (α)ε Tt,k (α) = ε t,k ε t,k ,
n−1 n−1
k=1 t=1 k=1 t=1
X = vaT + ε X
(15.36)
Y = ubT + ε Y ,
9 The residuals here do not have the same meaning as those used to construct EOFs via
minimisation. Here instead, these residuals are used as an approximation to compute the “mis-
fit”.
15.8 Some Points on Coupled Patterns and Multivariate Regression 363
Smooth MCA can be used to derive a new set of EOFs, smooth EOFs. These
represent, in fact, a particular case of smooth MCA where the spatial fields are
identical. This is discussed further in the next chapter.
y = Ax + ε . (15.37)
where the bracket stands for the expectation operator, i.e. < . >= E(.). The least
squares fitted value of the predictand is given by
√
Note that in (15.40) X and Y are assumed to be scaled by n − 1. In many instances
one usually transforms one or both data sets using a linear transformation for various
reasons purposes, such as reducing the number of variables etc., and one would like
to identify the regression matrix of the transformed variables. If, in model (15.37),
x and y are replaced, respectively, by x = Lx and y = My, where L and M are
two matrices, then the model becomes
y = A x + ε . (15.41)
Using the data matrices X = LX and Y = MY, the new regression matrix is
T −1 −1
A = Y X X X
T
= M YXT LT LXXT LT . (15.42)
364 15 Methods for Coupled Patterns
A = MAL−1 . (15.43)
• We also get, via Eqs. (15.37) and (15.41), a relationship between the sum of
squared errors:
The last equality in (15.45) means, in particular, that both the squared error (y −
ŷ)T (y − ŷ) and the positive semi-definite quadratic functions of the error (y −
ŷ)T MMT (y − ŷ) are minimised.
From the above remarks, it can be seen, by choosing particular forms for the matrix
M, such as rows of the identity matrix, that the error covariance of each variable
separately is also minimised. Consequently the full regression model also embeds
the individual models for the predictand variables (Tippett et al. 2008). By using
the SVD of the regression matrix A, the regression coefficients can be interpreted
in terms of correlation, explained variance, standardised explained variance and
covariance (Tippett et al. 2008). For example, by using a whitening transformation,
i.e. using the scaled PCs of both variables, then (as in the unidimensional case) the
regression matrix A simply becomes the correlation matrix between the (scaled)
predictand and predictors.
We now consider the SVD of the matrix A , i.e. A = USVT , then UT Y
and VT X decompose the (pre-whitened) data into time series that are maximally
correlated, and uncorrelated with subsequent ones. In terms of the original variables
the weight vectors satisfy, respectively, QTx X = VT X and QTy Y = VT Y and are
given by
−1/2 −1/2
Qx = XXT V and Qy = YYT U. (15.46)
Remark The pattern vectors are similar to EOFs (associated with PCs) and satisfy
Px QTx X = X, and similarly for Y. These equations are solved in a least square sense,
see e.g. Tippett et al. (2008). Note that the above condition leads to PTx Qx = I, and
similarly for the predictand variables.
Hence the data are decomposed into patterns with maximally correlated time series
and uncorrelated with subsequent predictor and predictand ones. The regression
matrix is also decomposed as
A = Py SQTx ,
A = YXT ,
which represents the covariances between predictand and predictors, hence MCA.
Tippett et al. (2008) applied different transformation (or filtering) to a statistical
Fig. 15.3 CCA, RDA and Scaled SVD and coupled patterns
MCA within the α − β plane
of scaled SVD
RDA MCA
1
β
0.5
CCA
0
0 0.5 1
α
downscaling problem of precipitation over Brazil using a GCM. They found that
CCA provided the best overall results based on correlation as a measure of skill.
Remark MCA, CCA and RDA can be brought together in a unified-like approach
through the scaled SVD (Swenson 2015). If X = Ux Sx VTx is a SVD of the data
matrix X, then the data are scaled as
X∗ = Ux Sα−1 x Vx X = Ux Sx Vx
T α T
and similarly for Y. Scaling SVD is then obtained by applying the SVD to the
cross-covariance matrix X∗ Y∗T . It can be seen that CCA, MCA and RDA can be
recovered from scaled SVD by using, respectively, α = β = 1, α = β = 0 and
α = 0, β = 1 (Fig. 15.3). Swenson (2015) points out that other intermediate values
of 0 ≤ α, β ≤ 1 can isolate coupled signals better. This is discussed in more detail
in the next chapter.
Chapter 16
Further Topics
Abstract This chapter describes a number of further methods that have been
developed and applied to weather and climate. They include random projection,
which deals with very large data size; trend EOFs, which finds trend patterns in
gridded data; common EOFs, which identifies common patterns between several
fields; and archetypal analysis, which finds extremes in gridded data. The chapter
also discusses other methods that deal with nonlinearity.
16.1 Introduction
The research in multivariate data analysis has led to further development in various
topics in EOF analysis. Examples of such development include EOFs of large
datasets or data containing quasi-stationary signals. Also, sometimes we seek
to identify trends from gridded climate data without resorting to simple linear
regression. Another example includes the case when we seek to compute, for
instance, common EOFs from different groups of (similar) datasets.
Computer power has witnessed lately an unprecedented explosion, which
impacted on different branches of science. In atmospheric science climate modelling
has increased in complexity, which led to the possibility of running climate models
at a high resolution. Datasets with large spatial and/or temporal resolution are being
produced currently by various weather and climate centres across the world. This
has led to the need for ways to analyse these data efficiently. Projection methods
can be used to address this problem. When the data contain quasi-stationary signals
then results from the theory of cyclo-stationary processes can be applied yielding
cyclo-stationary EOFs. Trend EOF analysis is another method that can be used
to identify trend patterns from spatio-temporal climate data. Also, when we have
EOF analysis has proved to be an easy and cheap way to reduce the dimension of
climate data retaining only a small set of the leading modes of variability usually
explaining a substantial amount of variance. This is particularly the case when the
size of the data matrix X is not too high, e.g. O(103 ). Advances in high performance
computers has led to the recent explosion in the volume of data from climate model
simulations, which beg for analysis tools. In particular, dimensionality reduction is
required in order to handle and analyse these climate simulations.
There are various ways to reduce the dimension of a data matrix. Perhaps the
most straightforward method is that based on “random projection” (RP). In simple
terms, RP is based on some sort of “sampling”. Precisely, given a n × p data matrix
X, RP is based on constructing a p × k data matrix R (k < p) referred to as random
projection matrix then projecting X onto R, i.e.
P = XR. (16.1)
By choosing k much smaller than p, the new n × k data matrix P becomes much
smaller than X where EOF analysis, or any other type of pattern identification
method, can be applied much more efficiently. Note that the “rotation matrix” is
approximately orthogonal because the vectors are drawn randomly. It can, however,
be made exactly orthogonal but this will be at the expense of saving memory and
CPU time.
Random projection takes its origin from the so-called Johnson and Lindenstrauss
(1984) lemma (see also Dasgupta and Gupta 2003):
Johnson-Lindenstrauss Lemma
Given a n × p data matrix X = (x1 , x2 , . . . xn )T , for any ε > 0 and integer
k > O( log
ε2
n
), there exists a mapping f : Rp → Rk , such that for any 1 ≤ i, j ≤ n,
we have:
(1 − ε) xi − xj 2
≤ f (xi ) − f (xj ) 2
≤ (1 + ε) xi − xj 2
. (16.2)
The message from the above lemma is that it is always possible to embed the data
into a lower dimensional space such that the interpoint distance is conserved up to
any desired accuracy. One way to construct such a mapping is to generate random
vectors that make up the rows of the projection matrix R. Seitola et al. (2014) used
16.2 EOFs and Random Projection 369
the standard normal distribution N (0, 1) and normalised the random row vectors of
R to have unit-length. Other distributions have also been used, see e.g. Achlioptas
(2003), and Frankl and Maehara (1988). Refinement of the lower limit value of the
dimension k, provided in the lemma, was also provided by a number of authors.
For example, the values k = 1 + 9(ε2 − 2ε3 /3)−1 , and k = 4(ε2 /2 − ε3 /3)−1 log n
were provided, respectively, by Frankl and Maehara (1988) and Dasgupta and Gupta
(2003).
Seitola et al. (2014) applied EOF analysis to the randomly projected data. They
reduced the data volume down to 10% and 1% of the original volume, and recovered
the spatial structures of the modes of variability and their associated PCs. Let us
assume that the SVD of X be X = USVT , where U and V are the PCs and EOFs of
the full data matrix X, respectively. When the spatial dimension p is to be reduced,
one first obtains an approximation of the PCs U of X by taking the PCs of the
reduced/projected matrix P, i.e. U = Upr where P = Upr Spr VTpr . The EOFs of X
are then approximated by projecting the PCs of the projected matrix P onto the data
matrix X, i.e.
V ≈ XT Upr D−1
pr . (16.3)
When the time dimension is to be reduced the same procedure can be applied to XT
using P = XT R where R is now n × k.
Remark Note that random projection can be applied only to reduce one dimension
but not both. This, however, is not a major obstacle since it is sufficient to reduce
only one dimension.
Seitola et al. (2014) applied the procedure to monthly surface temperature from
a millennial Earth System Model simulation (Jungclaus 2008) using two cases with
n × p = 4608 × 4607 and n × p = 4608 × 78336. They compared the results
obtained using 10% and 1% reductions. An example of plot of EOF patterns of
the original and reduced data is shown in Fig. 16.1. Figure 16.2 shows the (spatial)
correlation between the EOFs of the original data and those approximated from the
10% (top) and 1% (bottom) reduction of the spatial dimension. Clearly, the leading
EOFs are well reproduced even with a 1% reduction. Figure 16.3 compares the
spectra of the PCs from the original data and those from 10% and 1% reduction,
respectively. The main peaks associated with periods 1, 1/2, 1/3 and 1/4 yr are very
well reproduced in the PCs of the projected data. They also obtained similar results
with the reduction of the 4608 × 78336 data matrix. When the time dimension is
reduced, Seitola et al. (2014) find similar results to those obtained by reducing the
spatial dimension. The random projection was also extended to delay coordinates
yielding randomised multichannel SSA by Seitola et al. (2015). They applied it
to the twentieth century reanalysis data and two climate model (HadGEM2-ES and
MPI-ESM-MR) simulations from the CMIP5 data archive. They found, in particular,
that the 2–6 year timescale variability in the centre Pacific was well captured by
HadGEM2.
370 16 Further Topics
0.05
PC9
0.00
0
−0.05
−50
0.10
50
0.05
PC10
0.00
0
−0.05
−50
−0.10
0.10
50
0.05
PC11
0.00
0
−0.05
−50
−0.10
0.10
50
0.05
PC12
0.00
0
−0.05
−50
−0.10
−150 −100 −50 0 50 100 150 −150 −100 −50 0 50 100 150 −150 −100 −50 0 50 100 150
Fig. 16.1 Ninth to the twelfth EOF patterns obtained from the original model simulation (left),
randomly projected data with 10% (middle) and 1% (left) reduction. Adapted from Seitola et al.
(2014)
16.3.1 Background
ORIGINAL
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
1
1
2
0.9
3
4
0.8
5
6
0.7
7
8
0.6
9
10
RP10% 11
0.5
12
0.4
13
14
0.3
15
16
0.2
17
18
0.1
19
20
0
ORIGINAL
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
1
1
2
0.9
3
4
0.8
5
6
0.7
7
8
0.6
9
10
RP1% 11
0.5
12
0.4
13
14
0.3
15
16
0.2
17
18
0.1
19
20
0
Fig. 16.2 Spatial correlations between the original EOFs and RP with 10% (top) and 1% (bottom)
reduction. Adapted from Seitola et al. (2014)
372 16 Further Topics
a b c
1
2
3
4
5
6
7
8
9
10
11
12
Fig. 16.3 Spectra of the 10 leading original (a), and RP with 10% (b) and 1% (c) reduction.
Adapted from Seitola et al. (2014)
In its simplest form cyclo-stationary EOF analysis bears some similarities with
CSPOP analysis in the sense that the variables have two distinct time scales
representing, respectively, the cycle and the nested time within the cycle, as in
the theory of cyclo-stationary processes encountered in signal processing (Gardner
and Franks 1975; Gardner 1994; Gardner et al. 2006). Perhaps the main difference
between conventional EOFs and CSEOFs is that the loading patterns (or EOFs) in
the latter method depend on time and are periodic.
If T is the nested period, then the field X(x, t) is decomposed as:
X(x, t) = Ek (x, t)vk (t), (16.4)
k
The cyclo-stationary loading vectors, i.e. the CSEOFs, are obtained as the solution
of a Karhunen–Loéve equation (Loève 1978):
K(x, t; x , t )E(x , t )dt dx = λE(x, t), (16.6)
16.3 Cyclo-stationary EOFs 373
Theoretically speaking the periodicity in t × t , Eq. (16.8), implies that K(.) can
be expanded using double Fourier series in t × t , and the CSEOFs can then be
computed in spectral space, which are then back-transformed into physical space.
This is feasible in simple examples such as the unidimensional case of Kim et al.
(1996) who constructed CSEOFs using Bloch’s wave functions encountered in solid
state physics. The CSEOFs are defined by
where Unm (.) is periodic with period T , and can be expanded as:
Unm (t) = unmk e2π ikt/T . (16.10)
k
The coefficients unmk are obtained by solving an eigenvalue problem involving the
cyclic spectrum.
The application to weather and climate fields is, however, very expensive in terms
of memory and CPU time. An approximate solution was suggested by Kim and
North (1997), based on the assumption of independence of the PCs. The method is
based on the Fourier expansion of the field X(x, t):
−1
T
X(x, t) = ak (x, t)e2π ikt/T . (16.11)
k=0
The CSEOFs are then obtained as the eigenvectors of the covariance matrix of the
extended data (χ (x, t)), t = 1, . . . , N, where:
CSEOFs have been applied to various atmospheric fields. Hamlington et al. (2014),
for example, argue that CSEOFs are able to minimise mode mixing, a common
problem in conventional EOFs. Hamlington et al. (2011, 2014) applied CSEOFs
374 16 Further Topics
to reconstructed sea level. They suggest, using a nested period of 1 year, that
CSEOF analysis is able to extract the modulated annual cycle and the ENSO
signals from the Archiving, Validation and Interpretation of Satellite Oceanographic
(AVISO) altimetry data. Like many other methods, CSEOF analysis has been used
in forecasting (Kim and North 1999; Lim and Kim 2006) and downscaling (Lim et
al. 2010).
Kim and Wu (1999) conducted a comparative study between CSEOF analysis
and other methods based on EOFs and related techniques including extended EOFs,
POPs and cyclo-stationary POPs. Their study suggests that CSEOFs is quite akin to
extended EOFs where the lag is not unity, as in extended EOFs, but is equal to the
nested period T . Precisely, the extended data take the form:
16.4.1 Motivation
The original context of EOFs (Obukhov 1947; Fukuoka 1951; Lorenz 1956) was to
achieve a decomposition of a continuous space-time field X(t, s), such as sea level
pressure, where t and s denote, respectively, time and spatial location, as
M
X(t, s) = ck (t)ak (s) (16.16)
k=1
using an optimal set of basis functions of space ak () and expansion functions of time
ck (). As it was discussed in Chap. 3 the EOF method has some useful properties,
e.g. orthogonality in space and time. These properties yield, however, a number of
difficulties, as it is discussed in Chap. 3, such as:
16.4 Trend EOFs 375
1 In practice, of course, EOFs of any data, not necessarily Gaussian, can be computed. But the point
The trend EOF method was introduced as a way to find trend patterns from gridded
data through overcoming the drawbacks of EOFs by addressing somehow the
difficulties listed above, see Hannachi (2007) for details. In essence, the method uses
some concepts from rank correlation. it is based on the rank correlation between the
time position of the sorted data.
Precisely, Let x1 , x2 , . . . xp , where xk = (x1k , . . . xnk ), k = 1, . . . p, be the p
variables, i.e. the p time series, or the rows forming our spatio-temporal field or data
matrix X = [x 1 , . . . x n ]T = (xij ), i = 1, . . . n, j = 1, . . . p. For each k, 1 ≤ k ≤ p,
we also designate by p1k , p2k , . . . , pnk the ranks of the corresponding kth time
series x1k , . . . xnk . The matrix of rank correlations is obtained by constructing first
the new variables:
for some permutation qk () of {1, 2, . . . , n}. It can be seen (see the appendix in
Hannachi (2007)) that this permutation is precisely the reciprocal of the rank
permutation pk (), i.e.
(m)
where pk = pk opk o . . . opk = pk (pk (. . . (pk )) . . .) is the mth iteration of the
permutation pk ().
As an illustration consider again the previous simple 5-element time series. The
new transformed time series is obtained by sorting first x to yield (−3, 0, 1, 2, 5).
Then, z consists of the (time) position of these sorted elements as given in the
16.4 Trend EOFs 377
and we are looking for correlations (or covariances) between (time) positions from
the sorted data as:
for k, l = 1, 2, . . . p. The trend EOFs are then obtained as the “EOFs/PCs” of the
newly obtained covariance matrix, which is also identical to the correlation matrix
(up to a multiplicative constant):
1 T T
T = (ρT (xk , xl )) = Z H HZ, (16.23)
n
a) b) c)
4 4 4
2 2 2
x1
x2
x3
0 0 0
-2 -2 -2
-4 -4 -4
0 200 400 0 200 400 0 200 400
d) e) f)
4 4 4
2 2 2
PC1
PC2
PC4
0 0 0
-2 -2 -2
-4 -4 -4
0 200 400 0 200 400 0 200 400
g) h) i)
4 4 4
2 2 2
TPC1
TPC2
TPC3
0 0 0
-2 -2 -2
-4 -4 -4
0 200 400 0 200 400 0 200 400
Fig. 16.4 Time series of the first variables simulated from Eq. (16.24) (first row), PCs 1, 2 and 4
(second row) and the leading three trend PCs (third row). (a) Time series wt . (b) Time series xt . (c)
Time series yt . (d) PC1. (e) PC2. (f) PC4. (g) Trend PC1. (h) Trend PC2. (i) Trend PC3. Adapted
from Hannachi (2007)
The TEOF method was illustrated with using simple examples by Hannachi (2007),
as shown in Fig. 16.4. The first row in Fig. 16.4 shows an example of time series
from the following 4-variable model containing a quadratic trend plus a periodic
wave contaminated by an additive AR(1) noise.
⎧
⎪
⎪ wt = 1.8at + 2βb(t) + 1.6εt1
⎨
xt = 1.8at + 1.8βb(t) + 2.4εt2
(16.24)
⎪
⎪ y = 0.5at + 1.7βb(t) + 1.5εt3
⎩ t
zt = 0.5at + 1.5βb(t) + 1.7εt4
16.4 Trend EOFs 379
20 20 20
Frequency
Frequency
Frequency
15 15 15
10 10 10
5 5 5
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Correlation Correlation Correlation
20 20 20
Frequency
Frequency
Frequency
15 15 15
10 10 10
5 5 5
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Correlation Correlation Correlation
Fig. 16.5 Histograms of the correlation coefficients between the quadratic trend, Eq. (16.24), and
the correlation-based PCs 1 (a), 2 (b) and 4 (c), and the trend PCs 1 (d), 2 (e) and 3 (f). Adapted
from Hannachi (2007)
The method was applied to reanalysis data using SLP and 1000-mb, 925-mb and
500-mb geopotential heights from NCAR/NCEP (Hannachi 2007). The application
to the NH SLP anomaly field for the (DJFM) winter months shows a discontinuous
spectrum of the new matrix (Eq. (16.21)) with two well separated eigenvalues and
a noise floor (Fig. 16.6). By contrast, the spectrum of the original data matrix
(Fig. 16.7) is “continuous”, i.e. with no noise floor. It can be seen that the third
eigenvalue is in this noise floor. In fact, Fig. 16.8 shows the third EOF of the new
data matrix where it is seen that clearly there is no coherent structure (or noise),
i.e. no trend. The leading two trend EOFs, with eigenvalues above the noise floor
(Fig. 16.6) are shown in Fig. 16.9. It is clear that the structure of these EOFs is
different from the first trend EOF.
Of course, to obtain the “physical” EOF pattern from those trend EOFs some
form of back-transformation is applied from the space of the transformed data
16.4 Trend EOFs 381
Fig. 16.8 Third EOF of winter monthly (DJFM) NCEP/NCAR SLP anomalies based on
Eq. (16.23)
Fig. 16.9 As in Fig. 16.8 but for the first (left) and second (right) trend EOFs of SLP anomalies
matrix (Eq. (16.21)) into the original (data) space. Hannachi (2007) applied a simple
regression between the trend PC and the original field to obtain the trend pattern,
but it is possible to apply more sophisticated methods for this back-transformation.
The obtained patterns are quite similar to the same patterns obtained based on
winter monthly (DJF) SLP anomalies (Hannachi 2007). These leading two patterns
associated with the leading two eigenvalues are shown in Fig. 16.10, and reveal,
respectively, the NAO pattern as the first trend (Fig. 16.10a), and the the Siberian
high as the second trend (Fig. 16.10b). This latter is known to have a strong trend in
382 16 Further Topics
a)
4
3
2
1
0
-1
-2
-3
-4
-5
-6
b)
3
-1
-2
-3
-4
Fig. 16.10 The two leading trend modes obtained by projecting the winter SLP anomalies onto
the first and second EOFs of Eq. (16.23) then regressing back onto the same anomalies. (a) Trend
mode 1 (SLP). (b) Trend mode 2 (SLP). Adapted from Hannachi (2007)
the winter time, see e.g. Panagiotopoulos et al. (2005) that is not found in the mid-
or upper troposphere.
The application to geopotential height shows somehow a different signature to
that obtained from the SLP. At 500-mb, for example, there is one single eigenvalue
separated from the noise floor. The pattern of the trend represents the NAO.
However, the eigenspectrum at 925-mb and 1000-mb yield two eigenvalues well
16.5 Common EOF Analysis 383
separated from the noise floor. The leading one is associated with the NAO as above,
but the second one is associated with the Siberian heigh, a well known surface
feature (Panagiotopoulos et al. 2005). The method was applied also to identify
the trend structures of global sea surface temperature by Barbosa and Andersen
(2009), regional mean sea level (Losada et al. 2013), and latent heat fluxes over
the equatorial and subtropical Pacific Ocean (Li et al. 2011a). The method was also
applied to other fields involving diurnal and seasonal time scales by Fischer and
Paterson (2014). They used the trend EOFs in combination with a linear trend model
for diurnal and seasonal time scales of Vinnikov et al. (2004). The TEOF method
was extended by Li et al. (2011b) to identify coherent structures between two fields,
and applied it to global ocean surface latent heat fluxes and SST anomalies.
16.5.1 Background
EOF (or PC) analysis deals with one data matrix, and as such it is also known as
one-sample method. The idea then arose to extend this to two- or more samples,
which is the objective of common EOF (or PC) analysis. The idea of using two-
sample PCA goes back to Krzanowski (1979) who computed the angles between the
leading EOFs for each group. Flury (1983) extended the two-sample EOF analysis
by computing the eigenvectors of −1 1 2 , where k , k = 1, 2, is the covariance
matrix of the kth group, to obtain, simultaneously, uncorrelated variables in the two
groups. Common PC (or EOF) analysis was suggested back in the late 1980s with
Flury (1984, 1988).
Given a number of groups, or population, common EOFs arise when we think
that the covariance matrices of these groups may have the same EOFs but with
different weights (or importance) in different groups. Sometimes this condition
is relaxed in favour of only a subset of similar EOFs with different weights in
different groups. This problem is quite common in weather and climate analysis.
Comparing, for example, the large scale flow patterns from a number of climate
model simulations, such as those from CMIP5, is a common problem in weather and
climate research. The basic belief underlying these models is that they may have the
same modes of variability, but with different prominence, e.g. explained variance,
in different models. This can also be used to compare different reanalysis products
from various weather centres, such as the National Center for Environmental
Prediction (NCEP), and the National Center for Atmospheric Research or the Japan
Meteorological Agency. Another direction where common EOF analysis can be
explored is the analysis of ensemble forecast. This can help identify any systematic
error in the forecasts.
The problem of common EOF analysis consists of finding common EOFs of a
set of M data matrices X1 , . . . , XM , where Xk , k = 1, . . . , M, is the nk × p data
384 16 Further Topics
matrix of the kth group. Note that data matrices can have different sample sizes.
The problem of common EOFs emerges naturally in climate research, particularly
in climate model evaluation. In CMIP5, for example, one important topic is often
to seek a comparison between the modes of variability of large scale flow of
different models. The way this is done in climate research literature is via a
simple comparison between the (conventional) EOFs of the different climate model
simulations. One of the main weaknesses of this approach is that the EOFs tend to be
model-dependent, leading to difficulties of comparison. It would be more objective
if the modes of variability are constructed on a common ground, and a natural way
to do this is via common EOFs.
Hc : AT k A = k , k = 1, . . . M. (16.25)
The column vectors of A = a1 . . . ap are the common EOFs, and Uk = Xk A are
the common PCs.
Remark Note that in the case when the covariance matrices are different there is no
unique way to define the set of within-population EOFs (or PCs), i.e. the common
EOFs.
Note, however, that unlike conventional EOFs, the diagonal elements of k , k =
1, . . . M, need not have the same order, and may not be monotonic simultaneously
for the different groups.
The common PCA model (16.25) appeared in Flury (1988) as a particular model
among five models with varying complexity. The first model (level-1) is based on the
equality of all covariances k , k = 1, . . . , M. The second level model assumes that
covariance matrices k , k = 1, . . . , M, are proportional to 1 . The third model
is precisely Eq. (16.25), whereas in the fourth model only a subset of EOFs are
common eigenvectors, hence partial common PC model. The last model has no
restriction on the covariance matrices. Those models are described in Jolliffe (2002).
Flury (1984) estimated the common EOFs from Eq. (16.25) using maximum
likelihood based on the normality assumption N (μk , k ), k = 1, . . . M, of the
kth p-variate random vector generating the data matrix Xk , k = 1, . . . M. Letting
16.5 Common EOF Analysis 385
!
M
1 n
k −1
L( 1 , . . . , M ) = α exp tr − Sk , (16.26)
| k |nk /2 2 k
k=1
Flury (1984) computed the likelihood ratio LR to test the hypothesis Hc , see
Eq. (16.25):
L( ˆ M)
ˆ k, . . . , M ˆ k|
|
LR = −2 log = nk log , (16.28)
L(S1 , . . . , SM ) |Sk |
k=1
LR ∼ χ(M−1)(p−1)/2
2
. (16.29)
All the methods described above including those mentioned in Jolliffe (2002)
share, however, the same drawback mentioned earlier, related to the lack of simulta-
neous monotonic change of the eigenvalues for all groups. A more precise method
to deal with the problem was proposed later (Trendafilov 2010) by computing the
common EOFs based on a stepwise procedure. The common EOFs are estimated
sequentially one-by-one allowing hence for a monotonic decrease (or increase) of
the eigenvalues of k , k = 1, . . . , M, in all groups simultaneously. The method
is similar to the stepwise procedure applied in simplifying EOFs (Hannachi et
al. 2006), and is based on projecting the gradient of the common EOF objective
function onto the orthogonal of the space composed of the common EOFs identified
in the previous step.
The reformulation of the common EOF problem is again based on the likelihood
Eq. (16.26), which takes a similar form, namely:
minA M k=1 nk log |diag(A Sk A)|
T
(16.30)
subject to A A = Ip .
T
and for j = 2, . . . p:
M
nk Sk
Ip − Qj −1 QTj−1 − Ip aj = 0. (16.33)
n aTj Sk aj
k=1
Trendafilov (2010) solved Eqs. (16.32–16.33) using the standard power method
(Golub and van Loan 1996), which is a special case of the more general gradient
ascent method of quadratic forms on the unit sphere (e.g. Faddeev and Faddeeva
1963).
The solution to Eqs. (16.32–16.33) is applied to the NCAR/NCEP monthly SLP
anomalies over the period 1950–2015. The data are divided into 4 seasons to yield
4 datasets for monthly DJF, MAM, JJA and SON, respectively. The data used here
are on a 5◦ × 5◦ grid north of 20◦ N. The objective is to illustrate the common EOFs
from these data, that is to get the matrix of loadings that simultaneously diagonalise
16.5 Common EOF Analysis 387
Fig. 16.11 Explained variance of the common EOFs for the winter (dot), spring (circle), summer
(asterisk), and fall (plus)
the covariance matrices of those 4 datasets. The algorithm is quite fast with 4 1080×
1080 covariance matrices, and takes a few minutes on a MacPro.
The results are illustrated in Figs. 16.11 and 16.12. The obtained diagonal matrix
(or spectrum) for each dataset, is transformed into percentage of explained variance
(Fig. 16.11). There is a clear jump in the spectrum between the first and the
remaining modes for the four datasets. The leading common mode explains a little
more variance for spring and summer compared to winter and autumn. The spatial
patterns of the leading 4 common EOFs are shown in Fig. 16.12. The first common
EOF (top left) projects well onto the Arctic oscillation, mainly over the polar region,
with an opposite centre in the North Atlantic centred around 40N. Common EOF2
(top right) shows a kind of wavetrain with main centre of action located near North
Russia. The third common EOF (bottom right) reflects more the NAO pattern,
with around 10% explained variance. The NAO is weak in the spring and summer
seasons, which explains the reduced explained variance. The fourth common EOF
shows mainly the North Pacific centre, associated most likely with the Aleutian low
variability. As is pointed out earlier, The common EOF approach is more suited,
for example, to the analysis of outputs from comparable GCMs, such as the case of
CMIP models, where the objective is to evaluate and quantify what is common in
those models in terms of modes of variability.
388 16 Further Topics
Fig. 16.12 The leading 4 common EOFs of the four datasets, namely (monthly) DJF, MAM, JJA,
and SON NCEP/NCAR sea level pressure anomalies over the period 1950–2015
16.6.1 Background
Conventional CCA, MCA and RDA are standard linear methods that are used to
isolates pair of coupled patterns from two datasets. They are all based on SVD
and the obtained patterns from the different methods are linked through linear
transformations. In fact, it is possible to view CCA, MCA and RDA within a
unified frame where each of them becomes a particular case. This is obtained
through what is known as partial whitening transformation. Partial whitening, with
degree α, aims at partially decorrelating the variables. This transformation is used
in continuum power regression, which links partial least squares (PLS) regression,
ordinary least squares regression and principal component regression (PCR), e.g.
Stone and Brooks (1990). Swenson (2015) extended continuum power regression to
CCA to get continuum power CCA (CPCCA).
16.6 Continuum Power CCA 389
X∗ = Aα,x XT , (16.34)
where
1−α −1
Aα,x = C 2 . (16.35)
We suppose here that C is full rank so that its non-integer power exists4
Remark The standard whitening transformation corresponds to α = 0.
CPCCA patterns u and v are obtained via:
max uT XT Yv
1−α 1−β (16.36)
s.t. uT XT X u = vT YT Y v = 1.
where X∗ = Aα,x XT and Y∗ = Aβ,y YT . As for CCA, the CPCCA patterns (in the
partially whitened space) are given by the SVD of X∗ YT∗ , i.e. X∗ YT∗ = U+ SVT+ ,
and the associated cross-covariance by the diagonal of S. The CPCCA time series
are provided by projecting the partially whitened variables onto the singular vectors
U+ and V+ yielding Tx = XT∗ U+ and Ty = YT∗ V+ . The CPCCA patterns within
the original space are obtained by using the inverse of Aα,x and Aβ,y , i.e.
1−α 1−β
U = XT X 2 U+ and V = YT Y 2 V+ . (16.38)
Fig. 16.13 Fraction of signal variance explained (FSVE) for the leading CPCCA mode in the
partial whitening parameters (α,β) plane for different values of the signal amplitudes a and b
(maximum shown by ‘*’) for a = b = 1(a), 0.75(b), 0.6(c). Also shown are the maxima of the
cross-correlation, squared covariance fraction and the fraction of variance of Y explained by X,
associated, respectively, with CCA, MCA and RDA. Adapted from Swenson (2015). ©American
Meteorological Society. Used with permission
Remark The time series TTx and Ty are (cross-) uncorrelated, i.e. TTx Ty = S. The
time series Tx , however, are not uncorrelated.
Partial whitening provides more flexibility through varying the parameters α and
β. CPCCA is similar, in some way, to the partial phase transform, PHAT-β,
encountered in signal processing (Donohue et al. 2007). Partial whitening can be
shown to yield an increase in performance when applied to CCA regarding the S/N
ratio. Figure 16.13 shows the performance of CPCCA as a function of α and β for
an artificial example in which the “common component” is weighted by specific
numbers a and b in the data matrices X and Y, respectively.
Various methods can be used to determine the degree of partial whitening (or
regularisation) parameter α. Perhaps an intuitive approach is to consider the
simultaneous optimisation of uT X∗ YT∗ v with respect to u, v α and β, where the
optimum solution is given by
{uo , vo , αo , βo } = argmax uT X∗ YT∗ v s.t. uT u = vT v = 1. (16.39)
It is possible to solve the above equation numerically. This was used by Salim et al.
(2005) to estimate the smoothing parameter in regularised MCA in addition to the
16.6 Continuum Power CCA 391
spatial patterns. They applied the method to analysing the association between the
Irish winter precipitation and sea surface temperature. They found clear association
between Irish precipitation anomalies, El-Niño Southern Oscillation and the North
Atlantic Oscillation. We note, however, that the calculation can be cumbersome.
The other, and common, method is to use cross-validation (CV). The CV is
feasible in practice but requires a relatively extended computation as it is based
on leave-one-out procedure. We explain the procedure here for the conventional
CCA. The application to the CPCCA is similar. In CCA we seek the spectral
−1 T −1 T
analysis of X XT X X Y YT Y Y whereas in regularised CCA, with reg-
ularisation parameter λ = (λ1 , λ2 ), we are interested in the spectral analysis
−1 T −1 T
of X XT X + λ1 I X Y YT Y + λ2 I Y , where I is the identity matrix. We
designate by X−i and Y−i the data matrices derived from X and Y, respectively,
by removing the ith rows xi and yi of X and Y, respectively. We also let ρλ(−i) be
the leading canonical correlation from CCA of X−i and Y−i with corresponding
(−i) (−i)
patterns (eigenvectors) uλ and vλ . The cross-validation score can be defined,
in general, as a measure of the squared error of a test set evaluated for an eigenvector
from a training set. The CV score is defined (Leurgans et al. 1993) by
(−i) (−i)
CV () = corr {xi uλ }i=1,...n , {yi vλ }i=1,...n . (16.40)
Note that in the above equation we consider xi as a 1 × p row vector. The cross-
validated parameter λ̂ is then given by
The third method to estimate the optimal parameter is related to ridge regression
in which a regularisation parameter, in the form of λI, is added to the covariance
matrix of the predictor variables before computing the inverse. In ridge regression
−1/2
a transformation is applied using Tridge = (1 − ρ)XT X + ρμI , with μ =
1 T 2
p X F . An estimate of ρ is derived by Ledoit and Wolf (2004):
n
t=1 (n − 1)xTi xi − XT X 2
ρLW = F
(16.42)
XT X − μI 2
F
with . F being the Fröbenius norm ( C F = tr(CCT )). For CPCCA, Swenson
(2015) suggests the following estimator for the parameter α:
1−α
α̂ = argmin ν XT X − (1 − ρLW )XT X − ρLW μI 2
F (16.43)
α
1−α
with ν = XT 2/
F (XT X) 2 2.
F
392 16 Further Topics
16.7.1 Background
Given two data matrices X and Y, classical or standard MCA looks for patterns a
and b such that Xa and Yb have maximum covariance. These patterns are given,
respectively, by the left and right singular vectors of the cross-covariance5 matrix
B = XT Y. These vectors satisfy XT Yb = nλa and YT Xa = nλb. In addition, the
associated time series x = Xa and y = Yb satisfy, respectively:
XXT YYT x = n2 λ2 x
. (16.44)
YYT XXT y = n2 λ2 y
Exercise Derive Eq. (16.44) In practice, of course, we do not solve Eq. (16.44),
but we apply the SVD algorithm to XT Y. The above derivation is useful for what
follows.
Kernel MCA takes its roots from kernel EOF where a transformation φ(.) is used
to map the input data space onto a feature space, then EOF analysis applied to
the transformed data. In kernel MCA the X and Y feature spaces are spanned,
respectively, by φ(x1 ), . . . , φ(xn ) and φ(y1 ), . . . , φ(yn ), respectively. The objective
is similar to standard MCA but applied to the feature spaces.
We designate by X and Y the matrices (or rather operators) defined, respectively,
by
⎛ ⎞ ⎛ ⎞
φ(x1 )T φ(y1 )T
⎜ ⎟ ⎜ ⎟
X = ⎝ ... ⎠ and Y = ⎝ ... ⎠ , (16.45)
φ(xn )T φ(yn )T
and we seek “feature” patterns u and v from the feature space such that X u and Yv
have maximum covariance.
The cross-covariance matrix between φ(xk ) and φ(yk ), k = 1, . . . , n, is
1
n
1
C= φ(xt )φ(yt )T = X T Y (16.46)
n n
t=1
Cv = λu
(16.47)
CT u = λv.
As in kernel EOF we see that u and v take, respectively, the following forms:
n n
u= t=1 at φ(xt ) and v = t=1 bt φ(yt ). (16.48)
Inserting (16.48) into (16.47), using (16.46), and denoting by Kx and Ky the
y
matrices with respective elements Kijx = φ(xi )T φ(xj ) and Kij = φ(yi )T φ(yj ),
we get
Kx Ky b = nλKx a
(16.49)
Ky Kx a = nλKy b.
One can solve (16.49) simply by considering the necessary condition, i.e.
Ky b = nλa
(16.50)
Kx a = nλb,
Ky Kx Ky b = nλKy Kx a = n2 λ2 Ky b, (16.51)
One can use the (data) matrices within the feature space, as in the standard case (i.e.
without transformation), and directly solve the system:
n X Yv = λu
1 T
(16.52)
n Y X u = λv,
1 T
394 16 Further Topics
X X T YY T X u = n2 λ2 X u
(16.53)
YY T X X T Yv = n2 λ2 Yv.
Now, x = X u is a time series of length n, and similarly for y = Yv. Also, we have
X X T = Kx and YY T = Ky , and then Eq. (16.53) becomes
Kx Ky x = n2 λ2 x
(16.54)
Ky Kx y = n2 λ2 y.
So the time series x and y having maximum covariance are given, respectively, by
the right and left eigenvectors of Kx Ky .
Remark Comparing Eqs. (16.51) and (16.54) one can see thatx = Kx a and
n
y =Ky b, which can be verified keeping in mind that u = t=1 at φ(xt ) and
n
v = t=1 bt φ(yt ) in addition to the fact that x = X u and y = Yv.
One finds either a and b (Eq. (16.51)) or x and y (Eq. (16.54)). We then construct
the feature patterns u and v using Eq. (16.48). The corresponding patterns from the
input spaces can be obtained be seeking x and y such that uT φ(x) and vT φ(y) are
maximised. This leads to the maximisation problem:
n n
max t=1 at K(x, xt ) and max t=1 bt K(y, yt ). (16.55)
This is exactly like the pre-image for Kernel EOFs, and therefore the same fixed
point algorithm can be used.
As above we let X and Y denote n × p and n × q two (anomaly) data matrices. The
conventional CCA is written in the primal form as:
uT XT Yv
ρ = max * . (16.56)
u,v (uT XT Xu)(vT YT Yv)
By denoting u = XT α and v = YT β, the above form can be cast in the dual form:
α T Kx Ky β
ρ = max - , (16.57)
α,β
(α T K2x α)(β T K2y β)
16.8 Kernel CCA and Its Regularisation 395
max α T Kx Ky β
(16.58)
s.t. α T K2x α = β T K2y β = 1.
This system can be analysed using Lagrange multipliers yielding a system of linear
equations in α and β:
Kx Ky β − λ1 Kx α = 0
(16.59)
Ky Kx α − λ2 Ky β = 0.
Remark Note that in the dual formulation the Raleigh quotient (16.60) and
also (16.57) the computation of the cross-correlation (or cross-covariance) is
avoided. This has implication when computing kernel CCA as shown later.
Exercise Assume that Kx and Ky are invertible, show that we have λ = 1.
The conclusion from the above exercise is that when Kx and Ky are invertible
perfect correlation can be obtained and the CCA problem becomes useless. This
is a kind “overfitting”.
Remark In CCA this problem occurs whenever Kx and Ky are invertible. This
means that rank(X) = n =rank(Y), i.e. n < q and n < p. This also means that the
covariance matrices XT X and YT Y are singular.
The solution to this problem is regularisation as it was discussed in Sect. 16.6,
(see also Chap. 15). by adding λ1 I and λ2 I to the correlation matrices of X and Y,
respectively, as in ridge regression. In ridge regression with a regression model Y =
−1 T
XB + the estimated matrix B̂ = XT X X Y is replaced by (R + λI)−1 XT Y,
with λ > 0, and R is the correlation matrix. The diagonal elements of R are
increased by λ, and this is where the name ridge comes from.
Remark The standard CCA problem can be cast into a generalised eigenvalue
problem as
396 16 Further Topics
O Cxy u 2 Cxx O u
=ρ
Cyx O v O Cyy v
(see exercise above). The above form can be used to extend CCA to multiple
datasets. For example, for three data one form of this generalisation is given by
the following generalised eigenvalue problem:
⎛ ⎞⎛ ⎞ ⎛ ⎞⎛ ⎞
O Cxy Cxz u Cxx O O u
⎝ Cyx O Cyz ⎠ ⎝ v ⎠ = ρ 2 ⎝ O Cyy O ⎠ ⎝ v ⎠ .
Czx Czy O w O O Czz w
In canonical covariance analysis no scaling, (i.e. correlation) was used, and therefore
no regularisation was required. As with conventional CCA we denote, respectively,
the Gram matrices of X and Y by K = (Kij ) and L = (Lij ), with Kij =
φ(xi )T φ(xj ) and Lij = ψ(xi )T ψ(xj ). Note that here
we can use a different
map for
Y. The solution of KCCA looks for patterns a = i αi φ(xi ) and b = i βi ψ(yi )
that are maximally correlated. This leads to maximising the Lagrangian:
1 1
L = α T KLβ − λ α T K2 α − 1 − λ β T L2 β − 1 (16.61)
2 2
and also maximising the Raleigh quotient (in the dual form). The obtained system of
equation is similar to Eq. (16.59). Again, if, for example, K is of full rank, which is
typically the case in practice, then a naive application of KCCA leads to λ = 1. This
shows the need to regularise the kernel, which leads to the regularised Lagrangian
1 1
L = α T KLβ − λ α T K2 α + η1 α T α − 1 − λ β T L2 β + η1 β T β − 1 .
2 2
(16.62)
The associated Raleigh quotient is similar to that shown in the exercise above except
that K2 and L2 are replaced by K2 + η1 I and L2 + η2 I, respectively, and associated
generalised eigenvalue problem. Note that we can also take η1 = η2 = η.
Remarks
• The dual formulation allows us to use different kernels, e.g. φ(.) and ψ(.) φ(.)
for X and Y, respectively. For example, one can kernelize only one variable and
leave the other without kernel.
• The regularised parameter η can be estimated using the cross-validation proce-
dure.
16.9 Archetypal Analysis 397
The solution to the regularised KCCA is given, e.g. for α, assuming that K is
invertible, by the standard eigenvalue problem:
16.9.1 Background
The above equations make the patterns z1 , . . . , zp of data (or pure) types. The data,
in turn, are also approximated by a similar weighted average of the archetypes. That
p
is, each xt , t = 1, . . . , n is approximated by a convex combination j =1 αtj zj , with
p
αtj ≥ 0 and j =1 αtj = 1. The archetypes are therefore the solution of a convex
least square problem obtained by minimising a residual sum of squares (RSS):
p p
{z1 , . . . , zp } = argmin t xt − k=1 αtk zk 22 = argmin t xt − k=1 nj=1 αtk βkj xj 22
α,β
p n
s.t. αtk ≥ 0, k=1 αtk = 1, t = 1, . . . n, βkj ≥ 0, and j =1 βkj = 1, k = 1, . . . p,
(16.65)
where . 2 stands for the Euclidean norm.
The above formulation of archetypes can be cast in terms of matrices. Letting
AT = (αij ) and BT = (βij ), (A in Rp×n , B in Rn×p ) the above equation transforms
into the following matrix optimisation problem:
minA,B R = X − AT BT X 2F
(16.66)
A, B ≥ 0, AT 1p = 1n , and BT 1n = 1p .
In the above system A and B are row stochastic matrices, 1x stands for the x-column
vector of ones and . F stands for the Fröbenius norm (Appendix D). The inferred
archetypes are then convex combination of the observations, which are given by
Z = z1 , . . . , zp = XT B, and they exist on the convex hull of the data x1 , . . . , xn .
Furthermore, letting A = (α 1 , . . . , α n ), then for each data xt , t = 1, . . . , n,
Zα t represents its projection on the convex hull of the archetypes as each α t is a
probability vector.
For a given p Cutler and Breiman (1994) show that the minimisers of RSS R,
Eq. (16.66), provide archetypes Z = z1 , . . . , zp that are, theoretically, located on
the boundary of the convex hull (or envelope) of the data. The convex hull of a given
data is the smallest convex set containing the data. Archetypes provide therefore
typical representation of the “corners” or extremes of the observations. Figure 16.14
shows an illustration of a two-dimensional example of a set of points with its convex
hull and its approximation using five archetypes. The sample mean x = n1 xt
provides the unique archetype for p = 1, and for p = 2 the pattern z2 −z1 coincides
with the leading EOF of the data. Unlike EOFs archetypes are not required to be
nested (Cutler and Breiman 1994; Bauckhage and Thurau 2009). However, like k-
16.9 Archetypal Analysis 399
Fig. 16.14 Two-dimensional illustration of a set of 25 points along with the convex hull (dashed),
an approximate convex hull (solid) and 5 archetypes (yellow). The blue colour refers to points that
contribute to the RSS. Adapted from Hannachi and Trendafilov (2017). ©American Meteorological
Society. Used with permission
means clustering (and unlike EOFs) AA is invariant to translation and scaling and
to rotational ambiguity (Morup and Hansen 2012). In summary, AA combines the
virtues of EOFs and clustering and, most importantly, deals with extremes in high
dimensions.
Exercise Show that the mean x is the unique archetype for p = 1.
Hint Letting X = (x , xn )T = (y1 , . . . , ym ), β = (β1 , . . . , βn )T , and ε2 =
1n, . . .
X − 1n β X F = t=1 m
T 2 (x − β T yk )2 , and differentiating ε2 with respect
m tk
k=1
to β one obtains the system k=1 (yk − β T yk )xtk = 0, for t = 1, . . . n, that is
T
xi xj β = n1 1Tn XXT . The only solution (satisfying the constraint) is β = n1 1n ,
that is z = βX.
There are mainly two algorithms to solve the archetypes problem, which are
discussed below. The first one is based on the alternating algorithm (Cutler and
Breiman 1994), and the second one is based on a optimisation algorithm on
Riemannian manifolds (Hannachi and Trendafilov 2017).
Alternating Algorithm
To solve the above optimisation problem Cutler and Breiman (1994) used, starting
from an initial set of archetypes, an alternating minimisation between finding the
best A for a given set of archetypes, or equivalently B, and the best B for a given A.
The algorithm has the following multi-steps:
400 16 Further Topics
After each iteration the archetypes are updated. For example, after finding Al+1
from Eq. (16.67) and before solving the second equation, Z is updated using
−1
X = ATl+1 ZT , i.e. ZT = Al+1 ATl+1 Al+1 X, which is then used in the second
equation of (16.67). After optimising this second equation, Z is then updated
using Z = XT Bl+1 . This algorithm has been widely used since it was proposed
by Cutler and Breiman (1994).
Remark Both equations in (16.67) can be transformed to n + p individual convex
least square problems. For example, the first equation is equivalent to
where ddiag(Y) is the double diagonal operator, which transforms a square matrix
Y into a diagonal matrix with the same diagonal elements as Y. This manifold
is known as the oblique manifold and is topologically equivalent to the Cartesian
product of spheres, with a natural inner product given by
R = X−(AA)T (BB)T X 2
F = tr(Z)−2tr(ZW)+tr(ZWT W), (16.70)
(16.71)
Exercise Consider the hypersphere S n−1 in Rn , S n−1 = {x ∈ Rn , x = 1}, and
the oblique manifold, Ob(n, p) = {X ∈ Rn×p , ddiag(XT X) = Ip }. The tangent
space at x ∈ S n−1 is the set of all vectors orthogonal to x, i.e.
Using the topological equivalence between Ob(n, p) and the Cartesian product
of p hyperspheres S n−1 , i.e. Ob(n, p) ∼ S n−1 × · · · × S n−1 derive the projection
U∗ of any U from Rn×p onto TX Ob(n, p), namely
U∗ = U − X ddiag XT U . (16.74)
Let us denote by A2. = A A and similarly for B. Then we have the following
expression of the gradient of the costfunction R (see Appendix D).
∇A R = 4 (B2. )T Z(−In + WT ) A
. (16.75)
∇B R = 4 Z(−In + WT )(A2. )T B
Finally, the projection of the gradient of R, ∇A,B R, onto the tangent space of the
oblique manifolds yields the final gradient gradA,B R, namely:
gradA R = ∇A R − A ddiag(AT ∇A R)
. (16.76)
gradB R = ∇B R − B ddiag(BT ∇B R)
Z = XT (B B) = XT B2. (16.77)
{Z, H} = argmin X − ZH 2
F. (16.78)
Z,H≥0
n
p
F (Z, H) = xij log(ZH)ij − (ZH)ij ,
i=1 j =1
subject to the non-negativity constraint, and used a multiplicative updating rule. But
other algorithms, e.g. alternating rules as in AA, can also be used. It is clear that the
main difference between AA and NMF is the stochasticity of the matrices Z and H.
In terms of patterns, AA yields archetypes whereas NMF yields characteristic parts
(Bauckhage and Thurau 2009). To bring it closer to AA, NMF has been extended
to convex NMF (C-NMF), see e.g. Ding et al. (2010), where the non-negativity of
X is relaxed, and the non-negative matrix Z takes the form Z = XW, with W a
non-negative matrix.
One of the nice and elegant features of simplexes is the two-dimensional visual-
isation of any m-simplex,6 i.e. m-dimensional polytope that is the convex hull of
its m + 1 vertices. This projection is well known in algebraic topology, sometimes
referred to as “skew orthogonal” projection7 and shows all the vertices of a regular
simplex on a circle where all vertices pairs are connected by edges. For example,
the regular 3-simplex (tetrahedron) projects onto a square (Fig. 16.15). Any point
y = (y1 , . . . , ym+1 )T in Rm+1 can be projected onto the m-simplex. The projection
y onto the standard m-simplex is the closest point t = (t1 , . . . , tm+1 ) ≥ 0,
of
i ti = 1, to y. The point t satisfies ti = max(yi + e, 0), i = 1, . . . , m + 1. The
number e can be obtained through a sorting algorithm of complexity O(n log n)
(Michelot 1986; Malozemov and Pevnyi 1992; Causa and Raciti 2013).
As pointed out above Zα i , i = 1, . . . , n is the best approximation of xi on
the convex hull of the archetypes Z, i.e. the (p − 1)-simplex with vertices the
Fig. 16.15 An example of a 3-simplex or tetrahedron (a), its two-dimensional projection (b), and
a two-dimensional projection of a 4-simplex (c)
Hannachi and Trendafilov (2017) applied AA to sea surface temperature (SST) and
Asian summer monsoon. The SST anomaly data come from the Hadley Centre Sea
Ice and Sea Surface Temperature8 (Rayner et al. 2003). The data are on a 1◦ ×
8 www.metoffice.gov.uk/hadobs/hadisst/.
16.9 Archetypal Analysis 405
a)
100
Relative RSS
80
60
40
20
0
1 5 10 15
Number of archetypes p
b)
100
Gradient norm and
gradient
costfunction
costfunction
-1
10
10-2
10-3
10-4
0 100 200 300 400 500
Iteration number
c)
1
-1
-1 0
Fig. 16.16 Scree plot (a) of a five Gaussian-clusters, costfunction and gradient norm of R (b) for
5 archetypes, and the skew projection (c) using the same 5 archetypes. Adapted from Hannachi
and Trendafilov (2017). ©American Meteorological Society. Used with permission
1◦ latitude–longitude grid from Jan 1870 to Dec 2014, over the region 45.5◦ S–
45.5◦ N. The scree plot (Fig. 16.17) shows a break- (or elbow-)like feature at p = 3
suggesting three archetypes.
The three archetypes suggested by Fig. 16.17 are shown in Fig. 16.18. The
first two archetypes show, respectively, El-Niño and La-Niña, the third archetype
shows the western boundary currents, namely Kuroshio, Gulf Stream and Agulhas
currents, in addition to the Brazil, East Australian and few other currents. It is
known that western boundary currents are driven by major gyres, which transport
warm tropical waters poleward along narrow, and sometimes deep, currents. These
406 16 Further Topics
100
Fig. 16.17 Scree plot of the SST anomalies showing the relative RSS versus the archetypes
number. Adapted from Hannachi and Trendafilov (2017). ©American Meteorological Society.
Used with permission
a) Archetype 1
24
22
20
18
16
14
12
10
86
42
-2 0
-4
-6
-8
-10
-12
-14
b) Archetype 2
8
6
4
2
0
-2
-4
-6
-8
-10
-12
-14
-16
c) Archetype 3
18
16
14
12
10
8
6
4
2
0
-2
-4
Fig. 16.18 The obtained three archetypes of the SST anomalies showing El-Niño (a), La-Niña
(b) and the western currents (c). Contour interval 0.2◦ C. Adapted from Hannachi and Trendafilov
(2017). ©American Meteorological Society. Used with permission
16.9 Archetypal Analysis 407
0.2
Mixture weights
a
0.15
0.1
0.05
0
1/1870 1/1886 1/1902 1/1918 1/1934 1/1950 1/1966 1/1982 1/1998 1/2014
0.2
Mixture weights
b
0.15
0.1
0.05
0
1/1870 1/1886 1/1902 1/1918 1/1934 1/1950 1/1966 1/1982 1/1998 1/2014
0.08
Mixture weights
c
0.06
0.04
0.02
0
1/1870 1/1886 1/1902 1/1918 1/1934 1/1950 1/1966 1/1982 1/1998 1/2014
Time
Fig. 16.19 Mixture weights of the three archetypes of SST anomalies, El-Niño (a), La-Niña
(b), and the western boundary currents (c). Adapted from Hannachi and Trendafilov (2017).
©American Meteorological Society. Used with permission
currents are normally fast and are referred to as the western intensification (e.g.
Stommel 1948, Munk 1950). This strongly suggests that these water boundary
currents project on extreme events, which are located on the outer boundary in
the system state space. It should be reminded here that the SST field is different
from the surface currents, which better capture the boundary currents. Records of
surface currents, however, are not long enough, in addition to the non-negligible
uncertainties in these currents.
The mixture weights of these archetypes are shown in Fig. 16.19.
For El-Niño archetype (Fig. 16.19a) the contribution comes from various obser-
vations scattered over the observation period and most notably from the first half
of the record. Those events correspond to prototype El-Niño’s, with largest weights
taking place end of the nineteenth and early twentieth centuries and in the last few
decades.
For the La-Niña archetype (Fig. 16.19b) there is a decreasing contribution with
time, with most weights located in the first half of the record, with particularly high
contribution from the event of the year 1916–1917. One can also see contributions
from La-Niña events of 1955 and 1975. It is interesting to note that these
contributing weights are clustered (in time). Unlike the previous two archetypes
408 16 Further Topics
a)
1
Amplitude
0.5
0
1/1870 1/1886 1/1902 1/1918 1/1934 1/1950 1/1966 1/1982 1/1998 1/2014
b)
1
Amplitude
0.5
0
1/1870 1/1886 1/1902 1/1918 1/1934 1/1950 1/1966 1/1982 1/1998 1/2014
c)
1
Amplitude
0.5
0
1/1870 1/1886 1/1902 1/1918 1/1934 1/1950 1/1966 1/1982 1/1998 1/2014
Time
Fig. 16.20 Time series amplitudes of the leading three archetypes (a, b, c) of the SST anomalies.
Adapted from Hannachi and Trendafilov (2017). ©American Meteorological Society. Used with
permission
the third, western current, archetype (Fig. 16.19c) is dominated by the last quarter
of the observational period starting around the late 1970s.
The time series of the archetypes, i.e. the columns of the stochastic matrix A2.
show the “amplitudes” of the archetypes, somehow similar to the PCs, and are
shown in Fig. 16.20. The time series of El-Niño shows slight weakening of the
archetypes, although the events of early 80s and late 90s are clearly showing up.
There is a decrease from the 90s to the end of the record. Prior to about 1945
the signal seemed quite stationary in terms of strength and frequency. The time
series of La-Niña archetype shows a general decrease in the last 50 or so years. The
signal was somehow “stationary” (with no clear trend) before about 1920. Unlike the
previous El-Niño and La-Niña archetypes the third (or western current) archetype
time series has an increasing trend starting immediately after an extended period of
weak activity around 1910. The trend is not gradual, with the existence of a period
with moderate activity around 1960s. The strongest activity occurs during the last
two decades starting around late 1990s. Figure 16.21 shows the simplex projection
of the data using three archetypes. The colours refer to the points that are closest to
each of the three archetypes.
16.9 Archetypal Analysis 409
0.5
-0.5
-1
-1 -0.5 0 0.5 1
Fig. 16.21 Two-dimensional simplex projection of the SST anomalies using three archetypes. The
200 points that are closest to each of the three archetypes are coloured, respectively, red, blue and
black and the remaining points are shown by light grey-shading. Adapted from Hannachi and
Trendafilov (2017). ©American Meteorological Society. Used with permission
100
a) b)
Relative RSS (%)
90 4
EOF3
80 2
0
70
-2
60 100
50 50
100
0 50
40 EOF2 0
-50 -50
2 4 6 8 10 -100
Number of archetypes EOF1
Fig. 16.22 (a) Relative RSS of 8 subsamples of sizes ranging from 1500 to 100, by increments of
200 (continuous) as well as the curves for the whole sample (black dotted-dashed), and the original
data less the top 1st (blue diamond), 2.5th (red square) and the 5th (black triangle) percentiles. (b)
Projection of the three archetypes of the subsamples described in (a) onto the leading three EOFs
of the SST anomalies (star), along with the same plots of the whole sample (filled circle), and the
original data less the top 1st (diamond), 2.5th (square) and 5th (triangle) percentiles. Adapted from
Hannachi and Trendafilov (2017). ©American Meteorological Society. Used with permission
Hannachi and Trendafilov (2017) also applied AA to the detrended SST anoma-
lies. Their finding suggests only two main archetypes, namely El-Niño and La Niña.
This once again strongly suggests that the archetype associated with the western
boundary currents is the pattern that mostly explains the trend in extremes. They also
show that that the method is quite robust to sample size and extremes (Fig. 16.22).
410 16 Further Topics
q
x(t) = fk (pk (t)) + ε t , (16.79)
k=1
where fk (.) is the kth trajectory, i.e. a map from the set of real numbers into the d-
dimensional data space, pk (.) is the associated time series, and εt is a residual term.
Conventional EOF method corresponds to linear maps in Eq. (16.79).
An interesting nonlinear EOF analysis method, namely nonlinear dynamical
mode (NDM) decomposition was presented by Mukhin et al. (2015). In their
decomposition Mukhin et al. (2015) used Eq. (16.79) with extra parameters and
fitted the model to the multivariate time series xt , t = 1, . . . n, with the nonlinear
trajectory fk (.) being the kth NDM. The NDMs are computed recursively by
identifying one NDM at a time, then compute the next one from the residuals, etc.,
that is:
xt = f(a, pt ) + ε t , t = 1, . . . n, (16.80)
with εt being multinormal with diagonal covariance matrix. The component f(.), a
and pt , are first estimated from the sample xt , t = 1, . . . n, then the next compo-
nents, similar to f, a, and pt , t = 1, . . . n, are estimated from the residuals εt , etc.
Each component of the NDM, e.g. f(.) in Eq. (16.80), is basically a combination of
orthogonal, with respect to the Gaussian probability density function, polynomials,
namely Hermite polynomials.
As in most nonlinear PC methods, the data dimension was reduced via (linear)
PCA, which simplifies the computation significantly. In vector form the principal
components at time t, yt , t = 1, . . . , n is expanded as:
Aw(pt ) σ ε1
yt = + , (16.81)
O Dε 2
of ε 2 . The last term on the right hand side of Eq. (16.81) represents the residuals,
which are used in the next step to get the next hidden time series and associated
NDM.
Mukhin et al. (2015) used a Bayesian framework and maximised the likelihood
function
where the last term is a prior distribution. The prior distribution of the latent
variables p11 , . . . p1n , i.e. p1 (t), t = 1, . . . n, was taken to be multinormal based
on the assumption of a first-order autoregressive model. They also assumed a
multinormal distribution with diagonal covariance matrix for the prior distribution
of the parameter vector a1 . One of the interesting properties of NDMs, compared
to other methods such as kernel EOFs and Isomap method (Hannachi and Turner
2013b), is that the method provides simultaneously the NDMs and associated time
series.
Mukhin et al. (2015), see also Gavrilov et al. (2016), applied the NDM method
to monthly National Oceanic and Atmospheric Administration (NOAA) optimal
interpolation OI.v2 SST data for the period 1981–2015. Substantial explained
variance was found to be captured by the leading few NDMs (Fig. 16.23). The
leading NDM captures the annual cycle (Fig. 16.23a). An interesting feature they
identified in the second NDM was a shift that occurred in 1997–1998 (Fig. 16.23b).
Fig. 16.23 The leading three time series p1t , p2t , p3t , t = 1, . . . n, of the leading three NDMs (a)
and the trajectory of the system within the obtained three-dimensional space (b). Blue colour refers
to the period 1981–1997 and the red refers to 1998–2015. Adapted from Mukhin et al. (2015)
412 16 Further Topics
Fig. 16.24 Spatial patterns associated with the leading two NDMs showing the difference between
winter and summer averages of SST explained by NDM1 (a) and the difference between SST
explained by NDM2 (b) averaged over the periods before and after 1998 and showing the opposite
phase of the PDO. Adapted from Mukhin et al. (2015)
The second NDM captures also some parts of the Pacific decadal oscillation
(PDO), North Tropical Atlantic (NTA) and the North Atlantic indices. The third
NDM (Fig. 16.23a) captures parts of the PDO, NTA and the Indian Ocean dipole
(IOD). The spatial patterns of the leading two NDMs are shown in Fig. 16.24. As
expected, the nonlinear modes capture larger explained variance compared to those
explained by conventional EOFs. For example, the leading three NDMs are found to
explain around 85% of the total variance versus 80% from the leading three EOFs.
The leading NDM by itself explains around 78% compared to 69% of the leading
EOF.
Gavrilov et al. (2016) computed the NDMs of SSTs from a 250-year pre-
industrial control run. They obtained several non-zero nonlinear modes. The leading
mode came out as the seasonal cycle whereas the ENSO cycle was captured by a
combination of the second and third NDMs. The combination of the fourth and
the fifth nonlinear modes yielded a decadal mode. The time series of these modes
are shown in Fig. 16.25. The leading five PCs of the SSTs are also shown for
comparison. The effect of mixing, characterising EOFs and PCs, can be clearly seen
in the figure. The time series of the nonlinear modes do not suffer from the mixing
drawback.
Fig. 16.25 Times series of the first five nonlinear dynamical modes (left) and the PCs (right)
obtained from the SST simulation of a climate model. Adapted from Gavrilov et al. (2016)
Chapter 17
Machine Learning
Abstract This last chapter discusses a relatively new method applied in atmo-
spheric and climate science: machine learning. Put simply, machine learning refers
to the use of algorithms allowing the computer to learn from the data and use
this learning to identify patterns or draw inferences from the data. The chapter
describes briefly the flavour of machine learning and discusses three main methods
used in weather and climate, namely neural networks, self-organising maps and
random forests. These algorithms can be used for various purposes, including
finding structures in the data and making prediction.
17.1 Background
Neural networks (NNs) originate from an attempt by scientists to mimic the human
brain during the process of learning and pattern recognition (McCulloch and Pitts
1943; Rosenblatt 1962; Widrow and Stearns 1985). The cornerstone of NNs is
the so-called universal approximation theorem (Cybenko 1989; Hornik 1991). The
theorem states that any regular multivariate real valued function f (x) can be
approximated at any given precision by a NN with one hidden layer and a finite
number of neurons with the same activation function and one linear output neuron.
That is, f (x) ≈ m α
k=1 k g(w T x + b ), with g(.) being a bounded function, with
k j
specific properties, referred to as sigmoid function, see below. NNs are supposed to
be capable of performing various tasks, e.g. pattern recognition and regression (Watt
et al. 2020). In addition, NNs can also be used for other purposes such as dimension
reduction, time series prediction (Wan 1994; Zhang et al. 1997), classification and
pdf estimation. And as put by Nielsen (2015), “Neural networks are one of the most
beautiful programming paradigms ever invented”.
17.2 Neural Networks 417
1
c(x) = g(wT x) = .
1 + e−w
Tx
Remark Note, in particular, that the logistic function (and similar sigmoids) has nice
properties. For example, it has a simple derivative (g (x) = 1 − g(x)) and that g(x)
is approximately linear for small |x|.
Now c(x) can be interpreted as a probability of x belonging to class 1, i.e. c(x) =
P r(c = 1|x; w). The logistic regression model shows the importance of using the
sigmoid function to determine the class of a new observation. Note that this model
works only for linearly separable classes. The application of the binary classification
problem in NN becomes straightforward, given a training set (x1 , c1 ), . . . (xn , cn ).
For the Rosenblatt perceptron, this is obtained by finding the weights, w1 , . . . wm
(and possibly a bias b), minimising an objective function measuring the distance
between the model output and the target. Since the target is binary, the costfunction
1 n T
consistent with the logistic regression is 2n i=1 f (g(w xi ), ci ), with the distance
function given by f (y, z) = −z log y − (1 − z) log(1 − y). This convex function
is known as cross-entropy error (Bishop 2006) and is derived based on statistical
arguments, see discussion in Sect. 17.2.5 below. The convexity is particularly
useful in optimisation. Rosenblatt (1962) showed precisely that the perceptron
algorithm converges and identifies the hyperplane between the two classes, that is,
the perceptron convergence theorem. The single perceptron NN can be extended to
include two or more neurons, which will form a layer, the hidden layer, yielding
the single-layer perceptron (Fig. 17.1, bottom panel). This latter model can also be
418 17 Machine Learning
x
1
x2
. g( )
. wT x o = g( w T x + b)
.
xm
0.8 0.8
0.6 0.6
g(x)
0.4 0.4
0.2 0.2
0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
x x
Fig. 17.1 Basic model of a simple nonlinear neuron, or perceptron (top), two examples of sigmoid
functions (middle) and a feedforward perceptron with one input, one output and one hidden layers,
along with the weights W(1) and W(2) linking the input and output layers to the hidden layer
(bottom)
17.2 Neural Networks 419
extended to include more than one hidden layer, yielding the multilayer perceptron
(Fig. 17.2). Now, recalling the universal approximation theorem, a network with
one hidden layer can approximate any bounded function with arbitrary accuracy.
Similarly, a network with two hidden layers can approximate any function with
arbitrary accuracy.
Fig. 17.3 Relation between a given neuron of a given layer and neurons from the previous layer
17.2 Neural Networks 421
One of the main reasons behind using squashing functions is to reduce the effect
of extreme input values on the performance of the network (Hill et al. 1994).
Many of the sigmoid functions have other nice properties, such as having simple
derivatives, see above for the example of the sigmoid function. This property is
particularly useful during the optimisation process.
For the output layer, the activation can also be linear or threshold function. For
example, in the case of classification, the activation of the output layer is a threshold
function equaling one if the input belongs to its class and zero otherwise. This has
the effect of identifying a classifier, i.e. a function g(.) from the space of all objects
(inputs) into the set of K classes where the value of each point is either zero or one
(Ripley 1994).
Remarks
• The parameter a in the sigmoid function determines the steepness of the response.
• In the case of scalar prediction of a time series, the output layer contains only
one unit (or neuron).
• Note that there are also forward–backward as well as recurrent NNs. In recurrent
NNs connections exist between output units.
NNs can have different designs depending on the tasks. For instance, to compare
two multivariate time series xk and zk , k = 1, 2, . . . n, where xk is p-dimensional
and zk is q-dimensional, a three-layer NN can be used. The input layer will then
have p units (neurons) and the output layer will have q units. Using a three-layer
NN where the transfer functions of the second and third layers g(2) (.) and g(3) (.)
are, respectively, a hyperbolic tangent and linear functions with p and q units in
the input and output layers, respectively. Cybenko (1989), Hornik et al. (1989) and
Hornik (1991) showed that it is possible to approximate any continuous function
from Rp to Rq if the second layer contains a sufficiently large number of neurons.
The NN is trained by finding the optimal values of the weight and bias parameters
which minimise the error or costfunction. This error function is a measure of the
proximity of the NN output O from the desired target T and can be computed using
the squared error O − T 2 , which is a function of the weights (wij ). For example,
when comparing two time series x = {xt , t = 1, . . . , n} and z = {zt , t = 1, . . . , n},
the costfunction takes the form:
J =< z − O (x, θ ) 2
>, (17.2)
where θ represents the set of parameters, i.e. weights and biases, O is the output
from the NN and “< >” is a time-average operator. The parameters are then
required to satisfy
422 17 Machine Learning
∇θ J = 0. (17.3)
(2)
where zk = (2)
j wj k yj + bk , yj = tanh
(1)
i wij xi + bj
(1)
, and zk◦ is the
kth observed value (target). The NN model is normally trained, i.e. estimating its
parameters, by using a first subset of the data (the training set) and then the second
subset for forecasting (the testing subset).
Remark Different architectures can yield different networks. Examples of special
networks include convolutional (LeCun et al. 2015) and recurrent (e.g. Haykin
2009) networks. Convolutional networks act directly on matrices or tensors (for
images) overcoming, hence, the difficulty resulting from transforming those struc-
tures into vectors, as is the case in conventional fully connected networks (Watt et
al. 2020). Transforming, for example, images into vectors yields a loss of the spatial
information.
Not all networks are feedforward. There are networks that contain feedback con-
nections. These are the recurrent NNs. Recurrent networks differ from conventional
feedforward NN by the fact that they have at least one feedbackward loop. They can
be seen as multiple copies of the same network and are used to analyse sequential
data such as texts and time series.
Another type of network is support vector machine (SVM) pioneered by Boser
et al. (1992). SVM is basically a learning machine with a feedforward network
having a single hidden layer of nonlinear units. It is used in pattern recognition
and nonlinear regression, through a nonlinear mapping from the input space into
a high-dimensional feature space (see Chap. 13). Within this feature space, the
problem becomes linear and could be solved at least theoretically. For example, in
binary classification the problem boils down to constructing a hyperplane maximally
separating the two classes (Haykin 2009; Bishop 2006).
17.2 Neural Networks 423
d
z(x) = a0 + wi xi
i=1
Fig. 17.5 Schematic representation of neural network model with multiple layers
includes one or more hidden layers (Fig. 17.5). An architecture with more than one
hidden layer leads to the nested sigmoid scheme, see e.g. Poggio and Girosi (1990):
(1)
(2)
zl (x) = g a0 + wil g a1 + wil g . . . g ap + wα xα . . . .
i k α
(17.6)
Note that each layer can have its own sigmoid although, in general, the same sigmoid
is used for most layers. When the transfer function is an RBF φ(.) (Appendix A),
such as the Gaussian probability density function, one obtains an RBF network, see
e.g. chap. 5 of Haykin (2009), with the mapping:
m
z(x) = wi φi (x), (17.7)
i=1
where φi (x) = φ d1i x − ci . The parameter ci is the centre of φi (.) and di its
scaling factor. In this case the distance between the input x and the centres (or
weights) is used instead of the standard dot product i wij xi .
Common choices of RBFs include the Cauchy distribution, multiquadratic and
its inverse, the Gaussian function and thin-plate spline. RBF networks1 can model
1 There is also another related class of NNs, namely probabilistic NNs, derived from RBF networks,
most useful in classification problems. They are based on estimating the pdfs of the different classes
(Goodfellow et al. 2016).
17.2 Neural Networks 425
any shape of function, and therefore the number of hidden layers can be reduced
substantially compared to MLP. RBF networks can also be trained extremely
quickly compared to MLP. The learning process is achieved through the training
set. If no training set is available, the learning is unsupervised, also referred to
sometimes as self-organisation, which is presented later.
Example (Autoregressive Model) A two-layer NN model connected by linear
transfer functions with inputs being the lagged values of a time series, xt , . . . xt−d
and whose output is the best prediction x̂t+1 of xt+1 from previous values reduces
to a simple autoregressive model. The application section below discusses other
examples including nonlinear principal component analysis.
The process of changing the weights during the minimisation of the costfunction
in NNs makes the training or learning algorithm. The most known minimisation
algorithm in NNs is the backpropagation (e.g. Watt et al. 2020). It is based on taking
small steps wij controlled by the unchanged learning rate η in the direction of the
steepest descent, i.e. following −∇J as
∂J
new
wij = wij
old
+ wij = wij
old
−η .
∂wij
This descent is controlled by the learning rate η. Consider, for example, the
feedforward NN with two (input and output) layers and one hidden layer (Fig. 17.1).
Let x1 , . . . xd be the actual values of the input units (in the input layer), which will
propagate to the hidden layer. The response (or activation) value hl at unit l of the
hidden layer takes the form
d
(1) (1)
hl = gh wil xi + εl ,
i=1
where gh (.) is the activation function of the hidden layer. These responses will then
make the inputs to the output layer “o” (Fig. 17.1) so that the kth output ok takes the
form
d
(2) (2) (2)
(2) (1) (1)
ok = go wlk hl + εk = go εk + wlk gh wil xi + εl ,
l l i=1
to other algorithms such as simulated annealing, see e.g. Hertz et al. (1991) and
Hsieh and Tang (1998). Because of the nested sigmoid architecture, the conventional
chain rule to compute the gradient can easily become confusing and erroneous
particularly when the network grows complex. The most popular algorithm used for
supervised learning is the (error) backpropagation algorithm (see e.g. Amari 1990).
At its core, backpropagation is an efficient (and exact) way to compute the gradient
of the costfunction in only one pass through the system. Backpropagation is the
equivalent of adjoint method, i.e. backward integration of the adjoint equations, in
variational data assimilation (Daley 1991).
Backpropagation proceeds as follows. Let yiα denote the activation of the ith
unit from layer α (Fig. 17.6), with α = 1, . . . L (the values 1 and L correspond,
respectively, to the input and output layers). Let also xiα denote the input to the ith
neuron of layer α + 1 prior to the sigmoid activation and wij α the weights between
Here we have dropped the bias function for simplicity, and we consider only one
sigmoid function g(.). If there are different sigmoids for different layers, then g(.)
will be replaced by gα (.). The costfunction then takes the form
m
2
J = z◦ − yL 2
= zk◦ − ykL (17.9)
k=1
C
J = z◦n − yL
n
2
(17.10)
n=1
∂J ∂J ∂xiα ∂J
α = α = α yjα , (17.11)
∂wj i ∂xi ∂wjαi ∂xi
∂J ∂J
= α+1 g (xiα )yjα . (17.12)
∂wjαi ∂yi
∂J ∂J ∂xjα+1 ∂J
α+1
= = wij . (17.13)
∂yiα+1 j
∂xjα+1 ∂yiα+1 j
∂xjα+1
The term ∂J
is computed as in Eq. (17.12), i.e. ∂J
= ∂J
g (xjα+1 ), and
∂xjα+1 ∂xjα+1 ∂yjα+2
Eq. (17.13) becomes
∂J ∂J
= g (xjα+1 )wij
α+1
(17.14)
∂xiα+1 j
∂yjα+2
Remark The inverse of the logistic function is known as the logit function. It is used
in generalised linear models (McCullagh and Nelder 1989) in the logistic regression
y
model, i.e. log 1−y = wT x, in which y is the binary/Bernoulli response variable,
i.e. the probability of belonging to one of the two classes. The logit is also known as
the link function of this generalised linear model. The above illustration shows the
equivalence between the single-layer perceptron and the logistic regression.
The above Bernoulli distribution
can be extended to K outcomes with respective
probabilities p1 , . . . , pK , pk = 1. Similarly, for a K-class classification, the
likelihood for a given input x with targets (classes) z1 , . . . zK in {0, 1} and outputs
yk = P r(zk = 1|x, w), ( k =zk1, . . . K, is obtained as a generalisation of the two-
class model to yield K k=1 yk (x, w). With a training set xi , zij , i = 1 : n, j =
1 : K, the K-class cross-entropy takes the form − ik zki log yk (xi , w). The K-
class logistic function can be obtained based on the Bayes theorem, i.e. P r(k|x) ∝
P r(x|k)P r(k), where P r(k|x) is the probability of class k, given x, and hence yk can
be written as yk (x, w) = exp(ak )/( exp(aj )), referred to as the softmax function
(e.g. Bishop 2006).
Remark As outlined in Sect. 17.2.1, for regression problems the activation function
of the output unit is linear (in the weights), and the costfunction can simply be the
quadratic norm of the error.
17.3 Self-organising Maps 429
17.3.1 Background
Broadly speaking, there are two main classes of training networks, namely super-
vised and unsupervised training. The former refers to the case in which a target
output exists for each input pattern (or observation) and the network learns to
produce the required outputs. When the pair of input and output patterns is not
available, and we only have a set of input patterns, then one deals with unsupervised
training. In this case, the network learns to identify the relevant information from
the available training sample. Clustering is a typical example of unsupervised
learning. A particularly interesting type of unsupervised learning is based on what
is known as competitive learning, in which the output network neurons compete
among themselves for activation resulting, through self-organisation, in only one
activated neuron: the winning neuron or the best matching unit (BMU), at any time.
The obtained network is referred to as self-organising map (SOM) (Kohonen 1982,
2001). SOM or a Kohonen network (Rojas 1996) has two layers, the input and output
layers. In SOM, the neurons are positioned at the nodes of a usually low-dimensional
(one- or two-dimensional) lattice. The positions of the neurons follow the principle
of neighbourhood so that neurons dealing with closely related input patterns are
kept close to each other following a meaningful coordinate system (Kohonen 1990;
Haykin 1999, 2009). In this way, SOM maps the input patterns onto the spatial
locations (or coordinates) of the neurons in the low-dimensional lattice. This kind of
discrete maps is referred to as topographic map. SOM can be viewed as a nonlinear
generalisation of principal component analysis (Ritter 1995; Haykin 1999, 2009).
SOM is often used for clustering and dimensionality reduction (or mapping) and
also prediction (Vesanto 1997).
Fig. 17.7 Schematic of the SOM mapping from the high-dimensional feature space into the
low-dimensional space (left) and the Kohonen SOM network showing the input layer and the
computational layer (or SOM grid) linked by a set of weights (right)
Training of SOM
Fig. 17.8 Left: schematic of the updating process of the best matching unit and its neighbours.
The solid and dashed lines represent, respectively, the topological relationships of the SOM before
and after updating, the units are at the crossings of the lines and the input sample vector is shown
by x. The neighbourhood is shown by the eight closest units in this example. Middle and right:
rectangular and hexagonal lattices, respectively, and the units are shown by the circles. In all panels
the BMU is shown by the blue filled circles
Equation (17.15) summarises the competitive process among the M output neurons
in which i(x) is the winning neuron or BMU (Fig. 17.8, left). SOM has a number of
components that set up the SOM algorithm. Next to the competitive process, there
are two more processes, namely co-operative and adaptive processes, as detailed
below.
The co-operative process concerns the neighbourhood structure. The winning
neuron determines a topological neighbourhood since a firing neuron tends to excite
nearby neurons more than those further away. A topology is then defined in the SOM
lattice, reflecting the lateral interaction among a set of excited neurons (Fig. 17.8).
Let dkl denote the lateral distance between neurons k and l on the SOM grid. This
then allows to define the topological neighbourhood hj,i(x) . As is mentioned above,
various neighbourhood functions can be used. A typical choice for this topology or
neighbourhood is given by the Gaussian function:
2
dj,i(x)
hj,i(x) = exp − , (17.16)
2σ 2
where dj,i(x) = rj − ri(x) and denotes the lateral distance on the SOM grid
between, respectively, the winning and excited neurons i(x) and j , and rj and ri(x)
are positions of neurons j and i(x) on the same grid. A characteristic feature of SOM
is that the neighbourhood size shrinks with time (or iteration). A typical choice of
this shrinking is given by
432 17 Machine Learning
n
σ (n) = σ0 exp − , (17.17)
τ0
which is applied to all neurons in the lattice within the neighbourhood of the winning
neuron (Fig. 17.8). The discrete time learning rate is given by η(n) = η0 exp(− τn1 ).
The values η0 = 0.1 and τ1 = 1000 are typical examples that can be used in
practice (Haykin 1999). Also, in Eq. (17.17), the value τ0 = 1000/ log(σ0 ) can be
adapted in practice (Haykin 1999). The neighbourhood of neurons can be defined by
a radius within the 2D topological map of neurons. This neighbourhood decreases
monotonically with iterations.
√ A typical √ initial value of this neighbourhood radius
could be of the order O( N), e.g. N/2, for a sample size N . The size of the
topological map, that is, the number of neurons
√ M, can be learned from experience,
but typical values can be of the order O( N). For √ example, for a two-dimensional
SOM, the grid can have a total number of, say, 5 N neurons, e.g. (Vesanto and
Alhoniemi 2000).
Remark The SOM error can be computed based on the input data, weight vectors
and neighbourhood function.
For a fixed neighbourhood function, the SOM error
function is ESOM = N t=1
M
j =1 hj,i(xt ) xt − wj , where hj,i(xt ) represents the
2
neighbourhood function centred at unit i(xt ), i.e. the BMU of input vector xt .
In order not to end in a metastable state, the adaptive process has two identifiable
phases, namely the ordering or self-organising and the convergence phases. During
the former phase, the topological ordering of the weight vectors takes place within
around 1000 iterations of the SOM algorithm. In this context, the choice of η0 =
0.1 and τ1 = 1000 is satisfactory, and the previous choice of τ0 in Eq. (17.17)
can also be adopted. During the convergence phase, the feature map is fine tuned
providing, hence, an accurate statistical quantification of the input patterns, which
17.4 Random Forest 433
takes a number of iterations, around 500 times the number of neurons in the network
(Haykin 1999).
Decision trees are the building blocks of random forests. They aim, based on a
training set, at predicting the output of any data from the input space. A decision
tree is based on a sequence of binary selections and looks like a (reversed) tree
(Bishop 2006). Precisely, the input sample is (sequentially) split into two or more
homogeneous sets based on the main features of the input variables. The following
simple example illustrates the basic concept. Consider the set {−5, −3, 1, 2, 6},
434 17 Machine Learning
Fig. 17.9 Decision tree of the simple dataset shown in the top box
which is to be separated using the main features (or variables), namely sign (+/−)
and parity (even/odd). Starting with the sign, the binary splitting yields the two sets
{1, 2, 6} and {−3, −5}. The last set is homogeneous, i.e. a class of negative odds.
The other set is not, and we use the parity feature to yield {1} and {2, 6}. This can be
summarised by a decision tree shown in Fig. 17.9. Each decision tree is composed
of:
• root node—containing the entire sample,
• splitting node,
• decision node—a subnode that splits into further subnodes,
• terminal node or leaf—a node that cannot split further,
• branch—a subtree of the whole tree and
• parent and child nodes.
In the above illustrative example, the root node is the whole sample. The set {1, 2, 6},
for example, is a decision node, whereas {2, 6} is a terminal node (or leaf). Also, the
last subset, i.e. {2, 6}, is a child node of the parent node {1, 2, 6}. The splitting rule
in the above example is quite simple, but for real problems more criteria are used
depending on the type of problem at hand, namely regression or classification, which
are discussed next.
As is mentioned above, decision trees attempt to partition the input (or pre-
dictor) space using a sequence of binary splitting. This splitting is chosen to
optimise a splitting criterion, which depends on the nature of the predictor, e.g.
discrete/categorical versus continuous, and the type of problem at hand, e.g.
classification versus regression. In the remaining part of this chapter we assume
that our training set is composed of n observations (x1 , y1 ), . . . , (xn , yn ), where,
for each k, k = 1, . . . n, xk = (xk1 , . . . , xkd )T and contains the d variables (or
17.4 Random Forest 435
features), and yk is the response variable. The splitting proceeds recursively with
each unsplit node by looking for the “best” binary split.
Case of Categorical Predictors
We discuss here the case of categorical predictors. A given unsplit (parent) node,
containing a subset x1 , . . . , xm of the training set, is split into two child nodes: a
left (L) node and a right (R) node. For regression, a common criterion F is the mean
square residual or variance of the response variable. For this subsample in the parent
node, the costfunction (variance) is
1
m
F = (yi − y)2 , (17.19)
m
i=1
where y is the average response of this subset. Now, if FL and FR are the
corresponding variance for the (left and right) child nodes, then the best split
is based on the variable (or feature) yielding maximum variance reduction, i.e.
max|F − FLR |, or similarly minimising FLR , i.e. min(FLR ), where FLR =
m−1 (mL FL + mR FR ), with mL and mR being the sample sizes of the left and right
(child) nodes, respectively. Note again that to obtain the best result, every possible
split based on every variable (among the d variables) is considered.
The splitting criterion for classification, with, say, K classes, is different. Let
pk designate
the proportion of the kth class in a given (unsplit) node, i.e. pk =
m−1 m i=1 1{yi =k} , where 1A is the indicator of set A, i.e. 1 if A is true and 0
otherwise. Then the most commonly used criterion is the Gini index (e.g. Cutler
et al. 2012):
K
K
F = pk pk = 1 − pk2 , (17.20)
k<k k=1
and the “best” split corresponds again to the one maximising |F − FLR | or
minimising FLR .
Case of Continuous Predictors
In the case of continuous predictor variables, the splitting involves sorting the values
of the predictor subset in the node of interest and considering all splits between
consecutive pairs (Cutler et al. 2012). Precisely, denote again by x1 , . . . , xm the
continuous d-dimensional subset (of the training set) in the node at hand. For each
variable j , j = 1, . . . , d, the observations are sorted as x1,j ≤ x2,j ≤ . . . ≤ xm,j ,
and a set is chosen of thresholds θ0,j , θ1,j , . . . θn+1,j , with θ0,j ≤ x1,j ; xn+1,j ≤
θn,j and xt,j ≤ θt,j ≤ xt+1,j , for t = 1, . . . n − 1. Note that any choice of the θ
parameters satisfying the above inequalities yields the same results. For example, a
common choice (after sorting) corresponds to the midpoint of any two consecutive
values, i.e. θt,j = 0.5(xt,j + xt+1,j ) for t = 1, . . . n − 1. Next, for each t and
j , t = 1, . . . n, and j = 1, . . . d, a binary split is defined based on the condition
436 17 Machine Learning
1{x.j ≤θt,j } , where x.j stands for the j th variable of any observation. In other words,
for any choice of θ (among the (n + 1)d choices of θt,j ) and any choice of variable,
a split is performed on the node under consideration. This requires around O(m2 d)
operations. For each of these splits, the costfunction FLR is computed, and the best
split is again chosen based on minimising FLR . Using a fast algorithm, the number
of operations can be reduced to O(dm log(m)). We remind again that each split is
based on one variable or feature.
In summary, the above steps can be summarised in the following algorithm
(e.g. Cutler et al. 2012). Given a training set (x1 , y1 ), . . . , (xn , yn ), where xt =
(xt1 , . . . , xtd )T and yt , t = 1, . . . n, denote, respectively, the d predictors and
associated responses, the decision tree algorithm proceeds as follows:
(1) Starting—Begin with the training set in the root node.
(2) Tree construction—Each unsplit node is split into two child nodes based on the
best split, as detailed above by finding the predictor variable among 1, . . . d,
which minimises FLR .
(3) Prediction—When the tree is obtained, a new input variable x (not from the
training set) is passed through the tree until it falls in a terminal node (or
leaf), denoted l, with responses yl1 , . . . , ylp (obtained from the training set).
p
The prediction for x is then given by yx = p−1 t=1 ylt for regression and
p
yx = argmax t=1 1{ylt =y} for classification, where y refers to any one class.
y
Remarks
• Besides the two optimisation criteria mentioned above, there are a number of
other criteria that are used in decision trees. These include chi-square, entropy
and gain ratio.
• Decision trees have a number of advantages but also disadvantages. The follow-
ing are the main advantages:
− > it is a non-parametric method;
− > it is easy to understand and implement;
− > it can use any data type (categorical or continuous).
• Although decision trees can be applied to continuous variables, they are not ideal
because of categorisation. But the main disadvantage is that of overfitting, related
to the fact that the tree can grow unnecessarily—e.g. ending with one leaf for
each single observation.
One way to overcome the main downside mentioned above, i.e. overfitting, is to
apply what is known as pruning. Pruning consists of trimming off unnecessary
branches starting from the leaf nodes in such a way that accuracy is not much
reduced. This can be achieved, for example, by dividing the training set into
two subsets, fitting the (trimmed) tree with one subset and validating it with the
other subset. The best trimmed tree corresponds to optimising the accuracy of the
17.4 Random Forest 437
validation set. There is another more attractive way to overcome overfitting, namely
random forest discussed next.
A convenient and robust way to get an independent estimate of the prediction error
is to use, for each tree in the random forest, the observations that do not get selected
in the bootstrap; a tool similar to cross-validation. These data are referred to as out-
438 17 Machine Learning
of-bag (oob) data. It is precisely these data that are used to estimate the performance
of the algorithm and works as follows. For each input data (xt , yt ) from the training
set, find the set Tt of those (Nt ) trees that did not include
this observation. Then,
the oob prediction at xt is given by yoob,t , i.e. Nt−1 Tt yxt ,i for regression and
argmax Tt 1{yxt ,i } for classification, where yxt ,i is the prediction of xt based on the
y
ith tree of the random forest. These predictions are then used to calculate the oob
by (e.g. Cutler et al. 2012) εoob = n−1 nt=1 (yt − yoob,t )2 for
error, εoob , given
regression or n−1 nt=1 1{yt =yoob,t } for classification.
Parameter Selection
To get the best out of the random forest, experience shows that the number of trees in
the forest can be chosen to be large enough (Breiman 2001), e.g. several hundreds.
Typical values of the selected number of variables in the random forest depend√on
the type of the problem at hand. For classification, a standard value of N is d,
whereas for regression it is of the order n/3.
Remarks Random forest algorithm has a number of advantages inherited from
decision trees. The main advantages are accuracy, robustness (to outliers) in addition
to being easy to use and reasonably fast. The algorithm can also handle missing
values in the predictors and scale well with large samples (Hastie et al. 2009).
The algorithm, however, has few main disadvantages. Random forest tends to be
difficult to interpret, in addition to being not very good at capturing relationships
involving linear combinations of predictor variables (Cutler et al. 2012).
17.5 Application
Machine learning has been applied in weather and climate since the late 90s, with
an application of neural network to meteorological data analysis (e.g. Hsieh and
Tang 1998). The interest in machine learning application in weather and climate
has grown recently in the academia and also weather centres, see e.g. Scher
(2020). The application spans a wide range of topics ranging from exploration to
weather/climate prediction. This section discusses some examples, and for more
examples and details, the reader is referred to the references provided.
Nonlinear EOF, or nonlinear principal component analysis, can take various forms.
We have already presented some of the methods in the previous chapters, such as
17.5 Application 439
Fig. 17.10 Schematic representation of the NN design for nonlinear PCA. Adapted from Hsieh
(2001b)
independent PCA, PP and kernel EOFs. Another class of nonlinear PCA has also
been presented, which originates from the field of artificial intelligence. These are
based on neural networks (Oja 1982; Diamantaras and Kung 1996; Kramer 1991;
Bishop 1995) and have also been applied to climate data (Hsieh 2001a,b; Hsieh
and Tang 1998; Monahan 2000). As is discussed in Sect. 17.2, the neural network
model is based on linking an input layer containing the input data to an output
layer using some sort of “neurons” and various intermediate layers. The textbook of
Hsieh (2009) provides a detailed account of the application of NN to nonlinear PC
analysis, which is briefly discussed below.
The NN model used by Hsieh and Tang (1998) and Monahan (2000) to construct
nonlinear PCs contains five layers, three intermediate (or hidden), one input and one
output layers. A schematic representation of the model is shown in Fig. 17.10.
A nonlinear function maps the high-dimensional state vector x = x1 , . . . , xp
onto a low-dimensional state vector u (one-dimensional in this case). Then, another
nonlinear transformation maps u onto the state space vector z = z1 , . . . , zp
from the original p-dimensional space. This mapping is achieved by minimising
the costfunction J =< x - z 2 >. These transformations (or weighting functions)
are
f(x) = f1 (W1 x + b1 ) ,
u = f2 (W2 f(x) + b2 ) ,
(17.21)
g(u) = f3 (w3 u + b3 ) ,
z = f4 (W4 g(u) + b4 ) ,
where f1 (.) and f3 (.) are m-dimensional functions. Also, W1 and W4 are m × p and
p × m weight matrices, respectively. The objective of NN nonlinear PCA is to find
the scalar function u(t) = F (x(t)) that minimises J . Note that if F (.) is linear, i.e.
440 17 Machine Learning
Weather and climate prediction is another topic that attracted the interest of climate
scientists. Neural networks can in principle approximate any nonlinear function
(Nielsen 2015; Goodfellow et al. 2016) and can be used to approximate the
nonlinear relationships involved in weather forecasting. An example was presented
by Scher (2018) to approximate a simple GCM using a convolutional neural
network. This example was used as a proof of concept to show that it is possible
to consider NN to learn the time evolution of atmospheric fields and hence provide
a potential for weather prediction. Convolutional NN was also used to study
precipitation predictability on regional scales and discharge extremes by Knighton
et al. (2019). Another example was presented by Subashini et al. (2019) to forecast
weather variables using data collected from the National Climatic Data Centre. They
used a recurrent NN, based on a long short-term memory (LSTM) algorithm, to
forecast wind, temperature and cloud cover.
Weyn et al. (2019) developed an elementary deep learning NN to forecast a few
meteorological fields. Their model was based on a convolutional NN architecture
and was used to forecast 500-hPa geopotential height for up to 14 days lead
time. The model was found to outperform persistence, climatology and barotropic
17.5 Application 441
Fig. 17.11 Nonlinear PCs from the control simulation of the Canadian climate model showing the
temporal variability along the NN PC1 with its PDF (top) and the nonlinear PC approximation of
the data projected onto the leading two PCs (bold curve, bottom) along with the PDFs associated
with the two branches. Adapted from Monahan et al.(2000)
Fig. 17.12 Composite SLP anomaly maps associated with the three representative points on the
nonlinear PC shown in Fig. 17.11. Adapted from Monahan et al.(2000)
vorticity models, in terms of root mean square errors (RMSEs) at forecast lead time
of 3 days. The model does not, however, beat an operational weather forecasting
system and climate forecasting system (CFS), as expected, as the latter contains
full physics. Weyn et al. (2019) found, however, that the model does a good job of
forecasting realistic states at a lead time of 14 days and captures reasonably well the
442 17 Machine Learning
Fig. 17.13 As in Fig. 17.11, but for the climate change simulation. Adapted from Monahan
et al.(2000)
500-hPa climatology and annual variability. Figure 17.14 shows the RMSE of 500-
hPa heights for up to 3 days lead time of different convolutional NN architectures
compared to the other models. An example of 24-hr 500-hPa forecast is shown in
Fig. 17.15. The main features of the field are very well captured.
Another important topic in weather prediction is forecasting uncertainty. Forecast
uncertainty is an integral component of weather (and climate) prediction, which
is used by the end users for planning and design. In weather forecasting centres,
forecast uncertainty is usually obtained using a computationally expensive ensemble
of numerical weather predictions. A number of authors have proposed machine
learning as an alternative to ensemble methods. An example where this is important
is tropical cyclone (Richman and Leslie 2010, Richman et al. 2017) and Typhoon
(Haghroosta 2019) forecast. For example, Richman and Leslie (2012) used machine
learning approaches, based on support vector regression (a subclass of support
vector machine), to provide seasonal prediction of tropical cyclone frequency and
intensity over Australia. We remind that the support vector regression is a special
architecture of neural net with two layers, an input and an output layer, and
where each input observation is mapped into a high-dimensional feature space.
As mentioned above, the architecture belongs to the class of radial basis function
17.5 Application 443
Fig. 17.14 RMSE forecast error of 500-hPa height over the test period 2007–2009, obtained from
different models: persistence, climatology, barotropic vorticity, the operational climate forecast
system (CFS) and the different convolutional NN architectures. Adapted from Weyn et al. (2019)
Fig. 17.15 An example of a 24-hr 500-hPa height forecast at 0 UTC 13 April 2007, based on
the NN architectures (bottom) compared to the barotropic (c) and the CFS (d). Coloured shading
shows the difference between forecasts and the verification (b) in dekameter. (a) initial state, (e-h)
forecasts from LSTM neural network. Adapted from Weyn et al. (2019)
networks (Haykin 2009, Chap. 5), in which the mapping is based on nonlinear
radial basis functions (Appendix A). The authors obtained high values of R 2 of the
order 0.56 compared to 0.18 obtained with conventional multiple linear regression.
444 17 Machine Learning
Fig. 17.16 Schematic illustration of the convolutional NN used by Scher and Messori (2019)
to predict weather forecast uncertainty. The network is fed with gridded atmospheric fields and
generates a scalar representing the uncertainty forecast (Source: Modified from Scher and Messori
(2019))
Richman et al. (2017) used the same machine learning architecture to reduce tropical
cyclone prediction error in the North Atlantic regions. A review of the application
of machine learning to tropical cyclone forecast can be found in Chen et al. (2020).
Scher and Messori (2019) proposed machine learning to predict weather forecast
uncertainty. They considered a convolutional NN (Krizhevsky et al. 2012; LeCun
et al. 2015) trained on past weather forecasts. As is discussed above, convolutional
NN is not fully connected, characterised by local (i.e. not full) connections and also
weight sharing (i.e. sharing similar weights), and involves convolution operation
and hence is faster than fully connected nets. In this network, training is done
with the forecast errors and the ensemble spread of forecasts. An uncertainty is
then predicted, given an initial forecast field (Scher 2020). Figure 17.16 shows a
schematic illustration of the approach. They suggest that while the obtained skill is
lower than that of ensemble methods, the network-based method is computationally
very efficient and hence offers the possibility to be explored.
SOM has been applied since the early 90s and is still being applied, in atmospheric
science and oceanography to reduce the dimensionality of the system and identify
patterns and clustering (e.g. Hewitson and Crane 1994; Ambroise et al. 2000; Liu et
al. 2006; Liu and Weisberg 2005; Cassano et al. 2015; Huva et al. 2015; Gibson et al.
2017; Meza–Padilla 2019). This application spans a wide range of topics including
synoptic climatology, cloud classification, weather/climate extremes, downscaling
and climate change. SOM has also been suggested to be used in time series
prediction (Vesanto 1997). SOM is particularly convenient and quite useful in
synoptic weather categorisation (Sheridan and Lee 2010; Hewitson and Crane
17.5 Application 445
2002). The obtained weather types are then used to study the relationship between
large scale teleconnections and local surface climate variables such as surface
temperature or precipitation. Surface weather maps and mid-tropospheric fields
have been used by a number of authors to study changes in atmospheric synoptic
circulations and their relation to precipitation (e.g. Hewitson and Crane 2002). The
identification of synoptic patterns from reanalysis as well as climate models was also
performed via SOM by Schuenemann et al. (2009) and Schuenemann and Cassano
(2010). Mass and moisture fields were also used to study North American monsoon
(Cavazos et al. 2002), precipitation downscaling (Ohba et al. 2016), Antarctic
climate (Reusch et al. 2005) and local Mediterranean climate (Khedairia and Khadir
2008), see e.g. Huth et al. (2008) for a review of SOM application to synoptic
analysis.
Clouds are known to have complex features and constitute a convenient test-bed
for SOM application and feature extraction (Ambroise et al. 2000; McDonald et
al. 2016). For example, McDonald et al. (2016) show that SOM analysis enables
identification of a wide range of cloud clusters representative of low-level cloud
states, which are related to geographical position. They also suggest that SOM
enables an objective identification of the different cloud regimes. SOM has also
been applied to study weather and climate extremes and climate change (Gibson
et al. 2016; Horton et al. 2015; Gibson et al. 2017; Cassano et al. 2016). These
studies show that SOM can reveal correspondence between changes in the frequency
of geopotential height patterns and temperature and rainfall extremes (Horton et
al. 2015; Cassano et al. 2015, 2016). Gibson et al. (2017) found that synoptic
circulation patterns are well represented during heat waves in Australia but also
highlight the importance of critically assessing the SOM features.
SOM was also explored in oceanography to explore and analyse SST and sea
surface height (Leloup et al. 2007; Telszewski et al. 2009; Iskandar 2009), ocean
circulation (e.g. Meza-Padilla et al. 2019) and ocean current forecasting (Vilibić
et al. 2016), see also the textbook of Thomson and Emery (2014, chapter 4). An
interesting feature revealed by SOM in ocean circulation in the Gulf of Mexico
(Meza-Padilla 2016) is the existence of areas of loop current eddies dominating
the circulation compared to local regimes at the upper slope. Vilibić et al. (2016)
found that SOM-based forecasting system of ocean surface currents was found to be
slightly better than the operational ROMS-derived forecasting system, particularly
during periods of strong wind conditions. Altogether, this shows that SOM, and
machine learning in general, has potential for improving ocean surface current
forecast.
Figure 17.17 illustrates the SOM algorithm application to weather and climate
fields. The two-dimensional (gridded) field data are transformed into a n × d data
matrix, with n and d being the sample size and the number of grid points (or the
number of variables), respectively. Each observation xt (t = 1, . . . n) is then used
to update the weights of SOM following Eqs. (17.15–17.18). The obtained weight
vectors of the SOM lattice are then transformed to physical space to yield the SOM
patterns (Fig. 17.17). To illustrate how SOM organises patterns, large scale flow
based on SLP is a convenient example to discuss. Johnson et al. (2008), for example,
446 17 Machine Learning
Fig. 17.17 Schematic illustration of the different steps used by SOM in meteorological application
(Source: Modified from Liu et al. (2006))
examined the SOM continuum of SLP over the winter (Dec–Mar) NH using NCEP-
NACR reanalysis. Figure 17.18 shows an example of NH SLP 4 × 5 SOM maps
obtained from Johnson et al. (2008), based on daily winter NCEP-NCAR SLP
reanalysis over the period 1958–2005. By construction (a small number of SOM
patterns), the figure shows large scale and low-frequency patterns. One of the main
features of Fig. 17.18 is the emergence of familiar teleconnection patterns, e.g.
−NAO (bottom left) and +NAO (bottom right). The occurrence frequency of those
patterns is used as a measure of climate change signal reflected by the NAO shift.
The SOM analysis also shows that interdecadal SLP variability can be understood
in terms of changes in the frequency of occurrence of the teleconnection patterns.
The SOM analysis of Johnson et al. (2008) reveals a change from a westward-
displaced −NAO-like pattern to an eastward-displaced +NAO-like pattern. More
examples and references of SOM application to synoptic climatology and large scale
phenomena can be found in Hannachi et al. (2017).
Due to its local character and the proximity neighbourhood, SOM seems to offer
some advantages compared to classical methods such as PCA and k-means (Reusch
et al. 2005; Astel et al. 2007; Lin and Chen 2006; Solidoro et al. 2007). Reusch et al.
(2005) compared the performance of SOM and PCA using synthetic climatological
data with and without noise contamination. They conclude that SOM is more robust
than PCA. For instance, SOM is able to isolate the predefined patterns with the
correct explained variance. On the other hand, PCA fails to identify the patterns due
to mixing. This conclusion is shared with other researchers (Liu and Weisberg 2007;
Annas et al. 2007; Astel et al. 2007). Liu and Weisberg (2007) compared SOM and
EOFs in capturing ocean current patterns using velocity field from moored ADCP
array. They found that SOM was readily more accurate to reveal, for example,
17.5 Application 447
Fig. 17.18 Illustration of 4 × 5 SOM maps of daily winter (Dec–Mar) SLP field from NCEP-
NCAR reanalysis for the period 1958–2005. Positive and negative values are shown by continuous
and dashed lines, respectively. Percentages of occurrence of the patterns are shown in the bottom
right corner for the whole period and in the top right corners for the periods 1958–1977 (top),
1978–1997 (middle) and 1998–2005 (bottom). Contour interval: 2 hPa. Adapted from Johnson et
al. (2008). ©American Meteorological Society. Used with permission
SOM transforms, using a topology-preserving projection, the data from its orig-
inal (usually high-dimensional) state space onto a low-dimensional (usually two-
dimensional) space, i.e. the SOM map. This SOM map, represented as an ordered
grid, contains prototype vectors representing the data (e.g. Vesanto and Alhoniemi
2000). This map can then be used to construct, for example clusters (Fig. 17.19).
This algorithm of clustering SOM rather than the original data is referred to as two-
level approach (e.g. Vesanto and Alhoniemi 2000).
SOM provides therefore a sensible tool to classify for example rainfall regimes
in a given location. Figure 17.21 (top left) shows the obtained SOM map of rainfall
events in northern Tunisia. Note that each neuron, or prototype vector (individual
hexagon), contains a number of observations in its neighbourhood. These prototype
vectors are then agglomerated to obtain clusters. One particularly interesting method
to obtain the number of clusters is to use the data image method (Minnotte and West
1999). The data image is a powerful visualisation tool showing the dissimilarity
matrix as an image where each pixel shows the distance between two observations
(Martinez et al. 2010). Several variants of this image can be incorporated. Precisely,
rows and columns of the dissimilarity matrix can be reordered, for example, based
on some clustering algorithm, such as hierarchical clustering, allowing clusters to
emerge as blocks along the main diagonal. An example of application of data image
to the stratosphere can be found in Hannachi et al. (2011).
In the hierarchical clustering algorithm a bunch of fully nested sets are obtained.
The smallest sets are the clusters obtained as the individual elements of the dataset,
whereas the largest set is obtained as the whole dataset. Starting, for example from
the individual data elements as clusters, the algorithm then proceeds by successively
merging closest clusters, based on a chosen similarity measure until we are left with
only one single cluster. This can be achieved using a linkage algorithm, such as
single or complete linkages (e.g. Gordon 1999; Hastie et al. 2009). The result of the
hierarchical clustering is presented in the form of a tree-like graph or dendrogram.
This dendrogram is composed of branches linking the whole cluster to the individual
elements. Cutting through the dendrogram at a specific level yields specific number
17.5 Application 449
Fig. 17.20 Two-dimensional scatter plot of two Gaussian clusters, with sample size of 50 each
(top left), dendrogram (top right), data matrix is showing interpoint distance between any two data
points (bottom left) and data matrix when the data are reordered so that the top left and bottom
right blocks represent, respectively, the first and the second clusters. Adapted from Hannachi et al.
(2011). ©American Meteorological Society. Used with permission
of clusters. Figure 17.20 shows an illustrative example of two clusters (Fig. 17.20,
top left) and the associated dendrogram (Fig. 17.20, top right). The interpoint
distance matrix between data points is shown as a data matrix in Fig. 17.20 (bottom
left). Dark and light contrasts represent, respectively, small and large distances. The
dark diagonal line represents the zero value. The figure shows scattered dark- and
light-coloured areas. When the lines and columns of the interpoint distance matrix
are reordered following the two clusters obtained from the dendrogram, see the
vertical dashed line in Fig. 17.20 (top right), the data matrix (Fig. 17.20, bottom
right) now shows two dark blocks along the main diagonal, with light contrast in the
background.
The application of the two-way approach, i.e. SOM+clustering, is shown in
Fig. 17.21. The bottom left panel of Fig. 17.21 shows the data matrix obtained
from the interpoint distances between the SOM prototypes (SOM map). Like the
example above, dark and light contrasts reflect, respectively, a state of proximity
and remoteness of the SOM prototype vectors. Those prototypes that are close
to each other can be agglomerated by the clustering algorithm. Figure 17.21
shows the image when the SOM (prototype) data are reordered based on two
450 17 Machine Learning
Fig. 17.21 SOM map with hexagonal grid of the rainfall events (top left), SOM map with three
clusters on the SOM map (top right), data image of the SOM prototype vectors (bottom left),
data image with two (bottom centre) and three (bottom right) clusters. The numbers in the SOM
map represent the numbers of observations within a neighbourhood of the neurons (or prototype
vectors). Courtesy of Sabrine Derouiche
(Fig. 17.21, bottom centre) and three (Fig. 17.21, bottom right) clusters obtained
from a dendrogram or clustering tree of hierarchical clustering. The contrast
between the diagonal blocs and the background is stronger in the three-cluster case
(Fig. 17.21, bottom right), compared to the two-cluster case (Fig. 17.21, bottom
centre), suggesting three clusters, which are shown in the SOM map (Fig. 17.21,
top right). These clusters are found to represent three rainfall regimes in the studied
area, namely wet, dry and semi-dry.
Random forest (RF) is quite new to the field of weather/climate. It has been applied
recently to weather prediction (Karthick et al. 2018), temperature downscaling
(Pang et al. 2017) and a few other related fields such as agriculture, e.g. crop yield
(Jeong et al. 2016), greenhouse soil temperature (Tsai et al. 2020) and forest fire (Su
17.5 Application 451
et al. 2018). In weather prediction, for example, Karthick et al. (2018) compared
few techniques and found that RF was the best with around 87% accuracy, with
only one disadvantage, namely overfitting. Temperature downscaling using RF was
performed by Pang et al. (2017) in the Pearl river basin in southern China. The
method was compared to three other methods, namely artificial NN, multilinear
regression and support vector machines. The authors found, based on five different
criteria, that RF outperforms all the other methods. For example, RF could identify
the best predictor combination compared to all the other methods. In crop yield
production, Jeong et al. (2016) used RF to predict three types of crops in response
to climate and biophysical variables and compared it to multiple linear regression as
a benchmark. RF was found to outperform the multilinear regression. For example,
the root mean square error was in the range of 6−14% compared to 14−49% for
multiple linear regression. Though this suggests that RF is an effective and versatile
tool for crop yield prediction, the authors also caution that it may result in a loss
of accuracy when applied to extremes or responses beyond the boundaries of the
training set, a weakness that characterises machine learning approaches in general.
Appendix A
Smoothing Techniques
y = f (x) + ε, (A.1)
for which the objective is to estimate f (.). It is of course easier if we knew the
general form of the function f (.). In practice, however, this information is very
seldom available. The spline smoothing considers f (.) to be a polynomial. One of
the most familiar polynomial smoothing is the cubic spline and corresponds to the
case of a piece-wise cubic function, i.e.
dα dα
α
fi (xi ) = fi−1 (xi ), (A.3)
dx dx α
for α = 0, 1, and 2. The constraints given by Eq. (A.2) and Eq. (A.3) lead to a
smooth function. However, the problem is not closed, and we need
extra conditions.
The problem is normally simplified by minimising the quantity ni=1 (yi − f (xi ))2
with a smoothness condition that takes the form of an integral of the second
derivative squared. The functional to be minimised is
n 2
d 2 f (x)
F = [yk − f (xk )] + λ2
dx. (A.4)
dx 2
k=1
The first part of Eq. (A.4) is a measure of the goodness of fit, whereas the second
part provides a measure of the overall smoothness. In the theory of elastic rods the
latter term is proportional to the energy of the rod when it is bent under constraints.
Note that the functional F (.), Eq. (A.4), can be extended to two dimensions and
the final surface will behave like a smooth plate. The function F in Eq. (A.4) is
known as the penalised residual sum of squares. Also, in Eq. (A.4) λ represents the
smoothing parameter and controls the relative weight given to the roughness penalty
and goodness of fit. It controls therefore the balance between goodness of fit and
smoothness. For example, the larger the parameter λ the smoother the function f .
Remark Note that if ε = 0 in Eq. (A.1) the spline simply interpolates the data. This
means that the spline solves Eq. (A.4) with λ → 0, which is equivalent to
2
min f (x) dx subject to yk = f (xk ), k = 1, . . . n. (A.5)
(A.6)
where k is a fixed positive integer. The obtained solution is known as thin-plate
spline. The solution is generally obtained as linear combination of the m m+k−1
monomials of degree less than k and a set of n radial basis functions (Wahba 1990).
The minimisation of Eq. (A.4), when λ is known, yields the cubic spline. The
determination of λ, however, is more important since it controls the smoothness of
the fitting. One way to obtain an appropriate estimate of λ is to use cross-validation.
The idea behind cross-validation is to have estimates that minimise the effect of
omitted observations. If fλ,k (x) is an estimate of the spline with parameter λ when
2
the kth observation is omitted, the mis-fit of the point xk is given by yk − fλ,k (x) .
The best value of λ is the one that minimises the cross-validation criterion
n
2
wk yk − fλ,k (x) ,
k=1
Originally, splines were used as a way to smoothly interpolate the set of points
(xk , yk ), k = 1, . . . n, where x1 < x2 < . . . xn belong to some interval [a, b],
by means of piece-wise polynomials. The name spline was coined by Schoenberg
(1964), see also Wahba (2000). The function is obtained as a solution to a variational
problem such as1
1
n b 2
min (yi − f (xi ))2 + μ f (m) (x) dx (A.7)
n a
i=1
for some μ > 0, and over the set of functions with 2m − 2 continuous derivatives
over [a, b], C 2m−2 ([a, b]). The solution is a piece-wise polynomial of degree 2m−1
inside each interval [xi , xi+1 ], i = 1, . . . n − 1, and m − 1 inside the outer intervals
[a, x1 ] and [xn , b].
In general, the smoothing spline can be formulated as a regularisation problem
(Tikhonov 1963; Morozov 1984). Given a set of points x1 , . . . xn in Rd and n
numbers y1 , . . . yn , we seek a smooth function f (x) from Rd into R that best fits the
data (x1 , y1 ), . . . (xn , yn ). This problem is normally solved by seeking to minimise
the functional:
n
F (f ) = (yi − f (xi ))2 + μ L(f ) 2 . (A.8)
i=1
Note that the first part measures the mis-fit and the second part is a penalty
measuring the smoothness of the function f (.). The operator L is in general a
differential operator, and μ is the smoothing parameter, which controls the trade-
off between both attributes. By computing F (f + δf ) − F (f ), where δf is a small
“departure” from f , the stationary solutions of Eq. (A.8) can be shown to satisfy
n
μ LL∗ (f )(x) = [yi − f (x)] δ (x − xi ) , (A.9)
i=1
1 The space over which Eq. (A.7) is defined is normally referred to as Sobolev space of functions
b 2
defined over [a, b] with m−1 absolutely continuous derivatives and satisfying a f (m) (x) dx <
∞.
456 A Smoothing Techniques
where G(x, y) is the Green’s function of the operator L∗ L (see Sect. A.3 below);
hence,
1
n n
f (x) = (yi − f (xi )) G (x, xi ) = μi G (x, xi ) . (A.10)
μ
i=1 i=1
Gμ = y, (A.11)
n
f (x) = μi G (x, xi ) + p(x)
i=1
n
F (f ) = (yk − f (xk ))2 + λ Lf 2
(A.13)
k=1
for a fixed positive integer m. The function f (x) used in Eq. (A.12) or Eq. (A.13) is
of class C m , i.e. with (m − 1) continuous derivatives and the mth derivative satisfies
Lf (m) < ∞. The corresponding L∗ L operator is invariant under translation and
rotation; hence, the corresponding Green’s function G(x, y) is a radial function and
satisfies
whose solution, see e.g. Gelfand and Vilenkin (1964), is a thin-plate spline, i.e.
x 2m−d log |x for 2m − d > 0, and d is even
G(x) = 2m−d (A.15)
x when d is odd.
n
f (x) = μj G x − xj + p(x), (A.16)
j =1
yi = f (xi ) + εi
where f = (f (x1 ), . . . , f (xn ))T . In the case of thin-plate spline where L(f ) 2
is given by Eq. (A.12), the solution is similar to Eq. (A.16) and the parameters
μ = (μ1 , . . . , μl )T and λ = (λ1 , . . . , λn )T are the solution of a linear system of
the form:
458 A Smoothing Techniques
G + nμW P λ y
= ,
PT O μ 0
So far the parameter μ was assumed to be fixed but unknown. One way to deal
with the problem would be to choose an arbitrary value based on experience. A
more concise way is to compute it from the data using an elegant procedure known
as cross-validation, see also Chap. 15. Suppose that one would like to solve the
problem given in Eq. (A.7) and would like to have an optimal estimate of μ. The
idea of cross-validation is to take one or more points out of the sample and find the
value of μ that minimises the mis-fit. Suppose in fact that xk was taken out. Then
(k)
the spline fμ (.) that fits the remaining data minimises the functional:
b
1
n
2
[yi − f (xi )]2 + μ f (m) (t) dt. (A.17)
n a
i=1,i=k
The overall optimal value of μ is the one that minimises the overall mis-fit or cross-
validation:
1
n
2
cv (μ) = yk − fμ(k) (xk ) . (A.18)
n
k=1
Let us designate by fμ (x) the spline function fitted to the whole sample for
a given μ. Let also A(μ) = aij (μ) , i, j = 1, . . . n, be the matrix relating
T
y = (y1 , . . . , yn )T to fμ = fμ (x1 ), . . . fμ (xn ) , i.e. satisfying A(μ)y = fμ .
Then Craven and Wahba (1979) have shown that
yk − fμ (xk )
yk − fμ(k) (xk ) = ,
1 − akk (μ)
and therefore,
n
1 yk − fμ (xk ) 2
cv (μ) = . (A.19)
n 1 − akk (μ)
k=1
2
(I − A(μ)) y
cv (μ) = n . (A.20)
tr (I − A(μ))
Radial basis functions (RBFs) constitute one of the attractive tools to interpolate
and/or smooth scattered data. RBFs have been formally introduced and coined
by Powell (1987) in exact multivariate interpolation, although the technique was
hanging around before that, see e.g. Hardy (1971), Franke (1982).
Given m distinct points2 xi , i = 1, . . . n, in Rd , and n real numbers fi , i =
1, . . . n, these numbers can be regarded as the values at xi of a certain unknown
function f (x). The problem of RBF is to find a smooth real function s(x) satisfying
the interpolation conditions:
s(xi ) = fi i = 1, . . . n (A.21)
n
m
s(x) = λk φk (x) = λk φ ( x − xk ) . (A.22)
k=1 k=1
The functions φk (x) = φ ( x − xk ) are known as radial basis functions. The real
function φ(x) is defined over the positive numbers, and . is any Euclidean norm
or Mahalanobis distance. Thus the radial basis functions s(x) are a simple linear
combination of the shifted radial basis functions φ(x).
−1
• φ(r) = 1 + r 2 and corresponds to inverse quadratic.
Equations (A.21) and (A.22) lead to the following linear system:
Aλ = f, (A.23)
n
s(x) = pm (x) + λk φ ( x − xk ) , (A.24)
k=1
n
λj p(xj ) = 0 (A.25)
j =1
for all polynomials p(x) of degree at most m. Apart from introducing more
equations, system of Eq. (A.25) can be used to measure the smoothness of the
RBF (Powell 1990). It also controls the rate of growth at infinity of the non-
polynomial
part of s(x) in Eq. (A.24) (Beatson et al. 1999). If (p1 , . . . pl ), with
l= d m+d
= (m+d)!
m!d! , is a basis of the space of algebraic polynomials of degree less
or equal than m in Rd , then Eq. (A.25) becomes
3 For example in the N -body problem, the gravitational potential at a point y takes the form
N
αi
φ(y) = .
xi − y
i=1
n
λj pk (xj ) = 0 k = 1, . . . l. (A.26)
j =1
l also that pm (x) in Eq. (A.24) can be substituted for the combination
Note
k=1 μk pk (x), which, when combined with Eq. (A.26), yields the following
system:
λ A P λ f
A = = , (A.27)
μ PT O μ 0
nonsingular for the unidimensional case.4 Powell (1987) also gives further examples
−β
of nonsingularity such as φ(r) = r 2 + 1 , (β > 0).
Consider now the extended interpolation in Eq. (A.24), the matrix A in
Eq. (A.27) is nonsingular only if the columns of P are linearly independent.
Michelli (1986) gives sufficient conditions for the invertibility of the system of
Eq. (A.27) based on conditional positivity,5 i.e. when φ(r) is conditionally strictly
4 In this case the matrix is nonsingular for all φ(r) = r 2α+1 (α positive integer) and the
interpolation function
n
s(x) = λi |x − xi |2α+1
i=1
is a spline function.
5 A real function φ(r) defined on the set of non-negative real numbers is conditionally (strictly)
positive definite of order m + 1 if for any distinct points x1 , . . . xn and scalars satisfying λ1 , . . . λn
satisfying Eq. (A.26) the quadratic form
λT λ = λi φ xi − xj λj
ij
is non-negative (positive). The “conditionally” in the definition refers to Eq. (A.26). The set of
conditionally positive definite functions of order m has been characterised by Michelli (1986). If a
dk
continuous function φ(.) defined on the set of non-negative real numbers is such that (−1)k dr k φ(r)
is completely monotonic, then φ(r 2 ) is conditionally positive definite of order k. Note that if
dk
(−1)k dr k φ(r) ≥ 0 for all positive integers k, then φ(r) is said to be completely monotonic. The
following important result is also given in Michelli (1986). If the continuous and positive function
462 A Smoothing Techniques
positive. The previous two sufficient conditions allow for various choices of radial
−α β
basis functions, such as r 2 + a 2 for α > 0, and r 2 + a 2 for 0 < β < 1. For
3
instance, the functions φ1 (r) = r 2 and φ2 (r) = 4r log r have their second derivatives
completely monotonic for r > 0. The functions φ1 (r 2 ) = r 3 and φ2 (r 2 ) = r 2 log r
can also be used as RBF. The latter case corresponds to the thin-plate spline.
Another case of singularity was provided by Powell (1987) and corresponds to
∞ b
φ(r) = 0 e−xr ψ(x)dx, where ψ(x) is non-negative with a ψ(t)dt > 0 for
2
see also Beatson et al. (1999) for fast fitting and evaluation of RBF.
φ(r) defined on the set of non-negative numbers has its first derivative completely monotonic (not
constant), then for any distinct points x1 , . . . xn
(−1)n−1 det φ xi − xj 2 > 0.
A Smoothing Techniques 463
⎛ ⎞
1 x1 y1
⎜1 x1 y1 ⎟
⎜ ⎟
P=⎜. .. .. ⎟ .
⎝ .. . . ⎠
1 xn yn
Note that thin-plate (or biharmonic) splines serve to model the deformation of an
infinite thin plate (Bookstein 1989) and are a C 1 function that minimises the energy
2 2 2
∂ 2s ∂ 2s ∂ 2s
E(s) = +2 + dxdy.
R2 ∂x 2 ∂x∂y ∂y 2
In the previous section the emphasis was on exact interpolation where the fitted
function goes exactly through the data (xi , fi ). This corresponds to the case when
the data are noise-free. If the data are contaminated with noise, then instead of the
condition given by Eq. (A.21) we seek a function s(x) that minimises the functional
1
n
[s(xi ) − fi ]2 + ρ s 2 , (A.28)
n
i=1
n
s(x) = p(x) + λi x − xi ,
i=1
In many problems in mathematical physics, one seeks to solve the following PDE:
Lu = f, (A.29)
where δ is the Dirac (or impulse) function. The solution to Eq. (A.29) is then given
by the following convolution:
u(x) = f (y)G (x, y) dy + p(x), (A.31)
D
where p(x) is a solution to the homogeneous equation Lp = 0. Note that Eq. (A.31)
is to be compared with Eq. (A.24). In fact, if there is an operator L satisfying Lφ(x−
xi ) = δ(x − xi ) and also Lpm (x) = 0, then clearly the RBF φ(r) is the Green’s
function of the differential operator L and the radial basis function s(x) given by
Eq. (A.24) is the solution to
n
Lu = λk δ(x − xk ).
k=1
As such, it is possible to use the PDE solver to solve an interpolation problem (see
e.g. Press et al. (1992)). In general, given φ, the operator L can be determined using
filtering techniques from time series.
RBF interpolation is also related to kriging. For example, when pm (x) = 0 in
Eq. (A.24), then the equations are similar to kriging and where φ plays the role of
an (isotropic) covariance function. The relation to splines has also been outlined.
For example, when the radial function φ(.) is cubic or thin-plate spline, then we
have a spline interpolant function. In this case the function minimises the bending
energy of an infinite thin plate in two dimensions, see Poggio and Girosi (1990)
for a review.
n For instance, if φ(r)
m = r 2m+1 (m positive integer), then the function
s(x) = i=1 λi |x − xi | 2m+1 + k=1 μk x k is a natural spline.
6 The Green’s function G depends only on the operator L and has various properties. For example, if
L is self-adjoint, then G is symmetric. If L is invariant under translation, then G(x, y) = G(x − y),
and if L is also invariant under rotation, then G is a radial function, i.e. G(x, y) = G( x − y ).
A Smoothing Techniques 465
n
ŷi = Kij yj , (A.32)
j =1
the weights are non-negative and add up to one, i.e. Kij ≥ 0, and for each
Clearly
i, nj=1 Kij = 1. The kernel function K(.) satisfies the following properties:
∞ ≥ 0, for all t.
(1) K(t)
(2) −∞ K(t)dt = 1.
(3) K(−t) = K(t) for all t.
Hence, K(.) is typically a symmetric probability density function. Note that the
parameter b gives a measure of the size of the neighbourhood in the averaging
process around each target point xi . Basically, the parameter b controls the “width”
of the function K( xb ). In the limit b → 0, we get a Dirac function δ0 . In this case
the smoothed function is identical to the original scatter, i.e. ŷi = yi . On the other
hand, in the limit b → ∞ we get a uniform
weight function, and the smoothed curve
reduces to the mean, i.e. ŷi = y = n1 yi . A familiar example of kernels is given
by the Gaussian PDF:
1 x2
K(x) = √ e− 2 .
2π
There are several other kernels used in the literature. The following are examples:
• Box kernel K(x) = 1[− 1 , 1 ] .
2 2
• Triangle kernel K(x) = 1 − |t| a 1[− a1 , a1 ] (for some a > 0).
⎧ 3
⎪
⎪ − x 2
+ |x|
for x ≤ M
⎪
⎨ 1 6 M 6 M 2
3
• Parzen kernel: K(x) = 2 1 − |x| for 2 ≤ |x| ≤ M
M
⎪
⎪ M
⎪
⎩
0 for |x| > M.
These kernels can also extend easily to the multidimensional case.
Appendix B
Introduction to Probability and Random
Variables
B.1 Background
k
p = lim ,
n→∞ n
where k is the number of times that event occurred in n trials.
(ii) The subjective or “Bayesian” school, which holds that the probability of an
event is a subjective or personal judgement of the likelihood of that event.
This interpretation goes back to Thomas Bayes (1763) and Pierre Simon
Laplace in the early sixteenth century (see Laplace 1951). This trend argues
that randomness is not an objectively measurable phenomenon but rather a
“knowledge” phenomenon, i.e. they regard probability as an epistemological
rather than ontological concept.
Besides these two main schools, there is another one: the classical school, which
interprets probability based on the concept of equally likely outcomes. According
to this interpretation, when performing a random experiment, one can assign the
same probability to events that are equally likely. This interpretation can be useful in
practice, although it has a few difficulties such as how to define equally likely events
before even computing their probabilities, and also how to define probabilities of
events that are not equally likely.
Operations on Subsets
Given a set S and subsets A, B, and C, one can perform the following operations:
• Union—The union of A and B, note A ∪ B, is the subset containing the elements
from A or B. It is clear that A ∪ Ø = A, and A ∪ S = S. Also, if A ⊂ B, then
A ∪ B = B. This definition can be extended to an infinite sequence of subsets
A1 , A2 , . . . to yield ∪∞
k=1 Ak .
• Intersection—The intersection of two subsets A and B, noted as A ∩ B, is the
set containing only common elements to A and B. If no common elements exist,
then A∩B = Ø, and the two subsets are said to be mutually exclusive or disjoint.
It can be seen that A ∩ S = A and that if D ⊂ B then D ∩ B = D. The definition
also extends to an infinite sequence of subsets.
• Complements—The complement of A, noted as Ac , is the subset of elements that
are not in A. One has (Ac )c = A; S c = Ø; A ∪ Ac = S and A ∩ Ac = Ø.
B Introduction to Probability and Random Variables 469
Definition/Axioms of Probability
Properties of Probability
• Direct consequences
(1) P r(Ø) = 0.
(2) P r(Ac ) = 1 − P r(A).
(3) P r(A ∪ B) = P r(A) + P r(B) − P r(A ∩ B).
(4) If A ⊂ B, then P r(A) ≤ P r(B).
(5) If A and B are exclusive, then P r(A ∩ B) = 0.
Exercise Derive the above properties.
Exercise Compute P r(A ∪ B ∪ C).
Answer P r(A) + P r(B) + P r(C) − P r(A ∩ B) − P r(A ∩ C) − P r(B ∩ C) +
P r(A ∩ B ∩ C).
• Conditional Probability
Given two events A and B, with P r(B) > 0, the conditional probability of A
given by B, denoted by P r(A|B) is defined by
P r(Bi )P r(A|Bi )
P r(Bi |A) = n .
j =1 P r(Bj )P r(A|Bj )
f (x) = P r(X = x)
k
is the probability function of X. One immediately sees that j =1 f (xj ) =
k
j =1 pj = 1. The function F (x) defined by
B Introduction to Probability and Random Variables 471
F (x) = P r(X ≤ x) = f (xi )
xi ≤x
is the cumulative distribution function (cdf) of X. The cdf of a discrete random vari-
able is piece-wise constant function between 0 and 1. Various other characteristics
can be defined from X, which are included in the continuous case discussed below.
for any interval [a, b] in I is the probability density function (pdf) of X. Hence
the quantity f (x)dx represents the probability of the event x ≤ X ≤ x + dx, i.e.
P r(x ≤ X ≤ x + dx) = f (x)dx. The pdf satisfies the following properties:
(1) f
∞(x) ≥ 0 for all x.
(2) −∞ f (x)dx = 1.
The cumulative distribution function of X is given by
x
F (x) = f (x)dx.
−∞
Let X be a continuous random variable with pdf f () and cdf F (). The quantity:
472 B Introduction to Probability and Random Variables
E(X) = xf (x)dx
Cumulants
The centred moments are defined with respect to the centred random variable X −
E(X). The characteristic function is given by
φ(s) = E eisX = eisx f (x)dx,
dm
We have, in particular, μm = ds m φ(s)|s=0 . The cumulant of order m of X, κm , is
given by
dm
κm = log (φ(s)) |s=0 .
i m ds m
For example, the third-order moment is the skewness, which provides a measure of
the symmetry of the pdf (with respect to the mean when centred moment is used),
B Introduction to Probability and Random Variables 473
and κ3 = μ3 − 3μ2 μ1 + 2μ31 . For the fourth-order cumulant, also called the kurtosis
of the distribution, κ4 = μ4 − 4μ3 μ1 − 3μ22 + 12μ21 μ2 − 6μ41 . Note that for zero-
mean distributions κ4 = μ4 − 3μ22 . A distribution with zero kurtosis is known as
mesokurtic, like the normal distribution.
A distribution with positive kurtosis is known as super-Gaussian or leptokurtic. This
distribution is characterised by a higher maximum and heavier tail than the normal
distribution with the same variance. A distribution with negative kurtosis is known
as sub-Gaussian or platykurtotic and has lower peak and higher tails than the normal
distribution with the same variance.
Let X and Y be two random variables over a sample space S with respective pdfs
fX () and fY (). For any x and y, the function f (x, y) defined by
x y
P r(X ≤ x; Y ≤ y) = f (u, v)dudv
−∞ −∞
is the joint probability density function. The definition can be extended in a similar
T
fashion to p random variables X1 , . . . , Xp . The vector x = X1 , . . . , Xp is called
a random vector, and its pdf is given by the joint pdf f (x) of these random variables.
Two random variables X and Y are said to be independent if
for all x and y. The pdfs fX () and fY () and associated cdfs FX () and FY () are called
marginal pdfs and marginal cdfs of X and Y , respectively. The marginal pdfs and
cdfs are linked to the joint cdf via
cov(X, Y )
ρX,Y = √
var(X)var(Y )
∂ p F (x1 , . . . , xp )
f (x1 , . . . , xp ) = .
∂x1 . . . ∂xp
Like the bivariate case, p random variables X1 , . . . , Xp are independent if the joint
cdf F () can be factorised into a product of the marginal cdfs as: F (x1 , . . . , xp ) =
F 1 (x1 ) . . . FXp (xp ), and similarly for the joint pdf. Also, we have fX1 (x1 ) =
X
∞ ∞
−∞ . . . −∞ f (x)dx2 . . . dxp , and similarly for the remaining marginal pdfs.
T
Let x = X1 , . . . , Xp be a random vector with pdf f () and cdf F (). The
expectation of a function g(x) is defined by
E [g(x)] = g(x)f (x)dx.
The mean μ of x is obtained when g() is the identity, i.e. μ = xf (x)dx. Assuming
the random variables X1 , . . . , Xp have finite variance, the covariance matrix, xx ,
of x is given by
xx = E (x − μ) (x − μ)T = E xx T − μμT ,
Let x and y be two random vectors over some state space with joint pdf fx,y (.). The
conditional probability density of y given x = x is given by
when fx (x) = 0; otherwise, one takes fy|x (x, y) = fx,y (x, y). Using this
conditional pdf, one can obtain the expectation of any function h(y) given x = x,
i.e.
∞
E (h(y)|x = x) = h(y)fy|x (y|x)dy,
−∞
A Bernoulli random variable X takes only two values, 0 and 1, i.e. X has two
outcomes: success or failure (true or false) with respective probabilities P r(X =
1) = p and P r(X = 0) = q = 1 − p. The pdf of this distribution can be written as
f (x) = px (1 − p)1−x , where x is either 0 or 1. A familiar example of a Bernoulli
trial is the tossing of a coin.
Binomial Distribution
where nj = j !(n−j n!
)! . Given a Bernoulli trial with probability of success p, a
Binomial trial B(n, p) consists of n repeated and independent Bernoulli trials.
Formally, if X1 , . . . , Xn are independent and identically
distributed (IID) Bernoulli
random variables with probability of success p, then nk=1 Xk follows a binomial
distribution B(n, p). A typical example consists of tossing a coin n times, and the
number of heads is a binomial random variable.
Exercise Let X ∼ B(n, p), show that μ = E(X) = np, and σ 2 = var(X) =
np(1 − p). Show that the characteristic function φ(t) = E(eiXt ) is (peit + q)n .
for j = r, r + 1, . . .. If we are interested in the first success, i.e. r = 1, one gets the
geometric distribution.
Exercise Show that the mean and variance of the negative binomial distribution are,
respectively, μ = r/p and σ 2 = r(1 − p)/p2 .
Poisson Distribution
A Poisson random variable with parameter λ > 0 can take all the integer numbers
and satisfies
λk −λ
P r(X = k) = k! e k = 0, 1, . . . .
A continuous uniform random variable over the interval [a, b] has the following pdf:
B Introduction to Probability and Random Variables 477
1
f (x) = 1[a,b] (x),
b−a
where 1I () is the indicator of I , i.e. with a value of one inside the interval and zero
elsewhere.
Exercise Show that for a uniform random variable X over [a, b], E(X) = (a+b)/2
and var(X) = (a − b)2 /12.
The normal (or Gaussian) distribution, N(μ, σ 2 ), has the following pdf:
1 (x − μ)2
f (x) = √ exp − .
σ 2π 2σ 2
Exercise Show that for the above normal distribution E(X) = μ and var(X) = σ 2 .
For a normal distribution, the random variable X−μ
σ has zero mean and unit variance
and is referred
x to as the standard normal. The cdf of X is generally noted as
(x) = −∞ f (u)du and is known as the error function. The normal distribution
is very useful and can be reached using a number of ways. For example, if Y is
Y −np
binomial B(n, p), Y ∼ B(n, p), then √np(1−p) approximates the standard normal
for large np. The same result holds for Y√−λ when Y follows a Poisson distribution
λ
with parameter λ. This result constitutes a particular case of a more general result,
namely the central limit theorem (see e.g. DeGroot and Shervish 2002, p. 282).
The Central Limit Theorem
Let X1 , . . . , Xn be a sequence of n IID random variables with mean μ and variance
0 < σ 2 < ∞ each, then for every number x
Xn − μ
lim P r √ ≤ x = (x),
n→∞ σ/ n
where () is the standard normal cdf, and Xn = n1 nk=1 Xk . The theorem says that
the (properly scaled) sum of a sequence of independent random variables with same
mean and (finite) variance is approximately normal.
λe−λx if x≥0
f (x) =
0 otherwise.
The pdf of the gamma distribution with parameters λ > 0 and β > 0 is given by
,
λβ β−1 −βx
(β) x e if x>0
f (x) =
0 otherwise,
∞
where (y) = 0 e−t t y−1 dt, for y > 0. If the parameter β < 0, the distribution is
known as Erlang distribution.
Exercise Show that for the above gamma distribution E(X) = β/λ, and var(X) =
β/λ2 . Show that φ(t) = (1 − it/λ)−β .
The chi-square random variable with n degrees of freedom (dof), noted as χn2 , has
the following pdf:
,
2−n/2 n2 −1 −x/2
(n/2) x e if x>0
f (x) =
0 otherwise.
− n+1
n−1/2 ( n+1
2 ) x2 2
f (x) = 1+ .
(1/2)(n/2) n
B Introduction to Probability and Random Variables 479
The Fisher–Snedecor random variable with n and m dof, Fn,m , has the following
pdf:
,
( mn )n/2 ( n+m
2 )
n n+m
nx − 2
f (x) = (n/2)(m/2) x 2 −1 1 + m if x>0
0 otherwise.
2m2 (n+m−2)
Exercise Show that E(Fn,m ) = m
m−2 and var(Fm,n ) = 4(m−2)(m−4) .
X/n
If X ∼ χn2 and Y ∼ χm2 are independent, then Fn,m = Y /m follows a Fisher–
Snedecor distribution with n and m dof.
A multinormally distributed random vector x, noted as x ∼ Np (μ, ), has the pdf
1 1
f (x) = p exp − (x − μ)T −1 (x − μ) ,
1
(2π ) ||
2 2
2
where μ and are, respectively, the mean and the covariance matrix of x. The
characteristic function of this distribution is φ(t) = exp iμT t − 12 tT t . The
multivariate normal distribution is widely used and has some very useful properties
that are given below:
• Let A be a m × p matrix, and y = Ax, and then y ∼ Nm (Aμ, AAT ).
• If x ∼ Np (μ, ), and rank() = p, then
(x − μ)T −1 (x − μ) ∼ χp2 .
then
480 B Introduction to Probability and Random Variables
n
W= Xk XTk
k=1
The function γ (), defined on the integers, is the autocovariance function of the
stationary stochastic process. The function
B Introduction to Probability and Random Variables 481
γk
ρk =
γ0
1
n−k
γ̂k = (xi − x)(xi+k − x),
n
t=1
This appendix gives a brief introduction to stationary time series analysis for the
univariate and multivariate cases.
It is clear that the variance of the time series is simply σ 2 = γ (0). The
autocorrelation function ρ() is given by
γ (τ )
ρ(τ ) = . (C.3)
σ2
p
γ (τi − τj )ai aj ≥ 0, (C.4)
i,j =1
Let εt , t ≥ 1, be a sequence of IID random variables with zero mean and variance
σε2 . This sequence is called white noise. The autocovariance of such a process is
simply a Dirac pulse, i.e.
γε (τ ) = δτ ,
i.e. one at τ = 0, and zero elsewhere. Although the white noise process is the
simplest time series, it remains, however, hypothetical because it does not exist
in practice. Climate and other time series are autocorrelated. Simple linear time
series models have been formulated to explain this autocorrelation. The models
we are reviewing here have been formulated in the early 1970s and are known as
autoregressive moving average (ARMA) models (Box and Jenkins 1970; see also
Box et al. (1994)).
Given a time series (xt ) where t is either continuous or discrete, various operations
can be defined.
C Stationary Time Series Analysis 485
• Backward shift or delay operator B—This is defined for discrete time series by
More generally, for any integer m ≥ 1, B m xt = xt−m . By analogy, one can define
the inverse operator B −1 , which is the forward operator. It is clear from Eq. (C.5)
that for a constant c, Bc = c. Also for any integers m and n, B m B n = B m+n .
Furthermore, for any time series (xt ), t = 1, 2, . . ., we have
1
xt = (1 + αB + α 2 B 2 + . . .)xt = xt + αxt−1 + . . . (C.6)
1 − αB
dy(t)
Dy(t) = (C.8)
dt
whenever this differentiation is possible.
• Continuous shift operator—Another useful operator normally encountered in
filtering is the shift operator in continuous time series, B u , defined by
This operator is equivalent to the backward shift operator in discrete time series.
It can be shown that
B u = e−uD = e−u dt .
d
(C.10)
486 C Stationary Time Series Analysis
Fig. C.1 Examples of time series of AR(1) models with lag-1 autocorrelation 0.5 (a) and −0.5
(b)
ARMA Models
The white noise εt is only correlated with xs for s ≥ t. When p = 1, one gets
the familiar Markov or first-order autoregressive, AR(1), model also known as red
noise. Figure C.1 shows an example of generated time series of an AR(1) model with
opposite lag-1 autocorrelations. The red noise is a particularly simple model that is
frequently used in climate research and constitutes a reasonably good model for
many climate processes, see e.g. Hasselmann (1976, 1988), von Storch (1995a,b),
Penland and Sardeshmukh (1995), Hall and Manabe (1997), Feldstein (2000) and
Wunsch (2003) to name just a few.
• Moving average scheme: MA(q)
Moving average models of order q, MA(q), are defined by
q
xt = εt + φ1 εt−1 + . . . + φq εt−q = 1 + φk B k εt . (C.12)
k=1
C Stationary Time Series Analysis 487
It is possible to combine both the above models, AR(p) and MA(q), into just one
single model, the ARMA model.
• Autoregressive moving average scheme: ARMA(p, q)
It is given by
p
q
1− φk B k
xt = 1 + θk B k εt . (C.13)
k=1 k=1
φ(z) = 0 (C.14)
be outside the unit circle, see e.g. Box et al. (1994) for details.
Various ways exist to identify possible models for a given time series. For
example, the autocorrelation function of an ARMA model is a damped exponential
and/or sine waves that could be used as a guide to select models. Another useful
measure is the partial autocorrelation function. It exploits the fact that, for example,
for an AR(p) model the autocorrelation function can be entirely described by
the first p lagged autocorrelations whose behaviour is described by the partial
autocorrelation, which is a function that cuts off after lag p for the AR(p) model.
Alternatively, one can use concepts from information theory (Akaike 1969, 1974)
by fitting a whole range of models, computing the residual estimates ε̂ and their
variances (the mean squared errors) σ̂ 2 and then deriving, for example, the Akaike
information criterion (AIC) given by
2
AI C = log(σ̂ 2 ) + (P + 1), (C.15)
n
where P is the number of parameters to be estimated. The best model corresponds
to the smallest AIC.
Fig. C.2 Autocorrelation function of AR(1) models with lag-1 autocorrelations 0.5 (a) and
−0.5(b)
Using the symmetry of the autocovariance function, the power spectrum becomes
∞
σ2
f (ω) = 1+2 ρ(k)coskω . (C.17)
2π
k=1
Remark Similar to power spectrum, the bispectrum is the Fourier transform of the
bicovariance function, and is related to the skewness (e.g. Pires and Hannachi 2021)
Properties of the Power Spectrum
• f () is even, i.e. f (−ω) = f (ω).
• f (ω) ≥ 0 for all ω in [−π, π].
π π
• γ (τ ) = −π eiωτ f (ω)dω = −π cosτ ωf (ω)dω, i.e. the autocovariance function
is the inverse Fourier transform of the power spectrum. Note π that from the last
property, one gets, in particular, the familiar result σ 2 = −π f (ω)dω, i.e. the
power spectrum distributes the variance.
Examples
2
• The power spectrum of a white noise is constant, i.e. f (ω) = 2π
σ
.
• For a red noise time series (of zero mean), xt = αxt−1 + εt , the auto-
correlation function is ρ(τ ) = α |τ | , and its power spectrum is f (ω) =
σ2 2 −1 (Figs. C.2, C.3).
2π 1 − 2αcosω + α
Exercise Derive the relationship between the variance of xt and that of the
innovation εt in the red noise model.
Hint σ 2 = σε2 (1 − α 2 )−1 .
Exercise
Fig. C.3 Power spectra of two AR(1) models with lag-1 autocorrelation 0.5 and −0.5
p
yt = αk xt−k .
k=1
490 C Stationary Time Series Analysis
p
A(z) = αk zk ,
k=1
where the function a(ω) = A(eiω ) is the frequency response function, which is the
Fourier transform of the transfer function. Now, the power spectrum of yt is linked
to that of xt following:
The application of this to the ARMA time series model (C.13), see also Chap. 2,
yields
) )
) θ (eiω ) )2
fx (ω) = σε2 )) ) . (C.18)
φ(eiω ) )
In the above equation it is assumed that the roots of φ(z) are outside unit circle
(stationarity) and similarly for θ (z) (for invertibility, i.e. εt is written as a convergent
power series in xt , xt−1 , . . .).
The elements of (τ ) are [(τ )]ij = E xt+τ,i xt,j . The diagonal elements are the
autocovariances of the individual unidimensional time series forming xt , whereas
its off-diagonal elements are the lagged cross-covariances. The lagged covariance
matrix has the following properties:
• (−τ ) = [(τ )]T .
• (0) is the covariance matrix 0 of xt .
C Stationary Time Series Analysis 491
• (τ ) is positive semi-definite, i.e. for any integer m > 0, and real vectors
a1 , . . . , am ,
m
aTi (i − j )aj ≥ 0. (C.20)
i,j =1
−1/2 −1/2
ϒ(τ ) = 0 (τ ) 0 ,
− 1
whose elements ρij (τ ) are [ϒ(τ )]ij = γij (τ ) γii (0)γjj (0) 2 , has similar
properties. Furthermore, we have
|ρij (τ )| ≤ 1.
Note that the inequality γij (τ ) ≤ γij (0), for i = j , is not true in general.
C.3.2 Cross-Spectrum
As for the univariate case, we can define the spectral density matrix F(ω) of xt ,
t = 1, 2, . . . for −π ≤ ω ≤ π as the Fourier transform of the autocovariance
matrix:
∞
1 −iτ ω
F(ω) = e (τ ) (C.21)
2π τ =−∞
whenever
τ (τ ) < ∞, where . is a matrix norm. For example, if
τ |γ ij (τ )| < ∞, for i, j = 1, 2, . . . p, then F(ω) exists. Unlike the univariate case,
however, the spectral density matrix can be complex because () is not symmetric.
The diagonal elements of F(ω) are real because they represent the power spectra of
the individual univariate time series that constitute xt . The real part of F(ω) is the
co-spectrum matrix, whereas the imaginary part is the quadrature spectrum matrix.
The spectral density matrix has the following properties:
• F(ω) is Hermitian, i.e.
F(−ω) = [F(ω)]∗T ,
π
(τ ) = F(ω)eiτ ω dω. (C.22)
−π
π
• 0 = −π F(ω)dω, and 2π F(0) = k (k).
• F(ω) is semi-definite (Hermitian), i.e. for any integer m > 0, and complex
numbers c1 , c2 , . . . , cp , we have
p
∗T
c F(ω)c = ci∗ Fij (ω)cj ≥ 0, (C.23)
i,j =1
T
where c = c1 , c2 , . . . , cp . The coherence and phase between xt,i and xt,j ,
t = 1, 2, . . ., for i = j , are, respectively, given by
|Fij (ω)|2
cij (ω) = , (C.24)
Fii (ω)Fjj (ω)
and
I m(Fij (ω))
φij (ω) = Atan . (C.25)
Re(Fij (ω))
The coherence, Eq. (C.24), gives essentially a measure of the square of the
correlation coefficient between both the time series in the frequency domain. The
phase, Eq. (C.25), on the other hand, gives a measure of the time lag between the
time series.
1
n−τ
γ̂1 (τ ) = (xt − x) (xt+τ − x) (C.26)
n
t=1
and
1
n−τ
γ̂2 (τ ) = (xt − x) (xt+τ − x) . (C.27)
n−τ
t=1
C Stationary Time Series Analysis 493
We can assume for simplicity that the sample mean is zero. It is clear from
Eq. (C.26) and Eq. (C.27) that γ̂1 () is slightly biased, with bias of order n1 , i.e.
asymptotically unbiased, whereas γ̂2 () is unbiased. The asymptotically unbiased
estimator γ̂1 () is, however, consistent, i.e. its variance goes to zero as the sample
size goes to infinity, whereas the estimator γ̂2 () is inconsistent with its variance
tending to infinity with the sample size (see e.g. Jenkins and Watts 1968). But, for
a fixed lag both the estimators are asymptotically unbiased and with approximate
variances satisfying (Priestly 1981, p. 328)
γ̂ (τ )
ρ̂(τ ) = , (C.28)
σ̂ 2
E ρ̂(τ ) ≈ 0 for τ = 0
and
var ρ̂(τ ) ≈ 1
n for τ = 0.
These approximations can be used to construct confidence intervals for the sample
autocorrelation function.
as the Nyquist frequency.1 The Nyquist frequency represents the highest frequency
that can be resolved, and therefore, the power spectrum can only be estimated for
frequencies less than the Nyquist frequency.
The sequence of the following complex vectors:
1 T
ck = √ eiωk , e2iωk , . . . , einωk
n
As for the power spectrum, the periodogram also distributes the sample variance, i.e.
the periodogram In (ωk ) represents the contribution to the sample variance from the
frequency ωk . The periodogram can be seen as an estimator of the power spectrum,
Eq. (C.16). In fact, by expanding Eq. (C.30) one gets
⎡ ⎤
1 ⎣ 2
n n−1
In (ωp ) = xt + xt xτ eikωp + e−ikωp ⎦
n
t=1 k=1 |t−τ |=k
1 Or 1
2t if the frequency is expressed in (1/time unit). For example, if the sampling time interval
is unity, then the Nyquist frequency is 12 .
C Stationary Time Series Analysis 495
n−1
= γ̂ (k) cos(ωp k). (C.32)
k=−(n−1)
1
Therefore 2π In (ωp ) is a candidate estimator for the power spectrum f (ωp ). Fur-
thermore, it can be seen from Eq. (C.32) that E In (ωp ) = n−1 k=−(n−1) E γ̂ (k)
cos(ωp k), i.e.
Periodogram Smoothing
The spectral window is a symmetric kernel function that integrates to unity and
decays at large values. This smoothing is equivalent to a discrete Fourier transform
of the weighted autocovariance estimator using a (time domain) lag window λ(.) as
1
n−1
fˆ(ω) = λ(k)γ̂ (k) cos(ωp k). (C.35)
2π
k=−(n−1)
The sum in Eq. (C.35) is normally truncated at the truncation point of the lag
window.
The spectral window W () is the Fourier transform of the lag window, whose
aim is to neglect the contribution, in the sample autocovariance function, from
large lags. This means that localisation in time is associated with broadness in the
spectral domain and vice versa. Figure C.4 illustrates the relationship between time
(or lag) window and spectral window. Various lag/spectral windows exist in the
literature. Two examples are given below, namely, the Bartlett (1950) and Parzen
(1961) windows:
496 C Stationary Time Series Analysis
Fig. C.4 Illustration of the relationship between time and spectral windows
• Parzen window:
⎧
⎪ τ 2
⎨1 − 6 M + 6
τ 3
for |τ | ≤ M
M 2
λ(τ ) = 2 1− Mτ 3 (C.38)
⎪
⎩
0 otherwise,
and
4
6 sin(Mω/4)
W (ω) = . (C.39)
πM3 sinω/2
C Stationary Time Series Analysis 497
Fig. C.5 Parzen window showing W (ω) in ordinate versus ω in abscissa for different values of
the parameter M
Figure C.5 shows an example of the Parzen window for different values of the
parameter M. Notice in particular that as M increases the lag window becomes
narrower. Since M can be regarded as a time resolution, it is clear that the variance
increases with M and vice versa.
Remark There are other ways to estimate the power spectrum such as the maximum
entropy method (MEM). The MEM estimator is achieved by fitting an autoregres-
sive model to the time series and then using the model parameters to compute
the power spectrum, see e.g. Burg (1972), Ulrych and Bishop (1975), and Priestly
(1981).
The cross-covariance and the cross-spectrum can be estimated in a similar way to the
sample covariance function and sample spectrum. For example, the cross-covariance
between two zero-mean time series samples xt , and yt , t = 1, . . . n, can be estimated
using
1
n−τ
γ̂12 (τ ) = xt yt+τ (C.40)
n
t=1
1
M
fˆ12 (ω) = λ(k)γ̂12 (k)eiωk . (C.41)
2π
k=−M
Appendix D
Matrix Algebra and Matrix Function
D.1 Background
Any n×p matrix X is a representation of a linear operator from a linear space Ep into
a linear space En . For example, if the space En is real, then one gets En = Rn . Let us
denote by xk = (x1k , . . . , xnk )T , and then the matrix is written as X = x1 , . . . , xp .
The kth column xk of X represents the image of the kth basis vector ek of Ep , i.e.
Xek = xk .
Product
Diagonal
A diagonal matrix is n × n matrix of the form A = xij δij , where δij is the
Kronecker symbol. For a n × p matrix A, the main diagonal is given by all the
elements aii , i = 1, . . . , min(n, p).
Trace
n
The trace of a square n × n matrix X = (xij ) is given by tr (X) = k=1 xkk .
Determinant
!
p
det (X) = |X| = (−1)|π | xkπ(k) , (D.1)
π k=1
D Matrix Algebra and Matrix Function 501
where the sum is over all permutations π() of {1, 2, . . . , p} and |π | is either +1 or
−1 depending on whether π() is written as the product of an even or odd number
of transpositions, respectively. The determinant can also be defined in a recurrent
manner as follows. For a scalar x, the determinant is simply x, i.e. det (x) = x.
Then, for a p × p matrix X, the determinant is given by
|X| = (−1)i+j xij ij = (−1)i+j xij ij ,
j i
p
xik cj k = |X|δij , (D.2)
k=1
where δij is the Kronecker symbol. The matrix C = (cij ) is the matrix of cofactors
of X.
Matrix Inversion
• Conventional inverse
When |X| = 0, the square p × p matrix X = (xij ) is invertible and its inverse
X−1 satisfies XX−1 = X−1 X = Ip . It is clear from Eq. (D.2) that when X is
invertible the inverse is given by
1 T
X−1 = C , (D.3)
|X|
where C is the matrix of cofactors of X. In what follows, the elements of X−1 are
denoted by x ij , i, j = 1, . . . n, i.e. X−1 = (x ij ).
• Generalised inverse
Let X be a n × p matrix, and the generalised inverse of X is the p × n matrix X−
satisfying the following properties:
Direct Product
Let A = (aij ) and B = (bij ) two matrices of respective order n × p and q × r. The
direct product of A and B, noted as A × B or A ⊗ B, is the nq × pr matrix defined
by
⎛ ⎞
a11 B a12 B . . . a1p B
⎜ a21 B a22 B . . . a2p B ⎟
⎜ ⎟
A⊗B=⎜ . .. .. ⎟ .
⎝ .. . . ⎠
an1 B an2 B . . . anp B.
The above product is indeed a left direct product. A direct product is also known
as Kronecker product. There is also another type of product between two n × p
matrices of the same order A = (aij ) and B = (bij ), and that is the Hadamard
product given by
A B = aij bij .
Positivity
Eigenvalues/Eigenvectors
Let A a p ×p matrix. The eigenvalues of A are given by the set of complex numbers
λ1 , . . . , λp solution to the algebraic polynomial equation:
|A − λIp | = 0.
Au = λu,
The SVD theorem has different forms, see e.g. Golub and van Loan (1996), and
Linz and Wang (2003). In its simplest form, any n × p real matrix X, of rank r, can
be decomposed as
X = UDVT , (D.4)
504 D Matrix Algebra and Matrix Function
then we have
then
−1
A11 = A11 − A12 A−1
22 A21 ,
−1
A22 = A22 − A21 A−111 A12 , and
• LU decomposition
For any nonsingular n × n matrix A, there exists some permutation matrix P such
that
PA = LU,
where L is a lower triangular matrix with ones in the main diagonal and U is an
upper triangular matrix.
• Cholesky factorisation
For any symmetric positive semi-definite matrix A, there exists a lower triangular
matrix L such that
A = LLT .
• QR decomposition
For any m × n matrix A, with m ≥ n say, there exist a m × m unitary matrix Q and
a m × n upper triangular matrix R such that
A = QR. (D.5)
The proof of this result is based on Householder transformation and finds a sequence
of n unitary matrices Q1 , . . . , Qn such that Qn . . . Q1 A = R. If at step k, we have
say
Lk | B
Qk . . . Q1 A = ,
Om−k,k | c|C
where Lk is a k × k upper triangular matrix, then Qk+1 will transform the vector c =
(ck+1 , . . . , cm )T into the vector d = (d, 0, . . . , 0)T without changing the structure
of Lk and the (m − k) × k null matrix Om−k,k . This matrix is known as Householder
transformation and has the form:
Ik Ok,m−k
Qk+1 = ,
Om−k,k Pm−k
j
because uik vk is the ith and j th element of uk vTk , one gets
q
UVT = uk vTk .
k=1
and is also known as the gradient of f (.) at x. The differential of f () is then written
p ∂f T
as df = K=1 ∂x k
dxk = ∇f (x)T dx, where dx = dx1 , . . . , dxp .
Examples
• For a linear form f (x) = aT x, ∇f (x) = a.
• For a quadratic form f (x) = xT Ax, ∇x f = 2Ax.
For a vector function f(x) = f1 (x), . . . , fq (x) , where f1 (.), . . . , fq (.) are scalar
functions of x, the gradient in this case is called the Jacobian matrix of f(.) and is
given by
∂f
j
Df(x) = ∇f1 (x) , . . . , ∇fq (x) =
T T
(x) . (D.7)
∂xi
1. Scalar Case
If Y = F (X) is a scalar function, then to define the derivative of F () we first use
T
the vec (.) notation given by vec (X) = xT1 , . . . , xTq transforming X into a pq-
dimensional vector. The differential of F (X) is then obtained by considering F ()
as a function of vec (X). One gets the following expression:
∂F ∂F
= . (D.8)
∂X ∂xij
The derivative ∂F
∂X is then a p × q matrix.
2. Matrix Case
If Y = F (X) is a r × s matrix, where each yij = Fij (X) is a differentiable scalar
function of X, the partial derivative of Y with respect to xmn is the r × s matrix:
∂Y ∂Fij (X)
= . (D.9)
∂xmn ∂xmn
Equation (D.10) also defines the Jacobian matrix DY (X) of the transformation.
Another definition of the Jacobian matrix is given in Magnus and Neudecker (1995,
p. 173) based on the vec transformation, namely,
∂vecF (X)
DF (X) = . (D.11)
∂ (vecX)T
Equation (D.11) is useful to compute the Jacobian matrices using the vec trans-
formation of X and Y and then get the Jacobian of a vector function. Note that
Eqs. (D.9) or (D.10) can also be written as a Kronecker product:
∂Y ∂ ∂yij
=Y⊗ = . (D.12)
∂X ∂X ∂X
D.3.3 Examples
In the following examples the p × q matrix Jij will denote the matrix whose ith
and j th element is one and zero elsewhere, i.e. Jij = δm−i,n−j = δmi δnj , and
similarly for the r × s matrix Kαβ . For instance, if X = (xmn ), then Y = Jij X is the
matrix whose ith line is the j th line of X and zero elsewhere (i.e. ymn = δmi xj n ), and
Z = XJij is the matrix whose j column is the ith column of X and zero elsewhere
(i.e. zmn = δj n xmi ). The matrices Jij and Kij are essentially identical, but they are
obtained differently, see the remark below.
∂ ∂
tr (X) = Ip = tr XT . (D.13)
∂X ∂X
∂f
• f (X) = tr (AX). Here f (X) = i k aik xki , and ∂xmn = k n
i,k aik δm δi =
anm ; hence,
∂f
= AT . (D.14)
∂X
• g (X) = g (f (X)), where f (.) is a scalar function of X and g(y) is a
differentiable scalar function of y. In this case we have
∂g dg ∂f
= (f (X)) .
∂X dy ∂X
∂ tr(XA)
For example, ∂X e = etr(XA) AT .
(X) = det (X) = |X|. For this case, one can use Eq. (D.2), i.e. |X| =
• f
j xαj Xαj where Xαj is the cofactor of xαj . Since Xαj is independent of xαk ,
∂|X|
for k = 1, . . . n, one gets ∂xαβ = Xαβ , and using Eq. (D.3), one gets
∂|X|
= |X|X−T . (D.15)
∂X
Consequently, if g(y) is any real differentiable scalar function of y, then we get
D Matrix Algebra and Matrix Function 509
∂ dg
g (|X|) = (|X|) |X|X−T . (D.16)
∂X dy
T
∂|Y(X)| −T ∂Y ∂Y −1
= tr |Y|Y = |Y|tr Y . (D.17)
∂xαβ ∂xαβ ∂xαβ
Remark We can also compute the derivative with respect to an element and
derivative of an element with respect to a matrix as in the following examples.
∂XT
• Let f (X) = X, then ∂X
∂xαβ = Jαβ , and
∂xαβ = Jβα = Jαβ .
T
∂y
• For a r × s matrix Y = yij , we have ∂Yij = Kij .
∂[f (X)]ij
• For f (X) = AXB, one obtains ∂f (X)
∂xαβ = AJαβ B and ∂X = AT Kij BT .
∂Xn
Exercise Compute ∂xαβ .
∂XXn−1
Hint Use a recursive relationship. Write Un = ∂xαβ , and then Un = Jαβ Xn−1 +
n−1
∂xαβ = Jαβ X
X ∂X n−1 + XU
n−1 . By induction, one finds that
∂f (X)
= Jαβ AX + XAJαβ . (D.18)
∂xαβ
This could be proven by expressing the ith and j th element [XAX]ij of XAX.
Application2. g(X) = tr(f (X)) where f (X)
= XAX.
Since tr ∂f (X)
∂xαβ = tr Jαβ AX + XAJ αβ = [AX]βα + [XA]βα , hence
∂tr (XAX)
= (AX + XA)T . (D.19)
∂X
510 D Matrix Algebra and Matrix Function
Application 3.
−1 −1
∂|XAXT |
= |XAXT | XAT XT XAT + XAXT XA . (D.20)
∂X
∂|XAXT | −1
In particular, if A is symmetric, then ∂X = 2|XAXT | XAT XT XA .
∂ XAXT
One can use the fact that = Jαβ AXT + XAJβα , see also Eq. (D.18),
∂xαβ
∂ XAXT ∂xik T ∂ AXT
which can be proven by writing ∂xαβ ij = k ∂x αβ
AX kj + xik ∂xαβ kj .
β
The first sum in the right hand side of this expression is simply k δk δiα AXT kj ,
which is the (i, j )th element of Jαβ AXT (and also the (α, β)th element of Jij XAT ).
Similarly, the second sum is the (i, j )th element of XAJαβ and, by applying the
trace operator, provides the required answer.
Exercise Complete the proof of Eq. (D.20).
∂|XAXT | −1
∂xαβ =|XAX |tr Jαβ AXT+XAJβα .
Hint First use Eq. (D.17), i.e. T XAXT
−1
Next use the same argument as that used in Eq. (D.19) to get tr XAXT Jαβ
−1 T −T
AX T = i XAX
T AX βi = XAX T XA T . A similar
iα αβ
reasoning yields
−1 −1 −1
tr XAXT
XAJαβ = tr XAJβα XAXT
= XAXT XA ,
αβ
∂|AXB|
= |AXB|AT (AXB)−T BT . (D.21)
∂X
In fact, one has ∂x∂αβ [AXB]ij = aiα bβj = AJαβ B ij . Hence ∂|AXB|
∂xαβ =
−1 −1
|AXB|tr AJαβ B (AXB) . The last term equals i aiα B (AXB) =
−1
−1
βi
B (AXB) βi aiα , which can be easily recognised as B (AXB) A βα =
iT
A (AXB)−T BT αβ .
−1
−1
∂xαβ , one can use the fact that X X =
• Derivative of the inverse. To compute ∂X
ik ∂x ik
Ip , i.e. k x xkj = δij , which yields after differentiation: k ∂xαβ xkj =
ik ∂X−1 −1
− k x Jαβ kj , i.e. ∂xαβ X = −X Jαβ or
D Matrix Algebra and Matrix Function 511
∂
X−1 = −X−1 Jαβ X−1 . (D.22)
∂xαβ
∂yij
= −X−T AT Jij BT X−T . (D.23)
∂X
∂
tr X−1 A = −X−T AT X−T . (D.24)
∂X
Alternatively, one can also use the identity tr(X) = |X|tr(X−1 ) (e.g. Graybill
1969, p. 227).
The matrices dealt with in the previous examples have independent elements. When,
however, the elements are not independent, the rules change. Here we consider
the case of symmetric matrices, but there are various other dependencies such as
normality, orthogonality etc. Let X = xij be a symmetric matrix, and J ij =
Jij + Jj i − diag Jij , i.e. the matrix with one for the (i, j )th and (j, i)th elements
and zero elsewhere. We have ∂x ∂X
ij
= J ij . Now, if f (X) is a scalar function of
the symmetric matrix X, then we can start first with the scalar function f (Y) for a
general matrix, and we get (e.g. Rogers 1980)
∂f (X) ∂f (Y) ∂f (Y) ∂f (Y)
= + − diag .
∂X ∂Y ∂YT ∂Y Y=X
∂
tr (AX) = A + AT . (D.25)
∂X
∂
|X| = |X| 2X−1 − diag X−1 . (D.26)
∂X
∂
|AXB|=|AXB| AT (AXB)−T BT +B (AXB)−1 A−diag B (AXB)−1 A .
∂X
(D.27)
Exercise Derive Eq. (D.26) and Eq. (D.27).
Hint Apply (D.17) to the transformation Y(X) = X1 + XT1 − diag (X), where
X1 is the lower triangular matrix whose elements are xij1 = xij , for i ≤ j . Then
∂|Y| ∂yij
one gets ∂x∂αβ |Y(X)| = ij ∂yij ∂xαβ = ij |Y|y
ji J
αβ ij . Keeping in mind
that Y = X, the previous expression yields |X|tr X−1 J αβ . To complete the
proof,
remember that J αβ = Jαβ + Jβα − diag Jαβ ; hence, tr X−1 J αβ =
x βα + x αβ − x αβ δx = 2X−1 − diag X−1 αβ .
αβ
Similarly, Eq. (D.27) is similar to Eq. (D.21) but involves symmetry, i.e.
−1
∂xαβ AXB = AJ αβ B. Therefore, ∂xαβ |AXB| = |AXB|tr AJ αβ B (AXB)
∂ ∂
.
• Derivative of a matrix inverse
∂X−1
= −X−1 J αβ X−1 . (D.28)
∂xαβ
∂
tr X−1 A = −X−1 A + AT X−1 + diag X−1 AX−1 . (D.29)
∂X
D.4 Application
Matrix derivatives find straight application in multivariate analysis. The most famil-
iar example is perhaps the estimation of a p-dimensional multinormal distribution
D Matrix Algebra and Matrix Function 513
!
n !n
−p/2 −1/2 1 −1
L= f (xt ; μ, ) = (2π ) || exp − (xt − μ) (xt −μ) .
T
2
t=1 t=1
(D.30)
The objective is then to estimate μ and by maximising L. It is usually simpler to
use the log-likelihood L = log L, which reads
1
n
np n
L = log L = log 2π − log || − (xt − μ)T −1 (xt − μ) . (D.31)
2 2 2
t=1
L
The estimates are obtained by solving the system of equations given by ∂∂μ = 0 and
∂L
∂ = O. The first of these yields is (assuming that −1 exists)
n
(xt − μ) = 0, (D.32)
t=1
which provides the sample mean. For the second, one can use Eqs. (D.16)–(D.26),
and Eq. (D.29) for the last term, which can be written as a trace of a matrix product.
This yields
2 −1 − diag −1 − 2 −1 S −1 + diag −1 S−1 −1 = O,
The estimation of the parameters of a factor model can be found in various text
books, e.g. Anderson (1984), Mardia et al. (1979). The log-likelihood of the model
has basically the same form as Eq. (D.31) except that now is given by =
+ T , where is a diagonal covariance matrix (see Chap. 10, Eq. (10.11)).
Using Eq. (D.16) along with results from Eq. (D.20), we get ∂ ∂
log |T +
−T
| = 2 T + . In a similar way we get ∂
∂ log |T + | =
514 D Matrix Algebra and Matrix Function
−1
diag T + ∂
. Furthermore, using Eq. (D.27), one gets ∂ log |T +
−1
−1
| = 2T T + − diag T T + .
∂ ∂tr(H−1 S) ∂
tr H−1 S = hij
∂xαβ ∂hij ∂xαβ
ij
= −H−T ST H−T Jαβ AXT + XAJβα .
ij ij
ij
This is precisely tr −H−1 SH−1 Jαβ AXT + XAJβα . Using an argument similar
to that presented in Eqs. (D.23)–(D.24), one gets
∂
trH−1 S = − H−T ST H−T XAT + H−1 SH−1 XA .
∂xαβ αβ
Applying the above exercise and keeping in mind the symmetry of , see
Eq. (D.29), yield
∂
tr −1 S = 2 −2 −1 S −1 + diag −1 S −1 .
∂
∂
tr −1 S = −2 −1 S −1 + diag −1 S −1 , that is − diag −1 S −1 .
∂ []αα αα
∂L
= − n2 2 −1 + 2 −2 −1 S −1 + diag −1 S −1
∂ −1
= −n ( − 2S) −1 + diag −1 S −1
∂L
= − n2 2T −1 − diag T −1 + 2T −2 −1 S −1 + diag −1 S −1
∂
− n2 diag T −2 −1 S −1 + diag −1 S −1
∂L
∂ = − n2 diag −1 − diag −1 S −1 = − n2 diag −1 ( − S) −1 .
(D.34)
Note that if one removes the terms pertaining to symmetry one finds what has been
presented in the literature, e.g. Dwyer (1967), Magnus and Neudecker (1995). For
example, in Dwyer (1967) symmetry was not explicitly taken into account in the
differentiation. The symmetry condition can be considered via Lagrange multipliers
(Magnus and Neudecker 1995). It turns out, in fact, that the stationary points of a
scalar function f (X) of the symmetric p × p matrix X, i.e. ∂f∂X
(X)
= O are also the
solutions to (Rogers 1980, th. 101, p. 80)
∂f (Y)
= O, (D.35)
∂Y Y=X
−1 ( − S) −1 = O
T −1 ( − S) −1 = O (D.36)
diag −1 ( − S) −1 = O.
Matrix derivative also finds application in various other subjects. The eigenvalue
problem of EOFs is a straightforward application. An interesting alternative to this
eigenvalue problem, which uses matrix derivative, is provided by the following
result (see Magnus and Neudecker 1995, th. 3, p. 355). For a given p × p positive
semi-definite matrix , of rank r, the minimum of
q
Y= λ2k vk vTk , (D.38)
k=1
516 D Matrix Algebra and Matrix Function
2 The number λ and vector Qk x are, respectively, referred to as Ritz value and Ritz vector of A.
518 D Matrix Algebra and Matrix Function
Lanczos Method
Lanczos method is based on a triangularisation algorithm of a Hermitian matrix (or
symmetric for real cases) A, as
AQn = Qn Hn , (D.41)
βj qj +1 = Aqj − βj −1 qj −1 − αj qj . (D.42)
The algorithm then starts from an initial vector q1 (taking q0 = 0) and obtains
αj , βj and qj +1 at each iteration step. (The vectors qi , i = 1, . . . n, are orthonor-
mal.) After k steps, one gets
with ek being the k-element vector (0, . . . , 0, 1)T . The algorithm stops when βk =
0.
Arnoldi Method
Arnoldi algorithm is similar to Lanczos’s except that the matrix Hn = (hij ) is upper
Hessenberg matrix, which satisfies hij = 0 for i ≥ j + 2. As for the Lanczos
j
method, Eq. (D.42) yields hj +1,j qj +1 = Aqj − i=1 hij qi . After k steps, one
obtains
The above Eq. (D.44) can be written in a compact form as AQk = Qk+1 Hk ,
where Hk is the obtained (k + 1) × k Hessenberg matrix. The matrix Hk is
obtained from H k+1 by deleting the last row. Note that Arnoldi (and also Lanczos)
methods are modified versions of the Gram–Schmidt orthogonalisation procedure,
with Hessenberg and tridiagonal matrices involved, respectively, in the two methods.
Note also that a non-symmetric Lanczos method exists, which yields a non-
symmetric tridiagonal matrix (e.g. Parlett et al. 1985).
The simplest iterative method is the Jacobi iteration, which solves a fixed point
problem. It transforms the linear system Ax = b into x = Âx + b̂, where  =
Im − D−1 A and b̂ = D−1 b, with D being either the diagonal matrix of A or simply
the identity matrix. The fixed point iterative algorithm is then given by xn+1 =
Âxn +b̂, with a given initial condition x0 . The algorithm converges when the spectral
D Matrix Algebra and Matrix Function 519
radius of  is less than unity, i.e. ρ(Â) < 1. The computation of the nth residual
vector rn = b − Axn involves the Krylov subspace Kn+1 (A, r0 ). Other methods
like gradient and semi-iterative methods are included in the Krylov space solver.
From an initial condition x0 , the residual takes the form
for a polynomial pn−1 (.) of degree n − 1, and belongs to Kn (A, r0 ). The problem
is then to find a good choice of xn in the Krylov space. There are essentially two
methods for this (Saad 2003), namely, Arnoldi’s method (described above) or FOM
(Full Orthogonalisation Method) and the GMRES (Generalised Minimum Residual
Method) algorithms.
The FOM algorithm is based on the above Arnoldi orthogonalisation procedure
and looks for xn − x0 within Kn (A, r0 ) such that (b − Axn ) is orthogonal to
this Krylov space (Galerkin condition). From the initial residual r0 , and letting
r0 = βq1 , with β = r0 2 , the algorithm yields a similar equation to (D.41), i.e.
QTk AQk = Hk and QTk r0 = βq1 . The approximate solution at step k is then given
by
xk = x0 + Qk yk = x0 + βQk H−1 −1
k Qk q1 = x0 + βQk Hk e1 ,
T
(D.46)
xk = x0 + Qk z∗ . (D.47)
where the last equality holds because Qk+1 is orthonormal ( Qk+1 x 2 = x 2 ), and
Hk is the matrix defined below Eq. (D.44).
Remark The Krylov space can be used, for example, to approximate the exponential
of a matrix, which is useful particularly for large matrices. Given a matrix A and a
vector v, an approximation of eA v, using Kk (A, v), is given by (Saad 1990)
with β = v 2 . Equation (D.50) can be used, for example, to compute the solution
of an inhomogeneous system of first-order ODE. Also, and as pointed out by Saad
(1990), Eq. (D.50) can be used to approximate the (matrix) integral:
∞ T
X= euA bbT euA du, (D.51)
0
E.1 Background
For a detailed review of the various optimisation problems and algorithms, the
reader is referred to Gill et al. (1981).
There is in general a large difference between one- and multidimensional
problems. The univariate and bivariate minimisation problems are in general not
difficult to solve since the function can be plotted and visualised, particularly when
the function is smooth. When the first derivative is not provided, methods like the
golden section can be used. The problem gets more difficult for many variables
when there are multiple minima. In fact, the main obstacle to minimisation in the
multivariate case is the problem of local minima. For example, the global minimum
can be attained when the function is quadratic:
1 T
f (x) = x Ax − bT x + c, (E.1)
2
where A is a symmetric matrix. The quadratic Eq. (E.1) is a typical example that
deserves attention. The gradient of f (.) is Ax − b, and a necessary condition for
optimality is given by ∇f (x) = 0. The solution to this linear equation provides a
partial answer, however. To get a complete answer, one has to compute the second
derivative at the solution of the necessary condition, to yield the Hessian:
∂ 2f
H = (hij ) = = A. (E.2)
∂xi ∂xj
1
In = I0 ,
2n/2
where I0 = x2 − x1 if [x1 , x2 ] is the initial interval.
• Golden section—It is based on subdividing the initial interval [x1 , x2 ] into three
subintervals using two extra points x3 and x4 , with x1 < x3 < x4 < x2 . For
example, if f (x3 ) ≤ f (x4 ), then the minimum is expected to lie within [x1 , x4 ];
otherwise, it is in [x3 , x2 ]. The iterative procedure takes the form:
(i) τ −1 (i) (i) (i)
x3 = τ x2 − x 1 + x1
(i) (i) (i) (i)
x4 = τ x2 − x1
1
+ x1 ,
√
where τ is the Golden number1 1+2 5 . There are various other methods such as
quadratic interpolation and Powell’s method, see Everitt (1987) and Box et al.
(1969).
When the first and perhaps the second derivatives are available, then it is known that
the two conditions:
∗
dx f (x ) = 0
d
d2 (E.3)
dx 2
f (x ∗ ) > 0
are sufficient conditions for x ∗ to be a minimum of f (). In this case the most widely
used method is based on Newton algorithm, also known as Newton–Raphson, and
aims at computing the zero of dfdx(x) based on the tangent line at dfdx(x) . The algorithm
reads
df (xk )/dx
xk+1 = xk − (E.4)
d 2 f (xk )/dx 2
when d 2 f (xk )/dx 2 = 0. Note that when the second derivative is not provided, then
the denominator of Eq. (E.4) can be approximated using a finite difference scheme:
xk − xk−1 df (xk )
xk+1 = xk − .
df (xk ) − df (xk−1 ) dxk
un+1
1 It is the limit of un when n → ∞ where u0 = u1 = 1 and un+1 = un + un−1 .
524 E Optimisation Algorithms
As for the one-dimensional case, there are direct search methods and gradient-
based algorithms. Among the most widely used direct search methods, one finds
the following:
This method is due to Nelder and Mead (1965) and was originally described by
Spendley et al. (1962). The method is based on a simplex,2 generally with mutually
equidistant vertices, from which a new simplex is formed simply by reflection of
the vertex (where the objective function is largest) through the opposite facet, i.e.
through the hyperplane formed by the remaining m points (or vertices), to a “lower”
vertex where the function is smaller. Details on the method can be found, e.g. in Box
et al. (1969) and Press et al. (1992). The method can be useful for a quick search
but can become inefficient for large dimensions.
Most multivariate minimisation algorithms attempt to find the best search direction
along which the function can be minimised. The conjugate direction method is
based on minimising a quadratic function and is known as quadratically convergent.
Consider the quadratic function:
1 T
f (x) = x Gx + bT x + c. (E.5)
2
facets. Triangles and pyramids are examples of simplexes in three- and four-dimensional spaces,
respectively.
E Optimisation Algorithms 525
x0 and a direction u0 , one minimises the univariate function f (x0 + λu0 ) and then
replaces x0 and u0 by x0 + λu0 and λu0 , respectively. Powell’s algorithms run as
follows:
0. Initialise ui = ei , i.e. the canonical basis vectors, i = 1, . . . , m.
1. Initialise x = x0 .
2. Minimise f (xi−1 + λui ), xi = x0 + λui , i = 1, . . . , m.
3. Set ui+1 = ui , i = 1, . . . , m, um = xm − x0 .
4. Minimise f (xm + λum ), x0 = xm + λum , and then go to 1.
Powell (1964) showed that the procedure yields a set of k mutually conjugate
directions after k iterations. The procedure has to be reinitialised with new vectors
after every m iterations in order to escape dependency of the obtained vectors, see
Press et al. (1992) for further details.
Remark The reason for using one-dimensional minimisation is conjugacy. In fact,
if u1 , . . . , um are mutually conjugate with respect to G, the required minimum is
taken to be
m
x1 = x0 + αk uk ,
k=1
m
1
f (x1 ) = αi2 uTi Gui + αi uTi (Gx0 + b) + f (x0 ) (E.6)
2
i=1
This algorithm is based on concepts from statistical mechanics and makes use
of Boltzmann probability of energy distribution in thermodynamical systems in
equilibrium (Metropolis et al. 1953). The method uses Monte Carlo simulation to
526 E Optimisation Algorithms
generate moves and is particularly useful because it can escape local minima. The
algorithm can be applied to continuous and discrete problems (Press et al. 1992), see
also Hannachi and Legras (1995) for an application to atmospheric low-frequency
variability.
Unlike direct search methods, gradient-based approaches use the gradient of the
objective function. Here we assume that the (smooth) objective function can be
approximated by
1
f (x + δx) = f (x) + g(x)T δx + δxT Hδx + o(|δx|2 ), (E.7)
2
where g(x) = ∇f (x), and H = ∂xi∂∂xj f (x) are, respectively, the gradient vector
and Hessian matrix of f (x). Gradient methods also belong to the class of descent
algorithms where the approximation of the desired minimum at various iterations is
perturbed in an additive manner as
Descent algorithms are distinguished by the manner in which the search direction
u is chosen. Most gradient methods use the gradient as search direction since the
gradient ∇f (x) points in the direction where the function increases most rapidly.
∇f (xm )
xm+1 = xm − λ . (E.10)
∇f (xm )
E Optimisation Algorithms 527
Note that Eq. (E.9) is quadratic when Eq. (E.7) is used, in which case the solution
is given by3
∇f (xm ) 3
λ= . (E.11)
∇f (xm )T H∇f (xm )
Note that because of the one-dimensional minimisation at each step, the method
can be computationally expensive. Some authors use decreasing step-size selection
λ = α k , (0 < α < 1), for k = 1, 2, . . . , until the first k where f () has decreased
(Cadzow 1996).
Note that it is also possible to choose xm+1 = xm − λH−1 ∇f (xm ), where λ can be
found through a one-dimensional minimisation as in the steepest descent.
Newton method requires the inverse of the Hessian at each iteration, and this can
be quite expensive particularly for large problems. There is also another drawback
of the approach, namely, the convergence towards the minimum can be secured
only if the Hessian is positive definite. Similarly, the steepest descent is no better
since it is known to exhibit a linear convergence, i.e. a slow convergence rate. These
drawbacks have led to the development of more advanced and improved algorithms.
Among these methods, two main classes of algorithms stand out, namely, the
conjugate gradient and the quasi-Newton methods discussed next.
3 One can eliminate the Hessian from Eq. (E.11) by choosing a first guess λ0 for λ and then using
−1
g λ20
Eq. (E.7) with δx = −λ0 g , which yields λ = 2 g f x − λ0 gg − f (x) + λ0 g .
528 E Optimisation Algorithms
It is possible that the descent direction −g = −∇f (x) and the direction to the
minimum may be near to orthogonality, which can explain the slow convergence
rate of the steepest descent. For a quadratic function, for example, the best search
direction is conjugate to that taken at the previous step (Fletcher 1972, th. 1). This is
the basic idea of conjugate gradient for which the new search direction is constructed
to be conjugate to the gradient of the previous step. The method can be thought of as
an association of conjugacy with steepest descent (Fletcher 1972), and is also known
as Fletcher–Reeves (or projection) method. From the set of conjugate gradients −gk ,
k = 1, . . . , m, a new set of conjugate directions is formed via linear combination as
k−1
uk = −gk−1 + αj k uj , (E.14)
j =1
gTk−1 δgj −1
αj k = − (E.15)
uTj δgj −1
gTk−1 δgk−2
uk = −gk−1 + uk−1 ,
uTk−1 δgk−2
which simplifies to
gTk−1 gk−1
uk = −gk−1 + uk−1 , (E.16)
gTk−2 gk−2
4 After k − 1 one-dimensional searches in (u1 , . . . , uk−1 ), the quadratic form is minimised at xk−1 ,
then gk−1 is orthogonal to uj , j = 1, . . . , k − 2, because of the one-dimensional requirement for
minimisation in each direction um , m = 1, . . . , k − 2, dα d
f (xk−2 + αuj ) = gTk−1 uj = 0 . Fur-
thermore, since the vectors uj are linear combinations of gi , i = 1, . . . , j , the vectors gj are
j
also linear combinations of u1 , . . . , uj , i.e. gj = i=1 αi ui , hence gTk−1 gj = 0, j = 1, . . . , k − 2.
E Optimisation Algorithms 529
xk+1 = xk − λk Sk gk , (E.19)
where δgk = gk+1 − gk and δxk = xk+1 − xk = −λk Sk δgk . Note that there exist in
the literature various other formulae for updating Sk , see e.g. Adby and Dempster
(1974) and Press et al. (1992). These techniques can be adapted and simplified
further depending on the objective function, such as the case of the sum of squares,
encountered in least square regression analysis, see e.g. Everitt (1987).
530 E Optimisation Algorithms
dx
= −∇F (x), (E.22)
dt
starting from a suitable initial condition, one should converge in principle to x∗ .
This method can be regarded as the continuous version of the steepest descent
algorithm. In fact, Eq. (E.22) becomes equivalent to the steepest algorithm when
dx xt+h −xt
dt is approximated by the simple finite difference h . The system of ODE,
Eq. (E.22), can be interpreted as the equation describing a particle moving in
a potential well given by F (.). Note that Eq. (E.22) can also be replaced by a
continuous Newton equation of the form:
dx
= −H−1 (x) ∇F (x), (E.23)
dt
E.5.1 Background
When the functions involved in Eq. (E.24) are convex or polynomials, the problem
is known under the name of mathematical programming. For instance, if f (.) is
quadratic or convex and the constraints are linear, efficient programming procedures
exist for the minimisation.
In general, most algorithms attempt to transform Eq. (E.24) into an unconstrained
problem. This can be done easily, via a change of variable, when the constraints are
simple. The following examples illustrate this.
• For constraints of the form x ≥ 0, the change of variable is x = y 2 .
• For a ≤ x ≤ b, one can have x = a+b 2 + 2 sin y.
b−a
h(x) + y 2 = 0.
the conditions given by Eq. (E.25) are known as Kuhn–Tucker optimality conditions
and express the stationarity of the Lagrangian:
r
m
L (x; u, v) = f (x) + ui gi (x) + vj hj (x) (E.26)
i=1 j =1
at x∗ for the optimum values u∗ and v∗ . Note that the first vector equation in
Eq. (E.25) can be solved by minimising the sum of squares of its elements, i.e.
2
min nk=1 ∂x ∂L
k
. In mathematical programming, system Eq. (E.25) is generally
referred to as the dual problem of Eq. (E.24).
5 Namely, linear independence between ∇hj (x∗ ) and ∇gi (x∗ ), i = 1, . . . , r, for all j satisfying
hj (x∗ ) = 0.
532 E Optimisation Algorithms
yielding a minimum xk+1 at the next iteration step k + 1. The multipliers uk+1 and
vk+1 are taken to be the optimal multipliers for the linearised constraints:
This method is based on linearising the constraints about the current point xk+1 .
More details can be found in Adby and Dempster (1974) and Gill et al. (1981).
Note that in
most iterative
techniques, an initial feasible point can be obtained by
minimising j hj (x) + i gi2 (x).
Penalty Function
where wj , j = 1, . . . , m, and ρ are parameters that can change value during the
minimisation, and usually ρ decreases to zero as the iteration number increases.
The functions G() and H () are penalty functions. For example, the function
E Optimisation Algorithms 533
u2
G(u, ρ) = (E.30)
ρ
is one of the widely used penalties. When inequality constraints are present, and
for a fixed ρ, the barrier function G() is non-zero in the interior of the feasible
region (hj (x) ≤ 0, j = 1, . . . , m) and infinite on its border. This maintains
iterates xk inside the feasible set, and as ρ → 0 the constrained minimum is
approached. Examples of barrier functions in this case include log (−h(x)) and
ρ
h2 (x)
. The following penalty function
wj 1 2
ρ3 + gi (x) (E.31)
j
h2j (x) ρ
i
Gradient Projection
Kx = 0, (E.32)
− g = u + KT w. (E.33)
where the summation is taken over the violated constraints at the current point xk .
Another search method, based on small step gradient, is given by
m
u = −∇f (xk ) − wj (xk )∇hj (xk ), (E.37)
j =1
where wj (xk ) = w if hj (xk ) > 0 (w is a suitably chosen large constant) and zero
otherwise, see Adby and Dempster (1974).
The ordinary differential equations-based method can also be used in constrained
minimisation in a similar way after the problem has been transformed into an
unconstrained minimisation problem, see e.g. Brown (1986) and Hannachi et al.
(2006) for the case of simplified EOFs.
Appendix F
Hilbert Space
This appendix introduces some concepts of linear vector spaces, metrics and Hilbert
spaces.
A metric d(., .) defined on a set X is a real valued function defined over X × X with
the following properties:
(i) d(x, y) = d(y, x),
(ii) d(x, y) = 0 if and only if x = y,
(iii) d(x, y) ≤ d(x, z) + d(z, y), for all x, y and z in X .
A set X with a metric d(., .) is referred to as a metric space (X , d).
F.2.1 Norm
F.2.3 Consequences
converges to an element x0 in X if
lim d (xn , x0 ) = 0.
n→∞
lim xn − x0 = 0.
n→∞
The existence of an inner product in a linear vector space X allows the definition of
orthogonality as follows. Two vectors x and y are orthogonal, denoted by x ⊥ y, if
< x, y >= 0.
F.2.4 Properties
1. A normed linear space, with norm . , defines a metric space with the metric
given by d(x, y) = x − y .
2. An inner product space X , with inner product < ., . >, is a normed linear space
with the norm defined by x =< x, x >1/2 , and is consequently a metric space.
3. For any x and y in an inner product space X , the following properties hold.
• | < x, y > | ≤ x y ,
• x+y 2+ x−y 2 = 2 x 2 + 2 y 2 . This is known as the parallelogram
identity.
4. Given an n-dimensional linear vector space with an inner product, one can always
construct an orthonormal basis (u1 , . . . , un ), i.e. < uk , ul >= δkl .
5. Also, the limit of the sum of two sequences in an inner product space is the sum
of the limit of the sequences. Similarly, the limit of the inner product of two
sequences is the inner product of the limits of the corresponding sequences.
538 F Hilbert Space
F.3.1 Completeness
(1987).
A fundamental result in Hilbert spaces concerns the concept of approximation of
vectors from the Hilbert space by vectors from subspaces. This result is expressed
under the so-called projection theorem, given below (see e.g. Halmos 1951).
Projection Theorem Let U be a Hilbert space and V a Hilbert subspace of U . Let
also u be a vector in U but not in V, and v a vector in V. Then there exists a unique
vector v in V such that
u − v = min u − v .
v in V
• Example 1
F Hilbert Space 539
Consider the collection U of all (complex) random variables U , with zero mean and
finite variance, i.e. E(U ) = 0, and V ar(|U |2 ) < ∞, defined on some sample space.
The following operation defined for all random variables U and V in U by
< U, V >= E U ∗ V ,
where U ∗ is the complex conjugate of U , defines a scalar product over U and makes
U a Hilbert space, see e.g. Priestly (1981, p. 190).
Exercise Show that the above operation is well defined.
Hint Use the fact that V ar(λU + V ) ≥ 0 for all scalar λ to deduce that < U, V >
is well defined.
The theory of Hilbert space in stochastic processes and time series started towards
the late 1940s (Loève 1948) and was lucidly formulated by Parzen (1959, 1961) in
the context of random function (stochastic processes). The concept of Hilbert space,
and in particular the projection theorem, finds natural application in the theory of
time series prediction.
• Example 2
Designate by T a subset of the real numbers, and let {Xt , for t in T } be a stochastic
process (or random function) satisfying E |Xt |2 < ∞ for t in T . Such stochastic
is said to be second order. Let U be the set of random variables of the form
process
U = nk=1 ck Xtk , where n is a positive integer, c1 , . . . , cn are scalars and t1 , . . . , tn
are elements in T . That is, U is the set of all finite linear combinations of random
variables Xt for t in T and is known as the space spanned by the random function
{Xt , for t in T }. The inner product < U, V >= E(U V ∗ ) induces an inner product
on U . The space U , extended by including all random variables that are limit of
sequences in U , i.e. random variables W satisfying
lim Wn − W = 0
n→∞
Let H be the Hilbert space defined in example 1 above and {Xt , t = 0, ±1, ±2, . . .}
a (discrete) stochastic process. Let now Ht be the subset spanned by the sequence
Xt , Xt−1 , Xt−2 , . . .. Using the same reasoning as in example 2 above, Ht is a
Hilbert space.
540 F Hilbert Space
Let now m be a given positive integer, and our objective is to estimate Xt+m
using elements from Ht . This is the classical prediction problem, which seeks an
element X̂t+m in Ht satisfying
Xt+m − X̂t+m 2
= E |Xt+m − X̂t+m |2 = min Xt+m − Y 2
.
Y in Ht
Hence X̂t+m is simply the orthogonal projection of Xt+m onto Ht . From the
projection theorem, we get
E Xt+m − X̂t+m Y = 0,
and the term εn+h = Xn+h − X̂n+h is known as the forecast error.
Exercise Show that the solution to Eq. (F.1) is given by Eq. (F.2).
Hint Recall the condition fh (x)dx = 1 and use Lagrange multiplier.
An important result emerges when {Xt } is Gaussian, namely, E (Xn+h |Xt , t ≤ n)
is a linear function of Xt , t ≤ n, and this what makes the reason behind choosing
F Hilbert Space 541
The predictor X̂n+h is meant to optimally approximate the (unobserved) future value
of Xn+h of the time series. In stationary time series the forecast error εn+h is also
stationary, and its variance σ2 = E(εn+h
2 ) is the forecast error variance.
Prediction of multivariate time series is more subtle than single variables time series
not least because matrices are involved. Matrices have two main features, namely,
they do not (in general) commute, and they can be singular without being null.
In this appendix a brief review of the multivariate prediction problem is given.
For a full discussion on prediction of vector time series, the reader is referred to
Doob (1953), Wiener and Masani (1957, 1958), Helsen and Lowdenslager (1958),
Rozanov (1967), Masani (1966), Hannan (1970) and Koopmans (1974), and the
up-to-date text by Wei (2019).
T
Let xt = Xt1 , . . . , Xtp denote a p-dimensional second-order (E |xt |2 <
∞) zero-mean random vector. Let also {xt , t = 0, ±1, ±2, . . .} be a second-order
vector random function (or stochastic process, or time series), H the Hilbert space
spanned by this random function, i.e. the space spanned by Xt,k , k = 1, . . . , p,
t = 0, ±1, ±2, . . ., and finally, Hn the Hilbert space spanned by Xt,k , k = 1, . . . , p,
t ≤ n. A p-dimensional random vector y = Y1 , . . . , Yp is an element of Hn
if each component Yk , k = 1, . . . , p belongs to Hn . Stated otherwise Hn can be
regarded as composed of random vectors y that are finite linear combinations of
elements of the vector random functions of the form:
m
y= Ak xtk
k=1
u, v p = E uv∗ ,
542 F Hilbert Space
where (∗ ) stands for the transpose complex conjugate.1 Note that the norm over H
is the trace of the Gramian matrix, i.e.
p
x 2
= E|Xk |2 = tr xxT p .
k=1
As for the univariate case, the predictor x̂t+m of xt+m is given by the orthogonal
projection of xt+m onto Ht . The prediction error εt+m = xt+m − x̂t+m is orthogonal
to all vectors in Ht . Also, ε k is orthogonal to ε l for l = k, i.e.
E ε k ε Tl = δkl ,
where is the covariance matrix of the prediction error εk . The prediction error
variance tr E ε t+1 ε Tt+1 of the one-step ahead prediction is given in Chap. 8.
1 Thatis, the Gramian matrix consists of all the inner products between the individual components
of u and v.
Appendix G
Systems of Linear Ordinary Differential
Equations
dx
= Ax + b (G.1)
dt
with the initial condition x0 = x(t0 ), where A is a m × m real (or complex) matrix
and b is a m-dimensional real (or complex) vector. When A is constant, the solution
is quite simple, but when it is time-dependent the solution is slightly more elaborate.
which can also be extended to etA , for any scalar t, one gets
detA
= etA A = AetA .
dt
dx
= Ax (G.3)
dt
with initial condition x0 is
Remark The above result can be used to solve the differential equation:
d my d m−1 y dy
m
+ am−1 m−1 + . . . + a1 + a0 y = 0 (G.5)
dt dt dt
m−1
with initial conditions y(t0 ), dy(t 0) d y(t0 )
dt , . . . , dt m−1 . The above ODE can be trans-
formed into a system similar to Eq. (G.3), with the Fröbenius matrix A given by
⎛ ⎞
0 1 ... 0 0
⎜ 0 0 ... ⎟ 0 0
⎜ ⎟
⎜ .. ⎟
A=⎜ . ⎟,
⎜ ⎟
⎝ 0 0 ... 0 1 ⎠
−a0 −a1 . . . −am−2 −am−1
m−1
and x(t) = (y(t), dy(t) d y(t) T
dt , . . . , dt m−1 ) , with initial condition x0 = x(t0 ).
dx
= A(t)x (G.7)
dt
with initial condition x0 . The theory behind the integration of Eq. (G.7) is
based on using a set of independent solutions of the differential equation. If
x1 (t), . . . , xm (t) is a set of m solutions of Eq. (G.7) with respective initial conditions
x1 (t0 ), . . . , xm (t0 ), assumed to be linearly independent, then the matrix M(t) =
(x1 (t), . . . , xm (t)) satisfies the following system of ODEs:
dM
= AM. (G.8)
dt
It turns out that if M(t0 ) is invertible the solution to (G.8) is also invertible.
Remark It can be shown, see e.g. Said-Houari (2015) or Teschl (2012), that the
Wronskian W (t) = det (M(t)) satisfies the ODE:
dW
= tr(A)W, (G.9)
dt
t
(or W (t) = W (t0 ) exp( t0 tr(A(u))du). The Wronskian can be used to show that,
like M(t0 ), M(t) is also invertible.
The solution to Eq. (G.7) then takes the form:
dx
= A(t)x + b(t), (G.12)
dt
with initial condition x0 , which takes the form:
t
x(t) = S(t, t0 )x(t0 ) + S(t, u)b(u)du. (G.13)
t0
546 G Systems of Linear Ordinary Differential Equations
It is worth mentioning here that Eq. (G.13) can be extended to the case when the
term b is a random forcing in relation to time-dependent multivariate autoregressive
models.
ẋ = A(t)x, (G.15)
with initial condition x0 = x(t0 ), for a periodic m × m matrix A(t) is covered by the
so-called Floquet theory (Floquet 1883). The solution takes the form:
for some periodic function y(t), and therefore need not be periodic. A set of m
independent solutions x1 (t), . . . xm (t) make what is known as the fundamental
matrix X(t), i.e. X(t) = [x1 (t), . . . , xm (t)], and if the initial condition X0 = X(t0 )
is the identity matrix, i.e. X0 = Im , then X(t) is called the principal fundamental
matrix. It is therefore clear that the solution to Eq. (G.15) is x(t) = X(t)X−1 0 x0 ,
where X(t) is a fundamental matrix.
An important result from Floquet theory is that if X(t) is a fundamental
matrix so is X(t + T ), and that there exists a nonsingular matrix B such that
X(t + T ) = X(t)B. Using the Wronskian, Eq. (G.9), one gets the determinant
T
of B, i.e. |B| = exp 0 tr(A(u))du . Furthermore, the eigenvalues of B, or
characteristic multipliers, which can be written as eμ1 T , . . . , eμm T , yield the so-
called characteristic (or Floquet) exponents μ1 , . . . , μm .
Remark In terms of the resolvent, see Sect. G2, the propagator S(t, τ ) is the
principal fundamental matrix.
The characteristic exponents, which may be complex, are not unique but the
characteristic multipliers are. In addition, the system (or the origin) is asymptotically
G Systems of Linear Ordinary Differential Equations 547
stable if the characteristic exponents have negative real parts. It can be seen that
if u is an eigenvector of B with eigenvalue ρ = eμT , then x(t) = X(t)u is a
solution to Eq. (G.15), and that x(t + T ) = ρx(t). The solution then takes the form
x(t) = eμT x(t)e−μT = eμT y(t), where precisely the vector y(t) is T-periodic.
Appendix H
Links for Software Resource Material
A CRAN (R programming language) package for EOFs and EOF rotation by Alan
Jassby is here:
https://www.rdocumentation.org/packages/wq/versions/0.4.8/topics/eof.
The site of David M. Kaplan provides Matlab codes for EOFs and varimax rotation:
https://websites.pmc.ucsc.edu/~dmk/notes/EOFs/EOFs.html.
Mathworks provides codes for PCA, factor analysis and factor rotation using
different rotation criteria at:
https://uk.mathworks.com/help/stats/rotatefactors.html.
https://uk.mathworks.com/help/stats/analyze-stock-prices-using-factor-analysis.
html.
There are also freely available Matlab source codes of factor analysis at
freesourcecode.net:
http://freesourcecode.net/matlabprojects/57962/factor-analysis-by-the-principal-
components-method.--in-matlab#.XysoXfJS80o.
Python (and R) PCA and varimax rotation can be found at this site:
https://mathtuition88.com/2019/09/13/python-code-for-pca-rotation-varimax-
matrix/.
A R package provided by MetNorway, including EOF, CCA and more, can be found
here:
https://rdrr.io/github/metno/esd/man/ERA5.CDS.html.
The site of Imad Dabbura from HMS provides coding implementation in R and
Python at:
https://imaddabbura.github.io/.
Mathworks also provides softwares for recurrent NN used in time series forecasting:
https://uk.mathworks.com/help/deeplearning/.
The site of Dr Qadri Hamarsheh “Neural Network and Fuzzy Logic: Self-
Organizing Map Using Matlab” here:
http://www.philadelphia.edu.jo/academics/qhamarsheh/uploads/Lecture%2016_
Self-organizing%20map%20using%20matlab.pdf.
The book by Nielsen (2015) provides hands-on approach on NN (and deep learning)
with Python (2.7) here:
http://neuralnetworksanddeeplearning.com/about.html.
The book by Buduma (2017) provides codes for deep learning in Tensorflow at:
https://github.com/darksigma/Fundamentals-of-Deep-Learning-Book.
The book by Chollet (2018) provides an exploration of deep learning from scratch
with Python codes here:
https://www.manning.com/books/deep-learning-with-python.
forest-in-python-77bf308a9b76.
Time series forecasting with random forest via time delay embedding (In R
programming language), by Mauel Tilgner:
https://www.statworx.com/at/blog/time-series-forecasting-with-random-forest/.
A recurrent NN library for LSTM, multidimensional RNN, and more, can be found
here:
https://sourceforge.net/projects/rnnl/.
Absil P-A, Mahony R, Sepulchre R (2010) Optimization on manifolds: Methods and applications.
In: Diehl M, Glineur F, Michiels EJ (eds) Recent advances in optimizations and its application
in engineering. Springer, pp 125–144
Achlioptas D (2003) Database-friendly random projections: Johnson-Lindenstrauss with binary
coins. J Comput Syst Sci 66:671–687
Ackoff RL (1989) From data to wisdom. J Appl Syst Anal 16:3–9
Adby PR, Dempster MAH (1974) Introduction to optimization methods. Chapman and Hall,
London
Aires F, Rossow WB, Chédin A (2002) Rotation of EOFs by the independent component analysis:
toward a solution of the mixing problem in the decomposition of geophysical time series. J
Atmospheric Sci 59:111–123
Aires F, Chédin A, Nadal J-P (2000) Independent component analysis of multivariate time series:
application to the tropical SST variability. J Geophys Res 105(D13):17437–17455
Akaike H (1969) Fitting autoregressive models for prediction. Ann Inst Stat Math 21:243–247
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Auto Control
19:716–723
Allen MR, Smith LA (1997) Optimal filtering in singular spectrum analysis. Phys Lett A 234:419–
423
Allen MR, Smith LA (1996) Monte Carlo SSA: Detecting irregular oscillations in the presence of
colored noise. J Climate 9:3373–3404
Aluffi-Pentini F, Parisi V, Zirilli F (1984) Algorithm 617: DAFNE: a differential-equations
algorithm for nonlinear equations. Trans Math Soft 10:317–324
Amari S-I (1990) Mathematical foundation of neurocomputing. Proc IEEE 78:1443–1463
Ambaum MHP, Hoskins BJ, Stephenson DB (2001) Arctic oscillation or North Atlantic oscilla-
tion? J Climate 14:3495–3507
Ambaum MHP, Hoskins BJ, Stephenson DB (2002) Corrigendum: Arctic oscillation or North
Atlantic oscillation? J Climate 15:553
Ambrizzi T, Hoskins BJ, Hsu H-H (1995) Rossby wave propagation and teleconnection patterns in
the austral winter. J Atmos Sci 52:3661–3672
Ambroise C, Seze G, Badran F, Thiria S (2000) Hierarchical clustering of self-organizing maps for
cloud classification. Neurocomputing 30:47–52. ISSN: 0925–2312
Anderson JR, Rosen RD (1983) The latitude-height structure of 40–50 day variations in atmo-
spheric angular momentum. J Atmos Sci 40:1584–1591
Anderson TW (1963) Asymptotic theory for principle component analysis. Ann Math Statist
34:122–148
Anderson TW (1984) An introduction to multivariate statistical analysis, 2nd edn. Wiley, New
York
Angell JK, Korshover J (1964) Quasi-biennial variations in temperature, total ozone, and
tropopause height. J Atmos Sci 21:479–492
Ångström A (1935) Teleconnections of climatic changes in present time. Geografiska Annaler
17:242–258
Annas S, Kanai T, Koyama S (2007) Principal component analysis and self-organizing map for
visualizing and classifying fire risks in forest regions. Agricul Inform Res 16:44–51. ISSN:
1881–5219
Asimov D (1985) The grand tour: A tool for viewing multidimensional data. SIAM J Sci Statist
Comp 6:128–143
Adachi K, Trendafilov N (2019) Some inequalities contrasting principal component and factor
analyses solutions. Jpn J Statist Data Sci. https://doi.org/10.1007/s42081-018-0024-4
Astel A, Tsakouski S, Barbieri P, Simeonov V (2007) Comparison of self-organizing maps
classification approach with cluster and principal components analysis for large environmental
data sets. Water Research 41:4566–4578. ISSN: 0043-1354
Bach F, Jorda M (2002) kernel independent component analysis. J Mach Learn Res 3:1–48
Bagrov NA (1959) Analytical presentation of the sequences of meteorological patterns by means
of the empirical orthogonal functions. TSIP Proceeding 74:3–24
Bagrov NA (1969) On the equivalent number of independent data (in Russian). Tr Gidrometeor
Cent 44:3–11
Baker CTH (1974) Methods for integro-differential equations. In: Delves LM, Walsh J (eds)
Numerical solution of integral equations. Oxford University Press, Oxford
Baldwin MP, Gray LJ, Dunkerton TJ, Hamilton K, Haynes PH, Randel WJ, Holton JR, Alexander
MJ, Hirota I, Horinouchi T, Jones DBA, Kinnersley JS, Marquardt C, Sao K, Takahas M (2001)
The Quasi-biennial oscillation. Rev Geophys 39:179–229
Baldwin MP, Stephenson DB, Jolliff IT (2009) Spatial weighting and iterative projection methods
for EOFs. J Climate 22:234–243
Barbosa SM, Andersen OB (2009) Trend patterns in global sea surface temperature. Int J Climatol
29:2049–2055
Barlow HB (1989) Unsupervised learning. Neural Computation 1:295–311
Barnett TP (1983) Interaction of the monsoon and Pacific trade wind system at international time
scales. Part I: The equatorial case. Mon Wea Rev 111:756–773
Barnston AG, Liveze BE (1987) Classification, seasonality, and persistence of low-frequency
atmospheric circulation patterns. Mon Wea Rev 115:1083–1126
Barnett TP (1984a) Interaction of the monsoon and the Pacific trade wind systems at interannual
time scales. Part II: The tropical band. Mon Wea Rev 112:2380–2387
Barnett TP (1984b) Interaction of the monsoon and the Pacific trade wind systems at interannual
time scales. Part III: A partial anatomy of the Southern Oscillation. Mon Wea Rev 112:2388–
2400
Barnett TP, Preisendorfer R (1987) Origins and levels of monthly and seasonal forecast skill for
United States srface air temperatures determined by canonical correlation analysis. Mon Wea
Rev 115:1825–1850
Barnston AG, Ropelewski CF (1992) Prediction of ENSO episodes using canonical correlation
analysis. J Climate 5:1316–1345
Barreiro M, Marti AC, Masoller C (2011) Inferring long memory processes in the climate network
via ordinal pattern analysis. Chaos 21:13,101. https://doi.org/10.1063/1.3545273
Bartholomew DJ (1987) Latent variable models and factor analysis. Charles Griffin, London
Bartlett MS (1939) The standard errors of discriminant function coefficients. J Roy Statist Soc
Suppl. 6:169–173
Bartlett MS (1950) Periodogram analysis and continuous spectra. Biometrika 37:1–16
Bartlett MS (1955) An introduction to stochastic processes. Cambridge University Press, Cam-
bridge
References 555
Basak J, Sudarshan A, Trivedi D, Santhanam MS (2004) Weather data mining using independent
component analysis. J Mach Lear Res 5:239–253
Basilevsky A, Hum PJ (1979) Karhunen-Loève analysis of historical time series with application
to Plantation birth in Jamaica. J Am Statist Ass 74:284–290
Basilevsky A (1983) Applied matrix algebra in the statistical science. North Holland, New York
Bauckhage C, Thurau C (2009) Making archetypal analysis practical. In: Pattern recognition,
Lecture Notes in Computer Science, vol 5748. Springer, Berlin, Heidelberg, pp 272–281.
https://doi.org/10.1007/978-3-642-03798-6-28
Bayes T (1763) An essay towards solving a problem in the doctrine of chances. Phil Trans 53:370
Beatson RK, Cherrie JB, Mouat CT (1999) Fast fitting of radial basis functions: Methods based on
preconditioned GMRES iteration. Adv Comput Math 11:253–270
Beatson RK, Light WA, Billings S (2000) Fast solution of the radial basis function interpolation
equations: Domain decomposition methods. SIAM J Sci Comput 200:1717–1740
Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data represen-
tation. Neural Comput 15:1373–1396
Bellman R (1961) Adaptive control processes: A guide tour. Princeton University Press, Princeton
Bell AJ, Sejnowski TJ (1995) An information-maximisation approach to blind separation and blind
deconvolution. Neural Computing 7:1004–1034
Bell AJ, Sejnowski TJ (1997) The “independent components” of natural scenes are edge filters.
Vision Research 37:3327–3338
Belouchrani A, Abed-Meraim K, Cardoso J-F, Moulines E (1997) A blind source separation
technique using second order statistics. IEEE Trans Signal Process 45:434–444
Bentler PM, Tanaka JS (1983) Problems with EM algorithms for ML factor analysis. Psychome-
trika 48:247–251
Berthouex PM, Brown LC (1994) Statistics for environmental engineers. Lewis Publishers, Boca
Raton
Bishop CM (1995) Neural networks for pattern recognition. Clarendon Press, Oxford, 482 p.
Bishop CM (2006) Pattern recognition and machine learning. Springer series in information
science and statistics. Springer, New York, 758 p.
Bjerknes J (1969) Atmospheric teleconnections from the equatorial Pacific. Mon Wea Rev 97:163–
172
Björnsson H, Venegas SA (1997) A manual for EOF and SVD analyses of climate data. Report
No 97-1, Department of Atmospheric and Oceanic Sciences and Centre for Climate and Global
Change Research, McGill University, p 52
Blumenthal MB (1991) Predictability of a coupled ocean-atmosphere model. J Climate 4:766–784
Bloomfield P, Davis JM (1994) Orthogonal rotation of complex principal components. Int J
Climatol 14:759–775
Bock H-H (1986) Multidimensional scaling in the framework of cluster analysis. In: Degens P,
Hermes H-J, Opitz O (eds) Studien Zur Klasszfikation. INDEKS-Verlag, Frankfurt, pp 247–
258
Bock H-H (1987) On the interface between cluster analysis, principal component analysis, and
multidimensional scaling. In: Bozdogan H, Kupta AK (eds) Multivariate statistical modelling
and data analysis. Reidel, Boston
Boers N, Donner RV, Bookhagen B, Kurths J (2014) Complex network analysis helps to identify
impacts of the El Niño Southern Oscillation on moisture divergence in South America. Clim
Dyn. https://doi.org/10.1007/s00382-014-2265-7
Bolton RJ, Krzanowski WJ (2003) Projection pursuit clutering for exploratory data analysis. J
Comput Graph Statist 12:121–142
Bonnet G (1965) Theorie de linformation−sur l’interpolation optimale d’une fonction aléatoire
èchantillonnée. C R Acad Sci Paris 260:297–343
Bookstein FL (1989) Principal warps: thin plate splines and the decomposition of deformations.
IEEE Trans Pattern Anal Mach Intell 11:567–585
Borg I, Groenen P (2005) Modern multidimensional scaling. Theory and applications, 2nd edn.
Springer, New York
556 References
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifier. In:
Haussler D (ed) Proceedings of the 5th anuual ACM workshop on computational learning
theory. ACM Press, pp 144–152 Pittsburgh.
Botsaris CA, Jacobson HD (1976) A Newton-type curvilinear search method for optimisation. J
Math Anal Applics 54:217–229
Botsaris CA (1978) A class of methods for unconstrained minimisation based on stable numerical
integration techniques. J Math Anal Applics 63:729–749
Botsaris CA (1979) A Newton-type curvilinear search method for constrained optimisation. J Math
Anal Applics 69:372–397
Botsaris CA (1981) Constrained optimisation along geodesics. J Math Anal Applics 79:295–306
Box MJ, Davies D, Swann WH (1969) Non-linera optimization techniques. Oliver and Boyd,
Edinburgh
Box GEP, Jenkins MG, Reinsel CG (1994) Time series analysis: forecasting and control. Prentice
Hall, New Jersey
Box GEP, Jenkins MG (1970) Time series analysis. Forecasting and control. Holden-Day, San
Fracisco (Revised and published in 1976)
Branstator G, Berner J (2005) Linear and nonlinear Signatures in the planetary wave dynamics of
an AGCM: Phase space tendencies. J Atmos Sci 62:1792–1811
Breakspear M, Brammer M, Robinson PA (2003) Construction of multivariate surrogate sets from
nonlinear data using the wavelet transform. Physica D 182:1–22
Breiman L (2001) Random forests. Machine Learning 45:5–32
Bretherton CS, Smith C, Wallace JM (1992) An intercomparison of methods for finding coupled
patterns in climate data. J Climate 5:541–560
Bretherton CS, Widmann M, Dymnykov VP, Wallace JM, Bladé I (1999) The effective number of
spatial degrees of freedom of a time varying field. J Climate 12:1990–2009
Brillinger DR, Rosenblatt M (1967) Computation and interpretation of k-th order spectra. In: Harris
B (ed) Spectral analysis of time series. Wiley, New York, pp 189–232
Brillinger DR (1981) Time series-data: analysis and theory. Holden-Day, San-Francisco
Brink KH, Muench RD (1986) Circulation in the point conception-Santa Barbara channel region.
J Geophys Res C 91:877–895
Brockwell PJ, Davis RA (1991) Time series: theory and methods, 2nd edn. Springer, New York
Brockwell PJ, Davis RA (2002) Introduction to time series and forecasting. Springer, New York
Brown AA (1986) Optimisation methods involving the solution of ordinary differential equations.
Ph.D. thesis, the Hatfield polytechnic, available from the British library
Brownlee J (2018) Statistical methods for machine learning. e-learning. ISBN-10. https://www.
unquotebooks.com/get/ebook.php?id=386nDwAAQBAJ
Broomhead DS, King GP (1986a) Extracting qualitative dynamics from experimental data. Physica
D 20:217–236
Broomhead DS, King GP (1986b) On the qualitative analysis of experimental dynamical systems.
In: Sarkar S (ed) Nonlinear phenomena and chaos. Adam Hilger, pp 113–144
Buduma N (2017) Fundamentals of deep learning, 1st edn. O’Reilly, Beijing
Bürger G (1993) Complex principal oscillation pattern analysis. J Climate 6:1972–1986
Burg JP (1972) The relationship between maximum entropy spectra and maximum likelihood
spectra. Geophysics 37:375–376
Cadzow JA, Li XK (1995) Blind deconvolution. Digital Signal Process J 5:3–20
Cadzow JA (1996) Blind deconvolution via cumulant extrema. IEEE Signal Process Mag (May
1996), 24–42
Cahalan RF, Wharton LE, Wu M-L (1996) Empirical orthogonal functions of monthly precipitation
and temperature ever over the united States and homogeneous Stochastic models. J Geophys
Res 101(D21): 26309–26318
Capua GD, Runge J, Donner RV, van den Hurk B, Turner AG, Vellore R, Krishnan R, Coumou
D (2020) Dominant patterns of interaction between the tropics and mid-latitudes in boreal
summer: Causal relationships and the role of time scales. Weather Climate Discuss. https://
doi.org/10.5194/wcd-2020-14.
References 557
Cardoso J-F (1989) Source separation using higher order moments. In: Proc. ICASSP’89, pp 2109–
2112
Cardoso J-F (1997) Infomax and maximum likelihood for source separation. IEEE Lett Signal
Process 4:112–114
Cardoso J-F, Souloumiac A (1993) Blind beamforming for non-Gaussian signals. IEE Proc F
140:362–370
Cardoso J-F, Hvam Laheld B (1996) Equivalent adaptive source separation. IEEE Trans Signal
Process 44:3017–3030
Carr JC, Fright RW, Beatson KR (1997) Surface interpolation with radial basis functions for
medical imaging. IEEE Trans Med Imag 16:96–107
Carreira-Perpiñán MA (2001) Continuous latent variable models for dimensionality reduction and
sequential data reconstruction. Ph.D. dissertation. Department of Computer Science, University
of Sheffield
Carroll JB (1953) An analytical solution for approximating simple structure in factor analysis.
Psychometrika 18:23–38
Caroll JD, Chang JJ (1970) Analysis of individual differences in multidimensional scaling via an
n-way generalization of ’Eckart-Young’ decomposition. Psychometrika 35:283–319
Cassano EN, Glisan JM, Cassano JJ, Gutowski Jr. WJ, Seefeldt MW (2015) Self-organizing map
analysis of widespread temperature extremes in Alaska and Canada. Clim Res 62:199–218
Cassano JJ, Cassano EN, Seefeldt MW, Gutowski WJ, Glisan JM (2016) Synoptic conditions
during wintertime temperature extremes in Alaska. J Geophys Res Atmos 121:3241–3262.
https://doi.org/10.1002/2015JD024404
Causa A, Raciti F (2013) A purely geometric approach to the problem of computing the projection
of a point on a simplex. JOTA 156:524–528
Cavazos T, Comrie AC, Liverman DM (2002) Intraseasonal variability associated with wet
monsoons in southeast Arizona. J Climate 15:2477–2490. ISSN: 0894-8755
Chan JCL, Shi J-E (1997) Application of projection-pursuit principal component analysis method
to climate studies. Int J Climatol 17(1):103–113
Charney JG, Devore J (1979) Multiple equilibria in the atmosphere and blocking. J Atmos Sci
36:1205–1216
Chatfield C (1996) The analysis of time series. An introduction 5th edn. Chapman and Hall, Boca
Raton
Chatfield C, Collins AJ (1980) Introduction to multivariate analysis. Chapman and Hall, London
Chatfield C (1989) The analysis of time series: An introduction. Chapman and Hall, London, 241 p
Chekroun MD, Kondrashov D (2017) Data-adaptive harmonic spectra and multilayer Stuart-
Landau models. Chaos 27:093110
Chen J-M, Harr PA (1993) Interpretation of extended empirical orthogonal function (EEOF)
analysis. Mon Wea Rev 121:2631–2636
Chen R, Zhang W, Wang X (2020) Machine learning in tropical cyclone forecast modeling: A
Review. Atmosphere 11:676. https://doi.org/10.3390/atmos11070676
Cheng X, Nitsche G, Wallace MJ (1995) Robustness of low-frequency circulation patterns derived
from EOF and rotated EOF analysis. J Climate 8:1709–1720
Chernoff H (1973) The use of faces to represent points in k-dimensional space graphically. J Am
Stat Assoc 68:361–368
Coifman RR, Lafon S (2006) Diffusion maps. Appl Comput Harmon Anal 21:5–30. https://doi.
org/10.1016/j.acha.2006.04.006
Chollet F (2018) Deep learning with Python. Manning Publications, New York, 361 p
Christiansen B (2009) Is the atmosphere interesting? A projection pursuit study of the circulation
in the northern hemisphere winter. J Climate 22:1239–1254
Cleveland WS, McGill R (1984) The many faces of a scatterplot. J Am Statist Assoc 79:807–822
Cleveland WS (1993) Visualising data. Hobart Press, New York
Comon P, Jutten C, Herault J (1991) Blind separation of sourcesi, Part ii: Problems statement.
Signal Process 24:11–20
Comon P (1994) Independent component analysis, a new concept? Signal Process 36:287–314
558 References
Cook D, Buja A, Cabrera J (1993) Projection pursuit indices based on expansions with orthonormal
functions. J Comput Graph Statist 2:225–250
Cover TM, Thomas JA (1991) Elements of information theory. Wiley Series in Telecommunica-
tion. Wiley, New York
Cox DD (1984) Multivariate smoothing spline functions. SIAM J Num Anal 21:789–813
Cox TF, Cox MAA (1994) Mulyidimensional scaling. Chapman and Hall, London
Craddock JM (1965) A meteorological application of principal component analysis. Statistician
15:143–156
Craddock JM (1973) Problems and prospects for eigenvector analysis in meteorology. Statistician
22:133–145
Craven P, Wahba G (1979) Smoothing noisy data with spline functions: estimating the correct
degree of smoothing by the method of generalized cross-validation. Numer Math 31:377–403
Cristianini N, Shawe-Taylor J, Lodhi H (2001) Latent semantic kernels. In: Brodley C, Danyluk
A (eds) Proceedings of ICML-01, 18th international conference in machine learning. Morgan
Kaufmann, San Francisco, pp 66–73
Crommelin DT, Majda AJ (2004) Strategies for model reduction: Comparing different optimal
bases. J Atmos Sci 61:2206–2217
Cupta AS (2004) Calculus of variations with applications. PHI Learning, India, 256p. ISBN:
9788120311206
Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36:338–347
Cutler A, Cutler DR, Stevens JR (2012) Random forests. In: Zhang C, Ma YQ (eds) Ensemble
machine learning. Springer, New York, pp 157–175
Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signal
Sys 2:303–314
Daley R (1991) Atmospheric data assimilaltion. Cambridge University Press, Camnbridge, 457 p
Dasgupta S, Gupta A (2003) An elementary proof of a theorem of Johnson and Lindenstrauss.
Random Struct Algorithm 22:60–65
Daubechies I (1992) Ten lectures on wavelets. Soc. for Ind. and Appl. Math., Philadelphia, PA
Davis JM, Estis FL, Bloomfield P, Monahan JF (1991) Complex principal components analysis of
sea-level pressure over the eastern USA. Int J Climatol 11:27–54
de Lathauwer L, de Moor B, Vandewalle J (2000) A multilinear singular value decomposition.
SIAM J Matrix Analy Appl 21:1253–1278
DeGroot MH, Shervish MJ (2002) Probability and statistics, 4th edn. Addison–Wesley, Boston,
p 893
DelSole T (2001) Optimally persistent patterns in time varying fields. J Atmos Sci 58:1341–1356
DelSole T (2006) Low-frequency variations of surface temperature in observations and simula-
tions. J Climate 19:4487–4507
DelSole T, Tippett MK (2009a) Average predictability time. Part I: theory. J Atmos Sci 66:1172–
1187
DelSole T, Tippett MK (2009b) Average predictability time. Part II: seamless diagnoses of
predictability on multiple time scales. J Atmos Sci 66:1188–1204
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM
algorithm. J Roy Statist Soc B 39:1–38
De Swart HE (1988) Low-order spectral models of the atmospheric circulation: A survey. Acta
Appl Math 11:49–96
Derouiche S, Mallet C, Hannachi A, Bargaoui Z (2020) Rainfall analysis via event features and
self-organizing maps with application to northern Tunisia. J Hydrolo revised
Diaconis P, Freedman D (1984) Asymptotics of graphical projection pursuit. Ann Statist 12:793–
815
Diamantaras KI, Kung SY (1996) Principal component neural networks. Wiley, New York
Ding C, Li T, Jordan IM (2010) Convex and semi-nonnegative matrix factorizations. IEEE Trans
Pattern Anal Mach Intell 32:45–55
References 559
Donges JF, Petrova I, Loew A, Marwan N, Kurths J (2015) How complex climate networks
complement eigen techniques for the statistical analysis of climatological data. Clim Dyn
45:2407–2424
Donges JF, Zou Y, Marwan N, Kurths J (2009) Complex networks in climate dynamics. Eur Phys
J Spec Top 174:157–179. https://doi.org/10.1140/epjst/e2009--01098-2
Dommenget D, Latif M (2002) A cautionary note on the interpretation of EOFs. J Climate 15:216–
225
Dommenget D (2007) Evaluating EOF modes against a stochastic null hypothesis. Clim Dyn
28:517–331
Donner RV, Zou Y, Donges JF, Marwan N, Kurths J (2010) Recurrence networks—a novel
paradigm for nonlinear time series analysis. New J Phys 12:033025. https://doi.org/10.1088/
1367-2630/12/3/033025
Donohue KD, Hennemann J, Dietz HG (2007) Performance of phase transform for detecting
sound sources with microphone arrays in reverberant and noisy environments. Signal Process
87:1677–1691
Doob JL (1953) Stochastic processes. Wiley, New York
Dorn M, von Storch H (1999) Identification of regional persistent patterns through principal
prediction patterns. Beitr Phys Atmos 72:15–111
Dwyer PS (1967) Some applications of matrix derivatives in multivariate analysis. J Am Statist
Ass 62:607–625
Ebdon RA (1960) Notes on the wind flow at 50 mb in tropical and subtropical regions in January
1957 and in 1958. Q J R Meteorol Soc 86:540–542
Ebert-Uphoff I, Deng Y (2012) Causal discovery for climate research using graphical models. J
Climate 25:5648–5665. https://doi.org/10.1175/JCLI-D-11-00387.1
Efron B (1979) Bootstrap methods: Another look at the Jackknife. Ann Stat 7:1–26
Efron B, Tibshirani RJ (1994) An introduction to bootstrap. Chapman-Hall, Boca-Raton. ISBN-13:
978-0412042317
Eslava G, Marriott FHC (1994) Some criteria for projection pursuit. Stat Comput 4:13–20
Eugster MJA, Leisch F (2011) Weighted and robustarchetypal analysis. Comput Stat Data Anal
55:1215–1225
Eugster MJA, Leisch F, (2013) Archetypes: Archetypal analysis. http://CRAN.R-project.org/
package=archetypes. R package version 2.1-2
Everitt BS (1978) Graphical techniques for multivariate data. Heinemann Educational Books,
London
Everitt BS (1984) An introduction to latent variable models. Chapman and Hall, London
Everitt BS (1987) Introduction to optimization methods and their application in statistics. Chapman
and Hall, London
Everitt BS (1993) Cluster analysis, 3rd edn. Academic Press, London, 170pp
Everitt BS, Dunn G (2001) Applied Multivariate Data Analysis, 2nd edn. Arnold, London
Evtushenko JG (1974) Two numerical methods of solving non-linear programming problems. Sov
Math Dokl 15:20–423
Evtushenko JG, Zhadan GV (1977) A relaxation method for solving problems of non-linear
programming. USSR Comput Math Math Phys 17:73–87
Fang K-T, Zhang Y-T (1990) Generalized multivariate analysis. Springer, 220p
Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy U (eds) (1996) Advances in knowledge
discovery and data mining. AAAI Press/The MIT Press, Menlo Park, CA
Ferguson GA (1954) The concept of parsimony in factor analysis. Psychometrika 18:23–38
Faddeev DK, Faddeeva NV (1963) Computational methods of linear algebra. W.H. Freeman and
Company, San Francisco
Fisher RA (1925) Statistical methods for research workers. Oliver & Boyd, Edinburgh
Fischer MJ, Paterson AW (2014) Detecting trends that are nonlinear and asymmetric on diurnal
and seasonal time scales. Clim Dyn 43:361–374
Fischer MJ (2016) Predictable components in global speleothem δ 18 O. Q Sci Rev 131:380–392
560 References
Fukuoka A (1951) A study of 10-day forecast (A synthetic report). Geophys Mag Tokyo XXII:177–
218
Galton F (1885) Regression towards mediocrity in hereditary stature. J Anthropological Inst
15:246–263
Gámez AJ, Zhou CS, Timmermann A, Kurths J (2004) Nonlinear dimensionality reduction in
climate data. Nonlin Process Geophys 11:393–398
Gardner WA, Napolitano A, Paura L (2006) Cyclostationarity: Half a century of research. Signal
Process 86:639–697
Gardner WA (1994) Cyclostationarity in communications and signal processing. IEEE Press, 504 p
Gardner WA, Franks LE (1975) Characterization of cyclostationary random signal processes. IEEE
Trans Inform Theory 21:4–14
Gavrilov A, Mukhin D, Loskutov E, Volodin E, Feigin A, Kurths J (2016) Method for reconstruct-
ing nonlinear modes with adaptive structure from multidimensional data. Chaos 26:123101.
https://doi.org/10.1063/1.4968852
Geary RC (1947) Testing for normality. Biometrika 34:209–242
Gelfand IM, Vilenkin NYa (1964) Generalized functions-vol 4: Applications of harmonic analysis.
Academic Press
Ghil M, Allen MR, Dettinger MD, Ide K, Kondrashov D, Mann ME, Robertson AW, Saunders
A, Tian Y, Varadi F, Yiou P (2002) Advanced spectral methods for climatic time series. Rev
Geophys 40:1.1–1.41
Giannakis D, Majda AJ (2012) Nonlinear laplacian spectral analysis for time series with intermit-
tency and low-frequency variability. Proc Natl Sci USA 109:2222–2227
Giannakis D, Majda AJ (2013) Nonlinear laplacian spectral analysis: capturing intermittent and
low-frequency spatiotemporal patterns in high-dimensional data. Stat Anal Data Mining 6.
https://doi.org/10.1002/sam.11171
Gibbs JW (1902) Elementary principles in statistical mechanics developed with especial reference
to the rational foundation of thermodynamics. Yale University Press, New Haven, CT.
Republished by Dover, New York in 1960
Gibson J (1994) What is the interpretation of spectral entropy? In: Proceedings of IEEE
international symposium on information theory, p 440
Gibson PB, Perkins-Kirkpatrick SE, Uotila P, Pepler AS, Alexander LV (2017) On the use of self-
organizing maps for studying climate extremes. J Geophys Res Atmos 122:3891–3903. https://
doi.org/10.1002/2016JD026256
Gibson PB, Perkins-Kirkpatrick SE, Renwick JA (2016) Projected changes in synoptic weather
patterns over New Zealand examined through self-organizing maps. Int J Climatol 36:3934–
3948. https://doi.org/10.1002/joc.4604
Gill PE, Murray W, Wright HM (1981) Practical optimization. Academic Press, London
Gilman DL (1957) Empirical orthogonal functions applied to thirty-day forecasting. Sci Rep No 1,
Department of Meteorology, Mass Inst of Tech, Cambridge, Mass, 129pp.
Girshik MA (1939) On the sampling theory of roots of determinantal equations. Ann Math Statist
43:128–136
Glahn HR (1962) An experiment in forecasting rainfall probabilities by objective methods. Mon
Wea Rev 90:59–67
Goerg GM (2013) Forecastable components analysis. J Mach Learn Res Workshop Conf Proc
28:64–72
Goldfeld SM, Quandt RE, Trotter HF (1966) Maximization by quadratic hill-climbing. Economet-
rica 34:541–551
Golub GH, van Loan CF (1996) Matrix computation. John Hopkins University Press, Baltimore,
MD
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge, MA, 749 p.
http://www.deeplearningbook.org
Gordon AD (1999) Classification, 2nd edn. Chapman and Hall, 256 p
Gordon AD (1981) Classification: methods for the exploratory analysis of multivariate data.
Chapman and Hall, London
562 References
Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate
analysis. Biometrika 53:325–338
Graybill FA (1969)Introduction to matrices with application in statistics. Wadsworth, Belmont, CA
Graystone P (1959) Meteorological office discussion−Tropical meteorology. Meteorol Mag
88:113–119
Grenander U, Rosenblatt M, (1957) Statistical analysis of time series. Wiley, New York
Grimmer M (1963) The space-filtering of monthly surface temperature anomaly data in terms of
pattern using empirical orthogonal functions. Q J Roy Meteorol Soc 89:395–408
Hackbusch W (1995) Integral equations: theory and numerical treatment. Birkhauser Verlag, Basel
Haghroosta T (2019) Comparative study on typhoon’s wind speed prediction by a neural networks
model and a hydrodynamical model. MethodsX 6:633–640
Haines K, Hannachi A (1995) Weather regimes in the Pacific from a GCM. J Atmos Sci 52:2444-
2462
Hall A, Manabe S (1997) Can local linear stochastic theory explain sea surface temperature and
salinity variability? Clim Dyn 13:167–180
Hall, P (1989) On polynomial-based projection indices for exploratory projection pursuit. Ann
Statist 17:589–605
Halmos PR (1951) Introduction to Hilbert space. Chelsea, New York
Halmos PR (1972) Positive approximants of operators. Indian Univ Math J 21:951–960
Hamlington BD, Leben RR, Nerem RS, Han W, Kim K-Y (2011) Reconstructing sea level using
cyclostationary empirical orthogonal functions. J Geophys Res 116:C12015. https://doi.org/10.
1029/2011JC007529
Hamlington BD, Leben RR, Strassburg MW, Kim K-Y (2014) Cyclostationary empirical orthogo-
nal function sea-level reconstruction. Geosci Data J 1:13–19
Hamming RW (1980) Coding and information theory. Prentice-Hall, Englewood Cliffs, New Jersey
Hannachi A, Allen M (2001) Identifying signals from intermittent low-frequency behaving
systems. Tellus A 53A:469–480
Hannachi A, Legras B (1995) Simulated annealing and weather regimes classification. Tellus
47A:955–973
Hannachi A, Iqbal W (2019) On the nonlinearity of winter northern hemisphere atmospheric
variability. J Atmos Sci 76:333–356
Hannachi A, Turner AG (2013a) Isomap nonlinear dimensionality reduction and bimodality of
Asian monsoon convection. Geophys Res Lett 40:1653–1658
Hannachi A, Turner GA (2013b) 20th century intraseasonal Asian monsoon dynamics viewed from
isomap. Nonlin Process Geophys 20:725–741
Hannachi A, Dommenget D (2009) Is the Indian Ocean SST variability a homogeneous diffusion
process. Clim Dyn 33:535–547
Hannachi A, Unkel S, Trendafilov NT, Jolliffe TI (2009) Independent component analysis of
climate data: A new look at EOF rotation. J Climate 22:2797–2812
Hannachi, A (2010) On the origin of planetary-scale extratropical winter circulation regimes. J
Atmos Sci 67:1382–1401
Hannachi A (1997) Low frequency variability in a GCM: three-dimensional flow regimes and their
dynamics. J Climate 10:1357–1379
Hannachi A, O’Neill A (2001) Atmospheric multiple equilibria and non-Gaussian behaviour in
model simulations. Q J R Meteorol Soc 127:939–958
Hannachi A (2008) A new set of orthogonal patterns in weather and climate: Optimall interpolated
patterns. J Climate 21:6724–6738
Hannachi A, Jolliffe TI, Trendafilov N, Stephenson DB (2006) In search of simple structures in
climate: Simplifying EOFs. Int J Climatol 26:7–28
Hannachi A, Jolliffe IT, Stephenson DB (2007) Empirical orthogonal functions and related
techniques in atmospheric science: A review. I J Climatol 27:1119–1152
Hannachi A (2007) Pattern hunting in climate: A new method for finding trends in gridded climate
data. Int J Climatol 27:1–15
Hannachi A (2000) A probobilistic-based approach to optimal filtering. Phys Rev E 61:3610–3619
References 563
Hewitson BC, Crane RG (1994) Neural nets: Applications in geography. Springer, New York.
ISBN: 978-07-923-2746-2
Higham NJ (1988) Computing nearest symmetric positive semi-definite matrix. Linear Algebra
Appl 103:103–118
Hill T, Marquez L, O’Connor M, Remus W (1994) Artificial neural network models for forecasting
and decision making. Int J Forecast 10:5–15
Hinton GE, Dayan P, Revow M (1997) Modeling the manifolds of images of hand written digits.
IEEE Trans Neural Netw 8:65–74
Hirsch MW, Smale S (1974) Differential equations, dynamical systems, and linear algebra.
Academic Press, London
Hochstadt H (1973) Integral equations. Wiley, New York
Hodges JL, Lehmann EL (1956) The efficiency of some non-parametric competitors of the t-test.
Ann Math Statist 27:324–335
Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems.
Technometrics 12:55–67
Holsheimer M, Siebes A (1994) Data mining: The search for knowledge in databases. Technical
Report CS-R9406, CWI Amsterdam
Horel JD (1981) A rotated principal component analysis of the interannual variability variability
of the Northern Hemisphere 500 mb height field. Mon Wea Rev 109:2080–2092
Horel JD (1984) Complex principal component analysis: Theory and examples. J Climate Appl
Meteor 23:1660–1673
Horn RA, Johnson CA (1985) Matrix analysis. Cambridge University Press, Cambridge
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Networks
4:251–257
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal
approximators. Neural Networks 2:359–366
Horton DE, Johnson NC, Singh D, Swain DL, Rajaratnam B, Diffenbaugh NS (2015) Contribution
of changes in atmospheric circulation patterns to extreme temperature trends. Nature 522:465–
469. https://doi.org/10.1038/nature14550
Hosking JRM (1990) L-moments: analysis and estimation of distributions using linear combina-
tions of order statistics. J R Statist Soc B 52:105–124
Hoskins BJ, Karoly DJ (1981) The steady linear response to a spherical atmosphere to thermal and
orographic forcing. J Atmos Sci 38:1179–1196
Hoskins BJ, Ambrizzi T (1993) Rossby wave propagation on a realistic longitudinally varying
flow. J Atmos Sci 50:1661–1671
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ
Psych 24:417–520, 498–520
Hotelling H (1935) The most predictable criterion. J Educ Psych 26:139–142
Hotelling H (1936a) Simplified calculation of principal components. Psychometrika 1:27–35
Hotelling H (1936b) Relation between two sets of variables. Biometrika 28:321–377
Hsieh WW (2001a) Nonlinear canonical correlation analysis of the tropical Pacific climate
variability using a neural network approach. J Climate 14:2528–2539
Hsieh WW (2001b) Nonlinear principal component analysis by neural networks. Tellus 53A:599–
615
Hsieh WW (2009) Machine learning methods in the environmental sciences: neural networks and
kernels. Cambridge University Press, Cambridge
Hsieh W, Tang B (1998) Applying neural network models to prediction and data analysis in
meteorology and oceanography. Bull Am Meteorol Soc 79:1855–1870
Hubbert S, Baxter B (2001) Radial basis functions for the sphere. In: Haussmann W, Jetter
K, Reimer M (eds) Recent progress in multivariate approximation, 4th international confer-
ence, September 2000, Witten-Bommerholz. International Series of Numerical Mathematics,
vol. 137. Birkhäuser, Basel, pp 33–47
Huber PJ (1985) Projection pursuit. Ann Statist 13:435–475
Huber PJ (1981) Robust statistics. Wiley, New York, 308 p
References 565
Jolliffe IT, Trendafilov TN, Uddin M (2003) A modified principal component technique based on
the LASSO. J Comput Graph Stat 12:531–547
Jolliffe IT (1987) Rotation of principal components: Some comments. J Climatol 7:507–510
Jolliffe IT (1995) Rotation of principal components: Choice of normalization constraints. J Appl
Stat 22:29–35
Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, New York
Jones MC (1983) The projection pursuit algorithm for exploratory data analysis. Ph.D. Thesis,
University of Bath
Jones MC, Sibson R (1987) What is projection pursuit? J R Statist Soc A 150:1–36
Jones RH (1975) Estimating the variance of time averages. J Appl Meteor 14:159–163
Jöreskog KG (1967) Some contributions to maximum likelihood factor analysis. Psychometrika
32:443–482
Jöreskog KG (1969) A general approach to confirmatory maximum likelihood factor analysis.
Psychometrika 34:183–202
Jung T-P, Makeig S, Mckeown MJ, Bell AJ, Lee T-W, Sejnowski TJ (2001) Imaging brain dynamics
using independent component analysis. Proc IEEE 89:1107–1122
Jungclaus J (2008) MPI-M earth system modelling framework: millennium full forcing experiment
(ensemble member 1). World Data Center for climate. CERA-DB “mil0010”. http://cera-www.
dkrz.de/WDCC/ui/Compact.jsp?acronym=mil0010
Jutten C, Herault J (1991) Blind separation of sources, part i: An adaptive algorithm based on
neuromimetic architecture. Signal Process 24:1–10
Kaiser HF (1958) The varimax criterion for analytic rotation in favor analysis. Psychometrika
23:187–200
Kano Y, Miyamoto Y, Shimizu S (2003) Factor rotation and ICA. In: Proceedings of the 4th
international symposium on independent component analysis and blind source separation
(Nara, Japan), pp 101–105
Kao SK (1968) Governing equations and spectra for atmospheric motion and transports in
frequency-wavenumber space. J Atmos Sci 25:32–38
Kapur JN (1989) Maximum-entropy models in science and engineering. Wiley, New York
Karlin S, Taylor HM (1975) A first course in stochastic processes, 2nd edn. Academic Press,
Boston
Karthick S, Malathi D, Arun C (2018) Weather prediction analysis using random forest algorithm.
Int J Pure Appl Math 118:255–262
Keller LB (1935) Expanding of limit theorems of probability theory on functions with continuous
arguments (in Russian). Works Main Geophys Observ 4:5–19
Kendall MG (1994) Advanced theory of statistics. Vol I: distribution theory, 6th edn. In: Stuart A,
Ord JK (eds). Arnold, London.
Kendall MG, Stuart A (1961) The advanced theory of statistics: Inference and relationships, 3rd
edn. Griffin, London.
Kendall MG, Stuart A (1977) The advanced Theory of Statistics. Volume 1: distribution theory,
4th edn. Griffin, London
Keogh EJ, Chu S, Hart D, Pazzani MJ (2001) An online algorithm for segmenting time series. In:
Proceedings 2001 IEEE international conference on data mining, pp 289–296. https://doi.org/
10.1109/ICDM.2001.989531
Kettenring JR (1971) Canonical analysis of several sets of variables. Biometrika 58:433–451
Khatri CG (1976) A note on multiple and canonical correlation for a singular covariance matrix.
Psychometrika 41:465–470
Khedairia S, Khadir MT (2008) Self-organizing map and k-means for meteorological day type
identification for the region of Annaba–Algeria. In: 7th computer information systems and
industrial management applications, Ostrava, pp 91–96. ISBN: 978-0-7695-3184-7
Kiers HAL (1994) Simplimax: Oblique rotation to an optimal target with simple structure.
Psychometrika 59:567–579
Kikkawa S, Ishida M (1988) Number of degrees of freedom, correlation times, and equivalent
bandwidths of a random process. IEEE Trans Inf Theory 34:151–155
References 567
Kiladis GN, Weickmann KM (1992) Circulation anomalies associated with tropical convection
during northern winter. Mon Wea Rev 120:1900–1923
Killworth PD, McIntyre ME (1985) Do Rossby-wave critical layers absorb, reflect or over-reflect?
J Fluid Mech 161:449–492
Kim K-Y, Hamlington B, Na H (2015) Theoretical foundation of cyclostationary EOF analysis for
geophysical and climate variables: concept and examples. Eart Sci Rev 150:201–218
Kim K-Y, North GR (1999) A comparison of study of EOF techniques: analysis of non-stationary
data with periodic statistics. J Climate 12:185–199
Kim K-Y, Wu Q (1999) A comparison study of EOF techniques: Analysis of nonstationary data
with periodic statistics. J Climate 12:185–199
Kim K-Y, North GR, Huang J (1996) EOFs of one-dimensional cyclostationary time series:
Computations, examples, and stochastic modeling. J Atmos Sci 53:1007–1017
Kim K-Y, North GR (1997) EOFs of harmonizable cyclostationary processes. J Atmos Sci
54:2416–2427
Kimoto M, Ghil M, Mo KC (1991) Spatial structure of the extratropical 40-day oscillation. In:
Proc. 8’th conf. atmos. oceanic waves and stability. Amer. Meteor. Soc., Boston, pp 115–116
Knighton J, Pleiss G, Carter E, Walter MT, Steinschneider S (2019) Potential predictability of
regional precipitation and discharge extremes using synoptic-scale climate information via
machine learning: an evaluation for the eastern continental United States. J Hydrometeorol
20:883–900
Knutson TR, Weickmann KM (1987) 30–60 day atmospheric oscillation: Composite life cycles of
convection and circulation anomalies. Mon Wea Rev 115:1407–1436
Kobayashi S, Ota Y, Harada Y, Ebita A, Moriya M, Onoda H, Onogi K, Kamahori H, Kobayashi
C, Endo H, Miyaoka K, Takahashi K (2015) The JRA-55 Reanalysis: General specifications
and basic characteristics. J Meteor Soc Jpn 93:5–48
Kohonen T (2001) Self-organizing maps, 3rd edn. Springer, Berlin, 501 p
Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biological
Cybernetics 43:59–69
Kohonen T (1990) The self-organizing map. Proc IEEE 78:1464–1480
Kolmogorov AN (1933) Foundations of the theory of probability (Grundbegriffe der Wahrschein-
lichkeitsrechnung). Translated by Nathan Morrison and Published by Chelsea Publishing
Company, New York, 1950
Kolmogorov AN (1939) Sur l’interpolation et l’extrapolation des suites stationaires. Comptes
Rendus Acad Sci Paris 208:2043–2045
Kolmogorov AN (1941) Stationary sequences in Hilbert space. Bull Math Univ Moscow 2:1–40
Kondrashov D, Chekroun MD, Yuan X, Ghil M (2018a) Data-adaptive harmonic decomposition
and stochastic modeling of Arctic sea ice. Dyn Statist Clim Syst 3:179–205
Kondrashov, D., M. D. Chekroun, P. Berloff, (2018b) Multiscale Stuart-Landau emulators:
Application wind-driven ocean gyres. Fluids 3:21. https://doi.org/10.3390/fluids3010021
Kooperberg C, O’Sullivan F (1996) Prediction oscillation patterns: A synthesis of methods for
spatial-temporal decomposition of random fields. J Am. Statist Assoc 91:1485–1496
Koopmans LH (1974) The spectral analysis of time series. Academic Press, New York
Kramer MA (1991) Nonlinear principal component analysis using autoassociative neural networks.
AIChE J 37:233–243
Kress R, Martensen E (1970) Anwendung der rechteckregel auf die reelle Hilbertransformation
mit unendlichem intervall. Z Angew Math Mech 50:T61–T64
Krishnamurthi TN, Chakraborty DR, Cubucku N, Stefanova L, Vijaya Kumar TSV (2003) A
mechanism of the Madden-Julian oscillation based on interactions in the frequency domain.
Q J R Meteorol Soc 129:2559–2590
Kruskal JB (1964a) Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis. Psychometrika 29:1–27
Kruskal JB (1964b) Nonmetric multidimensional scaling: a numerical method. Psychometrika
29:115–129
568 References
Kruskal JB (1969) Toward a practical method which helps uncover the structure of a set of
multivariate observations by finding the linear transformation which optimizes a new ‘index
of condensation’. In: Milton RC, Nelder JA (eds) Statistical computation, New York
Kruskal JB (1972) Linear transformations of multivariate data to reveal clustering. In: Multidi-
mensional scaling: theory and application in the behavioural sciences, I, theory. Seminra Press,
New York
Krzanowski WJ, Marriott FHC (1994) Multivariate analysis, Part 1. Distributions, ordination and
inference. Arnold, London
Krzanowski WJ (2000) Principles of multivariate analysis: A user’s perspective, 2nd edn. Oxford
University Press, Oxford
Krzanowski WJ (1984) Principal component analysis in the presence of group structure. Appl
Statist 33:164–168
Krzanowski WJ (1979) Between-groups comparison of principal components. J Am Statist Assoc
74:703–707
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional
neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural
information processing systems, vol 25. Curran Associates, Red Hook, NY, pp 1097–1105
Kubáčkouá L, Kubáček L, Kukuča J (1987) Probability and statistics in geodesy and geophysics.
Elsevier, Amsterdam
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86
Kundu PK, Allen JS (1976) Some three-dimensional characteristics of low-frequency current
fluctuations near the Oregon coast. J Phys Oceanogr 6:181–199
Kutzbach JE (1967) Empirical eigenvectors of sea-level pressure, surface temperature and
precipitation complexes over North America. J Appl Meteor 6:791–802
Kwon S (1999) Clutering in multivariate data: visualization, case and variable reduction. Ph.D.
Thesis, Iowa State University
Kwasniok F (1996) The reduction of complex dynamical systems using principal interaction
patterns. Physica D 92:28–60
Kwasniok F (1997) Optimal Galerkin approximations of partial differential equations using
principal interaction patterns. Phys Rev E 55:5365–5375
Kwasniok F (2004) Empirical low-order models of barotropic flow. J Atmos Sci 61:235–245
Labitzke K, van Loon H (1999) The stratosphere. Springer, New York
Laplace PS (1951) A philosophical essay on probabilities. Dover Publications, New York
Larsson E, Fornberg B (2003) A numerical study of some radial basis function based solution
methods for elliptic PDEs. Comput Math Appli 47:37–55
Laughlin S (1981) A simple coding procedure enhances a neuron’s information capacity. Z
Natureforsch 36c:910–912
Lawley DN (1956) Tests of significance for the latent roots of covariance and correlation matrices.
Biometrika 43:128–136
Lawley DN, Maxwell AE (1971) Factor analysis as a statistical method, 2nd edn. Butterworth,
London
Lazante JR (1990) The leading modes of 10–30 day variability in the extratropics of the Northern
Hemisphere during the cold season. J Atmos Sci 47:2115–2140
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/
nature14539
Ledoit O, Wolf M (2004) A well-conditioned estimator for large-dimensional covariance matrices.
J Multivar Anal 88:365–411
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization.
Nature 401:788–791
Legates DR (1991) The effect of domain shape on principal components analyses. Int J Climatol
11:135–146
Legates DR (1993) The effect of domain shape on principal components analyses: A reply. Int J
Climatol 13:219–228
References 569
Leith CE (1973) The standard error of time-average estimates of climatic means. J Appl Meteorol
12:1066–1069
Leloup JA, Lachkar Z, Boulanger JP, Thiria S (2007) Detecting decadal changes in ENSO using
neural networks. Clim Dyn 28:147–162. https://doi.org/10.1007/s00382-006-0173-1. ISSN:
0930-7575
Leurgans SE, RA Moyeed, Silverman BW (1993) Canonical correlation analysis when the data are
curves. J R Statist Soc B 55:725–740
Li G, Ren B, Yang C, Zheng J (2011a) Revisiting the trend of the tropical and subtropical Pacific
surface latent heat fluxduring 1977–2006. J Geophys Res 116:D10115. https://doi.org/10.1029/
2010JD015444
Li G, Ren B, Zheng J, Yang C (2011b) Trend singular value decomposition analysis and its
application to the global ocean surfacelatent heat flux and SST anomalies. J Climate 24:2931–
2948
Lin G-F, Chen L-H (2006) Identification of homogeneous regions for regional frequency analysis
using the self-organizing map. J Hydrology 324:1–9. ISSN: 0022-1694
Lingoes JC, Roskam EE (1973) A mathematical and empirical analysis of two multidimensional
analysis scaling algorithms. Psychometrika 38(Monograph supplement):1–93
Linz P, Wang RLC (2003) Exploring numerical methods: An introduction to scientific computing
using MATLAB. Jones and Bartlett Publishers, Sudbury, MA
Lim Y-K, Kim K-Y (2006) A new perspective on the climate prediction of Asian summer monsoon
precipitation. J Climate 19:4840–4853
Lim Y-K, Cocke S, Shin DW, Schoof JT, LaRow TE, O’Brien JJ (2010) Downscaling large-scale
NCEP CFS to resolve fine-scale seasonal precipitation and extremes for the crop growing
seasons over the southeastern United States. Clim Dyn 35:449–471
Liu Y, Weisberg RH (2007) Ocean currents and sea surface heights estimated across the West
Florida Shelf. J Phys Oceanog 37:1697–1713. ISSN: 0022-3670
Liu Y, Weisberg RH, Mooers CNK (2006) Performance evaluation of the selforganizing map for
feature extraction. J Geophys Res 111:C05018. https://doi.org/10.1029/2005JC003117. ISSN:
0148-0227
Liu Y, Weisberg RH (2005) Patterns of ocean current variability on the West Florida Shelf using
the self-organizing map. J Geophys Res 110:C06003. https://doi.org/10.1029/2004JC002786
Loève M (1948) Functions aléatoires du second order. Suplement to P. Lévy: Processus Stochas-
tiques et Mouvement Brownien. Gauthier-Villars, Paris
Loève M (1963) Probability theory. Van Nostrand Reinhold, New York
Loève M (1978) Probability theory, vol II, 4th edn. Springer, 413 p
Lorenz EN (1963) Deterministic non-periodic flow. J Atmos Sci 20:130–141
Lorenz EN (1970) Climate change as a mathematical problem. J Appl Meteor 9:325–329
Lorenz EN (1956) Empirical orthogonal functions and statistical weather prediction. Technical
report, Statistical Forecast Project Report 1, Dept. of Meteor., MIT, 49 p
Losada IJ, Reguero BG, Méndez FJ, Castanedo S, Abascal AJ, Minguez R (2013) Long-term
changes in sea-level components in Latin America and the Caribbean. Global Planetary Change
104:34–50
Lucio JH, Valdés R, Rodríguez LR (2012) Improvements to surrogate data methods for nonstation-
ary time series. Phys Rev E 85:056202
Luxburg U (2007) A tutorial on spectral clustering. Statist Comput 17:395–416
Lütkepoch H (2006) New introduction to multiple time series analysis. Springer, Berlin
Madden RA, Julian PR (1971) Detection of a 40–50 day oscillation in the zonal wind in the tropical
pacific. J Atmos Sci 28:702–708
Madden RA, Julian PR (1972) Description of global-scale circulation cells in the tropics with a
40–50 day period. J Atmos Sci 29:1109–1123
Madden RA, Julian PR (1994) Observations of the 40–50-day tropical oscillation−A review. Mon
Wea Rev 122:814–837
Magnus JR, Neudecker H (1995) Matrix differential calculus with applications in statistics and
econometrics. Wiley, Chichester
570 References
Malik N, Bookhagen B, Marwan N, Kurths J (2012) Analysis of spatial and temporal extreme
monsoonal rainfall over South Asia using complex networks. Clim Dyn 39:971–987. https://
doi.org/10.1007/s00382-011-1156-4
Malozemov VN, Pevnyi AB (1992) Fast algorithm of the projection of a point onto the simplex.
Vestnik St. Petersburg University 1(1):112–113
Mansour A, Jutten C (1996) A direct solution for blind separation of sources. IEEE Trans Signal
Process 44:746–748
Mardia KV, Kent TJ, Bibby MJ (1979) Multivariate analysis. Academic Press, London
Mardia KV (1980) Tests of univariate and multivariate normality. In: Krishnaiah PR (ed) Handbook
of statistics 1: Analysis of variance. North-Holland Publishing, pp 279–320
Martinez WL, Martinez AR (2002) Computational statistics handbook with MATLAB. Chapman
and Hall, Boca Raton
Martinez AR, Solka J, Martinez WL (2010) Exploratory data analysis with MATLAB, 2nd edn.
CRS Press, 530 p
Maruyama T (1997) The quasi-biennial oscillation (QBO) and equatorial waves−A historical
review. Pap Meteorol Geophys 47:1–17
Marwan N, Donges JF, Zou Y, Donner RV, Kurths J (2009) Complex network approach for
recurrence analysis of time series. Phys Lett A 373:4246–4254
Mathar R (1985) The best Euclidean fit to a given distance matrix in prescribed dimensions. Linear
Algebra Appl 67:1–6
Matsubara Y, Sakurai Y, van Panhuis WG, Faloutsos C (2014) FUNNEL: automatic mining
of spatially coevolving epidemics. In: KDD, pp 105–114 https://doi.org/10.1145/2623330.
2623624
Matthews AJ (2000) Propagation mechanisms for the Madden-Julian oscillation. Q J R Meteorol
Soc 126:2637–2651
Masani P (1966) Recent trends in multivariate prediction theory. In: Krishnaiah P (ed) Multivariate
analysis – I. Academic Press, New York, pp 351–382
Mazloff MR, Heimbach P, Wunch C (2010) An eddy-permitting Southern Ocean State Estimate. J
Phys Oceano 40:880–899
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, London,
511 p
McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull
Math Biophys 5:115–133
McDonald AJ, Cassano JJ, Jolly B, Parsons S, Schuddeboom A (2016) An automated satellitecloud
classification scheme usingself-organizing maps: Alternative ISCCP weather states. J Geophys
Res Atmos 121:13,009–13,030. https://doi.org/10.1002/2016JD025199
McEliece RJ (1977) The theory of information and coding. Addison-Wesley, Reading, MA
McGee VE (1968) Multidimensional scaling of N sets of similarity measures: a nonmetric
individual differences approach. Multivar Behav Res 3:233–248
McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York
McLachlan GJ (2004) Discriminant analysis and statistical pattern recognition. Wiley Interscience,
545 p
Meila M, Shi J (2000) Learning segmentation by random walks. In: Proceedings of NIPS, pp 873–
879
Mercer T (1909) Functions of positive and negative type and their connection with the theory of
integral equations. Trans Lond Phil Soc A 209:415–446
Merrifield MA, Winant CD (1989) Shelf circulation in the Gulf of California: A description of the
variability. J Geophys Res 94:18133–18160
Merrifield MA, Guza RT (1990) Detecting propagating signals with complex empirical orthogonal
functions: A cautionary note. J Phys Oceanogr 20:1628–1633
Mestas-Nuñez AM (2000) Orthogonality properties of rotated empirical modes. Int J Climatol
20:1509–1516
Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E (1953) Equation of state calculation
by fast computing machines. J Chem Phys 21:1087–1092
References 571
Meyer Y (1992) Wavelets and operators. Cambridge University Press, New York, 223 p
Meza–Padilla R, Enriquez C, Liu Y, Appendini CM (2019) Ocean circulation in the western Gulf
of Mexico using self–organizing maps. J Geophys Res Oceans 124:4152–4167. https://doi.org/
10.1029/2018JC014377
Michelli CA (1986) Interpolation of scattered data: Distance matrices and conditionally positive
definite functions. Constr Approx 2:11–22
Michelot C (1986) A finite algorithm for finding the projection of a point onto the canonical
simplex of Rn . JOTA 50:195–200
Mirsky L (1955) An introduction to linear algebra. Oxford University Press, Oxford, 896pp
Mitchell TM (1998) Machine learning. McGraw-Hill, New York, 432 p
Mikhlin SG (1964) Integral equations, 2nd edn. Pergamon Press, London
Minnotte MC, West RW (1999) The data image: A tool for exploring high dimensional data sets.
In: Proc. ASA section on stat. graphics, Dallas, TX, American Statistical Association, pp 25–33
Moiseiwitsch BL (1977) Integral equations. Longman, London
Monahan AH, DelSole T (2009) Information theoretic measures of dependence, compactness, and
non-Gaussianity for multivariate probability distributions. Nonlin Process Geophys 16:57–64
Monahan AH, Fyfe CJ (2007) Comment on the shortcomings of nonlinear principal component
analysis in identifying circulation regimes. J Climate 20:374–377
Monahan, A.H., L. Pandolfo, Fyfe JC (2001) The prefered structure of variability of the Northern
Hemisphere atmospheric circulation. Geophys Res Lett27:1139–1142
Monahan AH, Tangang FT, Hsieh WW (1999) A potential problem with extended EOF analysis of
standing wave fields. Atmosphere-Ocean 3:241–254
Monahan AH, Fyfe JC, Flato GM (2000) A regime view of northern hemisphere atmospheric
variability and change under global warming. Geophys Res Lett 27:1139–1142
Monahan AH (2000) Nonlinear principal component analysis by neural networks: theory and
application to the Lorenz system. J Climate 13:821–835
Monahan AH (2001) Nonlinear principal component analysis: tropical Indo–Pacific sea surface
temperature and sea level pressure. J Climate 14:219–233
Moody J, Darken CJ (1989) Fast learning in networks of locally-tuned processing units. Neural
Comput 1:281–294
Moon TK (1996) The expectation maximization algorithm. IEEE Signal Process Mag, 47–60
Mori A, Kawasaki N, Yamazaki K, Honda M, Nakamura H (2006) A reexamination of the northern
hemisphere sea level pressure variability by the independent component analysis. SOLA 2:5–8
Morup M, Hansen LK (2012) Archetypal analysis for machine Learning and data mining.
Neurocomputing 80:54–63
Morozov VA (1984) Methods for solving incorrectly posed problems. Springer, Berlin. ISBN: 3-
540-96059-7
Morrison DF (1967) Multivariate statistical methods. McGraw-Hill, New York
Morton SC (1989) Interpretable projection pursuit. Technical Report 106. Department of Statistics,
Stanford University, Stanford. https://www.osti.gov/biblio/5005529-interpretable-projection-
pursuit
Mukhin D, Gavrilov A, Feigin A, Loskutov E, Kurths J (2015) Principal nonlinear dynamical
modes of climate variability. Sci Rep 5:15510. https://doi.org/10.1038/srep15510
Munk WH (1950) On the wind-driven ocean circulation. J Metorol 7:79–93
Nadler B, Lafon S, Coifman RR, Kevrikedes I (2006) Diffusion maps, spectral clustering, and
reaction coordinates of dynamical systems. Appl Comput Harmon Anal 21:113–127
Nason G (1992) Design and choice of projection indices. Ph.D. Thesis, The University of Bath
Nason G (1995) Three-dimensional projection pursuit. Appl Statist 44:411–430
Nason GP, Sibson R (1992) Measuring multimodality. Stat Comput 2:153–160
Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7:308–313
Newman MEJ (2006) Modularity and community structure in networks. PNAS 103:8577–8582.
www.pnas.org/cgi/doi/10.1073/pnas.0601602103
Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys
Rev E 69:026113. https://doi.org/10.1103/PhysRevE.69.026113
572 References
Pearson K (1895) Notes on regression and inheritance in the case of two parents. Proc R Soc
London 58:240–242
Pearson K (1920) Notes on the history of correlation. Biometrika 13:25–45
Pham D-T, Garrat P, Jutten C (1992) Separation of mixture of independent sources through
maximum likelihood approach. In: Proc EUSIPCO, pp 771–774
Pires CAL, Hannachi A (2021) Bispectral analysis of nonlinear interaction, predictability and
stochastic modelling with application to ENSO. Tellus A 73, 1–30
Plaut G, Vautard R (1994) Spells of low-frequency oscillations and weather regimes in the northern
hemisphere. J Atmos sci 51:210–236
Pearson K (1902) On lines and planes of closest fit to systems of points in space. Phil Mag 2:559–
572
Penland C (1989) Random forcing and forecasting using principal oscillation patterns. Mon Wea
Rev 117:2165–2185
Penland C, Sardeshmukh PD (1995) The optimal growth of tropical sea surface temperature
anomalies. J Climate 8:1999–2024
Pezzulli S, Hannachi A, Stephenson DB (2005) The variability of seasonality. J Climate 18:71–88
Philippon N, Jarlan L, Martiny N, Camberlin P, Mougin E (2007) Characterization of the
interannual and intraseasonal variability of west African vegetation between 1982 and 2002
by means of NOAA AVHRR NDVI data. J Climate 20:1202–1218
Pires CAL, Hannachi A (2017) Independent subspace analysis of the sea surface temperature
variability: non-Gaussian sources and sensitivity to sampling and dimensionality. Complexity.
https://doi.org/10.1155/2017/3076810
Pires CAL, Ribeiro AFS (2017) Separation of the atmospheric variability into non-Gaussian
multidimensional sources by projection pursuit techniques. Climate Dynamics 48:821–850
Poggio T, Girosi F (1990) Networks for approximation and learning. Proc IEEE 78:1481–1497
Polya G, Latta G (1974) Complex variables. Wiley, New York, 334pp
Powell MJD (1964) An efficient method for finding the minimum of a function of several variables
without calculating derivatives. Comput J 7:155–162
Powell MJD (1987) Radial basis functions for multivariate interpolation: a review. In: Mason JC,
Cox MG (eds) Algorithms for the approximation of functions and data. Oxford University
Press, Oxford, pp 143–167
Powell MJD (1990) The theory of radial basis function approximation in (1990) In: Light W (ed)
Advances in numerical analysis, Volume 2: wavelets, subdivision algorithms and radial basis
functions. Oxford University Press, Oxford
Preisendorfer RW, Mobley CD (1988) Principal component analysis in meteorology and oceanog-
raphy. Elsevier, Amsterdam
Press WH, et al (1992) Numerical recipes in Fortran: The Art of scientific computing. Cambridge
University Press, Cambridge
Priestly MB (1981) Spectral analysis of time series. Academic-Press, London
Posse C (1995) Tools for two-dimensional exploratory projection pursuit. J Comput Graph Statist
4:83–100
Ramsay JO, Silverman BW (2006) Functional data analysis, 2nd edn. Springer Series in Statistics,
New York
Rasmusson EM, Arkin PA, Chen W-Y, Jalickee JB (1981) Biennial variations in surface tempera-
ture over the United States as revealed by singular decomposition. Mon Wea Rev 109:587–598
Rayner NA, Parker DE, Horton EB, Folland CK, Alexander LV, Rowell DP, Kent EC, Kaplan A
(2003) Global analyses of sea surface temperature, sea ice, and night marine air temperature
since the late nineteenth century. J Geophys Res 108:014, 4407.
Rényi A (1961) On measures of entropy and information. In: Neyman J (ed) Proceedings of the
Fourth Bekeley symposium on mathematical statistics and probability, vol I. The University of
California Press, Berkeley, pp 547–561
Rényi A (1970) Probability theory. North Holland, Amsterdam, 666pp
Reed RJ, Campbell WJ, Rasmussen LA, Rogers RG (1961) Evidence of a downward propagating
annual wind reversal in the equatorial stratosphere. J Geophys Res 66:813–818
574 References
Reichenback H (1937) Les fondements logiques du calcul des probabilités. Ann Inst H Poincaré
7:267–348
Rennert KJ, Wallace MJ (2009) Cross-frequency coupling, skewness and blocking in the Northern
Hemisphere winter circulation. J Climate 22:5650–5666
Renwick AJ, Wallace JM (1995) Predictable anomaly patterns and the forecast skill of northern
hemisphere wintertime 500-mb height fields. Mon Wea Rev 123:2114–2131
Reusch DB, Alley RB, Hewitson BC (2005) Relative performance of self-organizing maps and
principal component analysis in pattern extraction from synthetic climatological data. Polar
Geography 29(3):188–212. https://doi.org/10.1080/789610199
Reyment RA, Jvreskog KG (1996) Applied factor analysis in the natural sciences. Cambridge
University Press, Cambridge
Richman MB (1981) Obliquely rotated principal components: An improved meteorological map
typing technique. J Appl Meteor 20:1145–1159
Richman MB (1986) Rotation of principal components. J Climatol 6:293–335
Richman MB (1987) Rotation of principal components: A reply. J Climatol 7:511–520
Richman MB (1987) Rotation of principal components: A reply. J Climatol 7:511–520
Richman M (1993) Comments on: The effect of domain shape on principal components analyses.
Int J Climatol 13:203–218
Richman M, Adrianto I (2010) Classification and regionalization through kernel principal compo-
nent analysis. Phys Chem Earth 35:316–328
Richman MB, Leslie LM (2012) Adaptive machine learning approaches to seasonal prediction of
tropical cyclones. Procedia Comput Sci 12:276–281
Richman MB, Leslie LM, Ramsay HA, Klotzbach PJ (2017) Reducing tropical cyclone prediction
errors using machine learning approaches. Procedia Comput Sci 114:314–323
Ripley BD (1994) Neural networks and related methods for classification. J Roy Statist Soc B
56:409–456
Riskin H (1984) The Fokker-Planck quation. Springer
Ritter H (1995) Self-organizing feature maps: Kohonen maps. In: Arbib MA (ed) The handbook
of brain theory and neural networks. MIT Press, Cambridge, MA, pp 846–851
Roach GF (1970) Greens functions: introductory theory with applications. Van Nostrand Reinhold
Campany, London
Rodgers JL, Nicewander WA (1988) Thirten ways to look at the correlation coefficients. Am
Statistician 42:59–66
Rodwell MJ, Hoskins BJ (1996) Monsoons and the dynamics of deserts. Q J Roy Meteorol Soc
122:1385–1404
Rogers GS (1980) Matrix derivatives. Marcel Dekker, New York
Rojas R (1996) Neural networks: A systematic introduction. Springer, Berlin, 509 p
Rosenblatt F (1962) Principles of neurodynamics. Spartman, New York
Rosenblatt F (1958) The perceptron: A probabilistic model for information storage and organiza-
tion in the brain. Psychological Rev 65:386–408
Ross SM (1998) A first course in probability, 5th edn. Prentice-Hall, New Jersey
Roweis ST (1998) The EM algorithm for PCA and SPCA. In: Jordan MI, Kearns MJ, Solla SA
(eds) Advances in neural information processing systems, vol 10. MIT Press, Cambridge, MA
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linera embedding.
Science 290:2323–2326
Rozanov YuA (1967) Stationary random processes. Holden-Day, San-Francisco
Rubin DB, Thayer DT (1982) EM algorithms for ML factor analysis. Psychometrika 47:69–76
Rubin DB, Thayer DT (1983) More on EM for ML factor analysis. Psychometrika 48:253–257
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagation
errors. Nature 323:533–536
Rumelhart DE, Widrow B, Lehr AM (1994) The basic ideas in neural networks. Commun ACM
37:87–92
References 575
Runge J, Petoukhov V, Kurths J (2014) Quantifying the strength and delay of climatic interactions:
the ambiguities of cross correlation and a novel measure based on graphical models. J Climate
27:720–739
Runge J, Heitzig J, Kurths J (2012) Escaping the curse of dimensionality in estimating multivariate
transfer entropy. Phys Rev Lett 108:258701. https://doi.org/10.1103/PhysRevLett.108.258701
Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, Philadelphia
Saad Y (1990) Numerical solution of large Lyapunov equations. In: Kaashoek AM, van Schuppen
JH, Ran AC (eds) Signal processing, scattering, operator theory, and numerical methods, Pro-
ceedings of the international symposium MTNS-89, vol III, pp 503–511, Boston, Birkhauser
Saad Y, Schultz MH (1985) Conjugate gradient-like algorithms for solving nonsymmetric linear
systems. Math Comput 44:417–424
Said-Houari B (2015) Diffierential equations: Methods and applications. Springer, Cham, 212pp
Salim A, Pawitan Y, Bond K (2005) Modelling association between two irregularly observed
spatiotemporal processes by using maximum covariance analysis. Appl Statist 54:555–573
Sammon JW Jr (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput C-
18:401–409
Samuel AL (1959) Some studies in machine learning using the game of of checkers. IBM J Res
Dev 3:211–229
Saunders DR (1961) The rationale for an “oblimax” method of transformation in factor analysis.
Psychometrika 26:317–324
Scher S (2020) Artificial intelligence in weather and climate prediction. Ph.D. Thesis in Atmo-
spheric Sciences and Oceanography, Stockholm University, Sweden 2020
Scher S (2018) Toward data-driven weather and climate forecasting: Approximating a simple
general circulation model with deep learning. Geophys Res Lett 45:12,616–12,622. https://
doi.org/10.1029/2018GL080704
Scher S, Messori G (2019) Weather and climate forecasting with neural networks: using general
circulation models (GCMs) with different complexity as a study ground. Geosci Model Dev
12:2797–2809
Schmidtko S, Johnson GC, Lyman JM (2013) MIMOC: A global monthly isopycnal upper-ocean
climatology with mixed layers. J Geophys Res, 118. https://doi.org/10.1002/jgrc.20122
Schneider T, Neumaier A (2001) Algorithm 808: ARFit − A Matlab package for the estimation
of parameters and eigenmodes of multivariate autoregressive models. ACM Trans Math Soft
27:58–65
Schölkopf B, Smola A, Müller K-R (1998) Nonlinear component analysis as a kernel eigenvalue
problem. Neural Comput 10:1299–1319
Schölkopf B, Mika S, Burgers CJS, Knirsch P, Müller K-R, Rätsch G, Smola A (1999) Input space
vs. feature space in kernel-based methods. IEEE Trans Neural Netw 10:1000–1017
Schoenberg IJ (1935) Remarks to Maurice Fréchet’s article ‘sur la définition axiomatique d’une
classe e’espace distanciés vectoriellement applicable sur l’espace de Hilbert’. Ann Math (2nd
series) 36:724–732
Schoenberg IJ (1964) Spline interpolation and best quadrature formulae. Bull Am Soc 70:143–148
Schott JR (1991) Some tests for common principal component subspaces in several groups.
Biometrika 78:771–778
Schott JR (1988) Common principal component subspaces in two groups. Biometrika 75:229–236
Scott DW (1992) Multivariate density estimation: theory, practice, and vizualization. Wiley, New
York
Schuenemann KC, Cassano JJ (2010) Changes in synoptic weather patterns and Greenland precip-
itation in the 20th and 21st centuries: 2. Analysis of 21st century atmospheric changes using
self-organizing maps, J Geophys Res 115:D05108. https://doi.org/10.1029/2009JD011706.
ISSN: 0148-0227
Schuenemann KC, Cassano JJ, Finnis J (2009) Forcing of precipitation over Greenland: Synoptic
climatology for 1961–99. J Hydrometeorol 10:60–78. https://doi.org/10.1175/2008JHM1014.
1. ISSN: 1525-7541
576 References
Scott DW, Thompson JR (1983) Probability density estimation in higher dimensions. In: Computer
science and statistics: Proceedings of the fifteenth symposium on the interface, pp 173–179
Seal HL (1967) Multivariate statistical analysis for biologists. Methuen, London
Schmid PJ (2010) Dynamic mode decomposition of numerical and experimental data. J Fluid Mech
656(1):5–28
Seitola T, Mikkola V, Silen J, Järvinen H (2014) Random projections in reducing the dimension-
ality of climate simulation data. Tellus A, 66. Available at www.tellusa.net/index.php/tellusa/
article/view/25274
Seitola T, Silén J, Järvinen H (2015) Randomized multi-channel singular spectrum analysis of the
20th century climate data. Tellus A 67:28876. Available at https://doi.org/10.3402/tellusa.v67.
28876.
Seltman HJ (2018) Experimental design and analysis. http://www.stat.cmu.edu/~hseltman/309/
Book/Book.pdf
Seth S, Eugster MJA (2015) Probabilistic archetypal analysis. Machine Learning. https://doi.org/
10.1007/s10994-015-5498-8
Shalvi O, Weinstein E (1990) New criteria for blind deconvolution of nonminimum phase systems
(channels). IEEE Trans Inf Theory 36:312–321
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–
656
Shepard RN (1962a) The analysis of proximities: multidimensional scaling with unknown distance
function. Part I. Psychometrika 27:125–140
Shepard RN (1962b) The analysis of proximities: multidimensional scaling with unknown distance
function. Part II. Psychometrika 27:219–246
Sheridan SC, Lee CC (2010) Synoptic climatology and the general circulation model. Progress
Phys Geography 34:101–109. ISSN: 1477-0296
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach
Intell 22:888–905
Schnur R, Schmitz G, Grieger N, von Storch H (1993) Normal modes of the atmosphere as
estimated by principal oscillation patterns and derived from quasi-geostrophic theory. J Atmos
Sci 50:2386–2400
Sibson R (1972) Order invariant methods for data analysis. J Roy Statist Soc B 34:311–349
Sibson R (1978) Studies in the robustness of multidimensional scaling: procrustes statistics. J Roy
Statist Soc B 40:234–238
Sibson R (1979) Studies in the robustness of multidimensional scaling: Perturbational analysis of
classical scaling. J Roy Statist Soc B 41:217–229
Sibson R (1981) Multidimensional scaling. Wiley, Chichester
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall,
London
Simmons AJ, Wallace MJ, Branstator WG (1983) Barotropic wave propagation and instability, and
atmospheric teleconnection patterns. J Atmos Sci 40:1363–1392
Smith S (1994) Optimization techniques on Riemannian manifolds. In Hamiltonian and gradient
flows, algorithm and control (Bloch A, Ed.), Field Institute Communications, Vol 3, Amer Math
Soc, 113–136.
Snyman JA (1982) A new and dynamic method for unconstrained optimisation. Appl Math Modell
6:449–462
Solidoro C, Bandelj V, Barbieri P, Cossarini G, Fonda Umani S (2007) Understanding dynamic of
biogeochemical properties in the northern Adriatic Sea by using self-organizing maps and k-
means clustering. J Geophys Res 112:C07S90. https://doi.org/10.1029/2006JC003553. ISSN:
0148-0227
Sočan G (2003) The incremental value of minimum rank factor analysis. Ph.D. Thesis, University
of Groningen, Groningen
Spearman C (1904a) General intelligence, objectively determined and measured. Am J Psy
15:201–293
References 577
Spearman C (1904b) The proof and measurement of association between two things. Am J Psy
15:72, and 202
Spence I, Garrison RF (1993) A remarkable scatterplot. Am Statistician 47:12–19
Spendley W, Hext GR, Humsworth FR (1962) Sequential applications of simplex designs in
optimization and evolutionary operations. Technometrics 4:441–461
Stewart D, Love W (1968) A general canonical correlation index. Psy Bull 70:160–163
Steinschneiders S, Lall U (2015) Daily precipitation and tropical moisture exports across the east-
ern United States: An application of archetypal analysis to identify spatiotemporal structure. J
Climate 28:8585–8602
Stephenson G (1973) Mathematical methods for science students, 2nd edn. Dover Publication,
Mineola, 526 p
Stigler SM (1986) The History of Statistics: The Measurement of Uncertainty Before 1900.
Harvard University Press, Cambridge, MA
Stommel H (1948) The westward intensification of wind-driven ocean currents. EOS Trans Amer
Geophys Union 29:202–206
Stone M, Brooks RJ (1990) Continuum regression: cross-validation sequentially constructed
prediction embracing ordinary least squares, partial least squares and principal components
regression. J Roy Statist Soc B52:237–269
Su Z, Hu H, Wang G, Ma Y, Yang X, Guo F (2018) Using GIS and Random Forests to identify fire
drivers in a forest city, Yichun, China. Geomatics Natural Hazards Risk 9:1207–1229. https://
doi.org/10.1080/19475705.2018.1505667
Subashini A, Thamarai SM, Meyyappan T (2019) Advanced weather forecasting Prediction using
deep learning. Int J Res Appl Sci Eng Tech IJRASET 7:939–945. www.ijraset.com
Sura P, Hannachi A (2015) Perspectives of non-Gaussianity in atmospheric synoptic and low-
frequency variability. J Cliamte 28:5091–5114
Swenson ET (2015) Continuum power CCA: A unified approach for isolating coupled modes. J
Climate 28:1016–1030
Takens F (1981) Detecting strange attractors in turbulence. In: Rand D, Young LS (eds) Dynamical
systems and turbulence, warwick 1980. Lecture Notes in Mathematics, vol 898. Springer, New
York, pp 366–381
Talley LD (2008) Freshwater transport estimates and the global overturning circulation: shallow,
deep and throughflow components. Progress Ocenaography 78:257–303
Taylor GI (1921) Diffusion by continuous movement. Proc Lond Math Soc 20(2):196–212
Telszewski M, Chazottes A, Schuster U, Watson AJ, Moulin C, Bakker DCE, Gonzalez-Davila
M, Johannessen T, Kortzinger A, Luger H, Olsen A, Omar A, Padin XA, Rios AF, Steinhoff
T, Santana-Casiano M, Wallace DWR, Wanninkhof R (2009) Estimating the monthly pCO2
distribution in the North Atlantic using a self-organizing neural network. Biogeosciences
6:1405–1421. ISSN: 1726–4170
Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear
dimensionality reduction. Science 290:2319–2323
TerMegreditchian MG (1969) On the determination of the number of independent stations which
are equivalent to prescribed systems of correlated stations (in Russian). Meteor Hydrol 2:24–36
Teschl G (2012) Ordinary differential equations and dynamical systems. Graduate Studies in
Mathematics, vol 140, Amer Math Soc, Providence, RI, 345pp
Thacker WC (1996) Metric-based principal components: data uncertainties. Tellus 48A:584–592
Thacker WC (1999) Principal predictors. Int J Climatol 19:821–834
Tikhonov AN (1963) Solution of incorrectly formulated problems and the regularization method.
Sov Math Dokl 4:1035–1038
Theiler J, Eubank S, Longtin A, Galdrikian B, Farmer JD (1992) Testing for nonlinearity in time
series: the method of surrogate data. Physica D 58:77–94
Thiebaux HJ (1994) Statistical data analyses for ocean and atmospheric sciences. Academic Press
Thomas JB (1969) An introduction to statistical communication theory. Wiley
Thomson RE, Emery WJ (2014) Data analysis methods in physical oceanography, 3rd edn.
Elsevier, Amsterdam, 716 p
578 References
Thompson DWJ, Wallace MJ (1998) The arctic oscillation signature in wintertime geopotential
height and temperature fields. Geophys Res Lett 25:1297–1300
Thompson DWJ, Wallace MJ (2000) Annular modes in the extratropical circulation. Part I: Month-
to-month variability. J Climate 13:1000–1016
Thompson DWJ, Wallace JM, Hegerl GC (2000) Annular modes in the extratropical circulation,
Part II: Trends. J Climate 13:1018–1036
Thurstone LL (1940) Current issues in factor analysis. Psychological Bulletin 37:189–236
Thurstone LL (1947) Multiple factor analysis. The University of Chicago Press, Chicago
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288
Tippett MK, DelSole T, Mason SJ, Barnston AG (2008) Regression based methods for finding
coupled patterns. J Climate 21:4384–4398
Tipping ME, Bishop CM (1999) Probabilistic principal components. J Roy Statist Soc B 61:611–
622
Toumazou V, Cretaux J-F (2001) Using a Lanczos eigensolver in the computation of empirical
orthogonal functions. Mon Wea Rev 129:1243–1250
Torgerson WS (1952) Multidimensional scaling I: Theory and method. Psychometrika 17:401–419
Torgerson WS (1958) Theory and methods of scaling. Wiley, New York
Trenberth KE, Jones DP, Ambenje P, Bojariu R, Easterling D, Klein Tank A, Parker D, Rahimzadeh
F, Renwick AJ, Rusticucci M, Soden B, Zhai P (2007) Observations: surface and atmospheric
climate change. In: Solomon S, Qin D, Manning M, et al. (eds) Climate Change (2007) The
physical science basis. Contribution of working Group I to the fourth assessment report of the
intergovernmental panel on climate change. Cambridge University Press, p 235–336
Trenberth KE, Shin W-TK (1984) Quasi-biennial fluctuations is sea level pressures over the
Northern Hemisphere. Mon Wea Rev 111:761–777
Trendafilov NT (2010) Stepwise estimation of common principal components. Comput Statist Data
Anal 54:3446–3457
Trendafilov NT, Jolliffe IT (2006) Projected gradient approach to the numerical solution of the
SCoTLASS. Comput Statist Data Anal 50:242–253
Tsai YZ, Hsu K-S, Wu H-Y, Lin S-I, Yu H-L, Huang K-T, Hu M-C, Hsu S-Y (2020) Application of
random forest and ICON models combined with weather forecasts to predict soil temperature
and water content in a greenhouse. Water 12:1176
Tsonis AA, Roebber PJ (2004) The architecture of the climate network. Phys A 333:497–504.
https://doi.org/10.1016/j.physa.2003.10.045
Tsonis AA, Swanson KL, Roebber PJ (2006) What do networks have to do with climate? Bull Am
Meteor Soc 87:585–595. https://doi.org/10.1175/BAMS-87-5-585
Tsonis AA, Swanson KL, Wang G (2008) On the role of atmospheric teleconnections in climate. J
Climate 21(2990):3001
Tu JH, Rowley CW, Luchtenburg DM, Brunton SL, Kutz JN (2014) On dynamic mode decompo-
sition: Theory and applications. J Comput Dyn 1:391–421. https://doi.org/10.3934/jcd.2014.1.
391
Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31:279–
311
Tukey JW (1977) Exploratory data analysis. Addison-Wesley, Reading, MA
Tukey PA, Tukey JW (1981) Preparation, prechosen sequences of views. In: Barnett V (ed)
Interpreting multivariate data. Wiley, Chichester, pp 189–213
Tyler DE (1982) On the optimality of the simultaneous redundancy transformations. Psychome-
trika 47:77–86
Ulrych TJ, Bishop TN (1975) Maximum entropy spectral analysis and autoregressive decomposi-
tion. Rev Geophys Space Phys 13:183–200
Unkel S, Trendafilov NT, Hannachi A, Jolliffe IT (2010) Independent exploratory factor analysis
with application to atmospheric science data. J Appl Stat 37:1847–1862
Unkel S, Trendafilov NT, Hannachi A, Jolliffe IT (2011) Independent component analysis for three-
way data with an application from atmospheric science. J Agr Biol Environ Stat 16:319–338
References 579
Uppala SM, Kallberg PW, Simmons AJ, Andrae U, Bechtold VDC, Fiorino M, Gibson JK, Haseler
J, Hernandez A, Kelly GA, Li X, Onogi K, Saarinen S, Sokka N, Allan RP, Andersson E, Arpe
K, Balmaseda MA, Beljaars ACM, Berg LVD, Bidlot J, Bormann N, Caires S, Chevallier F,
Dethof A, Dragosavac M, Fisher M, Fuentes M, Hagemann S, Hólm E, Hoskins BJ, Isaksen
L, Janssen PAEM, Jenne R, Mcnally AP, Mahfouf J-F, Morcrette J-J, Rayner NA, Saunders
RW, Simon P, Sterl A, Trenberth KE, Untch A, Vasiljevic D, Viterbo P, Woollen J (2005) The
ERA-40 re-analysis. Q J Roy Meteorol Soc 131:2961–3012
van den Dool HM, Saha S, Johansson Å(2000) Empirical orthogonal teleconnections. J Climate
13:1421–1435
van den Dool HM (2011) An iterative projection method to calculate EOFs successively without
use of the covariance matrix. In: Science and technology infusion climate bulletin NOAA’s
National Weather Service. 36th NOAA annual climate diagnostics and prediction workshop,
Fort Worth, TX, 3–6 October 2011. www.nws.noaa.gov/ost/climate/STIP/36CDPW/36cdpw-
vandendool.pdf
van den Wollenberg AL (1977) Redundancy analysis: an alternative to canonical correlation
analysis. Psychometrika 42:207–219
Vasicek O (1976) A test for normality based on sample entropy. J R Statist Soc B 38:54–59
Vautard R, Ghil M (1989) Singular spectrum analysis in nonlinear dynamics, with applications to
paleoclimatic time series. Physica D 35:395–424
Vautard R, Yiou P, Ghil M (1992) Singular spectrum analysis: A toolkit for short, noisy chaotic
signals. Physica D 58:95–126
Venables WN, Ripley BD (1994) Modern applied statistics with S-plus. McGraw-Hill, New York
Vesanto J, Alhoniemi E (2000) Clustering of the self-organizing map. IEEE Trans Neural Net
11:586–600
Vesanto J (1997) Using the SOM and local models in time series prediction. In Proceedings of
workshop on self-organizing maps (WSOM’97), Espo, Finland, pp 209–214
Vinnikov KY, Robock A, Grody NC, Basist A (2004) Analysis of diurnal and seasonal cycles and
trends in climate records with arbitrary observations times. Geophys Res Lett 31. https://doi.
org/10.1029/2003GL019196
Vilibić I, et al (2016) Self-organizing maps-based ocean currents forecasting system. Sci Rep
6:22924. https://doi.org/10.1038/srep22924
von Mises R (1928) Wahrscheinlichkeit, Statistik und Wahrheit, 3rd rev. edn. Springer, Vienna,
1936; trans. as Probability, statistics and truth, 1939. W. Hodge, London
von Storch H (1995a) Spatial patterns: EOFs and CCA. In: von Storch H, Navarra A (eds) Analysis
of climate variability: Application of statistical techniques. Springer, pp 227–257
von Storch J (1995b) Multivariate statistical modelling: POP model as a first order approximation.
In: von Storch H, Navarra A (eds) Analysis of climate variability: application of statistical
techniques. Springer, pp 281–279
von Storch H, Zwiers FW (1999) Statistical analysis in climate research. Cambridge University
Press, Cambridge
von Storch H, Xu J (1990) Principal oscillation pattern analysis of the tropical 30- to 60-day
oscillation. Part I: Definition of an index and its prediction. Climate Dynamics 4:175–190
von Storch H, Bruns T, Fisher-Bruns I, Hasselmann KF (1988) Principal oscillation pattern analysis
of the 30- to 60-day oscillation in a general circulation model equatorial troposphere. J Geophys
Res 93:11022–11036
von Storch H, Bürger G, Schnur R, Storch J-S (1995) Principal ocillation patterns. A review. J
Climate 8:377–400
von Storch H, Baumhefner D (1991) Principal oscillation pattern analysis of the tropical 30- to
60-day oscillation. Part II: The prediction of equatorial velocity potential and its skill. Climate
Dynamics 5:1–12
Wahba G (1979) Convergence rates of “Thin Plate” smoothing splines when the data are noisy.
In: Gasser T, Rosenblatt M (eds) Smoothing techniques for curve estimation. Lecture notes in
mathematics, vol 757. Springer, pp 232–246
580 References
Wahba G (1990) Spline models for observational data SIAM. Society for Industrial and Applied
Mathematics, Philadelphia, PA, 169 p
Wahba G (2000) Smoothing splines in nonparametric regression. Technical Report No 1024,
Department of Statistics, University of Wisconsin. https://www.stat.wisc.edu/sites/default/files/
tr1024.pdf
Walker GT (1909) Correlation in seasonal variation of climate. Mem Ind Met Dept 20:122
Walker GT (1923) Correlation in seasonal variation of weather, VIII, a preliminary study of world
weather. Mem Ind Met Dept 24:75–131
Walker GT (1924) Correlation in seasonal variation of weather, IX. Mem Ind Met Dept 25:275–332
Walker GT, Bliss EW (1932) World weather V. Mem Roy Met Soc 4:53–84
Wallace JM (2000) North Atlantic Oscillation/annular mode: Two paradigms–one phenomenon.
QJR Meteorol Soc 126:791–805
Wallace JM, Dickinson RE (1972) Empirical orthogonal representation of time series in the
frequency domain. Part I: Theoretical consideration. J Appl Meteor 11:887–892
Wallace JM (1972) Empirical orthogonal representation of time series in the frequency domain.
Part II: Application to the study of tropical wave disturbances. J Appl Meteor 11:893–900
Wallace JM, Gutzler DS (1981) Teleconnections in the geopotential height field during the
Northern Hemisphere winter. Mon Wea Rev 109:784–812
Wallace JM, Smith C, Bretherton CS (1992) Singular value decomposition of wintertime sea
surface temperature and 500-mb height anomalies. J Climate 5:561–576
Wallace JM, Thompson DWJ (2002) The Pacific Center of Action of the Northern Hemisphere
annular mode: Real or artifact? J Climate 15:1987–1991
Walsh JE, Richman MB (1981) Seasonality in the associations between surface temperatures over
the United States and the North Pacific Ocean. Mon Wea Rev 109:767–783
Wan EA (1994) Time series prediction by using a connectionist network with internal delay lines.
In: Weigend AS, Gershenfeld NA (eds) Time series prediction: forecasting the future and
understanding the past. Addison-Wesley, Boston, MA, pp 195–217
Wang D, Arapostathis A, Wilke CO, Markey MK (2012) Principal-oscillation-pattern analysis of
gene expression. PLoS ONE 7 7:1–10. https://doi.org/10.1371/journal.pone.0028805
Wang Y-H, Magnusdottir G, Stern H, Tian X, Yu Y (2014) Uncertainty estimates of the EOF-
derived North Atlantic Oscillation. J Climate 27:1290–1301
Wang D-P, Mooers CNK (1977) Long coastal-trapped waves off the west coast of the United States,
summer (1973) J Phys Oceano 7:856–864
Wang XL, Zwiers F (1999) Interannual variability of precipitation in an ensemble of AMIP climate
simulations conducted with the CCC GCM2. J Climate 12:1322–1335
Watkins DS (2007) The matrix eigenvalue problem: GR and Krylov subspace methods. SIAM,
Philadelphia
Watt J, Borhani R, Katsaggelos AK (2020) Machine learning refined: foundation, algorithms and
applications, 2nd edn. Cambridge University Press, Cambridge, 574 p
Weare BC, Nasstrom JS (1982) Examples of extended empirical orthogonal function analysis. Mon
Wea Rev 110:481–485
Wegman E (1990) Hyperdimensional data analysis using parallel coordinates. J Am Stat Assoc
78:310–322
Wei WWS (2019) Multivariate time series analysis and applications. Wiley, Oxford, 518 p
Weideman JAC (1995) Computing the Hilbert transform on the real line. Math Comput 64:745–762
Weyn JA, Durran DR, Caruana R (2019) Can machines learn to predict weather? Using deep
learning to predict gridded 500-hPa geopotential heighjt from historical weather data. J Adv
Model Earth Syst 11:2680–2693. https://doi.org/10.1029/2019MS001705
Werbos PJ (1990) Backpropagation through time: What it does and how to do it. Proc IEEE,
78:1550–1560
Whittle P (1951) Hypothesis testing in time series. Almqvist and Wicksell, Uppsala
Whittle P (1953a) The analysis of multiple stationary time series. J Roy Statist Soc B 15:125–139
Whittle P (1953b) Estimation and information in stationary time series. Ark Math 2:423–434
References 581
Whittle P (1983) Prediction and regulation by linear least-square methods, 2nd edn. University of
Minnesota, Minneapolis
Widrow B, Stearns PN (1985) Adaptive signal processing. Prentice-Hall, Englewood Cliffs, NJ
Wikle CK (2004) Spatio-temporal methods in climatology. In: El-Shaarawi AH, Jureckova J (eds)
UNESCO encyclopedia of life support systems (EOLSS). EOLSS Publishers, Oxford, UK.
Available: https://pdfs.semanticscholar.org/e11f/f4c7986840caf112541282990682f7896199.
pdf
Wiener N, Masani L (1957) The prediction theory of multivariate stochastic processes, I. Acta
Math 98:111–150
Wiener N, Masani L (1958) The prediction theory of multivariate stochastic processes, II. Acta
Math 99:93–137
Wilkinson JH (1988) The algebraic eigenvalue problem. Clarendon Oxford Science Publications,
Oxford
Wilks DS (2011) Statistical methods in the atmospheric sciences. Academic Press, San Diego
Williams MO, Kevrekidis IG, Rowley CW (2015) A data-driven approximation of the Koopman
operator: extending dynamic mode decomposition. J Nonlin Sci 25:1307–1346
Wiskott L, Sejnowski TJ (2002) Slow feature analysis: unsupervised learning of invariances.
Neural Comput 14:715–770
Wise J (1955) The autocorrelation function and the spectral density function. Biometrika 42:151–
159
Woollings T, Hannachi A, Hoskins BJ, Turner A (2010) A regime view of the North Atlantic
Oscillation and its response to anthropogenic forcing. J Climate 23:1291–1307
Wright RM, Switzer P (1971) Numerical classification applied to certain Jamaican eocene
numuulitids. Math Geol 3:297–311
Wunsch C (2003) The spectral description of climate change including the 100 ky energy. Clim
Dyn 20:353–363
Wu C-J (1996) Large optimal truncated low-dimensional dynamical systems. Discr Cont Dyn Syst
2:559–583
Xinhua C, Dunkerton TJ (1995) Orthogonal rotation of spatial patterns derived from singular value
decomposition analysis. J Climate 8:2631–2643
Xu J-S (1993) The joint modes of the coupled atmosphere-ocean system observed from 1967 to
1986. J Climate 6:816–838
Xue Y, Cane MA, Zebiak SE, Blumenthal MB (1994) On the prediction of ENSO: A study with a
low order Markov model. Tellus 46A:512–540
Young GA, Smith RL (2005) Essentials of statistical inference. Cambridge University Press, New
York, 226 p. ISBN-10: 0-521-54866-7
Young FW (1987) Multidimensional scaling: history, theory and applications. Lawrence Erlbaum,
Hillsdale, New Jersey
Young FW, Hamer RM (1994) Theory and applications of multidimensional scaling. Eribaum
Associates, Hillsdale, NJ
Young G, Householder AS (1938) Discussion of a set of points in terms of their mutual distances.
Psychometrika 3:19–22
Young G, Householder AS (1941) A note on multidimensional psycho-physical analysis. Psy-
chometrika 6:331–333
Yu Z-P, Chu P-S, Schroeder T (1997) Predictive skills of seasonal to annual rainfall variations
in the U.S. Affiliated Pacific Islands: Canonical correlation analysis and multivariate principal
component regression approaches. J Climate 10:2586–2599
Zveryaev II, Hannachi AA (2012) Interannual variability of Mediterranean evaporation and its
relation to regional climate. Clim Dyn. https://doi.org/10.1007/s00382-011-1218-7
Zveryaev II, Hannachi AA (2016) Interdecadal changes in the links between Mediterranean
evaporation and regional atmospheric dynamics during extended cold season. Int J Climatol.
https://doi.org/10.1002/joc.4779
Zeleny, M (1987) Management support systems: towards integrated knowledge management.
Human Syst Manag 7:59–70
582 References
Zhang G, Patuwo BE, Hu MY (1997) Forecasting with artificial neural networks: The state of the
art. Int J Forecast 14:35–62
Zhu Y, Shasha D (2002) Statstream: Statistical monitoring of thousands of data streams in real time.
In: VLDB, pp 358–369. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.8732
Index
Filtering, 112, 166, 342 Fröbenius norm, 217, 230, 285, 391, 398
Filtering problem, 177 Fröbenius product, 230
Filter matrix, 276 Fröbenius structure, 139
Filter patterns, 178 Full model, 228
Finear filter, 26 Full rank, 54, 187
Finite difference scheme, 361 Funcional EOFs, 321
First-order auto-regressive model, 56 Functional analysis, 300
First-order Markov model, 179 Functional CCA, 353
First-order optimality condition, 386 Functional EOF, 319
First-order spatial autoregressive process, 56 Functional PCs, 322
First-order system, 135 Fundamental matrix, 546
Fisher information, 243, 245
Fisher’s linear discrimination function, 254
Fisher–Snedecor distribution, 479 G
Fitted model, 228 Gain, 27
Fixed point, 394 Gamma distribution, 478
Fletcher-Powell method, 226 Gaussian, 19, 63, 129, 192, 243
Fletcher–Reeves, 528 Gaussian grid, 44, 328
Floquet theory, 546 Gaussianity, 375
Floyd’s algorithm, 211 Gaussian kernel, 301, 393
Fluctuation-dissipation relation, 129 Gaussian mixture, 214
Forecast, 172, 185 Gaussian noise, 221
Forecastability, 172, 197 General circulation models (GCMs), 387
Forecastable component analysis (ForeCA), Generalised AR(1), 139
196 Generalised eigenvalue problem, 61, 66, 177,
Forecastable patterns, 196 324, 327, 357, 361, 396
Forecasting, 38, 416, 422 Generalised inverse, 501
Forecasting accuracy, 130 Generalised scalar product, 190
Forecasting models, 179 Generating kernels, 300
Forecasting uncertainty, 442 Geodesic distances, 211
Forecast models, 179 Geometric constraints, 70, 71
Forecast skill, 185 Geometric moments, 167
Forward-backward, 421 Geometric properties, 72
Forward stepwise procedure, 257 Geometric sense, 62
Fourier analysis, 17 Geopotential height, 62, 66, 68, 125, 180, 188,
Fourier decomposition, 107 260, 262, 382, 445
Fourier series, 104, 373 Geopotential height anomalies, 181
Fourier spectrum, 103 Geopotential height re-analyses, 157
Fourier transform (FT), 27, 48, 99, 102, 125, Gibbs inequality, 273
176, 183, 187, 192, 267, 494 Gini index, 435
Fourth order cumulant, 170 Global scaling, 209
Fourth order moment, 75 Gobal temperature, 284
Fractal dimensions, 149 Golden section, 523
Fredholm eigen problem, 37 Goodness-of-fit, 209, 259
Fredholm equation, 320 Gradient, 85, 242, 386
Fredholm homogeneous integral equation, 359 Gradient ascent, 282
Frequency-band, 97 Gradient-based algorithms, 283
Frequency domain, 94, 97, 176 Gradient-based approaches, 526
Frequency response function, 29, 104, 108, Gradient-based method, 268
189, 191, 267 Gradient methods, 247
Frequency-time, 103 Gradient optimisation algorithms, 256
Friction, 93 Gradient types algorithms, 282
Friedman’s index, 251 Gram matrix, 301
Fröbenius matrix norm, 233 Gram-Schmidt orthogonalization, 397
Index 589
V Y
Validation, 3 Young-Householder decomposition, 206
Validation set, 47 Yule-Walker equations, 175, 185
600 Index