MA4K0 Notes
MA4K0 Notes
MA4K0 Introduction to
Uncertainty Quantification
T
AF
T. J. Sullivan
Mathematics Institute
University of Warwick
Coventry CV4 7AL UK
[email protected]
DR
DRAFT
Not For General Distribution
Version 11 (2013-10-16 15:45)
—
—
DR
AF
T
—
Contents
—
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
T
1 Introduction 3
1.1 What is Uncertainty Quantification? . . . . . . . . . . . . . . . . 3
1.2 Mathematical Prerequisites . . . . . . . . . . . . . . . . . . . . . 6
AF
1.3 The Road Not Travelled . . . . . . . . . . . . . . . . . . . . . . . 7
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
—
6.2 Bayesian Inversion in Banach Spaces . . . . . . . . . . . . . . . . 62
6.3 Well-Posedness and Approximation . . . . . . . . . . . . . . . . . 63
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
T
7.2 Linear Kálmán Filter . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.3 Extended Kálmán Filter . . . . . . . . . . . . . . . . . . . . . . . 77
7.4 Ensemble Kálmán Filter . . . . . . . . . . . . . . . . . . . . . . . 78
7.5 Eulerian and Lagrangian Data Assimilation . . . . . . . . . . . . 80
AF
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8 Orthogonal Polynomials 85
8.1 Basic Definitions and Properties . . . . . . . . . . . . . . . . . . 85
8.2 Recurrence Relations . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.3 Roots of Orthogonal Polynomials . . . . . . . . . . . . . . . . . . 90
DR
9 Numerical Integration 99
9.1 Quadrature in One Dimension . . . . . . . . . . . . . . . . . . . . 99
9.2 Gaussian Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.3 Clenshaw–Curtis / Fejér Quadrature . . . . . . . . . . . . . . . . 104
—
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
—
12.1 Lax–Milgram Theory and Galerkin Projection . . . . . . . . . . . 136
12.2 Stochastic Galerkin Projection . . . . . . . . . . . . . . . . . . . 140
12.3 Nonlinearities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
T
13.1 Pseudo-Spectral Methods . . . . . . . . . . . . . . . . . . . . . . 150
13.2 Stochastic Collocation . . . . . . . . . . . . . . . . . . . . . . . . 150
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
AF
14 Distributional Uncertainty 155
14.1 Maximum Entropy Distributions . . . . . . . . . . . . . . . . . . 155
14.2 Distributional Robustness . . . . . . . . . . . . . . . . . . . . . . 157
14.3 Functional and Distributional Robustness . . . . . . . . . . . . . 162
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
DR
Index 179
—
iv CONTENTS
—
T
AF
DR
—
PREFACE v
Preface
These notes are designed as an introduction to Uncertainty Quantification (UQ)
at the fourth year (senior) undergraduate or beginning postgraduate level, and
are aimed primarily at students from a mathematical (rather than, say, engi-
neering) background; the mathematical prerequisites are listed in Section 1.2,
and the early chapters of the text recapitulate some of this material in more
detail. These notes accompany the University of Warwick mathematics mod-
ule MA4K0 Introduction to Uncertainty Quantification; while the notes are in-
tended to be general, certain contextual remarks and assumptions about prior
—
knowledge will be Warwick-specific, and will indicated by a large “W” in the
margin, like the one to the right. W
The aim is to give a survey of the main objectives in the field of UQ and a
few of the mathematical methods by which they can be achieved. There are,
of course, even more UQ problems and solution methods in the world that are
not covered in these notes, which are intended — with the exception of the
preliminary material on measure theory and functional analysis — to comprise
T
approximately 30 hours’ worth of lectures. For any grievous omissions in this
regard, I ask for your indulgence, and would be happy to receive suggestions for
improvements.
The exercises contain, by deliberate choice, a number of terribly ill-posed or
AF
under-specified problems of the messy type often encountered in practice. It is
my hope that these exercises will encourage students to grapple with the ques-
tions of mathematical modelling that are a necessary precursor to doing applied
mathematics outside the tame classroom environment. Theoretical knowledge
is important; however, problem solving, which begins with problem formula-
tion, is an equally vital skill that too often goes neglected in undergraduate
mathematics courses.
These notes have benefitted, from initial conception to nearly finished prod-
DR
uct, from discussions with many people. I would like to thank Charlie Elliott,
Dave McCormick, Mike McKerns, Michael Ortiz, Houman Owhadi, Clint Scovel,
Andrew Stuart, and all the students on the 2013–14 iteration of MA4K0 for their
useful comments.
T .J .S.
University of Warwick, U.K.
Wednesday 16th October, 2013
—
vi CONTENTS
—
T
AF
DR
—
—
Introduction to Uncertainty
T
Quantification
AF
DR
—
—
DR
AF
T
—
Chapter 1
—
Introduction
T
become more comfortable with probabil-
ity and uncertainty. We must think more
carefully about the assumptions and be-
liefs that we bring to a problem.
AF
The Signal and the Noise: The Art of
Science and Prediction
Nate Silver
Uncertainty Quantification (UQ) is, roughly put, the coming together of proba-
bility theory and statistical practice with ‘the real world’. These two anecdotes
illustrate something of what is meant by this statement:
• Until 1990–1995, risk modelling for catastrophe insurance and re-insurance
(i.e. insurance for property owners against risks arising from earthquakes,
hurricanes, terrorism, &c., and then insurance for the providers of such
insurance) was done on a purely statistical basis. Since that time, catas-
trophe modellers have started to incorporate models for the underlying
physics or human behaviour, hoping to gain a more accurate predictive
—
—
more precise definition is UQ is the end-to-end study of the reliability
of scientific inferences.” [109, p. 135]
It is especially important to appreciate the “end-to-end” nature of UQ studies:
one is interested in relationships between pieces of information, bearing in mind
that they are only approximations of reality, not the ‘truth’ of those pieces of
information / assumptions. There is always a risk of ‘Garbage In, Garbage
Out’. A mathematician performing a UQ analysis cannot tell you that your
T
model is ‘right’ or ‘true’, but only that, if you accept the validity of the model
(to some quantified degree), then you must logically accept the validity of cer-
tain conclusions (to some quantified degree). Naturally, a respectable analysis
will include a meta-analysis examining the sensitivity of the original analysis to
AF
perturbations of the governing assumptions. In the author’s view, this is the
proper interpretation of philosophically sound but practically unhelpful asser-
tions like “Verification and validation of numerical models of natural systems is
impossible” and “The primary value of models is heuristic.” [74].
Example 1.1. Consider the following elliptic boundary value problem on a con-
nected Lipschitz domain Ω ⊆ Rn (typically n = 2 or 3):
DR
−∇ · (κ∇u) = f in Ω,
u=0 on ∂Ω.
This PDE is a simple but not naı̈ve model for the pressure field u of some fluid
occupying a domain Ω, the permeability of which to the fluid is described by a
tensor field κ : Ω → Rn×n ; there is a source term f and the boundary condition
specifies that the pressure vanishes on the boundary of Ω. This simple model
is of interest in the earth sciences because Darcy’s law asserts that the velocity
field v of the fluid flow in this medium is related to the gradient of the pressure
field by
—
v = κ∇u.
If the fluid contains some kind of contaminant, then one is naturally very inter-
ested in where fluid following the velocity field v will end up, and how soon.
In a course on PDE theory, you will learn that if the permeability field κ is
positive-definite and essentially bounded, then this problem has a unique weak
solution u in the Sobolev space H01 (Ω) for each forcing term f in the dual Sobolev
space H −1 (Ω). One objective of this course is to tell you that this is far from the
end of the story! As far as practical applications go, existence and uniqueness
of solutions is only the beginning. For one thing, this PDE model is only an
approximation of reality. Secondly, even if the PDE were a perfectly accurate
model, the ‘true’ κ and f are not known precisely, so our knowledge about u =
1.1. WHAT IS UNCERTAINTY QUANTIFICATION? 5
u(κ, f ) is also uncertain in some way. If κ and f are treated as random variables,
then u is also a random variable, and one is naturally interested in properties
of that random variable such as mean, variance, deviation probabilities &c. —
and to do so it is necessary to build up the machinery of probability on function
spaces.
Another issue is that often we want to solve the inverse problem: we know
something about f and something about u and want to infer κ. Even a simple
inverse problem such as this one is of enormous practical interest: it is by
solving such inverse problems that oil companies attempt to infer the location
of oil deposits in order to make a profit, and seismologists the structure of the
—
planet in order to make earthquake predictions. Both of these problems, the
forward and inverse propagation of uncertainty, fall under the very general remit
of UQ. Furthermore, in practice, the fields f , κ and u are all discretized and
solved for numerically (i.e. approximately and finite-dimensionally), so it is of
interest to understand the impact of these discretization errors.
T
uncertainty into two types, aleatoric and epistemic uncertainty. Aleatoric un-
certainty — from the Latin alea, meaning a die — refers to uncertainty about
an inherently variable phenomenon. Epistemic uncertainty — from the Greek
ǫ̀πιστ ήµη, meaning knowledge — refers to uncertainty arising from lack of
AF
knowledge. To a certain extent, the distinction is an imprecise one, and re-
peats the old debate between frequentist and subjectivist (e.g. Bayesian) statis-
ticians. Someone who was simultaneously a devout Newtonian physicist and
a devout Bayesian might argue that the results of dice rolls are not aleatoric
uncertainties — one simply doesn’t have complete enough information about
the initial conditions of die, the material and geometry of the die, any gusts of
wind that might affect the flight of the die, &c. On the other hand, it is usually
DR
clear that some forms of uncertainty are epistemic rather than aleatoric: for
example, when physicists say that they have yet to come up with a Theory of
Everything, they are expressing a lack of knowledge about the laws of physics
in our universe, and the correct mathematical description of those laws. In any
case, regardless of one’s favoured interpretation of probability, the language of
probability theory is a powerful tool in describing uncertainty.
—
A Word of Warning. In this second decade of the third millennium, there is as
yet no elegant unified theory of UQ. UQ is not a mature field like linear algebra or
single-variable complex analysis, with stately textbooks containing well-polished
presentations of classical theorems bearing august names like Cauchy, Gauss and
Hamilton. Both because of its youth as a field and its very close engagement with
applications, UQ is much more about problems, methods, and ‘good enough for
T
the job’. There are some very elegant approaches within UQ, but as yet no
single, general, over-arching theory of UQ.
AF
1.2 Mathematical Prerequisites
Like any course, MA4K0 has certain prerequisites. If you are just following the
course for fun, and attending the lectures merely to stay warm and dry in what
is almost sure to be a fine English autumn, then good for you. However, if you
actually want to understand what is going on, then it’s better for your own
health if you can use your nearest time machine to ensure that you have already
DR
If any of the symbols, concepts or terms used or implicit in that sentence give you
more than a few moments’ pause, then you should think again before attempting
MA4K0.
If, in addition, you have taken the following courses, then certain techniques,
examples and remarks will make more sense to you:
• MA117 Programming for Scientists
• MA228 Numerical Analysis
• MA250 Introduction to Partial Differential Equations
• MA398 Matrix Analysis and Algorithms
• MA3H0 Numerical Analysis and PDEs
1.3. THE ROAD NOT TRAVELLED 7
—
1.3 The Road Not Travelled
There are many topics relevant to UQ that are either not covered or discussed
only briefly in these notes, including: detailed treatment of data assimilation
beyond the confines of the Kálmán filter and its variations; accuracy, stability
and computational cost of numerical methods; details of numerical implementa-
tion of optimization methods; stochastic homogenization; optimal control; and
T
machine learning.
AF
DR
—
—
T
AF
DR
—
Chapter 2
—
Recap of Measure and
Probability Theory
3. Counting measure:
(
|E|, if E ∈ F is a finite set,
κ(E) :=
—
+∞, if E ∈ F is an infinite set.
T
measurable set E ∈ F such that µ(E) = 0, then N is called a µ-null set. If the
set of x ∈ X for which some property P (x) does not hold is µnull, then P is
said to hold µ-almost everywhere (or, when µ is a probability measure, µ-almost
surely). If every µ-null set is in fact an F -measurable set, then the measure
AF
space (X , F , µ) is said to be complete.
When the sample space is a topological space, it is usual to use the Borel
σ-algebra (i.e. the smallest σ-algebra that contains all the open sets); measures
on the Borel σ-algebra are called Borel measures. Unless noted otherwise, this
is the convention followed in these notes.
\
supp(µ) := {F ⊆ X | F is closed and µ(X \ F ) = 0}.
That is, supp(µ) is the smallest closed subset of X that has full µ-measure.
Equivalently, supp(µ) is the complement of the union of all open sets of µ-
measure zero, or the set of all points x ∈ X for which every neighbourhood of
x has strictly positive µ-measure.
and probability measures µ on the power set of X are in bijection with row
vectors
µ({1}) · · · µ({n})
Pn
such that µ({i}) ≥ 0 for all i ∈ {1, . . . , n} and i=1 µ({i}) = 1. As illustrated
in Figure 2.1, the set of such µ is the (n − 1)-dimensional simplex in Rn that is
2.1. MEASURE AND PROBABILITY SPACES 11
Looking ahead, the expected value of f under µ is exactly the matrix product:
—
n f (1)
X
Eµ [f ] = µ({i})f (i) = hµ | f i = µ⊤ f = µ({1}) · · · µ({n}) ... .
i=1
f (n)
T
Sir Michael Atiyah [4, Paper 160, p. 7]:
Or, as is traditionally but perhaps apocryphally said to have been inscribed over
the entrance to Plato’s Academy:
In a sense that will be made precise later, for any ‘nice’ space X , M1 (X ) is the
DR
{µ ∈ M(X ) | Eµ [f ] ≤ c}
{µ ∈ M1 (X ) | Eµ [f1 ] ≤ c1 , . . . , Eµ [fm ] ≤ cm }
µ(E ∩ B)
µ(E|B) := for E ∈ F .
µ(B)
µ(B|A)µ(A)
µ(A|B) = .
µ(B)
12 CHAPTER 2. RECAP OF MEASURE AND PROBABILITY THEORY
M1 ({1, 2, 3})
δ2
δ1
δ3
⊂ M± ({1, 2, 3}) ∼
= R3
—
Figure 2.1: The probability simplex M1 ({1, 2, 3}), drawn as the triangle
spanned by the unit Dirac masses δi , i ∈ {1, 2, 3}, in the space of signed mea-
sures on {1, 2, 3}.
T
2.2 Random Variables and Stochastic Processes
Definition 2.8. Let (X , F ) and (Y, G ) be measurable spaces. A function
AF
f : X → Y generates a σ-algebra on X by
σ(f ) := σ {[f ∈ E] | E ∈ G } ,
Definition 2.9. Let Ω be any set and let (Θ, F , µ) be a probability space. A
DR
i ≤ j in I =⇒ Fi ⊆ Fj .
—
2.3 Aside: Interpretations of Probability
It is worth noting that the above discussions are purely mathematical: a prob-
ability measure is an abstract algebraic–analytic object with no necessary con-
nection to everyday notions of chance or probability. The question of what
T
interpretation of probability to adopt, i.e. what practical meaning to ascribe to
probability measures, is a question of philosophy and mathematical modelling.
The two main points of view are the frequentist and Bayesian perspectives. To
a frequentist, the probability µ(E) of an event E is the relative frequency of
AF
occurrence of the event E in the limit of infinitely many independent but iden-
tical trials; to a Bayesian, µ(E) is a numerical representation of one’s degree of
belief in the truth of a proposition E. The frequentist’s point of view is objec-
tive; the Bayesian’s is subjective; both use the same mathematical machinery of
probability measures to describe the properties of the function µ.
Frequentists are careful to distinguish between parts of their analyses that
are fixed and deterministic versus those that have a probabilistic character.
However, for a Bayesian, any uncertainty can be described in terms of a suitable
DR
—
with high probability, and hence to give a confidence interval for θ, but θ
itself does not have a distribution.
• To the Bayesian, θ is a random variable, and its distribution in advance
of seeing the data is encoded in a prior π. Upon seeing the data and
conditioning upon it using Bayes’ rule, the distribution of the parameter
is the posterior distribution π(θ|d). The posterior encodes everything that
2
is known about θ in view of π, L(y|θ) ∝ e−|y−θ| /2 and d, although this
T
information may be summarized by a single number such as the maximum
a posteriori estimator
—
X X X
provided that at least one of the integrals on the right-hand side is finite. The
integral of a complex-valued measurable function f : X → C is defined to be
Z Z Z
f dµ := Re f dµ + i Im f dµ.
X X X
T
One of the major attractions of the Lebesgue integral is that, subject to a
simple domination condition, pointwise convergence of integrands is enough to
ensure convergence of integral values:
AF
Theorem 2.15 (Dominated convergence theorem). Let (X , F , µ) be a measure
space and let fn : X → K be a measurable function for each n ∈ N. If f : X → K
is such that limn→∞ fn (x) = f (x) Rfor every x ∈ X and there is a measurable
function g : X → [0, ∞] such that X |g| dµ is finite and |fn (x)| ≤ g(x) for all
x ∈ X and all large enough n ∈ N, then
Z Z
f dµ = lim fn dµ.
X n→∞ X
DR
where Z 1/p
p
kf kLp(µ) := |f (x)| dµ(x)
X
for 1 ≤ p < ∞ and
kf kL∞(µ) := inf {kgk∞ | f = g : X → K µ-almost everywhere}
= inf {t ≥ 0 | |f | ≤ t µ-almost everywhere}
To be more precise, Lp (X , µ; K) is the set of equivalence classes of such functions,
where functions that differ only on a set of µ-measure zero are identified.
—
Theorem 2.18 (Chebyshev’s inequality). Let X ∈ Lp (Θ, µ; K), 1 ≤ p < ∞, be a
random variable. Then, for all t ≥ 0,
Pµ |X − Eµ [X]| ≥ t ≤ t−p Eµ |X|p . (2.1)
T
above for complex-valued integrands. Many interesting UQ problems concern
random fields, i.e. random variables with values in infinite-dimensional spaces
of functions. For definiteness, consider a function f defined on a measure space
(X , F , µ) taking values in a Banach space V. There are two ways to proceed,
AF
and they are in general inequivalent:
1. The strong integral or Bochner integral of f is defined by integrating simple
V-valued functions as in the construction of the Lebesgue integral, and
then defining Z Z
f dµ := lim φn dµ
X n→∞ X
whenever (φn )n∈N is a sequence
R of simple functions such that the (scalar-
valued) Lebesgue integral X kf − φn k dµ converges to 0 as n → ∞. It
DR
—
Theorem 2.20 (Radon–Nikodým). Suppose that µ and ν are σ-finite measures
on a measurable space (X , F ) and that ν ≪ µ. Then there exists a measurable
function ρ : X → [0, ∞] such that, for all measurable functions f : X → R and
all E ∈ F , Z Z
f dν = f ρ dµ
E E
T
whenever either integral exists. Furthermore, any two functions ρ with this
property are equal µ-almost everywhere.
F × G, F ∈ F, G ∈ G ,
Definition 2.23. Let (X ×Y, F , µ) be a measure space and suppose that the fac-
tor space X is equipped with a σ-algebra such that the projections ΠX : (x, y) 7→
x is a measurable function. Then the marginal measure µX is the measure on
X defined by
µX (E) := (ΠX )∗ µ (E) = µ(E × Y).
The marginal measure µY on Y is defined similarly.
Theorem 2.24. Let X = (X1 , X2 ) be a random variable taking values in a
product space X = X1 × X2 . Let µ be the (joint) distribution of X, and µi the
(marginal) distribution of Xi for i = 1, 2. Then X1 and X2 are independent
—
random variables if and only if µ = µ1 ⊗ µ2 .
The important property of integration with respect to a product measure,
and hence taking expected values of independent random variables, is that it
can be performed by iterated integration:
Theorem 2.25 (Fubini–Tonelli). Let (X , F , µ) and (Y, G , ν) be σ-finite measure
spaces, and let f : X × Y → [0, +∞] be measurable. Then, of the following three
T
integrals, if one exists in [0, ∞], then all three exist and are equal:
Z Z Z Z
f (x, y) dν(y) dµ(x), f (x, y) dµ(x) dν(y),
X Y Y X
AF
Z
and f (x, y) d(µ ⊗ ν)(x, y).
X ×Y
for each measurable E ⊆ Rd . The Gaussian measure N (0, I) is called the stan-
dard Gaussian measure. A Dirac measure δm can be considered as a degenerate
Gaussian measure on R, one with variance equal to zero.
2.7. GAUSSIAN MEASURES 19
—
Definition 2.29. Let µ be a probability measure on a Banach space V. An
element mµ ∈ V is called the mean of µ if
Z
hℓ | x − mµ i dµ(x) = 0 for all ℓ ∈ V ′ ,
V
R
so that V x dµ(x) = mµ in the sense of a Pettis integral. If mµ = 0, then
µ is said to be centred. The covariance operator is the symmetric operator
T
Cµ : V ′ × V ′ → K defined by
Z
Cµ (k, ℓ) = hk | x − mµ ihℓ | x − mµ i dµ(x) for all k, ℓ ∈ V ′ .
V
AF
We often abuse notation and write Cµ : V ′ → V ′′ for the operator defined by
hCµ k | ℓi := Cµ (k, ℓ)
The inverse of Cµ , if it exists, is called the precision operator of µ.
Theorem 2.30 (Vakhania). Let µ be a Gaussian measure on a separable, reflexive
Banach space V with mean mµ ∈ V and covariance operator Cµ : V ′ → V. Then
DR
the support of µ is the affine subspace of V that is the translation by the mean
of the closure of the range of the covariance operator, i.e.
supp(µ) = mµ + Cµ V ′ .
Corollary 2.31. For a Gaussian measure µ on a separable, reflexive Banach
space V, the following are equivalent:
1. µ is non-degenerate;
2. Cµ : V ′ → V is one-to-one;
3. Cµ V ′ = V.
—
for some positive-definite quadratic form Q on V ′ . Indeed, Q(ℓ) = Cµ (ℓ, ℓ). Fur-
thermore, if two Gaussian measures µ and ν have the same mean and covariance
operator, then µ = ν.
Theorem 2.34 (Fernique). Let µ be a centered Gaussian measure on a separable
—
Banach space V. Then there exists α > 0 such that
Z
exp(αkxk2 ) dµ(x) < +∞.
V
T
V
for any orthonormal basis (en ) of H, and (by Lidskiı̆’s theorem) this equals
the sum of the eigenvalues of K, counted with multiplicity.
Theorem 2.36. Let µ be a centred Gaussian measure on a separable Hilbert
space H. Then Cµ : H → H is trace class and
—
Z
tr(Cµ ) = kxk2 dµ(x).
H
—
Gaussian measures on such spaces are either equivalent or mutually singular
— there is no middle ground in the way that Lebesgue measure on [0, 1] has a
density with respect to Lebesgue measure on R but is not equivalent to it —
and surprisingly simple operations can destroy equivalence.
T
• µ and ν are equivalent, i.e. µ(E) = 0 ⇐⇒ ν(E) = 0; or
• µ and ν are mutually singular, i.e. there exists E such that µ(E) = 0 and
ν(E) = 1.
Furthermore, equivalence holds if and only if
AF
1/2 1/2
1. R(Cµ ) = R(Cν ) = E; and
2. mµ − mν ∈ E; and
−1/2 1/2 −1/2 1/2
3. T := (Cµ Cν )(Cµ Cν )∗ − I is Hilbert–Schmidt in E.
Bibliography
At Warwick, this material is mostly covered in MA359 Measure Theory and W
ST318 Probability Theory. Gaussian measure theory in infinite-dimensional
spaces is covered in MA482 Stochastic Analysis and MA612 Probability on
Function Spaces and Bayesian Inverse Problems. Vakhania’s theorem (Theo-
rem 2.30) on the support of a Gaussian measure can be found in [110]. Fernique’s
—
—
Dempster [23] and Shafer [87].
T
AF
DR
—
Chapter 3
—
Recap of Banach and Hilbert
Spaces
This chapter covers the necessary concepts from linear functional analysis
on Hilbert and Banach spaces, in particular basic and useful constructions like
direct sums and tensor products. Like Chapter 2, this chapter is intended as a
review of material that should be understood as a prerequisite before proceeding;
DR
to extent, Chapters 2 and 3 are interdependent and so can (and should) be read
in parallel with one another.
—
for all x, y ∈ V. (3.1)
T
identity, then the unique inner product h · , · i that induces this norm is found
by the polarization identity
kx + yk2 − kx − yk2
hx, yi = (3.3)
AF
4
in the real case, and
kx + yk2 − kx − yk2 kix − yk2 − kix + yk2
hx, yi = +i (3.4)
4 4
in the complex case.
Example 3.3. 1. For any n ∈ N, the coordinate space Kn is an inner product
DR
In the real case, this is usually known as the dot product and denoted x · y.
2. For any m, n ∈ N, the space Km×n of m × n matrices is an inner product
space under the Frobenius inner product
X
hA, Bi ≡ A : B := aij bij .
—
i=1,...,m
j=1,...,n
—
3. Given a measure space (X , F , µ), the space L2 (X , µ; K) of (equivalence
classes modulo equality µ-almost everywhere) of square-integrable func-
tions from X to K is a Hilbert space with respect to the inner product
Z
hf, giL2 (µ) := f (x)g(x) dµ(x). (3.5)
X
T
Note that it is necessary to take the quotient by the equivalence relation
of equality µ-almost everywhere since a function f that vanishes on a set
of full measure but is non-zero on a set of zero measure is not the zero
function but nonetheless has kf kL2(µ) = 0. When (X , F , µ) is a probabil-
AF
ity space, elements of L2 (X , µ; K) are thought of as random variables of
finite variance, and the L2 inner product is the covariance:
hX, Y iL2 (µ) := Eµ X̄Y = cov(X, Y ).
there is a K-linear map T : H → K such that hT x, T yiK = hx, yiH for all
x, y ∈ H.
Example 3.6. 1. For any compact topological space X , the space C(X ; K)
of continuous functions f : X → K is a Banach space with respect to the
supremum norm
kf k∞ := sup |f (x)|. (3.6)
x∈X
are Banach spaces, but only the L2 spaces are Hilbert spaces.
26 CHAPTER 3. RECAP OF BANACH AND HILBERT SPACES
Another family of Banach spaces that arises very often in PDE applications
is the family of Sobolev spaces. For the sake of brevity, we limit the discussion to
those Sobolev spaces that are Hilbert spaces. To save space, we use multi-index
notation for derivatives: α := (α1 , . . . , αn ) ∈ Nn0 denotes a multi-index, and
|α| := α1 + · · · + αn .
Definition 3.7. Let Ω ⊆ Rn , let α ∈ Nn0 , and consider f : Ω → R. A weak
derivative of order α for f : Ω → R is a function v : Ω → R such that
Z Z
∂ |α| |α|
f (x) α1 φ(x) dx = (−1) v(x)φ(x) dx (3.9)
—
Ω ∂ x1 . . . ∂ αn xn Ω
for every φ ∈ Cc∞ (Ω; R). Such a weak derivative is usually denoted Dα f , and
coincides with the classical (strong) derivative if it exists. The Sobolev space
H k (Ω) is
for all α ∈ Nn0 with |α| ≤ k,
H k (Ω) := f ∈ L2 (Ω) (3.10)
f has a weak derivative Dα f ∈ L2 (Ω)
T
with the inner product
X
hu, viH k := hDα u, Dα viL2 . (3.11)
|α|≤k
AF
3.2 Dual Spaces and Adjoints
Definition 3.8. The (continuous) dual space of a normed space V over K is the
vector space V ′ of all continuous linear functionals ℓ : V → K. The dual pairing
between an element ℓ ∈ V ′ and an element v ∈ V is denoted hℓ | vi or simply
ℓ(v). For a linear functional ℓ, being continuous is equivalent to being bounded
DR
is finite.
Proposition 3.9. For every normed space V, the dual space V ′ is a Banach space
with respect to k · k′ .
An important property of Hilbert spaces is that they are naturally self-dual :
—
Given a matrix {ei }i∈I of H, the corresponding dual basis {ei }i∈I of H is defined
by the relation hei , ej iH = δij . The matrix of A with respect to bases {ei }i∈I
of H and {fj }j∈J of K and the matrix of A∗ with respect to the corresponding
dual bases are very simply related: the one is the conjugate transpose of the
—
other, and so by abuse of terminology the conjugate transpose of a matrix is
often referred to as the adjoint.
T
Definition 3.11. A subset E of an inner product space V is said to be orthogonal
if hx, yi = 0 for all distinct elements x, y ∈ E; it is said to be orthonormal if
(
AF
1, if x = y ∈ E,
hx, yi =
0, if x, y ∈ E and x 6= y.
⊥
H = K ⊕ K⊥ .
Proof. Deferred to Lemma 4.20 and the more general context of convex opti-
mization.
The operator ΠK : H → K is called the orthogonal projection onto K.
Theorem 3.13. Let K and L be closed subspaces of a Hilbert space H. The
corresponding orthogonal projection operators
1. are continuous inear operators of norm at most 1;
2. are such that I − ΠK = ΠK⊥ ;
28 CHAPTER 3. RECAP OF BANACH AND HILBERT SPACES
—
projection of X onto the subspace L2 (Θ, G , µ; K). In elementary contexts, G
is usually taken to be the σ-algebra generated by a single event E of positive
µ-probability, i.e.
G = {∅, [X ∈ E], [X ∈ / E], Θ}.
The orthogonal projection point of view makes two important properties of
conditional expectation intuitively obvious:
1. Whenever G1 ⊆ G2 F , L2 (Θ, G1 , µ; K) is a subspace of L2 (Θ, G2 , µ; K) and
T
composition of the orthogonal projections onto these subspace yields the
tower rule for conditional expectations:
E[X|G1 ] = E E[X|G2 ] G1 ,
AF
and, in particular, taking G1 to be the trivial σ-algebra {∅, Θ},
E[XY |G ] = XE[Y |G ].
DR
Direct Sums. Suppose that V and W are vector spaces over a common field
K. The Cartesian product V × W can be given the structure of a vector space
over K by defining the operations componentwise:
(v, w) + (v ′ , w′ ) := (v + v ′ , w + w′ ),
α(v, w) := (αv, αw),
are mutually orthogonal. For this reason, the orthogonality of the two sum-
⊥
mands in a Hilbert direct sum is sometimes emphasized by the notation H ⊕ K.
The Hilbert space projection theorem (Theorem 3.12) was the statement that
⊥
whenever K is a closed subspace of a Hilbert space H, H = K ⊕ K⊥ .
It is necessary to be a bit more careful in defining the direct sum of countably
many Hilbert spaces. Let Hn be a L Hilbert space over K for each n ∈ N. Then
the Hilbert space direct sum H := n∈N Hn is defined to be
xn ∈ Hn for each n ∈ N, and
—
H := x = (xn )n∈N ,
xn = 0 for all but finitely many n
T
which is always a finite sum when applied to elements of the generating set.
This construction ensures that every element x of H has finite norm kxk2H =
P 2
n∈N kxn kHn . As before, each of the summands Hn is a subspace of H that is
orthogonal to all the others.
AF
Orthogonal direct sums and orthogonal bases are among the most important
constructions in Hilbert space theory, and will be very useful in what follows.
The prototypical example to bear in mind is the Fourier basis {en | n ∈ Z} of
L2 (S1 ; C), where
1
en (x) := exp(−inx).
2π
Indeed, Fourier’s claimh3.1i that any periodic function f could be written as
DR
X
f (x) = fbn en (x),
n∈Z
Z
fbn := f (y)en (y) dy,
S1
can be seen as one of the historical drivers behind the development of much of
analysis. Other important examples are the systems of orthogonal polynomials
that will be considered in Chapter 8. Some important results about orthogonal
systems are summarized below:
—
Theorem 3.16 (Parseval identity). Let H be a Hilbert space, let (en )n∈N be an
orthonormal sequence in H, and let (αn )n∈N be a sequence in K. Then the series
h3.1i Of
course, Fourier did not use the modern notation of Hilbert spaces! Furthermore, if he
had, then it would have been ‘obvious’ that his claim could only hold true for L2 functions
and in the L2 sense, not pointwise for arbitrary functions.
30 CHAPTER 3. RECAP OF BANACH AND HILBERT SPACES
P P
converges in H if and only if the series n∈N |αn |2 converges in R,
n∈N αn en
in which case
2
X X
αn en = |αn |2 . (3.13)
n∈N n∈N
—
1. {en | n ∈ N}⊥ = {0};
2. H = Lspan{en | n ∈ N};
3. H = n∈N Ken as a direct
P sum of Hilbert spaces;
= n∈N |hx, en i|2 ;
4. for all x ∈ H, kxk2P
5. for all x ∈ H, x = n∈N hx, en ien .
If one (and hence all) of these conditions holds true, then (en )n∈N is called a
complete orthonormal basis for H
T
Corollary 3.19. Let (en )n∈N be a complete orthonormal basis for H. For every
PN
x ∈ H, the truncation error x− n=1 hx, en ien is orthogonal to span{e1 , . . . , eN }.
PN
Proof. LetPv := n=1 vn en be any element of span{e1 , . . . , eN }. By complete-
AF
ness, x = n∈N hx, en ien . Hence,
* N
+ * N
+
X X X
x− hx, en ien , v = hx, en ien , vn en
n=1 n>N n=1
X
= hhx, en ien , vm em i
n>N
m∈{0,...,N }
DR
X
= hx, en ivm hen , em i
n>N
m∈{0,...,N }
=0
The ‘freeness’ of FV×W is that the elements e(v,w) are, by definition linearly
independent for distinct pairs (v, w) ∈ V × W. Now define an equivalence
relation ∼ on FV×W such that
—
Definition 3.21. The (algebraic) tensor product V ⊗ W is the quotient space
FV×W
V ⊗ W := .
R
One can easily check that V ⊗ W, as defined in this way, is indeed a vector
space over K. The subspace R of FV×W is mapped to the zero element of V ⊗ W
T
under the quotient map, and so the above equivalences become equalities in the
tensor product space:
(v + v ′ ) ⊗ w = v ⊗ w + v ′ ⊗ w,
AF
v ⊗ (w + w′ ) = v ⊗ w + v ⊗ w′ ,
α(v ⊗ w) = (αv) ⊗ w = v ⊗ (αw)
norm.
Tensor products of Hilbert spaces arise very naturally when considering
spaces of functions of more than one variable, or spaces of functions that take
values in other function spaces. A prime example of the second type is a space
of stochastic processes.
Example 3.23. 1. Given two measure spaces (X , F , µ) and (Y, G , ν), con-
sider L2 (X × Y, µ ⊗ ν; K), the space of functions on X × Y that are square
integrable with respect to the product measure µ ⊗ ν. If f ∈ L2 (X , µ; K)
and g ∈ L2 (Y, ν; K), then we can define a function h : X × Y → K by
h(x, y) := f (x)g(y). The definition of the product measure ensures that
32 CHAPTER 3. RECAP OF BANACH AND HILBERT SPACES
L2 (X , µ; K) ⊗ L2 (Y, ν; K) ∼
= L2 (X × Y, µ ⊗ ν; K),
—
2. Similarly, L2 (X , µ; H), the space of functions f : X → H that are square
integrable in the sense that
Z
kf (x)k2H dµ(x) < +∞,
X
T
in L2 (X , µ; H).
3. Combining the previous two examples reveals that
L2 (X , µ; K) ⊗ L2 (Y, ν; K) ∼
= L2 (X × Y, µ ⊗ ν; K) ∼
= L2 X , µ; L2 (Y, ν; K) .
AF
Similarly, one can consider Bochner spaces of functions (random variables)
taking
R values in a Banach space V that are pth -power-integrable in the sense
p
that X kf (x)kV dµ(x) is finite, and identify this with a suitable tensor product
Lp (X , µ; R) ⊗ V. However, several subtleties arise in doing this, as there is no
single ‘natural’ tensor product of Banach spaces as there is for Hilbert spaces.
Readers who are interested in such spaces should consult the book of Ryan [85].
DR
Bibliography
W At Warwick, the theory of Hilbert and Banach spaces is covered in courses such
as MA3G7 Functional Analysis I and MA3G8 Functional Analysis II. Sobolev
spaces are covered in MA4A2 Advanced PDEs.
Classic reference texts on elementary functional analysis, including Banach
and Hilbert space theory, include the monographs of Reed & Simon [79], Rudin [83],
and Rynne & Youngson [86]. Further discussion of the relationship between ten-
—
—
Basic Optimization Theory
T
The Hitchhiker’s Guide to the Galaxy
Douglas Adams
AF
This chapter reviews the basic elements of optimization theory and practice,
without going into the fine details of numerical implementation.
all x in a given subset of the domain of f , along with the point or points x that
realize those extreme values. The general form of a constrained optimization
problem is
extremize: f (x)
with respect to: x ∈ X
subject to: gi (x) ∈ Ei for i = 1, 2, . . . ,
where X is some set; f : X → R ∪ {±∞} is a function called the objective
function; and, for each i, gi : X → Yi is a function and Ei ⊆ Yi some subset.
—
over X is the same as unconstrained minimization over the feasible set. How-
ever, from a practical standpoint, the difference is huge. Typically, X is Rn for
some n, or perhaps a simple subset specified using inequalities on one coordinate
at a time, such as [a1 , b1 ] × · · · × [an , bn ]; a bona fide non-trivial constraint is
one that involves a more complicated function of one coordinate, or two or more
coordinates, such as
g1 (x) := cos(x) − sin(x) > 0
or
g2 (x1 , x2 , x3 ) := x1 x2 − x3 = 0.
—
Definition 4.1. The arg min or set of global minimizers of f : X → R ∪ {±∞}
is defined to be
′
arg min f (x) := x ∈ X f (x) = inf
′
f (x ) ,
x∈X x ∈X
T
be
′
arg max f (x) := x ∈ X f (x) = sup f (x ) .
x∈X x′ ∈X
−1
xn+1 := xn − Df (xn ) f (xn ).
Newton’s method is often applied to find critical points of f , i.e. points where
Df vanishes, in which case the iteration is.
−1
xn+1 := xn − D2 f (xn ) Df (xn ).
—
mate minimizer x_best along with the corresponding value of f. Suppose that
random() generates independent samples of X from a probability measure µ
with support X .
f_best = +inf
n = 0
while n < n_max:
T
x_new = random()
f_new = f(x_new)
if f_new < f_best:
x_best = x_new
AF
f_best = f_new
n = n + 1
return [x_best, f_best]
A weakness of Algorithm 4.4 is that it completely neglects local information
about f. Even if the current state x_best is very close to the global minimizer
x_min, the algorithm may continue to sample points x_new that are very far
away and have f(x_new) ≫ f(x_best). It would be preferable to explore a
DR
n = 0
while n < n_max:
x_new = x_best + jump()
f_new = f(x_new)
if f_new < f_best:
x_best = x_new
f_best = f_new
n = n + 1
return [x_best, f_best]
Algorithm 4.5 also has a weakness: since the state is only ever updated to
states with a strictly lower value of f, and only looks for new states within
36 CHAPTER 4. BASIC OPTIMIZATION THEORY
unit distance of the current one, the algorithm is prone to becoming stuck in
local minima if they are surrounded by wells that are sufficiently wide, even if
they are very shallow. The next algorithm, the simulated annealing method,
attempts to rectify this problem by allowing the optimizer to make some ‘up-
hill’ moves, which can be accepted or rejected according to comparison of a
uniformly-distributed random variable with a user-prescribed acceptance prob-
ability function. Therefore, in the simulated annealing algorithm, a distinction
is made between the current state x of the algorithm and the best state so far,
x_best; unlike in the previous two algorithms, proposed states x_new may be
accepted and become x even if f(x_new) > f(x_best). The idea is to introduce
—
a parameter T, to be thought of as ‘temperature’: the optimizer starts off ‘hot’,
and ‘uphill’ moves are likely to be accepted; by the end of the calculation, the
optimizer is relatively ‘cold’, and ‘uphill’ moves are unlikely to accepted.
Algorithm 4.6 (Simulated annealing). Suppose that an initial state x0 is given,
and that functions temperature(), neighbour() and acceptance_prob() have
been specified. Suppose that uniform() generates independent samples from
T
the uniform distribution on [0, 1]. Then the simulated annealing algorithm is
x = x0
fx = f(x)
x_best = x
AF
f_best = fx
n = 0
while n < n_max:
T = temperature(n / n_max)
x_new = neighbour(x)
f_new = f(x_new)
if acceptance_prob(fx, f_new, T) > uniform():
DR
x = x_new
fx = f_new
if f_new < f_best:
x_best = x_new
f_best = f_new
n = n + 1
return [x_best, f_best]
Like Algorithm 4.4, the simulated annealing method can guarantee to find
the global minimizer of f provided that the neighbour() function allows full ex-
ploration of the state space and the maximum run time n_max is large enough.
—
—
is a constrained extremizer of f subject to the constraint that g(u0 ) = 0. Suppose
also that the Fréchet derivative Dg(u0 ) : X → Y is surjective. Then there exists
a Lagrange multiplier λ0 ∈ Y ′ such that (u0 , λ0 ) is an unconstrained critical
point of the Lagrangian L defined by
T
The corresponding result for inequality constraints is the Karush–Kuhn–
Tucker theorem:
Theorem 4.8 (Karush–Kuhn–Tucker). Suppose that x∗ ∈ Rn is a local minimizer
AF
of f ∈ C 1 (Rn ; R) subject to inequality constraints gi (x) ≤ 0 and equality con-
straints hj (x) = 0, where gi , hj ∈ C 1 (Rn ; R) for i = 1, . . . , m and j = 1, . . . ℓ.
Then there exist µ ∈ Rm and λ ∈ Rℓ such that
where x∗ is feasible, and µ satisfies the dual feasibility criteria µi ≥ 0 and the
complementary slackness criteria µi gi (x∗ ) = 0 for i = 1, . . . , m.
DR
minimize: f (x)
with respect to: x ∈ X
subject to: c(x) ≤ 0
—
putational time, and this approach takes no account of how ‘nearly feasible’ an
infeasible x’ might be.
One alternative approach is to use penalty functions: instead of considering
the constrained problem of minimizing f (x) subject to c(x) ≤ 0, one can consider
the unconstrained problem of minimizing x 7→ f (x)+p(x), where p : X → [0, ∞)
is some function that equals zero on the feasible set and takes larger values the
‘more’ the constraint inequality c(x) ≤ 0 is violated, e.g., for µ > 0.
T
(
0, if c(x) ≤ 0,
pµ (x) = c(x)/µ
e − 1, if c(x) > 0.
The hope is that (a) the minimization of f + pµ over all of X is easy, and (b) as
AF
µ → 0, minimizers of f + pµ converge to minimzers of f on the original feasible
set. The penalty function approach is attractive, but the choice of penalty
function is rather ad hoc, and issues can easily arise of competition between the
penalties corresponding to multiple constraints.
An alternative to the use of penalty functions is to construct constraining
functions that enforce the constraints exactly. That is, we seek a function C()
that takes as input a possibly infeasible x’ and returns some x_new = C(x’)
DR
that is guaranteed to satisfy the constraint c(x_new) <= 0. For example, sup-
pose that X = Rn and the feasible set is the Euclidean unit ball, so the constraint
is
c(x) := kxk22 − 1 ≤ 0.
Then a suitable constraining function could be
(
x, if kxk2 ≤ 1,
C(x) :=
x/kxk2 , if kxk2 > 1.
Constraining functions are very attractive because the constraints are treated
exactly. However, they must often be designed on a case-by-case basis for each
—
constraint function c, and care must be taken to ensure that multiple constrain-
ing functions interact well and do not unduly favour parts of the feasible set
over others; for example, the above constraining function C maps the entire in-
feasible set to the unit sphere, which might be considered undesirable in certain
settings, and so a function such as
(
e x, if kxk2 ≤ 1,
C(x) := 2
x/kxk2 , if kxk2 > 1.
might be more appropriate. Finally, note that the original accept/reject method
of finding feasible states is a constraining function in this sense, albeit a very
inefficient one.
4.4. CONVEX OPTIMIZATION 39
—
More generally, given points x0 , . . . , xn of a vector space, a sum of the form
α0 x0 + · · · + αn xn
is called a linear combination if the αi are any field elements, an affine combi-
nation if their sum is 1, and a convex combination if they are non-negative and
sum to 1.
T
Definition 4.9. A subset K ⊆ X is a convex set if, for all x0 , x1 ∈ K and
t ∈ [0, 1], xt ∈ K; it is said to be strictly convex if xt ∈ K̊ whenever x0 and
x1 are distinct points of K̄ and t ∈ (0, 1). An extreme point of a convex set K
AF
is a point of K that cannot be written as a non-trivial convex combination of
distinct elements of K; the set of all extreme points of K is denoted ext(K).
The convex hull co(S) (resp. closed convex hull co(S)) of S ⊆ X is defined to
be the intersection of all convex (resp. closed and convex) subsets of X that
contain S.
Example 4.10. The square [−1, 1]2 is a convex subset of R2 , but is not strictly
convex, and its extreme points are the four vertices (±1, ±1). The closed unit
DR
The reason that these notes restrict attention to Hausdorff, locally convex
topological vector spaces X is that it is just too much of a headache to work
with spaces for which the following ‘common sense’ results do not hold:
b b
—
(a) A convex subset of the (b) A non-convex subset of
plane (grey) and its ex- the plane (black) and its
treme points (black). convex hull (grey).
T
Then there exists a probability measure µ supported on ext(K) such that, for all
affine functions f on K,
Z
AF
f (c) = f (e) dµ(e).
ext(K)
be obtained as an average of its values at the extremal points in the same way.
Definition 4.14. Let K ⊆ X be convex. A function f : K → R ∪ {±∞} is a
convex function if, for all x0 , x1 ∈ K and t ∈ [0, 1],
f (xt ) ≤ (1 − t)f (x0 ) + tf (x1 ),
and is called a strictly convex function if, for all distinct x0 , x1 ∈ K and t ∈
(0, 1),
f (xt ) < (1 − t)f (x0 ) + tf (x1 ).
It is straightforward to see that f : K → R ∪ {±∞} is convex (resp. strictly
—
Remark. Note well that Theorem 4.15 does not assert the existence of mini-
mizers, for which simultaneous compactness of K and lower semicontinuity of
f is required. For example:
• the exponential function on R is strictly convex, continuous and bounded
below by 0 yet has no minimizer;
• the interval [−1, 1] is compact, and the function f : [−1, 1] → R ∪ {±∞}
defined by (
x, if |x| < 12 ,
f (x) :=
+∞, if |x| ≥ 12 ,
—
is convex, yet f has no minimizer — although inf x∈[−1,1] f (x) = − 21 , there
is no x for which f (x) attains this infimal value.
Proof. 1. Suppose that x0 is a local minimizer of f in K that is not a global
minimizer: that is, suppose that x0 is a minimizer of f in some open
neighbourhood N of x0 , and also that there exists x1 ∈ K \ N such that
f (x1 ) < f (x0 ). Then, for sufficiently small t > 0, xt ∈ N , but convexity
T
implies that
f (xt ) ≤ (1 − t)f (x0 ) + tf (x1 ) < (1 − t)f (x0 ) + tf (x0 ) = f (x0 ),
AF
which contradicts the assumption that x0 is a minimizer of f in N .
2. Suppose that x0 , x1 ∈ K are global minimizers of f . Then, for all t ∈ [0, 1],
xt ∈ K and
f (x0 ) ≤ f (xt ) ≤ (1 − t)f (x0 ) + tf (x1 ) = f (x0 ).
Hence, xt ∈ arg minK f , and so arg minK f is convex.
3. Suppose that x0 , x1 ∈ K are distinct global minimizers of f , and let
DR
2. In practice, many problems are not obviously convex programs, but but
can be transformed into convex programs by e.g. a cunning change of
variables. Being able to spot the right equivalent problem is a major part
of the art of optimization.
—
explore the interior of the feasible set in search of the solution to the convex
program, while being kept away from the boundary of the feasible set by a bar-
rier function. The discussion that follows is only intended as an outline; for
details, see Chapter 11 of Boyd & Vandenberghe [16].
Consider the convex program
minimize: f (x)
T
with respect to: x ∈ Rn
subject to: ci (x) ≤ 0 for i = 1, . . . , m,
—
with respect to: x ∈ Rn
subject to: Ax ≤ b
x ≥ 0,
where c ∈ Rn , A ∈ Rm×n and b ∈ Rm are given, and the two inequalities are
interpreted componentwise.
Note that the feasible set for a linear program is an intersection of finitely
T
many half-spaces of Rn , i.e. a polytope. This polytope may be empty, in which
case the constraints are mutually contradictory and the program is said to be
infeasible. Also, the polytope may be unbounded in the direction of c, in which
case the extreme value of the problem is infinite.
AF
Since linear programs are special cases of convex programs, methods such as
interior point methods are applicable to linear programs as well. Such methods
approach the optimum point x∗ , which is necessarily an extremal element of
the feasible polytope, from the interior of the feasible polytope. Historically,
however, such methods were preceded by methods such as Dantzig’s simplex
algorithm, which, sets out to directly explore the set of extreme points in a
(hopefully) efficient way. Although the theoretical worst worst-case complexity
DR
and hence
k2y − (xm + xn )k2 + kxn − xm k2 ≤ 4I 2 + 2
n + 2
m.
Since K is convex, 21 (xm + xn ) ∈ K, so the first term on the left-hand side above
is bounded below as follows:
2
—
xm + xn
k2y − (xm + xn )k2 = 4 y − ≥ 4I 2 .
2
Hence,
kxn − xm k2 ≤ 4I 2 + 2
n + 2
m − 4I 2 = 2
n + 2
m,
and so the sequence (xn )n∈N is Cauchy; since H is complete and K is closed,
this sequence converges to some x̂ ∈ K. Since the norm k · k is continuous,
T
ky − x̂k = I.
Proof. Let J(x) := 12 kx−bk2 , which has the same minimizers as x 7→ kx−bk; by
Lemma 4.20, such a minimizer exists and is unique. Suppose that (x − b) ⊥ V
and let y ∈ V . Then y − x ∈ V and so (y − x) ⊥ (x − b). Hence, by Pythagoras’
DR
theorem,
ky − bk2 = ky − xk2 + kx − bk2 ≥ kx − bk2 ,
and so x minimizes J.
Conversely, suppose that x minimizes J. Then, for every y ∈ V ,
∂ 1
0= J(x + λy) = (hy, x − bi + hx − b, yi) = Rehx − b, yi
∂λ 2
and, in the complex case,
—
∂ 1
0= J(x + λiy) = (−ihy, x − bi + ihx − b, yi) = −Imhx − b, yi.
∂λ 2
Hence, hx − b, yi = 0, and since y was arbitrary, (x − b) ⊥ V .
the equations on the right-hand side being known as the normal equations.
4.6. LEAST SQUARES 45
—
Weighting and Regularization. It is common in practice that one does not
want to minimize the K-norm directly, but perhaps some re-weighted version
of the K-norm. This re-weighting is accomplished by a self-adjoint and positive
definite operator on K.
Corollary 4.23 (Weighted least squares). Let A : H → K be a linear operator
between Hilbert spaces such that R(A) is a closed subspace of K. Let Q : K → K
T
be self-adjoint and positive-definite and let
hk, k ′ iQ := hk, Qk ′ iK
Then, given b ∈ K,
AF
x̂ ∈ arg min kAx − bkQ ⇐⇒ A∗ QAx̂ = A∗ Qb.
x∈H
that ‘the right solution’ should be close to some initial guess x0 . A technique
that accomplishes both of these aims is Tikhonov regularization:
Corollary 4.24 (Tikhonov-regularized least squares). Let A : H → K be a linear
operator between Hilbert spaces such that R(A) is a closed subspace of K, let
Q : H → H be self-adjoint and positive-definite, and let b ∈ K and x0 ∈ H be
given. Let
J(x) := kAx − bk2 + kx − x0 k2Q .
Then
x̂ ∈ arg min J(x) ⇐⇒ (A∗ A + Q)x̂ = A∗ b + Qx0 .
—
x∈H
The aim is to find θ to minimize the objective function J(θ) := kr(θ)k22 . Let
∂r1 (θ) ∂r1 (θ)
∂θ 1 ··· ∂θ p
.. .. ..
A :=
. . .
∈ Rm×p
∂rm (θ) ∂rm (θ)
∂θ 1 ··· ∂θ p θ=θn
be the Jacobian matrix of the residual vector, and note that A = −DF (θn ),
where
f (x1 ; θ)
—
.. m
F (θ) := . ∈R .
f (xm ; θ)
T
Thus, to approximately minimize kr(θ)k2 , we find δ := r(θ) − r(θn ) that makes
the right-hand side of the approximation equal to zero. This is an ordinary linear
least squares problem, the solution of which is given by the normal equations as
AF
δ = (A∗ A)−1 A∗ r(θn ).
Thus, we obtain the Gauss–Newton iteration for a sequence (θn )n∈N of approx-
imate minimizers of J:
Bibliography
—
Direct and iterative methods for the solution of linear least squares problems
W are covered in MA398 Matrix Analysis and Algorithms.
The book of Boyd & Vandenberghe [16] is an excellent reference on the
theory and practice of convex optimization, as is the associated software library
cvxopt. The classic reference for convex analysis in general is the monograph of
Rockafellar [82]. A standard reference on numerical methods for optimization
is the book of Nocedal & Wright [72].
For constrained global optimization in the absence of ‘nice’ features, particu-
larly for the UQ methods in Chapter 14, the author recommends the Differential
Evolution algorithm [77, 97] within the Mystic framework.
EXERCISES 47
Exercises
Exercise 4.1. Let k · k be a norm on a vector space X , and fix y ∈ X . Show
that the function J : X → [0, ∞) defined by J(x) := ky − xk2 is strictly convex.
Exercise 4.2. Let A : H → K be a linear operator between Hilbert spaces such
that R(A) is a closed subspace of K. Let Q : K → K be self-adjoint and positive-
definite and let
hk, k ′ iQ := hk, Qk ′ iK
Show that, given b ∈ K,
—
x̂ ∈ arg min kAx − bkQ ⇐⇒ A∗ QAx̂ = A∗ Qb.
x∈H
T
J(x) := kAx − bk2 + kx − x0 k2Q .
Show that
AF
x̂ ∈ arg min J(x) ⇐⇒ (A∗ A + Q)x̂ = A∗ b + Qx0 .
x∈H
Hint: Consider the operator from H into K⊕H given in block form as [A, Q1/2 ]⊤ .
DR
—
48 CHAPTER 4. BASIC OPTIMIZATION THEORY
—
T
AF
DR
—
Chapter 5
—
Measures of Information and
Uncertainty
Donald Rumsfeld
DR
—
Sometimes, nothing more can be said about some unknown quantity than a
range of possible values, with none more or less probable than any other. In
the case of an unknown real number x, such information may boil down to an
interval such as [a, b] in which x is known to lie. This is, of course, a very basic
form of uncertainty, and one may simply summarize the degree of uncertainty
by the length of the interval.
T
Interval Arithmetic. As well as summarizing the degree of uncertainty by the
length of the interval estimate, it is often of interest to manipulate the interval
estimates themselves as if they were numbers. One commonly-used method of
AF
manipulating interval estimates of real quantities is interval arithmetic. Each of
the basic arithmetic operations ∗ ∈ {+, −, ·, /} is extended to intervals A, B ⊆ R
by
A ∗ B := {x ∈ R | x = a ∗ b for some a ∈ A, b ∈ B}.
Hence,
[a, b] + [c, d] = [a + c, b + d],
DR
—
If V(µ) is small (resp. large), then we are relatively certain (resp. uncertain)
that X ∼ µ is in fact quite close to m. A more refined variance-based measure
of informativeness is the covariance operator
C(µ) := EX∼µ (X − m) ⊗ (X − m) .
A distribution µ for which the operator norm of C(µ) is large may be said to
T
be a relatively uninformative distribution. Note that when X = Rn , C(µ) is an
n × n symmetric positive-definite matrix. Hence, such a C(µ) has n positive
real eigenvalues (counted with multiplicity)
AF
λ1 ≥ λ2 ≥ · · · ≥ λn > 0,
(Qψ)(x) := xψ(x),
Rn Rn
In general, for a fixed unit-norm element ψ ∈ H, the expected value hAi and
2
variance V(A) ≡ σA of a self-adjoint operator A : H → H are defined by
—
2
σB = hg, gi = kgk2 .
T
|hf, gi|2 = Re(hf, gi) + Im(hf, gi)
2 2
hf, gi + hg, f i hf, gi − hg, f i
= + .
2 2i
AF
Using the self-adjointness of A and B,
The negative sign in (5.2) makes I(x) non-negative, and logarithms are used
because one seeks a quantity I( · ) that represents in an additive way the ‘surprise
value’ of observing x. So, for example, if x has half the probability of y, then
one is ‘twice as surprised’ to see the outcome X = x instead of X = y, and so
I(x) = I(y) + log 2. The entropy of the measure µ is the expected information:
X
H(µ) := EX∼µ [I(X)] ≡ − µ(x) log µ(x). (5.3)
x∈X
5.4. INFORMATION GAIN 53
(We follow the convention that 0 log 0 := limp→0 p log p = 0.) These definitions
are readily extended to a random variable X taking values in Rn and distributed
according to a probability measure µ that has Lebesgue density ρ:
Since entropy measures the average information content of the possible values
of X ∼ µ, entropy is often interpreted as a measure of the uncertainty implicit
—
in µ. (Remember that if µ is very ‘spread out’ and describes a lot of uncertainty
about X, then observing a particular value of X carries a lot of ‘surprise value’
and hence a lot of information.)
Example 5.2. Consider a Bernoulli random variable X taking values in x1 , x2 ∈
X with probabilities p, 1 − p ∈ [0, 1] respectively. This random variable has
entropy
T
−p log p − (1 − p) log(1 − p).
If X is certain to equal x1 , then p = 1, and the entropy is 0; similarly, if
X is certain to equal x2 , then p = 0, and the entropy is again 0; these two
distributions carry zero information and have minimal entropy. On the other
AF
hand, if p = 21 , in which case X is uniformly distributed on X , then the entropy
is log 2; indeed, this is the maximum possible entropy for a Bernoulli random
variable. This example is often interpreted as saying that when interrogating
someone with questions that demand “yes” or “no” answers, one gains maximum
information by asking questions that have an equal probability of being answered
“yes” versus “no”.
Proposition 5.3. Let µ and ν be probability measures on discrete sets or Rn .
DR
That is, the entropy of a random vector with independent components is the sum
of the entropies of the component random variables.
Proof. Exercise 5.2.
Implicit in the definition and entropy (5.3) is the use of a uniform measure
(counting measure on a finite set, or Lebesgue measure on Rn ) as a reference
measure. Upon reflection, there is no need to privilege uniform measure with
being the unique reference measure. Indeed, in some settings, such as infinite-
dimensional Banach spaces, there is no such thing as a uniform measure. In
general, if µ is a probability measure on a measurable space (X , F ) with refer-
ence measure π, then we would like to define the entropy of µ with respect to π
by an expression like
Z
dµ dµ
H(µ|π) = − (x) log (x) dπ(x)
R dπ dπ
54 CHAPTER 5. MEASURES OF INFORMATION AND UNCERTAINTY
—
While DKL ( · k · ) is non-negative, and vanishes if and only if its arguments
are identical, it is neither symmetric nor does it satisfy the triangle inequality.
Nevertheless, DKL ( · k · ) generates a topology on M1 (X ) that is finer than the
total variation topology:
T
r
DKL (µkν)
dTV (µ, ν) ≤ ,
2
AF
where the total variation metric is defined by
dTV (µ, ν) := sup |µ(E) − ν(E)| E ∈ F . (5.4)
knowns θ from the observed data y that will result from the experiment λ. If, for
each λ and θ, we know the conditional distribution y|λ, θ of the observed data,
then the conditional distribution y|λ is obtained by integration with respect to
the prior distribution of θ:
Z
ρ(y|λ) = ρ(y|λ, θ)ρ(θ) dθ.
ρ(y|θ, λ)ρ(θ)
ρ(θ|y, λ) = .
ρ(y|λ)
For example, one could take the utility U (y, λ) to be the Kullback–Leibler di-
vergence DKL ρ( · |y, λ) ρ( · |λ) between the prior and posterior distributions
on θ. An experimental design λ that maximizes
Z
U (λ) := U (y, λ)ρ(y|λ) dy
is one that is optimal in the sense of maximizing the expected gain in Shannon
information.
BIBLIOGRAPHY 55
Bibliography
Comprehensive treatments of interval analysis include the classic monograph of
Moore [69] and the more recent text of Jaulin & al. [42].
Information theory was pioneered by Shannon in his seminal 1948 paper [88].
The Kullback–Leibler divergence was introduced by Kullback & Leibler [55],
who in fact considered the symmetrized version of the divergence that now
bears their names. The book of MacKay [63] provides a thorough introduction
to information theory.
—
Exercises
Exercise 5.1. Prove Gibbs’ inequality that the Kullback–Leibler divergence is
non-negative, i.e. Z
dµ
DKL (µkν) := log dµ ≥ 0
X dν
T
whenever µ, ν are σ-finite measures on (X , F ) with µ ≪ ν. Show also that
DKL (µkν) = 0 if and only if µ = ν.
Exercise 5.2. Prove Proposition 5.3. That is, suppose that µ and ν are proba-
AF
bility measures on discrete sets or Rn , and show that the product measure µ ⊗ ν
satisfies
H(µ ⊗ ν) = H(µ) + H(ν).
That is, the entropy of a random vector with independent components is the
sum of the entropies of the component random variables.
Exercise 5.3. Let µ0 = N (m0 , C0 ) and µ1 = N (m1 , C1 ) be non-degenerate
Gaussian measures on Rn . Show that DKL (µ0 kµ1 ) is
DR
1 −1 ⊤ −1 det C0
tr(C1 C0 ) + (m1 − m0 ) C1 (m1 − m0 ) − n − log .
2 det C1
—
T
AF
DR
—
Chapter 6
—
Bayesian Inverse Problems
T
sure that just ain’t so.
Mark Twain
AF
This chapter provides a general introduction, at the high level, to the back-
ward propagation of uncertainty/information in the solution of inverse problems,
and specifically a Bayesian probabilistic perspective on such inverse problems.
Under the umbrella of inverse problems, we consider parameter estimation and
regression. One specific aim is to make clear the connection between regulariza-
tion and the application of a Bayesian prior. The filtering methods of Chapter 7
fall under the general umbrella of Bayesian approaches to inverse problems, but
DR
y = H(u)
—
y = H(u) + η. (6.1)
The problem is, given some observations of inputs ui and corresponding outputs
yi , to find the parameter value θ such that
—
Y
u∈X
However, this problem, too, can be difficult to solve as it may possess minimizing
sequences that do not have a limit in X ,h6.1i or may possess multiple minima,
or may depend sensitively on the observed data y. Especially in this last case, it
may be advantageous to not try to fit the observed data too closely, and instead
regularize the problem by seeking
n o
T
2 2
arg min y − H(u) Y + u − ū E u ∈ E ⊆ X
y = H(u; h) + ε + η,
where ε := H(u) − H(u; h). In principle, the observational noise η and the
computational error ε could be combined into a single term, but keeping them
separate is usually more appropriate: unlike η, ε is typically not of mean zero,
and is dependent upon u.
h6.1i Take
a moment to reconcile the statement “there may exist minimizing sequences that do
not have a limit in X ” with X being a Banach space.
6.1. INVERSE PROBLEMS AND REGULARIZATION 59
To illustrate the central role that least squares minimization plays in elemen-
tary statistical estimation, and hence motivate the more general considerations
of the rest of the chapter, consider the following finite-dimensional linear prob-
lem. Suppose that we are interested in learning some vector of parameters
x ∈ Kn , which gives rise to a vector y ∈ Km of observations via
y = Ax + η,
—
matrix E[ηη ∗ ] = Q ∈ Km×m , with η independent of x. A common approach is
to seek an estimate x̂ of x that is a linear function Ky of the data y, is unbiased
in the sense that E[x̂] = x, and is the best estimate in that it minimizes an
appropriate cost function. The following theorem, the Gauss–Markov theorem,
states that there is precisely one such estimator, and that it is the best in two
very natural senses:
T
Theorem 6.2 (Gauss–Markov). Suppose that A∗ Q−1 A is invertible. Then, among
all unbiased linear estimators K ∈ Kn×m , producing an estimate x̂ = Ky of x,
the one that minimizes both the mean-squared error E[kx̂ − xk2 ] and the covari-
ance matrixh6.2i E[(x̂ − x)(x̂ − x)∗ ] is
AF
K = (A∗ Q−1 A)−1 A∗ Q−1 ,
Remark 6.3. Indeed, by Corollary 4.23, x̂ = (A∗ Q−1 A)−1 A∗ Q−1 y is also the
DR
solution to the weighted least squares problem with weight Q−1 , i.e.
1
x̂ = arg min J(x), J(x) := kAx − yk2Q−1 .
x∈Kn 2
Note that the first and second derivatives (gradient and Hessian) of J are
Proof of Theorem 6.2. Note that the first part of this proof is surplus to re-
quirements: we could simply check that K := (A∗ Q−1 A)−1 A∗ Q−1 is indeed the
minimal linear unbiased estimator, but it is nice to derive the formula for K
from first principles and get some practice at constrained convex optimization
into the bargain.
Since K is required to be unbiased, it follows that KA = I. Therefore,
Since kKηk2 = η ∗ K ∗ Kη is a scalar and tr(XY ) = tr(Y X) for any two rectan-
gular matrices X and Y of the appropriate sizes,
Note
p that this is a convex optmization problem, since, by Exercise 6.2, K 7→
—
tr(KQK ∗ ) is a norm. Introduce a matrix Λ ∈ Kn×n of Lagrange multipliers,
so that the minimizer of the constrained problem is the unique critical point of
the Lagrangian
T
The critical point of the Lagrangian satisfies
and so, taking expectations of both sides, E[x̂] = x. Moreover, the covariance
DR
E[(x̂ − x)(x̂ − x)∗ ] = (A∗ Q−1 A)−1 A∗ Q−1 E[ηη ∗ ]Q−1 A(A∗ Q−1 A)−1
= (A∗ Q−1 A)−1 ,
as claimed.
Now suppose that L = K + D is any linear unbiased estimator; note that
DA = 0. Then the covariance of the estimate Ly satisfies
= (K + D)Q(K ∗ + D∗ )
= KQK ∗ + DQD∗ + KQD∗ + (KQD∗ )∗ .
Since DA = 0,
and so
E[(Ly − x)(Ly − x)∗ ] = KQK ∗ + DQD∗ ≥ KQK ∗
in the sense of positive semi-definite matrices, as claimed.
6.1. INVERSE PROBLEMS AND REGULARIZATION 61
B † := lim (B ∗ B + δI)B ∗ ,
δ→0
B † := lim B ∗ (BB ∗ + δI)B ∗ , or
δ→0
—
B † := V Σ+ U ∗ ,
T
not ideal: for example, because of its characterization as the minimizer of a
quadratic cost function, it is sensitive to large outliers in the data, i.e. compo-
nents of y that differ greatly from the corresponding component of Ax̂. In such
a situation, it may be desirable to not try to fit the observed data y too closely,
AF
and instead regularize the problem by seeking x̂, the minimizer of
1 1
J(x) := kAx − yk2Q−1 + kx − x̄k2R−1 ,
2 2
for some chosen x̄ ∈ Kn and positive-definite Tikhonov matrix R ∈ Kn×n .
Depending upon the relative sizes of Q and R, x̂ will be influenced more by the
data y and hence lie close to the Gauss–Markov estimator, or be influenced more
DR
be the regularization term and hence lie close to x̄. At first sight this procedure
may seem somewhat ad hoc, but it has a natural Bayesian interpretation.
The observation equation
y = Ax + η
in fact defines the conditional distribution y|x by (y − Ax)|x = η ∼ N (0, Q). To
find the minimizer of x 7→ 21 kAx − yk2Q−1 , i.e. x̂ = Ky, amounts to finding the
maximum likelihood estimator of x given y. The Bayesian interpretation of the
regularization term is that N (x̄, R) is a prior distribution for x. The resulting
posterior distribution for x|y has Lebesgue density ρ(x|y) with
—
1 1
ρ(x|y) ∝ exp − kAx − yk2Q−1 exp − kx − x̄k2R−1
2 2
1 2 1 2
= exp − kAx − ykQ−1 − kx − x̄kR−1
2 2
1 2 1 2
= exp − kx − KykA∗Q−1 A − kx − x̄kR−1
2 2
1 ∗ −1 −1 −1 ∗ −1 −1 2
= exp − kx − (A Q A + R ) (A Q AKy + R x̄)kA∗ Q−1 A+R−1
2
by the standard result that the product of Gaussian distributions with means
m1 and m2 and covariances C1 and C2 is a Gaussian with covariance C3 =
62 CHAPTER 6. BAYESIAN INVERSE PROBLEMS
(C1−1 + C2−1 )−1 and mean C3 (C1−1 m1 + C2−1 m2 ). The solution to the regular-
ized least squares problem, i.e. minimizing the exponent in the above posterior
distribution, is the maximum a posteriori estimator of x given y. However,
the full posterior contains more information than the MAP estimator alone. In
particular, the posterior covariance matrix (A∗ Q−1 A + R−1 )−1 reveals those
components of x about which we are relatively more or less certain.
—
This section concerns Bayesian inversion in Banach spaces, and, in particular,
establishing the appropriate rigorous statement of Bayes’ rule in settings where
there is no Lebesgue measure with respect to which we can take densities.
Example 6.5. There are many applications in which it is of interest to deter-
mine the permeability of subsurface rock, e.g. the prediction of transport of
radioactive waste from an underground waste repository, or the optimization
of oil recovery from underground fields. The flow velocity v of a fluid under
T
pressure p in a medium or permeability K is given by Darcy’s law
v = −K∇p.
The pressure field p within a bounded, open domain Ω ⊂ Rd is governed by the
AF
elliptic PDE
−∇ · (K∇p) = 0 in Ω,
p=h on ∂Ω.
For simplicity, take the permeability tensor field K to be a scalar field k times
the identity tensor; for mathematical and physical reasons, it is important that k
be positive, so write k = eu . The objective is to recover u from, say, observations
DR
respect to Lebesgue measure. Let Φ(u; y) be any function that differs from
− log ρ(y − H(u)) by an additive function of y alone, so that
ρ(y − H(u))
∝ exp(−Φ(u; y))
ρ(y)
with a constant of proportionality independent of u. An informal application
of Bayes’ rule suggests that the posterior probability distribution of u given y,
µy , has Radon–Nikodým derivative with respect to the prior, µ0 , given by
dµy
(u) ∝ exp(−Φ(u; y)).
dµ0
The next theorem makes this argument rigorous:
6.3. WELL-POSEDNESS AND APPROXIMATION 63
—
Nikodým derivative given by
(
ϕ(x,y)
dµy Z(y) , if Z(y) > 0,
(x) =
dν y 1, otherwise,
R
where Z(y) := S ϕ(x, y) dν y (x).
Proof. See Section 10.2 of [25].
T
Proof of Theorem 6.6. Let Q0 (dy) := ρ(y) dy on Rm and Q(du|y) := ρ(y −
H(u)) dy, so that, by construction
dQ
AF
(y|u) = C(y) exp(−Φ(u; y)).
dQ0
Define measures ν0 and ν on Rm × X by
ν0 (dy, du) := Q0 (dy) ⊗ µ0 (du),
ν(dy, du) := Q0 (dy|u)µ0 (du).
Note that ν0 is a product measure under which u and y are independent, whereas
DR
Φ(u; y) ≥ M − εkuk2X .
2. For every r > 0, there exists K = K(r) > 0 such that, for all u ∈ X and
all y ∈ Y with kukX , kykY < r,
Φ(u; y) ≤ K.
—
3. For every r > 0, there exists L = L(r) > 0 such that, for all u1 , u2 ∈ X
and all y ∈ Y with ku1 kX , ku2 kX , kykY < r,
Φ(u1 ; y) − Φ(u2 ; y) ≤ L u1 − u2 X
.
4. For every ε > 0 and r > 0, there exists C = C(ε, r) > 0 such that, for all
T
u ∈ X and all y1 , y ∈ Y with ky1 kY , ky2 kY < r,
Φ(u; y1 ) − Φ(u; y2 ) ≤ exp εkuk2X + C y1 − y2 Y
.
AF
Theorem 6.9. Let Φ satisfy standard assumptions (1), (2), and (3) and assume
that µ0 is a Gaussian probability measure on X . Then, for each y ∈ Y, µy given
by
since µ0 is Gaussian and we may choose ε small enough that the Fernique
theorem (Theorem 2.34) applies. Thus, µy can indeed be normalized to be a
probability measure on X .
6.3. WELL-POSEDNESS AND APPROXIMATION 65
Definition 6.10. The Hellinger distance between two measures µ and ν is de-
fined in terms of any reference measure ρ with respect to which both µ and ν
are absolutely continuous by:
v
u Z s s 2
u1 dµ dν
t
dHell (µ, ν) := (θ) − (θ) dρ(θ).
2 Θ dρ dρ
Exercises 6.4, 6.5 and 6.6 establish the major properties of the Hellinger
metric. A particularly useful property is that closeness in the Hellinger met-
—
ric implies closeness of expected values of polynomially bounded functions: if
f : X → E, for some Banach space E, then
q
Eµ [f ] − Eν [f ] E ≤ 2 Eµ kf k2E − Eν kf k2E dHell (µ, ν).
The following theorem shows that Bayesian inference with respect to a Gaus-
sian prior measure is well-posed with respect to perturbations of the observed
data y:
T
Theorem 6.11. Let Φ satisfy the standard assumptions (1), (2) and (4), suppose
that µ0 is a Gaussian probability measure on X , and that µy ≪ µ0 with density
given by the generalized Bayes’ rule for each y ∈ Y. Then there exists a constant
AF
C ≥ 0 such that, for all y, y ′ ∈ Y,
′
dHell (µy , µy ) ≤ Cky − y ′ kY .
Proof. As in the proof of Theorem 6.9, standard assumption (2) gives a lower
bound on Z(y). By standard assumptions (1) and (4) and the Fernique theorem
(Theorem 2.34),
Z
′ ′
exp(−εkuk2X − M ) exp(−εkuk2X + C) dµ0 (u)
DR
|Z(y) − Z(y )| ≤ ky − y kY
X
≤ Cky − y ′ kY .
By the definition of the Hellinger distance,
Z !2
y′ 2 1 1 ′
y
2dHell (µ , µ ) = p e−Φ(u;y)/2 − p e−Φ(u;y )/2 dµ0 (u)
Z(y) ′
Z(y )
X
≤ I1 + I2
where
—
Z 2
2 ′
I1 := e−Φ(u;y)/2 − e−Φ(u;y )/2 dµ0 (u),
Z(y) X
2 Z
1 1 ′
I2 := 2 p −p e−Φ(u;y )/2 dµ0 (u).
Z(y) Z(y ′ ) X
Combining these facts yields the desired Lipschitz continuity in the Hellinger
metric.
—
Similarly, the next theorem shows that Bayesian inference with respect to
a Gaussian prior measure is well-posed with respect to approximation of mea-
sures and log-likelihoods. The approximation of Φ by some ΦN typically arises
through the approximation of H by some discretized numerical model H N .
Theorem 6.12. Suppose that the probabilities µ and µN are the posteriors aris-
ing from potentials Φ and ΦN and are all absolutely continuous with respect to
µ0 , and that Φ, ΦN satisfy the standard assumptions (1) and (2) with constants
T
uniform in N . Assume also that, for all ε > 0, there exists K = K(ε) > 0 such
that
Φ(u) − ΦN (u; y) ≤ K exp(εkuk2X )ψ(N ), (6.2)
AF
where limN →∞ ψ(N ) = 0. Then there is a constant C, independent of N , such
that
dHell (µ, µN ) ≤ Cψ(N ).
Proof. Since y does not appear in this problem, y-dependence will be suppressed
for the duration of this proof. Let Z and Z N denote the appropriate normaliza-
tion constants, as in the proof of Theorem 6.11. By standard assumption (1),
(6.2), and the Fernique theorem,
DR
Z
N
|Z − Z | ≤ Kψ(N ) exp(εkuk2X − M ) exp(εkuk2X ) dµ0 ≤ Cψ(N ).
X
where
Z 2
2 N
I1 := e−Φ(u)/2 − e−Φ (u)/2 dµ0 (u),
Z X
2 Z
1 1 N
I2 := 2 √ − √ e−Φ (u)/2
dµ0 (u).
Z ZN X
Similarly,
2
1 1 1 1
√ −√ ≤ C max , |Z − Z N |2 ≤ Cψ(N )2 .
Z ZN Z 3 (Z N )3
Combining these facts yields the desired bound on dHell (µ, µN ).
Remark 6.13. Note well that, regardless of the value of the observed data y, the
Bayesian posterior µy is absolutely continuous with respect to the prior µ0 and,
in particular, cannot associate positive posterior probability to any event of prior
probability zero. However, the Feldman–Hájek theorem (Theorem 2.38) says
—
that it is very difficult for probability measures on infinite-dimensional spaces
to be absolutely continuous with respect to one another. Therefore, the choice
of infinite-dimensional prior µ0 is a very strong modelling assumption that, if
it is ‘wrong’, cannot be ‘corrected’ even by large amounts of data y. In this
sense, it is not reasonable to expect that Bayesian inference on function spaces
should be well-posed with respect to apparently small perturbations of the prior
µ0 , e.g. by a shift of mean that lies outside the Cameron–Martin space, or a
T
change of covariance arising from a non-unit dilation of the space. Nevertheless,
the infinite-dimensional perspective is not without genuine fruits: in particular,
the well-posedness results (Theorems 6.11 and 6.12) are very important for the
stability of finite-dimensional (discretized) Bayesian problems with respect to
AF
discretization dimension N .
Bibliography
This material is covered in much greater detail in the module MA612 Probability
on Function Spaces and Bayesian Inverse Problems. W
Tikhonov regularization was introduced in [105, 106]. An introduction to
DR
Exercises
Exercise 6.1. Let µ0 be a Gaussian probability measure on a separable Banach
space X and suppose that the potential Φ(u; y) is quadratic in u. Show that
the posterior µy is also a Gaussian measure on X .
68 CHAPTER 6. BAYESIAN INVERSE PROBLEMS
—
1 dµ dν
dTV (µ, ν) := Eρ −
2 dρ dρ
Z
1 dµ dν
= (θ) − (θ) dρ(θ).
2 Θ dρ dρ
Show that dTV is a metric on M1 (Θ, F ), that its values do not depend upon
the choice of ρ, that M1 (Θ, F ) has diameter at most 1, and that, if ν ≪ µ,
T
then Z
1 dν 1 dν
dTV (µ, ν) := Eµ 1 − ≡ 1− (θ) dµ(θ).
2 dµ 2 Θ dµ
Show also that this definition agrees with the definition given in (5.4).
AF
Exercise 6.4. Let µ and ν be probability measures on (Θ, F ), both absolutely
continuous with respect to a reference measure ρ. Define the Hellinger distance
between µ and ν by
v s
u s 2
u
u 1 dµ dν
dHell (µ, ν) := t Eρ −
2 dρ dρ
DR
v
u Z s s 2
u1 dµ dν
t
= (θ) − (θ) dρ(θ).
2 Θ dρ dρ
Show that dHell is a metric on M1 (Θ, F ), that its values do not depend upon
the choice of ρ, that M1 (Θ, F ) has diameter at most 1, and that, if ν ≪ µ,
then
v v
u s 2 u Z s 2
u
u1 dν u t 1 dν
dHell (µ, ν) := t Eµ 1 − ≡ 1− (θ) dµ(θ).
—
2 dµ 2 Θ dµ
Exercise 6.5. Show that the total variation and Hellinger distances satisfy, for
all µ and ν,
1
√ dTV (µ, ν) ≤ dHell (µ, ν) ≤ dTV (µ, ν)1/2 .
2
Exercise 6.6. Suppose that µ and ν are probability measures on a Banach space
X . Show that, if E is a Banach space and f : X → E has finite second moment
with respect to both µ and ν, then
q
Eµ [f ] − Eν [f ] E ≤ 2 Eµ kf k2E − Eν kf k2E dHell (µ, ν).
EXERCISES 69
Show also that, if E is Hilbert and f has finite fourth moment with respect to
both µ and ν, then
q
Eµ [f ⊗ f ] − Eν [f ⊗ f ] E ≤ 2 Eµ kf k4E − Eν kf k4E dHell (µ, ν).
Hence show that the differences between the means and covariance operators of
two measures on X are bounded above by the Hellinger distance between the
two measures.
Exercise 6.7. Let Γ ∈ Rq×q be symmetric and positive definite. Suppose that
H : X → Rq satisfies
—
1. For every ε > 0, there exists M ∈ R such that, for all u ∈ X ,
kH(u)kΓ−1 ≤ exp εkuk2X + M .
2. For every r > 0, there exists K > 0 such that, for all u1 , u2 ∈ X with
ku1 kX , ku2 kX < r,
kH(u1 ) − H(u2 )kΓ−1 ≤ K u1 − u2 X
.
T
q
Show that Φ : X × R → R defined by
1
Φ(u; y) := y − H(u), Γ−1 (y − H(u))
2
AF
satisfies the standard assumptions.
Exercise 6.8. An exercise in forensic inference:
NEWSFLASH: THE PRESIDENT HAS BEEN SHOT!
While being driven through the streets of the capital in his open-topped limou-
sine, President Marx of Freedonia has been shot by a sniper. To make matters
worse, the bullet appears to have come from a twenty-storey building, on each
DR
floor of which was stationed a single bodyguard of President Marx’s security de-
tail who was meant to protect him. None of the suspects have any obvious marks
of guilt such as one bullet missing from their magazine, gunpowder burns, failed
lie detector tests. . . not even an evil moustache. The soundproofing inside the
building was good enough that none of the security detail can even say whether
the shot came from above or below them. You have been called in as an expert
quantifier of uncertainty to try to infer from which floor the assassin took the
shot, and hence identify the guilty man. You have the following information:
1. At the time of the shot, the President’s limousine was 500m from the
—
—
T
AF
DR
—
Chapter 7
—
Filtering and Data
Assimilation
T
It is not bigotry to be certain we are
right; but it is bigotry to be unable to
imagine how we might possibly have gone
AF
wrong.
—
state xk−1 ∈ Kn at time tk−1 according to the linear model
xk = Fk xk−1 + Gk uk + wk (7.1)
where, for each time tk ,
• Fk ∈ Kn×n is the state transition model, which is applied to the previous
state xk−1 ∈ Kn ;
• Gk ∈ Kn×p is the control-to-input model, which is applied to the control
T
vector uk ∈ Kp ;
• wk ∼ N (0, Qk ) is the process noise, a Kn -valued centred Gaussian random
variable with self-adjoint positive-definite covariance matrix Qk ∈ Kn×n .
At time tk an observation yk ∈ Kq of the true state xk is made according to
AF
yk = Hk xk + vk , (7.2)
where
• Hk ∈ Kq×n is the observation operator, which maps the true state space
Kn into the observable space Kq
• vk ∼ N (0, Rk ) is the observation noise, a Kq -valued centred Gaussian
random variable with self-adjoint positive-definite covariance Qk ∈ Kq×q .
DR
where
µ0 −w0
G1 u1 −w1
y1 +v1
.. x0 ..
. .
bk|m :=
G m m u ∈ Kn(k+1)+qm ,
zk := ... , ηk|m :=
−wm
ym +vm
xk
Gm+1 um+1 −wm+1
.. .
. ..
Gk uk −wk
7.1. STATE ESTIMATION IN DISCRETE TIME 73
—
:= .. .. .. .. .. .. ..
Ak|m .. . . . . . . . . .
.. .. .. .. .. .. ..
. . . . −Fm I . . .
.. .. .. .. .. .. ..
. . . . 0 Hm . . .
. .. .. .. .. .. ..
.. . . . . −Fm+1 I . .
. .. .. .. .. .. .. ..
.. . . . . . . . 0
T
0 ··· ··· ··· ··· ··· 0 −Fk I
Note that the noise vector ηk|m is Kn(k+1)+qm -valued and has mean zero and
block-diagonal positive-definite precision operator (inverse covariance) Wk|m
AF
given in block form by
Wk|m := diag Q−1 −1 −1 −1 −1 −1 −1
0 , Q1 , R1 , . . . , Qm , Rm , Qm+1 , . . . , Qk .
By the Gauss–Markov theorem (Theorem 6.2), the best linear unbiased estimate
ẑk|m = [x̂0|m , . . . , x̂k|m ]∗ of zk satisfies
1 2
DR
ẑk|m ∈ arg min Jk|m (zk ), Jk|m (zk ) := Ak|m zk − bk|m Wk|m
, (7.4)
zk ∈Kn 2
By Exercise 7.1, it follows from the assumptions made above that these normal
equations have a unique solution
−1 ∗
ẑk|m = A∗k|m Wk|m Ak|m Ak|m Wk|m bk|m . (7.5)
—
By Theorem 6.2 and Remark 6.3, E[ẑk|m ] = zk and the covariance matrix of
the estimate ẑk|m is (A∗k|m Wk|m Ak|m )−1 ; a Bayesian statistician would say that
zk , conditioned upon the control and observation data bk|m , is the Gaussian
random variable with distribution N ẑk|m , (A∗k|m Wk|m Ak|m )−1 .
Note that, since Wk|m is block diagonal, Jk|m can be written as
m k
1 1X 1X
Jk|m (zk ) = kx0 −µ0 k2Q−1 + kyi −Hi xi k2R−1 + kxi −Fi xi−1 −Gi ui k2Q−1 .
2 0 2 i=1 i 2 i=1 i
An expansion of this type will prove very useful in derivation of the linear
Kálmán filter in the next section.
74 CHAPTER 7. FILTERING AND DATA ASSIMILATION
—
find x̂ = arg min kAx − bk2Km
x∈Kn
where A ∈ Km×n , at least by direct methods such as solving the normal equa-
tions or QR factorization, requires of the order of mn2 floating-point operations,
W as shown in MA398 Matrix Analysis and Algorithms. Hence, calculation of the
state estimate ẑk by direct solution of (7.5) takes of the order of
(x̂k−1|k−1 , Pk−1|k−1 ) for time tk−1 into the estimate (x̂k|k , Pk|k ) for tk is split
into two steps (which could, of course, be algebraically unified into a single
step):
• the prediction step uses the dynamics alone to update (x̂k−1|k−1 , Pk−1|k−1 )
into (x̂k|k−1 , Pk|k−1 ), an estimate for the state at time tk that does not
use the observation yk ;
• the correction step updates (x̂k|k−1 , Pk|k−1 ) into (x̂k|k , Pk|k ) using the ob-
servation yk .
—
Prediction. Write
Fk := 0 · · · 0 Fk ∈ Kn×nk ,
in which case the gradient and Hessian of Jk|k−1 are given in block form with
respect to zk = [zk−1 , xk ]∗ as
∇Jk−1|k−1 (zk−1 ) + Fk∗ Q−1
k (Fk zk−1 − xk + Gk uk )
∇Jk|k−1 (zk ) =
−Q−1k (Fk zk−1 − xk + Gk uk )
and
D2 Jk−1|k−1 (zk−1 ) + Fk∗ Q−1
k Fk −Fk∗ Q−1
D2 Jk|k−1 (zk ) = −1
k ,
−Qk Fk Q−1k
—
which, by Exercise 7.2, is positive definite. Hence, by a single iteration of New-
ton’s method with any initial condition zk , the minimizer ẑk|k−1 of Jk|k−1 (zk )
is simply
−1
ẑk|k−1 = zk − D2 Jk|k−1 (zk ) ∇Jk|k−1 (zk ).
Note that ∇Jk−1|k−1 (ẑk−1|k−1 ) = 0 and Fk ẑk−1|k−1 = Fk x̂k−1|k−1 , so clever
initial guess is to take
T
ẑk−1|k−1
zk = .
Fk x̂k−1|k−1 + Gk uk
With this initial guess, the gradient becomes ∇Jk|k−1 (zk ) = 0 i.e. the optimal
AF
estimate of xk given y1 , . . . , yk−1 is the bottom row of ẑk|k−1 and — by Re-
mark 6.3 — the covariance matrix Pk|k−1 of this estimate is the bottom-right
−1
block of the inverse Hessian D2 Jk|k−1 (zk ) , calculated using the method of
Schur complements (Exercise 7.3):
These two updates comprise the prediction step of the Kálmán filter. The cal-
culation of x̂k|k−1 requires Θ(n2 + np) operations, and the calculation of Pk|k−1
requires O(nα ) operations, assuming that matrix-matrix multiplication for n×n
matrices can be effected in O(nα ) operations for some 2 ≤ α ≤ 3.
Correction. The next step is a correction step (or update step) that corrects the
prior estimate-covariance pair (x̂k|k−1 , Pk|k−1 ) to a posterior estimate-covariance
pair (x̂k|k , Pk|k ) given the observation yk . Write
—
Hk := 0 · · · 0 Hk ∈ Kn×nk ,
and
D2 Jk|k (zk ) = D2 Jk|k−1 (zk ) + Hk∗ Rk−1 Hk .
We now take zk = ẑk|k−1 as a clever initial guess for a single Newton iteration,
so that the gradient becomes
∇Jk|k (ẑk|k−1 ) = ∇Jk|k−1 (ẑk|k−1 ) +Hk∗ Rk−1 (Hk ẑk|k−1 − yk ).
| {z }
=0
The posterior estimate x̂k|k is now obtained as the bottom row of the Newton
update, i.e.
—
x̂k|k = x̂k|k−1 − Pk|k Hk∗ Rk−1 (Hk x̂k|k−1 − yk ) (7.8)
where the posterior covariance Pk|k is obtained as the bottom-right block of the
−1
inverse Hessian D2 Jk|k (zk ) by Schur complementation:
−1
−1
Pk|k = Pk|k−1 + Hk∗ Rk−1 Hk . (7.9)
T
(Exercise 7.6).
In many presentations of the Kálmán filter, the correction step is phrased in
terms of the Kálmán gain Kk ∈ Kn×q :
−1
AF
Kk := Pk|k−1 Hk∗ Hk Pk|k−1 Hk∗ + Rk . (7.10)
so that
x̂k|k = x̂k|k−1 + Kk (zk − Hk x̂k|k−1 ) (7.11)
Pk|k = (I − Kk Hk )Pk|k−1 . (7.12)
It is also common to refer to
DR
ỹk := zk − Hk x̂k|k−1
as the innovation residual and
Sk := Hk Pk|k−1 Hk∗ + Rk
as the innovation covariance, so that Kk = Pk|k−1 Hk∗ Sk−1 and x̂k|k = x̂k|k−1 +
Kk ỹk . It is an exercise in algebra to show that the first presentation of the
correction step (7.8)–(7.9) and the Kálmán gain formulation (7.10)–(7.12) are
the same.
—
and that Pk|k−1 is the solution at time tk of the initial value problem
—
Pk|k = (I − Kk Hk )Pk|k−1 .
The LKF with continuous time evolution and observation is known as the
Kálmán–Bucy filter. The evolution and observation equations are
T
Notably, in the Kálmán–Bucy filter, the distinction between prediction and
correction does not exist.
˙
x̂(t) = F (t)x̂(t) + G(t)u(t) + K(t) y(t) − H(t)x̂(t) ,
AF
Ṗ (t) = F (t)P (t) + P (t)F (t)∗ + Q(t) − K(t)R(t)K(t)∗ ,
where
K(t) := P (t)H(t)∗ R(t)−1 .
The extended Kálmán filter (EKF) is an extension of the Kálmán filter to nonlin-
ear dynamical systems. In discrete time, the evolution and observation equations
are
xk = fk (xk−1 , uk ) + wk ,
yk = hk (xk ) + vk ,
—
ũk := fk (x̂k−1|k−1 , uk ) − Fk x̂k−1|k−1 ,
zk := hk (x̂k|k−1 ) − Hk x̂k|k−1 ,
the linearized system is
xk = Fk xk−1 + ũk + wk ,
yk = Hk xk + zk + vk .
T
We now apply the standard LKF to this system, treating ũk as the controls for
the linear system and yk − zk as the observations, to obtain
x̂k|k−1 = fk (x̂k−1|k−1 , uk ), (7.13)
AF
Pk|k−1 = Fk Pk−1|k−1 Fk∗ + Qk , (7.14)
−1
−1
Pk|k = Pk|k−1 + Hk∗ Rk−1 Hk , (7.15)
x̂k|k = x̂k|k−1 − Pk|k Hk∗ Rk−1 (hk (x̂k|k−1 ) − yk ). (7.16)
The EnKF is a Monte Carlo approximation of the Kalman filter that avoids
evolving the covariance matrix of the state vector x ∈ Kn . Instead, the EnKF
uses an ensemble of N states
X = [x(1) , . . . , x(N ) ].
The columns of the matrix X ∈ Kn×N are the ensemble members.
are not generally independent except in the initial ensemble, since every EnKF
step ties them together, but all the calculations proceed as if they actually were
independent.
Correction. The correction step for the EnKF uses a trick called data replica-
tion: the observed data yk = Hk xk + vk is replicated into an m × N matrix
so that each column di consists of the observed data vector yk plus an indepen-
dent random draw from N (0, Rk ). If the columns of X̂k|k−1 are a sample from
the prior distribution, then the columns of
X̂k|k−1 + Kk D − Hk X̂k|k−1
—
form a sample from the posterior probability distribution, in the sense of a
Bayesian prior (before data) and posterior (conditioned upon the data). The
EnKF approximates this sample by replacing the exact Kálmán gain matrix
(7.10)
−1
Kk := Pk|k−1 Hk∗ Hk Pk|k−1 Hk∗ + Rk ,
T
which involves the covariance matrix Pk|k−1 , which is not tracked in the EnKF,
by an approximate covariance matrix. The empirical mean and empirical co-
variance of X are
AF
N
1 X (i) (X − hXi)(X − hXi)∗
hXi := x , .
N i=1 N −1
The Kálmán gain matrix for the EnKF uses Ck|k−1 in place of Pk|k−1 :
−1
K̃k := Ck|k−1 Hk∗ Hk Ck|k−1 Hk∗ + Rk , (7.17)
DR
One can also use sampling to dispense with Rk , and instead use the empirical
covariance of the replicated data,
(D − hDi)(D − hDi)∗
.
N −1
—
Remark 7.1. Even when the matrices involved are positive-definite, instead
of computing the inverse of a matrix and multiplying by it, it is much better
(several times cheaper and also more accurate) to compute the Cholesky decom-
position of the matrix and treat the multiplication by the inverse as solution
of a system of linear equations (cf. MA398 Matrix Analysis and Algorithms).
This is a general point relevant to the implementation of all KF-like methods.
80 CHAPTER 7. FILTERING AND DATA ASSIMILATION
—
would report Lagrangian data about temperature, salinity &c. at its current
(evolving) position.
Consider for example the incompressible Stokes (ι = 0) or Navier–Stokes
(ι = 1) equations on the unit square with periodic boundary conditions, thought
of as the two-dimensional torus T2 , starting at time t = 0:
∂u
+ ιu · ∇u = ν∆u − ∇p + f on T2 × [0, ∞),
T
∂t
∇·u= 0 on T2 × [0, ∞),
u = u0 on T2 × {0}.
Here u, f : T2 × [0, ∞) → R2 are the velocity field and forcing term respectively,
AF
p : T2 × [0, ∞) → R is the pressure field, u0 : T2 → R2 is the initial value of the
velocity field, and ν ≥ 0 is the viscosity of the fluid.
Eulerian observations of this system might take the form of noisy obser-
vations yj,k of the velocity field at fixed points zj ∈ T2 , j = 1, . . . , J, at an
increasing sequence of discrete times tk ≥ 0, k ∈ N, i.e.
yj,k = u(zj , tk ) + ηj,k .
DR
On the other hand, Lagrangian observations might take the form of noisy ob-
servations yj,k of the locations zj (tk ) ∈ T2 at time tk of J passive tracers that
start at position zj,0 ∈ T2 at time t = 0 and are carried along with the flow
thereafter, i.e.
Z t
zj (t) = zj,0 + u(zj (s), s) ds,
0
yj,k = zj (tk ) + ηj,k .
—
Bibliography
The description given of the Kálmán filter, particularly in terms of Newton’s
method applied to the quadratic objective function J, follows that of Humpherys
& al. [41]. The remarks about Eulerian versus Lagrangian data assimilation
borrow from §3.6 of Stuart [98].
The original presentation of the Kálmán [46] and Kálmán–Bucy filters [47]
was in the context of signal processing, and encountered some initial resistance
from the engineering community, as related in the article of Humpherys & al.
Filtering is now fully accepted in applications communities and has a sound
algorithmic and theoretical base; for a stochastic processes point of view on
filtering, see e.g. the books of Jazwinski [43] and Øksendal [73] (§6.1).
EXERCISES 81
Exercises
Exercise 7.1. Verify that the normal equations for the state estimation problem
(7.4) have a unique solution.
Exercise 7.2. Suppose that A ∈ Kn×n and C ∈ Km×m are self-adjoint and
positive definite, B ∈ Km×n , and D ∈ Km×m is self-adjoint and positive semi-
definite. Then
A + B ∗ CB −B ∗ C
−CB C +D
—
is self-adjoint and positive-definite.
Exercise 7.3 (Schur complements). Let
A B
M=
C D
T
Then
−1
A + A−1 B(D − CA−1 B)−1 CA−1 −A−1 B(D − CA−1 B)−1
M −1 =
−(D − CA−1 B)−1 CA−1 (D − CA−1 B)−1
AF
and
(A − BD−1 C)−1 −(A − BD−1 C)−1 BD−1
M −1 = .
−D C(A − BD−1 C)−1
−1
D−1 + D−1 C(A − BD−1 C)−1 BD−1
Exercise 7.4. Schur complementation is often stated in the more restrictive
setting of self-adjoint positive-definite matrices, in which it has a natural in-
terpretation in terms of the conditioning of Gaussian random variables. Let
(X, Y ) ∼ N (m, C) be jointly Gaussian, where, in block form,
DR
m1 C11 C12
m= , C= ∗ ,
m2 C12 C22
and C is self-adjoint and positive definite. Show:
1. C11 and C22 are necessarily self-adjoint and positive-definite matrices.
−1 ∗
2. With the Schur complement defined by S := C11 − C12 C22 C12 , S is self-
adjoint and positive definite, and
−1
S −1 −S −1 C12 C22
C −1 = −1 ∗ −1 −1 −1 ∗ −1 −1 .
−C22 C12 S C22 + C22 C12 S C12 C22
—
—
some modeling error in the system.
To do this, consider the objective function
k
(λ) λk 1 X k−i
Jk|k (zk ) := kx0 − µ0 k2Q−1 + λ kyi − Hi xi k2R−1
2 0 2 i=1 i
k
1 X k−1
+ λ kxi − Fi xi−1 − Gi ui k2Q−1 ,
T
2 i=1 i
where the parameter λ ∈ [0, 1] is called the forgetting factor ; note that the stan-
dard LKF is the case λ = 1, and the objective function increasingly relies upon
AF
recent measurements as λ → 0. Find a recursive expression for the objective
(λ)
function Jk|k and follow the steps in the derivation of the usual LKF to derive
the LKF with fading memory λ.
Exercise 7.8. Write the prediction and correction equations (7.13)–(7.16) for
the EKF in terms of the Kálmán gain matrix.
Exercise 7.9. Suppose that a fuel tank of an aircraft is (when the aircraft is
DR
level) the cuboid Ω := [a1 , b1 ] × [a2 , b2 ] × [a3 , b3 ]. Assume that at some time t,
the aircraft is flying such that
• the original upward-pointing [0, 0, 1]∗ vector of the plane, and hence the
tank, is ν(t) ∈ R3 ;
• the fuel in the tank is in static equilibrium;
• fuel probes inside the tank provide noisy measurements of the fuel depth
at the four corners of the tank: specifically, if [ai , bi , zi ]∗ is the intersection
of the fuel surface with the boundary of the tank at the corner [ai , bi ]∗ ,
assume that you are told ζi = zi + N (0, σ 2 ), independently for each fuel
probe.
—
trajectory as it passes by a radar sensor, but also effectively predicts the point
of impact as well as the point of origin — so that troops on the ground can both
duck for cover and return fire before the projectile lands.
The state of the projectile is. . .
—
T
AF
DR
—
84 CHAPTER 7. FILTERING AND DATA ASSIMILATION
—
T
AF
DR
—
Chapter 8
—
Orthogonal Polynomials
T
finds uncertainty fascinating.
On War
Karl von Clausewitz
AF
Orthogonal polynomials are an important example of orthogonal decompo-
sitions of Hilbert spaces. They are also of great practical importance: they play
a central role in numerical integration using quadrature rules (Chapter 9) and
approximation theory; in the context of UQ, they are also a foundational tool
in polynomial chaos expansions (Chapter 11).
For the rest of this chapter, N = N0 or {0, 1, . . . , N } for some N ∈ N0 .
DR
The constants Z
γn = kqn k2L2 (µ) = qn2 dµ
R
are required to be strictly positive and are called the normalization constants
of the system Q. If γn = 1 for all n ∈ N , then Q is an orthonormal system.
86 CHAPTER 8. ORTHOGONAL POLYNOMIALS
—
−1
2. The Hermite polynomials Hen are the orthogonal polynomials for standard
2
Gaussian measure (2π)−1/2 e−x /2 dx on the real line:
Z ∞
exp(−x2 /2)
Hem (x)Hen (x) √ dx = n!δmn .
−∞ 2π
The first few Legendre and Hermite polynomials are given in Table 8.1
T
and illustrated in Figures 8.1 and 8.2.
3. See Table 8.2 for a summary of other classical systems of orthogonal poly-
nomials corresponding to various probability measures on subsets of the
real line.
AF
Remark 8.3. Many sources, typically physicists’ texts, use the weight function
2 2 2
e−x dx instead of probabilists’ preferred (2π)−1/2 e−x /2 dx or e−x /2 dx for
the Hermite polynomials. Changing from one normalization to the other is
of course not difficult, but special care must be exercised in practice to see
which normalization a source is using, especially when relying on third-party
software packages: for example, the GAUSSQ Gaussian quadrature package from
2
http://netlib.org/ uses the e−x dx normalization. To convert integrals with
DR
respect to one Gaussian measure to integrals with respect to another (and hence
get the right answers for Gauss–Hermite quadrature), use the following change-
of-variables formula:
Z Z √
1 −x2 /2 1 2
√ f (x)e dx = f ( 2x)e−x dx.
2π R π R
It follows from this that conversion between the physicists’ and probabilists’
Gauss–Hermite quadrature formulæ is achieved by
wiphys √ phys
wiprob = √ , xprob = 2xi .
—
i
π
Lemma 8.4. The L2 (R, µ) inner product is positive definite on R≤d [x] if and
only if the Hankel determinant det(Hn ) is strictly positive for n = 1, . . . , d + 1,
where
m0 m1 · · · mn−1
m1 m2 · · · mn Z
Hn := . . .. . , m n := xn dµ(x).
.. .. . .. R
mn−1 mn · · · m2n−2
Hence, the L2 (R, µ) inner product is positive definite on R[x] if and only if
det(Hn ) > 0 for all n ∈ N.
8.1. BASIC DEFINITIONS AND PROPERTIES 87
n Len Hen
0 1 1
1 x x
1 2
2 2 (3x − 1) x2 − 1
1 3
3 2 (5x − 3x) x3 − 3x
1 4
4 8 (35x − 30x2 + 3) x − 6x2 + 3
4
1 5 3
5 8 (63x − 70x + 15x) x5 − 10x3 + 15x
—
Table 8.1: The first few Legendre polynomials Len , which are the orthogo-
nal polynomials for uniform measure dx on [−1, 1], and Hermite polynomi-
als Hen , which are the orthogonal polynomials for standard Gaussian measure
2
(2π)−1/2 e−x /2 dx on R.
1.0
T 0.5
AF
−1.00 −0.75 −0.50 −0.25 0.25 0.50 0.75 1.00
−0.5
−1.0
DR
Figure 8.1: The Legendre polynomials Le0 (red), Le1 (orange), Le2 (yellow),
Le3 (green), Le4 (blue) and Le5 (purple) on [−1, 1].
10.0
5.0
—
−10.0
Figure 8.2: The Hermite polynomials He0 (red), He1 (orange), He2 (yellow),
He3 (green), He4 (blue) and He5 (purple) on R.
88 CHAPTER 8. ORTHOGONAL POLYNOMIALS
—
Hypergeometric Hahn
Proof. Let
T
p(x) := cd xd + · · · + c1 x + c0
be any polynomial of degree at most d. Note that
Z Xd d
X
2
kpkL2 (R,µ) = ck cℓ xk+ℓ dµ(x) = ck cℓ mk+ℓ ,
AF
R k,ℓ=0 k,ℓ=0
and so kpkL2(R,µ) > 0 if and only if Hd+1 is positive definite. This, in turn, is
equivalent to having det(Hn ) > 0 for n = 1, 2, . . . , d + 1.
Theorem 8.5. If the L2 (R, µ) inner product is positive definite on R[x], then
there exists an infinite sequence of orthogonal polynomials for µ.
Proof. Apply the Gram–Schmidt procedure to the monomials xn , n ∈ N0 . That
DR
Since the inner product is positive definite, hqk , qk i > 0, and so each qn is
uniquely defined. By construction, each qn is orthogonal to qk for k < n.
By Exercise 8.1, the hypothesis of Theorem 8.5 is satisfied if the measure µ
has infinite support. In the other direction, we have the following:
—
Theorem 8.6. If the L2 (R, µ) inner product is positive definite on K≤d [x], but
not on R≤n [x] for n > d, then µ admits only d + 1 orthogonal polynomials.
Proof. The Gram–Schmidt procedure can be applied so long as the denomina-
tors hqk , qk i are strictly positive, i.e. for k ≤ d + 1. The polynomial qd+1 is
orthogonal to qn for n ≤ d; we now show that qd+1 = 0. By assumption, there
exists a polynomial P of degree d + 1, having the same leading coefficient as
qd+1 , such that kP kL2(R,µ) = 0. Hence, P − qd+1 has degree d, so it can be
written in the orthogonal basis {q0 , . . . , qd } as
d
X
P − qd+1 = ck qk
k=0
8.2. RECURRENCE RELATIONS 89
which implies, in particular, that kqd+1 kL2 (R,µ) = 0. Hence, the normalization
constant γd+1 = 0, which is not permitted, and so qd+1 is not a member of a
sequence of orthogonal polynomials for µ.
Theorem 8.7. If µ has finite moments only of degrees 0, 1, . . . , r, then µ admits
—
only a finite system of orthogonal polynomials q0 , . . . , qd , where d = ⌊r/2⌋.
Proof. Exercise 8.2
Theorem 8.8. The coefficients of any system of orthogonal polynomials are
determined, up to multiplication by an arbitrary constant for each degree, by the
Hankel determinants of the polynomial moments. That is, if
Z
mn := xn dµ(x)
T
R
th
then the n degree orthogonal polynomial qn for µ is, for some cn 6= 0,
mn
AF
..
Hn .
qn = cn det
m 2n−1
1 · · · xn−1 xn
m0 m1 m2 ··· mn
m1 m2 m3 · · · mn+1
.. .
.. .
.. .. .. .
= cn det . .
DR
.
mn−1 mn mn+2 · · · m2n−1
1 x x2 ··· xn
Proof. FINISH ME!!!
Theorem 8.9 (Favard). Let (An ), (Bn ), (Cn ) be real sequences and let Q =
{qn | n ∈ N } be defined by
qn+1 (x) = (An x + Bn )qn (x) − Cn qn−1 (x),
q0 (x) = 1,
q−1 (x) = 0.
Then Q is a system of orthogonal polynomials for some measure µ if and only
if, for all n ∈ N ,
—
An 6= 0, Cn 6= 0, Cn An An−1 > 0.
Theorem 8.10 (Christoffel–Darboux formula). The orthonormal polynomials {Pn |
n ≥ 0} for a measure µ satisfy
n
X p Pn+1 (y)Pn (x) − Pn (y)Pn+1 (x)
Pk (y)Pk (x) = βn+1 (8.1)
y−x
k=0
T
and n
X p
′
|Pk (x)|2 = βn+1 Pn+1 (x)Pn (x) − Pn′ (x)Pn+1 (x) . (8.2)
k=0
AF
Proof. Multiply the recurrence relation
p p
βk+1 Pk+1 (x) = (x − αk )Pk (x) − βk Pk−1 (x)
by Pk (y) on both sides and subtract the corresponding expression with x and y
interchanged to obtain
p
(y − x)Pk (y)Pk (x) = βk+1 Pk+1 (y)Pk (x) − Pk (y)Pk+1 (x)
p
DR
—
(n) (n)
1. For each n ∈ N0 , qn has exactly n distinct real roots z1 , . . . , zn ∈ I.
2. If (a, b) is an open interval of µ-measure zero, then (a, b) contains at most
one root of any orthogonal polynomial qn for µ.
3. The zeros of qn and qn+1 alternate:
(n+1) (n) (n+1) (n+1)
z1 < z1 < z2 < · · · < zn(n+1) < zn(n) < zn+1 ;
T
hence, whenever m > n, between any two zeros of qn there lies a zero of
qm .
4. If the support of µ is the entire interval I, then the set of all zeros for the
system Q is dense in I:
AF
[
I= {z ∈ R | qn (z) = 0}.
n∈N
Proof. 1. First observe that hqn , 1iL2 (µ) = 0, and so qn changes sign in I.
Since qn is continuous, the Intermediate Value Theorem implies that qn
(n)
has at least one real root z1 ∈ I. For n > 1, there must be another root
(n) (n) (n)
z2 ∈ I of qn distinct from z1 , since if qn were to vanish only at z1 ,
DR
(n)
then x − z1 qn would not change sign in I, which would contradict the
(n)
orthogonality relation hx − z1 , qn iL2 (µ) = 0. Similarly, if n > 2, consider
(n) (n)
x − z1 x − z2 qn to deduce the existence of yet a third distinct root
(n)
z3 ∈ I. This procedure terminates when all the n complex roots of qn
guaranteed by the Fundamental Theorem of Algebra are shown to lie in
I.
(n) (n)
2. Suppose that (a, b) contains two distinct zeros zi and zj of qn . Then
* + Z
—
Y (n)
Y (n)
qn , x − zk = qn (x) x − zk dµ(x)
k6=i,j R k6=i,j
L2 (µ)
Z Y (n) 2 (n) (n)
= x − zk x − zi x − zj dµ(x)
R k6=i,j
> 0,
since the integrand is positive outside of (a, b). However, this contradicts
the orthogonality of qn to all polynomials of degree less than n.
3. As usual, let Pn be the normalized version of qn . Let σ, τ be consecutive ze-
ros of Pn , so that Pn′ (σ)Pn′ (τ ) < 0. Then Corollary 8.11 implies that Pn+1
92 CHAPTER 8. ORTHOGONAL POLYNOMIALS
—
4. FINISH ME!!!
T
the invertibility of the Vandermonde matrix
1 x0 x20 ··· xn+1
0
1 x1 x21 ··· xn+1
1
AF
Vn (x0 , . . . , xn ) := . .. .. .. .. (8.3)
.. . . . .
1 xn x2n ··· xn+1
n
and hence the unique solvability of the system of simultaneous linear equations
c0 y0
.. ..
Vn (x0 , . . . , xn ) . = . . (8.4)
DR
cn yn
There is, however, another way to express the polynomial interpolation problem,
the so-called Lagrange form, which amounts to a clever choice of basis for R≤n [x]
(instead of the usual monomial basis {1, x, x2 , . . . , xn }) so that the matrix in
(8.4) in the new basis is the identity matrix.
ℓj (x) := .
xj − xk
0≤k≤K
k6=j
—
K
X K
X
f (xk ) = cj ℓj (xk ) = cj δjk = ck ,
j=0 j=0
and so p = L, as claimed.
T
the nodes xk to be equally spaced. Runge’s phenomenon [84] shows that this
is not always a good choice of interpolation scheme. Consider the function
f : [−1, 1] → R defined by
1
f (x) := , (8.5)
AF
1 + 25x2
and let Ln be the degree-n (Lagrange) interpolation polynomial for f on the
equally-spaced nodes xk := 2k n − 1. As illustrated in Figure 8.1, Ln oscillates
wildly near the endpoints of the interval [−1, 1]. Even worse, as n increases,
these oscillations do not die down but increase without bound: it can be shown
that
lim sup |f (x) − Ln (x)| = ∞.
n→∞ x∈[−1,1]
DR
However, Chebyshev nodes are not a panacea. Indeed, for every predefined
set of interpolation nodes there is a continuous function for which the interpo-
lation process on those nodes diverges. For every continuous function there is
a set of nodes on which the interpolation process converges. Interpolation on
Chebyshev nodes converges uniformly for every absolutely continuous function.
1.0
0.5
—
−0.5
−1.0
T
Figure 8.1: Runge’s phenomenon. The function f (x) := (1 + 25x2 )−1 in black,
and polynomial interpolations of degrees 5 (red), 9 (green), and 13 (blue) on
evenly-spaced nodes.
AF
Theorem 8.17 (Weierstrass). Let [a, b] ⊂ R be a bounded interval, let f : [a, b] →
R be continuous, and let ε > 0. Then there exists a polynomial p such that
Theorem 8.18. For any f ∈ L2 (I, µ) and any d ∈ N0 , the orthogonal projection
Πd f of f onto R≤d [x] is the best degree d polynomial approximation of f in the
L2 (I, µ) norm, i.e.
d
X hf, qk iL2 (µ)
Πd f := qk ,
kqk k2L2 (µ)
k=0
Xk m k Z
X
d u dm v dm u dm v
hu, viH k (µ) := , = dµ
m=0
dxm dxm L2 (µ) m=0 I dxm dxm
1/2
kukH k (µ) := hu, viH k (µ) .
The Sobolev space H k (µ) consists of all L2 (µ) functions that have weak deriva-
tives of all orders up to k in L2 (µ), and is equipped with the above inner product
—
and norm.
Legendre expansions of Sobolev functions on [−1, 1] satisfy the following
spectral convergence theorem; the analogous results for Hermite expansions of
Sobolev functions on R and Laguerre expansions of Sobolev functions on (0, ∞)
are Exercise 8.5 and Exercise 8.6 respectively.
Theorem 8.19 (Spectral convergence of Legendre expansions). There is a constant
C ≥ 0 that may depend upon k but is independent of d and f such that, for all
T
f ∈ H k ([−1, 1], dx),
LLen = λn Len ,
Note that, by the definition of the Sobolev norm and the operator L, kLf kL2 ≤
Ckf kH 2 and hence, for any m ∈ N, kLm f kL2 ≤ Ckf kH 2m .
The key ingredient of the proof is integration by parts:
Z 1
−1
hf, Len iL2 = λn (LLen )(x)f (x) dx
−1
Z 1
= λ−1
n (1 − x2 )Le′′n (x)f (x) − 2xLe′n (x)f (x) dx
−1
Z 1
((1 − x2 )f )′ (x)Le′n (x) + 2xLe′n (x)f (x) dx
—
= −λ−1
n
−1
Z1
= −λ−1
n (1 − x2 )f ′ (x)Le′n (x) dx
−1
Z 1 ′
= λ−1
n (1 − x2 )f ′ (x)Le′n (x) dx
−1
= λ−1
n hLf, Len iL2 .
Hence,
∞
X |hf, Len i|2
kf − Πd f k2 =
kLen k2
n=d+1
X∞
|hLm f, Len i|2
=
λ2m
n kLen k
2
n=d+1
∞
X |hLm f, Len i|2
≤ λ−2m
d
kLen k2
n=d+1
—
≤ d−4m kLm f k2
≤ C 2 d−4m kf k2H 2m .
Taking k = 2m and square roots completes the proof.
However, in the other direction poor regularity can completely ruin the nice
convergence of spectral expansions. The classic example of this is Gibbs’ phe-
T
nomenon, in which one tries to approximate the sign function
−1, if x < 0,
sgn(x) := 0, if x = 0,
AF
1, if x > 0,
on [−1, 1] by its expansion with respect to a system of orthogonal polynomi-
als such as the Legendre polynomials Len (x) or the Fourier polynomials eπnx .
FINISH ME!!!
hqα , piL2 (µ) = 0 for all p(x) ∈ R[x1 , . . . , xd ] with deg(p) < |α|.
—
{Pα | α ∈ Nd0 } is orthonormal if
Bibliography
The classic reference on orthogonal polynomials is the 1939 monograph of Szegő
T
[100]. An excellent more modern reference is the book of Gautschi [33]; some
topics covered in that book that are not treated here include. . . FINISH ME!!!
Many important properties of orthogonal polynomials, and standard exam-
ples, are given in Chapter 22 of Abramowitz & Stegun [1].
AF
Exercises
Exercise 8.1. Show that the L2 (R, µ) inner product is positive definite on the
space of polynomials if the measure µ has infinite support.
Exercise 8.2. Show that if µ has finite moments only of degrees 0, 1, . . . , r,
DR
mimic the proof of Theorem 8.19 to show that there is a constant C ≥ 0 that may
depend upon k but is independent of d and f such that, for all f ∈ H k (R, γ), f
and its degree d expansion in the Hermite orthogonal basis of L2 (R, γ) satisfy
Exercise 8.6 (Spectral convergence of Laguerre expansions). Let dµ(x) = e−x dx,
for which the orthogonal polynomials are the Laguerre polynomials Lan , n ∈ N0 .
Establish an integration-by-parts formula for µ and then use this and the fact
d2 d
that Lan is an eigenfunction for x dx 2 + (1 − x) dx with eigenvalue −n to prove
—
the analogue of Exercise 8.5 but for Laguerre expansions.
T
AF
DR
—
Chapter 9
—
Numerical Integration
T
ment that the human race cares about its
welfare “with increased statistical signif-
icance”. On the 1001st day, the turkey
has a surprise.
AF
The Fourth Quadrant: A Map of the
Limits of Statistics
Nassim Taleb
Definition
R 9.1. A quadrature formula is said to have order of accuracy n ∈ N0
if I f dµ = Q(f ) whenever f is a polynomial of degree at most n.
PK
A quadrature formula Q(f ) = k=1 wk f (xk ) can be identified with the
PK
discrete measure k=1 wk δxk . If some of the weights wk are negative, then this
measure is a signed measure. This point of view will be particularly useful when
considering multi-dimensional quadrature formulas. Regardless of the signature
of the weights, the following limitation on the accuracy of quadrature formulas
is fundamental:
—
Lemma 9.2. Let I ⊆ R be any interval. Then no quadrature formula with n
distinct nodes in I can have order of accuracy 2n or greater.
Proof. Let {x1 , . . . , xn } ⊆ I be any set of n distinct points, and letQ{w1 , . . . , wn }
be any set of weights. Let f be the degree-2n polynomial f (x) := nj=1 (x−xj )2 ,
i.e. the square of the nodal polynomial. Then
Z Xn
f (x) dx > 0 = wj f (xj ),
T
I j=1
since f vanishes at each node xj . Hence, the quadrature formula is not exact
for polynomials of degree 2n.
AF
The first, simplest, quadrature formulas to consider are those in which the
nodes form an equally-spaced discrete set of points in [a, b]. Many of these
quadrature formulas may be familiar from high-school mathematics.
Definition 9.3 (Midpoint rule). The midpoint quadrature formula has the single
node x1 := b−a 2 and the single weight w1 := ρ(x1 )|b − a|. That is, it is the
approximation
Z b
DR
b−a b−a
f (x)ρ(x) dx ≈ I1 (f ) := f ρ |b − a|.
a 2 2
Another viewpoint on the midpoint rule is that it is the
approximation of the
integrand f by the constant function with value f b−a 2 . The next quadrature
formula, on the other hand, amounts to the approximation of f by the affine
function
x−a
x 7→ f (a) + (f (b) − f (a))
b−a
that equals f (a) at a and f (b) at b.
—
Definition 9.4 (Trapezoidal rule). The midpoint quadrature formula has the
nodes x1 := a and x2 := b and the weights w1 := ρ(a) |b−a| 2 and w2 := ρ(b) |b−a|
2 .
That is, it is the approximation
Z b
|b − a|
f (x)ρ(x) dx ≈ I2 (f ) := (f (a)ρ(a) + f (b)ρ(b)) .
a 2
Recall the definition of the Lagrange interpolation polynomial L for a set of
nodes and values from Definition 8.15. The midpoint and trapezoidal quadrature
formulas amount to approximating f by a Lagrange interpolation polynomial L
Rb Rb
of degree 0 or 1 and hence approximating a f (x) dx by a L(x) dx. The general
such construction is the following:
9.2. GAUSSIAN QUADRATURE 101
—
Proposition 9.6. The weights for the closed Newton–Cotes quadrature formula
are given by
Z b
wj = ℓj (x) dx.
a
T
Z b Z b
f (x) dx ≈ L(x) dx
a a
Z K
bX
= f (xj )ℓj (x) dx
AF
a j=0
K
X Z b
= f (xj ) ℓj (x) dx
j=0 a
as claimed.
??? at xk
—
—
where the nodes x1 , . . . , xn and weights w1 , . . . , wn will be chosen appropriately.
Let {q0 , q1 , . . . } be a system of orthogonal polynomials with respect to the
weight function w. That is,
Z b
f (x)qn (x)w(x) dx = 0
a
T
whenever f is a polynomial of degree at most n − 1. Let the nodes x1 , . . . , xn
be the zeros of qn ; by Theorem 8.14, qn has n distinct roots in [a, b]. Define the
associated weights by
Rb
AF
an a qn−1 (x)2 w(x) dx
wj := ,
an−1 qn′ (xj )qn−1 (xj )
Theorem 9.8. The n-point Gauss quadrature formula has order of accuracy
exactly 2n − 1, and no quadrature formula has order of accuracy higher than
this.
Proof. Lemma 9.2 shows that no quadrature formula can have order of accuracy
greater than 2n − 1.
On the other hand, suppose that p is any polynomial of degree at most
2n − 1. Factor this polynomial as
Rb
of degree at most n, a
gqn+1 dµ = 0. However, since g(xj )qn+1 (xj ) = 0 for
each node xj ,
n
X
In (gqn+1 ) = wj g(xj )qn+1 (xj ) = 0.
j=1
Rb
Since a · dµ and Qn ( · ) are both linear operators,
Z b Z b
f dµ = r dµ and Qn (f ) = Qn (r).
a a
Rb Rb
—
Since r is of degree at most n − 1, a
r dµ = In (r), and so a
f dµ = Qn (f ), as
claimed.
Theorem 9.9. The Gauss weights are given by
Rb
an a qn−1 (x)2 w(x) dx
wj = ,
an−1 qn′ (xj )qn−1 (xj )
where ak is the coefficient of xk in qk (x).
T
Proof. Suppose that p is any polynomial of degree at most 2n − 1. Factor this
polynomial as
p(x) = g(x)qn+1 (x) + r(x),
AF
where g is a polynomial of degree at most n − 1, and the remainder r is also a
polynomial
Pn of degree at most n − 1. Using Lagrange basis polynomials, write
r = i=1 r(xi )ℓi , so that
Z b n
X Z b
r dµ = r(xi ) ℓi dµ.
a i=1 a
Since the Gauss quadrature formula is exact for r, it follows that the Gauss
DR
weights satisfy
Z b
wi = ℓi dµ
a
...
...
...
FINISH ME!!!
Theorem 9.10. The weights of the Gauss quadrature formula are all positive.
Proof. Fix 1 ≤ i ≤ n and consider the polynomial
Y
—
p(x) := (x − xj )2
1≤j≤n
j6=i
i.e. the square of the nodal polynomial, divided by (x − xi )2 . Since the degree of
p is strictly less than 2n − 1, the Gauss quadrature formula is exact, and since
p vanishes at every node other than xi , it follows that
Z Xn
p dµ = wj p(xj ) = wi p(xi ).
I j=1
—
Clenshaw–Curtis quadrature rules [19] (although in fact discovered thirty years
previously by Fejér [29]) are nested quadrature rules, with accuracy comparable
to Gaussian quadrature in many circumstances, and with weights that can be
computed with cost O(n log n).
∞
a0 X
f (cos θ) = + ak cos(kθ)
2
T
k=1
where Z π
2
ak = f (cos θ) cos(kθ) dθ
π 0
AF
The cosine series expansion of f is also a Chebyshev polynomial expansion of
f , since by construction Tk (cos θ) = cos(kθ):
X ∞
a0
f (x) = T0 (x) + ak Tk (x)
2
k=1
Z 1 Z π
DR
Qd
f (x) dx = ... f (x1 , . . . , xd ) dx1 . . . dxd .
j=1 [aj ,bj ] ad a1
general, when the one-dimensional quadrature formula uses N nodes, the error
for an integrand in C r using a tensor product rule is O(N −r/d ).
—
Smolyak sparse grids, which is particularly useful when combined with a nested
one-dimensional quadrature rule such as the Clenshaw–Curtis rule.
Assume that we are given, for each ℓ ∈ N, we are given a one-dimensional
(1)
quadrature formula Qℓ . The formula for Smolyak quadrature in dimension
d ∈ N at level ℓ ∈ N is defined in terms of the lower-dimensional quadrature
formalæby
ℓ
!
X
(d) (1) (1) (d−1)
Qℓ (f ) := Qi − Qi−1 ⊗ Qℓ−i+1 (f )
T
i=1
This formula takes a little getting used to, and it helps to first consider the
case d = 2 and a few small values of ℓ. First, for ℓ = 1, Smolyak’s rule is the
AF
quadrature formula
(2) (1) (1)
Q1 = Q1 ⊗ Q1 ,
(1)
i.e. the full tensor product of the one-dimensional quadrature formula Q1 with
itself. For the next level, ℓ = 2, Smolyak’s rule is
2
X
(2) (1) (1) (1)
Q2 = Qi − Qi−1 ⊗ Qℓ−i+1
DR
i=1
(1) (1) (1) (1) (1)
= Q1 ⊗ Q2 + Q2 − Q1 ⊗ Q1
(1) (1) (1) (1) (1) (1)
= Q1 ⊗ Q2 + Q2 ⊗ Q1 − Q1 ⊗ Q1 .
(1) (1)
The “−Q1 ⊗ Q1 ” term is included to avoid double counting. See Figure 9.1
for an illustration of nodes of the Smolyak construction in the case that the
(d)
one-dimensional quadrature formula Qℓ has 2ℓ − 1 equally-spaced nodes.
In general, when the one-dimensional quadrature formula at level ℓ uses Nℓ
nodes, the quadrature error for an integrand in C r using Smolyak recursion is
—
Remark 9.11. The right Sobolev space for studying sparse grids, since we need
1
pointwise evaluation, is Hmix , in which functions are weakly differentiable in
each coordinate direction.
b b b b
b b b b b b b b b b b
b b b b
—
(a) ℓ = 1 (b) ℓ = 2 (c) ℓ = 3
Figure 9.1: Illustration of the nodes of the 2-dimensional Smolyak sparse quadra-
(2)
ture formulas Qℓ for levels ℓ = 1, 2, 3, in the case that the 1-dimensional
(1)
quadrature formula Qℓ has 2ℓ − 1 equally-spaced nodes in the interior of the
domain of integration, i.e. is an open Newton–Cotes formula.
T
only partially alleviate this problem. Remarkably, however, the curse of dimen-
AF
sionality can be entirely circumvented by resorting to random sampling of the
integration domain — provided, of course, that it is possible to draw samples
from the measure against which the integrand is to be integrated.
Monte Carlo methods are, in essence, an application of the law of large
numbers (LLN). Recall that the LLN states that if X (1) , X (2) , . . . is a sequence
of independent samples from a random variable X with finite expectation E[X],
then the sample average
K
1 X (k)
DR
X
K
k=1
converges to E[X] as K → ∞. The weak LLN states that the mode of conver-
gence is convergence in probability:
" K
#
1 X (k)
for all ε > 0, lim P X − E[X] > ε = 0;
K→∞ K
k=1
the strong LLN (which is harder to prove than the weak LLN) states that the
mode of convergence is actually almost sure:
—
" K
#
1 X (k)
P lim X = E[X] = 1.
K→∞ K
k=1
—
to obtain that, for any t ≥ 0,
V[X]
P[|SK − E[X]| ≥ t] ≤ .
Kt2
That is, for any ε ∈ (0, 1], with probability at least 1 − ε with respect to the
K Monte Carlo samples, the Monte Carlo average SK lies within (V[X]/Kε)1/2
of the true expected value E[X]. The fact that the error decays like K −1/2 ,
T
i.e. slowly, is a major limitation of ‘vanilla’ Monte Carlo methods; it is undesir-
able to have to quadruple the number of samples to double the accuracy of the
approximate integral.
AF
CDF Inversion One obvious criticism of Monte Carlo integration as presented
above is the accessibility of the measure of integration µ. Even leaving aside
the sensitive topic of the generation of truly ‘random’ numbers, it is no easy
matter to draw random numbers from an arbitrary probability measure on R.
The uniform measure on an interval may be said to be easily accessible; ρ dx,
for some positive and integrable function ρ, is not.
Z
DR
Fν (x) := dν
(−∞,x]
Importance Sampling
Markov Chain Monte Carlo Markov chain Monte Carlo (MCMC) methods
are a class of algorithms for sampling from a probability distribution µ based
on constructing a Markov chain that has µ as its equilibrium distribution. The
state of the chain after a large number of steps is then used as a sample of µ.
—
The quality of the sample improves as a function of the number of steps. Usually
it is not hard to construct a Markov chain with the desired properties; the more
difficult problem is to determine how many steps are needed to converge to µ
within an acceptable error.
#(P ∩ B)
DN (P ) := sup − λd (B)
B∈J N
Qd
where J is the collection of all products of the form i=1 [ai , bi ), with 0 ≤ ai <
bi ≤ 1. The star-discrepancy:
∗ #(P ∩ B)
DN (P ) := sup − λd (B)
B∈J ∗ N
—
Qd
where J ∗ is the collection of all products of the form i=1 [0, bi ), with 0 ≤ bi < 1.
Lemma 9.14.
∗ ∗
DN ≤ DN ≤ 2 d DN .
Definition 9.15. Let f : [0, 1]d → R. If J ⊆ [0, 1]d is a subrectangle of [0, 1]d ,
i.e. a d-fold product of subintervals of [0, 1], let ∆J (f ) be the sum of the values
T
of f at the 2d vertices of J, with alternating signs at nearest-neighbour vertices.
The Vitali variation of f : [0, 1]d → R is defined to be
( )
X Π is a partition of [0, 1]d into finitely
AF
Vit
V (f ) := sup |∆J (f )|
many non-overlapping subrectangles
J∈Π
where the sum runs over all faces F of [0, 1]d having dimension at most s.
DR
N Z
1 X ∗
f (xi ) − f (x) dx ≤ V[0,1]d (f )DN (x1 , . . . , xN ).
N i=1 [0,1] d
Furthermore, this bound is sharp in the sense that, for every {x1 , . . . , xN } ⊆
[0, 1)N and every ε > 0, there exists f : [0, 1]d → R with V (f ) = 1 such that
N Z
1 X ∗
f (xi ) − f (x) dx > DN (x1 , . . . , xN ) − ε.
N i=1 [0,1]d
Bibliography
W At Warwick, Monte Carlo integration and related topics are covered in the mod-
ule ST407 Monte Carlo Methods. See also Robert & Casella [81] for a survey
of MC methods in statistics.
Orthogonal polynomials for quadrature formulas can be found in Section 25.4
of Abramowitz & Stegun [1]. Gautschi’s general monograph [33] on orthogonal
polynomials covers applications to Gaussian quadrature in Section 3.1. The
article [107] compares the Gaussian and Clenshaw–Curtis quadrature rules and
explains their similar accuracy in many circumstances.
—
Smolyak recursion was introduced in [90].
Exercises
Exercise 9.1. Determine the weights for the open Newton–Cotes quadrature
formula. (Cf. Proposition 9.6.)
T
Exercise 9.2 (Takahasi–Mori (tanh–sinh) Quadrature [101]). Consider a definite
R1
integral over [−1, 1] of the form −1 f (x) dx. Employ a change of variables
x = ϕ(t) := tanh( π2 sinh(t)) to convert this to an integral over the real line. Let
h > 0 and K ∈ N, and approximate this integral over R using 2K + 1 points
AF
equally spaced from −Kh to Kh to derive a quadrature rule
Z 1 k=K
X
f (x) dx ≈ Qh,K (f ) := wk f (xk ),
−1 k=−K
and wk := .
cosh2 ( π2 sinh(kh))
How are these nodes distributed in [−1, 1]? If f is bounded, then what rate of
decay does f ◦ ϕ have? Hence, why is excluding the nodes xk with |k| > K a
reasonable approximation?
—
110 CHAPTER 9. NUMERICAL INTEGRATION
—
T
AF
DR
—
Chapter 10
—
Sensitivity Analysis and
Model Reduction
T
Le doute n’est pas un état bien agréable,
mais l’assurance est un état ridicule.
Voltaire
AF
The topic of this chapter is sensitivity analysis, which may be broadly un-
derstood as understanding how f (x1 , . . . , xn ) depends upon variations not only
in the xi individually, but also combined or correlated effects among the xi
of V are called the right singular vectors of A; and the diagonal entries of
Σ are called the singular values of A. While the singular values are unique,
the singular vectors may fail to be. By convention, the singular values and
corresponding singular vectors are ordered so that the singular values form a
decreasing sequence
σ1 ≥ σ2 ≥ · · · ≥ σmin{m,n} ≥ 0.
Thus, the SVD is a decomposition of A into a sum of rank-1 operators:
min{m,n} min{m,n}
X X
A = U ΣV ∗ = σj uj ⊗ vj = σj uj hvj , · i.
j=1 j=1
112 CHAPTER 10. SENSITIVITY ANALYSIS AND MODEL REDUCTION
The appeal of the SVD is that it is numerically stable, and that it provides
optimal low-rank approximation of linear operators: if Ak ∈ Rm×n is defined
by
Xk
Ak := σj uj ⊗ vj ,
j=1
then Ak is the optimal rank-k approximation to A in the sense that
X ∈ Rm×n and
kA − Ak k2 = min kA − Xk2 ,
rank(X) ≤ k
—
where k · k2 denotes the operator 2-norm on matrices.
Chapter 11 contains an important application of the SVD to the analysis of
sample data from random variables, an discrete variant of the Karhunen–Loève
expansion. Simply put, when A is a matrix whose columns are independent
samples from some some stochastic process (random vector), the SVD of A is
the ideal way to fit a linear structure to those data points. One may consider
nonlinear fitting and dimensionality reduction methods in the same way, and
T
this is known as manifold learning: see, for instance, the IsoMap algorithm of
Tenenbaum & al. [104].
10.2 Derivatives
AF
A natural first way to understand the dependence of f (x1 , . . . , xn ) upon x1 , . . . , xn
near some nominal point x∗ = (x∗1 , . . . , x∗n ) is to estimate the partial derivatives
of f at x∗ , i.e. to approximate
∂f ∗ f (x∗1 , . . . , x∗i + h, . . . , x∗n ) − f (x∗ )
(x ) := lim .
∂xi h→0 h
DR
Approximate by e.g.
∂f ∗
(x ) ≈
∂xi
Ultimately boils down to polynomial approximation f ≈ p, f ′ ≈ p′ .
Qn
Definition 10.2. The ith McDiarmid subdiameter of f : i=1 Xi → K is
Qn
x, y ∈ i=1 Xi
Di [f ] := sup |f (x) − f (y)|
xj = yj for j 6= i
′ xj ∈ Xj for j = 1, . . . , n
= sup |f (x1 , . . . , xi , . . . , xn ) − f (x1 , . . . , xi , . . . , xn )| .
x′i ∈ Xi
The McDiarmid diameter of f is
v
u n
uX
D[f ] := t Di [f ]2 .
i=1
10.3. MCDIARMID DIAMETERS 113
Remark 10.3. Note that although the two definitions of Di [f ] given above are
obviously mathematically equivalent, they are very different from a computa-
tional point of view: the first formulation is ‘obviously’ a constrained optimiza-
tion problem in 2n variables with n − 1 constraints (i.e. ‘difficult’), whereas
the second formulation is ‘obviously’ an unconstrained optimization problem in
n + 1 variables (i.e. ‘easy’).
—
Proof. Exercise 10.1.
The McDiarmid subdiameters and diameter are useful not only as sensitivity
indices, but also for providing a rigorous upper bound on deviations of a function
of independent random variables from its mean value:
T
i=1 Xi , and let f : X → R be absolutely integrable with respect to the law of X
and have finite McDiarmid diameter D[f ]. Then, for any t ≥ 0,
2t2
AF
P f (X) ≥ E[f (X)] + t ≤ exp − ,
D[f ]2
2t2
P f (X) ≤ E[f (X)] − t ≤ exp − ,
D[f ]2
2t2
P |f (X) − E[f (X)]| ≥ t ≤ 2 exp − .
D[f ]2
able withPindependent components, taking values in the cuboid ni=1 [ai , bi ]. Let
1 n
Sn := n i=1 Xi . Then, for any t ≥ 0,
−2n2 t2
P Sn − E[Sn ] ≥ t ≤ exp − Pn 2
,
i=1 (bi − ai )
and similarly for deviations below, and either side, of the mean.
put, the concentration of measure phenomenon, which was first noticed by Lévy
[61], is the fact that a function of a high-dimensional random variable with many
independent (or weakly correlated) components has its values overwhelmingly
concentrated about the mean (or median). An inequality such as McDiarmid’s
provides a rigorous certification criterion: to be sure that f (X) will deviate
above its mean by more than t with probability no greater than ε ∈ [0, 1], it
sufficies to show that
2t2
exp − ≤ε
D[f ]2
i.e. r
2
D[f ] ≤ t .
log ε−1
114 CHAPTER 10. SENSITIVITY ANALYSIS AND MODEL REDUCTION
Experimental effort then revolves around determining E[f (X)] and D[f ]; given
those ingredients, the certification criterion is mathematically rigorous. That
said, it is unlikely to be the optimal rigorous certification criterion, because Mc-
Diarmid’s inequality is not guaranteed to be sharp. The calculation of optimal
probability inequalities is considered in Chapter 14.
To prove McDiarmid’s inequality first requires a lemma bounding the moment-
generating function of a random variable:
Lemma 10.7 (Hoeffding’s lemma). Let X be a random variable with mean zero
taking values in [a, b]. Then, for t ≥ 0,
—
2
t (b − a)2
E[etX ] ≤ exp .
8
Proof. By the convexity of the exponential function, for each x ∈ [a, b],
b − x ta x − a tb
etx ≤ e + e .
b−a b−a
T
Therefore, applying the expectation operator,
E[etX ] ≤
b ta
e +
a tb
e =: eφ(t) .
AF
b−a b−a
Observe that φ(0) = 0, φ′ (0) = 0, and φ′′ (t) ≤ 41 (b − a)2 . Hence, since exp is an
increasing and convex function,
2
tX (b − a)2 t2 t (b − a)2
E[e ] ≤ exp 0 + 0t + = exp ,
4 2 8
DR
as claimed.
Proof of McDiarmid’s inequality (Theorem 10.5). The proof uses the proper-
ties of conditional expectation outlined in Example 3.14. Let Fi be the σ-
algebra generated by X1 , . . . , Xi , and define random variables Z0 , . . . , Zn by
Zi := E[f (X)|Fi ]. Note that Z0 = E[f (X)] and Zn = f (X). Now consider the
conditional increment (Zi − Zi−1 )|Fi−1 . First observe that
Li ≤ Zi − Zi−1 |Fi−1 ≤ Ui ,
where
—
= e−st E es i=1 Zi −Zi−1 E es(Zn −Zn−1 ) Fn−1 (Z0 , . . . , Zn−1 are Fn−1 -measurable)
2 2
h Pn−1 i
≤ e−st es Dn [f ] /8 E es i=1 Zi −Zi−1
by the first part of the proof. Repeating this argument a further n − 1 times
shows that
s2
P[f (X) − E[f (X)] ≥ t] ≤ exp −st + D[f ]2 .
T
8
The expression on the right-hand side is minimized by s = 4t/D[f ]2 , which
yields the first of McDiarmid’s inequalities, and the others follow easily.
AF
10.4 ANOVA/HDMR Decompositions
The topic of this section is a variance-based decomposition of a function of n vari-
ables that goes by various names such as the analysis of variance (ANOVA), the
functional ANOVA, and the high-dimensional model representation (HDMR).
As before, let (Xi , Fi , µi ) be a probability space for i = 1, . . . , n, and let
(X , F , µ) be the product space. Write N = {1, . . . , n}, and consider a (F -
DR
Note, however, that low-order cooperativity does not necessarily imply that
there is a small set of significant variables (it is possible that f{i} is large for
most i ∈ {1, . . . , n}), not does it say anything about the linearity or non-linearity
of the input-output relationship. Furthermore, there are many HDMR-type
expansions of the form given above; orthogonality criteria can be used to select
a particular HDMR representation.
—
Z
f∅ (x) := f dµ,
X
i.e. f∅ is the orthogonal projection of f onto the one-dimensional space of con-
stant functions, and so it is common to abuse notation and write f∅ ∈ R. For
i = 1, . . . , n, let
Z Z Z Z
T
f{i} (x) := ... ... f dµ1 . . . dµi−1 dµi+1 . . . dµn − f∅ .
X1 Xi−1 Xi+1 Xn
Note that f{i} (x) is actually a function of xi only, and so it is common to abuse
notation and write f{i} (xi ) instead of f{i} (x); f{i} is constant with respect to the
AF
other n − 1 variables. To take this idea further and capture cooperative effects
among two or more xi , for I ⊆ N := {1, . . . , n}, let |I| denote the cardinality of
I and let ∼ I denote the relative complement D \ I. For I = (i1 , . . . , i|I| ) ⊆ N
and x ∈ X , define the point xI by xI := (xi1 , . . . , xi|I| ); similar notation like
x∼I , XI , µI &c. should hopefully be self-explanatory.
Definition
P 10.8. The ANOVA decomposition or RS-HDMR of f is the sum
f = I⊆N fI , where the functions fI : X → R (or, by abuse of notation,
DR
Z X
fI (xI ) := f (x) − fJ (xJ ) dx∼I
X∼I J(I
Z X
= f (x) dx∼I − fJ (xJ ).
—
X∼I J(I
—
j ∈ J whenever |J| < |I|. Then
Z Z Z X
fI dµi = f dµ∼I − fJ dµi
Xi Xi X∼I J(I
Z Z XZ
= f dµ∼I dµi − fJ dµi
Xi X∼I Xi
T
J(I
i∈J
/
AF
FINISH ME!!!
2. FINISH ME!!!
f∅ (x) := f (x̄),
f{i} (x) := f (x̄1 , . . . , x̄i−1 , xi , x̄i+1 , . . . , x̄n ) − f∅ (x)
≡ f (xi , x̄∼{i} ) − f∅ (x),
f{i,j} (x) := f (x̄1 , . . . , x̄i−1 , xi , x̄i+1 , . . . , x̄j−1 , xj , x̄j+1 , . . . , x̄n ) − f{i} (x) − f{j} (x) − f∅ (x)
≡ f (x{i,j} , x̄∼{i,j} ) − f (xi , x̄∼{i} ) − f (xj , x̄∼{j} ) − f∅ (x),
—
X
fI (x) := f (xI , x̄∼I ) − fJ (x).
J(I
Hence,
fI (x)fJ (x) = 0 whenever xk = x̄k for some k ∈ I ∪ J.
Indeed, this orthogonality relation defines the Cut-HDMR expansion.
118 CHAPTER 10. SENSITIVITY ANALYSIS AND MODEL REDUCTION
—
F∅ := {f ∈ F | f (x) ≡ a for some a ∈ R and all x ∈ [0, 1]n .}
is the space of constant functions. For i = 1, . . . , n, P{i} : F → F{i} , where
Z 1
F{i} := f ∈ F f is independent of xj for j 6= i and f (x) dxi = 0
0
T
and, for ∅ 6= I ⊆ N , PI : F → FI , where
Z 1
FI := f ∈ F f is independent of xj for j ∈/ I and, for i ∈ I, f (x) dxi = 0 .
0
AF
These linear operators PI are idempotent, commutative and mutually orthogo-
nal, i.e. (
PI f, if I = J,
PI PJ f = PJ PI f =
0, if I 6= J,
and form a resolution of the identity
X
PI f = f.
DR
I⊆N
L
Thus, the space of functions F decomposes as the direct sum F = I⊆N FI ,
and this direct sum is orthogonal when F is a Hilbert subspace of L2 (X , µ).
′
The normalized lower and upper Sobol sensitivity indices of I ⊆ N are, respec-
tively,
s2I := τ 2I /σ 2 , and s2I := τ 2I /σ 2 .
BIBLIOGRAPHY 119
P
Since I⊆N σI2 = σ 2 = kf − f∅ k2L2 , it follows immediately that, for each
I ⊆ N,
0 ≤ s2I ≤ s2I ≤ 1.
P
Note, however, that while the ANOVA theorem guarantees that σ 2 = I⊆D σI2 ,
in general Sobol′ indices satisfy no such additivity relation:
X X
1 6= s2I < s2I 6= 1.
I⊆N I⊆N
—
Bibliography
Detailed treatment of the singular value decomposition can be found in any text
on (numerical) linear algebra, such as that of Trefethen & Bau [108].
McDiarmid’s inequality appears in [65], although the underlying martingale
results go back to Hoeffding [39] and Azuma [5]. Ledoux [59] and Ledoux &
Talagrand [60] give more general presentations of the concentration-of-measure
T
phenomenon, including geometrical considerations such as isoperimetric inequal-
ities.
In the statistical literature, the analysis of variance (ANOVA) originates
with Fisher & Mackenzie [32]. The ANOVA decomposition was generalized by
AF
Hoeffding [38] to functions in L2 ([0, 1]d ) for d ∈ N; for d = ∞, see Owen [75].
That generalization can easily be applied to L2 functions on any product do-
main, and leads to the functional ANOVA of Stone [96]. In the mathematical
chemistry literature, the HDMR (with its obvious connections to ANOVA) was
popularized by Rabitz & al. [3, 78]. The presentation of ANOVA/HDMR in
this chapter draws upon those references and the presentations of Beccacece &
Borgonovo [7] and Hooker [40].
Sobol′ indices were introduced by Sobol′ in [92]. HDMR by Sobol′ in [93].
DR
Exercises
Exercise 10.1. Prove Lemma 10.4. That is, show that, for each j = 1, . . . , n, the
McDiarmid subdiameter Dj [ · ] is a semi-norm on the space of bounded functions
f : X → K, as is the McDiarmid diameter D[ · ]. What are the null-spaces of
these semi-norms?
Exercise 10.2. Let f : [−1, 1]2 → R be a function of two variables. Sketch
—
—
T
AF
DR
—
Chapter 11
—
Spectral Expansions
T
healthy mind is indeed the ability to live
with uncertainty and ambiguity, but only
as much as there really is.
Julian Baggini
AF
This chapter and its sequels consider several spectral methods for uncertainty
quantification. At their core, these are orthogonal decomposition methods in
which a random variable stochastic process (usually the solution of interest)
over a probability space (Θ, F , µ) is expanded with respect to an appropriate
orthogonal basis of L2 (Θ, µ; R). This chapter lays the foundations by considering
DR
—
Define the covariance operator of U , also denoted by CU : L2 (Ω, dx; R) →
2
L (Ω, dx; R) by Z
(CU f )(x) := CU (x, y)f (y) dy.
Ω
T
CU (x, y)en (y) dy = λn en (x)
Ω
and Z
AF
em (x)en (x) dx = δmn .
Ω
and this series converges absolutely, and uniformly over compact subsets of X .
11.1. KARHUNEN–LOÈVE EXPANSIONS 123
where the {en }n∈N are orthonormal eigenfunctions of the covariance operator
CU , the corresponding eigenvalues {λn }n∈N are non-negative, the convergence
of the series is in L2 (Θ, µ; R) and uniform in x ∈ Ω, with
Z
—
Zn = U (x)en (x) dx.
Ω
Furthermore, the random variables Zn are centred, uncorrelated, and have vari-
ance λn :
Eµ [Zn ] = 0, and Eµ [Zm Zn ] = λn δmn .
Proof. Since it is continuous, the covariance function is a Mercer kernel. Hence,
by Mercer’s theorem, there is an orthonormal basis {en }n∈N of L2 (Ω, dx; R)
T
consisting of eigenfunctions of the covariance operator with non-negative eigen-
values {λn }n∈N . In this basis, the covariance function has the representation
X
CU (x, y) = λn en (x)en (y).
AF
n∈N
Then Z Z
Eµ [Zn ] = Eµ U (x)en (x) dx = E[U (x)]en (x) dx = 0.
Ω Ω
and
Z Z
Eµ [Zm Zn ] = Eµ U (x)em (x) dx U (x)en (x) dx
ZΩ Z Ω
—
= em (x)λn en (x) dx
Ω
= λn δmn .
124 CHAPTER 11. SPECTRAL EXPANSIONS
PN
Let SN := n=1 Zn en : Ω × Θ → R. Then, for any x ∈ Ω,
Eµ |U (x) − SN (x)|2
= Eµ [U (x)2 ] + Eµ [SN (x)2 ] − 2Eµ [U (x)SN (x)]
" N N # " N
#
XX X
= CU (x, x) + Eµ Zn Zm em (x)en (x) − 2Eµ U (x) Zn en (x)
n=1 m=1 n=1
N
" N Z
#
X X
= CU (x, x) + λn en (x)2 − 2Eµ U (x)U (y)en (y)en (x) dy
Ω
—
n=1 n=1
N
X N Z
X
= CU (x, x) + λn en (x)2 − 2 CU (x, y)en (y)en (x) dy
n=1 n=1 Ω
N
X
= CU (x, x) − λn en (x)2
n=1
→ 0 as N → ∞, uniformly in x, by Mercer’s theorem.
T
Among many possible decompositions of a random process, the Karhunen–
Loève expansion is optimal in the sense that the mean-square error of any trun-
cation of the expansion after finitely many terms is minimal. However, its
AF
utility is limited since the covariance function of the solution process is often
not known a priori. Nevertheless, the Karhunen–Loève expansion provides an
effective means of representing input random processes when their covariance
structure is known, and provides a simple method for sampling Gaussian mea-
sures on Hilbert spaces, which is a necessary step in the implementation of the
methods outlined in Chapter 6.
Example 11.5. Suppose that C : H → H is a self-adjoint, positive-definite,
DR
1.0 1.0
0.5 0.5
0 0
1 1
−0.5 −0.5
—
(a) 10 KL modes (b) 100 KL modes
1.0 1.0
0.5 0.5
0 0
1 1
T
−0.5 −0.5
and assume without loss of generality that the samples have mean zero. The
empirical covariance matrix of the samples is
b :=
C 1 ⊤
M 2 XX .
—
In fact, performing principal component analysis via the singular value de-
composition is numerically preferable to forming and then diagonalizing the
covariance matrix, since the formation of XX ⊤ can cause a disastrous loss of
precision; the classic example of this phenomenon is the Läuchli matrix
1 ε 0 0
1 0 ε 0 (0 < ε ≪ 1),
T
1 0 0 ε
for which taking the singular value decomposition is stable, but forming and
diagonalizing XX ⊤ is unstable.
AF
11.2 Wiener–Hermite Polynomial Chaos
The next section will cover polynomial chaos (PC) expansions in greater gen-
erality, and this section serves as an introductory prelude. In this, the classical
and notationally simplest setting, we consider expansions of a real-valued ran-
dom variable U with respect to a single standard Gaussian random variable
DR
Proof. TO DO!
Let us further extend the h · , · i notation for inner products and write h · i for
expectation with respect to the distribution γ of Ξ. So, for example, the or-
thogonality relation for the Hermite polynomials reads hHem Hen i = n!δmn .
—
U= un Hen (Ξ)
n∈N0
T
2
hHen i n! 2π −∞
X
E[U ] = hHe0 , U i = un hHe0 , Hen i = u0 ,
DR
n∈N0
so the expected value of U is simply its 0th PC coefficient. Similarly, its variance
is a weighted sum of the squares of its PC coefficients:
V[U ] = E |U − E[U ]|2
2
X
= E un Hen since E[U ] = u0
n∈N
—
X
= um un hHem , Hen i
m,n∈N
X
= u2n hHe2n i by Hermitian orthogonality.
n∈N
P
expansion of Y as k∈N0 yk Hek (Ξ) has coeffients
E eµ eσΞ Hek (Ξ)
yk =
E [Hek (Ξ)2 ]
µ
e
= E eσΞ Hek (Ξ)
k!
k k
eµ σ Ξ
= E Hek (Ξ) by Hermitian orthogonality
k! k!
2
eµ eσ /2 σ k
=
—
k!
i.e.
2 X σk
Y = eµ+σ /2
Hek (Ξ).
k!
k∈N0
2
From this expansion it can be seen that E[Y ] = eµ+σ /2 and
k 2 2
2 X σ
T
2
V[Y ] = e2µ+σ hHe2k i = e2µ+σ eσ − 1 .
k!
k∈N0
P
Of course, in practice, the series expansion U = n∈N0 un Hen (Ξ) must be
truncated after finitely many terms, and so it is natural to ask about the quality
AF
of the approximation
P
X
P
U ≈ U := un Hen (Ξ)
n=0
Since the Hermite polynomials form a complete orthogonal basis for L2 (R, γ; R),
the standard results about orthogonal approximations in Hilbert spaces apply.
In particular, by Corollary 3.19, the truncation error U − U P is orthogonal to
the space from which U P was chosen, i.e.
DR
* ! P
!+
X X
hU − U P , V i = un Hen vm Hem
n>P m=0
* +
X
= un vm Hen Hem
n>P
m∈{0,...,P}
X
= un vm hHen Hem i
n>P
m∈{0,...,P}
= 0.
11.3. GENERALIZED PC EXPANSIONS 129
—
The ideas of polynomial chaos can be generalized well beyond the setting in
which the stochastic germ Ξ is a standard Gaussian random variable, or even a
vector Ξ = (Ξ1 , . . . , Ξd ) of mutually orthogonal Gaussian random variables.
Let Ξ = (Ξ1 , . . . , Ξd ) be an Rd -valued random variable with independent
(and hence orthogonal) components. As usual, let R[ξ1 , . . . , ξd ] denote the ring
T
of all polynomials in ξ1 , . . . , ξd with real coefficients, and let R[ξ1 , . . . , ξd ]≤p de-
note those polynomials of total degree at most p ∈ N0 . Let Γp ⊆ R[ξ1 , . . . , ξd ]≤p
be a collection of polynomials that are mutually orthogonal and orthogonal to
R[ξ1 , . . . , ξd ]≤p−1 , and let Γ̃p := span Γp . This yields the orthogonal decompo-
AF
sition
M
L2 (Θ, µ; R) = Γ̃p .
p∈N0
to the monomials ξ α .
Note that (as usual, assuming separability) the L2 space over the product
probability space (Θ, F , µ) is isomorphic to the Hilbert space tensor product of
the L2 spaces over the marginal probability spaces:
d
O
L2 (Θ1 × · · · × Θd , µ1 ⊗ · · · ⊗ µd ; R) = L2 (Θi , µi ; R),
i=1
—
Example 11.13. Let Ξ = (Ξ1 , Ξ2 ) be such that Ξ1 and Ξ2 are independent (and
hence orthogonal) and such that Ξ1 is a standard Gaussian random variable
and Ξ2 is uniformly distributed on [−1, 1]. Hence, the univariate orthogonal
polynomials for Ξ1 are the Hermite polynomials Hen and the univariate orthog-
onal polynomials for Ξ2 are the Legendre polynomials Len . Then a system of
130 CHAPTER 11. SPECTRAL EXPANSIONS
Γ0 = {1},
Γ1 = {He1 (ξ1 ), Le1 (ξ2 )}
= {ξ1 , ξ2 },
Γ2 = {He2 (ξ1 ), He1 (ξ1 )Le1 (ξ2 ), Le2 (ξ2 )}
= {ξ12 − 1, ξ1 ξ2 , 21 (3ξ22 − 1)},
Γ3 = {He3 (ξ1 ), He2 (ξ1 )Le1 (ξ2 ), He1 (ξ1 )Le2 (ξ2 ), Le3 (ξ2 )}
—
= {ξ13 − 3ξ1 , ξ12 ξ2 − ξ2 , 12 (3ξ1 ξ22 − ξ1 ), 12 (5ξ23 − 3ξ2 )}.
Rather than have the orthogonal basis polynomials have two indices, one for
the degree p and one within each set Γp , it is useful and conventional to order
the basis polynomials using a single index k ∈ N0 . It is common in practice
to take Ψ0 = 1 and to have the polynomial degree be (weakly) increasing with
respect to the new index k. So, to continue Example 11.13, one could take
T
Ψ0 (ξ) = 1,
Ψ1 (ξ) = ξ1 ,
Ψ2 (ξ) = ξ2 ,
AF
Ψ3 (ξ) = ξ12 − 1,
Ψ4 (ξ) = ξ1 ξ2 ,
Ψ5 (ξ) = 21 (3ξ22 − 1),
Ψ6 (ξ) = ξ13 − 3ξ1 ,
Ψ7 (ξ) = ξ12 ξ2 − ξ2 ,
Ψ8 (ξ) = 21 (3ξ1 ξ22 − ξ1 ),
DR
a truncated gPC expansion. Suppose that the stochastic germ Ξ has dimension
d (i.e. has d independent components), and we work only with polynomials
of total degree at most p. The total number of coefficients in the truncated
expansion U P is
(d + p)!
P+1=
d!p!
That is, the total number of gPC coefficients that must be calculated grows
combinatorially as a function of the number of input random variables and the
degree of polynomial approximation. Such rapid growth limits the usefulness of
gPC expansions for practical applications where d and p are much greater than,
say, 10.
11.3. GENERALIZED PC EXPANSIONS 131
—
for µi . The chaos function associated to a multi-index α defined to be
s
ρ1 (ξ1 ) . . . ρd (ξd ) (1)
Ψα (ξ) := φα1 (ξ1 ) . . . φ(d)
αd (ξd ).
ρ(ξ)
T
basis for L2 (Θ, µ; R), so we have the usual series expansion U = α uα Ψα .
Note, however, that with the exception of Ψ0 = 1, the functions Ψα are not
polynomials. Nevertheless, we still have the usual properties that truncation
error is orthogonal the the approximation subspace, and
AF
X
Eµ [U ] = u0 , Vµ [U ] = u2α hΨ2α i.
α6=0
P
X
U P (ξ) = uk Ψk (ξ),
k=0
where the polynomials Ψk are orthogonal with respect to the law of ξ, and with
the usual convention that Ψ0 = 1. A first, easy, observation is that
P
X
E[U P ] = hΨ0 , U P i = uk hΨ0 , Ψk i = u0 ,
k=0
so the expected value of U P is simply its 0th GPC coefficient. Similarly, its
—
—
d×d
and the covariance matrix C ∈ R of U P is given by
P
X P
X
C= uk u⊤ 2
k hΨk i i.e. Cij = uik ujk hΨ2k i.
k=1 k=1
T
tic process U : Ω × Θ → R; that is, for each x ∈ Ω, U (x, · ) ∈ L2 (Θ, µ) is a real-
valued random variable, and, for each θ ∈ Θ, U ( · , θ) ∈ L2 (Ω, dx) is a scalar
field on the domain Ω. Recall that
L2 (Θ, µ; R) ⊗ L2 (Ω, dx; R) ∼
= L2 (Θ × Ω, µ ⊗ dx; R) ∼
= L2 Θ, µ; L2 (Ω, dx) .
AF
As usual, take {Ψk | k ∈ N0 } to be an orthogonal polynomial basis of L2 (Θ, µ; R),
ordered (weakly) by total degree, with Ψ0 = 1. A GPC expansion of the random
field U is an L2 -convergent expansion of the form
X
U (x, ξ) = uk (x)Ψk (ξ),
k∈N0
P
X
U (x, ξ) ≈ U P (x, ξ) = uk (x)Ψk (ξ).
k=0
The functions uk : Ω → R are called the stochastic modes of the process U . The
stochastic mode u0 : Ω → R is the mean field of U :
E[U (x)] = E[U P (x)] = u0 (x).
The variance of the field at x ∈ Ω is
∞
X P
X
—
FINISH ME!!!
FINISH ME!!!
—
Figure 11.2: ...
and so
X
CU (x, y) = uk (x)uk (y)hΨ2k i
k∈N
T
P
X
≈ uk (x)uk (y)hΨ2k i
k=1
= CU P (x, y).
AF
At least when dim Ω is low, it is very common to see the behaviour of a stochastic
field U (or U P ) summarized by plots of the mean field and the variance field,
as in Figure 11.1; when dim Ω = 1, a surface or contour plot of the covariance
field CU (x, y) as in Figure 11.2 can also be informative.
Bibliography
DR
ularized by Ghanem & Spanos [34]; the extension to gPC and the connection
with the Askey scheme is due to Xiu & Karniadakis [118].
The extension of generalized polynomial chaos to arbitrary dependency among
the components of the stochastic germ, as in Remark 11.14, is due to Soize &
Ghanem [94].
Exercises
Exercise 11.1. Use the Karhunen–Loève expansion to generate samples from a
Gaussian random field U on [−1, 1] (i.e., for each x ∈ [−1, 1], U (x) is a Gaussian
random variable) with covariance function
134 CHAPTER 11. SPECTRAL EXPANSIONS
1. CU (x, y) = exp(−|x − y|2 /a2 ,
2. CU (x, y) = exp(−|x − y|/a , and
3. CU (x, y) = (1 + |x − y|2 /a2 )−1
for various values of a > 0. Plot and comment upon your results, particularly
the smoothness of the fields produced.
2
d
Exercise 11.2. Consider the negative Laplacian operator L := − dx 2 acting on
real-valued functions on the interval [0, 1], with zero boundary conditions. Show
that the eigenvalues µn and eigenfunctions en of L are
—
µn = (πn)2 , en (x) = sin(πnx).
Hence show that C := L−1 has the same eigenfunctions with eigenvalues λn =
(πn)−2 . Hence, using the Karhunen–Loève theorem, generate figures similar to
Figure 11.1 for your choice of mean field m : [0, 1] → R.
Exercise 11.3. Do the analogue of Exercise 11.3 for the negative Laplacian
d2 d2 2
operator L := − dx 2 − dy 2 acting on real-valued functions on the square [0, 1] ,
T
again with zero boundary conditions.
Exercise 11.4. Show that the eigenvalues λn and eigenfunctions en of the ex-
ponential covariance function C(x, y) = exp(−|x − y|/a) on [−b, b] are given
AF
by ( 2a
1+a2 wn2 , if n ∈ 2Z,
λn = 2a
2 ,
1+a2 vn if n ∈ 2Z + 1,
q
sin(wn x) b − sin(2w n b)
, if n ∈ 2Z,
2wn
en (x) = q
cos(v x) b + sin(2vn b) , if n ∈ 2Z + 1,
n 2vn
DR
Hence, using the Karhunen–Loève theorem, generate sample paths from the
Gaussian measure with covariance kernel C and your choice of mean path.
—
Chapter 12
—
Stochastic Galerkin Methods
T
Am I an Atheist or an Agnostic?
Bertrand Russell
AF
Unlike non-intrusive approaches, which rely on individual realizations to
determine the stochastic model response to random inputs, Galerkin methods
use a formalism of weak solutions, expressed in terms of inner products, to
form systems of governing equations for the solution’s PC coefficients, which
are generally coupled together. They are in essence the extension to a suitable
tensor product Hilbert space of the usual Galerkin formalism that that underlies
many theoretical and numerical approaches to PDEs. This chapter is devoted
DR
M(u; d) = 0.
A good model for this kind of set-up is an elliptic boundary balue problem on,
say, a bounded, connected domain Ω ⊆ Rn with smooth boundary ∂Ω:
In this case, the input data d are typically the forcing term f : Ω → R and
the permeability field κ : Ω → Rn×n ; in some cases, the domain Ω itself might
depend upon d, but this introduces additional complications that will not be
considered in this chapter. For a PDE such as this, solutions u are typically
sought in the Sobolev space H01 (Ω) of L2 functions that have a weak derivative
that itself lies in L2 , and that vanish on ∂Ω in the sense of trace. Moreover, it
is usual to seek weak solutions, i.e. u ∈ H01 (Ω) for which the inner product of
(12.1) with any v ∈ H01 (Ω) is an equality. That is, integrating by parts, we seek
u ∈ H01 (Ω) such that
hκ∇u, ∇viL2 (Ω) = hf, viL2 (Ω) for all v ∈ H01 (Ω). (12.2)
136 CHAPTER 12. STOCHASTIC GALERKIN METHODS
On expressing this problem in a chosen basis of H01 (Ω), the column vector [u] of
coefficients of u in this basis turn out to satisfy a matrix-vector equation (i.e. a
system of simultaneous linear equations) of the form [a][u] = [b] for some matrix
[a] determined by the permeability field κ and a column vector [b] determined
by the forcing term f .
In this chapter, after reviewing basic Lax–Milgram theory and Galerkin pro-
jection for problems like (12.1) / (12.2), we consider the situation in which the
input data d are uncertain and are described as a random variable D(ξ). Then
the solution is also a random variable U (ξ) and the model relationship becomes
—
M(U (ξ); D(ξ)) = 0.
T
form of a matrix-vector equation [A][U ] = [B] related to, but more complicated
than, the deterministic problem [a][u] = [b].
AF
12.1 Lax–Milgram Theory and Galerkin Projection
Let H be a real Hilbert space equipped with a bilinear form a : H × H → R.
Given f ∈ H∗ (i.e. a continuous linear functional f : H → R), the associated
weak problem is:
where the second equality follows from integration by parts when u, v are smooth
functions that vanish on ∂Ω; such functions form a dense subset of the Sobolev
space H01 (Ω). This short calculation motivates two important developments in
—
the treatment of the PDE (12.1). First, even though the original formulation
(12.1) seems to require the solution u to have two orders of differentiability, the
last line of the above calculation makes sense even if u and v have only one
order of (weak) differentiability, and so we restrict attention to H01 (Ω). Second,
we declare u ∈ H01 (Ω) to be a weak solution of (12.1) if the L2 (Ω) inner product
of (12.1) with any v ∈ H01 (Ω) holds as an equality of real numbers, i.e. if
Z Z
− ∇ · (κ(x)∇u(x))v(x) dx = f (x)v(x) dx
Ω Ω
i.e. if
a(u, v) = hf, viL2 (Ω) for all v ∈ H01 (Ω).
12.1. LAX–MILGRAM THEORY AND GALERKIN PROJECTION 137
The existence and uniqueness of solutions problems like (12.3), under appro-
priate conditions on a (which of course are inherited from appropriate conditions
on κ), is ensured by the Lax–Milgram theorem, which generalizes the Riesz rep-
resentation theorem that any Hilbert space is isomorphic to its dual space.
Theorem 12.2 (Lax–Milgram). Let a be a bilinear form on a Hilbert space H
such that
1. (boundedness) there exists a constant C > 0 such that, for all u, v ∈ H,
|a(u, v)| ≤ Ckukkvk; and
2. (coercivity) there exists a constant c > 0 such that, for all v ∈ H, |a(v, v)| ≥
—
ckvk2 .
Then for all f ∈ H∗ , there exists a unique u ∈ V such that, for all v ∈ H,
a(u, v) = hf | vi.
Proof. For each u ∈ H, v 7→ a(u, v) is a bounded linear functional on H. So,
by the Riesz representation theorem, given u ∈ H, there is a unique w ∈ H
such that hw, vi = a(u, v). Define Au := w. The map A : H → H is clearly
well-defined. It is also linear: take α1 , α2 ∈ R and u1 , u2 ∈ H:
T
hA(α1 u1 + α2 u2 ), vi = a(α1 u1 + α2 u2 , v)
= α1 a(u1 , v) + α2 a(u2 , v)
= α1 hAu1 , vi + α2 hAu2 , vi
AF
= hα1 Au1 + α2 Au2 , vi.
A is a bounded map, since
kAuk2 = hAu, Aui = a(u, Au) ≤ CkukkAuk,
so kAuk ≤ Ckuk. Furthermore, A is injective since
kAukkuk ≥ hAu, ui = a(u, u) ≥ ckuk2,
DR
so Au = 0 =⇒ u = 0.
To see that the range R(A) is closed, take a convergent sequence (vn )n∈N in
R(A) that converges to some v ∈ H. Choose un ∈ H such that Aun = vn for
each n ∈ N. The sequence (Aun )n∈N is Cauchy, so
kAun − Aum kkun − um k ≥ |hAun − Aum , un − um i|
= |a(un − um , un − um )|
≥ ckun − um k2 .
—
find u(M) ∈ V (M) such that a(u(M) , v (M) ) = hf, v (M) i for all v (M) ∈ V (M) .
Note that if the hypotheses of the Lax–Milgram theorem are satisfied on the full
space H, then they are certainly satisfied on the subspace V (M) , thereby ensuring
the existence and uniqueness of solutions to the Galerkin problem. Note well,
though, that existence of a unique Galerkin solution for each M ∈ N0 does not
—
imply the existence of a unique weak solution (nor even multiple weak solutions)
to the full problem; for this, one typically needs to show that the Galerkin
approximations are uniformly bounded and appeal to a Sobolev embedding
theorem to extract a convergent subsequence.
Example 12.3. 1. The Fourier basis {ek }k∈Z of L2 (S1 , dx; C) defined by
1
T
ek (x) = √ exp(ikx).
2π
For Galerkin projection, one can use the finite-dimensional subspace
AF
V (2M+1) := span{e−M , . . . , e−1 , e0 , e1 , . . . , eM }
x 7→ cos(kx), for k ∈ N0 ,
x 7→ sin(kx), for k ∈ N.
DR
—
onto V (M) ; if H has an orthonormal basis {en } and u = n∈N un en , then the
P
optimal approximation of u in V (M) = span{e1 , . . . , eM } is M n
n=1 u en , but this
(M)
is not generally the same as the Galerkin solution u . However, the next
result, Céa’s lemma, shows that u(M) is a quasi-optimal approximation to u
(note that the ratio C/c is always at least 1):
Lemma 12.4 (Céa’s lemma). Let a, c and C be as in the statement of the Lax–
Milgram theorem. Then the weak solution u ∈ H and the Galerkin solution
T
u(M) ∈ V (M) satisfy
C n o
u − u(M) ≤ inf u − v (M) v (M) ∈ V (M) .
c
AF
Proof. Exercise 12.2.
Matrix Form. It is helpful to recast the Galerkin problem in matrix form. Let
{φ1 , . . . , φM } be a basis for V (M) . Then u(M) solves the Galerkin problem if
and only if
a(u(M) , φm ) = hf, φm i for m ∈ {1, . . . , M }.
PM
Now expand u(M) in this basis as u(M) = m=1 um φm and insert this into the
DR
previous equation:
M
! M
X X
m
a u φm , φi = um a(φm , φi ) = hf, φi i for i ∈ {1, . . . , M }.
m=1 m=1
The matrix [a] ∈ RM×M is the Gram matrix of the bilinear form a, and is of
course a symmetric matrix whenever a is a symmetric bilinear form.
Remark 12.5. In practice the matrix-vector equation [a][u(M) ] = [b] is never
solved by explicitly inverting the Gram matrix [a] to obtain the coefficients um
via [u(M) ] = [a]−1 [b]. Indeed, in many situations the Gram matrix is sparse,
and so inversion methods that take advantage of that sparsity are used; further-
more, for large systems, the methods used are often iterative rather than direct
(e.g. factorization-based).
140 CHAPTER 12. STOCHASTIC GALERKIN METHODS
Lax–Milgram Theory for Banach Spaces. There are many extensions of the
now-classical Lax–Milgram lemma beyond the setting of symmetric bilinear
forms on Hilbert spaces. For example, the following generalization is due to
Kozono & Yanagisawa [53]:
Theorem 12.6 (Kozono–Yanagisawa). Let X be a Banach space, Y a reflexive
Banach space, and a : X × Y → C a bilinear form such that
1. there is a constant M > 0 such that
—
2. the null spaces
T
RX and Y = NY ⊕ RY ;
3. there is a constant C > 0 such that
!
AF
|a(x, y)|
kxkX ≤ C sup + kPX xkX for all x ∈ X ,
y∈Y kykY
!
|a(x, y)|
kykY ≤ C sup + kPY ykY for all x ∈ X ,
x∈X kxkX
with θ being drawn from some probability space (Θ, F , µ). To that end, we
introduce a stochastic space S, which in the following will be L2 (Θ, µ; R). We
retain also a Hilbert space V in which the deterministic solution u(θ) is sought
for each θ ∈ Θ; implicitly, V is independent of the problem data, or rather of
θ. Thus, the space in which the stochastic solution U is sought is the tensor
12.2. STOCHASTIC GALERKIN PROJECTION 141
—
β(Y ) := Eµ [hF | Y iV ].
Clearly, if α satisfies the boundedness and coercivity assumptions of the Lax–
Milgram theorem on H, then, for every F ∈ L2 (Θ, µ; V ′ ), there is a unique weak
solution U ∈ L2 (Θ, µ; V) satisfying
α(U, Y ) = β(Y ) for all Y ∈ L2 (Θ, µ; V).
A sufficient, but not necessary, condition for α to satisfy the hypotheses of the
T
Lax–Milgram theorem on H is for A(θ) to satisfy those hypotheses uniformly in
θ on V:
Theorem 12.7 (Stochastic Lax–Milgram theorem). Let (Θ, F , µ) be a probabil-
AF
ity space, and let A be a random variable on Θ, taking values in the space of
symmetric bilinear forms on a Hilbert space V, and satisfying the hypotheses of
the deterministic Lax–Milgram theorem (Theorem 12.2) uniformly with respect
to θ ∈ Θ. Define a bilinear form α on L2 (Θ, µ; V) by
α(X, Y ) := Eµ [A(X, Y )].
Then, for every F ∈ L2 (Θ, µ; V ′ ), there is a unique U ∈ L2 (Θ, µ; V) such that
DR
are both strictly positive and finite. The bilinear form α satisfies, for all X, Y ∈
H,
—
α(X, Y ) = Eµ [A(X, Y )]
≤ Eµ CkXkV kY kV
1/2
≤ C ′ Eµ kXk2V ]1/2 Eµ [kY k2V
= C ′ kXkKkY kH ,
and
α(X, X) = Eµ [A(X, X)]
≥ Eµ ckXk2V
≥ c′ kXk2H.
142 CHAPTER 12. STOCHASTIC GALERKIN METHODS
Remark 12.8. Note, however, that uniform boundedness and coercivity of A are
not necessary for α to be bounded and coercive. For example, the constants c(θ)
and C(θ) may degenerate to 0 or ∞ as θ approaches certain points of the sample
space Θ. Provided that these degeneracies are integrable and yield positive and
—
finite expected values, this will not ruin the boundedness and coercivity of α.
Indeed, there may be an arbitrarily large (but µ-measure zero) set of θ for which
there is no weak solution u to the deterministic problem
T
Stochastic Galerkin Projection. Let V (M) be a finite-dimensional subspace of
V, with basis φ1 , . . . , φM . As indicated above, take the stochastic space S to be
L2 (Θ, µ; K), which we assume to be equipped with an orthogonal decomposition
such as a PC decomposition. Let S P be a finite-dimensional subspace of S,
AF
for example the span of the polynomials of degree at most P P. The Galerikin
projection of the stochastic problem on H is to find U = m,k um k φm ⊗ Ψk ∈
(M) P
V ⊗ S such that
In particular, it suffices to find U that satisfies this condition for each basis
DR
[α][U ] = [β]
...
Suppose that we are already given a linear problem with its deterministic
problem discretized and cast in the matrix form
—
• and bi := hΨi , Bi ∈ RM is the ith stochastic mode of the source term B.
Note that, in general, the stochastic modes uj of the solution U (and, indeed the
coefficients um
j of the stochastic modes in the deterministic basis φ1 , . . . , φM )
are all coupled together through the matrix [A]. This can be a limitation of
stochastic Galerkin methods, and will be remarked upon later.
Example 12.9. As a special case, suppose that the random data have no impact
on the differential operator and affect only the right-hand side B. In this case
T
A(θ) = a for all θ ∈ Θ and so
[A]ij := hΨi , [a]Ψj i = [a]hΨi , Ψj i = [a]δij hΨ2i i.
AF
Hence, the stochastic Galerkin system, in its matrix form (12.5), becomes
[a] [0] ... [0] u0 b0
. . .. u b
[0] [a]hΨ1 i 2 . . 1 1
.. = .. .
. . .
.
.. .. .. [0] .
[0] ... [0] [a]hΨ2P i uP bP
DR
k=0
with coefficient matrices [α]k ∈ RM×M for k ≥ 0. In this case, the blocks [α]ij
are given by
P
X
[α]ij = hΨi , [α]Ψj i = [α]k hΨi , Ψj Ψk i.
k=0
Hence, the Galerkin block system (12.5) is equivalent to
[α]00 . . . [α]0P u0 b0
. . . . .
.. .. .. .. = .. , (12.6)
[α]P0 . . . [α]PP uP bP
144 CHAPTER 12. STOCHASTIC GALERKIN METHODS
where
hB, Ψi i
bi := ,
hΨ2i i
P
X
[α]ij := [α]k Ckji ,
k=0
hΨi Ψj Ψk i
Cijk := .
hΨk Ψk i
The rank-3 tensor Cijk is called the Galerkin tensor :
—
• it is symmetric in the first two indices: Cijk = Cjik ;
• this induces symmetry in the problem (12.6): [α]ij = [α]ji ;
• and since the Ψk are an orthogonal system, many of the (P + 1)3 entries
of Cijk are zero, leading to sparsity for the matrix problem; for example,
P
X
[α]00 = [α]k Ck00 = [α]0 .
T
k=0
Note that the Galerkin tensor Cijk is determined entirely by the PC system
{Ψk | k ≥ 0}, and so while there is a significant computational cost associated
AF
to evaluating its entries, this is a one-time cost: the Galerkin tensor can be
pre-computed, stored, and then used for many different problems, i.e. many As
and Bs.
dU
= −ZU, U (t) = B,
DR
dt
for U : [0, T ] × Θ → R. Let {Ψk }k∈N0 be an orthogonal basis for L2 (Θ, µ; R)
with the
P usual conventionPthat Ψ0 = 1. Give Z, B P and U the chaos expansions
Z = k∈N0 zk Ψk , B = k∈N0 bk Ψk and U (t) = k∈N0 uk (t)Ψk respectively.
Projecting the evolution equation onto the basis {Ψk }k∈N0 yields
dU
E Ψk = −E[ZU Ψk ] for each k ∈ N0 .
dt
Inserting the chaos expansions for Z and U into this yields, for every k ∈ N0 ,
—
" #
X X X
E u̇i (t)Ψi Ψk = −E zj Ψj ui (t)Ψi Ψk ,
i∈N0 j∈N0 i∈N0
X
i.e. u̇k (t)hΨ2k i =− zj ui (t)E[Ψj Ψi Ψk ],
i,j∈N0
X
i.e. u̇k (t) = − Cijk zj ui (t).
i,j∈N0
the above summations over N0 become summations over {0, . . . , P}, yielding a
coupled system of P + 1 ordinary differential equations. In matrix form:
u0 (t) u0 u0 (0) b0
d . ⊤ . .. ..
.
. = A .
. , =
. . ,
dt
uP (t) uP uP (0) bP
P
where A ∈ R(P+1)×(P+1) is Aik = − j Cijk zj .
—
12.3 Nonlinearities
Nonlinearities of various types occur throughout practical problems, and their
treatment is critical in the context of stochastic Galerkin methods, which require
the projection of these nonlinearities onto the finite-dimensional solution spaces.
For example, given the approximation
T
P
X
U (ξ) ≈ U P (ξ) = uk Ψk (ξ)
k=0
p
how does one calculate the coefficients of, say, U (ξ)2 or U (ξ)? The first exam-
AF
ple, U 2 , is a special case of taking the product of two Galerkin approximations,
and can be resolved using the Galerkin tensor Cijk of the previous section.
XP
hW, Ψk i
wk = = ui vj Cijk .
hΨk , Ψk i i,j=0
P
The truncation of the expansion W = k≥0 wk Ψk to k = 0, . . . , P is the or-
thogonal projection of W onto S P (and hence the L2 -closest approximation of
W in S P ) and is called the Galerkin product, or pseudo-spectral product, of U
and V , denoted U ∗ V .
The fact that multiplication of two random variables can be handled effi-
ciently, albeit with some truncation error, in terms of their expansions in the
146 CHAPTER 12. STOCHASTIC GALERKIN METHODS
gPC basis and the Galerkin tensor is very useful: it adds to the list of reasons
why one might wish to pre-computate and store the Galerkin tensor for use in
many problems.
However, the situation of binary products (and hence squares) is. . .
Triple and higher products... non-commutativity
X P
X
U= uk Ψk ≈ uk Ψk
—
k≥0 k=0
P PP
we seek a random variable V = k≥0 vk Ψk ≈ k=0 vk Ψk such that U (ξ)V (ξ) =
1 for almost every ξ. The weak interpretation of this desideratum is that U ∗
V = Ψ0 . Thus we are led to the following matrix-vector equation for the gPC
coefficients of V :
P PP 1
T
P
Ck00 uk . . . k=0 CkP0 uk v0
k=0 . . 0
. .. .. . =
.. (12.7)
. . . .
PP PP .
k=0 Ck0P uk ... k=0 CkPP uk
vP
0
AF
Naturally, if U (ξ) = 0 for a positive probability set of ξ, then V (ξ) will be
undefined for those same ξ. Furthermore, if U ≈ 0 with ‘too large’ probability,
then V may exist a.e. but fail to be in L2 . Hence, it is not surprising to learn
that while (12.7) has a unique solution whenever the matrix on the left-hand
is non-singular, the system becomes highly ill-conditioned as the amount of
probability mass near U = 0 increases.
DR
FINISH ME!!!
Bibliography
Basic Lax–Milgram theory and Galerkin methods for PDEs can be found in any
modern textbook on PDEs, such as those by Evans [27] (see Chapter 6) and
Renardy & Rogers [80] (see Chapter 9).
The monograph of Xiu [117] provides a general introduction to spectral
methods for uncertainty quantification, including Galerkin methods (Chapter
6), but is light on proofs. The book of Le Maı̂tre & Knio [58] covers Galerkin
—
Exercises
Exercise 12.1. Let a be a bilinear form satisfying the hypotheses of the Lax–
Milgram theorem. Given f ∈ H∗ , show that the unique u such that a(u, v) =
hf | vi for all v ∈ H satisfies kukH ≤ c−1 kf kH′ .
—
N
0, if x ≤ 0 or x ≤ xn−1 ;
(x − x
n−1 )/h, if xn−1 ≤ x ≤ xn ;
φn (x) :=
(x n+1 − x)/h, if xn ≤ x ≤ xn+1 ;
0, if x ≥ 1 or x ≥ xn+1 .
T
What space of functions is spanned by φ0 , . . . , φN ? For these functions φ0 , . . . , φN ,
calculate the Gram matrix for the bilinear form
Z 1
a(u, v) := u′ (x)v ′ (x) dx
AF
0
U ∗ V = V ∗ U,
DR
—
T
AF
DR
—
Chapter 13
—
Non-Intrusive Spectral
Methods
Isaac Asimov
minimizes the stochastic residual, but has the severe disadvantage that the
stochastic modes of the solution are coupled together by a large system such
as (12.5). Hence, the Galerkin formalism is not suitable for situations in which
deterministic solutions are slow and expensive to obtain, and the determinis-
tic solution method cannot be modified. Many so-called legacy codes are not
amenable to such intrusive methods of UQ.
In contrast, this chapter considers non-intrusive spectral methods for UQ.
These are characterized by the feature that the solution of the deterministic
problem is a ‘black box’ that does not need to be modified for use in the spec-
tral method, beyond being able to be evaluated at any desired point θ of the
probability space (Θ, F , µ).
150 CHAPTER 13. NON-INTRUSIVE SPECTRAL METHODS
of u ∈ L (Θ, µ; H) ∼
2
= H ⊗ L2 (Θ, µ; R) in terms of coefficients (stochastic modes)
uk ∈ H and an orthogonal basis {Ψk | k ∈ N0 } of L2 (Θ, µ; R). As usual, the
stochastic modes are given by
—
Z
Eµ [uΨk ] 1
ûk = = u(θ)Ψk (θ) dµ(θ).
Eµ [Ψ2k ] γk Θ
If the normalization constants γk := kΨk k2L2 (µ) are known ahead of time, then
it remains only to approximate the integral with respect to µ of the product of
u with each basis function Ψk .
T
Deterministic Quadrature. If the dimension of Θ is low and u(θ) is relatively
smooth as a function of θ, then an appealing approach to the estimation of
Eµ [uΨk ] is deterministic quadrature. For optimal polynomial accuracy, Gaus-
sian quadrature (i.e. nodes at the roots of µ-orthogonal polynomials) may be
AF
used. In practice, nested quadrature rules such as Clenshaw–Curtis may be
preferable since one does not wish to have to discard past solutions of u upon
passing to a more accurate quadrature rule. For multi-dimensional domains of
integration Θ, sparse quadrature rules may be used to alleviate the curse of
dimension.
Sources of Error. In practice, the following sources of error arise when com-
puting pseudo-spectral expansions of this type:
1. discretization error comes about through the approximation of H by a
finite-dimensional subspace, i.e. the
Pmapproximation of and of the stochastic
—
—
to be solved on an interval of time [a, b]. Choose n points
called collocation nodes. Now find a polynomial p(t) ∈ R≤n [t] so that the ODE
T
is satisfied for k = 1, . . . , n, as is the initial condition p(a) = ua . For example,
if n = 2, t1 = a and t2 = b, then the coefficients c2 , c1 , c0 ∈ R of the polynomial
approximation
X2
p(t) = ck (t − a)k ,
AF
k=0
i.e.
f (b, p(b)) − f (a, ua )
p(t) = (t − a)2 + f (a, ua )(t − a) + ua .
2(b − a)
The above equation implicitly defines the final value p(b) of the collocation
solution. This method is also known as the trapezoidal rule for ODEs, since the
same solution is obtained by rewriting the differential equation as
Z t
u(t) = u(a) + f (s, u(s)) ds
a
—
and approximating the integral on the right-hand side by the trapezoidal quadra-
ture rule for integrals.
It should be made clear at the outset that there is nothing stochastic about
‘stochastic collocation’, just as there is nothing chaotic about ‘polynomial chaos’.
The meaning of the term ‘stochastic’ in this case is that the collocation principle
is being applied across the ‘stochastic space’ (i.e. the probability space) of a
stochastic process, rather than the space/time/space-time domain. Consider
for example the random PDE
where, for µ-a.e. θ in some probability space (Θ, F , µ), the differential operator
Lθ and boundary operator Bθ are well-defined and the PDE admits a unique
solution u( · , θ) : Ω → R. The solution u : Ω × Θ → R is then a stochastic
process. We now let ΘM = {θ(1) , . . . , θ(M) } ⊆ Θ be a finite set of prescribed
collocation nodes. The collocation problem is to find an approximate solution
u(M) ≈ u that satisfies
Lθ(m) u(M) x, θ(m) = 0 for x ∈ Ω,
(M) (m)
Bθ(m) u x, θ =0 for x ∈ ∂Ω,
—
for m = 1, . . . , M ; there is, however, some flexibility in how to approximate
u(x, θ) for θ ∈
/ ΘM .
T
Example 13.2. Consider the initial value problem
d
u(t, θ) = −eθ u(t, θ), u(0, θ) = 1,
dt
AF
with θ ∼ N (0, 1). Take as the collocation nodes θ(1) , . . . , θ(M) ∈ R the M roots
of the Hermite polynomial HeM of degree M . The collocation solution u(m) is
given at the collocation nodes θ(m) by
d (m) (m)
u (t, θ(m) ) = −eθ u(t, θ(m) ), u(0, θ(m) ) = 1,
dt
i.e. (m)
u(t, θ(m) ) = exp(−eθ t)
DR
(M)
Away from the collocation nodes, u is defined by polynomial interpolation:
for each t, u(M) (t, θ) is a polynomial in θ of degree at most M with prescribed
values at the collocation nodes. Writing this interpolation in Lagrange form
yields
M
X
(M)
u (t, θ) = u(m) (t, θ(m) )ℓm (θ)
m=1
M
X Y
(m) θ − θ(k)
= exp(−eθ t) .
θ(m)− θ(k)
—
m=1 1≤k≤M
k6=m
Bibliography
The monograph of Xiu [117] provides a general introduction to spectral methods
for uncertainty quantification, including collocation methods, but is light on
proofs. The recent paper of Narayan & Xiu [70] presents a method for stochastic
collocation on arbitrary sets of nodes using the framework of least orthogonal
interpolation.
Non-intrusive methods for UQ, including NISP and stochastic collocation,
are covered in Chapter 3 of Le Maı̂tre & Knio [58].
—
Exercises
Exercise 13.1. Let u = (u1 , . . . , uM ) : Θ → RM be a square-integrable random
vector defined over a probability space (Θ, F , µ), and let {Ψk | k ∈ N0 }, with
normalization constants γk := kΨk k2L2 (µ) , be an orthogonal basis for L2 (Θ, µ; R).
Suppose that N independent samples {(θ(n) , u(θ(n) )) | n = 1, . . . , N } with
T
θ(n) ∼ µ, are given, and it is desired to use these N Monte Carlo samples
to form a truncated pseudo-spectral expansion
K
X (N )
u ≈ u(N ) := uk Ψk
AF
k=0
of u, where the approximate stochastic modes are obtained using Monte Carlo
integration. Write down the defining equation for the mth component of the
(N )
k th approximate stochastic mode, uk,m , and hence show that the approximate
stochastic modes solve the matrix equation
(N ) (N )
DR
u1,1 . . . uK,1
. .. ..
. −1 ⊤
. . . = Γ DP ,
(N ) (N )
u1,M . . . uK,M
where
Γ := diag(γ0 , . . . , γK ),
u1 (θ(1) ) . . . u1 (θ(N ) )
.. .. ..
D := . . . ,
—
(1) (N )
uM (θ ) . . . uM (θ )
Ψ0 (θ(1) ) . . . Ψ1 (θ(N ) )
.. .. ..
P := . . . .
ΨK (θ(1) ) . . . ΨK (θ ) (N )
Exercise 13.2. What is the analogue of the result of Exercise 13.1 when the
integrals are approximated using a quadrature rule, rather than using Monte
Carlo?
154 CHAPTER 13. NON-INTRUSIVE SPECTRAL METHODS
—
T
AF
DR
—
Chapter 14
—
Distributional Uncertainty
T
risks by investors. Uncertainty is ruled
out if possible. [P]eople generally prefer
the predictable. Few recognize how de-
structive this can be, how it imposes se-
AF
vere limits on variability and thus makes
whole populations fatally vulnerable to
the shocking ways our universe can throw
the dice.
Heretics of Dune
Frank Herbert
DR
In the previous chapters, it has been assumed that an exact model is avail-
able for the probabilistic components of a system, i.e. that all probability dis-
tributions involved are known and can be sampled. In practice, however, such
assumptions about probability distributions are always wrong to some degree:
the distributions used in practice may only be simple approximations of more
complicated real ones, or there may be significant uncertainty about what the
real distributions actually are. The same is true of uncertainty about the correct
form of the forward physical model.
Pm only constraints on p are the natural ones that pi ≥ 0 and that S(p) :=
The
i=1 pi = 1. Temporarily neglect the inequality constraints and use the method
of Lagrange multipliers to find the extrema of H(p) among all p ∈ Rm with
S(p) = 1; such p must satisfy, for some λ ∈ R,
1 + log p1 + λ
..
0 = ∇H(p) − λ∇S(p) = − .
1 + log pm + λ
—
It is clear that any solution to this equation must have p1 = · · · = pm , for
if pi and pj differ, then at most one of 1 + log pi + λ and 1 + log pj + λ can
equal 0 for the same value of λ. Therefore, since S(p) = 1, it follows that the
1
unique extremizer of H(p) among {p ∈ Rm | S(p) = 1} is p1 = · · · = pm = m .
The inequality constraints that were neglected initially are satisfied, and are not
active constraints, so it follows that the uniform probability measure on X is
the unique maximum entropy distribution on X .
T
A similar argument using the calculus of variations shows that the unique
maximum entropy probability distribution on an interval [a, b] ( R is the uni-
1
form distribution |b−a| dx.
AF
Example 14.2 (Constrained maximum entropy distributions). Consider the set of
all probability measures µ on R that have mean m and variance s2 ; what is
the maximum entropy distribution in this set? Consider probability measures µ
that are absolutely continuous with respect to Lebesgue measure, having density
ρ. Then the aim is to find µ to maximize
Z
H(µ) = − ρ(x) log ρ(x) dx,
R
DR
R R
Rsubject to2 the constraints that ρ ≥ 0, R ρ(x) dx = 1, R xρ(x) dx = 0 and
2
R (x − m) ρ(x) dx = s . Introduce Lagrange multipliers c = (c0 , c1 , c2 ) and the
Lagrangian
Z Z Z Z
Fc (ρ) := − ρ(x) log ρ(x) dx+c0 ρ(x) dx+c1 xρ(x) dx+c2 (x−m)2 ρ(x) dx.
R R R R
d
Fc (ρ + tσ) = 0.
dt t=0
—
Discrete Entropy and Convex Programming. In discrete settings, the entropy
of a probability measure p ∈ M1 ({1, . . . , m}) with respect to the uniform mea-
sure as defined in (14.1) is a strictly convex function of p ∈ Rm >0 . Therefore,
when p is constrained by a family of convex constraints, finding the maximum
entropy distribution is a convex program:
T
m
X
minimize: pi log pi
i=1
with respect to: p ∈ Rm
AF
subject to: p ≥ 0
p·1=1
ϕi (p) ≤ 0 for i = 1, . . . , n,
Suppose that we are interested in the value Q(µ† ) of some quantity of interest
that is a functional of a partially-known probability measure µ† on a space
X . Very often, Q(µ† ) arises as the expected value with respect to µ† of some
function q : X → R, so the objective is to determine
—
The inequality
Q(A) ≤ Q(µ† ) ≤ Q(A)
is, by construction, the sharpest possible bound on Q(µ† ) given only information
that µ† ∈ A. The obvious question is, can Q(A) and Q(A) be computed?
T
finite set equipped with the discrete topology. Then the space of measurable
functions f : X → R is isomorphic to RK and the space of probability measures
µ on X is isomorphic to the unit simplex in RK . If the available information on
µ† is that it lies in the set
AF
A := {µ ∈ M1 (X ) | Eµ [ϕn ] ≤ cn for n = 1, . . . , N }
extremize: p · q
DR
Note that the feasible set A for this problem is a convex subset of RK ; indeed,
A is a polytope, i.e. the intersection of finitely many closed half-spaces of RK .
Furthermore, as a closed subset of the probability simplex in RK , A is compact.
Therefore, by Corollary 4.18, the extreme values of this problem are certain to
—
be found in the extremal set ext(A). This insight can be exploited to great effect
in the study of distributional robustness problems for general sample spaces X .
Remarkably, when the feasible set A of probability measures is sufficiently
like a polytope, it is not necessary to consider finite sample spaces. What would
appear to be an intractable optimization problem over an infinite-dimensional
set of measures is in fact equivalent to a tractable finite-dimensional problem.
Thus, the aim of this section is to find a finite-dimensional subset A∆ of A with
the property that
ext Q(µ) = ext Q(µ).
µ∈A µ∈A∆
Extreme Points of Moment Classes. The first step in this reduction is to clas-
sify the extremal measures in sets of probability measures that are prescribed
by inequality or equality constraints on the expected value of finitely many arbi-
trary measurable test functions, so-called moment classes. Since, in finite time,
we can only verify — even approximately, numerically — the truth of finitely
many inequalities, such moment classes are appealing feasible sets from an epis-
temological point of view because they they conform to Karl Popper’s dictum
that “Our knowledge can only be finite, while our ignorance must necessarily
be infinite.”
—
Definition 14.3. A Borel measure µ on a topological space X is called inner
regular if, for every Borel-measurable set E ⊆ X ,
T
Radon space.
Definition 14.6. A Riesz space (or vector lattice) is a vector space V together
—
C := {(tx, t) ∈ V × R | t ∈ R, t ≥ 0, x ∈ V}
lD
A = {µ ∈ M1 (X ) | Eµ [ϕ] ≤ c}
δx 2
lD ∈ ext(A) ⊆ ∆1 ∩ A
lD
δx 1
lD
δx 3 lD
—
⊂ M± (X )
T
The definition of a Choquet simplex extends the usual finite-dimensional
definition: a finite-dimensional compact Choquet simplex is a simplex in the
ordinary sense of being the closed convex hull of finitely many points.
With these definitions, the extreme points of moment sets of probability
AF
measures can be described by the following theorem due to Winkler. The proof
is rather technical, and can be found in [116]. The important point is that
when X is a pseudo-Radon space, Winkler’s theorem applies with P = M1 (X ),
and so the extreme measures in moment classes will simply be finite convex
combinations of Dirac measures. Figures like Figure 14.1 should make this an
intuitively plausible claim.
be a Choquet simplex such that ext(P ) consists of Dirac measures. Fix measur-
able functions ϕ1 , . . . , ϕn : X → R and c1 , . . . , cn ∈ R and let
for i = 1, . . . , n,
A := µ ∈ P .
ϕi ∈ L1 (µ) and Eµ [ϕi ] ≤ ci
the vectors (ϕ1 (xi ), . . . , ϕn (xi ), 1)m
i=1
are linearly independent
As always, the reader should check that the terminology ‘measure affine’
is a sensible choice by verifying that when X = {1, . . . , K} is a finite sample
space, the restriction of any affine function F : RK ∼= M± (X ) → R to a subset
A ⊆ M1 (X ) is a measure affine function in the sense of Definition 14.9.
—
An important and simple example of a measure affine functional is an eval-
uation functional, i.e. the integration of a fixed measurable function q:
Proposition 14.10. If q is bounded either below or above, then ν 7→ Eν [q] is a
measure affine map.
Proof. Exercise 14.2.
T
In summary, we now have the following:
Theorem 14.11. Let X be a pseudo-Radon space and let A ⊆ M1 (X ) be a
moment class of the form
AF
A := {µ ∈ M1 (X ) | Eµ [ϕj ] ≤ 0 for j = 1, . . . , N }
ext(A) ⊆ A ∩ ∆N (X )
for some α0 , . . . , αN ∈ [0, 1], x0 , . . . , xN ∈ X ,
PN
µ = i=0 αi δxi
DR
= µ ∈ M1 (A) PN .
i=0 αi = 1,
PN
and i=0 i j i ) ≤ 0 for j = 1, . . . , N
α ϕ (x
Hence, if q is bounded either below or above, then Q(A) = Q(A∆ ) and Q(A) =
Q(A∆ ).
Proof. The classification of ext(A) is given by Winkler’s theorem (Theorem 14.8).
Since q is bounded on at least one side, Proposition 14.10 implies that µ 7→
F (µ) := Eµ [q] is measure affine. Let µ ∈ A be arbitrary and choose a proba-
—
measures can fail to be convex (see Exercise 14.3), so the reduction to extreme
points cannot be applied directly. Fortunately, a cunning application of Fubini’s
theorem resolves this difficulty. Note well, though, that unlike Theorem 14.11,
Theorem 14.12 does not say that A∆ = ext(A); it only says that the optimiza-
tion problem has the same extreme values over A∆ and A.
Theorem 14.12. Let A ⊆ M1 (X ) be a moment class of the form
Eµ [ϕj ] ≤ 0 for j = 1, . . . , N ,
K K Eµ1 [ϕ1j ] ≤ 0 for j = 1, . . . , N1 ,
O O
—
A := µ = µk ∈ M1 (Xk ) ..
.
k=1 k=1
EµK [ϕKj ] ≤ 0 for j = 1, . . . , NK
A∆ := {µ ∈ A | µk ∈ ∆N +Nk (Xk )} .
T
Then, if q is bounded either above or below, Q(A) = Q(A∆ ) and Q(A) = Q(A∆ ).
ε
Proof. Let ε > 0 and let µ∗ ∈ A be K+1 -suboptimal for the maximization of
µ 7→ Eµ [q] over µ ∈ A, i.e.
AF
ε
Eµ∗ [q] ≥ sup Eµ [q] − .
µ∈A K +1
By Fubini’s theorem,
Eµ∗1 ⊗···⊗µ∗K [q] = Eµ∗1 Eµ∗2 ⊗···⊗µ∗K [q]
By the same arguments used in the proof of Theorem 14.11, µ∗1 can be replaced
DR
another space Y of outputs, it may only be known that g † lies in some subset G of
the set of all (measurable) functions from X to Y; furthermore, our information
about g † and our information about µ† may be coupled in some way, e.g. by
knowledge of EX∼µ† [g † (X)]. Therefore, we now consider admissible sets of the
form
g : X → Y is measurable
A ⊆ (g, µ) ,
and µ ∈ M1 (X )
quantities of interest of the form Q(g, µ) = EX∼µ [q(X, g(X))] for some measur-
able function q : X × Y → R, and seek the extreme values
—
Q(A) := inf EX∼µ [q(X, g(X))] and Q(A) := sup EX∼µ [q(X, g(X))].
(g,µ)∈A (g,µ)∈A
T
µ∈ K
k=1 ∆N +Nk (Xk )
3
(R2 , k · k2 )
b b
1 f 2 b
1
−1 1
−1 b b b
(R2 , k · k∞ ) −2 −1 1 2
−1
—
Figure 14.1: The function f that takes the three points on the left (equipped
with k · k∞ ) to the three points on the right (equipped with k · k2 ) has Lipschitz
constant 1, but has no 1-Lipschitz extension F to (0, 0), let alone the whole
plane R2 , since F ((0, 0)) would have to lie in the (empty) intersection of the
three grey discs. Cf. Example 14.14.
T
Special cases of Minty’s theorem include the Kirszbraun–Valentine theorem
(which assures that Lipschitz functions between Hilbert spaces can be extended
AF
without increasing the Lipschitz constant) and McShane’s theorem (which as-
sures that scalar-valued continuous functions on metric spaces can be extended
without increasing a prescribed convex modulus of continuity). However, the
extensibility property fails for Lipschitz functions between Banach spaces, even
finite-dimensional ones:
Example 14.14. Let E ⊆ R2 be given by E := {(1, −1), (−1, 1), (1, 1)} and
define f : E → R2 by
DR
√
f ((1, −1)) := (1, 0), f ((−1, 1)) := (−1, 0), and f ((1, 1)) := (0, 3).
Then, if q is bounded either above or below, Q(A) = Q(A∆ ) and Q(A) = Q(A∆ ).
—
Example 14.16. Suppose that g † : [−1, 1] → R is known to have Lipschitz con-
stant Lip(g † ) ≤ L. Suppose also that the inputs of g † are distributed according
to µ† ∈ M1 ([−1, 1]), and it is known that
T
g : [−1, 1] → R has Lipschitz constant ≤ L,
A = (g, µ) .
µ ∈ M1 ([−1, 1]), EX∼µ [X] = 0, and EX∼µ [g(X)] ≥ m
AF
Suppose that our quantity of interest is the probability of output values below
0, i.e. q(x, y) = 1[y ≤ 0]. Then Theorem 14.15 ensures that the extreme values
of
Q(g, µ) = EX∼µ [1[g(X) ≤ 0]] = PX∼µ [g(X) ≤ 0]
are the solutions of
2
X
extremize: wi 1[yi ≤ 0]
DR
i=0
with respect to: w0 , w1 , w2 ≥ 0
x0 , x1 , x2 ∈ [−1, 1]
y0 , y1 , y2 ∈ R
2
X
subject to: wi = 1
i=0
X2
wi xi = 0
—
i=0
X2
wi yi ≥ m
i=0
|yi − yj | ≤ L|xi − xj | for i, j ∈ {0, 1, 2}.
Perhaps not surprisingly given its general form, McDiarmid’s inequality is not
the least upper bound on Pµ [g(X) ≤ 0]; the actual least upper bound can be
calculated using the reduction theorems.
FINISH ME!!!
—
Bibliography
Berger [8] makes the case for distributional robustness, with respect to priors
and likelihoods, in Bayesian inference. Smith [89] provides theory and several
practical examples for generalized Chebyshev inequalities in decision analysis.
Boyd & Vandenberghe [16, Sec. 7.2] cover some aspects of distributional robust-
T
ness under the heading of nonparametric distribution estimation, in the case in
which it is a convex problem. Convex optimization approaches to distributional
robustness and optimal probability inequalities are also considered by Bertsi-
mas & Popescu [9]. There is also an extensive literature on the related topic of
AF
majorization, for which see the book of Marshall & al. [64].
The classification of the extreme points of moment sets and the consequences
for the optimization of measure affine functionals are due to von Weizsäcker &
Winkler [112, 113] and Winkler [116]. Karr [49] proved similar results under
additional topological and continuity assumptions. Theorem 14.12 and the Lip-
schitz version of Theorem 14.15 can be found in Owhadi & al. [76] and Sullivan
& al. [99] respectively. Extension Theorem 14.13 is due to Minty [68], and gen-
DR
eralizes earlier results by McShane [66], Kirszbraun [51] and Valentine [111].
Example 14.14 is taken from the example given on p. 202 of Federer [28].
Exercises
Exercise 14.1. Consider the topology T on R generated by the basis of open
sets [a, b), where a, b ∈ R.
1. Show that this topology generates the same σ-algebra on R as the usual
Euclidean topology does. Hence, show that Gaussian measure is a well-
—
Exercise 14.3. Let λ denote uniform measure on the unit interval [0, 1] ( R.
Show that the line segment in M1 ([0, 1]2 ) joining the measures λ ⊗ δ0 and δ0 ⊗ λ
contains measures that are not product measures. Hence show that a set A of
product probability measures like that considered in Theorem 14.12 is typically
not convex.
Exercise 14.4. Calculate by hand, as a function of t ∈ R, D ≥ 0 and m ∈ R,
—
where
EX∼µ [X] ≥ m, and
A := µ ∈ M1 (R) .
diam(supp(µ)) ≤ D
Exercise 14.5. Calculate by hand, as a function of t ∈ R, s ≥ 0 and m ∈ R,
T
and
sup PX∼µ [|X − m| ≥ st],
µ∈A
AF
where
EX∼µ [X] ≤ m, and
A := µ ∈ M1 (R) .
EX∼µ [|X − m|2 ] ≤ s2
Exercise 14.6. Prove Theorem 14.15.
Exercise 14.7. Calculate by hand, as a function of t ∈ R, m ∈ R, z ∈ [0, 1] and
v ∈ R,
sup PX∼µ [g(X) ≤ t],
DR
(g,µ)∈A
where
g : [0, 1] → R has Lipschitz constant 1,
A := (g, µ) µ ∈ M1 ([0, 1]), EX∼µ [g(X)] ≥ m, .
and g(z) = v
Numerically verify your calculations.
—
168 CHAPTER 14. DISTRIBUTIONAL UNCERTAINTY
—
T
AF
DR
—
—
T
Bibliography and Index
AF
DR
—
—
DR
AF
T
—
Bibliography
—
[1] M. Abramowitz and I. A. Stegun, editors. Handbook of Mathematical
Functions with Formulas, Graphs, and Mathematical Tables. Dover Pub-
lications Inc., New York, 1992. Reprint of the 1972 edition.
[2] C. D. Aliprantis and K. C. Border. Infinite Dimensional Analysis: A
Hitchhiker’s Guide. Springer, Berlin, third edition, 2006.
T
[3] Ö. F. Alış and H. Rabitz. Efficient implementation of high dimensional
model representations. J. Math. Chem., 29(2):127–142, 2001.
[4] M. Atiyah. Collected Works. Vol. 6. Oxford Science Publications. The
AF
Clarendon Press Oxford University Press, New York, 2004.
[5] K. Azuma. Weighted sums of certain dependent random variables. Tōhoku
Math. J. (2), 19:357–367, 1967.
[6] V. Barthelmann, E. Novak, and K. Ritter. High dimensional polynomial
interpolation on sparse grids. Adv. Comput. Math., 12(4):273–288, 2000.
[7] F. Beccacece and E. Borgonovo. Functional ANOVA, ultramodularity and
DR
Mathematical Statistics. John Wiley & Sons Inc., New York, third edition,
1995. A Wiley-Interscience Publication.
[11] E. Bishop and K. de Leeuw. The representations of linear functionals by
measures on sets of extreme points. Ann. Inst. Fourier. Grenoble, 9:305–
331, 1959.
[12] S. Bochner. Integration von Funktionen, deren Werte die Elemente eines
Vectorraumes sind. Fund. Math., 20:262–276, 1933.
[13] G. Boole. An Investigation of the Laws of Thought on Which are Founded
the Mathematical Theories of Logic and Probabilities. Walton and Maber-
ley, London, 1854.
172 BIBLIOGRAPHY
—
[17] R. H. Cameron and W. T. Martin. The orthogonal development of non-
linear functionals in series of Fourier–Hermite functionals. Ann. of Math.
(2), 48:385–392, 1947.
[18] M. Capiński and E. Kopp. Measure, Integral and Probability. Springer Un-
dergraduate Mathematics Series. Springer-Verlag London Ltd., London,
second edition, 2004.
T
[19] C. W. Clenshaw and A. R. Curtis. A method for numerical integration
on an automatic computer. Numer. Math., 2:197–205, 1960.
AF
[20] S. L. Cotter, M. Dashti, J. C. Robinson, and A. M. Stuart. Bayesian
inverse problems for functions and applications to fluid mechanics. Inverse
Problems, 25(11):115008, 43, 2009.
[22] M. Dashti, S. Harris, and A. Stuart. Besov priors for Bayesian inverse
DR
—
[32] R. A. Fisher and W. A. Mackenzie. The manurial response of different
potato varieties. J. Agric. Sci., 13:311–320, 1923.
T
Approach. Springer-Verlag, New York, 1991.
[41] J. Humpherys, P. Redd, and J. West. A Fresh Look at the Kalman Filter.
—
[42] L. Jaulin, M. Kieffer, O. Didrit, and É. Walter. Applied Interval Analysis:
With Examples in Parameter and State Estimation, Robust Control and
Robotics. Springer-Verlag London Ltd., London, 2001.
[47] R. E. Kalman and R. S. Bucy. New results in linear filtering and prediction
theory. Trans. ASME Ser. D. J. Basic Engrg., 83:95–108, 1961.
—
[48] K. Karhunen. Über lineare Methoden in der Wahrscheinlichkeitsrechnung.
Ann. Acad. Sci. Fennicae. Ser. A. I. Math.-Phys., 1947(37):79, 1947.
[49] Alan F. Karr. Extreme points of certain sets of probability measures, with
applications. Math. Oper. Res., 8(1):74–85, 1983.
T
[51] M. D. Kirszbraun. Über die zusammenziehende und Lipschitzsche Trans-
formationen. Fund. Math., 22:77–108, 1934.
AF
[52] D. D. Kosambi. Statistics in function space. J. Indian Math. Soc. (N.S.),
7:76–88, 1943.
[54] M. Krein and D. Milman. On extreme points of regular convex sets. Studia
DR
—
[64] A. W. Marshall, I. Olkin, and B. C. Arnold. Inequalities: Theory of
Majorization and its Applications. Springer Series in Statistics. Springer,
New York, second edition, 2011.
T
[66] E. J. McShane. Extension of range of functions. Bull. Amer. Math. Soc.,
40(12):837–842, 1934.
AF
[67] J. Mikusiński. The Bochner Integral. Birkhäuser Verlag, Basel, 1978.
Lehrbücher und Monographien aus dem Gebiete der exakten Wis-
senschaften, Mathematische Reihe, Band 55.
1966.
[75] A. B. Owen. Latin supercube sampling for very high dimensional simula-
tions. ACM Trans. Mod. Comp. Sim., 8(2):71–102, 1998.
—
Equations, volume 13 of Texts in Applied Mathematics. Springer-Verlag,
New York, second edition, 2004.
T
ics. Princeton University Press, Princeton, NJ, 1997. Reprint of the 1970
original, Princeton Paperbacks.
[96] C. J. Stone. The use of polynomial splines and their tensor products in
—
multivariate function estimation. Ann. Statist., 22(1):118–184, 1994.
T
[99] T. J. Sullivan, M. McKerns, D. Meyer, F. Theil, H. Owhadi, and M. Or-
tiz. Optimal uncertainty quantification for legacy data observations of
Lipschitz functions. Math. Model. Numer. Anal., 47(6):1657–1689, 2013.
AF
[100] G. Szegő. Orthogonal Polynomials. American Mathematical Society, Prov-
idence, R.I., fourth edition, 1975. American Mathematical Society, Collo-
quium Publications, Vol. XXIII.
[102] M. Talagrand. Pettis integral and measure theory. Mem. Amer. Math.
DR
[103] A. Tarantola. Inverse Problem Theory and Methods for Model Parame-
ter Estimation. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 2005.
[108] L. N. Trefethen and D. Bau, III. Numerical Linear Algebra. Society for
Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1997.
178 BIBLIOGRAPHY
—
1979/80.
[113] H. von Weizsäcker and G. Winkler. Noncompact extremal integral rep-
resentations: some probabilistic aspects. In Functional analysis: surveys
and recent results, II (Proc. Second Conf. Functional Anal., Univ. Pader-
born, Paderborn, 1979), volume 68 of Notas Mat., pages 115–148. North-
Holland, Amsterdam, 1980.
T
[114] P. Walley. Statistical Reasoning with Imprecise Probabilities, volume 42
of Monographs on Statistics and Applied Probability. Chapman and Hall
Ltd., London, 1991.
AF
[115] K. Weichselberger. The theory of interval-probability as a unifying concept
for uncertainty. Internat. J. Approx. Reason., 24(2-3):149–170, 2000.
[116] G. Winkler. Extreme points of moment sets. Math. Oper. Res., 13(4):581–
587, 1988.
[117] D. Xiu. Numerical Methods for Stochastic Computations: A Spectral
Method Approach. Princeton University Press, Princeton, NJ, 2010.
DR
—
affine combination, 39
almost everywhere, 10 Dirac measure, 10
ANOVA, 116 direct sum, 28
arg max, 34 dominated convergence theorem, 15
arg min, 34 dual space, 26
T
barycentre, 159 entropy, 52, 155
Bayes’ rule, 11, 63 equivalent measures, 17
Bessel’s inequality, 29 Eulerian observations, 80
Birkhoff–Khinchin ergodic theorem, 106 expectation, 15
AF
Bochner integral, 16, 32 expected value, 15
bounded differences inequality, 113 extended Kálmán filter, 77
extreme point, 39, 160
Céa’s lemma, 139, 146
Favard’s theorem, 90
Cameron–Martin space, 20
Fejér quadrature, 104
Cauchy–Schwarz inequality, 24, 52
Feldman–Hájek theorem, 21, 67
Chebyshev nodes, 93
Fernique’s theorem, 20
DR
Chebyshev’s inequality, 16
filtration, 12
Choquet simplex, 159
Fubini–Tonelli theorem, 18
Choquet–Bishop–de Leeuw theorem, 39
Christoffel–Darboux formula, 90 Galerkin product, 145
Clenshaw–Curtis quadrature, 104 Galerkin projection
collocation method deterministic, 137
for ODEs, 151 stochastic, 140
stochastic, 151 Galerkin tensor, 144
complete measure space, 10 Gauss–Markov theorem, 59
concentration of measure, 113 Gauss–Newton iteration, 46
—
—
Lebesgue, 14
Pettis, 16, 19 Newton’s method, 34
strong, 16 Newton–Cotes formula, 101
weak, 16 norm, 23
interior point method, 42 normal equations, 44
interval arithmetic, 50 normed space, 23
null set, 10
T
Kálmán filter, 74
ensemble, 78 orthogonal complement, 27
extended, 77 orthogonal polynomials, 88
linear, 74 orthogonal projection, 27
AF
Karhunen–Loève theorem, 123 orthogonal set, 27
sampling Gaussian measures, 124 orthonormal set, 27
Karush–Kuhn–Tucker conditions, 37
Koksma’s inequality, 108 parallelogram identity, 24
Koksma–Hlawka inequality, 108 Parseval identity, 29
Kozono–Yanagisawa theorem, 140 penalty function, 38
Kreı̆n–Milman theorem, 39 Pettis integral, 16, 19
DR
Sherman–Morrison–Woodbury formula,
81
signed measure, 9
simulated annealing, 36
singular value decomposition, 111, 125
Sobol′ indices, 118
Sobolev space, 26, 95
stochastic collocation method, 151
stochastic process, 12
strong integral, 16
—
support, 10
surprisal, 52
T
total variation distance, 54, 68
trapezoidal rule, 100
trivial measure, 10
AF
uncertainty principle, 52
Vandermonde matrix, 92
variance, 15
vector lattice, 159
weak integral, 16
Wiener–Hermite PC expansion, 127
DR
zero-one measure, 39
—