0% found this document useful (0 votes)
23 views189 pages

MA4K0 Notes

This document provides an introduction to uncertainty quantification. It covers topics such as measure and probability theory, Banach and Hilbert spaces, optimization theory, measures of information and uncertainty, Bayesian inverse problems, filtering and data assimilation, orthogonal polynomials, numerical integration, sensitivity analysis and model reduction. The document is a draft and contains various chapters that provide recaps of relevant mathematical concepts and an overview of key aspects of uncertainty quantification.

Uploaded by

Prem Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views189 pages

MA4K0 Notes

This document provides an introduction to uncertainty quantification. It covers topics such as measure and probability theory, Banach and Hilbert spaces, optimization theory, measures of information and uncertainty, Bayesian inverse problems, filtering and data assimilation, orthogonal polynomials, numerical integration, sensitivity analysis and model reduction. The document is a draft and contains various chapters that provide recaps of relevant mathematical concepts and an overview of key aspects of uncertainty quantification.

Uploaded by

Prem Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 189

MA4K0 Introduction to
Uncertainty Quantification

T
AF
T. J. Sullivan
Mathematics Institute
University of Warwick
Coventry CV4 7AL UK
[email protected]
DR

DRAFT
Not For General Distribution
Version 11 (2013-10-16 15:45)


DR
AF
T

Contents


Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Introduction to Uncertainty Quantification 1

T
1 Introduction 3
1.1 What is Uncertainty Quantification? . . . . . . . . . . . . . . . . 3
1.2 Mathematical Prerequisites . . . . . . . . . . . . . . . . . . . . . 6
AF
1.3 The Road Not Travelled . . . . . . . . . . . . . . . . . . . . . . . 7

2 Recap of Measure and Probability Theory 9


2.1 Measure and Probability Spaces . . . . . . . . . . . . . . . . . . . 9
2.2 Random Variables and Stochastic Processes . . . . . . . . . . . . 12
2.3 Aside: Interpretations of Probability . . . . . . . . . . . . . . . . 13
2.4 Lebesgue Integration . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 The Radon–Nikodým Theorem and Densities . . . . . . . . . . . 16
DR

2.6 Product Measures and Independence . . . . . . . . . . . . . . . . 17


2.7 Gaussian Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Recap of Banach and Hilbert Spaces 23


3.1 Basic Definitions and Properties . . . . . . . . . . . . . . . . . . 23
3.2 Dual Spaces and Adjoints . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Orthogonality and Direct Sums . . . . . . . . . . . . . . . . . . . 27
3.4 Tensor Products . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Basic Optimization Theory 33


4.1 Optimization Problems and Terminology . . . . . . . . . . . . . . 33
4.2 Unconstrained Global Optimization . . . . . . . . . . . . . . . . 34
4.3 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
ii CONTENTS

5 Measures of Information and Uncertainty 49


5.1 The Existence of Uncertainty . . . . . . . . . . . . . . . . . . . . 49
5.2 Interval Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Variance, Information and Entropy . . . . . . . . . . . . . . . . . 50
5.4 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6 Bayesian Inverse Problems 57


6.1 Inverse Problems and Regularization . . . . . . . . . . . . . . . . 57


6.2 Bayesian Inversion in Banach Spaces . . . . . . . . . . . . . . . . 62
6.3 Well-Posedness and Approximation . . . . . . . . . . . . . . . . . 63
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7 Filtering and Data Assimilation 71


7.1 State Estimation in Discrete Time . . . . . . . . . . . . . . . . . 72

T
7.2 Linear Kálmán Filter . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.3 Extended Kálmán Filter . . . . . . . . . . . . . . . . . . . . . . . 77
7.4 Ensemble Kálmán Filter . . . . . . . . . . . . . . . . . . . . . . . 78
7.5 Eulerian and Lagrangian Data Assimilation . . . . . . . . . . . . 80
AF
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8 Orthogonal Polynomials 85
8.1 Basic Definitions and Properties . . . . . . . . . . . . . . . . . . 85
8.2 Recurrence Relations . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.3 Roots of Orthogonal Polynomials . . . . . . . . . . . . . . . . . . 90
DR

8.4 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . 92


8.5 Polynomial Approximation . . . . . . . . . . . . . . . . . . . . . 93
8.6 Orthogonal Polynomials of Several Variables . . . . . . . . . . . . 96
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

9 Numerical Integration 99
9.1 Quadrature in One Dimension . . . . . . . . . . . . . . . . . . . . 99
9.2 Gaussian Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.3 Clenshaw–Curtis / Fejér Quadrature . . . . . . . . . . . . . . . . 104

9.4 Quadrature in Multiple Dimensions . . . . . . . . . . . . . . . . . 104


9.5 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . 105
9.6 Pseudo-Random Methods . . . . . . . . . . . . . . . . . . . . . . 107
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

10 Sensitivity Analysis and Model Reduction 111


10.1 Model Reduction for Linear Models . . . . . . . . . . . . . . . . . 111
10.2 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
10.3 McDiarmid Diameters . . . . . . . . . . . . . . . . . . . . . . . . 112
10.4 ANOVA/HDMR Decompositions . . . . . . . . . . . . . . . . . . 115
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
CONTENTS iii

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

11 Spectral Expansions 121


11.1 Karhunen–Loève Expansions . . . . . . . . . . . . . . . . . . . . 121
11.2 Wiener–Hermite Polynomial Chaos . . . . . . . . . . . . . . . . . 126
11.3 Generalized PC Expansions . . . . . . . . . . . . . . . . . . . . . 129
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

12 Stochastic Galerkin Methods 135


12.1 Lax–Milgram Theory and Galerkin Projection . . . . . . . . . . . 136
12.2 Stochastic Galerkin Projection . . . . . . . . . . . . . . . . . . . 140
12.3 Nonlinearities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

13 Non-Intrusive Spectral Methods 149

T
13.1 Pseudo-Spectral Methods . . . . . . . . . . . . . . . . . . . . . . 150
13.2 Stochastic Collocation . . . . . . . . . . . . . . . . . . . . . . . . 150
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
AF
14 Distributional Uncertainty 155
14.1 Maximum Entropy Distributions . . . . . . . . . . . . . . . . . . 155
14.2 Distributional Robustness . . . . . . . . . . . . . . . . . . . . . . 157
14.3 Functional and Distributional Robustness . . . . . . . . . . . . . 162
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
DR

Bibliography and Index 169


Bibliography 171

Index 179

iv CONTENTS


T
AF
DR

PREFACE v

Preface
These notes are designed as an introduction to Uncertainty Quantification (UQ)
at the fourth year (senior) undergraduate or beginning postgraduate level, and
are aimed primarily at students from a mathematical (rather than, say, engi-
neering) background; the mathematical prerequisites are listed in Section 1.2,
and the early chapters of the text recapitulate some of this material in more
detail. These notes accompany the University of Warwick mathematics mod-
ule MA4K0 Introduction to Uncertainty Quantification; while the notes are in-
tended to be general, certain contextual remarks and assumptions about prior


knowledge will be Warwick-specific, and will indicated by a large “W” in the
margin, like the one to the right. W
The aim is to give a survey of the main objectives in the field of UQ and a
few of the mathematical methods by which they can be achieved. There are,
of course, even more UQ problems and solution methods in the world that are
not covered in these notes, which are intended — with the exception of the
preliminary material on measure theory and functional analysis — to comprise

T
approximately 30 hours’ worth of lectures. For any grievous omissions in this
regard, I ask for your indulgence, and would be happy to receive suggestions for
improvements.
The exercises contain, by deliberate choice, a number of terribly ill-posed or
AF
under-specified problems of the messy type often encountered in practice. It is
my hope that these exercises will encourage students to grapple with the ques-
tions of mathematical modelling that are a necessary precursor to doing applied
mathematics outside the tame classroom environment. Theoretical knowledge
is important; however, problem solving, which begins with problem formula-
tion, is an equally vital skill that too often goes neglected in undergraduate
mathematics courses.
These notes have benefitted, from initial conception to nearly finished prod-
DR

uct, from discussions with many people. I would like to thank Charlie Elliott,
Dave McCormick, Mike McKerns, Michael Ortiz, Houman Owhadi, Clint Scovel,
Andrew Stuart, and all the students on the 2013–14 iteration of MA4K0 for their
useful comments.

T .J .S.
University of Warwick, U.K.
Wednesday 16th October, 2013

vi CONTENTS


T
AF
DR


Introduction to Uncertainty

T
Quantification
AF
DR


DR
AF
T

Chapter 1


Introduction

We must think differently about our


ideas — and how we test them. We must

T
become more comfortable with probabil-
ity and uncertainty. We must think more
carefully about the assumptions and be-
liefs that we bring to a problem.
AF
The Signal and the Noise: The Art of
Science and Prediction
Nate Silver

1.1 What is Uncertainty Quantification?


DR

Uncertainty Quantification (UQ) is, roughly put, the coming together of proba-
bility theory and statistical practice with ‘the real world’. These two anecdotes
illustrate something of what is meant by this statement:
• Until 1990–1995, risk modelling for catastrophe insurance and re-insurance
(i.e. insurance for property owners against risks arising from earthquakes,
hurricanes, terrorism, &c., and then insurance for the providers of such
insurance) was done on a purely statistical basis. Since that time, catas-
trophe modellers have started to incorporate models for the underlying
physics or human behaviour, hoping to gain a more accurate predictive

understanding of risks by blending the statistics and the physics, e.g. by


focussing on what is both statistically and physically reasonable.
• Over roughly the same period of time, deterministic engineering models of
complex physical processes began to incorporate some element of probabil-
ity to account for lack of knowledge about important physical parameters,
random variability in operating circumstances, or outright uncertainty
about what the form of a ‘correct’ model would be. Again the aim is to
provide more accurate predictions about systems’ behaviour.
Perhaps as a result of its history, there are many perspectives on what UQ is,
including at the extremes assertions like “UQ is just a buzzword for statistics”
or “UQ is just error analysis”; other perspectives on UQ include the study
4 CHAPTER 1. INTRODUCTION

of numerical error and the stability of algorithms. UQ problems of interest


include certification, prediction, model and software verification and validation,
parameter estimation, data assimilation, and inverse problems. At its very
broadest,
“UQ studies all sources of error and uncertainty, including the
following: systematic and stochastic measurement error; ignorance;
limitations of theoretical models; limitations of numerical represen-
tations of those models; limitations of the accuracy and reliability of
computations, approximations, and algorithms; and human error. A


more precise definition is UQ is the end-to-end study of the reliability
of scientific inferences.” [109, p. 135]
It is especially important to appreciate the “end-to-end” nature of UQ studies:
one is interested in relationships between pieces of information, bearing in mind
that they are only approximations of reality, not the ‘truth’ of those pieces of
information / assumptions. There is always a risk of ‘Garbage In, Garbage
Out’. A mathematician performing a UQ analysis cannot tell you that your

T
model is ‘right’ or ‘true’, but only that, if you accept the validity of the model
(to some quantified degree), then you must logically accept the validity of cer-
tain conclusions (to some quantified degree). Naturally, a respectable analysis
will include a meta-analysis examining the sensitivity of the original analysis to
AF
perturbations of the governing assumptions. In the author’s view, this is the
proper interpretation of philosophically sound but practically unhelpful asser-
tions like “Verification and validation of numerical models of natural systems is
impossible” and “The primary value of models is heuristic.” [74].
Example 1.1. Consider the following elliptic boundary value problem on a con-
nected Lipschitz domain Ω ⊆ Rn (typically n = 2 or 3):
DR

−∇ · (κ∇u) = f in Ω,
u=0 on ∂Ω.

This PDE is a simple but not naı̈ve model for the pressure field u of some fluid
occupying a domain Ω, the permeability of which to the fluid is described by a
tensor field κ : Ω → Rn×n ; there is a source term f and the boundary condition
specifies that the pressure vanishes on the boundary of Ω. This simple model
is of interest in the earth sciences because Darcy’s law asserts that the velocity
field v of the fluid flow in this medium is related to the gradient of the pressure
field by

v = κ∇u.
If the fluid contains some kind of contaminant, then one is naturally very inter-
ested in where fluid following the velocity field v will end up, and how soon.
In a course on PDE theory, you will learn that if the permeability field κ is
positive-definite and essentially bounded, then this problem has a unique weak
solution u in the Sobolev space H01 (Ω) for each forcing term f in the dual Sobolev
space H −1 (Ω). One objective of this course is to tell you that this is far from the
end of the story! As far as practical applications go, existence and uniqueness
of solutions is only the beginning. For one thing, this PDE model is only an
approximation of reality. Secondly, even if the PDE were a perfectly accurate
model, the ‘true’ κ and f are not known precisely, so our knowledge about u =
1.1. WHAT IS UNCERTAINTY QUANTIFICATION? 5

u(κ, f ) is also uncertain in some way. If κ and f are treated as random variables,
then u is also a random variable, and one is naturally interested in properties
of that random variable such as mean, variance, deviation probabilities &c. —
and to do so it is necessary to build up the machinery of probability on function
spaces.
Another issue is that often we want to solve the inverse problem: we know
something about f and something about u and want to infer κ. Even a simple
inverse problem such as this one is of enormous practical interest: it is by
solving such inverse problems that oil companies attempt to infer the location
of oil deposits in order to make a profit, and seismologists the structure of the


planet in order to make earthquake predictions. Both of these problems, the
forward and inverse propagation of uncertainty, fall under the very general remit
of UQ. Furthermore, in practice, the fields f , κ and u are all discretized and
solved for numerically (i.e. approximately and finite-dimensionally), so it is of
interest to understand the impact of these discretization errors.

Epistemic and Aleatoric Uncertainty. It is common in the literature to divide

T
uncertainty into two types, aleatoric and epistemic uncertainty. Aleatoric un-
certainty — from the Latin alea, meaning a die — refers to uncertainty about
an inherently variable phenomenon. Epistemic uncertainty — from the Greek
ǫ̀πιστ ήµη, meaning knowledge — refers to uncertainty arising from lack of
AF
knowledge. To a certain extent, the distinction is an imprecise one, and re-
peats the old debate between frequentist and subjectivist (e.g. Bayesian) statis-
ticians. Someone who was simultaneously a devout Newtonian physicist and
a devout Bayesian might argue that the results of dice rolls are not aleatoric
uncertainties — one simply doesn’t have complete enough information about
the initial conditions of die, the material and geometry of the die, any gusts of
wind that might affect the flight of the die, &c. On the other hand, it is usually
DR

clear that some forms of uncertainty are epistemic rather than aleatoric: for
example, when physicists say that they have yet to come up with a Theory of
Everything, they are expressing a lack of knowledge about the laws of physics
in our universe, and the correct mathematical description of those laws. In any
case, regardless of one’s favoured interpretation of probability, the language of
probability theory is a powerful tool in describing uncertainty.

Some Typical UQ Objectives. Many common UQ objectives can be illustrated


in the context of a system, F , that maps inputs X in some space X to outputs
Y = F (X) in some space Y. Some common UQ objectives include:

• The reliability or certification problem. Suppose that some set Yfail ⊆ Y is


identified as a ‘failure set’, i.e. the outcome F (X) ∈ Yfail is undesirable in
some way. Given appropriate information about the inputs X and forward
process F , determine the failure probability,
P[F (X) ∈ Yfail ].
Furthermore, in the case of a failure, how large will the deviation from
acceptable performance be, and what are the consequences?
• The prediction problem. Dually to the reliability problem, given a maxi-
mum acceptble probability of error ε > 0, find a set Yε ⊆ Y such that
P[F (X) ∈ Yε ] ≥ 1 − ε.
6 CHAPTER 1. INTRODUCTION

i.e. the prediction F (X) ∈ Yε is wrong with probability at most ε.


• The parameter identification or inverse problem. Given some observations
of the output, Y , which may be corrupted or unreliable in some way,
attempt to determine the corresponding inputs X such that F (X) = Y .
In what sense are some estimates for X more or less reliable than others?
• The model reduction model calibration problem. Construct another func-
tion Fh (perhaps a numerical model with certain numerical parameters to
be calibrated, or one involving far fewer input or output variables) such
that Fh ≈ F in an appropriate sense. Quantifying the accuracy of the
approximation may itself be a certification or prediction problem.



A Word of Warning. In this second decade of the third millennium, there is as
yet no elegant unified theory of UQ. UQ is not a mature field like linear algebra or
single-variable complex analysis, with stately textbooks containing well-polished
presentations of classical theorems bearing august names like Cauchy, Gauss and
Hamilton. Both because of its youth as a field and its very close engagement with
applications, UQ is much more about problems, methods, and ‘good enough for

T
the job’. There are some very elegant approaches within UQ, but as yet no
single, general, over-arching theory of UQ.
AF
1.2 Mathematical Prerequisites
 Like any course, MA4K0 has certain prerequisites. If you are just following the
course for fun, and attending the lectures merely to stay warm and dry in what
is almost sure to be a fine English autumn, then good for you. However, if you
actually want to understand what is going on, then it’s better for your own
health if you can use your nearest time machine to ensure that you have already
DR

taken and understood, in addition to the standard G100/G103 Mathematics


W core courses, the following non-core courses:
• ST112 Probability B
• Either MA359 Measure Theory or ST318 Probability Theory
• MA3G7 Functional Analysis I
As a crude diagnostic test, read the following sentence:

Given any σ-finite measure space (X , F R , µ), the set of all F -


measurable functions f : X → C for which X |f |2 dµ is finite, mod-
ulo equality µ-almost everywhere,
R is a Hilbert space with respect to
the inner product hf, gi := X f¯g dµ.

If any of the symbols, concepts or terms used or implicit in that sentence give you
more than a few moments’ pause, then you should think again before attempting
MA4K0.
If, in addition, you have taken the following courses, then certain techniques,
examples and remarks will make more sense to you:
• MA117 Programming for Scientists
• MA228 Numerical Analysis
• MA250 Introduction to Partial Differential Equations
• MA398 Matrix Analysis and Algorithms
• MA3H0 Numerical Analysis and PDEs
1.3. THE ROAD NOT TRAVELLED 7

• ST407 Monte Carlo Methods


• MA482 Stochastic Analysis
• MA4A2 Advanced PDEs
• MA607 Data Assimilation
However, none of these courses are essential. That said, some ability and will-
ingness to implement UQ methods — even in simple settings — in e.g. C/C++,
Mathematica, Matlab, or Pythonh1.1i is highly desirable. UQ is a topic best
learned in the doing, not through pure theory.


1.3 The Road Not Travelled
There are many topics relevant to UQ that are either not covered or discussed
only briefly in these notes, including: detailed treatment of data assimilation
beyond the confines of the Kálmán filter and its variations; accuracy, stability
and computational cost of numerical methods; details of numerical implementa-
tion of optimization methods; stochastic homogenization; optimal control; and

T
machine learning.
AF
DR

h1.1i The author’s current language of choice.


8 CHAPTER 1. INTRODUCTION


T
AF
DR

Chapter 2


Recap of Measure and
Probability Theory

T To be conscious that you are ignorant is


a great step to knowledge.
AF
Sybil
Benjamin Disraeli

Probability theory, grounded in Kolmogorov’s axioms and the general foun-


dations of measure theory, is an essential tool in the mathematical treatment of
uncertainty. This chapter serves as a review, without detailed proof, of concepts
from measure and probability theory that will be used in the rest of the text.
DR

Like Chapter 3, this chapter is intended as a review of material that should be


understood as a prerequisite before proceeding; to extent, Chapters 2 and 3 are
interdependent and so can (and should) be read in parallel with one another.

2.1 Measure and Probability Spaces


Definition 2.1. A measurable space is a pair (Θ, F ), where
• Θ is a set, called the sample space; and
• F is a σ-algebra on Θ, i.e. a collection of subsets of Θ containing ∅
and closed under countable applications of the operations of union, in-

tersection, and complementation relative to Θ; elements of F are called


measurable sets or events.
Definition 2.2. A signed measure (or charge) on a measurable space (Θ, F )
is a function µ : F → R ∪ {±∞} that takes at most one of the two infinite
values, has µ(∅) = 0, and, whenever
P E1 , E2 , · · · ∈ F are pairwise disjoint
with union E ∈ F , the series n∈N µ(En ) converges absolutely to µ(E). A
measure is a signed measure that does not take negative values. A probability
measure is a measure such that µ(Θ) = 1. The triple (Θ, F , µ) is called a signed
measure space, measure space, or probability space as appropriate. The sets of
all signed measures, measures, and probability measures on (Θ, F ) are denoted
M± (Θ, F ), M+ (Θ, F ), and M1 (Θ, F ) respectively.
10 CHAPTER 2. RECAP OF MEASURE AND PROBABILITY THEORY

Examples 2.3. 1. The trivial measure: τ (E) := 0 for every E ∈ F .


2. The unit Dirac measure at a ∈ X :
(
1, if a ∈ E, E ∈ F ,
δa (E) :=
0, if a ∈
/ E, E ∈ F .

3. Counting measure:
(
|E|, if E ∈ F is a finite set,
κ(E) :=


+∞, if E ∈ F is an infinite set.

4. Lebesgue measure on Rn : the unique measure on Rn (equipped with its


Borel σ-algebra B(Rn )) that assigns to every cube its volume. To be more
precise, Lebesgue measure is actually defined on the completion of B(Rn ),
a larger σ-algebra than B(Rn ).

Definition 2.4. Let (X , F , µ) be a measure space. If N ⊆ X is a subset of a

T
measurable set E ∈ F such that µ(E) = 0, then N is called a µ-null set. If the
set of x ∈ X for which some property P (x) does not hold is µnull, then P is
said to hold µ-almost everywhere (or, when µ is a probability measure, µ-almost
surely). If every µ-null set is in fact an F -measurable set, then the measure
AF
space (X , F , µ) is said to be complete.

When the sample space is a topological space, it is usual to use the Borel
σ-algebra (i.e. the smallest σ-algebra that contains all the open sets); measures
on the Borel σ-algebra are called Borel measures. Unless noted otherwise, this
is the convention followed in these notes.

Definition 2.5. The support of a measure µ defined on a topological space X is


DR

\
supp(µ) := {F ⊆ X | F is closed and µ(X \ F ) = 0}.

That is, supp(µ) is the smallest closed subset of X that has full µ-measure.
Equivalently, supp(µ) is the complement of the union of all open sets of µ-
measure zero, or the set of all points x ∈ X for which every neighbourhood of
x has strictly positive µ-measure.

M1 (X ) is often called the probability simplex on X . The motivation for this


terminology comes from the case in which X = {1, . . . , n} is a finite set. In this

case, functions f : X → R are in bijection with column vectors


 
f (1)
 .. 
 . 
f (n)

and probability measures µ on the power set of X are in bijection with row
vectors  
µ({1}) · · · µ({n})
Pn
such that µ({i}) ≥ 0 for all i ∈ {1, . . . , n} and i=1 µ({i}) = 1. As illustrated
in Figure 2.1, the set of such µ is the (n − 1)-dimensional simplex in Rn that is
2.1. MEASURE AND PROBABILITY SPACES 11

the convex hull of the n points


 
δ1 = 1 0 ··· 0 ,
 
δ2 = 0 1 ··· 0 ,
..
.
 
δn = 0 0 ··· 1 .

Looking ahead, the expected value of f under µ is exactly the matrix product:


 
n f (1)
X   
Eµ [f ] = µ({i})f (i) = hµ | f i = µ⊤ f = µ({1}) · · · µ({n})  ...  .
i=1
f (n)

The geometry of M1 (X ) is something that one forgets in favour of the algebraic


properties of measures and functions at one’s peril, as poetically highlighted by

T
Sir Michael Atiyah [4, Paper 160, p. 7]:

“Algebra is the offer made by the devil to the mathematician.


The devil says: ‘I will give you this powerful machine, it will answer
AF
any question you like. All you need to do is give me your soul: give
up geometry and you will have this marvellous machine.’”

Or, as is traditionally but perhaps apocryphally said to have been inscribed over
the entrance to Plato’s Academy:

AΓEΩMETPHTOΣ MH∆EIΣ EIΣITΩ

In a sense that will be made precise later, for any ‘nice’ space X , M1 (X ) is the
DR

simplex spanned by the collection of unit Dirac measures {δx | x ∈ X }. Given


a bounded, measurable function f : X → R and c ∈ R,

{µ ∈ M(X ) | Eµ [f ] ≤ c}

is a half-space of M(X ), and so a set of the form

{µ ∈ M1 (X ) | Eµ [f1 ] ≤ c1 , . . . , Eµ [fm ] ≤ cm }

can be thought of as a polytope of probability measures.


Definition 2.6. If (Θ, F , µ) is a probability space and B ∈ F has µ(B) > 0,


then the conditional probability measure µ( · |B) on (Θ, F ) is defined by

µ(E ∩ B)
µ(E|B) := for E ∈ F .
µ(B)

Theorem 2.7 (Bayes’ rule). If (Θ, F , µ) is a probability space and A, B ∈ F


have µ(A), µ(B) > 0, then

µ(B|A)µ(A)
µ(A|B) = .
µ(B)
12 CHAPTER 2. RECAP OF MEASURE AND PROBABILITY THEORY

M1 ({1, 2, 3})
δ2

δ1

δ3
⊂ M± ({1, 2, 3}) ∼
= R3


Figure 2.1: The probability simplex M1 ({1, 2, 3}), drawn as the triangle
spanned by the unit Dirac masses δi , i ∈ {1, 2, 3}, in the space of signed mea-
sures on {1, 2, 3}.

T
2.2 Random Variables and Stochastic Processes
Definition 2.8. Let (X , F ) and (Y, G ) be measurable spaces. A function
AF
f : X → Y generates a σ-algebra on X by

σ(f ) := σ {[f ∈ E] | E ∈ G } ,

and f is called a measurable function if σ(f ) ⊆ F . A measurable function


whose domain is a probability space is usually called a random variable.

Definition 2.9. Let Ω be any set and let (Θ, F , µ) be a probability space. A
DR

function U : Ω × Θ → X such that each U (ω, · ) is a random variable is called


an X -valued stochastic process on Ω.

Definition 2.10. A measurable function f : X → Y from a measure space


(X , F , µ) to a measurable space (Y, G ) defines a measure f∗ µ on (Y, G ), called
the push-forward of µ by f , by

(f∗ µ)(E) := µ [f ∈ E] , for E ∈ G .

When µ is a probability measure, f∗ µ is called the distribution or law of the


random variable f .

Definition 2.11. A filtration of a σ-algebra F is a family F• = {Fi | i ∈ I} of


sub-σ-algebras of F , indexed by an ordered set I, such that

i ≤ j in I =⇒ Fi ⊆ Fj .

The natural filtration associated to a stochastic process U : I × Θ → X is the


filtration F•U defined by

FiU := σ {U (i, · )−1 (E) ⊆ Θ | E ⊆ X is measurable} .

A stochastic process U is adapted to a filtration F• if FiU ⊆ Fi for each i ∈ I.


2.3. ASIDE: INTERPRETATIONS OF PROBABILITY 13

Measurability and adaptedness are important properties of stochastic pro-


cesses, and loosely correspond to certain questions being ‘answerable’ or ‘de-
cidable’ with respect to the information contained in a given σ-algebra. For
instance, if the event [X ∈ E] is not F -measurable, then it does not even make
sense to ask about the probability Pµ [X ∈ E]. For another example, suppose
that some stream of observed data is modelled as a stochastic process Y , and
it is necessary to make some decision U (t) at each time t. It is common sense
to require that the decision stochastic process be F•Y -adapted, since the deci-
sion U (t) must be made on the basis of the observations Y (s), s ≤ t, not on
observations from any future time.


2.3 Aside: Interpretations of Probability
It is worth noting that the above discussions are purely mathematical: a prob-
ability measure is an abstract algebraic–analytic object with no necessary con-
nection to everyday notions of chance or probability. The question of what

T
interpretation of probability to adopt, i.e. what practical meaning to ascribe to
probability measures, is a question of philosophy and mathematical modelling.
The two main points of view are the frequentist and Bayesian perspectives. To
a frequentist, the probability µ(E) of an event E is the relative frequency of
AF
occurrence of the event E in the limit of infinitely many independent but iden-
tical trials; to a Bayesian, µ(E) is a numerical representation of one’s degree of
belief in the truth of a proposition E. The frequentist’s point of view is objec-
tive; the Bayesian’s is subjective; both use the same mathematical machinery of
probability measures to describe the properties of the function µ.
Frequentists are careful to distinguish between parts of their analyses that
are fixed and deterministic versus those that have a probabilistic character.
However, for a Bayesian, any uncertainty can be described in terms of a suitable
DR

probability measure. In particular, one’s beliefs about some unknown θ (taking


values in a space Θ) in advance of observing data are summarized by a prior
probability measure π on Θ. The other ingredient of a Bayesian analysis is a
likelihood function, which is up to normalization a conditional probability: given
any observed datum y, L(y|θ) is the likelihood of observing y if the parameter
value θ were the truth. A Bayesian’s belief about θ given the prior π and the
observed datum y is the posterior probability measure π( · |y) on Θ, which is
just the conditional probability
L(y|θ)π(θ) L(y|θ)π(θ)
π(θ|y) = = R

Eπ [L(y|θ)] Θ L(y|θ) dπ(θ)

or, written in a fancier way that generalizes better to infinite-dimensional Θ,


dπ( · |y)
(θ) ∝ L(y|θ).

Both the previous two equations are referred to as Bayes’ rule, and are at this
stage informal applications of the standard Bayes’ rule (Theorem 2.7) for events
A and B of non-zero probability.
Parameter estimation provides a good example of the philosophical differ-
ence between frequentist and subjectivist uses of probability. Suppose that
X1 , . . . , Xn are n independent and identically distributed observations of some
14 CHAPTER 2. RECAP OF MEASURE AND PROBABILITY THEORY

random variable X, which is distributed according to the normal distribution


N (θ, 1) of mean θ and variance 1. We set our frequentist and Bayesian statis-
ticians the challenge of estimating θ from the data X1 , . . . , Xn .
• To the frequentist, θ is a well-defined real number that happens to be
unknown. This number can be estimated using the estimator
n
1X
θbn := Xi ,
n i=1

which is a random variable. It makes sense to say that θbn is close to θ


with high probability, and hence to give a confidence interval for θ, but θ
itself does not have a distribution.
• To the Bayesian, θ is a random variable, and its distribution in advance
of seeing the data is encoded in a prior π. Upon seeing the data and
conditioning upon it using Bayes’ rule, the distribution of the parameter
is the posterior distribution π(θ|d). The posterior encodes everything that
2
is known about θ in view of π, L(y|θ) ∝ e−|y−θ| /2 and d, although this

T
information may be summarized by a single number such as the maximum
a posteriori estimator

θbMAP := arg max π(θ|d)


AF
θ∈R

or the maximum likelihood estimator

θbMLE := arg max L(d|θ).


θ∈R

It is also worth noting that there is a significant community that, in addition


to being frequentist or Bayesian, asserts that selecting a single probability mea-
DR

sure is too precise a description of uncertainty. These ‘imprecise probabilists’


count such distinguished figures as George Boole and John Maynard Keynes
among their ranks, and would prefer to say that 12 −2−100 ≤ P[heads] ≤ 12 +2−100
than commit themselves to the assertion that P[heads] = 21 ; imprecise proba-
bilists would argue that the former assertion can be verified, to a prescribed
level of confidence, in finite time, whereas the latter cannot. Techniques like
the use of lower and upper probabilities (or interval probabilities) are popular
in this community, including sophisticated generalizations like Dempster–Shafer
theory; one can also consider feasible sets of probability measures, which is the

approach taken in Chapter 14 of these notes.

2.4 Lebesgue Integration


Definition 2.12. Let (X P, nF , µ) be a measure space. A function f : X → K
is called simple if f = i=1 αi 1Ei for some scalars α1 , . . . , αn ∈ K and some
pairwise disjoint measurable sets E1 , . . . , En ∈ F withPµ(Ei ) finite for i =
1, . . . , n. The Lebesgue integral of a simple function f := ni=1 αi 1Ei is defined
to be Z Xn
f dµ := αi µ(Ei ).
X i=1
2.4. LEBESGUE INTEGRATION 15

Definition 2.13. Let (X , F , µ) be a measure space and let f : X → [0, +∞] be


a measurable function. The Lebesgue integral of f is defined to be
Z Z 
φ : X → R is a simple function, and
f dµ := sup φ dµ .
X X 0 ≤ φ(x) ≤ f (x) for µ-almost all x ∈ X

Definition 2.14. Let (X , F , µ) be a measure space and let f : X → R be a


measurable function. The Lebesgue integral of f is defined to be
Z Z Z
f dµ := f+ dµ − f− dµ


X X X

provided that at least one of the integrals on the right-hand side is finite. The
integral of a complex-valued measurable function f : X → C is defined to be
Z Z Z
f dµ := Re f dµ + i Im f dµ.
X X X

T
One of the major attractions of the Lebesgue integral is that, subject to a
simple domination condition, pointwise convergence of integrands is enough to
ensure convergence of integral values:
AF
Theorem 2.15 (Dominated convergence theorem). Let (X , F , µ) be a measure
space and let fn : X → K be a measurable function for each n ∈ N. If f : X → K
is such that limn→∞ fn (x) = f (x) Rfor every x ∈ X and there is a measurable
function g : X → [0, ∞] such that X |g| dµ is finite and |fn (x)| ≤ g(x) for all
x ∈ X and all large enough n ∈ N, then
Z Z
f dµ = lim fn dµ.
X n→∞ X
DR

Furthermore, if the measure space is complete, then the conditions on pointwise


convergence and pointwise domination of fn (x) can be relaxed to hold µ-almost
everywhere.

Definition 2.16. When (Θ, F , µ) is a probability spaceR and X : Θ → K is a


random variable, it is conventional to write Eµ [X] for Θ X(θ) dµ(θ) and to call
Eµ [X] the expected value or expectation of X. Also,
 2
Vµ [X] := Eµ X − Eµ [X] ≡ Eµ [|X|2 ] − |Eµ [X]|2

is called the variance of X. If X is a Kd -valued random variable, then Eµ [X],


if it exists, is an element of Kd , and
 
C := Eµ (X − Eµ [X])(X − Eµ [X])∗ ∈ Kd×d
 
i.e. Cij := Eµ (Xi − Eµ [Xi ])(Xj − Eµ [Xj ]) ∈ K

is the covariance matrix of X.

Definition 2.17. Let (X , F , µ) be a measure space. For 1 ≤ p ≤ ∞, the Lp


space (or Lebesgue space) is defined by

Lp (X , µ; K) := {f : X → K | f is measurable and kf kLp(µ) is finite},


16 CHAPTER 2. RECAP OF MEASURE AND PROBABILITY THEORY

where Z 1/p
p
kf kLp(µ) := |f (x)| dµ(x)
X
for 1 ≤ p < ∞ and
kf kL∞(µ) := inf {kgk∞ | f = g : X → K µ-almost everywhere}
= inf {t ≥ 0 | |f | ≤ t µ-almost everywhere}
To be more precise, Lp (X , µ; K) is the set of equivalence classes of such functions,
where functions that differ only on a set of µ-measure zero are identified.


Theorem 2.18 (Chebyshev’s inequality). Let X ∈ Lp (Θ, µ; K), 1 ≤ p < ∞, be a
random variable. Then, for all t ≥ 0,
   
Pµ |X − Eµ [X]| ≥ t ≤ t−p Eµ |X|p . (2.1)

Integration of Vector-Valued Functions. Lebesgue integration of functions


that take values in Rn can be handled componentwise, as indeed was done

T
above for complex-valued integrands. Many interesting UQ problems concern
random fields, i.e. random variables with values in infinite-dimensional spaces
of functions. For definiteness, consider a function f defined on a measure space
(X , F , µ) taking values in a Banach space V. There are two ways to proceed,
AF
and they are in general inequivalent:
1. The strong integral or Bochner integral of f is defined by integrating simple
V-valued functions as in the construction of the Lebesgue integral, and
then defining Z Z
f dµ := lim φn dµ
X n→∞ X
whenever (φn )n∈N is a sequence
R of simple functions such that the (scalar-
valued) Lebesgue integral X kf − φn k dµ converges to 0 as n → ∞. It
DR

transpires that f is Bochner integrable if and only if kf k is Lebesgue in-


tegrable. The Bochner integral satisfies a version of the Dominated Con-
vergence Theorem, but there are some subtleties concerning the Radon–
Nikodým theorem. R
2. The weak integral or Pettis integral of f is defined using duality: X f dµ
is defined to be an element v ∈ V such that
Z
hℓ | vi = hℓ | f (x)i dµ(x) for all ℓ ∈ V ′ .
X
Since this is a weaker integrability criterion, there are naturally more

Pettis-integrable functions than Bochner-integrable ones, but the Pettis


integral has deficiencies such as the space of Pettis-integrable functions
being incomplete, theR existence of a Pettis-integrable function f : [0, 1] →
V such that F (t) := [0,t] f (τ ) dτ is not differentiable [44], and so on.

2.5 The Radon–Nikodým Theorem and Densities


Let (X , F , µ) be a measure space and let ρ : X → [0, +∞] be a measurable
function. The operation
Z
ν : E 7→ ρ(x) dµ(x) (2.2)
E
2.6. PRODUCT MEASURES AND INDEPENDENCE 17

defines a measure ν on (X , F ). It is natural to ask whether every measure ν


on (X , F ) can be expressed in this way. A moment’s thought reveals that the
answer, in general, is no: there is no ρ that will make (2.2) hold when µ and ν
are Lebesgue measure and a unit Dirac measure (or vice versa) on R.

Definition 2.19. Let µ and ν be measures on a measurable space (X , F ). If, for


E ∈ F , ν(E) = 0 whenever µ(E) = 0, then ν is said to be absolutely continuous
with respect to µ, denoted ν ≪ µ. If ν ≪ µ ≪ ν, then µ and ν are said to be
equivalent, and this is denoted µ ≈ ν. If there exists E ∈ F such that µ(E) = 0
and ν(X \ E) = 0, then µ and ν are said to be mutually singular, denoted µ ⊥ ν.


Theorem 2.20 (Radon–Nikodým). Suppose that µ and ν are σ-finite measures
on a measurable space (X , F ) and that ν ≪ µ. Then there exists a measurable
function ρ : X → [0, ∞] such that, for all measurable functions f : X → R and
all E ∈ F , Z Z
f dν = f ρ dµ
E E

T
whenever either integral exists. Furthermore, any two functions ρ with this
property are equal µ-almost everywhere.

The function ρ in the Radon–Nikodým theorem is called the Radon–Nikodým


AF

derivative of ν with respect to µ, and the suggestive notation ρ = dµ is often

used. In probability theory, when ν is a probability measure, dµ is called the
probability density function (PDF) of ν (or any ν-distributed random variable)
with respect to µ.

2.6 Product Measures and Independence


DR

Definition 2.21. Let (Θ, F , µ) be a probability space.


1. Two measurable sets (events) E1 , E2 ∈ F are said to be independent if
µ(E1 ∩ E2 ) = µ(E1 )µ(E2 ).
2. Two sub-σ-algebras G1 and G2 of F are said to be independent if E1 and
E2 are independent events whenever E1 ∈ G1 and E2 ∈ G2 .
3. Two measurable functions (random variables) X : Θ → X and Y : Θ → Y
are said to be independent if the σ-algebras generated by X and Y are
independent.

Definition 2.22. Let (X , F , µ) and (Y, G , ν) be σ-finite measure spaces. The


product σ-algebra F ⊗ G is the σ-algebra on X × Y that is generated by the


measurable rectangles, i.e. the smallest σ-algebra for which all the products

F × G, F ∈ F, G ∈ G ,

are measurable sets. The product measure µ ⊗ ν : F ⊗ G → [0, +∞] is the


measure such that

(µ ⊗ ν)(F × G) = µ(F )ν(G), for all F ∈ F , G ∈ G .

In the other direction, given a measure on a product space, we can consider


the measures induced on the factor spaces:
18 CHAPTER 2. RECAP OF MEASURE AND PROBABILITY THEORY

Definition 2.23. Let (X ×Y, F , µ) be a measure space and suppose that the fac-
tor space X is equipped with a σ-algebra such that the projections ΠX : (x, y) 7→
x is a measurable function. Then the marginal measure µX is the measure on
X defined by 
µX (E) := (ΠX )∗ µ (E) = µ(E × Y).
The marginal measure µY on Y is defined similarly.
Theorem 2.24. Let X = (X1 , X2 ) be a random variable taking values in a
product space X = X1 × X2 . Let µ be the (joint) distribution of X, and µi the
(marginal) distribution of Xi for i = 1, 2. Then X1 and X2 are independent


random variables if and only if µ = µ1 ⊗ µ2 .
The important property of integration with respect to a product measure,
and hence taking expected values of independent random variables, is that it
can be performed by iterated integration:
Theorem 2.25 (Fubini–Tonelli). Let (X , F , µ) and (Y, G , ν) be σ-finite measure
spaces, and let f : X × Y → [0, +∞] be measurable. Then, of the following three

T
integrals, if one exists in [0, ∞], then all three exist and are equal:
Z Z Z Z
f (x, y) dν(y) dµ(x), f (x, y) dµ(x) dν(y),
X Y Y X
AF
Z
and f (x, y) d(µ ⊗ ν)(x, y).
X ×Y

2.7 Gaussian Measures


An important class of probability measures and random variables is the class of
Gaussian measures (also known as normal distributions) and random variables.
DR

Gaussian measures are particularly important because, unlike Lebesgue mea-


sure, they are well-defined on infinite-dimensional topological vector spaces; the
non-existence of an infinite-dimensional Lebesgue measure is a consequence of
the following theorem:
Theorem 2.26. Let V be an infinite-dimensional, separable Banach space. Then
the only Borel measure µ on V that is locally finite (i.e. every point of V has a
neighbourhood of finite µ-measure) and translation invariant (i.e. µ(x + E) =
µ(E) for all x ∈ V and measurable E ⊆ V) is the trivial measure.
Gaussian measures on Rd are defined using a Radon–Nikodým derivative

with respect to Lebesgue measure:


Definition 2.27. Let m ∈ Rd and let C ∈ Rd×d be symmetric and positive
definite. The Gaussian measure with mean m and covariance C is denoted
N (m, C) and defined by
Z  
1 (x − m) · C(x − m)
N (m, C)(E) = √ √ d exp − dx
det C 2π E 2

for each measurable E ⊆ Rd . The Gaussian measure N (0, I) is called the stan-
dard Gaussian measure. A Dirac measure δm can be considered as a degenerate
Gaussian measure on R, one with variance equal to zero.
2.7. GAUSSIAN MEASURES 19

It is easily verified that the push-forward of N (m, C) by any linear functional


ℓ : Rd → R is a Gaussian measure on R, and this is taken as the defining property
of a general Gaussian measure for settings in which, by Theorem 2.26, there may
not be a Lebesgue measure with respect to which densities can be taken:
Definition 2.28. Let V be a (locally convex) topological vector space. A Borel
measure γ on V is said to be a (non-degenerate) Gaussian measure if, for every
ℓ ∈ V ′ , the push-forward measure ℓ∗ γ is a (non-degenerate) Gaussian measure
on R. Equivalently, γ is Gaussian if, for every linear map T : V → Rd , T∗ γ =
N (m, C) for some m ∈ Rd and some symmetric positive-definite C ∈ Rd×d .


Definition 2.29. Let µ be a probability measure on a Banach space V. An
element mµ ∈ V is called the mean of µ if
Z
hℓ | x − mµ i dµ(x) = 0 for all ℓ ∈ V ′ ,
V
R
so that V x dµ(x) = mµ in the sense of a Pettis integral. If mµ = 0, then
µ is said to be centred. The covariance operator is the symmetric operator

T
Cµ : V ′ × V ′ → K defined by
Z
Cµ (k, ℓ) = hk | x − mµ ihℓ | x − mµ i dµ(x) for all k, ℓ ∈ V ′ .
V
AF
We often abuse notation and write Cµ : V ′ → V ′′ for the operator defined by
hCµ k | ℓi := Cµ (k, ℓ)
The inverse of Cµ , if it exists, is called the precision operator of µ.
Theorem 2.30 (Vakhania). Let µ be a Gaussian measure on a separable, reflexive
Banach space V with mean mµ ∈ V and covariance operator Cµ : V ′ → V. Then
DR

the support of µ is the affine subspace of V that is the translation by the mean
of the closure of the range of the covariance operator, i.e.
supp(µ) = mµ + Cµ V ′ .
Corollary 2.31. For a Gaussian measure µ on a separable, reflexive Banach
space V, the following are equivalent:
1. µ is non-degenerate;
2. Cµ : V ′ → V is one-to-one;
3. Cµ V ′ = V.

Example 2.32. Consider a Gaussian random variable X = (X1 , X2 ) ∼ µ taking


values in R2 . Suppose that the mean and covariance of X (or, equivalently, µ)
are, in the usual basis of R2 ,
   
0 1 0
m= C= .
1 0 0
Then X = (Z, 1), where Z ∼ N (0, 1) is a standard Gaussian random variable
on R; the values of X all lie on the affine line L := {(x1 , x2 ) ∈ R2 | x2 = 1}.
Indeed, Vakhania’s theorem says that
    
0 x1
supp(µ) = m + C(R2 ) = + x1 ∈ R = L.
1 0
20 CHAPTER 2. RECAP OF MEASURE AND PROBABILITY THEORY

Theorem 2.33. A centred probability measure µ on V is a Gaussian measure if


and only if its Fourier transform µ b : V ′ → C satisfies
Z
µ̂(ℓ) := eihℓ | xi dµ(x) = e−Q(ℓ)/2 for all ℓ ∈ V ′ .
V

for some positive-definite quadratic form Q on V ′ . Indeed, Q(ℓ) = Cµ (ℓ, ℓ). Fur-
thermore, if two Gaussian measures µ and ν have the same mean and covariance
operator, then µ = ν.
Theorem 2.34 (Fernique). Let µ be a centered Gaussian measure on a separable


Banach space V. Then there exists α > 0 such that
Z
exp(αkxk2 ) dµ(x) < +∞.
V

A fortiori, µ has moments of all orders: for all k ≥ 0,


Z
kxkk dµ(x) < +∞.

T
V

Definition 2.35. Let K : H → H be a linear operator on a separable Hilbert


space H.
1. K is said to be compact if it has a singular value decomposition, i.e. if
AF
there exist finite or countably infinite orthonormal sequences (un ) and
(vn ) in H and a sequence of non-negative reals (σn ) such that
X
K= σn hv ∗ , · iun ,
n

with limn→∞ σn = 0 if the sequences are P infinite.


2. K is said to be trace classPor nuclear if n σn is finite, and Hilbert–Schmidt
DR

or nuclear of order 2 if n σn2 is finite.


3. If K is trace class, then its trace is defined to be
X
tr(K) := hen , Ken i
n

for any orthonormal basis (en ) of H, and (by Lidskiı̆’s theorem) this equals
the sum of the eigenvalues of K, counted with multiplicity.
Theorem 2.36. Let µ be a centred Gaussian measure on a separable Hilbert
space H. Then Cµ : H → H is trace class and

Z
tr(Cµ ) = kxk2 dµ(x).
H

Conversely, if K : H → H is positive, symmetric and of trace class, then there


is a Gaussian measure µ on H such that Cµ = K.
Definition 2.37. Let µ = N (mµ , Cµ ) be a Gaussian measure on a Banach space
V. The Cameron–Martin space is the Hilbert space Hµ defined equivalently by:
• Hµ is the completion of

h ∈ V for some h∗ ∈ V ′ , Cµ (h∗ , · ) = h · | hi
with respect to the inner product hh, kiµ = Cµ (h∗ , k ∗ ).
BIBLIOGRAPHY 21

• Hµ is the completion of the range of the covariance operator Cµ : V ′ → V


with respect to this inner product (cf. the closure with respect to the norm
in V in Theorem 2.30).
1/2
• If V is Hilbert, then Hµ is the completion of R(Cµ ) with the inner
−1/2 −1/2
product hh, kiµ = hCµ h, Cµ kiV .
• Hµ is the set of all v ∈ V such that (Tv )∗ µ ≈ µ.
• Hµ is the intersection of all linear subspaces of V that have full µ-measure.

Note, however, that if Hµ is infinite-dimensional, then µ(Hµ ) = 0. Fur-
thermore, infinte-dimensional spaces have the rather alarming property that


Gaussian measures on such spaces are either equivalent or mutually singular
— there is no middle ground in the way that Lebesgue measure on [0, 1] has a
density with respect to Lebesgue measure on R but is not equivalent to it —
and surprisingly simple operations can destroy equivalence.

Theorem 2.38 (Feldman–Hájek). Let µ, ν be Gaussian probability measures on


a locally convex topological vector space V. Then either

T
• µ and ν are equivalent, i.e. µ(E) = 0 ⇐⇒ ν(E) = 0; or
• µ and ν are mutually singular, i.e. there exists E such that µ(E) = 0 and
ν(E) = 1.
Furthermore, equivalence holds if and only if
AF
1/2 1/2
1. R(Cµ ) = R(Cν ) = E; and
2. mµ − mν ∈ E; and
−1/2 1/2 −1/2 1/2
3. T := (Cµ Cν )(Cµ Cν )∗ − I is Hilbert–Schmidt in E.

Proposition 2.39. Let µ be a centred Gaussian measure on a separable Banach


space V such that dim Hµ = ∞. For c ∈ R, let Dc : V → V be the dilation
map Dc (x) := cx. Then (Dc )∗ µ is equivalent to µ if and only if c ∈ {±1}, and
DR

(Dc )∗ µ and µ are mutually singular otherwise.

Bibliography
At Warwick, this material is mostly covered in MA359 Measure Theory and W
ST318 Probability Theory. Gaussian measure theory in infinite-dimensional
spaces is covered in MA482 Stochastic Analysis and MA612 Probability on
Function Spaces and Bayesian Inverse Problems. Vakhania’s theorem (Theo-
rem 2.30) on the support of a Gaussian measure can be found in [110]. Fernique’s

theorem (Theorem 2.34) on the integrability of Gaussian vectors was proved in


[31]. The Feldman–Hájek dichotomy (Theorem 2.38) was proved independently
by Feldman [30] and Hájek [36] in 1958.
Gordon’s book [35] is mostly a text on the gauge integral, but its first
chapters provide an excellent condensed introduction to measure theory and
Lebesgue integration. The book of Capiński & Kopp [18] is a clear, readable
and self-contained introductory text confined mainly to Lebesgue integration on
R (and later Rn ), including material on Lp spaces and the Radon–Nikodým theo-
rem. Another excellent text on measure and probability theory is the monograph
of Billingsley [10]. The Bochner integral was introduced by Bochner in [12]; re-
cent texts on the topic include those of Diestel & Uhl [24] and Mikusiński [67].
For detailed treatment of the Pettis integral, see Talagrand [102]. Further dis-
22 CHAPTER 2. RECAP OF MEASURE AND PROBABILITY THEORY

cussion of the relationship between tensor products and spaces of vector-valued


integrable functions can be found in the book of Ryan [85].
Bourbaki [15] contains a treatment of measure theory from a functional-
analytic perspective. The presentation is focussed on Radon measures on locally
compact spaces, which is advantageous in terms of regularity but leads to an
approach to measurable functions that is cumbersome, particularly from the
viewpoint of probability theory.
The modern origins of imprecise probability lie in treatises like those of
Boole [13] and Keynes [50]; more recent foundations and expositions have been
put forward by Walley [114], Kuznetsov [56], Weichselberger [115], and by


Dempster [23] and Shafer [87].

T
AF
DR

Chapter 3


Recap of Banach and Hilbert
Spaces

T Dr. von Neumann, ich möchte gern


wissen, was ist dann eigentlich ein
Hilbertscher Raum?
AF
David Hilbert

This chapter covers the necessary concepts from linear functional analysis
on Hilbert and Banach spaces, in particular basic and useful constructions like
direct sums and tensor products. Like Chapter 2, this chapter is intended as a
review of material that should be understood as a prerequisite before proceeding;
DR

to extent, Chapters 2 and 3 are interdependent and so can (and should) be read
in parallel with one another.

3.1 Basic Definitions and Properties


In what follows, K will denote either of the fields R or C.
Definition 3.1. A norm on a vector space V over K is a function k · k : V → R
that is
1. positive semi-definite: for all x ∈ V, kxk ≥ 0;

2. positive definite: for all x ∈ V, kxk = 0 if and only if x = 0;


3. homogeneous: for all x ∈ V and α ∈ K, kαxk = |α|kxk;
4. sublinear : for all x, y ∈ V, kx + yk ≤ kxk + kyk.
If the positive definiteness requirement is omitted, then k · k is said to be a semi-
norm. A vector space equipped with a (semi-)norm is called a (semi-)normed
space.
Definition 3.2. An inner product on a vector space V over K is a function
h · , · i : V × V → K that is
1. positive semi-definite: for all x ∈ V, hx, xi ≥ 0;
2. positive definite: for all x ∈ V, hx, xi = 0 if and only if x = 0;
3. conjugate symmetric: for all x, y ∈ V, hx, yi = hy, xi;
24 CHAPTER 3. RECAP OF BANACH AND HILBERT SPACES

4. sesquilinear : for all x, y, z ∈ V and all α, β ∈ K, hx, αy + βzi = αhx, yi +


βhx, zi.
A vector space equipped with an inner product is called an inner product space.
In the case K = R, conjugate symmetry becomes symmetry, and sesquilinearity
becomes bilinearity.
It is easily verified that p
every inner product space is a normed space under
the induced norm kxk := hx, xi. The inner product and norm satisfy the
Cauchy–Schwarz inequality

|hx, yi| ≤ kxk1/2 kyk1/2


for all x, y ∈ V. (3.1)

Every norm on V that is induced by an inner product satisfies the parallelogram


identity

kx + yk2 + kx − yk2 = 2kxk2 + 2kyk2 for all x, y ∈ V. (3.2)

In the opposite direction, if k · k is a norm on V that satisfies the parallelogram

T
identity, then the unique inner product h · , · i that induces this norm is found
by the polarization identity
kx + yk2 − kx − yk2
hx, yi = (3.3)
AF
4
in the real case, and
kx + yk2 − kx − yk2 kix − yk2 − kix + yk2
hx, yi = +i (3.4)
4 4
in the complex case.
Example 3.3. 1. For any n ∈ N, the coordinate space Kn is an inner product
DR

space under the Euclidean inner product


n
X
hx, yi := xi yi .
i=1

In the real case, this is usually known as the dot product and denoted x · y.
2. For any m, n ∈ N, the space Km×n of m × n matrices is an inner product
space under the Frobenius inner product
X
hA, Bi ≡ A : B := aij bij .

i=1,...,m
j=1,...,n

Definition 3.4. Let (V, k · k) be a normed space. A sequence (xn )n∈N in V


converges to x ∈ V if, for every ε > 0, there exists N ∈ N such that, whenever
n ≥ N , kxn − xk < ε. A sequence (xn )n∈N in V is called Cauchy if, for every
ε > 0, there exists N ∈ N such that, whenever m, n ≥ N , kxm − xn k < ε. A
complete space is one in which each Cauchy sequence in V converges to some
element of V. Complete normed spaces are called Banach spaces, and complete
inner product spaces are called Hilbert spaces.
Example 3.5. 1. Kn and Km×n are finite-dimensional Hilbert spaces with
respect to their usual inner products.
3.1. BASIC DEFINITIONS AND PROPERTIES 25

2. The standard example of an infinite-dimensional Hilbert space is the space


of square-summable sequences,
( )
X
2 N 2
ℓ (K) := x = (xn )n∈N ∈ K kxkℓ2 := |xn | < ∞ ,
n∈N

is a Hilbert space with respect to the inner product


X
hx, yiℓ2 := xn yn .
n∈N


3. Given a measure space (X , F , µ), the space L2 (X , µ; K) of (equivalence
classes modulo equality µ-almost everywhere) of square-integrable func-
tions from X to K is a Hilbert space with respect to the inner product
Z
hf, giL2 (µ) := f (x)g(x) dµ(x). (3.5)
X

T
Note that it is necessary to take the quotient by the equivalence relation
of equality µ-almost everywhere since a function f that vanishes on a set
of full measure but is non-zero on a set of zero measure is not the zero
function but nonetheless has kf kL2(µ) = 0. When (X , F , µ) is a probabil-
AF
ity space, elements of L2 (X , µ; K) are thought of as random variables of
finite variance, and the L2 inner product is the covariance:
 
hX, Y iL2 (µ) := Eµ X̄Y = cov(X, Y ).

When L2 (X , µ; K) is a separable space, it is isometrically isomorphic to


ℓ2 (K) (see Theorem 3.16).
4. Indeed, Hilbert spaces over a fixed field K are classified by their dimension:
whenever H and K are Hilbert spaces of the same dimension over K,
DR

there is a K-linear map T : H → K such that hT x, T yiK = hx, yiH for all
x, y ∈ H.
Example 3.6. 1. For any compact topological space X , the space C(X ; K)
of continuous functions f : X → K is a Banach space with respect to the
supremum norm
kf k∞ := sup |f (x)|. (3.6)
x∈X

For non-compact X , the supremum norm is only a bona fide norm if we


restrict attention to bounded continuous functions (otherwise it may take

the value ∞).


2. For 1 ≤ p ≤ ∞, the spaces Lp (X , µ; K) with their norms
Z 1/p
kf kLp(µ) := |f (x)|p dµ(x) (3.7)
X

for 1 ≤ p < ∞ and

kf kL∞(µ) := inf {kgk∞ | f = g : X → K µ-almost everywhere} (3.8)


= inf {t ≥ 0 | |f | ≤ t µ-almost everywhere}

are Banach spaces, but only the L2 spaces are Hilbert spaces.
26 CHAPTER 3. RECAP OF BANACH AND HILBERT SPACES

Another family of Banach spaces that arises very often in PDE applications
is the family of Sobolev spaces. For the sake of brevity, we limit the discussion to
those Sobolev spaces that are Hilbert spaces. To save space, we use multi-index
notation for derivatives: α := (α1 , . . . , αn ) ∈ Nn0 denotes a multi-index, and
|α| := α1 + · · · + αn .
Definition 3.7. Let Ω ⊆ Rn , let α ∈ Nn0 , and consider f : Ω → R. A weak
derivative of order α for f : Ω → R is a function v : Ω → R such that
Z Z
∂ |α| |α|
f (x) α1 φ(x) dx = (−1) v(x)φ(x) dx (3.9)


Ω ∂ x1 . . . ∂ αn xn Ω

for every φ ∈ Cc∞ (Ω; R). Such a weak derivative is usually denoted Dα f , and
coincides with the classical (strong) derivative if it exists. The Sobolev space
H k (Ω) is
 
for all α ∈ Nn0 with |α| ≤ k,
H k (Ω) := f ∈ L2 (Ω) (3.10)
f has a weak derivative Dα f ∈ L2 (Ω)

T
with the inner product
X
hu, viH k := hDα u, Dα viL2 . (3.11)
|α|≤k
AF
3.2 Dual Spaces and Adjoints
Definition 3.8. The (continuous) dual space of a normed space V over K is the
vector space V ′ of all continuous linear functionals ℓ : V → K. The dual pairing
between an element ℓ ∈ V ′ and an element v ∈ V is denoted hℓ | vi or simply
ℓ(v). For a linear functional ℓ, being continuous is equivalent to being bounded
DR

in the sense that its operator norm (or dual norm)


|hℓ | vi|
kℓk′ := sup ≡ sup |hℓ | vi| ≡ sup |hℓ | vi|
06=v∈V kvk v∈V v∈V
kvk=1 kvk≤1

is finite.
Proposition 3.9. For every normed space V, the dual space V ′ is a Banach space
with respect to k · k′ .
An important property of Hilbert spaces is that they are naturally self-dual :

every continuous linear functional on a Hilbert space can be naturally identified


with the action of taking the inner product with some element of the space.
This stands in stark contrast to the duals of even elementary Banach spaces.
Theorem 3.10 (Riesz representation theorem). Let H be a Hilbert space. For
every continuous linear functional f ∈ H′ , there exists f ♯ ∈ H such that hf | xi =
hf ♯ , xi for all x ∈ H. Furthermore, the map f 7→ f ♯ is an isometric isomorphism
between H and its dual.
Given a linear map A : V → W between normed spaces V and W, the adjoint
of A is the linear map A∗ : W ′ → V ′ defined by the relation
hA∗ ℓ | vi = hℓ | Avi for all v ∈ V and ℓ ∈ W ′ .
3.3. ORTHOGONALITY AND DIRECT SUMS 27

When considering a linear map A : H → K between Hilbert spaces H and K, we


can appeal to the Riesz representation theorem and define the adjoint in terms
of inner products:

hA∗ k, hiH = hk, AhiK for all h ∈ H and k ∈ K.

Given a matrix {ei }i∈I of H, the corresponding dual basis {ei }i∈I of H is defined
by the relation hei , ej iH = δij . The matrix of A with respect to bases {ei }i∈I
of H and {fj }j∈J of K and the matrix of A∗ with respect to the corresponding
dual bases are very simply related: the one is the conjugate transpose of the


other, and so by abuse of terminology the conjugate transpose of a matrix is
often referred to as the adjoint.

3.3 Orthogonality and Direct Sums


Orthogonal decompositions of Hilbert spaces will be fundamental tools in many
of the methods considered later on.

T
Definition 3.11. A subset E of an inner product space V is said to be orthogonal
if hx, yi = 0 for all distinct elements x, y ∈ E; it is said to be orthonormal if
(
AF
1, if x = y ∈ E,
hx, yi =
0, if x, y ∈ E and x 6= y.

The orthogonal complement E ⊥ of a subset E of an inner product space V is

E ⊥ := {y ∈ V | for all x ∈ E, hy, xi = 0}.

The orthogonal complement of E ⊆ V is always a closed linear subspace of


DR

V, and hence if V = H is a Hilbert space, then E ⊥ is also a Hilbert space in its


own right.
Theorem 3.12. Let K be a closed subspace of a Hilbert space H. Then, for any
x ∈ H, there is a unique ΠK x ∈ K that is closest to x in the sense that

kΠK x − xk = inf ky − xk.


y∈K

Furthermore, x can be written uniquely as x = ΠK x + z, where z ∈ K⊥ . Hence,


H decomposes as the orthogonal direct sum


H = K ⊕ K⊥ .

Proof. Deferred to Lemma 4.20 and the more general context of convex opti-
mization.
The operator ΠK : H → K is called the orthogonal projection onto K.
Theorem 3.13. Let K and L be closed subspaces of a Hilbert space H. The
corresponding orthogonal projection operators
1. are continuous inear operators of norm at most 1;
2. are such that I − ΠK = ΠK⊥ ;
28 CHAPTER 3. RECAP OF BANACH AND HILBERT SPACES

and satisfy, for every x ∈ H,


3. kxk2 = kΠK xk2 + k(I − ΠK )xk2 ;
4. ΠK x = x ⇐⇒ x ∈ K;
5. ΠK x = 0 ⇐⇒ x ∈ K⊥ .
Example 3.14 (Conditional expectation). An important probabilistic application
of orthogonal projection is the operation of conditioning a random variable.
Let (Θ, F , µ) be a probability space and let X ∈ L2 (Θ, F , µ; K) be a square-
integrable random variable. If G ⊆ F is a σ-algebra, then the conditional
expectation of X with respect to G , usually denoted E[X|G ], is the orthogonal


projection of X onto the subspace L2 (Θ, G , µ; K). In elementary contexts, G
is usually taken to be the σ-algebra generated by a single event E of positive
µ-probability, i.e.
G = {∅, [X ∈ E], [X ∈ / E], Θ}.
The orthogonal projection point of view makes two important properties of
conditional expectation intuitively obvious:
1. Whenever G1 ⊆ G2 F , L2 (Θ, G1 , µ; K) is a subspace of L2 (Θ, G2 , µ; K) and

T
composition of the orthogonal projections onto these subspace yields the
tower rule for conditional expectations:
 
E[X|G1 ] = E E[X|G2 ] G1 ,
AF
and, in particular, taking G1 to be the trivial σ-algebra {∅, Θ},

E[X] = E[E[X|G2 ]].

2. Whenever X, Y ∈ L2 (Θ, F , µ; K) and X is, in fact, G -measurable,

E[XY |G ] = XE[Y |G ].
DR

Direct Sums. Suppose that V and W are vector spaces over a common field
K. The Cartesian product V × W can be given the structure of a vector space
over K by defining the operations componentwise:

(v, w) + (v ′ , w′ ) := (v + v ′ , w + w′ ),
α(v, w) := (αv, αw),

for all v, v ′ ∈ V, w, w′ ∈ W, and α ∈ K. The resulting vector space is called


the (algebraic) direct sum of V and W and is usually denoted by V ⊕ W, while


elements of V ⊕ W are usually denoted by v ⊕ w instead of (v, w).
If {ei |i ∈ I} is a basis of V and {ej |j ∈ J} is a basis of W, then {ek | k ∈
K := I ⊎ J} is basis of V ⊕ W. Hence, the dimension of V ⊕ W over K is equal
to the sum of the dimensions of V and W.
When H and K are Hilbert spaces, their (algebraic) direct sum H ⊕ K can
be given a Hilbert space structure by defining

hh ⊕ k, h′ ⊕ k ′ iH⊕K := hh, h′ iH + hk, k ′ iK

for all h, h′ ∈ H and k, k ′ ∈ K. The original spaces H and K embed into H ⊕ K


as the subspaces H ⊕ {0} and {0} ⊕ K respectively, and these two subspaces
3.3. ORTHOGONALITY AND DIRECT SUMS 29

are mutually orthogonal. For this reason, the orthogonality of the two sum-

mands in a Hilbert direct sum is sometimes emphasized by the notation H ⊕ K.
The Hilbert space projection theorem (Theorem 3.12) was the statement that

whenever K is a closed subspace of a Hilbert space H, H = K ⊕ K⊥ .
It is necessary to be a bit more careful in defining the direct sum of countably
many Hilbert spaces. Let Hn be a L Hilbert space over K for each n ∈ N. Then
the Hilbert space direct sum H := n∈N Hn is defined to be
 
xn ∈ Hn for each n ∈ N, and


H := x = (xn )n∈N ,
xn = 0 for all but finitely many n

where the completion is taken with respect to the inner product


X
hx, yiH := hxn , yn iHn ,
n∈N

T
which is always a finite sum when applied to elements of the generating set.
This construction ensures that every element x of H has finite norm kxk2H =
P 2
n∈N kxn kHn . As before, each of the summands Hn is a subspace of H that is
orthogonal to all the others.
AF
Orthogonal direct sums and orthogonal bases are among the most important
constructions in Hilbert space theory, and will be very useful in what follows.
The prototypical example to bear in mind is the Fourier basis {en | n ∈ Z} of
L2 (S1 ; C), where
1
en (x) := exp(−inx).

Indeed, Fourier’s claimh3.1i that any periodic function f could be written as
DR

X
f (x) = fbn en (x),
n∈Z
Z
fbn := f (y)en (y) dy,
S1

can be seen as one of the historical drivers behind the development of much of
analysis. Other important examples are the systems of orthogonal polynomials
that will be considered in Chapter 8. Some important results about orthogonal
systems are summarized below:

Lemma 3.15 (Bessel’s inequality). Let V be an inner product space


P and (en )n∈N
an orthonormal sequence in V. Then, for any x ∈ V, the series n∈N |hx, en i|2
converges and satisfies
X
|hx, en i|2 ≤ kxk2 . (3.12)
n∈N

Theorem 3.16 (Parseval identity). Let H be a Hilbert space, let (en )n∈N be an
orthonormal sequence in H, and let (αn )n∈N be a sequence in K. Then the series
h3.1i Of
course, Fourier did not use the modern notation of Hilbert spaces! Furthermore, if he
had, then it would have been ‘obvious’ that his claim could only hold true for L2 functions
and in the L2 sense, not pointwise for arbitrary functions.
30 CHAPTER 3. RECAP OF BANACH AND HILBERT SPACES

P P
converges in H if and only if the series n∈N |αn |2 converges in R,
n∈N αn en
in which case
2
X X
αn en = |αn |2 . (3.13)
n∈N n∈N

Corollary 3.17. Let H be a Hilbert space and let


P (en )n∈N be an orthonormal
sequence in H. Then, for any x ∈ H, the series n∈N hx, en ien converges.
Theorem 3.18. Let H be a Hilbert space and let (en )n∈N be an orthonormal
sequence in H. Then the following are equivalent:


1. {en | n ∈ N}⊥ = {0};
2. H = Lspan{en | n ∈ N};
3. H = n∈N Ken as a direct
P sum of Hilbert spaces;
= n∈N |hx, en i|2 ;
4. for all x ∈ H, kxk2P
5. for all x ∈ H, x = n∈N hx, en ien .
If one (and hence all) of these conditions holds true, then (en )n∈N is called a
complete orthonormal basis for H

T
Corollary 3.19. Let (en )n∈N be a complete orthonormal basis for H. For every
PN
x ∈ H, the truncation error x− n=1 hx, en ien is orthogonal to span{e1 , . . . , eN }.
PN
Proof. LetPv := n=1 vn en be any element of span{e1 , . . . , eN }. By complete-
AF
ness, x = n∈N hx, en ien . Hence,
* N
+ * N
+
X X X
x− hx, en ien , v = hx, en ien , vn en
n=1 n>N n=1
X
= hhx, en ien , vm em i
n>N
m∈{0,...,N }
DR

X
= hx, en ivm hen , em i
n>N
m∈{0,...,N }

=0

since hen , em i = δnm .

3.4 Tensor Products


Intuitively, the tensor product V ⊗ W of two vector spaces V and W over a


common field K is the vector space over K with basis given by the formal
symbols {ei ⊗ fj | i ∈ I, j ∈ J}, where {ei |i ∈ I} is a basis of V and {fj |j ∈ J}
is a basis of W. However, it is not immediately clear that this definition is
independent of the bases chosen for V and W. A more thorough definition is as
follows.
Definition 3.20. The free vector space FV×W on the Cartesian product V × W
is defined by taking the vector space in which the elements of V × W are a basis:
( n )
X
FV×W := αi e(vi ,wi ) n ∈ N, αi ∈ K, (vi , wi ) ∈ V × W
i=1
3.4. TENSOR PRODUCTS 31

The ‘freeness’ of FV×W is that the elements e(v,w) are, by definition linearly
independent for distinct pairs (v, w) ∈ V × W. Now define an equivalence
relation ∼ on FV×W such that

e(v+v′ ,w) ∼ e(v,w) + e(v′ ,w) ,


e(v,w+w′ ) ∼ e(v,w) + e(v,w′ ) ,
αe(v,w) ∼ e(αv,w) ∼ e(v,αw)

for arbitrary v, v ′ ∈ V, w, w′ ∈ W, and α ∈ K. Let R be the subspace of FV×W


generated by these equivalence relations, i.e. the equivalence class of e(0,0) .


Definition 3.21. The (algebraic) tensor product V ⊗ W is the quotient space

FV×W
V ⊗ W := .
R
One can easily check that V ⊗ W, as defined in this way, is indeed a vector
space over K. The subspace R of FV×W is mapped to the zero element of V ⊗ W

T
under the quotient map, and so the above equivalences become equalities in the
tensor product space:

(v + v ′ ) ⊗ w = v ⊗ w + v ′ ⊗ w,
AF
v ⊗ (w + w′ ) = v ⊗ w + v ⊗ w′ ,
α(v ⊗ w) = (αv) ⊗ w = v ⊗ (αw)

for all v, v ′ ∈ V, w, w′ ∈ W, and α ∈ K.


One can also check that the heuristic definition in terms of bases holds true
under the formal definition: if {ei |i ∈ I} is a basis of V and {fj |j ∈ J} is a basis
of W, then {ei ⊗ fj | i ∈ I, j ∈ J} is basis of V ⊗ W. Hence, the dimension of
DR

the tensor product is the product of dimensions of the original spaces.


Definition 3.22. The Hilbert space tensor product of two Hilbert spaces H and
K over the same field K is given by defining an inner product on the algebraic
tensor product H ⊗ K by

hh ⊗ k, h′ ⊗ k ′ iH⊗K := hh, h′ iH hk, k ′ iK for all h, h′ ∈ H and k, k ′ ∈ K,

extending this definition to all of the algebraic tensor product by sesquilinearity,


and defining the Hilbert space tensor product H ⊗ K to be the completion of the
algebraic tensor product with respect to this inner product and its associated

norm.
Tensor products of Hilbert spaces arise very naturally when considering
spaces of functions of more than one variable, or spaces of functions that take
values in other function spaces. A prime example of the second type is a space
of stochastic processes.
Example 3.23. 1. Given two measure spaces (X , F , µ) and (Y, G , ν), con-
sider L2 (X × Y, µ ⊗ ν; K), the space of functions on X × Y that are square
integrable with respect to the product measure µ ⊗ ν. If f ∈ L2 (X , µ; K)
and g ∈ L2 (Y, ν; K), then we can define a function h : X × Y → K by
h(x, y) := f (x)g(y). The definition of the product measure ensures that
32 CHAPTER 3. RECAP OF BANACH AND HILBERT SPACES

h ∈ L2 (X × Y, µ ⊗ ν; K), so this procedure defines a bilinear mapping


L2 (X , µ; K) × L2 (Y, ν; K) → L2 (X × Y, µ ⊗ ν; K). It turns out that the
span of the range of this bilinear map is dense in L2 (X × Y, µ ⊗ ν; K) if
L2 (X , µ; K) and L2 (Y, ν; K) are separable. This shows that

L2 (X , µ; K) ⊗ L2 (Y, ν; K) ∼
= L2 (X × Y, µ ⊗ ν; K),

and it also explains why it is necessary to take the completion in the


construction of the Hilbert space tensor product.


2. Similarly, L2 (X , µ; H), the space of functions f : X → H that are square
integrable in the sense that
Z
kf (x)k2H dµ(x) < +∞,
X

is isomorphic to L2 (X , µ; K) ⊗ H if this space is separable. The isomor-


phism maps f ⊗ ϕ ∈ L2 (X , µ; K) ⊗ H to the H-valued function x 7→ f (x)ϕ

T
in L2 (X , µ; H).
3. Combining the previous two examples reveals that

L2 (X , µ; K) ⊗ L2 (Y, ν; K) ∼
= L2 (X × Y, µ ⊗ ν; K) ∼
= L2 X , µ; L2 (Y, ν; K) .
AF
Similarly, one can consider Bochner spaces of functions (random variables)
taking
R values in a Banach space V that are pth -power-integrable in the sense
p
that X kf (x)kV dµ(x) is finite, and identify this with a suitable tensor product
Lp (X , µ; R) ⊗ V. However, several subtleties arise in doing this, as there is no
single ‘natural’ tensor product of Banach spaces as there is for Hilbert spaces.
Readers who are interested in such spaces should consult the book of Ryan [85].
DR

Bibliography
W At Warwick, the theory of Hilbert and Banach spaces is covered in courses such
as MA3G7 Functional Analysis I and MA3G8 Functional Analysis II. Sobolev
spaces are covered in MA4A2 Advanced PDEs.
Classic reference texts on elementary functional analysis, including Banach
and Hilbert space theory, include the monographs of Reed & Simon [79], Rudin [83],
and Rynne & Youngson [86]. Further discussion of the relationship between ten-

sor products and spaces of vector-valued integrable functions can be found in


the book of Ryan [85].
Truly intrepid students may wish to consult Bourbaki [14], but the standard
warnings about Bourbaki texts apply: the presentation is comprehensive but
often forbiddingly austere, and so is perhaps better as a reference text than
a learning tool. Aliprantis & Border’s Hitchhiker’s Guide [2] is another en-
cyclopædic text, but is surprisingly readable despite the Bourbakiste order in
which material is presented.
Chapter 4


Basic Optimization Theory

We demand rigidly defined areas of doubt


and uncertainty!

T
The Hitchhiker’s Guide to the Galaxy
Douglas Adams
AF
This chapter reviews the basic elements of optimization theory and practice,
without going into the fine details of numerical implementation.

4.1 Optimization Problems and Terminology


In an optimization problem, the objective is to find the extreme values (either
the minimal value, the maximal value, or both) f (x) of a given function f among
DR

all x in a given subset of the domain of f , along with the point or points x that
realize those extreme values. The general form of a constrained optimization
problem is
extremize: f (x)
with respect to: x ∈ X
subject to: gi (x) ∈ Ei for i = 1, 2, . . . ,
where X is some set; f : X → R ∪ {±∞} is a function called the objective
function; and, for each i, gi : X → Yi is a function and Ei ⊆ Yi some subset.

The conditions {gi (x) ∈ Ei | i = 1, 2, . . . } are called constraints, and a point


x ∈ X for which all the constraints are satisfied is called feasible; the set of
feasible points,
{x ∈ X | gi (x) ∈ Ei for i = 1, 2, . . . },
is called the feasible set. If there are no constraints, so that the problem is a
search over all of X , then the problem is said to be unconstrained. In the case of
a minimization problem, the objective function f is also called the cost function
or energy; for maximization problems, the objective function is also called the
utility function.
From a purely mathematical point of view, the distinction between con-
strained and unconstrained optimization is artificial: constrained minimization
34 CHAPTER 4. BASIC OPTIMIZATION THEORY

over X is the same as unconstrained minimization over the feasible set. How-
ever, from a practical standpoint, the difference is huge. Typically, X is Rn for
some n, or perhaps a simple subset specified using inequalities on one coordinate
at a time, such as [a1 , b1 ] × · · · × [an , bn ]; a bona fide non-trivial constraint is
one that involves a more complicated function of one coordinate, or two or more
coordinates, such as
g1 (x) := cos(x) − sin(x) > 0
or
g2 (x1 , x2 , x3 ) := x1 x2 − x3 = 0.


Definition 4.1. The arg min or set of global minimizers of f : X → R ∪ {±∞}
is defined to be
 

arg min f (x) := x ∈ X f (x) = inf

f (x ) ,
x∈X x ∈X

and the arg max or set of global maximizers of f : X → R ∪ {±∞} is defined to

T
be  

arg max f (x) := x ∈ X f (x) = sup f (x ) .
x∈X x′ ∈X

Definition 4.2. A constraint is said to be


AF
1. redundant if it does not change the feasible set, and relevant otherwise;
2. non-binding if it does not change the extreme value, and binding otherwise;
3. active if it holds as an equality at the extremizer, and inactive otherwise.

4.2 Unconstrained Global Optimization


In general, finding a global minimizer of an arbitrary function is very hard, es-
DR

pecially in high-dimensional settings and without nice features like convexity.


Except in very simple settings like linear least squares, it is necessary to con-
struct an approximate solution, and to do so iteratively; that is, one computes a
sequence (xn )n∈N in X such that xn converges as n → ∞ to an extremizer of the
objective function within the feasible set. A simple example of a deterministic
iterative method for finding the critical points, and hence extrema, of a smooth
function is Newton’s method:

Definition 4.3. Given a differentiable function f and an initial state x0 , New-


ton’s method for finding a zero of f is the sequence generated by the iteration

−1
xn+1 := xn − Df (xn ) f (xn ).

Newton’s method is often applied to find critical points of f , i.e. points where
Df vanishes, in which case the iteration is.
−1
xn+1 := xn − D2 f (xn ) Df (xn ).

For objective functions f : X → R ∪ {±∞} that have little to no smooth-


ness, or that have many local extremizers, it is often necessary to resort to
random searches of the space X . For such algorithms, there can only be a prob-
abilistic guarantee of convergence. The rate of convergence and the degree of
4.2. UNCONSTRAINED GLOBAL OPTIMIZATION 35

approximate optimality naturally depend upon features like randomness of the


generation of new elements of X and whether the extremizers of f are difficult
to reach, e.g. because they are located in narrow ‘valleys’. We now describe
three very simple random iterative algorithms for minimization of a prescribed
objective function f, in order to illustrate some of the relevant issues. For sim-
plicity, suppose that f has a unique global minimizer x_min and write f_min
for f(x_min).
Algorithm 4.4 (Random sampling). For simplicity, the following algorithm runs
for n_max steps with no convergence checks. The algorithm returns an approxi-


mate minimizer x_best along with the corresponding value of f. Suppose that
random() generates independent samples of X from a probability measure µ
with support X .
f_best = +inf
n = 0
while n < n_max:

T
x_new = random()
f_new = f(x_new)
if f_new < f_best:
x_best = x_new
AF
f_best = f_new
n = n + 1
return [x_best, f_best]
A weakness of Algorithm 4.4 is that it completely neglects local information
about f. Even if the current state x_best is very close to the global minimizer
x_min, the algorithm may continue to sample points x_new that are very far
away and have f(x_new) ≫ f(x_best). It would be preferable to explore a
DR

neighbourhood of x_best more thoroughly and hence find a better approxima-


tion of [x_min, f_min]. The next algorithm attempts to rectify this deficiency.
Algorithm 4.5 (Random walk). As before, this algorithm runs for n_max steps.
The algorithm returns an approximate minimizer x_best along with the corre-
sponding value of f. Suppose that an initial state x0 is given, and that jump()
generates independent samples of X from a probability measure µ with support
equal to the unit ball of X .
x_best = x0
f_best = f(x_best)

n = 0
while n < n_max:
x_new = x_best + jump()
f_new = f(x_new)
if f_new < f_best:
x_best = x_new
f_best = f_new
n = n + 1
return [x_best, f_best]
Algorithm 4.5 also has a weakness: since the state is only ever updated to
states with a strictly lower value of f, and only looks for new states within
36 CHAPTER 4. BASIC OPTIMIZATION THEORY

unit distance of the current one, the algorithm is prone to becoming stuck in
local minima if they are surrounded by wells that are sufficiently wide, even if
they are very shallow. The next algorithm, the simulated annealing method,
attempts to rectify this problem by allowing the optimizer to make some ‘up-
hill’ moves, which can be accepted or rejected according to comparison of a
uniformly-distributed random variable with a user-prescribed acceptance prob-
ability function. Therefore, in the simulated annealing algorithm, a distinction
is made between the current state x of the algorithm and the best state so far,
x_best; unlike in the previous two algorithms, proposed states x_new may be
accepted and become x even if f(x_new) > f(x_best). The idea is to introduce


a parameter T, to be thought of as ‘temperature’: the optimizer starts off ‘hot’,
and ‘uphill’ moves are likely to be accepted; by the end of the calculation, the
optimizer is relatively ‘cold’, and ‘uphill’ moves are unlikely to accepted.
Algorithm 4.6 (Simulated annealing). Suppose that an initial state x0 is given,
and that functions temperature(), neighbour() and acceptance_prob() have
been specified. Suppose that uniform() generates independent samples from

T
the uniform distribution on [0, 1]. Then the simulated annealing algorithm is
x = x0
fx = f(x)
x_best = x
AF
f_best = fx
n = 0
while n < n_max:
T = temperature(n / n_max)
x_new = neighbour(x)
f_new = f(x_new)
if acceptance_prob(fx, f_new, T) > uniform():
DR

x = x_new
fx = f_new
if f_new < f_best:
x_best = x_new
f_best = f_new
n = n + 1
return [x_best, f_best]
Like Algorithm 4.4, the simulated annealing method can guarantee to find
the global minimizer of f provided that the neighbour() function allows full ex-
ploration of the state space and the maximum run time n_max is large enough.

However, the difficulty lies in coming up with functions temperature() and


acceptance_prob() such that the algorithm finds the global minimizer in rea-
sonable time. Simulated annealing calculations can be extremely computation-
ally costly. A commonly-used acceptance probability function P is the one from
the Metropolis–Hastings algorithm:
(
′ 1, if e′ < e,
P (e, e , T ) = ′
e−(e −e)/T , if e′ ≥ e.
There are, however, many other choices; in particular, it is not necessary to
automatically accept downhill moves, and it is permissible to have P (e, e′ , T ) <
1 for e′ < e.
4.3. CONSTRAINED OPTIMIZATION 37

4.3 Constrained Optimization


It is well-known that the unconstrained extremizers of smooth enough functions
must be critical points, i.e. points where the derivative vanishes. The following
theorem, the Lagrange multiplier theorem, states that the constrained minimiz-
ers of a smooth enough function, subject to smooth enough equality constraints,
are critical points of an appropriately generalized function:
Theorem 4.7 (Lagrange multipliers). Let X and Y be real Banach spaces. Let
U ⊆ X be open and let f ∈ C 1 (U ; R). Let g ∈ C 1 (U ; Y), and suppose that that u0


is a constrained extremizer of f subject to the constraint that g(u0 ) = 0. Suppose
also that the Fréchet derivative Dg(u0 ) : X → Y is surjective. Then there exists
a Lagrange multiplier λ0 ∈ Y ′ such that (u0 , λ0 ) is an unconstrained critical
point of the Lagrangian L defined by

U × Y ′ ∋ (u, λ) 7→ L(u, λ) := f (u) + hλ | g(u)i ∈ R.

i.e. Df (u0 ) = λ0 ◦ Dg(u0 ) as linear maps from X to R.

T
The corresponding result for inequality constraints is the Karush–Kuhn–
Tucker theorem:
Theorem 4.8 (Karush–Kuhn–Tucker). Suppose that x∗ ∈ Rn is a local minimizer
AF
of f ∈ C 1 (Rn ; R) subject to inequality constraints gi (x) ≤ 0 and equality con-
straints hj (x) = 0, where gi , hj ∈ C 1 (Rn ; R) for i = 1, . . . , m and j = 1, . . . ℓ.
Then there exist µ ∈ Rm and λ ∈ Rℓ such that

−∇f (x∗ ) = µ · ∇g(x∗ ) + λ · ∇h(x∗ ),

where x∗ is feasible, and µ satisfies the dual feasibility criteria µi ≥ 0 and the
complementary slackness criteria µi gi (x∗ ) = 0 for i = 1, . . . , m.
DR

Strictly speaking, the validity of the Karush–Kuhn–Tucker theorem also de-


pends upon some regularity conditions on the constraints called constraint qual-
ification conditions, of which there are many variations that can easily be found
in the literature. A very simple one is that if gi and hj are affine functions,
then no further regularity is needed; another is that the gradients of the active
inequality constraints and the gradients of the equality constraints be linearly
independent at x∗ .

Numerical Implementation of Constraints. In the numerical treatment of


constrained optimization problems, there are many ways to implement con-


straints, not all of which actually enforce the constraints in the sense of ensuring
that trial states x_new, accepted states x, or even the final solution x_best are
actually members of the feasible set. For definiteness, consider the constrained
minimization problem

minimize: f (x)
with respect to: x ∈ X
subject to: c(x) ≤ 0

for some functions f, c : X → R ∪ {±∞}. One way of seeing the constraint


‘c(x) ≤ 0’ is as a Boolean true/false condition: either the inequality is satisfied,
38 CHAPTER 4. BASIC OPTIMIZATION THEORY

or it is not. Supposing that neighbour(x) generates new (possibly infeasible)


elements of X given a current state x, one approach to generating feasible trial
states x_new is the following:
x’ = neighbour(x)
while c(x’) > 0:
x’ = neighbour(x)
x_new = x’
However, this accept/reject approach is extremely wasteful: if the feasible set
is very small, then x’ will ‘usually’ be rejected, thereby wasting a lot of com-


putational time, and this approach takes no account of how ‘nearly feasible’ an
infeasible x’ might be.
One alternative approach is to use penalty functions: instead of considering
the constrained problem of minimizing f (x) subject to c(x) ≤ 0, one can consider
the unconstrained problem of minimizing x 7→ f (x)+p(x), where p : X → [0, ∞)
is some function that equals zero on the feasible set and takes larger values the
‘more’ the constraint inequality c(x) ≤ 0 is violated, e.g., for µ > 0.

T
(
0, if c(x) ≤ 0,
pµ (x) = c(x)/µ
e − 1, if c(x) > 0.
The hope is that (a) the minimization of f + pµ over all of X is easy, and (b) as
AF
µ → 0, minimizers of f + pµ converge to minimzers of f on the original feasible
set. The penalty function approach is attractive, but the choice of penalty
function is rather ad hoc, and issues can easily arise of competition between the
penalties corresponding to multiple constraints.
An alternative to the use of penalty functions is to construct constraining
functions that enforce the constraints exactly. That is, we seek a function C()
that takes as input a possibly infeasible x’ and returns some x_new = C(x’)
DR

that is guaranteed to satisfy the constraint c(x_new) <= 0. For example, sup-
pose that X = Rn and the feasible set is the Euclidean unit ball, so the constraint
is
c(x) := kxk22 − 1 ≤ 0.
Then a suitable constraining function could be
(
x, if kxk2 ≤ 1,
C(x) :=
x/kxk2 , if kxk2 > 1.
Constraining functions are very attractive because the constraints are treated
exactly. However, they must often be designed on a case-by-case basis for each

constraint function c, and care must be taken to ensure that multiple constrain-
ing functions interact well and do not unduly favour parts of the feasible set
over others; for example, the above constraining function C maps the entire in-
feasible set to the unit sphere, which might be considered undesirable in certain
settings, and so a function such as
(
e x, if kxk2 ≤ 1,
C(x) := 2
x/kxk2 , if kxk2 > 1.
might be more appropriate. Finally, note that the original accept/reject method
of finding feasible states is a constraining function in this sense, albeit a very
inefficient one.
4.4. CONVEX OPTIMIZATION 39

4.4 Convex Optimization


The topic of this section is convex optimization. As will be seen, convexity
is a powerful property that makes optimization problems tractable to a much
greater extent than any amount of smoothness (which still permits local minima)
or low-dimensionality can do.
In this section, X will be a Hausdorff, locally convex topological vector
space. Given two points x0 and x1 of X and t ∈ [0, 1], xt will denote the convex
combination
xt := (1 − t)x0 + tx1 .


More generally, given points x0 , . . . , xn of a vector space, a sum of the form

α0 x0 + · · · + αn xn

is called a linear combination if the αi are any field elements, an affine combi-
nation if their sum is 1, and a convex combination if they are non-negative and
sum to 1.

T
Definition 4.9. A subset K ⊆ X is a convex set if, for all x0 , x1 ∈ K and
t ∈ [0, 1], xt ∈ K; it is said to be strictly convex if xt ∈ K̊ whenever x0 and
x1 are distinct points of K̄ and t ∈ (0, 1). An extreme point of a convex set K
AF
is a point of K that cannot be written as a non-trivial convex combination of
distinct elements of K; the set of all extreme points of K is denoted ext(K).
The convex hull co(S) (resp. closed convex hull co(S)) of S ⊆ X is defined to
be the intersection of all convex (resp. closed and convex) subsets of X that
contain S.

Example 4.10. The square [−1, 1]2 is a convex subset of R2 , but is not strictly
convex, and its extreme points are the four vertices (±1, ±1). The closed unit
DR

disc {(x, y) ∈ R2 | x2 + y 2 ≤ 1} is a strictly convex subset of R2 , and its extreme


points are the points of the unit circle {(x, y) ∈ R2 | x2 +y 2 = 1}. See Figure 4.1
for further examples.

Example 4.11. M1 (X ) is a convex subset of the space of all (signed) Borel


measures on X . The extremal probability measures are the zero-one measures,
i.e. those those for which, for every measurable set E ⊆ X , µ(E) ∈ {0, 1}.
Furthermore, as will be discussed in Chapter 14, if X is, say, a Polish space, then
the zero-one measures (and hence the extremal probability measures) on X are

the Dirac point masses. Indeed, in this situation, M1 (X ) = co {δx | x ∈ X }


as a subset of the space M± (X ) of signed measures on X .

The reason that these notes restrict attention to Hausdorff, locally convex
topological vector spaces X is that it is just too much of a headache to work
with spaces for which the following ‘common sense’ results do not hold:

Theorem 4.12 (Kreı̆n–Milman [54]). Let X be a Hausdorff, locally convex topo-


logical vector space, and let K ⊆ X be compact and convex. Then K is the
closed convex hull of its extreme points.

Theorem 4.13 (Choquet–Bishop–de Leeuw [11]). Let X be a Hausdorff, locally


convex topological vector space, let K ⊆ X be compact and convex, and let c ∈ K.
40 CHAPTER 4. BASIC OPTIMIZATION THEORY

b b


(a) A convex subset of the (b) A non-convex subset of
plane (grey) and its ex- the plane (black) and its
treme points (black). convex hull (grey).

Figure 4.1: Convex sets, extreme points, and convex hulls.

T
Then there exists a probability measure µ supported on ext(K) such that, for all
affine functions f on K,
Z
AF
f (c) = f (e) dµ(e).
ext(K)

Informally speaing, the Kreı̆n–Milman and Choquet–Bishop–de Leeuw the-


orems together assure us that a compact, convex subset K of a topologically
respectable space is entirely characterized by its set of extreme points in the
following sense: every point of K can be obtained as an average of extremal
points of K, and, indeed, the value of any affine function at any point of K can
DR

be obtained as an average of its values at the extremal points in the same way.
Definition 4.14. Let K ⊆ X be convex. A function f : K → R ∪ {±∞} is a
convex function if, for all x0 , x1 ∈ K and t ∈ [0, 1],
f (xt ) ≤ (1 − t)f (x0 ) + tf (x1 ),
and is called a strictly convex function if, for all distinct x0 , x1 ∈ K and t ∈
(0, 1),
f (xt ) < (1 − t)f (x0 ) + tf (x1 ).
It is straightforward to see that f : K → R ∪ {±∞} is convex (resp. strictly

convex) if and only if its epigraph


epi(f ) := {(x, v) ∈ K × R | v ≥ f (x)}
is a convex (resp. strictly convex) subset of K × R. Convex functions have many
convenient properties with respect to minimization and maximization:
Theorem 4.15. Let f : K → R be a convex function on a compact, convex,
non-empty set K ⊆ X . Then
1. any local minimizer of f in K is also a global minimizer;
2. the set arg minK f of global minimizers of f in K is convex;
3. if f is strictly convex, then it has at most one global minimizer in K;
4. f has the same maximum values on K and ext(K).
4.4. CONVEX OPTIMIZATION 41

Remark. Note well that Theorem 4.15 does not assert the existence of mini- 
mizers, for which simultaneous compactness of K and lower semicontinuity of
f is required. For example:
• the exponential function on R is strictly convex, continuous and bounded
below by 0 yet has no minimizer;
• the interval [−1, 1] is compact, and the function f : [−1, 1] → R ∪ {±∞}
defined by (
x, if |x| < 12 ,
f (x) :=
+∞, if |x| ≥ 12 ,


is convex, yet f has no minimizer — although inf x∈[−1,1] f (x) = − 21 , there
is no x for which f (x) attains this infimal value.
Proof. 1. Suppose that x0 is a local minimizer of f in K that is not a global
minimizer: that is, suppose that x0 is a minimizer of f in some open
neighbourhood N of x0 , and also that there exists x1 ∈ K \ N such that
f (x1 ) < f (x0 ). Then, for sufficiently small t > 0, xt ∈ N , but convexity

T
implies that
f (xt ) ≤ (1 − t)f (x0 ) + tf (x1 ) < (1 − t)f (x0 ) + tf (x0 ) = f (x0 ),
AF
which contradicts the assumption that x0 is a minimizer of f in N .
2. Suppose that x0 , x1 ∈ K are global minimizers of f . Then, for all t ∈ [0, 1],
xt ∈ K and
f (x0 ) ≤ f (xt ) ≤ (1 − t)f (x0 ) + tf (x1 ) = f (x0 ).
Hence, xt ∈ arg minK f , and so arg minK f is convex.
3. Suppose that x0 , x1 ∈ K are distinct global minimizers of f , and let
DR

t ∈ (0, 1). Then xt ∈ K and


f (x0 ) ≤ f (xt ) < (1 − t)f (x0 ) + tf (x1 ) = f (x0 ),
which is a contradiction. Hence, f has at most one minimizer in K.
4. Suppose that x is a non-extreme point of K that is also a strict maximizer
for f on K. Let x = xt = (1 − t)x0 + tx1 , where x0 , x1 are distinct points
of K and t ∈ (0, 1). Since x is assumed to be a strict maximizer for f on
K, max{f (x0 ), f (x1 )} < f (xt ). Then, since f is convex,

f (xt ) ≤ (1 − t)f (x0 ) + tf (x1 ) ≤ max{f (x0 ), f (x1 )} < f (xt ),


which is a contradiction.
Definition 4.16. A convex optimization problem (or convex program) is a mini-
mization problem in which the objective function and all constraints are equal-
ities or inequalities with respect to convex functions.
Remark 4.17. 1. Beware of the common pitfall of saying that a convex pro- 
gram is simply the minimization of a convex function over a convex set. Of
course, by Theorem 4.15, such minimization problems are nicer than gen-
eral minimization problems, but bona fide convex programs are an even
nicer special case.
42 CHAPTER 4. BASIC OPTIMIZATION THEORY

2. In practice, many problems are not obviously convex programs, but but
can be transformed into convex programs by e.g. a cunning change of
variables. Being able to spot the right equivalent problem is a major part
of the art of optimization.

It is difficult to overstate the importance of convexity in making optimiza-


tion problems tractable. Indeed, it has been remarked that lack of convexity is
a much greater obstacle to tractability than high dimension. There are many
powerful methods for the solution of convex programs, with corresponding stan-
dard software libraries such as cvxopt. For example, the interior point methods


explore the interior of the feasible set in search of the solution to the convex
program, while being kept away from the boundary of the feasible set by a bar-
rier function. The discussion that follows is only intended as an outline; for
details, see Chapter 11 of Boyd & Vandenberghe [16].
Consider the convex program

minimize: f (x)

T
with respect to: x ∈ Rn
subject to: ci (x) ≤ 0 for i = 1, . . . , m,

where the functions f, c1 , . . . , cm : Rn → R are all convex and differentiable. Let


AF
F denote the feasible set for this program. Let 0 < µ ≪ 1 be a small scalar,
called the barrier parameter, and define the barrier function associated to the
program by
m
X
B(x; µ) := f (x) − µ log ci (x).
i=1

Note that B( · ; µ) is strictly convex for µ > 0, that B(x; µ) → +∞ as x → ∂F ,


and that B( · ; 0) = f ; therefore, the unique minimizer x∗µ of B( · ; µ) lies in F̊
DR

and (hopefully) converges to the minimizer of the original problem as µ → 0.


Indeed, using arguments based on convex duality, one can show that

f (x∗µ ) − inf f (x) ≤ mµ.


x∈F

The strictly convex problem of minimizing B( · ; µ) can be solved approximately


using Newton’s method. In fact, however, one settles for a partial minimization
of B( · ; µ) using only one or two steps of Newton’s method, then decreases µ to
µ′ , performs another partial minimization of B( · ; µ′ ) using Newton’s method,

and so on in this alternating fashion.

4.5 Linear Programming


Theorem 4.15 has the following immediate corollary for the minimization and
maximization of affine functions on convex sets:

Corollary 4.18. Let ℓ : K → R be an affine function on a non-empty, compact,


convex set K ⊆ X . Then

ext ℓ(x) = ext ℓ(x).


x∈K x∈ext(K)
4.6. LEAST SQUARES 43

Definition 4.19. A linear program is an optimization problem of the form


extremize: f (x)
with respect to: x ∈ Rn
subject to: gi (x) ≤ 0 for i = 1, . . . , m,
where the functions f, g1 , . . . , gm : Rn → R are all affine functions. Linear pro-
grams are often written in the canonical form
maxmize: c · x


with respect to: x ∈ Rn
subject to: Ax ≤ b
x ≥ 0,
where c ∈ Rn , A ∈ Rm×n and b ∈ Rm are given, and the two inequalities are
interpreted componentwise.
Note that the feasible set for a linear program is an intersection of finitely

T
many half-spaces of Rn , i.e. a polytope. This polytope may be empty, in which
case the constraints are mutually contradictory and the program is said to be
infeasible. Also, the polytope may be unbounded in the direction of c, in which
case the extreme value of the problem is infinite.
AF
Since linear programs are special cases of convex programs, methods such as
interior point methods are applicable to linear programs as well. Such methods
approach the optimum point x∗ , which is necessarily an extremal element of
the feasible polytope, from the interior of the feasible polytope. Historically,
however, such methods were preceded by methods such as Dantzig’s simplex
algorithm, which, sets out to directly explore the set of extreme points in a
(hopefully) efficient way. Although the theoretical worst worst-case complexity
DR

of simplex method as formulated by Dantzig is exponential in n and m, in prac-


tice the simplex method is remarkably efficient (polynomial running time) pro-
vided that certain precautions are taken to avoid pathologies such as ‘stalling’.

4.6 Least Squares


An elementary example of convex programming is unconstrained quadratic min-
imization, otherwise known as least squares. Least squares minimization plays
a central role in elementary statistical estimation, as will be demonstrated by

the Gauss–Markov theorem (Theorem 6.2).


Lemma 4.20. Let K be a closed, convex subset of a Hilbert space H. Then, for
each y ∈ H, there is a unique element x̂ ∈ K such that
x̂ ∈ arg min ky − xk.
x∈K

Proof. By Exercise 4.1, the function J : X → [0, ∞) defined by J(x) := ky − xk2


is strictly convex, and hence it has at most one minimizer in K. Therefore, it
only remains to show that J has at least one minimizer in K. Since J is bounded
below (on X , not just on K), J has a sequence of approximate minimizers: let
I := inf ky − xk2 , I 2 ≤ ky − xn k2 ≤ I 2 + n1 .
x∈K
44 CHAPTER 4. BASIC OPTIMIZATION THEORY

By the parallelogram identity for the Hilbert norm k · k,

k(y − xm ) + (y − xn )k2 + k(y − xm ) − (y − xn )k2 = 2ky − xm k2 + 2ky − xn k2 ,

and hence
k2y − (xm + xn )k2 + kxn − xm k2 ≤ 4I 2 + 2
n + 2
m.

Since K is convex, 21 (xm + xn ) ∈ K, so the first term on the left-hand side above
is bounded below as follows:
2


xm + xn
k2y − (xm + xn )k2 = 4 y − ≥ 4I 2 .
2

Hence,
kxn − xm k2 ≤ 4I 2 + 2
n + 2
m − 4I 2 = 2
n + 2
m,

and so the sequence (xn )n∈N is Cauchy; since H is complete and K is closed,
this sequence converges to some x̂ ∈ K. Since the norm k · k is continuous,

T
ky − x̂k = I.

Lemma 4.21 (Orthogonality of the residual). Let V be a closed subspace of a


Hilbert space H and let b ∈ H. Then x̂ ∈ V minimizes the distance to b if and
AF
only if the residual x̂ − b is orthogonal to V , i.e.

x̂ = arg min kx − bk ⇐⇒ (x̂ − b) ⊥ V.


x∈V

Proof. Let J(x) := 12 kx−bk2 , which has the same minimizers as x 7→ kx−bk; by
Lemma 4.20, such a minimizer exists and is unique. Suppose that (x − b) ⊥ V
and let y ∈ V . Then y − x ∈ V and so (y − x) ⊥ (x − b). Hence, by Pythagoras’
DR

theorem,
ky − bk2 = ky − xk2 + kx − bk2 ≥ kx − bk2 ,
and so x minimizes J.
Conversely, suppose that x minimizes J. Then, for every y ∈ V ,

∂ 1
0= J(x + λy) = (hy, x − bi + hx − b, yi) = Rehx − b, yi
∂λ 2
and, in the complex case,

∂ 1
0= J(x + λiy) = (−ihy, x − bi + ihx − b, yi) = −Imhx − b, yi.
∂λ 2
Hence, hx − b, yi = 0, and since y was arbitrary, (x − b) ⊥ V .

Lemma 4.22 (Normal equations). Let A : H → K be a linear operator between


Hilbert spaces such that R(A) is a closed subspace of K. Then, given b ∈ K,

x̂ ∈ arg min kAx − bkK ⇐⇒ A∗ Ax̂ = A∗ b,


x∈H

the equations on the right-hand side being known as the normal equations.
4.6. LEAST SQUARES 45

Proof. Recall that, as a consequence of completeness, the only element of a


Hilbert space that is orthogonal to every other element of the space is the zero
element. Hence,
kAx − bkK is minimal
⇐⇒ (Ax − b) ⊥ Av for all v ∈ H, by Lemma 4.21
⇐⇒ hAx − b, AviK = 0 for all v ∈ H
⇐⇒ hA∗ Ax − A∗ b, viH = 0 for all v ∈ H
⇐⇒ A∗ Ax = A∗ b by completeness of H.


Weighting and Regularization. It is common in practice that one does not
want to minimize the K-norm directly, but perhaps some re-weighted version
of the K-norm. This re-weighting is accomplished by a self-adjoint and positive
definite operator on K.
Corollary 4.23 (Weighted least squares). Let A : H → K be a linear operator
between Hilbert spaces such that R(A) is a closed subspace of K. Let Q : K → K

T
be self-adjoint and positive-definite and let
hk, k ′ iQ := hk, Qk ′ iK
Then, given b ∈ K,
AF
x̂ ∈ arg min kAx − bkQ ⇐⇒ A∗ QAx̂ = A∗ Qb.
x∈H

Proof. Exercise 4.2.


Another situation that arises frequently in practice is that the normal equa-
tions do not have a unique solution (e.g. because A∗ A is not invertible) and so
it is necessary to select one by some means, or that one has some prior belief
DR

that ‘the right solution’ should be close to some initial guess x0 . A technique
that accomplishes both of these aims is Tikhonov regularization:
Corollary 4.24 (Tikhonov-regularized least squares). Let A : H → K be a linear
operator between Hilbert spaces such that R(A) is a closed subspace of K, let
Q : H → H be self-adjoint and positive-definite, and let b ∈ K and x0 ∈ H be
given. Let
J(x) := kAx − bk2 + kx − x0 k2Q .
Then
x̂ ∈ arg min J(x) ⇐⇒ (A∗ A + Q)x̂ = A∗ b + Qx0 .

x∈H

Proof. Exercise 4.3.

Nonlinear Least Squares and Gauss–Newton Iteration. It often occurs in


practice that one wishes to find a vector of parameters θ ∈ Rp such that a
function Rk ∋ x 7→ f (x; θ) ∈ Rℓ best fits a collection of data points {(xi , yi ) ∈
Rk × Rℓ | i = 1, . . . , m}. For each candidate parameter vector θ, define the
residual vector
   
r1 (θ) y1 − f (x1 ; θ)
   
r(θ) :=  ...  =  ..
.
m
∈R .
rm (θ) ym − f (xm ; θ)
46 CHAPTER 4. BASIC OPTIMIZATION THEORY

The aim is to find θ to minimize the objective function J(θ) := kr(θ)k22 . Let
 
∂r1 (θ) ∂r1 (θ)
∂θ 1 ··· ∂θ p
 .. .. .. 
A := 
 . . .

 ∈ Rm×p
∂rm (θ) ∂rm (θ)
∂θ 1 ··· ∂θ p θ=θn

be the Jacobian matrix of the residual vector, and note that A = −DF (θn ),
where  
f (x1 ; θ)


 ..  m
F (θ) :=  . ∈R .
f (xm ; θ)

Consider the first-order Taylor approximation

r(θ) ≈ r(θn ) + A(r(θ) − r(θn )).

T
Thus, to approximately minimize kr(θ)k2 , we find δ := r(θ) − r(θn ) that makes
the right-hand side of the approximation equal to zero. This is an ordinary linear
least squares problem, the solution of which is given by the normal equations as
AF
δ = (A∗ A)−1 A∗ r(θn ).

Thus, we obtain the Gauss–Newton iteration for a sequence (θn )n∈N of approx-
imate minimizers of J:

θn+1 := θn − (A∗ A)−1 A∗ r(θn )


−1
= θn + (DF (θn ))∗ (DF (θn )) (DF (θn ))∗ r(θn ).
DR

In general, the Gauss–Newton iteration is not guaranteed to converge to the


exact solution, particularly if δ is ‘too large’, in which case it may be appropriate
to use a judicously chosen small positive multiple of δ. The use of Tikhonov
regularization in this context is known as the Levenberg–Marquardt algorithm
or trust region method, and the small multiplier applied to δ is essentially the
reciprocal of the Tikhonov regularization parameter.

Bibliography

Direct and iterative methods for the solution of linear least squares problems
W are covered in MA398 Matrix Analysis and Algorithms.
The book of Boyd & Vandenberghe [16] is an excellent reference on the
theory and practice of convex optimization, as is the associated software library
cvxopt. The classic reference for convex analysis in general is the monograph of
Rockafellar [82]. A standard reference on numerical methods for optimization
is the book of Nocedal & Wright [72].
For constrained global optimization in the absence of ‘nice’ features, particu-
larly for the UQ methods in Chapter 14, the author recommends the Differential
Evolution algorithm [77, 97] within the Mystic framework.
EXERCISES 47

Exercises
Exercise 4.1. Let k · k be a norm on a vector space X , and fix y ∈ X . Show
that the function J : X → [0, ∞) defined by J(x) := ky − xk2 is strictly convex.
Exercise 4.2. Let A : H → K be a linear operator between Hilbert spaces such
that R(A) is a closed subspace of K. Let Q : K → K be self-adjoint and positive-
definite and let
hk, k ′ iQ := hk, Qk ′ iK
Show that, given b ∈ K,


x̂ ∈ arg min kAx − bkQ ⇐⇒ A∗ QAx̂ = A∗ Qb.
x∈H

Exercise 4.3. Let A : H → K be a linear operator between Hilbert spaces such


that R(A) is a closed subspace of K, let Q : H → H be self-adjoint and positive-
definite, and let b ∈ K and x0 ∈ H be given. Let

T
J(x) := kAx − bk2 + kx − x0 k2Q .

Show that
AF
x̂ ∈ arg min J(x) ⇐⇒ (A∗ A + Q)x̂ = A∗ b + Qx0 .
x∈H

Hint: Consider the operator from H into K⊕H given in block form as [A, Q1/2 ]⊤ .
DR

48 CHAPTER 4. BASIC OPTIMIZATION THEORY


T
AF
DR

Chapter 5


Measures of Information and
Uncertainty

T As we know, there are known knowns.


There are things we know we know. We
also know there are known unknowns.
AF
That is to say we know there are some
things we do not know. But there are also
unknown unknowns, the ones we don’t
know we don’t know.

Donald Rumsfeld
DR

This chapter briefly summarizes some basic numerical measures of uncer-


tainty, from interval bounds to information-theoretic quantities such as (Shan-
non) information and entropy. This discussion then naturally leads to consider-
ation of distances (and distance-like functions) between probability measures.

5.1 The Existence of Uncertainty


At a very fundamental level, the first level in understanding the uncertainties
affecting some system is to identify the sources of uncertainty. Sometimes, this

can be a challenging task because there may be so much lack of knowledge


about, e.g. the relevant physical mechanisms, that one does not even know
what a list of the important parameters would be, let alone what uncertainty
one has about their values. The presence of such so-called unknown unknowns
is of major concern in high-impact settings like risk assessment.
One way of assessing the presence of unknown unknowns is that if one sub-
scribes to a deterministic view of the universe in which reality maps inputs x ∈ X
to outputs y = f (x) ∈ Y by a well-defined single-valued function f : X → Y,
then unknown unknowns are additional variables z ∈ Z whose existence one
infers from contradictory observations like

f (x) = y1 and f (x) = y2 6= y1 .


50 CHAPTER 5. MEASURES OF INFORMATION AND UNCERTAINTY

Unknown unknowns explain away this contradiction by asserting the existence


of a space Z containing distinct elements z1 and z2 , that in fact f is a function
f : X × Z → Y, and that the observations were actually
f (x, z1 ) = y1 and f (x, z2 ) = y2 .
Of course, this viewpoint does nothing to actually identify the relevant space Z
nor the values z1 and z2 .

5.2 Interval Estimates


Sometimes, nothing more can be said about some unknown quantity than a
range of possible values, with none more or less probable than any other. In
the case of an unknown real number x, such information may boil down to an
interval such as [a, b] in which x is known to lie. This is, of course, a very basic
form of uncertainty, and one may simply summarize the degree of uncertainty
by the length of the interval.

T
Interval Arithmetic. As well as summarizing the degree of uncertainty by the
length of the interval estimate, it is often of interest to manipulate the interval
estimates themselves as if they were numbers. One commonly-used method of
AF
manipulating interval estimates of real quantities is interval arithmetic. Each of
the basic arithmetic operations ∗ ∈ {+, −, ·, /} is extended to intervals A, B ⊆ R
by
A ∗ B := {x ∈ R | x = a ∗ b for some a ∈ A, b ∈ B}.
Hence,
[a, b] + [c, d] = [a + c, b + d],
DR

[a, b] − [c, d] = [a − d, b − c],


 
[a, b] · [c, d] = min{a · c, a · d, b · c, b · d}, max{a · c, a · d, b · c, b · d} ,
 
[a, b]/[c, d] = min{a/c, a/d, b/c, b/d}, max{a/c, a/d, b/c, b/d} when 0 ∈ / [c, d].
The addition and multiplication operations are commutative, associative and
sub-distributive:
A(B + B) ⊆ AB + AC.
These ideas can be extended to elementary functions without too much dif-
ficulty: monotone functions are straightforward, and the Intermediate Value

Theorem ensures that the continuous image of an interval is again an inter-


val. However, for general functions f , it is not straightforward to compute (the
convex hull of) the image of f .
The distributional robustness approaches covered in Chapter 14 can be seen
as an extension of this approach from partially known real numbers to partially
known probability measures.

5.3 Variance, Information and Entropy


Suppose that one adopts a subjectivist (e.g. Bayesian) interpretation of prob-
ability, so that one’s knowledge about some system of interest with possible
5.3. VARIANCE, INFORMATION AND ENTROPY 51

values in X is summarized by a probability measure µ ∈ M1 (X ). The proba-


bility measure µ is a very rich and high-dimensional object; often it is necessary
to summarize the degree of uncertainty implicit in µ with a few numbers —
perhaps even just one number.

Variance. One obvious summary statistic, when X is (a subset of) a normed


vector space and µ has mean m, is the variance of µ, i.e.
Z
 
V(µ) := kx − mk2 dµ(x) ≡ EX∼µ kX − mk2 .
X


If V(µ) is small (resp. large), then we are relatively certain (resp. uncertain)
that X ∼ µ is in fact quite close to m. A more refined variance-based measure
of informativeness is the covariance operator
 
C(µ) := EX∼µ (X − m) ⊗ (X − m) .

A distribution µ for which the operator norm of C(µ) is large may be said to

T
be a relatively uninformative distribution. Note that when X = Rn , C(µ) is an
n × n symmetric positive-definite matrix. Hence, such a C(µ) has n positive
real eigenvalues (counted with multiplicity)
AF
λ1 ≥ λ2 ≥ · · · ≥ λn > 0,

with corresponding normalized eigenvectors v1 , . . . , vn ∈ Rn . The direction v1


corresponding to the largest eigenvalue λ1 is the direction in which the uncer-
tainty about the random vector X is greatest; correspondingly, the direction vn
is the direction in which the uncertainty about the random vector X is least.
A beautiful and classical result concerning the variance of two quantities of
interest is the uncertainty principle from quantum mechanics. In this setting,
DR

the probability distribution is written as p = |ψ|2 , where ψ is a unit-norm


element of a suitable Hilbert space, usually something like L2 (Rn ; C). Physical
observables like position, momentum &c. act as self-adjoint operators on this
Hilbert space; e.g. the position operator Q is

(Qψ)(x) := xψ(x),

so that the expected position is


Z Z
hψ, Qψi = ψ(x)xψ(x) dx = x|ψ(x)|2 dx.

Rn Rn

In general, for a fixed unit-norm element ψ ∈ H, the expected value hAi and
2
variance V(A) ≡ σA of a self-adjoint operator A : H → H are defined by

hAi := hψ, Aψi,


2
σA := h(A − hAi)2 i.

The following inequality provides a fundamental lower bound on the product


of the variances of any two observables A and B in terms of their commutator
[A, B] := AB − BA and their anti-commutator {A, B} := AB + BA. When
this lower bound is positive, the two variances cannot both be close to zero, so
simultaneous high-precision measurements of A and B are impossible.
52 CHAPTER 5. MEASURES OF INFORMATION AND UNCERTAINTY

Theorem 5.1 (Uncertainty principle: Schrödinger’s inequality). Let A, B be self-


adjoint operators on a Hilbert space H, and let ψ ∈ H have unit norm. Then
2 2
2 2 h{A, B}i − 2hAihBi h[A, B]i
σA σB ≥ + (5.1)
2 2
1
and, a fortiori, σA σB ≥ 2 h[A, B]i .
Proof. Let f := (A − hAi)ψ and g := (B − hBi)ψ, so that
2
σA = hf, f i = kf k2,


2
σB = hg, gi = kgk2 .

Therefore, by the Cauchy–Schwarz inequality (3.1),


2 2
σA σB = kf k2 kgk2 ≥ |hf, gi|2 .

Now write the right-hand side of this inequality as


2 2

T
|hf, gi|2 = Re(hf, gi) + Im(hf, gi)
 2  2
hf, gi + hg, f i hf, gi − hg, f i
= + .
2 2i
AF
Using the self-adjointness of A and B,

hf, gi = h(A − hAi)ψ, (B − hBi)ψi


= hABi − hAihBi − hAihBi + hAihBi
= hABi − hAihBi;

similarly, hg, f i = hBAi − hAihBi. Hence,


DR

hf, gi − hg, f i = h[A, B]i,


hf, gi + hg, f i = h{A, B}i − 2hAihBi,

which yields (5.1).

Information and Entropy. In information theory as pioneered by Claude Shan-


non, the information (or surprisal ) associated to a possible outcome x of a
random variable X ∼ µ taking values in a finite set X is defined to be

I(x) := − log PX∼µ [X = x] ≡ − log µ(x). (5.2)


Information has units according to the base of the logarithm used:

base 2 ↔ bits, base e ↔ nats/nits, base 10 ↔ bans/dits/hartleys.

The negative sign in (5.2) makes I(x) non-negative, and logarithms are used
because one seeks a quantity I( · ) that represents in an additive way the ‘surprise
value’ of observing x. So, for example, if x has half the probability of y, then
one is ‘twice as surprised’ to see the outcome X = x instead of X = y, and so
I(x) = I(y) + log 2. The entropy of the measure µ is the expected information:
X
H(µ) := EX∼µ [I(X)] ≡ − µ(x) log µ(x). (5.3)
x∈X
5.4. INFORMATION GAIN 53

(We follow the convention that 0 log 0 := limp→0 p log p = 0.) These definitions
are readily extended to a random variable X taking values in Rn and distributed
according to a probability measure µ that has Lebesgue density ρ:

I(x) := − log ρ(x),


Z
H(µ) := − ρ(x) log ρ(x) dx.
Rn

Since entropy measures the average information content of the possible values
of X ∼ µ, entropy is often interpreted as a measure of the uncertainty implicit


in µ. (Remember that if µ is very ‘spread out’ and describes a lot of uncertainty
about X, then observing a particular value of X carries a lot of ‘surprise value’
and hence a lot of information.)
Example 5.2. Consider a Bernoulli random variable X taking values in x1 , x2 ∈
X with probabilities p, 1 − p ∈ [0, 1] respectively. This random variable has
entropy

T
−p log p − (1 − p) log(1 − p).
If X is certain to equal x1 , then p = 1, and the entropy is 0; similarly, if
X is certain to equal x2 , then p = 0, and the entropy is again 0; these two
distributions carry zero information and have minimal entropy. On the other
AF
hand, if p = 21 , in which case X is uniformly distributed on X , then the entropy
is log 2; indeed, this is the maximum possible entropy for a Bernoulli random
variable. This example is often interpreted as saying that when interrogating
someone with questions that demand “yes” or “no” answers, one gains maximum
information by asking questions that have an equal probability of being answered
“yes” versus “no”.
Proposition 5.3. Let µ and ν be probability measures on discrete sets or Rn .
DR

Then the product measure µ ⊗ ν satisfies

H(µ ⊗ ν) = H(µ) + H(ν).

That is, the entropy of a random vector with independent components is the sum
of the entropies of the component random variables.
Proof. Exercise 5.2.

5.4 Information Gain


Implicit in the definition and entropy (5.3) is the use of a uniform measure
(counting measure on a finite set, or Lebesgue measure on Rn ) as a reference
measure. Upon reflection, there is no need to privilege uniform measure with
being the unique reference measure. Indeed, in some settings, such as infinite-
dimensional Banach spaces, there is no such thing as a uniform measure. In
general, if µ is a probability measure on a measurable space (X , F ) with refer-
ence measure π, then we would like to define the entropy of µ with respect to π
by an expression like
Z
dµ dµ
H(µ|π) = − (x) log (x) dπ(x)
R dπ dπ
54 CHAPTER 5. MEASURES OF INFORMATION AND UNCERTAINTY

whenever µ has a Radon–Nikodým derivative with respect to π. The negative


of this functional is a distance-like function on the set of probability measures
on (X , F ):

Definition 5.4. Let µ, ν be σ-finite measures on (X , F ). The Kullback–Leibler


divergence from µ to ν is defined to be
Z Z
 dµ dµ dµ
log dν ≡ log dµ, if µ ≪ ν,
DKL (µkν) := dν dν dν
 X X
+∞, otherwise.


While DKL ( · k · ) is non-negative, and vanishes if and only if its arguments
are identical, it is neither symmetric nor does it satisfy the triangle inequality.
Nevertheless, DKL ( · k · ) generates a topology on M1 (X ) that is finer than the
total variation topology:

Theorem 5.5 (Pinsker). For any µ, ν ∈ M1 (X , F ),

T
r
DKL (µkν)
dTV (µ, ν) ≤ ,
2
AF
where the total variation metric is defined by

dTV (µ, ν) := sup |µ(E) − ν(E)| E ∈ F . (5.4)

Example 5.6 (Bayesian experimental design). Suppose that a Bayesian point of


view is adopted, and for simplicity that all the random variables of interest are
finite-dimensional with Lebesgue densities ρ. Consider the problem of selecting
an optimal experimental design λ for the inference of some parameters / un-
DR

knowns θ from the observed data y that will result from the experiment λ. If, for
each λ and θ, we know the conditional distribution y|λ, θ of the observed data,
then the conditional distribution y|λ is obtained by integration with respect to
the prior distribution of θ:
Z
ρ(y|λ) = ρ(y|λ, θ)ρ(θ) dθ.

Let U (y, λ) be a real-valued measure of the utility of the posterior distribution


ρ(y|θ, λ)ρ(θ)
ρ(θ|y, λ) = .
ρ(y|λ)

For example, one could take the  utility U (y, λ) to be the Kullback–Leibler di-
vergence DKL ρ( · |y, λ) ρ( · |λ) between the prior and posterior distributions
on θ. An experimental design λ that maximizes
Z
U (λ) := U (y, λ)ρ(y|λ) dy

is one that is optimal in the sense of maximizing the expected gain in Shannon
information.
BIBLIOGRAPHY 55

Bibliography
Comprehensive treatments of interval analysis include the classic monograph of
Moore [69] and the more recent text of Jaulin & al. [42].
Information theory was pioneered by Shannon in his seminal 1948 paper [88].
The Kullback–Leibler divergence was introduced by Kullback & Leibler [55],
who in fact considered the symmetrized version of the divergence that now
bears their names. The book of MacKay [63] provides a thorough introduction
to information theory.


Exercises
Exercise 5.1. Prove Gibbs’ inequality that the Kullback–Leibler divergence is
non-negative, i.e. Z

DKL (µkν) := log dµ ≥ 0
X dν

T
whenever µ, ν are σ-finite measures on (X , F ) with µ ≪ ν. Show also that
DKL (µkν) = 0 if and only if µ = ν.
Exercise 5.2. Prove Proposition 5.3. That is, suppose that µ and ν are proba-
AF
bility measures on discrete sets or Rn , and show that the product measure µ ⊗ ν
satisfies
H(µ ⊗ ν) = H(µ) + H(ν).
That is, the entropy of a random vector with independent components is the
sum of the entropies of the component random variables.
Exercise 5.3. Let µ0 = N (m0 , C0 ) and µ1 = N (m1 , C1 ) be non-degenerate
Gaussian measures on Rn . Show that DKL (µ0 kµ1 ) is
DR

  
1 −1 ⊤ −1 det C0
tr(C1 C0 ) + (m1 − m0 ) C1 (m1 − m0 ) − n − log .
2 det C1

Exercise 5.4. Suppose that µ and ν are equivalent probability measures on


(X , F ) and define

d(µ, ν) := ess sup log (x) .
x∈X dν
Show that this defines a well-defined metric on the measure equivalence class
E containing µ and ν. In particular, show that neither the choice of function

used as the Radon–Nikodým derivative dµ


dν , nor the choice of measure in E with
respect to which the essential supremum is taken, affect the value of d(µ, ν).
Exercise 5.5. Show that Pinsker’s inequality (Theorem 5.5) cannot be reversed.
That is, show that, for any ε > 0, there exist probability measures µ and ν with
dTV (µ, ν) ≤ ε but DKL (µkν) = +∞.
56 CHAPTER 5. MEASURES OF INFORMATION AND UNCERTAINTY


T
AF
DR

Chapter 6


Bayesian Inverse Problems

It ain’t what you don’t know that gets


you into trouble. It’s what you know for

T
sure that just ain’t so.

Mark Twain
AF
This chapter provides a general introduction, at the high level, to the back-
ward propagation of uncertainty/information in the solution of inverse problems,
and specifically a Bayesian probabilistic perspective on such inverse problems.
Under the umbrella of inverse problems, we consider parameter estimation and
regression. One specific aim is to make clear the connection between regulariza-
tion and the application of a Bayesian prior. The filtering methods of Chapter 7
fall under the general umbrella of Bayesian approaches to inverse problems, but
DR

have an additional emphasis on real-time computational expediency.

6.1 Inverse Problems and Regularization


In many applications it is of interest to solve inverse problems, namely to find u,
an input to a mathematical model, given y, an observation of (some components
of, or functions of) the solution of the model. We have an equation of the form

y = H(u)

where X , Y are, say, Banach spaces, u ∈ X , y ∈ Y, and H : X → Y is the


observation operator. However, inverse problems are typically ill-posed: there
may be no solution, the solution may not be unique, or there may be a unique
solution that depends sensitively on y. Indeed, very often we do not actually
observe H(u), but some noisily corrupted version of it, such as

y = H(u) + η. (6.1)

The inverse problem framework encompasses that problem of model calibra-


tion (or parameter estimation), where a model Hθ relating inputs to outputs
depends upon some parameters θ ∈ Θ, e.g., when X = Y = Θ, Hθ (u) = θu.
58 CHAPTER 6. BAYESIAN INVERSE PROBLEMS

The problem is, given some observations of inputs ui and corresponding outputs
yi , to find the parameter value θ such that

yi = Hθ (ui ) for each i.

Again, this problem is typically ill-posed.


One approach to the problem of ill-posedness is to seek a least-squares solu-
tion: find, for the norm k · kY on Y,
2
arg min y − H(u) .


Y
u∈X

However, this problem, too, can be difficult to solve as it may possess minimizing
sequences that do not have a limit in X ,h6.1i or may possess multiple minima,
or may depend sensitively on the observed data y. Especially in this last case, it
may be advantageous to not try to fit the observed data too closely, and instead
regularize the problem by seeking
n o

T
2 2
arg min y − H(u) Y + u − ū E u ∈ E ⊆ X

for some Banach space E embedded in X and a chosen ū ∈ E. The standard


example of this regularization setup is Tikhonov regularization, as in Corol-
AF
lary 4.24: when X and Y are Hilbert spaces, for suitable self-adjoint positive-
definite operators Q on X and R on Y, we seek
n o
2 2
arg min Q−1/2 (y − H(u)) Y + R−1/2 (u − ū) X u ∈ X

However, this approach appears to be somewhat ad hoc, especially where the


choice of regularization is concerned.
DR

Taking a probabilistic — specifically, Bayesian — viewpoint alleviates these


difficulties. If we think of u and y as random variables, then (6.1) defines the
conditioned random variable y|u, and we define the ‘solution’ of the inverse
problem to be the conditioned random variable u|y. This allows us to model
the noise, η, via its statistical properties, even if we do not know the exact
instance of η that corrupted the given data, and it also allows us to specify a
priori the form of solutions that we believe to be more likely, thereby enabling
us to attach weights to multiple solutions which explain the data. This is the
essence of the Bayesian approach to inverse problems.

Remark 6.1. In practice the true observation operator is often approximated by


some numerical model H( · ; h), where h denotes a mesh parameter, or parameter
controlling missing physics. In this case (6.1) becomes

y = H(u; h) + ε + η,

where ε := H(u) − H(u; h). In principle, the observational noise η and the
computational error ε could be combined into a single term, but keeping them
separate is usually more appropriate: unlike η, ε is typically not of mean zero,
and is dependent upon u.
h6.1i Take
a moment to reconcile the statement “there may exist minimizing sequences that do
not have a limit in X ” with X being a Banach space.
6.1. INVERSE PROBLEMS AND REGULARIZATION 59

To illustrate the central role that least squares minimization plays in elemen-
tary statistical estimation, and hence motivate the more general considerations
of the rest of the chapter, consider the following finite-dimensional linear prob-
lem. Suppose that we are interested in learning some vector of parameters
x ∈ Kn , which gives rise to a vector y ∈ Km of observations via

y = Ax + η,

where A ∈ Km×n is a known linear operator (matrix) and η is a Gaussian noise


vector known to have mean zero and self-adjoint, positive-definite covariance


matrix E[ηη ∗ ] = Q ∈ Km×m , with η independent of x. A common approach is
to seek an estimate x̂ of x that is a linear function Ky of the data y, is unbiased
in the sense that E[x̂] = x, and is the best estimate in that it minimizes an
appropriate cost function. The following theorem, the Gauss–Markov theorem,
states that there is precisely one such estimator, and that it is the best in two
very natural senses:

T
Theorem 6.2 (Gauss–Markov). Suppose that A∗ Q−1 A is invertible. Then, among
all unbiased linear estimators K ∈ Kn×m , producing an estimate x̂ = Ky of x,
the one that minimizes both the mean-squared error E[kx̂ − xk2 ] and the covari-
ance matrixh6.2i E[(x̂ − x)(x̂ − x)∗ ] is
AF
K = (A∗ Q−1 A)−1 A∗ Q−1 ,

and the resulting estimate x̂ has E[x̂] = x and covariance matrix

E[(x̂ − x)(x̂ − x)∗ ] = (A∗ Q−1 A)−1 .

Remark 6.3. Indeed, by Corollary 4.23, x̂ = (A∗ Q−1 A)−1 A∗ Q−1 y is also the
DR

solution to the weighted least squares problem with weight Q−1 , i.e.

1
x̂ = arg min J(x), J(x) := kAx − yk2Q−1 .
x∈Kn 2

Note that the first and second derivatives (gradient and Hessian) of J are

∇J(x) = A∗ Q−1 Ax − A∗ y, D2 J(x) = A∗ Q−1 A,

so the covariance matrix of x̂ is the inverse of the Hessian of J. These observa-


tions will be useful in the construction of the Kálmán filter in Chapter 7.

Proof of Theorem 6.2. Note that the first part of this proof is surplus to re-
quirements: we could simply check that K := (A∗ Q−1 A)−1 A∗ Q−1 is indeed the
minimal linear unbiased estimator, but it is nice to derive the formula for K
from first principles and get some practice at constrained convex optimization
into the bargain.
Since K is required to be unbiased, it follows that KA = I. Therefore,

kx̂ − xk2 = kKy − xk2 = kK(Ax + η) − xk2 = kKηk2 .


h6.2i Here,
the minimization is meant in the sense of positive semi-definite matrices: for two
matrices A and B, we say that A ≤ B if B − A is a positive semi-definite matrix.
60 CHAPTER 6. BAYESIAN INVERSE PROBLEMS

Since kKηk2 = η ∗ K ∗ Kη is a scalar and tr(XY ) = tr(Y X) for any two rectan-
gular matrices X and Y of the appropriate sizes,

E[kx̂ − xk2 ] = E[η ∗ K ∗ Kη] = tr(E[Kηη ∗ K ∗ ]) = tr(KQK ∗ ).

Thus, K is the solution to the constrained optimization problem

K = arg min{tr(KQK ∗ ) | KA = I}.

Note
p that this is a convex optmization problem, since, by Exercise 6.2, K 7→


tr(KQK ∗ ) is a norm. Introduce a matrix Λ ∈ Kn×n of Lagrange multipliers,
so that the minimizer of the constrained problem is the unique critical point of
the Lagrangian

L(K, Λ) := tr(KQK ∗ ) − Λ : (KA − I)



= tr KQK ∗ − Λ∗ (KA − I) .

T
The critical point of the Lagrangian satisfies

0 = ∇K L(K, Λ) = KQ∗ + KQ − ΛA∗ = 2KQ − ΛA∗ ,


AF
since Q is self-adjoint. Multiplication on the right by Q−1 A, and using the
constraint that KA = I, reveals that Λ = 2(A∗ Q−1 A)−1 , and hence that K =
(A∗ Q−1 A)−1 A∗ Q−1 .
It is easily verified that K is an unbiased estimator:

x̂ = (A∗ Q−1 A)−1 A∗ Q−1 (Ax + η) = x + (A∗ Q−1 A)−1 A∗ Q−1 η

and so, taking expectations of both sides, E[x̂] = x. Moreover, the covariance
DR

of this estimator satisfies

E[(x̂ − x)(x̂ − x)∗ ] = (A∗ Q−1 A)−1 A∗ Q−1 E[ηη ∗ ]Q−1 A(A∗ Q−1 A)−1
= (A∗ Q−1 A)−1 ,

as claimed.
Now suppose that L = K + D is any linear unbiased estimator; note that
DA = 0. Then the covariance of the estimate Ly satisfies

E[(Ly − x)(Ly − x)∗ ] = E[(K + D)ηη ∗ (K ∗ + D∗ )]


= (K + D)Q(K ∗ + D∗ )
= KQK ∗ + DQD∗ + KQD∗ + (KQD∗ )∗ .

Since DA = 0,

KQD∗ = (A∗ Q−1 A)−1 A∗ Q−1 QD∗ = (A∗ Q−1 A)−1 0 = 0,

and so
E[(Ly − x)(Ly − x)∗ ] = KQK ∗ + DQD∗ ≥ KQK ∗
in the sense of positive semi-definite matrices, as claimed.
6.1. INVERSE PROBLEMS AND REGULARIZATION 61

Remark 6.4. In the situation that A∗ Q−1 A is not invertible, it is standard to


use the estimator
K = (A∗ Q−1 A)† A∗ Q−1 ,
where B † denotes the Moore–Penrose pseudo-inverse of B, defined equivalently
by

B † := lim (B ∗ B + δI)B ∗ ,
δ→0
B † := lim B ∗ (BB ∗ + δI)B ∗ , or
δ→0


B † := V Σ+ U ∗ ,

where B = U ΣV ∗ is the singular value decomposition of B, and Σ+ is the


transpose of the matrix obtained from Σ by replacing all the strictly positive
singular values by their reciprocals.

Bayesian Interpretation of Regularization. The Gauss–Markov estimator is

T
not ideal: for example, because of its characterization as the minimizer of a
quadratic cost function, it is sensitive to large outliers in the data, i.e. compo-
nents of y that differ greatly from the corresponding component of Ax̂. In such
a situation, it may be desirable to not try to fit the observed data y too closely,
AF
and instead regularize the problem by seeking x̂, the minimizer of
1 1
J(x) := kAx − yk2Q−1 + kx − x̄k2R−1 ,
2 2
for some chosen x̄ ∈ Kn and positive-definite Tikhonov matrix R ∈ Kn×n .
Depending upon the relative sizes of Q and R, x̂ will be influenced more by the
data y and hence lie close to the Gauss–Markov estimator, or be influenced more
DR

be the regularization term and hence lie close to x̄. At first sight this procedure
may seem somewhat ad hoc, but it has a natural Bayesian interpretation.
The observation equation
y = Ax + η
in fact defines the conditional distribution y|x by (y − Ax)|x = η ∼ N (0, Q). To
find the minimizer of x 7→ 21 kAx − yk2Q−1 , i.e. x̂ = Ky, amounts to finding the
maximum likelihood estimator of x given y. The Bayesian interpretation of the
regularization term is that N (x̄, R) is a prior distribution for x. The resulting
posterior distribution for x|y has Lebesgue density ρ(x|y) with
   

1 1
ρ(x|y) ∝ exp − kAx − yk2Q−1 exp − kx − x̄k2R−1
2 2
 
1 2 1 2
= exp − kAx − ykQ−1 − kx − x̄kR−1
2 2
 
1 2 1 2
= exp − kx − KykA∗Q−1 A − kx − x̄kR−1
2 2
 
1 ∗ −1 −1 −1 ∗ −1 −1 2
= exp − kx − (A Q A + R ) (A Q AKy + R x̄)kA∗ Q−1 A+R−1
2

by the standard result that the product of Gaussian distributions with means
m1 and m2 and covariances C1 and C2 is a Gaussian with covariance C3 =
62 CHAPTER 6. BAYESIAN INVERSE PROBLEMS

(C1−1 + C2−1 )−1 and mean C3 (C1−1 m1 + C2−1 m2 ). The solution to the regular-
ized least squares problem, i.e. minimizing the exponent in the above posterior
distribution, is the maximum a posteriori estimator of x given y. However,
the full posterior contains more information than the MAP estimator alone. In
particular, the posterior covariance matrix (A∗ Q−1 A + R−1 )−1 reveals those
components of x about which we are relatively more or less certain.

6.2 Bayesian Inversion in Banach Spaces


This section concerns Bayesian inversion in Banach spaces, and, in particular,
establishing the appropriate rigorous statement of Bayes’ rule in settings where
there is no Lebesgue measure with respect to which we can take densities.
Example 6.5. There are many applications in which it is of interest to deter-
mine the permeability of subsurface rock, e.g. the prediction of transport of
radioactive waste from an underground waste repository, or the optimization
of oil recovery from underground fields. The flow velocity v of a fluid under

T
pressure p in a medium or permeability K is given by Darcy’s law
v = −K∇p.
The pressure field p within a bounded, open domain Ω ⊂ Rd is governed by the
AF
elliptic PDE
−∇ · (K∇p) = 0 in Ω,
p=h on ∂Ω.
For simplicity, take the permeability tensor field K to be a scalar field k times
the identity tensor; for mathematical and physical reasons, it is important that k
be positive, so write k = eu . The objective is to recover u from, say, observations
DR

of the pressure field at known points x1 , . . . , xm ∈ Ω:


yi = p(xi ) + ηi .
Note that this fits the general ‘y = H(u) + η’ setup, with H being defined
implicitly by the solution operator to the elliptic PDE.
In general, let u be a random variable with (prior) distribution µ0 — which
we do not at this stage assume to be Gaussian — on a separable Banach space
X . Suppose that we observe data y ∈ Rm according to (6.1), where η is an
Rm -valued random variable independent of u with probability density ρ with

respect to Lebesgue measure. Let Φ(u; y) be any function that differs from
− log ρ(y − H(u)) by an additive function of y alone, so that
ρ(y − H(u))
∝ exp(−Φ(u; y))
ρ(y)
with a constant of proportionality independent of u. An informal application
of Bayes’ rule suggests that the posterior probability distribution of u given y,
µy , has Radon–Nikodým derivative with respect to the prior, µ0 , given by
dµy
(u) ∝ exp(−Φ(u; y)).
dµ0
The next theorem makes this argument rigorous:
6.3. WELL-POSEDNESS AND APPROXIMATION 63

Theorem 6.6 (Generalized Bayes’ rule). Suppose that H : X → Rm is continuous,


and that η is absolutely continuous with support Rm . If u ∼ µ0 , then u|y ∼ µy ,
where µy ≪ µ0 and
dµy
(u) ∝ exp(−Φ(u; y)).
dµ0
Lemma 6.7. Let µ, ν be probability measures on S × T , where (S, A) and (T, B)
are measurable spaces. Assume that µ ≪ ν and that dµ dν = ϕ. Assume further
that the conditional distribution of x|y under ν, denoted by ν y (dx), exists. Then
distribution of x|y under µ, denoted µy (dx), exists and µy ≪ ν y , with Radon–


Nikodým derivative given by
(
ϕ(x,y)
dµy Z(y) , if Z(y) > 0,
(x) =
dν y 1, otherwise,
R
where Z(y) := S ϕ(x, y) dν y (x).
Proof. See Section 10.2 of [25].

T
Proof of Theorem 6.6. Let Q0 (dy) := ρ(y) dy on Rm and Q(du|y) := ρ(y −
H(u)) dy, so that, by construction
dQ
AF
(y|u) = C(y) exp(−Φ(u; y)).
dQ0
Define measures ν0 and ν on Rm × X by
ν0 (dy, du) := Q0 (dy) ⊗ µ0 (du),
ν(dy, du) := Q0 (dy|u)µ0 (du).
Note that ν0 is a product measure under which u and y are independent, whereas
DR

ν is not. Since H is continuous, so is Φ; since µ0 (X ) = 1, it follows that Φ is


µ0 -measurable. Therefore, ν is well-defined, ν ≪ ν0 , and

(y, u) = C(y) exp(−Φ(u; y)).
dν0
Note that
Z Z
exp(−Φ(u; y)) dµ0 (u) = C(y) ρ(y − H(u)) dµ0 (u) > 0,
X X

since ρ is strictly positive on Rm and H is continuous. Since ν0 (du|y) = µ0 (du),


the result follows from Lemma 6.7.


Proposition 6.8. If the prior µ0 is a Gaussian measure and the potential Φ is
quadratic in u, then, for all y, the posterior µy is Gaussian.
Proof. Exercise 6.1.

6.3 Well-Posedness and Approximation


This section concerns the well-posedness of the Bayesian inference problem for
Gaussian priors on Banach spaces. To save space later on, the following will be
taken as our standard assumptions on the negative log-likelihood / potential Φ:
64 CHAPTER 6. BAYESIAN INVERSE PROBLEMS

Assumptions on Φ. Assume that Φ : X × Y → R satisfies:


1. For every ε > 0 and r > 0, there exists M = M (ε, r) ∈ R such that, for
all u ∈ X and all y ∈ Y with kykY < r,

Φ(u; y) ≥ M − εkuk2X .

2. For every r > 0, there exists K = K(r) > 0 such that, for all u ∈ X and
all y ∈ Y with kukX , kykY < r,

Φ(u; y) ≤ K.


3. For every r > 0, there exists L = L(r) > 0 such that, for all u1 , u2 ∈ X
and all y ∈ Y with ku1 kX , ku2 kX , kykY < r,

Φ(u1 ; y) − Φ(u2 ; y) ≤ L u1 − u2 X
.

4. For every ε > 0 and r > 0, there exists C = C(ε, r) > 0 such that, for all

T
u ∈ X and all y1 , y ∈ Y with ky1 kY , ky2 kY < r,

Φ(u; y1 ) − Φ(u; y2 ) ≤ exp εkuk2X + C y1 − y2 Y
.
AF
Theorem 6.9. Let Φ satisfy standard assumptions (1), (2), and (3) and assume
that µ0 is a Gaussian probability measure on X . Then, for each y ∈ Y, µy given
by

dµy exp(−Φ(u; y))


(u) = ,
dµ0 Z(y)
Z
Z(y) = exp(−Φ(u; y)) dµ0 (u),
DR

is a well-defined probability measure on X .

Proof. Assumption (2) implies that Z(y) is bounded below:


Z
 
Z(y) ≥ exp(−K(r)) dµ0 (u) = exp(−K(r))µ0 kukX ≤ r > 0
{u|kukX ≤r}

for r > 0, since µ0 is a strictly positive measure on X . Assumption (3) implies


that Φ is µ0 -measurable, and hence that µy is a well-defined measure. By

assumption (1), for kykY ≤ r and ε sufficiently small,


Z
Z(y) = exp(−Φ(u; y)) dµ0 (u)
X
Z
≤ exp(εkuk2X − M (ε, r)) dµ0 (u)
X
≤ C exp(−M (ε, r)) < ∞,

since µ0 is Gaussian and we may choose ε small enough that the Fernique
theorem (Theorem 2.34) applies. Thus, µy can indeed be normalized to be a
probability measure on X .
6.3. WELL-POSEDNESS AND APPROXIMATION 65

Definition 6.10. The Hellinger distance between two measures µ and ν is de-
fined in terms of any reference measure ρ with respect to which both µ and ν
are absolutely continuous by:
v
u Z s s 2
u1 dµ dν
t
dHell (µ, ν) := (θ) − (θ) dρ(θ).
2 Θ dρ dρ

Exercises 6.4, 6.5 and 6.6 establish the major properties of the Hellinger
metric. A particularly useful property is that closeness in the Hellinger met-


ric implies closeness of expected values of polynomially bounded functions: if
f : X → E, for some Banach space E, then
q    
Eµ [f ] − Eν [f ] E ≤ 2 Eµ kf k2E − Eν kf k2E dHell (µ, ν).

The following theorem shows that Bayesian inference with respect to a Gaus-
sian prior measure is well-posed with respect to perturbations of the observed
data y:

T
Theorem 6.11. Let Φ satisfy the standard assumptions (1), (2) and (4), suppose
that µ0 is a Gaussian probability measure on X , and that µy ≪ µ0 with density
given by the generalized Bayes’ rule for each y ∈ Y. Then there exists a constant
AF
C ≥ 0 such that, for all y, y ′ ∈ Y,

dHell (µy , µy ) ≤ Cky − y ′ kY .
Proof. As in the proof of Theorem 6.9, standard assumption (2) gives a lower
bound on Z(y). By standard assumptions (1) and (4) and the Fernique theorem
(Theorem 2.34),
Z
′ ′
exp(−εkuk2X − M ) exp(−εkuk2X + C) dµ0 (u)
DR

|Z(y) − Z(y )| ≤ ky − y kY
X
≤ Cky − y ′ kY .
By the definition of the Hellinger distance,
Z !2
y′ 2 1 1 ′
y
2dHell (µ , µ ) = p e−Φ(u;y)/2 − p e−Φ(u;y )/2 dµ0 (u)
Z(y) ′
Z(y )
X

≤ I1 + I2
where

Z  2
2 ′
I1 := e−Φ(u;y)/2 − e−Φ(u;y )/2 dµ0 (u),
Z(y) X
2 Z
1 1 ′
I2 := 2 p −p e−Φ(u;y )/2 dµ0 (u).
Z(y) Z(y ′ ) X

By standard assumptions (1) and (4) and the Fernique theorem,


Z
Z(y) 1
I1 ≤ exp(εkuk2X − M ) exp(2εkuk2X + 2C)ky − y ′ k2Y dµ0 (u)
2 X 4
≤ Cky − y ′ k2Y .
66 CHAPTER 6. BAYESIAN INVERSE PROBLEMS

A similar application of standard assumption (1) and the Fernique theorem


shows that the integral in I2 is finite. Also, the lower bound on Z( · ) implies
that
2  
1 1 1 1
p −p ≤ C max , |Z(y) − Z(y ′ )|2 ≤ Cky − y ′ k2Y .
Z(y) Z(y ′ ) Z(y)3 Z(y ′ )3

Combining these facts yields the desired Lipschitz continuity in the Hellinger
metric.


Similarly, the next theorem shows that Bayesian inference with respect to
a Gaussian prior measure is well-posed with respect to approximation of mea-
sures and log-likelihoods. The approximation of Φ by some ΦN typically arises
through the approximation of H by some discretized numerical model H N .
Theorem 6.12. Suppose that the probabilities µ and µN are the posteriors aris-
ing from potentials Φ and ΦN and are all absolutely continuous with respect to
µ0 , and that Φ, ΦN satisfy the standard assumptions (1) and (2) with constants

T
uniform in N . Assume also that, for all ε > 0, there exists K = K(ε) > 0 such
that
Φ(u) − ΦN (u; y) ≤ K exp(εkuk2X )ψ(N ), (6.2)
AF
where limN →∞ ψ(N ) = 0. Then there is a constant C, independent of N , such
that
dHell (µ, µN ) ≤ Cψ(N ).
Proof. Since y does not appear in this problem, y-dependence will be suppressed
for the duration of this proof. Let Z and Z N denote the appropriate normaliza-
tion constants, as in the proof of Theorem 6.11. By standard assumption (1),
(6.2), and the Fernique theorem,
DR

Z
N
|Z − Z | ≤ Kψ(N ) exp(εkuk2X − M ) exp(εkuk2X ) dµ0 ≤ Cψ(N ).
X

By the definition of the Hellinger distance,


Z  2
N 2 1 −Φ(u)/2 1 −ΦN (u)/2
2dHell (µ, µ ) = √ e −√ e dµ0 (u)
X Z ZN
≤ I1 + I2

where
Z  2
2 N
I1 := e−Φ(u)/2 − e−Φ (u)/2 dµ0 (u),
Z X
2 Z
1 1 N
I2 := 2 √ − √ e−Φ (u)/2
dµ0 (u).
Z ZN X

By standard assumption (1), (6.2), and the Fernique theorem,


Z
Z
I1 ≤ K 2 exp(3εkuk2X − M )ψ(N )2 dµ0 (u)
2 X
≤ Cψ(N )2 .
BIBLIOGRAPHY 67

Similarly,
2  
1 1 1 1
√ −√ ≤ C max , |Z − Z N |2 ≤ Cψ(N )2 .
Z ZN Z 3 (Z N )3
Combining these facts yields the desired bound on dHell (µ, µN ).
Remark 6.13. Note well that, regardless of the value of the observed data y, the 
Bayesian posterior µy is absolutely continuous with respect to the prior µ0 and,
in particular, cannot associate positive posterior probability to any event of prior
probability zero. However, the Feldman–Hájek theorem (Theorem 2.38) says


that it is very difficult for probability measures on infinite-dimensional spaces
to be absolutely continuous with respect to one another. Therefore, the choice
of infinite-dimensional prior µ0 is a very strong modelling assumption that, if
it is ‘wrong’, cannot be ‘corrected’ even by large amounts of data y. In this
sense, it is not reasonable to expect that Bayesian inference on function spaces
should be well-posed with respect to apparently small perturbations of the prior
µ0 , e.g. by a shift of mean that lies outside the Cameron–Martin space, or a

T
change of covariance arising from a non-unit dilation of the space. Nevertheless,
the infinite-dimensional perspective is not without genuine fruits: in particular,
the well-posedness results (Theorems 6.11 and 6.12) are very important for the
stability of finite-dimensional (discretized) Bayesian problems with respect to
AF
discretization dimension N .

Bibliography
This material is covered in much greater detail in the module MA612 Probability
on Function Spaces and Bayesian Inverse Problems. W
Tikhonov regularization was introduced in [105, 106]. An introduction to
DR

the general theory of regularization and its application to inverse problems is


the book of Engl & al. [26]. The book of Tarantola [103] also provides a good
applied introduction to inverse problems. Kaipio & Somersalo [45] provide a
good introduction to the Bayesian approach to inverse problems, especially in
the context of differential equations.
This chapter owes a great deal to the papers of Stuart [98] and Cotter &
al. [20, 21], which set out the common structure of Bayesian inverse prob-
lems on Banach and Hilbert spaces, focussing on Gaussian priors. Stuart’s
article stresses the importance of delaying discretization to the last possible
moment, much as in PDE theory, lest one carelessly end up with a family of

finite-dimensional problems that are individually well-posed but collectively ill-


conditioned as the discretization dimension tends to infinity. Extensions to
Besov priors, which are constructed using wavelet bases of L2 and allow for
non-smooth local features in the random fields, can be found in the articles of
Dashti & al. [22] and Lassas & al. [57].

Exercises
Exercise 6.1. Let µ0 be a Gaussian probability measure on a separable Banach
space X and suppose that the potential Φ(u; y) is quadratic in u. Show that
the posterior µy is also a Gaussian measure on X .
68 CHAPTER 6. BAYESIAN INVERSE PROBLEMS

Exercise 6.2. Let Q ∈ Kn×n be a self-adjoint and positive-definite matrix. Show


that
hA, Bi := tr(A∗ QB) for A, B ∈ Kn×m
product on the space Kn×m of n × m matrices over K, and
defines an inner p
hence that A 7→ tr(A∗ QA) is a norm on Kn×m .
Exercise 6.3. Let µ and ν be probability measures on (Θ, F ), both absolutely
continuous with respect to a reference measure ρ. Define the total variation
distance between µ and ν by
 


1 dµ dν
dTV (µ, ν) := Eρ −
2 dρ dρ
Z
1 dµ dν
= (θ) − (θ) dρ(θ).
2 Θ dρ dρ
Show that dTV is a metric on M1 (Θ, F ), that its values do not depend upon
the choice of ρ, that M1 (Θ, F ) has diameter at most 1, and that, if ν ≪ µ,

T
then   Z
1 dν 1 dν
dTV (µ, ν) := Eµ 1 − ≡ 1− (θ) dµ(θ).
2 dµ 2 Θ dµ
Show also that this definition agrees with the definition given in (5.4).
AF
Exercise 6.4. Let µ and ν be probability measures on (Θ, F ), both absolutely
continuous with respect to a reference measure ρ. Define the Hellinger distance
between µ and ν by
v s 
u s 2
u
u 1  dµ dν 
dHell (µ, ν) := t Eρ −
2 dρ dρ
DR

v
u Z s s 2
u1 dµ dν
t
= (θ) − (θ) dρ(θ).
2 Θ dρ dρ

Show that dHell is a metric on M1 (Θ, F ), that its values do not depend upon
the choice of ρ, that M1 (Θ, F ) has diameter at most 1, and that, if ν ≪ µ,
then
v   v
u s 2 u Z s 2
u
u1  dν  u t 1 dν
dHell (µ, ν) := t Eµ 1 − ≡ 1− (θ) dµ(θ).

2 dµ 2 Θ dµ

Exercise 6.5. Show that the total variation and Hellinger distances satisfy, for
all µ and ν,
1
√ dTV (µ, ν) ≤ dHell (µ, ν) ≤ dTV (µ, ν)1/2 .
2
Exercise 6.6. Suppose that µ and ν are probability measures on a Banach space
X . Show that, if E is a Banach space and f : X → E has finite second moment
with respect to both µ and ν, then
q    
Eµ [f ] − Eν [f ] E ≤ 2 Eµ kf k2E − Eν kf k2E dHell (µ, ν).
EXERCISES 69

Show also that, if E is Hilbert and f has finite fourth moment with respect to
both µ and ν, then
q    
Eµ [f ⊗ f ] − Eν [f ⊗ f ] E ≤ 2 Eµ kf k4E − Eν kf k4E dHell (µ, ν).

Hence show that the differences between the means and covariance operators of
two measures on X are bounded above by the Hellinger distance between the
two measures.
Exercise 6.7. Let Γ ∈ Rq×q be symmetric and positive definite. Suppose that
H : X → Rq satisfies


1. For every ε > 0, there exists M ∈ R such that, for all u ∈ X ,

kH(u)kΓ−1 ≤ exp εkuk2X + M .
2. For every r > 0, there exists K > 0 such that, for all u1 , u2 ∈ X with
ku1 kX , ku2 kX < r,
kH(u1 ) − H(u2 )kΓ−1 ≤ K u1 − u2 X
.

T
q
Show that Φ : X × R → R defined by
1
Φ(u; y) := y − H(u), Γ−1 (y − H(u))
2
AF
satisfies the standard assumptions.
Exercise 6.8. An exercise in forensic inference:
NEWSFLASH: THE PRESIDENT HAS BEEN SHOT!
While being driven through the streets of the capital in his open-topped limou-
sine, President Marx of Freedonia has been shot by a sniper. To make matters
worse, the bullet appears to have come from a twenty-storey building, on each
DR

floor of which was stationed a single bodyguard of President Marx’s security de-
tail who was meant to protect him. None of the suspects have any obvious marks
of guilt such as one bullet missing from their magazine, gunpowder burns, failed
lie detector tests. . . not even an evil moustache. The soundproofing inside the
building was good enough that none of the security detail can even say whether
the shot came from above or below them. You have been called in as an expert
quantifier of uncertainty to try to infer from which floor the assassin took the
shot, and hence identify the guilty man. You have the following information:
1. At the time of the shot, the President’s limousine was 500m from the

building, travelling at about 20mph.


2. The bullet entered the President’s body at the base of his neck and exited
through the centre of the breastbone. The President is an average-sized
man.
Apply the techniques you have learned in this chapter to make a recommen-
dation as to which floor the shot came from, and hence which bodyguard is
the traitor. Do your own research on rifle muzzle velocities &c., explaining the
assumptions that you make along the way.
Exercise 6.9. Construct a short sequence of Bayesian inferences in which the
posterior appears to predict one thing and then another. Implications for not
cutting short analyses and studies. [???] FINISH ME!!!
70 CHAPTER 6. BAYESIAN INVERSE PROBLEMS


T
AF
DR

Chapter 7


Filtering and Data
Assimilation

T
It is not bigotry to be certain we are
right; but it is bigotry to be unable to
imagine how we might possibly have gone
AF
wrong.

The Catholic Church and Conversion


G. K. Chesterton

Data assimilation is the integration of two information sources:


• a mathematical model of a time-dependent physical system, or a numerical
DR

implementation of such a model; and


• a sequence of observations of that system, usually corrupted by some noise.
The objective is to combine these two ingredients to produce a more accurate
estimate of the system’s true state, and hence more accurate predictions of
the system’s future state. Very often, data assimilation is synonymous with
filtering, which incorporates many of the same ideas but arose in the context of
signal processing. An additional component of the data assimilation / filtering
problem is that one typically wants to achieve it in real time: if today is Monday,
then a data assimilation scheme that takes until Friday to produce an accurate

prediction of Tuesday’s weather using Monday’s observations is basically useless.


Data assimilation methods are typically Bayesian, in the sense that the cur-
rent knowledge of the system state can be thought of as a prior, and the incor-
poration of the dynamics and observations as an update/conditioning step that
produces a posterior. Bearing in mind considerations of computational cost and
the imperative for real time data assimilation, there are two key ideas underly-
ing filtering: the first is to build up knowledge about the posterior sequentially,
and hence perhaps more efficiently; the second is to break up the unknown u
and build up knowledge about its constituent parts sequentially, hence reducing
the computational dimension of each sampling problem. Thus, the first idea
means decomposing the data sequentially, while the second means decomposing
the unknown state sequentially.
72 CHAPTER 7. FILTERING AND DATA ASSIMILATION

7.1 State Estimation in Discrete Time


In the Kálmán filter, the probability distributions representing the system state
and various noise terms are described purely in terms of their mean and covari-
ance, so they are effectively being approximated as Gaussians.
For simplicity, the first description of the Kálmán filter will be of a controlled
linear dynamical system that evolves in discrete time steps
t0 < t1 < · · · < tk < . . . .
The state of the system at time tk is a vector xk ∈ Kn , and it evolves from the


state xk−1 ∈ Kn at time tk−1 according to the linear model
xk = Fk xk−1 + Gk uk + wk (7.1)
where, for each time tk ,
• Fk ∈ Kn×n is the state transition model, which is applied to the previous
state xk−1 ∈ Kn ;
• Gk ∈ Kn×p is the control-to-input model, which is applied to the control

T
vector uk ∈ Kp ;
• wk ∼ N (0, Qk ) is the process noise, a Kn -valued centred Gaussian random
variable with self-adjoint positive-definite covariance matrix Qk ∈ Kn×n .
At time tk an observation yk ∈ Kq of the true state xk is made according to
AF
yk = Hk xk + vk , (7.2)
where
• Hk ∈ Kq×n is the observation operator, which maps the true state space
Kn into the observable space Kq
• vk ∼ N (0, Rk ) is the observation noise, a Kq -valued centred Gaussian
random variable with self-adjoint positive-definite covariance Qk ∈ Kq×q .
DR

As an initial condition, the state of the system at time t0 is taken to be x0 =


µ0 + w0 where µ0 ∈ Kn is known and w0 ∼ N (0, Q0 ). All the noise vectors are
assumed to be mutually independent.
As a preliminary to constructing the full Kálmán filter, we consider the prob-
lem of estimating states x1 , . . . , xk given the corresponding controls u1 , . . . , uk
and m known observations y1 , . . . , ym , where k ≥ m. In particular, we seek the
best linear unbiased estimate of x1 , . . . , xk .
First note that (7.1)–(7.2) is equivalent to the single equation
bk|m = Ak|m zk + ηk|m , (7.3)

where
   
µ0 −w0
 G1 u1   −w1 
   
 y1   +v1 
     
 ..  x0  .. 
 .   . 
     
bk|m := 
 G m m u  ∈ Kn(k+1)+qm ,
 zk :=  ...  , ηk|m := 
 −wm 

 ym   +vm 
  xk  
Gm+1 um+1  −wm+1 
   
 ..   . 
 .   .. 
Gk uk −wk
7.1. STATE ESTIMATION IN DISCRETE TIME 73

and Ak|m ∈ K(n(k+1)+qm)×n(k+1) is


 
I 0 0 ··· ··· ··· ··· ··· 0
 .. .. .. .. .. .. 
−F1 I 0 . . . . . .
 
 .. .. .. .. .. .. 
 0 H1 0 . . . . . .
 
 .. .. .. .. .. .. 
 0 −F2 I . . . . . .
 
 .. .. .. .. .. .. 
 0 0 H2 . . . . . .
 
 . .. 


:=  .. .. .. .. .. .. .. 
Ak|m  .. . . . . . . . . .
 
 .. .. .. .. .. .. .. 
 . . . . −Fm I . . .
 
 .. .. .. .. .. .. .. 
 . . . . 0 Hm . . .
 
 . .. .. .. .. .. .. 
 .. . . . . −Fm+1 I . .
 
 . .. .. .. .. .. .. .. 
 .. . . . . . . . 0

T
0 ··· ··· ··· ··· ··· 0 −Fk I

Note that the noise vector ηk|m is Kn(k+1)+qm -valued and has mean zero and
block-diagonal positive-definite precision operator (inverse covariance) Wk|m
AF
given in block form by

Wk|m := diag Q−1 −1 −1 −1 −1 −1 −1
0 , Q1 , R1 , . . . , Qm , Rm , Qm+1 , . . . , Qk .

By the Gauss–Markov theorem (Theorem 6.2), the best linear unbiased estimate
ẑk|m = [x̂0|m , . . . , x̂k|m ]∗ of zk satisfies

1 2
DR

ẑk|m ∈ arg min Jk|m (zk ), Jk|m (zk ) := Ak|m zk − bk|m Wk|m
, (7.4)
zk ∈Kn 2

and by Lemma 4.22 is the solution of the normal equations

A∗k|m Wk|m Ak|m ẑk|m = A∗k|m Wk|m bk|m .

By Exercise 7.1, it follows from the assumptions made above that these normal
equations have a unique solution
−1 ∗
ẑk|m = A∗k|m Wk|m Ak|m Ak|m Wk|m bk|m . (7.5)

By Theorem 6.2 and Remark 6.3, E[ẑk|m ] = zk and the covariance matrix of
the estimate ẑk|m is (A∗k|m Wk|m Ak|m )−1 ; a Bayesian statistician would say that
zk , conditioned upon the control and observation data bk|m , is the Gaussian

random variable with distribution N ẑk|m , (A∗k|m Wk|m Ak|m )−1 .
Note that, since Wk|m is block diagonal, Jk|m can be written as
m k
1 1X 1X
Jk|m (zk ) = kx0 −µ0 k2Q−1 + kyi −Hi xi k2R−1 + kxi −Fi xi−1 −Gi ui k2Q−1 .
2 0 2 i=1 i 2 i=1 i

An expansion of this type will prove very useful in derivation of the linear
Kálmán filter in the next section.
74 CHAPTER 7. FILTERING AND DATA ASSIMILATION

7.2 Linear Kálmán Filter


We now consider the state estimation problem in the common practical situation
that k = m. Why is the state estimate (7.5) not the end of the story? For one
thing, there is an issue of immediacy: one does not want to have to wait for
observation y1000 to come in before estimating states x1 , . . . , x999 as well as
x1000 , in particular because the choice of the control uk+1 typically depends
upon the estimate of xk ; what one wants is to estimate xk upon observing yk .
However, there is also an issue of computational cost, and hence computation
time: the solution of the least squares problem


find x̂ = arg min kAx − bk2Km
x∈Kn

where A ∈ Km×n , at least by direct methods such as solving the normal equa-
tions or QR factorization, requires of the order of mn2 floating-point operations,
W as shown in MA398 Matrix Analysis and Algorithms. Hence, calculation of the
state estimate ẑk by direct solution of (7.5) takes of the order of

T (n(k + 1) + qm)n2 (k + 1)2

operations. It is clearly impractical to work with a state estimation scheme with


AF
a computational cost that increases cubically with the number of time steps to
be considered. The idea of filtering is the break the state estimation problem
down into a sequence of estimation problems that can be solved with constant
computational cost per time step, as each observation comes in.
The two-step linear Kálmán filter (LKF) is a recursive method for construct-
ing the best linear unbiased estimate x̂k|k (with covariance matrix Pk|k ) of xk
in terms of the previous state estimate x̂k−1|k−1 and the data uk and yk . It
is called the two-step filter because the process of updating the state estimate
DR

(x̂k−1|k−1 , Pk−1|k−1 ) for time tk−1 into the estimate (x̂k|k , Pk|k ) for tk is split
into two steps (which could, of course, be algebraically unified into a single
step):
• the prediction step uses the dynamics alone to update (x̂k−1|k−1 , Pk−1|k−1 )
into (x̂k|k−1 , Pk|k−1 ), an estimate for the state at time tk that does not
use the observation yk ;
• the correction step updates (x̂k|k−1 , Pk|k−1 ) into (x̂k|k , Pk|k ) using the ob-
servation yk .

Initialization. We begin by initializing the state estimate as

(x̂0|0 , P0|0 ) := (µ0 , Q0 ).

Prediction. Write
 
Fk := 0 · · · 0 Fk ∈ Kn×nk ,

where the Fk block is the block corresponding to xk−1 , so that Fk zk−1 =


Fk xk−1 . A key insight here is to write the cost function Jk|k−1 recursively
as
1
Jk|k−1 (zk ) = Jk−1|k−1 (zk−1 ) + kxk − Fk zk−1 − Gk uk k2Q−1 ,
2 k
7.2. LINEAR KÁLMÁN FILTER 75

in which case the gradient and Hessian of Jk|k−1 are given in block form with
respect to zk = [zk−1 , xk ]∗ as
 
∇Jk−1|k−1 (zk−1 ) + Fk∗ Q−1
k (Fk zk−1 − xk + Gk uk )
∇Jk|k−1 (zk ) =
−Q−1k (Fk zk−1 − xk + Gk uk )

and  
D2 Jk−1|k−1 (zk−1 ) + Fk∗ Q−1
k Fk −Fk∗ Q−1
D2 Jk|k−1 (zk ) = −1
k ,
−Qk Fk Q−1k


which, by Exercise 7.2, is positive definite. Hence, by a single iteration of New-
ton’s method with any initial condition zk , the minimizer ẑk|k−1 of Jk|k−1 (zk )
is simply
−1
ẑk|k−1 = zk − D2 Jk|k−1 (zk ) ∇Jk|k−1 (zk ).
Note that ∇Jk−1|k−1 (ẑk−1|k−1 ) = 0 and Fk ẑk−1|k−1 = Fk x̂k−1|k−1 , so clever
initial guess is to take
 

T
ẑk−1|k−1
zk = .
Fk x̂k−1|k−1 + Gk uk

With this initial guess, the gradient becomes ∇Jk|k−1 (zk ) = 0 i.e. the optimal
AF
estimate of xk given y1 , . . . , yk−1 is the bottom row of ẑk|k−1 and — by Re-
mark 6.3 — the covariance matrix Pk|k−1 of this estimate is the bottom-right
−1
block of the inverse Hessian D2 Jk|k−1 (zk ) , calculated using the method of
Schur complements (Exercise 7.3):

x̂k|k−1 = Fk x̂k−1|k−1 + Gk uk , (7.6)


Pk|k−1 = Fk Pk−1|k−1 Fk∗ + Qk . (7.7)
DR

These two updates comprise the prediction step of the Kálmán filter. The cal-
culation of x̂k|k−1 requires Θ(n2 + np) operations, and the calculation of Pk|k−1
requires O(nα ) operations, assuming that matrix-matrix multiplication for n×n
matrices can be effected in O(nα ) operations for some 2 ≤ α ≤ 3.

Correction. The next step is a correction step (or update step) that corrects the
prior estimate-covariance pair (x̂k|k−1 , Pk|k−1 ) to a posterior estimate-covariance
pair (x̂k|k , Pk|k ) given the observation yk . Write
 

Hk := 0 · · · 0 Hk ∈ Kn×nk ,

where the Hk block is the block corresponding to xk , so that Hk zk = Hk xk .


Again, the cost function is written recursively:
1
Jk|k (zk ) = Jk|k−1 (zk ) + kyk − Hk zk k2R−1 .
2 k

The gradient and Hessian are

∇Jk|k (zk ) = ∇Jk|k−1 (zk ) + Hk∗ Rk−1 (Hk zk − yk )


= ∇Jk|k−1 (zk ) + Hk∗ Rk−1 (Hk xk − yk )
76 CHAPTER 7. FILTERING AND DATA ASSIMILATION

and
D2 Jk|k (zk ) = D2 Jk|k−1 (zk ) + Hk∗ Rk−1 Hk .
We now take zk = ẑk|k−1 as a clever initial guess for a single Newton iteration,
so that the gradient becomes
∇Jk|k (ẑk|k−1 ) = ∇Jk|k−1 (ẑk|k−1 ) +Hk∗ Rk−1 (Hk ẑk|k−1 − yk ).
| {z }
=0

The posterior estimate x̂k|k is now obtained as the bottom row of the Newton
update, i.e.


x̂k|k = x̂k|k−1 − Pk|k Hk∗ Rk−1 (Hk x̂k|k−1 − yk ) (7.8)
where the posterior covariance Pk|k is obtained as the bottom-right block of the
−1
inverse Hessian D2 Jk|k (zk ) by Schur complementation:
−1
−1
Pk|k = Pk|k−1 + Hk∗ Rk−1 Hk . (7.9)

Determination of the computational costs of these two steps is left as an exercise

T
(Exercise 7.6).
In many presentations of the Kálmán filter, the correction step is phrased in
terms of the Kálmán gain Kk ∈ Kn×q :
−1
AF
Kk := Pk|k−1 Hk∗ Hk Pk|k−1 Hk∗ + Rk . (7.10)
so that
x̂k|k = x̂k|k−1 + Kk (zk − Hk x̂k|k−1 ) (7.11)
Pk|k = (I − Kk Hk )Pk|k−1 . (7.12)
It is also common to refer to
DR

ỹk := zk − Hk x̂k|k−1
as the innovation residual and
Sk := Hk Pk|k−1 Hk∗ + Rk

as the innovation covariance, so that Kk = Pk|k−1 Hk∗ Sk−1 and x̂k|k = x̂k|k−1 +
Kk ỹk . It is an exercise in algebra to show that the first presentation of the
correction step (7.8)–(7.9) and the Kálmán gain formulation (7.10)–(7.12) are
the same.

The Kálmán filter can also be formulated in continuous time, or in a hybrid


form with continuous evolution but discrete observations. For example, the
hybrid Kálmán filter has the evolution and observation equations
ẋ(t) = F (t)x(t) + G(t)u(t) + w(t),
yk = Hk xk + vk ,
where xk := x(tk ). The prediction equations are that x̂k|k−1 is the solution at
time tk of the initial value problem
˙
x̂(t) = F (t)x̂(t) + G(t)u(t),
x̂(tk−1 ) = x̂k−1|k−1 ,
7.3. EXTENDED KÁLMÁN FILTER 77

and that Pk|k−1 is the solution at time tk of the initial value problem

Ṗ (t) = F (t)P (t)F (t)∗ + Q(t),


P (tk−1 ) = Pk−1|k−1 .

The correction equations (in Kálmán gain form) are as before:


−1
Kk = Pk|k−1 Hk∗ Hk Pk|k−1 Hk∗ + Rk
x̂k|k = x̂k|k−1 + Kk (zk − Hk x̂k|k−1 )


Pk|k = (I − Kk Hk )Pk|k−1 .

The LKF with continuous time evolution and observation is known as the
Kálmán–Bucy filter. The evolution and observation equations are

ẋ(t) = F (t)x(t) + G(t)u(t) + w(t),


y(t) = H(t)x(t) + v(t).

T
Notably, in the Kálmán–Bucy filter, the distinction between prediction and
correction does not exist.

˙
x̂(t) = F (t)x̂(t) + G(t)u(t) + K(t) y(t) − H(t)x̂(t) ,
AF
Ṗ (t) = F (t)P (t) + P (t)F (t)∗ + Q(t) − K(t)R(t)K(t)∗ ,

where
K(t) := P (t)H(t)∗ R(t)−1 .

7.3 Extended Kálmán Filter


DR

The extended Kálmán filter (EKF) is an extension of the Kálmán filter to nonlin-
ear dynamical systems. In discrete time, the evolution and observation equations
are

xk = fk (xk−1 , uk ) + wk ,
yk = hk (xk ) + vk ,

where, as before, xk ∈ Kn are the states, uk ∈ Kp are the controls, yk ∈ Kq


are the observations, fk : Kn × Kp → Kn are the vector fields for the dynamics,

hk : Kn → Kq are the observation maps, and the noise processes wk and vk


are uncorrelated with zero mean and positive-definite covariances Qk and Rk
respectively.
The classical derivation of the EKF is to approximate the nonlinear evolution–
observation equations with a linear system and then use the LKF on that linear
system. In contrast to the LKF, the EKF is neither the unbiased minimum
mean-squared error estimator nor the minimum variance unbiased estimator of
the state; in fact, the EFK is generally biased. However, the EKF is the best
linear unbiased estimator of the linearized dynamical system, which can often
be a good approximation of the nonlinear system. As a result, how well the
local linear dynamics match the nonlinear dynamics determines in large part
how well the EKF will perform.
78 CHAPTER 7. FILTERING AND DATA ASSIMILATION

The approximate linearized system is obtained by first-order Taylor expan-


sion of fk about the previous estimated state x̂k−1|k−1 and hk about x̂k|k−1
xk = fk (x̂k−1|k−1 , uk ) + Dfk (x̂k−1|k−1 , uk )(xk−1 − x̂k−1|k−1 ) + wk ,
yk = hk (x̂k|k−1 ) + Dhk (x̂k|k−1 )(xk − x̂k|k−1 ) + vk .
Taking
Fk := Dfk (x̂k−1|k−1 , uk ),
Hk := Dhk (x̂k|k−1 ),


ũk := fk (x̂k−1|k−1 , uk ) − Fk x̂k−1|k−1 ,
zk := hk (x̂k|k−1 ) − Hk x̂k|k−1 ,
the linearized system is
xk = Fk xk−1 + ũk + wk ,
yk = Hk xk + zk + vk .

T
We now apply the standard LKF to this system, treating ũk as the controls for
the linear system and yk − zk as the observations, to obtain
x̂k|k−1 = fk (x̂k−1|k−1 , uk ), (7.13)
AF
Pk|k−1 = Fk Pk−1|k−1 Fk∗ + Qk , (7.14)
−1
−1
Pk|k = Pk|k−1 + Hk∗ Rk−1 Hk , (7.15)
x̂k|k = x̂k|k−1 − Pk|k Hk∗ Rk−1 (hk (x̂k|k−1 ) − yk ). (7.16)

7.4 Ensemble Kálmán Filter


DR

The EnKF is a Monte Carlo approximation of the Kalman filter that avoids
evolving the covariance matrix of the state vector x ∈ Kn . Instead, the EnKF
uses an ensemble of N states
X = [x(1) , . . . , x(N ) ].
The columns of the matrix X ∈ Kn×N are the ensemble members.

Initialization. The ensemble is initialized by choosing the columns of X̂0|0 to


be N independent draws from, say, N (µ0 , Q0 ). However, the ensemble members

are not generally independent except in the initial ensemble, since every EnKF
step ties them together, but all the calculations proceed as if they actually were
independent.

Prediction. The prediction step of the EnKF is straightforward: each column


(i)
x̂k−1|k−1 is evolved to x̂k|k−1 using the LKF prediction step (7.6)
(i) (i)
x̂k|k−1 = Fk x̂k−1|k−1 + Gk uk ,

or the EKF prediction step (7.13)


x̂k|k−1 = fk (x̂k−1|k−1 , uk ).
7.4. ENSEMBLE KÁLMÁN FILTER 79

Correction. The correction step for the EnKF uses a trick called data replica-
tion: the observed data yk = Hk xk + vk is replicated into an m × N matrix

D = [d(1) , . . . , d(N ) ], d(i) := yk + ηi , ηi ∼ N (0, Rk ).

so that each column di consists of the observed data vector yk plus an indepen-
dent random draw from N (0, Rk ). If the columns of X̂k|k−1 are a sample from
the prior distribution, then the columns of

X̂k|k−1 + Kk D − Hk X̂k|k−1


form a sample from the posterior probability distribution, in the sense of a
Bayesian prior (before data) and posterior (conditioned upon the data). The
EnKF approximates this sample by replacing the exact Kálmán gain matrix
(7.10)
−1
Kk := Pk|k−1 Hk∗ Hk Pk|k−1 Hk∗ + Rk ,

T
which involves the covariance matrix Pk|k−1 , which is not tracked in the EnKF,
by an approximate covariance matrix. The empirical mean and empirical co-
variance of X are
AF
N
1 X (i) (X − hXi)(X − hXi)∗
hXi := x , .
N i=1 N −1

The Kálmán gain matrix for the EnKF uses Ck|k−1 in place of Pk|k−1 :
−1
K̃k := Ck|k−1 Hk∗ Hk Ck|k−1 Hk∗ + Rk , (7.17)
DR

so that the correction step becomes



X̂k|k := X̂k|k−1 + K̃k D − Hk X̂k|k−1 . (7.18)

One can also use sampling to dispense with Rk , and instead use the empirical
covariance of the replicated data,

(D − hDi)(D − hDi)∗
.
N −1

Note, however, that the empirical covariance matrix is typically rank-deficient


(in practical applications, there are usually many more state variables than
ensemble members), in which case the matrix inverse in (7.17) may fail to exist;
in such situations, a pseudo-inverse may be used.

Remark 7.1. Even when the matrices involved are positive-definite, instead
of computing the inverse of a matrix and multiplying by it, it is much better
(several times cheaper and also more accurate) to compute the Cholesky decom-
position of the matrix and treat the multiplication by the inverse as solution
of a system of linear equations (cf. MA398 Matrix Analysis and Algorithms).
This is a general point relevant to the implementation of all KF-like methods.
80 CHAPTER 7. FILTERING AND DATA ASSIMILATION

7.5 Eulerian and Lagrangian Data Assimilation


Systems involving some kind of fluid flow are an important application area for
data assimilation. It is interesting to consider two classes of observations of such
systems, namely Eulerian and Lagrangian observations: Eulerian observations
are observations at fixed points in space, whereas Lagrangian observations are
observations at points that are carried along by the flow field. For example,
a fixed weather station on the ground with a thermometer, a barometer, an
anemometer &c. would report Eulerian observations of temperature, pressure
and wind speed. By contrast, an unpowered float set adrift on the ocean currents


would report Lagrangian data about temperature, salinity &c. at its current
(evolving) position.
Consider for example the incompressible Stokes (ι = 0) or Navier–Stokes
(ι = 1) equations on the unit square with periodic boundary conditions, thought
of as the two-dimensional torus T2 , starting at time t = 0:
∂u
+ ιu · ∇u = ν∆u − ∇p + f on T2 × [0, ∞),

T
∂t
∇·u= 0 on T2 × [0, ∞),
u = u0 on T2 × {0}.
Here u, f : T2 × [0, ∞) → R2 are the velocity field and forcing term respectively,
AF
p : T2 × [0, ∞) → R is the pressure field, u0 : T2 → R2 is the initial value of the
velocity field, and ν ≥ 0 is the viscosity of the fluid.
Eulerian observations of this system might take the form of noisy obser-
vations yj,k of the velocity field at fixed points zj ∈ T2 , j = 1, . . . , J, at an
increasing sequence of discrete times tk ≥ 0, k ∈ N, i.e.
yj,k = u(zj , tk ) + ηj,k .
DR

On the other hand, Lagrangian observations might take the form of noisy ob-
servations yj,k of the locations zj (tk ) ∈ T2 at time tk of J passive tracers that
start at position zj,0 ∈ T2 at time t = 0 and are carried along with the flow
thereafter, i.e.
Z t
zj (t) = zj,0 + u(zj (s), s) ds,
0
yj,k = zj (tk ) + ηj,k .

Bibliography
The description given of the Kálmán filter, particularly in terms of Newton’s
method applied to the quadratic objective function J, follows that of Humpherys
& al. [41]. The remarks about Eulerian versus Lagrangian data assimilation
borrow from §3.6 of Stuart [98].
The original presentation of the Kálmán [46] and Kálmán–Bucy filters [47]
was in the context of signal processing, and encountered some initial resistance
from the engineering community, as related in the article of Humpherys & al.
Filtering is now fully accepted in applications communities and has a sound
algorithmic and theoretical base; for a stochastic processes point of view on
filtering, see e.g. the books of Jazwinski [43] and Øksendal [73] (§6.1).
EXERCISES 81

Exercises
Exercise 7.1. Verify that the normal equations for the state estimation problem
(7.4) have a unique solution.
Exercise 7.2. Suppose that A ∈ Kn×n and C ∈ Km×m are self-adjoint and
positive definite, B ∈ Km×n , and D ∈ Km×m is self-adjoint and positive semi-
definite. Then  
A + B ∗ CB −B ∗ C
−CB C +D


is self-adjoint and positive-definite.
Exercise 7.3 (Schur complements). Let
 
A B
M=
C D

be a square matrix with A, D, A − BD−1 C and D − CA−1 B all non-singular.

T
Then
 −1 
A + A−1 B(D − CA−1 B)−1 CA−1 −A−1 B(D − CA−1 B)−1
M −1 =
−(D − CA−1 B)−1 CA−1 (D − CA−1 B)−1
AF
and
 
(A − BD−1 C)−1 −(A − BD−1 C)−1 BD−1
M −1 = .
−D C(A − BD−1 C)−1
−1
D−1 + D−1 C(A − BD−1 C)−1 BD−1
Exercise 7.4. Schur complementation is often stated in the more restrictive
setting of self-adjoint positive-definite matrices, in which it has a natural in-
terpretation in terms of the conditioning of Gaussian random variables. Let
(X, Y ) ∼ N (m, C) be jointly Gaussian, where, in block form,
DR

   
m1 C11 C12
m= , C= ∗ ,
m2 C12 C22
and C is self-adjoint and positive definite. Show:
1. C11 and C22 are necessarily self-adjoint and positive-definite matrices.
−1 ∗
2. With the Schur complement defined by S := C11 − C12 C22 C12 , S is self-
adjoint and positive definite, and
 −1

S −1 −S −1 C12 C22
C −1 = −1 ∗ −1 −1 −1 ∗ −1 −1 .
−C22 C12 S C22 + C22 C12 S C12 C22

3. The conditional distribution of X given that Y = y is Gaussian:


−1

(X|Y = y) ∼ N m1 + C12 C22 (y − m2 ), S .
Sketch the PDF of a Gaussian random variable in R2 to further convince
yourself of this result.
Exercise 7.5 (Sherman–Morrison–Woodbury formula). Let A and D be invertible
matrices and let B and C be such that A + BD−1 C and D + CA−1 B are non-
singular. Then
(A + BD−1 C)−1 = A−1 − A−1 B(D + CA−1 B)−1 CA−1 .
82 CHAPTER 7. FILTERING AND DATA ASSIMILATION

Exercise 7.6. Determine the asymptotic computational costs of the correction


steps (7.8) and (7.9) of the LKF, and hence the asymptotic computational cost
per iteration of the LKF.
Exercise 7.7 (Fading memory). In the LKF, the current state variable is up-
dated as the latest inputs and measurements become known, but the estimation
is based on the least squares solution of all the previous states where all measure-
ments are weighted according to their covariance. One can also use an estimator
that discounts the error in older measurements leading to a greater emphasis
on recent observations, which is particularly useful in situations where there is


some modeling error in the system.
To do this, consider the objective function
k
(λ) λk 1 X k−i
Jk|k (zk ) := kx0 − µ0 k2Q−1 + λ kyi − Hi xi k2R−1
2 0 2 i=1 i

k
1 X k−1
+ λ kxi − Fi xi−1 − Gi ui k2Q−1 ,

T
2 i=1 i

where the parameter λ ∈ [0, 1] is called the forgetting factor ; note that the stan-
dard LKF is the case λ = 1, and the objective function increasingly relies upon
AF
recent measurements as λ → 0. Find a recursive expression for the objective
(λ)
function Jk|k and follow the steps in the derivation of the usual LKF to derive
the LKF with fading memory λ.
Exercise 7.8. Write the prediction and correction equations (7.13)–(7.16) for
the EKF in terms of the Kálmán gain matrix.
Exercise 7.9. Suppose that a fuel tank of an aircraft is (when the aircraft is
DR

level) the cuboid Ω := [a1 , b1 ] × [a2 , b2 ] × [a3 , b3 ]. Assume that at some time t,
the aircraft is flying such that
• the original upward-pointing [0, 0, 1]∗ vector of the plane, and hence the
tank, is ν(t) ∈ R3 ;
• the fuel in the tank is in static equilibrium;
• fuel probes inside the tank provide noisy measurements of the fuel depth
at the four corners of the tank: specifically, if [ai , bi , zi ]∗ is the intersection
of the fuel surface with the boundary of the tank at the corner [ai , bi ]∗ ,
assume that you are told ζi = zi + N (0, σ 2 ), independently for each fuel
probe.

Using this information:


1. Assuming first that σ 2 = 0, calculate the volume of fuel in the tank.
2. Assuming now that σ 2 > 0, estimate the volume of fuel in the tank.
3. Explain how to estimate the volume of fuel remaining in the tank as a
function of time. What other information other than the probe measure-
ments might you use, and why?
4. Generalize your results to more general fuel tank and probe geometries,
and more general observational noise.
Exercise 7.10. Consider the problem of using the LKF to estimate the position
and velocity of a projectile given a few noisy measurements of its position. In
fact, the LKF not only provides a relatively smooth profile of the projectile’s
EXERCISES 83

trajectory as it passes by a radar sensor, but also effectively predicts the point
of impact as well as the point of origin — so that troops on the ground can both
duck for cover and return fire before the projectile lands.
The state of the projectile is. . .


T
AF
DR

84 CHAPTER 7. FILTERING AND DATA ASSIMILATION


T
AF
DR

Chapter 8


Orthogonal Polynomials

Although our intellect always longs for


clarity and certainty, our nature often

T
finds uncertainty fascinating.

On War
Karl von Clausewitz
AF
Orthogonal polynomials are an important example of orthogonal decompo-
sitions of Hilbert spaces. They are also of great practical importance: they play
a central role in numerical integration using quadrature rules (Chapter 9) and
approximation theory; in the context of UQ, they are also a foundational tool
in polynomial chaos expansions (Chapter 11).
For the rest of this chapter, N = N0 or {0, 1, . . . , N } for some N ∈ N0 .
DR

8.1 Basic Definitions and Properties


Recall that a polynomial of degree n in a single indeterminate x is an expression
of the form
p(x) = cn xn + cn−1 xn−1 + · · · + c1 x + c0 ,
where the coefficients ci are scalars drawn from some field K, and cn 6= 0. If
cn = 1, then p is said to be monic. The space of all polynomials in x is denoted
K[x], and the space of polynomials of degree at most n is denoted K≤n [x].

Definition 8.1. Let µ be a non-negative measure on R. A family of polynomials


Q = {qn | n ∈ N } is called an orthogonal system of polynomials if qn is of degree
n and
Z
hqm , qn iL2 (µ) := qm (x)qn (x) dµ(x) = γn δmn for m, n ∈ N .
R

The constants Z
γn = kqn k2L2 (µ) = qn2 dµ
R
are required to be strictly positive and are called the normalization constants
of the system Q. If γn = 1 for all n ∈ N , then Q is an orthonormal system.
86 CHAPTER 8. ORTHOGONAL POLYNOMIALS

In other words, a system of orthogonal polynomials is nothing but a col-


lection of orthogonal elements of the Hilbert space L2 (R, µ) that happen to be
polynomials. Note that, given µ, orthogonal (resp. orthonormal) polynomials
for µ can be found inductively by using the Gram–Schmidt orthogonalization
(resp. orthonormalization) procedure on the monomials 1, x, x2 , . . . .
Example 8.2. 1. The Legendre polynomials Len are the orthogonal polyno-
mials for uniform measure on [−1, 1]:
Z 1
Lem (x)Len (x) dx = δmn .


−1

2. The Hermite polynomials Hen are the orthogonal polynomials for standard
2
Gaussian measure (2π)−1/2 e−x /2 dx on the real line:
Z ∞
exp(−x2 /2)
Hem (x)Hen (x) √ dx = n!δmn .
−∞ 2π
The first few Legendre and Hermite polynomials are given in Table 8.1

T
and illustrated in Figures 8.1 and 8.2.
3. See Table 8.2 for a summary of other classical systems of orthogonal poly-
nomials corresponding to various probability measures on subsets of the
real line.
AF
 Remark 8.3. Many sources, typically physicists’ texts, use the weight function
2 2 2
e−x dx instead of probabilists’ preferred (2π)−1/2 e−x /2 dx or e−x /2 dx for
the Hermite polynomials. Changing from one normalization to the other is
of course not difficult, but special care must be exercised in practice to see
which normalization a source is using, especially when relying on third-party
software packages: for example, the GAUSSQ Gaussian quadrature package from
2
http://netlib.org/ uses the e−x dx normalization. To convert integrals with
DR

respect to one Gaussian measure to integrals with respect to another (and hence
get the right answers for Gauss–Hermite quadrature), use the following change-
of-variables formula:
Z Z √
1 −x2 /2 1 2
√ f (x)e dx = f ( 2x)e−x dx.
2π R π R
It follows from this that conversion between the physicists’ and probabilists’
Gauss–Hermite quadrature formulæ is achieved by
wiphys √ phys
wiprob = √ , xprob = 2xi .

i
π

Lemma 8.4. The L2 (R, µ) inner product is positive definite on R≤d [x] if and
only if the Hankel determinant det(Hn ) is strictly positive for n = 1, . . . , d + 1,
where  
m0 m1 · · · mn−1
 m1 m2 · · · mn  Z
 
Hn :=  . . .. .  , m n := xn dµ(x).
 .. .. . ..  R
mn−1 mn · · · m2n−2
Hence, the L2 (R, µ) inner product is positive definite on R[x] if and only if
det(Hn ) > 0 for all n ∈ N.
8.1. BASIC DEFINITIONS AND PROPERTIES 87

n Len Hen
0 1 1
1 x x
1 2
2 2 (3x − 1) x2 − 1
1 3
3 2 (5x − 3x) x3 − 3x
1 4
4 8 (35x − 30x2 + 3) x − 6x2 + 3
4
1 5 3
5 8 (63x − 70x + 15x) x5 − 10x3 + 15x


Table 8.1: The first few Legendre polynomials Len , which are the orthogo-
nal polynomials for uniform measure dx on [−1, 1], and Hermite polynomi-
als Hen , which are the orthogonal polynomials for standard Gaussian measure
2
(2π)−1/2 e−x /2 dx on R.

1.0

T 0.5
AF
−1.00 −0.75 −0.50 −0.25 0.25 0.50 0.75 1.00

−0.5

−1.0
DR

Figure 8.1: The Legendre polynomials Le0 (red), Le1 (orange), Le2 (yellow),
Le3 (green), Le4 (blue) and Le5 (purple) on [−1, 1].

10.0

5.0

−1.5 −1.0 −0.5 0.5 1.0 1.5


−5.0

−10.0

Figure 8.2: The Hermite polynomials He0 (red), He1 (orange), He2 (yellow),
He3 (green), He4 (blue) and He5 (purple) on R.
88 CHAPTER 8. ORTHOGONAL POLYNOMIALS

Distribution of ξ Polynomials qk (ξ) Support


Gaussian Hermite R
Continuous Gamma Laguerre [0, ∞)
Beta Jacobi [a, b]
Uniform Legendre [a, b]
Poisson Charlier N0
Discrete Binomial Krawtchouk {0, 1, . . . , n}
Negative Binomial Meixner N0
{0, 1, . . . , n}


Hypergeometric Hahn

Table 8.2: Families of probability distributions and the corresponding families


of orthogonal polynomials.

Proof. Let

T
p(x) := cd xd + · · · + c1 x + c0
be any polynomial of degree at most d. Note that
Z Xd d
X
2
kpkL2 (R,µ) = ck cℓ xk+ℓ dµ(x) = ck cℓ mk+ℓ ,
AF
R k,ℓ=0 k,ℓ=0

and so kpkL2(R,µ) > 0 if and only if Hd+1 is positive definite. This, in turn, is
equivalent to having det(Hn ) > 0 for n = 1, 2, . . . , d + 1.
Theorem 8.5. If the L2 (R, µ) inner product is positive definite on R[x], then
there exists an infinite sequence of orthogonal polynomials for µ.
Proof. Apply the Gram–Schmidt procedure to the monomials xn , n ∈ N0 . That
DR

is, take q0 (x) = 1, and for n ∈ N recursively define


n−1
X hxk , qk i
qn (x) := xn − qk (x).
hqk , qk i
k=0

Since the inner product is positive definite, hqk , qk i > 0, and so each qn is
uniquely defined. By construction, each qn is orthogonal to qk for k < n.
By Exercise 8.1, the hypothesis of Theorem 8.5 is satisfied if the measure µ
has infinite support. In the other direction, we have the following:

Theorem 8.6. If the L2 (R, µ) inner product is positive definite on K≤d [x], but
not on R≤n [x] for n > d, then µ admits only d + 1 orthogonal polynomials.
Proof. The Gram–Schmidt procedure can be applied so long as the denomina-
tors hqk , qk i are strictly positive, i.e. for k ≤ d + 1. The polynomial qd+1 is
orthogonal to qn for n ≤ d; we now show that qd+1 = 0. By assumption, there
exists a polynomial P of degree d + 1, having the same leading coefficient as
qd+1 , such that kP kL2(R,µ) = 0. Hence, P − qd+1 has degree d, so it can be
written in the orthogonal basis {q0 , . . . , qd } as
d
X
P − qd+1 = ck qk
k=0
8.2. RECURRENCE RELATIONS 89

for some coefficients c0 , . . . , cd . Hence,


d
X
0 = kP k2L2(R,µ) = kqd+1 k2L2 (R,µ) + c2k kqk k2L2 (R,µ) ,
k=0

which implies, in particular, that kqd+1 kL2 (R,µ) = 0. Hence, the normalization
constant γd+1 = 0, which is not permitted, and so qd+1 is not a member of a
sequence of orthogonal polynomials for µ.
Theorem 8.7. If µ has finite moments only of degrees 0, 1, . . . , r, then µ admits


only a finite system of orthogonal polynomials q0 , . . . , qd , where d = ⌊r/2⌋.
Proof. Exercise 8.2
Theorem 8.8. The coefficients of any system of orthogonal polynomials are
determined, up to multiplication by an arbitrary constant for each degree, by the
Hankel determinants of the polynomial moments. That is, if
Z
mn := xn dµ(x)

T
R
th
then the n degree orthogonal polynomial qn for µ is, for some cn 6= 0,
 
mn
AF
 .. 
 Hn . 

qn = cn det  
m 2n−1 


1 · · · xn−1 xn
 
m0 m1 m2 ··· mn
 m1 m2 m3 · · · mn+1 
 
 .. .
.. .
.. .. ..  .
= cn det  . . 
DR

 . 
mn−1 mn mn+2 · · · m2n−1 
1 x x2 ··· xn
Proof. FINISH ME!!!

8.2 Recurrence Relations


The Legendre polynomials satisfy the recurrence relation
2n + 1 n

Len+1 (x) = xLen (x) − Len−1 (x).


n+1 n+1
The Hermite polynomials satisfy the recurrence relation
Hen+1 (x) = xHen (x) − nHen−1 (x).
In fact, all systems of orthogonal polynomials satisfy a three-term recurrence
relation of the form
qn+1 (x) = (An x + Bn )qn (x) − Cn qn−1 (x)
for some sequences (An ), (Bn ), (Cn ), with the initial terms q0 (x) = 1 and
q−1 (x) = 0. Furthermore, there is a characterization of precisely which se-
quences (An ), (Bn ), (Cn ) arise from systems of orthogonal polynomials.
90 CHAPTER 8. ORTHOGONAL POLYNOMIALS

Theorem 8.9 (Favard). Let (An ), (Bn ), (Cn ) be real sequences and let Q =
{qn | n ∈ N } be defined by
qn+1 (x) = (An x + Bn )qn (x) − Cn qn−1 (x),
q0 (x) = 1,
q−1 (x) = 0.
Then Q is a system of orthogonal polynomials for some measure µ if and only
if, for all n ∈ N ,


An 6= 0, Cn 6= 0, Cn An An−1 > 0.
Theorem 8.10 (Christoffel–Darboux formula). The orthonormal polynomials {Pn |
n ≥ 0} for a measure µ satisfy
n
X p Pn+1 (y)Pn (x) − Pn (y)Pn+1 (x)
Pk (y)Pk (x) = βn+1 (8.1)
y−x
k=0

T
and n
X p 

|Pk (x)|2 = βn+1 Pn+1 (x)Pn (x) − Pn′ (x)Pn+1 (x) . (8.2)
k=0
AF
Proof. Multiply the recurrence relation
p p
βk+1 Pk+1 (x) = (x − αk )Pk (x) − βk Pk−1 (x)
by Pk (y) on both sides and subtract the corresponding expression with x and y
interchanged to obtain
p 
(y − x)Pk (y)Pk (x) = βk+1 Pk+1 (y)Pk (x) − Pk (y)Pk+1 (x)
p 
DR

− βk Pk (y)Pk−1 (x) − Pk−1 (y)Pk (x) .


Sum both sides from k = 0 to k = n and use the telescoping nature of the sum
on the right to obtain (8.1). Take the limit as y → x to obtain (8.2).
Corollary 8.11. The orthonormal polynomials {Pn | n ≥ 0} for a measure µ
satisfy

Pn+1 (x)Pn (x) − Pn′ (x)Pn+1 (x) > 0.

8.3 Roots of Orthogonal Polynomials


Definition 8.12. The Jacobi matrix of a measure µ is the infinite, symmetric,


tridiagonal matrix
 √ 
α0 β1 0 ···
√ √ .. 
 β1 α1 β2 .
 
J∞ (µ) :=  √ .. 
 0 β2 α2 .
 
.. .. .. ..
. . . .

where αk and βk are as in FINISH ME!!!. The upper-left n × n minor of


J∞ (µ) is denoted Jn (µ).
8.3. ROOTS OF ORTHOGONAL POLYNOMIALS 91

Theorem 8.13. Let P0 , P1 , . . . be the orthonormal polynomials for µ. The zeros


of Pn are the eigenvalues of Jn (µ), and the eigenvector corresponding to the zero
z is  
P0 (z)
 .. 
 . .
Pn−1 (z)
Theorem 8.14 (Zeros of orthogonal polynomials). Let µ be supported in a non-
degenerate interval I ⊆ R, and let Q = {qn | n ∈ N0 } be a system of orthogonal
polynomials for µ


(n) (n)
1. For each n ∈ N0 , qn has exactly n distinct real roots z1 , . . . , zn ∈ I.
2. If (a, b) is an open interval of µ-measure zero, then (a, b) contains at most
one root of any orthogonal polynomial qn for µ.
3. The zeros of qn and qn+1 alternate:
(n+1) (n) (n+1) (n+1)
z1 < z1 < z2 < · · · < zn(n+1) < zn(n) < zn+1 ;

T
hence, whenever m > n, between any two zeros of qn there lies a zero of
qm .
4. If the support of µ is the entire interval I, then the set of all zeros for the
system Q is dense in I:
AF
[
I= {z ∈ R | qn (z) = 0}.
n∈N

Proof. 1. First observe that hqn , 1iL2 (µ) = 0, and so qn changes sign in I.
Since qn is continuous, the Intermediate Value Theorem implies that qn
(n)
has at least one real root z1 ∈ I. For n > 1, there must be another root
(n) (n) (n)
z2 ∈ I of qn distinct from z1 , since if qn were to vanish only at z1 ,
DR

(n) 
then x − z1 qn would not change sign in I, which would contradict the
(n)
orthogonality relation hx − z1 , qn iL2 (µ) = 0. Similarly, if n > 2, consider
(n)  (n) 
x − z1 x − z2 qn to deduce the existence of yet a third distinct root
(n)
z3 ∈ I. This procedure terminates when all the n complex roots of qn
guaranteed by the Fundamental Theorem of Algebra are shown to lie in
I.
(n) (n)
2. Suppose that (a, b) contains two distinct zeros zi and zj of qn . Then
* + Z

Y (n) 
Y (n) 
qn , x − zk = qn (x) x − zk dµ(x)
k6=i,j R k6=i,j
L2 (µ)
Z Y (n) 2 (n)  (n) 
= x − zk x − zi x − zj dµ(x)
R k6=i,j

> 0,

since the integrand is positive outside of (a, b). However, this contradicts
the orthogonality of qn to all polynomials of degree less than n.
3. As usual, let Pn be the normalized version of qn . Let σ, τ be consecutive ze-
ros of Pn , so that Pn′ (σ)Pn′ (τ ) < 0. Then Corollary 8.11 implies that Pn+1
92 CHAPTER 8. ORTHOGONAL POLYNOMIALS

has opposite signs at σ and τ , and so the Intermediate Value Theorem


implies that Pn+1 has at least one zero between σ and τ . This observation
(n+1) (n+1)
accounts for n−1 of the n+1 zeros of Pn+1 , namely z2 < · · · < zn .
(n)
There are two further zeros of Pn+1 , one to the left of z1 and one to the
(n) (n) 
right of zn . This follows because Pn′ zn > 0, so Corollary 8.11 implies
(n) 
that Pn+1 zn < 0. Since Pn+1 (x) → +∞ as x → ∞, the Intermediate
(n+1) (n)
Value Theorem implies the existence of zn+1 > zn . A similar argument
(n+1) (n)
establishes the existence of z1 < z1 .


4. FINISH ME!!!

8.4 Polynomial Interpolation


Pn
The existence of a unique polynomial p(x) = i=0 ci xi of degree at most n that
interpolates the values y0 , . . . , yn at n + 1 distinct points x0 , . . . , xn follows from

T
the invertibility of the Vandermonde matrix
 
1 x0 x20 ··· xn+1
0
1 x1 x21 ··· xn+1 
 1 
AF
Vn (x0 , . . . , xn ) :=  . .. .. .. ..  (8.3)
 .. . . . . 
1 xn x2n ··· xn+1
n

and hence the unique solvability of the system of simultaneous linear equations
   
c0 y0
 ..   .. 
Vn (x0 , . . . , xn )  .  =  .  . (8.4)
DR

cn yn

There is, however, another way to express the polynomial interpolation problem,
the so-called Lagrange form, which amounts to a clever choice of basis for R≤n [x]
(instead of the usual monomial basis {1, x, x2 , . . . , xn }) so that the matrix in
(8.4) in the new basis is the identity matrix.

Definition 8.15 (Lagrange polynomials). Let x0 , . . . , xK ∈ R be distinct. For


0 ≤ j ≤ K, the associated Lagrange basis polynomial ℓj is defined by
Y x − xk

ℓj (x) := .
xj − xk
0≤k≤K
k6=j

Given also arbitrary values y0 , . . . , yK ∈ R, the associated Lagrange interpola-


tion polynomial is
XK
L(x) := yj ℓj (x).
j=0

Theorem 8.16. Given distinct x0 , . . . , xK ∈ R and any y0 , . . . , yK ∈ R, the as-


sociated Lagrange interpolation polynomial L is the unique polynomial of degree
at most K such that L(xk ) = yk for k = 0, . . . , K.
8.5. POLYNOMIAL APPROXIMATION 93

Proof. Observe that each Lagrange basis polynomial is a polynomial of degree


K, and so L is a polynomial of degree at most K. Observe also that ℓj (xk ) = δjk .
Hence,
K
X K
X
L(xk ) = f (xj )ℓj (xk ) = f (xj )δjk = f (xk ).
j=0 j=0
For uniqueness, consider the basis {ℓ0 , . . . , ℓK } of R≤K [x] and suppose that
PK
p = j=0 cj ℓj is any polynomial that interpolates the values {yk }K k=0 at the
K
points {xk }k=0 . But then, for each k = 0, . . . , K,


K
X K
X
f (xk ) = cj ℓj (xk ) = cj δjk = ck ,
j=0 j=0

and so p = L, as claimed.

Runge’s Phenomenon. Given the task of choosing nodes xk ∈ [a, b] between


which to interpolate functions f : [a, b] → R, it might seem natural to choose

T
the nodes xk to be equally spaced. Runge’s phenomenon [84] shows that this
is not always a good choice of interpolation scheme. Consider the function
f : [−1, 1] → R defined by
1
f (x) := , (8.5)
AF
1 + 25x2
and let Ln be the degree-n (Lagrange) interpolation polynomial for f on the
equally-spaced nodes xk := 2k n − 1. As illustrated in Figure 8.1, Ln oscillates
wildly near the endpoints of the interval [−1, 1]. Even worse, as n increases,
these oscillations do not die down but increase without bound: it can be shown
that
lim sup |f (x) − Ln (x)| = ∞.
n→∞ x∈[−1,1]
DR

As a consequence, polynomial interpolation and numerical integration using


uniformly spaced nodes — as in the Newton–Cotes formula (Definition 9.5) —
can in general be very inaccurate. The oscillations near ±1 can be controlled
by using a non-uniform set of nodes, in particular one that is denser ±1 than
near 0; the standard example is the set of Chebyshev nodes defined by
 
2k − 1
xk := cos π ,
2n
i.e. the roots of the Chebyshev polynomials of the first kind Tn , which are the
orthogonal polynomials for the measure (1 − x2 )−1/2 dx on [−1, 1].

However, Chebyshev nodes are not a panacea. Indeed, for every predefined
set of interpolation nodes there is a continuous function for which the interpo-
lation process on those nodes diverges. For every continuous function there is
a set of nodes on which the interpolation process converges. Interpolation on
Chebyshev nodes converges uniformly for every absolutely continuous function.

8.5 Polynomial Approximation


The following theorem on the uniform approximation (on compact sets) of con-
tinuous functions by polynomials should be familiar from elementary real anal-
ysis:
94 CHAPTER 8. ORTHOGONAL POLYNOMIALS

1.0

0.5

−1.0 −0.5 0.5 1.0


−0.5

−1.0

T
Figure 8.1: Runge’s phenomenon. The function f (x) := (1 + 25x2 )−1 in black,
and polynomial interpolations of degrees 5 (red), 9 (green), and 13 (blue) on
evenly-spaced nodes.
AF
Theorem 8.17 (Weierstrass). Let [a, b] ⊂ R be a bounded interval, let f : [a, b] →
R be continuous, and let ε > 0. Then there exists a polynomial p such that

sup |f (x) − p(x)| < ε.


a≤x≤b
DR

As a consequence of standard results on orthogonal projection in Hilbert


spaces, we have the following:

Theorem 8.18. For any f ∈ L2 (I, µ) and any d ∈ N0 , the orthogonal projection
Πd f of f onto R≤d [x] is the best degree d polynomial approximation of f in the
L2 (I, µ) norm, i.e.

Πd f = arg min kp − f kL2 (I,µ) ,


p(x)∈R≤d [x]

where, denoting the orthogonal polynomials for µ by {qk | k ∈ N0 },

d
X hf, qk iL2 (µ)
Πd f := qk ,
kqk k2L2 (µ)
k=0

and the residual is orthogonal to the projection subspace:

hf − Πd f, piL2 (µ) = 0 for all p(x) ∈ R≤d [x].

An important property of polynomial expansions of functions is that the


quality of the approximation (i.e. the rate of convergence) improves as the reg-
ularity of the function to be approximated increases. This property is referred
8.5. POLYNOMIAL APPROXIMATION 95

to as spectral convergence and is easily quantified by using the machinery of


Sobolev spaces. Recall that, given a measure µ on a subinterval I ⊆ R

Xk  m  k Z
X
d u dm v dm u dm v
hu, viH k (µ) := , = dµ
m=0
dxm dxm L2 (µ) m=0 I dxm dxm
1/2
kukH k (µ) := hu, viH k (µ) .

The Sobolev space H k (µ) consists of all L2 (µ) functions that have weak deriva-
tives of all orders up to k in L2 (µ), and is equipped with the above inner product


and norm.
Legendre expansions of Sobolev functions on [−1, 1] satisfy the following
spectral convergence theorem; the analogous results for Hermite expansions of
Sobolev functions on R and Laguerre expansions of Sobolev functions on (0, ∞)
are Exercise 8.5 and Exercise 8.6 respectively.
Theorem 8.19 (Spectral convergence of Legendre expansions). There is a constant
C ≥ 0 that may depend upon k but is independent of d and f such that, for all

T
f ∈ H k ([−1, 1], dx),

kf − Πd f kL2 (dx) ≤ Cd−k kf kH k (dx) .


AF
Proof. Recall that the Legendre polynomials satisfy

LLen = λn Len ,

where the differential operator L and eigenvalues λn are


 
d d d2 d
L= (1 − x2 ) = (1 − x2 ) 2 − 2x , λn = −n(n + 1)
dx dx dx dx
DR

Note that, by the definition of the Sobolev norm and the operator L, kLf kL2 ≤
Ckf kH 2 and hence, for any m ∈ N, kLm f kL2 ≤ Ckf kH 2m .
The key ingredient of the proof is integration by parts:
Z 1
−1
hf, Len iL2 = λn (LLen )(x)f (x) dx
−1
Z 1 
= λ−1
n (1 − x2 )Le′′n (x)f (x) − 2xLe′n (x)f (x) dx
−1
Z 1 
((1 − x2 )f )′ (x)Le′n (x) + 2xLe′n (x)f (x) dx

= −λ−1
n
−1
Z1
= −λ−1
n (1 − x2 )f ′ (x)Le′n (x) dx
−1
Z 1 ′
= λ−1
n (1 − x2 )f ′ (x)Le′n (x) dx
−1
= λ−1
n hLf, Len iL2 .

Hence, for all m ∈ N0 for which f has 2m weak derivatives,


hLm f, Len i
hf, Len i =
λmn
96 CHAPTER 8. ORTHOGONAL POLYNOMIALS

Hence,

X |hf, Len i|2
kf − Πd f k2 =
kLen k2
n=d+1
X∞
|hLm f, Len i|2
=
λ2m
n kLen k
2
n=d+1

X |hLm f, Len i|2
≤ λ−2m
d
kLen k2
n=d+1


≤ d−4m kLm f k2
≤ C 2 d−4m kf k2H 2m .
Taking k = 2m and square roots completes the proof.
However, in the other direction poor regularity can completely ruin the nice
convergence of spectral expansions. The classic example of this is Gibbs’ phe-

T
nomenon, in which one tries to approximate the sign function


−1, if x < 0,
sgn(x) := 0, if x = 0,


AF
1, if x > 0,
on [−1, 1] by its expansion with respect to a system of orthogonal polynomi-
als such as the Legendre polynomials Len (x) or the Fourier polynomials eπnx .
FINISH ME!!!

8.6 Orthogonal Polynomials of Several Variables


DR

For working with polynomials in d variables, we will use standard multi-index


notation. Multi-indices will be denoted by Greek letters α = (α1 , . . . , αd ) ∈ Nd0 .
For x = (x1 , . . . , xd ) ∈ Rd and α ∈ Nd0 , the monomial xα is defined by
xα := xα1 α2 αd
1 x2 . . . xd ,

and |α| := α1 + · · · + αd is called the total degree of xα . The total degree of


a polynomial p (i.e. a finite linear combination of such monomials) is denoted
deg(p) and is the maximum of the total degrees of the summands.
Given a measure µ on Rd , it is tempting to apply the Gram–Schmidt process

with respect to the inner product


Z
hf, giL2 (µ) := f (x)g(x) dµ(x)
Rd

to the monomials {xα | α ∈ Nd0 } to obtain a system of orthogonal polynomials


for the measure µ. However, there is an immediate problem, in that orthogonal
polynomials of several variables are not unique. In order to apply the Gram–
Schmidt process, we need to give a linear order to multi-indices α ∈ Nd0 . There
are many choices of well-defined total order (for example, the lexicographic order
or the graded lexicographic order); but there is no natural choice and different
orders will give different sequences of orthogonal polynomials. Instead of fixing
such a total order, we relax Definition 8.1 slightly:
BIBLIOGRAPHY 97

Definition 8.20. Let µ be a non-negative measure on Rd . A family of polyno-


mials Q = {qα | α ∈ Nd0 } is called an orthogonal system of polynomials if qα is
such that

hqα , piL2 (µ) = 0 for all p(x) ∈ R[x1 , . . . , xd ] with deg(p) < |α|.

Hence, in the many-variables case, an orthogonal polynomial of total degree


n, while it is required to be orthogonal to all polynomials of strictly lower total
degree, may be non-orthogonal to other polynomials of the same total degree n.
However, the meaning of orthonormality is unchanged: a system of polynomials


{Pα | α ∈ Nd0 } is orthonormal if

hPα , Pβ iL2 (µ) = δαβ .

Bibliography
The classic reference on orthogonal polynomials is the 1939 monograph of Szegő

T
[100]. An excellent more modern reference is the book of Gautschi [33]; some
topics covered in that book that are not treated here include. . . FINISH ME!!!
Many important properties of orthogonal polynomials, and standard exam-
ples, are given in Chapter 22 of Abramowitz & Stegun [1].
AF
Exercises
Exercise 8.1. Show that the L2 (R, µ) inner product is positive definite on the
space of polynomials if the measure µ has infinite support.
Exercise 8.2. Show that if µ has finite moments only of degrees 0, 1, . . . , r,
DR

then µ admits only a finite system of orthogonal polynomials q0 , . . . , qd , where


d = ⌊r/2⌋.
Exercise 8.3. Define a Borel measure µ on R by
dµ 1 1
(x) = .
dx π 1 + x2
Show that µ is a probability measure, that dim L2 (R, µ; R) = ∞, find all or-
thogonal polynomials for µ, and explain your results.
Exercise 8.4. Calculate the orthogonal polynomials of Table 8.2 by hand for

degree p ≤ 5, and write a numerical program to compute them for higher


degree.
Exercise 8.5 (Spectral convergence of Hermite expansions). Let γ = N (0, 1)
be standard Gaussian measure on R. First establish the integration-by-parts
formula Z Z
f (x)g ′ (x) dγ(x) = − (f ′ (x) − xf (x))g(x) dγ(x).
R R
Using this, and the fact that the Hermite polynomials satisfy
 2 
d d
−x Hen (x) = −nHen (x), for n ∈ N0 ,
dx2 dx
98 CHAPTER 8. ORTHOGONAL POLYNOMIALS

mimic the proof of Theorem 8.19 to show that there is a constant C ≥ 0 that may
depend upon k but is independent of d and f such that, for all f ∈ H k (R, γ), f
and its degree d expansion in the Hermite orthogonal basis of L2 (R, γ) satisfy

kf − Πd f kL2 (γ) ≤ Cd−k/2 kf kH k (γ) .

Exercise 8.6 (Spectral convergence of Laguerre expansions). Let dµ(x) = e−x dx,
for which the orthogonal polynomials are the Laguerre polynomials Lan , n ∈ N0 .
Establish an integration-by-parts formula for µ and then use this and the fact
d2 d
that Lan is an eigenfunction for x dx 2 + (1 − x) dx with eigenvalue −n to prove


the analogue of Exercise 8.5 but for Laguerre expansions.

T
AF
DR

Chapter 9


Numerical Integration

A turkey is fed for a 1000 days — ev-


ery day confirms to its statistical depart-

T
ment that the human race cares about its
welfare “with increased statistical signif-
icance”. On the 1001st day, the turkey
has a surprise.
AF
The Fourth Quadrant: A Map of the
Limits of Statistics
Nassim Taleb

The topic of this chapter is the numerical (i.e. approximate) evaluation of


definite integrals. Such methods of numerical integration will be essential if the
DR

expectations the UQ methods of later chapters — with their many expectations


— are to be implemented in a practical manner.
The topic of integration has a long history, as one of the twin pillars of
calculus, and was historically also known as quadrature. Nowadays, quadrature
usually refers to a particular method of numerical integration.

9.1 Quadrature in One Dimension


This section concerns the numerical integration of a real-valued function f with

respect to a measure µ on a sub-interval I ⊆ R, and to do so by sampling the


function at pre-determined points of I and taking a suitable weighted average.
That is, the aim is to construct an approximation of the form
Z K
X
f (x) dµ(x) ≈ Q(f ) := wk f (xk ),
I k=1

with prescribed points x1 , . . . , xK ∈ I called nodes and scalars w1 , . . . , wK ∈ R


called weights. The approximation Q(f ) is called a quadrature formula. The aim
is
R to choose nodes and weights wisely, so that the quality of the approximation
I f dµ ≈ Q(f ) is good for a large class of integrands f . One measure of the
quality of the approximation is the following:
100 CHAPTER 9. NUMERICAL INTEGRATION

Definition
R 9.1. A quadrature formula is said to have order of accuracy n ∈ N0
if I f dµ = Q(f ) whenever f is a polynomial of degree at most n.
PK
A quadrature formula Q(f ) = k=1 wk f (xk ) can be identified with the
PK
discrete measure k=1 wk δxk . If some of the weights wk are negative, then this
measure is a signed measure. This point of view will be particularly useful when
considering multi-dimensional quadrature formulas. Regardless of the signature
of the weights, the following limitation on the accuracy of quadrature formulas
is fundamental:


Lemma 9.2. Let I ⊆ R be any interval. Then no quadrature formula with n
distinct nodes in I can have order of accuracy 2n or greater.
Proof. Let {x1 , . . . , xn } ⊆ I be any set of n distinct points, and letQ{w1 , . . . , wn }
be any set of weights. Let f be the degree-2n polynomial f (x) := nj=1 (x−xj )2 ,
i.e. the square of the nodal polynomial. Then
Z Xn
f (x) dx > 0 = wj f (xj ),

T
I j=1

since f vanishes at each node xj . Hence, the quadrature formula is not exact
for polynomials of degree 2n.
AF
The first, simplest, quadrature formulas to consider are those in which the
nodes form an equally-spaced discrete set of points in [a, b]. Many of these
quadrature formulas may be familiar from high-school mathematics.
Definition 9.3 (Midpoint rule). The midpoint quadrature formula has the single
node x1 := b−a 2 and the single weight w1 := ρ(x1 )|b − a|. That is, it is the
approximation
Z b    
DR

b−a b−a
f (x)ρ(x) dx ≈ I1 (f ) := f ρ |b − a|.
a 2 2
Another viewpoint on the midpoint rule is that it is the
 approximation of the
integrand f by the constant function with value f b−a 2 . The next quadrature
formula, on the other hand, amounts to the approximation of f by the affine
function
x−a
x 7→ f (a) + (f (b) − f (a))
b−a
that equals f (a) at a and f (b) at b.

Definition 9.4 (Trapezoidal rule). The midpoint quadrature formula has the
nodes x1 := a and x2 := b and the weights w1 := ρ(a) |b−a| 2 and w2 := ρ(b) |b−a|
2 .
That is, it is the approximation
Z b
|b − a|
f (x)ρ(x) dx ≈ I2 (f ) := (f (a)ρ(a) + f (b)ρ(b)) .
a 2
Recall the definition of the Lagrange interpolation polynomial L for a set of
nodes and values from Definition 8.15. The midpoint and trapezoidal quadrature
formulas amount to approximating f by a Lagrange interpolation polynomial L
Rb Rb
of degree 0 or 1 and hence approximating a f (x) dx by a L(x) dx. The general
such construction is the following:
9.2. GAUSSIAN QUADRATURE 101

Definition 9.5 (Newton–Cotes formula). Consider K + 1 equally-spaced points

a = x0 < x1 = x0 + h < x2 = x0 + 2h < · · · < xK = b,


1
where h = K . The closed Newton–Cotes quadrature formula is the quadrature
formula that arises from approximating f by the Lagrange interpolating poly-
nomial L that runs through the points (xk , f (xk ))K
k=0 ; the open Newton–Cotes
quadrature formula is the quadrature formula that arises from approximating
f by the Lagrange interpolating polynomial L that runs through the points
(xk , f (xk ))K−1
k=1 .


Proposition 9.6. The weights for the closed Newton–Cotes quadrature formula
are given by
Z b
wj = ℓj (x) dx.
a

Proof. Let L be as in the definition of the Newton–Cotes rule. Then

T
Z b Z b
f (x) dx ≈ L(x) dx
a a
Z K
bX
= f (xj )ℓj (x) dx
AF
a j=0

K
X Z b
= f (xj ) ℓj (x) dx
j=0 a

as claimed.

The midpoint rule is the open Newton–Cotes quadrature formula on three


DR

points; the trapezoidal rule is the closed Newton–Cotes quadrature formula on


two points.
The quality of Newton–Cotes quadrature formulas can be very poor, essen-
tially because Runge’s phenomemon can make the quality of the approximation
f ≈ L very poor. The weakness of Newton–Cotes quadrature can be seen an-
other way: it is possible for the Newton–Cotes weights to be negative, and any
quadrature formula with weights of both signs is prone to errors. For example,
the positive-definite function f that linearly interpolates the values

??? at xk

has Q(f ) = 0. FINISH ME!!!

9.2 Gaussian Quadrature


Gaussian quadrature is a powerful method for numerical integration in which
both the nodes and the weights are chosen so as to maximize the order of accu-
racy of the quadrature formula. Remarkably, by the correct choice of n nodes
and weights, the quadrature formula can be made accurate for all polynomials
of degree at most 2n − 1. Moreover, the weights in this quadrature formula are
all positive, and so the quadrature formula is stable even for high n.
102 CHAPTER 9. NUMERICAL INTEGRATION

See [108, Ch. 37].


The objective is to approximate a definite integral
Z b
f (x)w(x) dx,
a

where w : [a, b] → (0, +∞) is a fixed weight function, by a finite sum


n
X
In (f ) := wj f (xj ),
j=1


where the nodes x1 , . . . , xn and weights w1 , . . . , wn will be chosen appropriately.
Let {q0 , q1 , . . . } be a system of orthogonal polynomials with respect to the
weight function w. That is,
Z b
f (x)qn (x)w(x) dx = 0
a

T
whenever f is a polynomial of degree at most n − 1. Let the nodes x1 , . . . , xn
be the zeros of qn ; by Theorem 8.14, qn has n distinct roots in [a, b]. Define the
associated weights by
Rb
AF
an a qn−1 (x)2 w(x) dx
wj := ,
an−1 qn′ (xj )qn−1 (xj )

where ak is the coefficient of xk in qk (x).


Orthogonal polynomials for quadrature formulas can be found in Abramowitz
& Stegun [1, §25.4]
1. The nodes actually determine the weights as above.
2. Those weights are positive. DONE
DR

3. Order of accuracy 2n − 1. DONE


4. Error estimate [95, Th. 3.6.24]: for f ∈ C 2n ,
Z b
f (2n) (ξ)
f (x)ρ(x) dx − QK (f ) = kpn k2ρ
a (2n)!
Definition 9.7. The n-point Gauss–Legendre quadrature formula is the quadra-
ture formula with nodes {x1 , . . . , xn } given by the zeros of the Legendre poly-
nomial qn+1 . The nodes are sometimes called Gauss points.

Theorem 9.8. The n-point Gauss quadrature formula has order of accuracy
exactly 2n − 1, and no quadrature formula has order of accuracy higher than
this.
Proof. Lemma 9.2 shows that no quadrature formula can have order of accuracy
greater than 2n − 1.
On the other hand, suppose that p is any polynomial of degree at most
2n − 1. Factor this polynomial as

p(x) = g(x)qn+1 (x) + r(x),

where g is a polynomial of degree at most n − 1, and the remainder r is also a


polynomial of degree at most n − 1. Since qn+1 is orthogonal to all polynomials
9.2. GAUSSIAN QUADRATURE 103

Rb
of degree at most n, a
gqn+1 dµ = 0. However, since g(xj )qn+1 (xj ) = 0 for
each node xj ,
n
X
In (gqn+1 ) = wj g(xj )qn+1 (xj ) = 0.
j=1
Rb
Since a · dµ and Qn ( · ) are both linear operators,
Z b Z b
f dµ = r dµ and Qn (f ) = Qn (r).
a a
Rb Rb


Since r is of degree at most n − 1, a
r dµ = In (r), and so a
f dµ = Qn (f ), as
claimed.
Theorem 9.9. The Gauss weights are given by
Rb
an a qn−1 (x)2 w(x) dx
wj = ,
an−1 qn′ (xj )qn−1 (xj )
where ak is the coefficient of xk in qk (x).

T
Proof. Suppose that p is any polynomial of degree at most 2n − 1. Factor this
polynomial as
p(x) = g(x)qn+1 (x) + r(x),
AF
where g is a polynomial of degree at most n − 1, and the remainder r is also a
polynomial
Pn of degree at most n − 1. Using Lagrange basis polynomials, write
r = i=1 r(xi )ℓi , so that
Z b n
X Z b
r dµ = r(xi ) ℓi dµ.
a i=1 a

Since the Gauss quadrature formula is exact for r, it follows that the Gauss
DR

weights satisfy
Z b
wi = ℓi dµ
a
...
...
...
FINISH ME!!!
Theorem 9.10. The weights of the Gauss quadrature formula are all positive.
Proof. Fix 1 ≤ i ≤ n and consider the polynomial
Y

p(x) := (x − xj )2
1≤j≤n
j6=i

i.e. the square of the nodal polynomial, divided by (x − xi )2 . Since the degree of
p is strictly less than 2n − 1, the Gauss quadrature formula is exact, and since
p vanishes at every node other than xi , it follows that
Z Xn
p dµ = wj p(xj ) = wi p(xi ).
I j=1

Since µ is a non-negative measure, p ≥ 0 everywhere, and p(xi ) > 0, it follows


that wi > 0.
104 CHAPTER 9. NUMERICAL INTEGRATION

9.3 Clenshaw–Curtis / Fejér Quadrature


Despite its optimal degree of polynomial exactness, Gaussian quadrature has
some major drawbacks in practice. One principal drawback is that, by The-
orem 8.14, the Gaussian quadrature nodes are never nested — that is, if one
wishes to increase the accuracy of the numerical integral by passing from using,
say, n nodes to 2n nodes, then none of the first n nodes will be re-used. If
evaluations of the integrand are computationally expensive, then this lack of
nesting is a serious concern. Another drawback of Gaussian quadrature on n
nodes is the cost of computing the weights, which is O(n2 ). By contrast, the


Clenshaw–Curtis quadrature rules [19] (although in fact discovered thirty years
previously by Fejér [29]) are nested quadrature rules, with accuracy comparable
to Gaussian quadrature in many circumstances, and with weights that can be
computed with cost O(n log n).

a0 X
f (cos θ) = + ak cos(kθ)
2

T
k=1

where Z π
2
ak = f (cos θ) cos(kθ) dθ
π 0
AF
The cosine series expansion of f is also a Chebyshev polynomial expansion of
f , since by construction Tk (cos θ) = cos(kθ):
X ∞
a0
f (x) = T0 (x) + ak Tk (x)
2
k=1

Z 1 Z π
DR

f (x) dx = f (cos θ) sin θ dθ


−1 0

X 2a2k
= a0 +
1 − (2k)2
k=1

9.4 Quadrature in Multiple Dimensions


Having established quadrature formulæfor integrals with a one-dimensional do-
main of integration, the next agendum is to produce quadrature formulas for

multi-dimensional (i.e. iterated) integrals of the form


Z Z bd Z b1

Qd
f (x) dx = ... f (x1 , . . . , xd ) dx1 . . . dxd .
j=1 [aj ,bj ] ad a1

Tensor Product Quadrature Formulæ. The first, obvious, strategy to try is


to treat d-dimensional integration as a succession of d one-dimensional integrals
and apply our favourite one-dimensional quadrature formula d times. This is
the idea underlying tensor product quadrature formulas, and it has one major
flaw: if the one-dimensional quadrature formula uses N nodes, then the tensor
product rule uses N d nodes, which very rapidly leads to an impractically large
number of integrand evaluations for even moderately large values of N and d. In
9.5. MONTE CARLO METHODS 105

general, when the one-dimensional quadrature formula uses N nodes, the error
for an integrand in C r using a tensor product rule is O(N −r/d ).

Sparse Quadrature Formulæ. The curse of dimension, which quickly renders


tensor product quadrature formulae impractical in high dimension, spurs the
consideration of sparse quadrature formulas, in which far fewer than N d nodes
are used, at the cost of some accuracy in the quadrature formula: in practice, we
are willing to pay the price of loss of accuracy in order to get any answer at all!
One example of a popular sparse quadrature rule is the recursive construction of


Smolyak sparse grids, which is particularly useful when combined with a nested
one-dimensional quadrature rule such as the Clenshaw–Curtis rule.
Assume that we are given, for each ℓ ∈ N, we are given a one-dimensional
(1)
quadrature formula Qℓ . The formula for Smolyak quadrature in dimension
d ∈ N at level ℓ ∈ N is defined in terms of the lower-dimensional quadrature
formalæby
ℓ 
!
X 
(d) (1) (1) (d−1)
Qℓ (f ) := Qi − Qi−1 ⊗ Qℓ−i+1 (f )

T
i=1

This formula takes a little getting used to, and it helps to first consider the
case d = 2 and a few small values of ℓ. First, for ℓ = 1, Smolyak’s rule is the
AF
quadrature formula
(2) (1) (1)
Q1 = Q1 ⊗ Q1 ,
(1)
i.e. the full tensor product of the one-dimensional quadrature formula Q1 with
itself. For the next level, ℓ = 2, Smolyak’s rule is
2 
X 
(2) (1) (1) (1)
Q2 = Qi − Qi−1 ⊗ Qℓ−i+1
DR

i=1
 
(1) (1) (1) (1) (1)
= Q1 ⊗ Q2 + Q2 − Q1 ⊗ Q1
(1) (1) (1) (1) (1) (1)
= Q1 ⊗ Q2 + Q2 ⊗ Q1 − Q1 ⊗ Q1 .

(1) (1)
The “−Q1 ⊗ Q1 ” term is included to avoid double counting. See Figure 9.1
for an illustration of nodes of the Smolyak construction in the case that the
(d)
one-dimensional quadrature formula Qℓ has 2ℓ − 1 equally-spaced nodes.
In general, when the one-dimensional quadrature formula at level ℓ uses Nℓ
nodes, the quadrature error for an integrand in C r using Smolyak recursion is

O(Nℓ−r (log Nℓ )(d−1)(r+1) ).

Remark 9.11. The right Sobolev space for studying sparse grids, since we need
1
pointwise evaluation, is Hmix , in which functions are weakly differentiable in
each coordinate direction.

9.5 Monte Carlo Methods


As seen above, tensor product quadrature formulas suffer from the curse of di-
mensionality: they require exponentially many evaluations of the integrand as a
function of the dimension of the integration domain. Sparse grid constructions
106 CHAPTER 9. NUMERICAL INTEGRATION

b b b b

b b b b b b b b b b b

b b b b


(a) ℓ = 1 (b) ℓ = 2 (c) ℓ = 3

Figure 9.1: Illustration of the nodes of the 2-dimensional Smolyak sparse quadra-
(2)
ture formulas Qℓ for levels ℓ = 1, 2, 3, in the case that the 1-dimensional
(1)
quadrature formula Qℓ has 2ℓ − 1 equally-spaced nodes in the interior of the
domain of integration, i.e. is an open Newton–Cotes formula.

T
only partially alleviate this problem. Remarkably, however, the curse of dimen-
AF
sionality can be entirely circumvented by resorting to random sampling of the
integration domain — provided, of course, that it is possible to draw samples
from the measure against which the integrand is to be integrated.
Monte Carlo methods are, in essence, an application of the law of large
numbers (LLN). Recall that the LLN states that if X (1) , X (2) , . . . is a sequence
of independent samples from a random variable X with finite expectation E[X],
then the sample average
K
1 X (k)
DR

X
K
k=1
converges to E[X] as K → ∞. The weak LLN states that the mode of conver-
gence is convergence in probability:
" K
#
1 X (k)
for all ε > 0, lim P X − E[X] > ε = 0;
K→∞ K
k=1

the strong LLN (which is harder to prove than the weak LLN) states that the
mode of convergence is actually almost sure:

" K
#
1 X (k)
P lim X = E[X] = 1.
K→∞ K
k=1

‘Vanilla’ Monte Carlo.


Theorem 9.12 (Birkhoff–Khinchin ergodic theorem). Let T : Θ → Θ be a measure-
preserving map of a probability space (Θ, F , µ), and let f ∈ L1 (Θ, µ; R). Then,
for µ-almost every θ ∈ Θ,
K−1
1 X
lim f (T k θ) = Eµ [f |GT ],
K→∞ K
k=0
9.6. PSEUDO-RANDOM METHODS 107

where GT is the σ-algebra of T -invariant sets. Hence, if T is ergodic, then


K−1
1 X
lim f (T k θ) = Eµ [f ] µ-a.s.
K→∞ K
k=0

To obtain an error estimate forPMonte Carlo integrals, we simply apply


1 K (k)
Chebyshev’s inequality to SK := K k=1 X , which has E[SK ] = E[X] and
K
1 X V[X]
V[SK ] = 2
V[X] = ,
K K
k=1


to obtain that, for any t ≥ 0,
V[X]
P[|SK − E[X]| ≥ t] ≤ .
Kt2
That is, for any ε ∈ (0, 1], with probability at least 1 − ε with respect to the
K Monte Carlo samples, the Monte Carlo average SK lies within (V[X]/Kε)1/2
of the true expected value E[X]. The fact that the error decays like K −1/2 ,

T
i.e. slowly, is a major limitation of ‘vanilla’ Monte Carlo methods; it is undesir-
able to have to quadruple the number of samples to double the accuracy of the
approximate integral.
AF
CDF Inversion One obvious criticism of Monte Carlo integration as presented
above is the accessibility of the measure of integration µ. Even leaving aside
the sensitive topic of the generation of truly ‘random’ numbers, it is no easy
matter to draw random numbers from an arbitrary probability measure on R.
The uniform measure on an interval may be said to be easily accessible; ρ dx,
for some positive and integrable function ρ, is not.
Z
DR

Fν (x) := dν
(−∞,x]

X ∼ Unif([0, 1]) =⇒ Fν−1 (X) ∼ ν

Importance Sampling

Markov Chain Monte Carlo Markov chain Monte Carlo (MCMC) methods
are a class of algorithms for sampling from a probability distribution µ based
on constructing a Markov chain that has µ as its equilibrium distribution. The
state of the chain after a large number of steps is then used as a sample of µ.

The quality of the sample improves as a function of the number of steps. Usually
it is not hard to construct a Markov chain with the desired properties; the more
difficult problem is to determine how many steps are needed to converge to µ
within an acceptable error.

9.6 Pseudo-Random Methods


This chapter concludes with a very brief survey of numerical integration methods
that are in fact based upon deterministic sampling, but in such a way as the
sample points ‘might as well be’ random.
Niederreiter [71]
108 CHAPTER 9. NUMERICAL INTEGRATION

Definition 9.13. The discrepancy:

#(P ∩ B)
DN (P ) := sup − λd (B)
B∈J N
Qd
where J is the collection of all products of the form i=1 [ai , bi ), with 0 ≤ ai <
bi ≤ 1. The star-discrepancy:

∗ #(P ∩ B)
DN (P ) := sup − λd (B)
B∈J ∗ N


Qd
where J ∗ is the collection of all products of the form i=1 [0, bi ), with 0 ≤ bi < 1.
Lemma 9.14.
∗ ∗
DN ≤ DN ≤ 2 d DN .
Definition 9.15. Let f : [0, 1]d → R. If J ⊆ [0, 1]d is a subrectangle of [0, 1]d ,
i.e. a d-fold product of subintervals of [0, 1], let ∆J (f ) be the sum of the values

T
of f at the 2d vertices of J, with alternating signs at nearest-neighbour vertices.
The Vitali variation of f : [0, 1]d → R is defined to be
( )
X Π is a partition of [0, 1]d into finitely
AF
Vit
V (f ) := sup |∆J (f )|
many non-overlapping subrectangles
J∈Π

For 1 ≤ s ≤ d, the Hardy–Krause variation of f is defined to be


X
V HK (f ) := VFVit (f ),
F

where the sum runs over all faces F of [0, 1]d having dimension at most s.
DR

Theorem 9.16 (Koksma’s inequality). If f : [0, 1] → R has bounded (total) vari-


ation, then, for any {x1 , . . . , xN } ⊆ [0, 1],
N Z
1 X ∗
f (xi ) − f (x) dx ≤ V (f )DN (x1 , . . . , xN ).
N i=1 [0,1]d

Theorem 9.17 (Koksma–Hlakwa Inequality). Let f : [0, 1]d → R have bounded


Hardy–Krause variation. Then, for any {x1 , . . . , xN } ⊆ [0, 1)N ,

N Z
1 X ∗
f (xi ) − f (x) dx ≤ V[0,1]d (f )DN (x1 , . . . , xN ).
N i=1 [0,1] d

Furthermore, this bound is sharp in the sense that, for every {x1 , . . . , xN } ⊆
[0, 1)N and every ε > 0, there exists f : [0, 1]d → R with V (f ) = 1 such that
N Z
1 X ∗
f (xi ) − f (x) dx > DN (x1 , . . . , xN ) − ε.
N i=1 [0,1]d

Halton’s sequence: [37]


Sobol′ sequence: [91]
BIBLIOGRAPHY 109

Bibliography
W At Warwick, Monte Carlo integration and related topics are covered in the mod-
ule ST407 Monte Carlo Methods. See also Robert & Casella [81] for a survey
of MC methods in statistics.
Orthogonal polynomials for quadrature formulas can be found in Section 25.4
of Abramowitz & Stegun [1]. Gautschi’s general monograph [33] on orthogonal
polynomials covers applications to Gaussian quadrature in Section 3.1. The
article [107] compares the Gaussian and Clenshaw–Curtis quadrature rules and
explains their similar accuracy in many circumstances.


Smolyak recursion was introduced in [90].

Exercises
Exercise 9.1. Determine the weights for the open Newton–Cotes quadrature
formula. (Cf. Proposition 9.6.)

T
Exercise 9.2 (Takahasi–Mori (tanh–sinh) Quadrature [101]). Consider a definite
R1
integral over [−1, 1] of the form −1 f (x) dx. Employ a change of variables
x = ϕ(t) := tanh( π2 sinh(t)) to convert this to an integral over the real line. Let
h > 0 and K ∈ N, and approximate this integral over R using 2K + 1 points
AF
equally spaced from −Kh to Kh to derive a quadrature rule
Z 1 k=K
X
f (x) dx ≈ Qh,K (f ) := wk f (xk ),
−1 k=−K

where xk := tanh( π2 sinh(kh)),


π
2 h cosh(kh)
DR

and wk := .
cosh2 ( π2 sinh(kh))

How are these nodes distributed in [−1, 1]? If f is bounded, then what rate of
decay does f ◦ ϕ have? Hence, why is excluding the nodes xk with |k| > K a
reasonable approximation?

110 CHAPTER 9. NUMERICAL INTEGRATION


T
AF
DR

Chapter 10


Sensitivity Analysis and
Model Reduction

T
Le doute n’est pas un état bien agréable,
mais l’assurance est un état ridicule.

Voltaire
AF
The topic of this chapter is sensitivity analysis, which may be broadly un-
derstood as understanding how f (x1 , . . . , xn ) depends upon variations not only
in the xi individually, but also combined or correlated effects among the xi

10.1 Model Reduction for Linear Models


DR

Suppose that the model mapping inputs x ∈ Rn to outputs y = f (x) ∈ Rm is


actually a linear map, and so can be represented by a matrix A ∈ Rm×n . There is
essentially only one method for the dimensional reduction of such linear models,
the singular value decomposition (SVD).
Theorem 10.1 (Singular value decomposition). For every matrix A ∈ Cm×n ,
there exist unitary matrices U ∈ Cm×m and V ∈ Cn×n and a diagonal matrix
Σ ∈ Rm×n
≥0 such that
A = U ΣV ∗ .
The columns of U are called the left singular vectors of A; the columns

of V are called the right singular vectors of A; and the diagonal entries of
Σ are called the singular values of A. While the singular values are unique,
the singular vectors may fail to be. By convention, the singular values and
corresponding singular vectors are ordered so that the singular values form a
decreasing sequence
σ1 ≥ σ2 ≥ · · · ≥ σmin{m,n} ≥ 0.
Thus, the SVD is a decomposition of A into a sum of rank-1 operators:
min{m,n} min{m,n}
X X
A = U ΣV ∗ = σj uj ⊗ vj = σj uj hvj , · i.
j=1 j=1
112 CHAPTER 10. SENSITIVITY ANALYSIS AND MODEL REDUCTION

The appeal of the SVD is that it is numerically stable, and that it provides
optimal low-rank approximation of linear operators: if Ak ∈ Rm×n is defined
by
Xk
Ak := σj uj ⊗ vj ,
j=1
then Ak is the optimal rank-k approximation to A in the sense that
 
X ∈ Rm×n and
kA − Ak k2 = min kA − Xk2 ,
rank(X) ≤ k


where k · k2 denotes the operator 2-norm on matrices.
Chapter 11 contains an important application of the SVD to the analysis of
sample data from random variables, an discrete variant of the Karhunen–Loève
expansion. Simply put, when A is a matrix whose columns are independent
samples from some some stochastic process (random vector), the SVD of A is
the ideal way to fit a linear structure to those data points. One may consider
nonlinear fitting and dimensionality reduction methods in the same way, and

T
this is known as manifold learning: see, for instance, the IsoMap algorithm of
Tenenbaum & al. [104].

10.2 Derivatives
AF
A natural first way to understand the dependence of f (x1 , . . . , xn ) upon x1 , . . . , xn
near some nominal point x∗ = (x∗1 , . . . , x∗n ) is to estimate the partial derivatives
of f at x∗ , i.e. to approximate
∂f ∗ f (x∗1 , . . . , x∗i + h, . . . , x∗n ) − f (x∗ )
(x ) := lim .
∂xi h→0 h
DR

Approximate by e.g.
∂f ∗
(x ) ≈
∂xi
Ultimately boils down to polynomial approximation f ≈ p, f ′ ≈ p′ .

10.3 McDiarmid Diameters


This section considers an ‘L∞ -type’ sensitivity index that measures the sen-
sitivity of a function of n variables or parameters to variations in those vari-
ables/parameters individually.

Qn
Definition 10.2. The ith McDiarmid subdiameter of f : i=1 Xi → K is
 Qn 
x, y ∈ i=1 Xi
Di [f ] := sup |f (x) − f (y)|
xj = yj for j 6= i
 
′ xj ∈ Xj for j = 1, . . . , n
= sup |f (x1 , . . . , xi , . . . , xn ) − f (x1 , . . . , xi , . . . , xn )| .
x′i ∈ Xi
The McDiarmid diameter of f is
v
u n
uX
D[f ] := t Di [f ]2 .
i=1
10.3. MCDIARMID DIAMETERS 113

Remark 10.3. Note that although the two definitions of Di [f ] given above are
obviously mathematically equivalent, they are very different from a computa-
tional point of view: the first formulation is ‘obviously’ a constrained optimiza-
tion problem in 2n variables with n − 1 constraints (i.e. ‘difficult’), whereas
the second formulation is ‘obviously’ an unconstrained optimization problem in
n + 1 variables (i.e. ‘easy’).

Lemma 10.4. For each j = 1, . . . , n, Dj [ · ] is a semi-norm on the space of


bounded functions f : X → K, as is D[ · ].


Proof. Exercise 10.1.

The McDiarmid subdiameters and diameter are useful not only as sensitivity
indices, but also for providing a rigorous upper bound on deviations of a function
of independent random variables from its mean value:

Theorem 10.5 (McDiarmid’s bounded differences inequality). Let X = (X1 , . . . , Xn )


Qnany random variable with independent components taking values in X =
be

T
i=1 Xi , and let f : X → R be absolutely integrable with respect to the law of X
and have finite McDiarmid diameter D[f ]. Then, for any t ≥ 0,
 
  2t2
AF
P f (X) ≥ E[f (X)] + t ≤ exp − ,
D[f ]2
 
  2t2
P f (X) ≤ E[f (X)] − t ≤ exp − ,
D[f ]2
 
  2t2
P |f (X) − E[f (X)]| ≥ t ≤ 2 exp − .
D[f ]2

Corollary 10.6 (Hoeffding’s inequality). Let X = (X1 , . . . , Xn ) beQa random vari-


DR

able withPindependent components, taking values in the cuboid ni=1 [ai , bi ]. Let
1 n
Sn := n i=1 Xi . Then, for any t ≥ 0,
 
  −2n2 t2
P Sn − E[Sn ] ≥ t ≤ exp − Pn 2
,
i=1 (bi − ai )

and similarly for deviations below, and either side, of the mean.

McDiarmid’s and Hoeffding’s inequalities are just two examples of a broad


family of inequalities known as concentration of measure inequalities. Roughly

put, the concentration of measure phenomenon, which was first noticed by Lévy
[61], is the fact that a function of a high-dimensional random variable with many
independent (or weakly correlated) components has its values overwhelmingly
concentrated about the mean (or median). An inequality such as McDiarmid’s
provides a rigorous certification criterion: to be sure that f (X) will deviate
above its mean by more than t with probability no greater than ε ∈ [0, 1], it
sufficies to show that  
2t2
exp − ≤ε
D[f ]2
i.e. r
2
D[f ] ≤ t .
log ε−1
114 CHAPTER 10. SENSITIVITY ANALYSIS AND MODEL REDUCTION

Experimental effort then revolves around determining E[f (X)] and D[f ]; given
those ingredients, the certification criterion is mathematically rigorous. That
said, it is unlikely to be the optimal rigorous certification criterion, because Mc-
Diarmid’s inequality is not guaranteed to be sharp. The calculation of optimal
probability inequalities is considered in Chapter 14.
To prove McDiarmid’s inequality first requires a lemma bounding the moment-
generating function of a random variable:

Lemma 10.7 (Hoeffding’s lemma). Let X be a random variable with mean zero
taking values in [a, b]. Then, for t ≥ 0,


 2 
t (b − a)2
E[etX ] ≤ exp .
8

Proof. By the convexity of the exponential function, for each x ∈ [a, b],

b − x ta x − a tb
etx ≤ e + e .
b−a b−a

T
Therefore, applying the expectation operator,

E[etX ] ≤
b ta
e +
a tb
e =: eφ(t) .
AF
b−a b−a

Observe that φ(0) = 0, φ′ (0) = 0, and φ′′ (t) ≤ 41 (b − a)2 . Hence, since exp is an
increasing and convex function,
   2 
tX (b − a)2 t2 t (b − a)2
E[e ] ≤ exp 0 + 0t + = exp ,
4 2 8
DR

as claimed.

Proof of McDiarmid’s inequality (Theorem 10.5). The proof uses the proper-
ties of conditional expectation outlined in Example 3.14. Let Fi be the σ-
algebra generated by X1 , . . . , Xi , and define random variables Z0 , . . . , Zn by
Zi := E[f (X)|Fi ]. Note that Z0 = E[f (X)] and Zn = f (X). Now consider the
conditional increment (Zi − Zi−1 )|Fi−1 . First observe that

E[Zi − Zi−1 |Fi−1 ] = 0,

so that the sequence (Zi )i≥0 is a martingale. Secondly, observe that


Li ≤ Zi − Zi−1 |Fi−1 ≤ Ui ,

where

Li := inf E[f (X)|Fi−1 , Xi = ℓ] − E[f (X)|Fi−1 ],



Ui := sup E[f (X)|Fi−1 , Xi = u] − E[f (X)|Fi−1 ].
u

Since Ui − Li ≤ Di [f ], Hoeffding’s lemma implies that


h i 2 2
E es(Zi −Zi−1 ) Fi−1 ≤ es Di [f ] /8 .
10.4. ANOVA/HDMR DECOMPOSITIONS 115

Hence, for any s ≥ 0,


P[f (X) − E[f (X)] ≥ t]
 
= P es(f (X)−E[f (X)]) ≥ est
 
≤ e−st E es(f (X)−E[f (X)]) (Markov’s inequality)
h Pn i
= e−st E es i=1 Zi −Zi−1 (telescoping sum)
h h Pn ii
= e−st E E es i=1 Zi −Zi−1 Fn−1 (tower rule)
h Pn−1 h ii


= e−st E es i=1 Zi −Zi−1 E es(Zn −Zn−1 ) Fn−1 (Z0 , . . . , Zn−1 are Fn−1 -measurable)
2 2
h Pn−1 i
≤ e−st es Dn [f ] /8 E es i=1 Zi −Zi−1
by the first part of the proof. Repeating this argument a further n − 1 times
shows that
 
s2
P[f (X) − E[f (X)] ≥ t] ≤ exp −st + D[f ]2 .

T
8
The expression on the right-hand side is minimized by s = 4t/D[f ]2 , which
yields the first of McDiarmid’s inequalities, and the others follow easily.
AF
10.4 ANOVA/HDMR Decompositions
The topic of this section is a variance-based decomposition of a function of n vari-
ables that goes by various names such as the analysis of variance (ANOVA), the
functional ANOVA, and the high-dimensional model representation (HDMR).
As before, let (Xi , Fi , µi ) be a probability space for i = 1, . . . , n, and let
(X , F , µ) be the product space. Write N = {1, . . . , n}, and consider a (F -
DR

measurable) function of interest f : X → R. Bearing in mind that in practical


applications n may be large (103 or more), it is of interest to efficiently identify
• which of the xi contribute in the most dominant ways to the variations in
f (x1 , . . . , xn ),
• how the effects of multiple xi are cooperative or competitive with one
another,
• and hence construct a surrogate model for f that uses a lower-dimensional
set of input variables, by using only those that give rise to dominant effects.
The idea is to write f (x1 , . . . , xn ) as a sum of the form
n
X X

f (x1 , . . . , xn ) = f∅ + f{i} (xi ) + f{i,j} (xi , xj ) + . . .


i=1 1≤i<j≤n
X
= fI (xI ).
I⊆N

Experience suggests that ‘typical real-world systems’ f exhibit only low-order


cooperativity in the effects of the input variables x1 , . . . , xn . That is, the terms
fI with |I| ≫ 1 are typically small, and a good approximation of f is given by,
say, a second-order expansion,
n
X X
f (x1 , . . . , xn ) ≈ f∅ + f{i} (xi ) + f{i,j} (xi , xj ).
i=1 1≤i<j≤n
116 CHAPTER 10. SENSITIVITY ANALYSIS AND MODEL REDUCTION

Note, however, that low-order cooperativity does not necessarily imply that
there is a small set of significant variables (it is possible that f{i} is large for
most i ∈ {1, . . . , n}), not does it say anything about the linearity or non-linearity
of the input-output relationship. Furthermore, there are many HDMR-type
expansions of the form given above; orthogonality criteria can be used to select
a particular HDMR representation.

RS-HDMR / ANOVA. A long-established decomposition of this type is the


analysis of variance (ANOVA) or random sampling HDMR (RS-HDMR) de-
composition. Let


Z
f∅ (x) := f dµ,
X
i.e. f∅ is the orthogonal projection of f onto the one-dimensional space of con-
stant functions, and so it is common to abuse notation and write f∅ ∈ R. For
i = 1, . . . , n, let
Z Z Z Z

T
f{i} (x) := ... ... f dµ1 . . . dµi−1 dµi+1 . . . dµn − f∅ .
X1 Xi−1 Xi+1 Xn

Note that f{i} (x) is actually a function of xi only, and so it is common to abuse
notation and write f{i} (xi ) instead of f{i} (x); f{i} is constant with respect to the
AF
other n − 1 variables. To take this idea further and capture cooperative effects
among two or more xi , for I ⊆ N := {1, . . . , n}, let |I| denote the cardinality of
I and let ∼ I denote the relative complement D \ I. For I = (i1 , . . . , i|I| ) ⊆ N
and x ∈ X , define the point xI by xI := (xi1 , . . . , xi|I| ); similar notation like
x∼I , XI , µI &c. should hopefully be self-explanatory.
Definition
P 10.8. The ANOVA decomposition or RS-HDMR of f is the sum
f = I⊆N fI , where the functions fI : X → R (or, by abuse of notation,
DR

fI : XI → R) are defined recursively by


Z
f∅ (x) := f (x) dµ(x)
X

 
Z X
fI (xI ) := f (x) − fJ (xJ ) dx∼I
X∼I J(I
Z X
= f (x) dx∼I − fJ (xJ ).

X∼I J(I

Theorem 10.9 (ANOVA). Let f ∈ L2 (X , µ) have variance σ 2 := kf − f∅ k2L2 .


Then R
1. whenever i ∈ I, Xi fI (x) dµi (xi ) = 0;
R
2. whenever
P I 6
= J, fI (x)fJ (x) dµ(x) = 0;
3. σ 2 = I⊆D σI2 , where
2
σ∅ := 0,
Z
2
σI := fI (x)2 dx.
10.4. ANOVA/HDMR DECOMPOSITIONS 117

Proof. 1. First suppose that I = {i}. Then


Z Z Z 
fI dµi = f dµ∼{i} − f∅ dµi
Xi X X∼I
Z i
= f dµ − f∅
X
= 0.
R
Now suppose for an induction that |I| = k and that Xj fJ dµj = 0 for all


j ∈ J whenever |J| < |I|. Then
 
Z Z Z X
fI dµi =  f dµ∼I − fJ  dµi
Xi Xi X∼I J(I
Z Z XZ
= f dµ∼I dµi − fJ dµi
Xi X∼I Xi

T
J(I
i∈J
/
AF
FINISH ME!!!

2. FINISH ME!!!

3. This follows immediately from the orthogonality relation


Z
fI (x)fJ (x) dµ(x) = hfI , fJ iL2 (µ) = 0 whenever I 6= J.
X
DR

Cut-HDMR. In Cut-HDMR, an expansion is performed with respect to a ref-


erence point x̄ ∈ X :

f∅ (x) := f (x̄),
f{i} (x) := f (x̄1 , . . . , x̄i−1 , xi , x̄i+1 , . . . , x̄n ) − f∅ (x)
≡ f (xi , x̄∼{i} ) − f∅ (x),
f{i,j} (x) := f (x̄1 , . . . , x̄i−1 , xi , x̄i+1 , . . . , x̄j−1 , xj , x̄j+1 , . . . , x̄n ) − f{i} (x) − f{j} (x) − f∅ (x)
≡ f (x{i,j} , x̄∼{i,j} ) − f (xi , x̄∼{i} ) − f (xj , x̄∼{j} ) − f∅ (x),

X
fI (x) := f (xI , x̄∼I ) − fJ (x).
J(I

Note that a component function fI of a Cut-HDMR expansion vanishes at any


x ∈ X that has a component in common with x̄, i.e.

fI (x) = 0 whenever xi = x̄i for some i ∈ I.

Hence,
fI (x)fJ (x) = 0 whenever xk = x̄k for some k ∈ I ∪ J.
Indeed, this orthogonality relation defines the Cut-HDMR expansion.
118 CHAPTER 10. SENSITIVITY ANALYSIS AND MODEL REDUCTION

HDMR Projectors. Decomposition of a function f into an HDMR expansion


can be seen as the application of a suitable sequence of orthogonal projection op-
erators, and hence HDMR provides an orthogonal decomposition of L2 ([0, 1]n ).
However, in contrast to orthogonal decompositions such as Fourier and polyno-
mial chaos expansions, in which L2 is decomposed into an infinite direct sum of
finite-dimensional subspaces, the ANOVA/HDMR decomposition is a decompo-
sition of L2 into a finite direct sum of infinite-dimensional subspaces.
To be more precise, let F be any vector space of measurable functions from
X to R. The obvious candidate for the projector P∅ is the orthogonal projection
P∅ : F → F∅ , where


F∅ := {f ∈ F | f (x) ≡ a for some a ∈ R and all x ∈ [0, 1]n .}
is the space of constant functions. For i = 1, . . . , n, P{i} : F → F{i} , where
 Z 1 
F{i} := f ∈ F f is independent of xj for j 6= i and f (x) dxi = 0
0

T
and, for ∅ 6= I ⊆ N , PI : F → FI , where
 Z 1 
FI := f ∈ F f is independent of xj for j ∈/ I and, for i ∈ I, f (x) dxi = 0 .
0
AF
These linear operators PI are idempotent, commutative and mutually orthogo-
nal, i.e. (
PI f, if I = J,
PI PJ f = PJ PI f =
0, if I 6= J,
and form a resolution of the identity
X
PI f = f.
DR

I⊆N
L
Thus, the space of functions F decomposes as the direct sum F = I⊆N FI ,
and this direct sum is orthogonal when F is a Hilbert subspace of L2 (X , µ).

Sobol′ Sensitivity Indices. The decomposition of the variance given by an


HDMR / ANOVA decomposition naturally gives rise to a set of sensitivity
indices for ranking the most important input variables and their cooperative
effects. An obvious (and naı̈ve) assessment of the relative importance of the
variables xI is the variance component σI2 , or the normalized contribution σI2 /σ 2 .

However, this measure neglects the contributions of those xJ with J ⊆ I, or


those xJ such that J has some indices in commmon with I. With this in mind,
the Sobol′ sensitivity indices are defined as follows:
Definition 10.10. Given an HDMR decomposition of a function f of n variables,
the lower and upper Sobol′ sensitivity indices of I ⊆ N are, respectively,
X X
τ 2I := σJ2 , and τ 2I := σJ2 .
J⊆I J∩I6=∅


The normalized lower and upper Sobol sensitivity indices of I ⊆ N are, respec-
tively,
s2I := τ 2I /σ 2 , and s2I := τ 2I /σ 2 .
BIBLIOGRAPHY 119

P
Since I⊆N σI2 = σ 2 = kf − f∅ k2L2 , it follows immediately that, for each
I ⊆ N,
0 ≤ s2I ≤ s2I ≤ 1.
P
Note, however, that while the ANOVA theorem guarantees that σ 2 = I⊆D σI2 , 
in general Sobol′ indices satisfy no such additivity relation:
X X
1 6= s2I < s2I 6= 1.
I⊆N I⊆N


Bibliography
Detailed treatment of the singular value decomposition can be found in any text
on (numerical) linear algebra, such as that of Trefethen & Bau [108].
McDiarmid’s inequality appears in [65], although the underlying martingale
results go back to Hoeffding [39] and Azuma [5]. Ledoux [59] and Ledoux &
Talagrand [60] give more general presentations of the concentration-of-measure

T
phenomenon, including geometrical considerations such as isoperimetric inequal-
ities.
In the statistical literature, the analysis of variance (ANOVA) originates
with Fisher & Mackenzie [32]. The ANOVA decomposition was generalized by
AF
Hoeffding [38] to functions in L2 ([0, 1]d ) for d ∈ N; for d = ∞, see Owen [75].
That generalization can easily be applied to L2 functions on any product do-
main, and leads to the functional ANOVA of Stone [96]. In the mathematical
chemistry literature, the HDMR (with its obvious connections to ANOVA) was
popularized by Rabitz & al. [3, 78]. The presentation of ANOVA/HDMR in
this chapter draws upon those references and the presentations of Beccacece &
Borgonovo [7] and Hooker [40].
Sobol′ indices were introduced by Sobol′ in [92]. HDMR by Sobol′ in [93].
DR

Exercises
Exercise 10.1. Prove Lemma 10.4. That is, show that, for each j = 1, . . . , n, the
McDiarmid subdiameter Dj [ · ] is a semi-norm on the space of bounded functions
f : X → K, as is the McDiarmid diameter D[ · ]. What are the null-spaces of
these semi-norms?
Exercise 10.2. Let f : [−1, 1]2 → R be a function of two variables. Sketch

the vanishing sets of the component functions of f in a Cut-HDMR expansion


through x̄ = (0, 0). Do the same exercise for f : [−1, 1]3 → R and x̄ = (0, 0, 0),
taking particular care with second-order terms like f{1,2} .
120 CHAPTER 10. SENSITIVITY ANALYSIS AND MODEL REDUCTION


T
AF
DR

Chapter 11


Spectral Expansions

The mark of a mature, psychologically

T
healthy mind is indeed the ability to live
with uncertainty and ambiguity, but only
as much as there really is.

Julian Baggini
AF
This chapter and its sequels consider several spectral methods for uncertainty
quantification. At their core, these are orthogonal decomposition methods in
which a random variable stochastic process (usually the solution of interest)
over a probability space (Θ, F , µ) is expanded with respect to an appropriate
orthogonal basis of L2 (Θ, µ; R). This chapter lays the foundations by considering
DR

spectral expansions in general, starting with the Karhunen–Loève biorthogonal


decomposition, and continuing with orthogonal polynomial bases for L2 (Θ, µ; R)
and the resulting polynomial chaos decompositions.

11.1 Karhunen–Loève Expansions


Fix a compact domain Ω ⊆ Rd (which could be thought of as ‘space’, ‘time’, or
a general parameter space) and a probability space (Θ, F , µ). The Karhunen–
Loève expansion of a stochastic process U : Ω × Θ → R is a particularly nice

spectral decomposition, in that it decomposes U in a biorthogonal fashion, i.e.


in terms of components that are both orthogonal over the parameter domain Ω
and the probability space Θ.
To be more precise, consider a stochastic process U : Ω × Θ → R such that
• for all x ∈ Ω, U (x) ∈ L2 (Θ, µ; R);
• for all x ∈ Ω, Eµ [U (x)] = 0;
• the covariance function CU (x, y) := Eµ [U (x)U (y)] is a well-defined con-
tinuous function of x, y ∈ Ω.

Remark 11.1. 1. The condition that U is a zero-mean process is not a serious


restriction; if U is not a zero-mean process, then simply consider Ũ defined
by Ũ (x, θ) := U (x, θ) − Eµ [U (x)].
122 CHAPTER 11. SPECTRAL EXPANSIONS

2. It is common in practice to see the covariance function interpreted as


proving some information on the correlation length of the process U . That
is, CU (x, y) depends only upon kx− yk and, for some function g : [0, ∞) →
[0, ∞), CU (x, y) = g(kx − yk). A typical such g is g(r) = exp(−r/r0 ), and
the constant r0 encodes how similar values of U at nearby points of Ω are
expected to be; when the correlation length r0 is small, the field U has
dissimilar values near to one another, and so is rough; when r0 is large,
the field U has only similar values near to one another, and so is more
smooth.


Define the covariance operator of U , also denoted by CU : L2 (Ω, dx; R) →
2
L (Ω, dx; R) by Z
(CU f )(x) := CU (x, y)f (y) dy.

Now let {en | n ∈ N} be an orthonormal basis of eigenvectors of L2 (Ω, dx; R)


with corresponding eigenvalues {λn | n ∈ N}, i.e.
Z

T
CU (x, y)en (y) dy = λn en (x)

and Z
AF
em (x)en (x) dx = δmn .

Definition 11.2. Let X be a first-countable topological space. A function


K : X → X → R is called a Mercer kernel if
1. K is continuous;
2. K is symmetric, i.e. K(x, x′ ) = K(x′ , x) for all x, x′ ∈ X ; and
3. K is positive semi-definite in the sense that, for all choices of finitely many
DR

points x1 , . . . , xn ∈ X , the Gram matrix


 
K(x1 , x1 ) · · · K(x1 , xn )
 .. .. .. 
G :=  . . . 
K(xn , x1 ) · · · K(xn , xn )

is positive semi-definite, i.e. satisfies ξ · Gξ ≥ 0 for all ξ ∈ Rn .


Theorem 11.3 (Mercer). Let X be a first-countable topological space equipped
with a complete Borel measure µ. Let K : X → X → R be a Mercer kernel. If
x 7→ K(x, x) lies in L1 (X , µ; R), then there is an orthonormal basis {en }n∈N of

L2 (X , µ; R) consisting of of eigenfunctions of the operator


Z
f 7→ K( · , y)f (y) dµ(y)
X

with non-negative eigenvalues {λn }n∈N . Furthermore, the eigenfunctions corre-


sponding to non-zero eigenvalues are continuous, and
X
K(x, y) = λn en (x)en (y),
n∈N

and this series converges absolutely, and uniformly over compact subsets of X .
11.1. KARHUNEN–LOÈVE EXPANSIONS 123

Theorem 11.4 (Karhunen–Loève). Under the above assumptions on U , its co-


variance function and its covariance operator, U can be written as
X
U= Zn en
n∈N

where the {en }n∈N are orthonormal eigenfunctions of the covariance operator
CU , the corresponding eigenvalues {λn }n∈N are non-negative, the convergence
of the series is in L2 (Θ, µ; R) and uniform in x ∈ Ω, with
Z


Zn = U (x)en (x) dx.

Furthermore, the random variables Zn are centred, uncorrelated, and have vari-
ance λn :
Eµ [Zn ] = 0, and Eµ [Zm Zn ] = λn δmn .
Proof. Since it is continuous, the covariance function is a Mercer kernel. Hence,
by Mercer’s theorem, there is an orthonormal basis {en }n∈N of L2 (Ω, dx; R)

T
consisting of eigenfunctions of the covariance operator with non-negative eigen-
values {λn }n∈N . In this basis, the covariance function has the representation
X
CU (x, y) = λn en (x)en (y).
AF
n∈N

Write the process U in terms of this basis as


X
U= Zn en ,
n∈N

where the coefficients Zn are random variables given by orthogonal projection:


Z
DR

Zn := U (x)en (x) dx.


Then Z  Z
Eµ [Zn ] = Eµ U (x)en (x) dx = E[U (x)]en (x) dx = 0.
Ω Ω
and
Z Z 
Eµ [Zm Zn ] = Eµ U (x)em (x) dx U (x)en (x) dx
ZΩ Z Ω


= Eµ U (x)em (x)U (y)en (y) dydx


Z Z Ω Ω
= Eµ [U (x)U (y)]em (x)en (y) dydx
Ω Ω
Z Z
= CU (x, y)em (x)en (y) dydx
ZΩ Ω Z
= em (x) CU (x, y)en (y) dydx
ZΩ Ω

= em (x)λn en (x) dx

= λn δmn .
124 CHAPTER 11. SPECTRAL EXPANSIONS

PN
Let SN := n=1 Zn en : Ω × Θ → R. Then, for any x ∈ Ω,
 
Eµ |U (x) − SN (x)|2
= Eµ [U (x)2 ] + Eµ [SN (x)2 ] − 2Eµ [U (x)SN (x)]
" N N # " N
#
XX X
= CU (x, x) + Eµ Zn Zm em (x)en (x) − 2Eµ U (x) Zn en (x)
n=1 m=1 n=1
N
" N Z
#
X X
= CU (x, x) + λn en (x)2 − 2Eµ U (x)U (y)en (y)en (x) dy


n=1 n=1
N
X N Z
X
= CU (x, x) + λn en (x)2 − 2 CU (x, y)en (y)en (x) dy
n=1 n=1 Ω
N
X
= CU (x, x) − λn en (x)2
n=1
→ 0 as N → ∞, uniformly in x, by Mercer’s theorem.

T
Among many possible decompositions of a random process, the Karhunen–
Loève expansion is optimal in the sense that the mean-square error of any trun-
cation of the expansion after finitely many terms is minimal. However, its
AF
utility is limited since the covariance function of the solution process is often
not known a priori. Nevertheless, the Karhunen–Loève expansion provides an
effective means of representing input random processes when their covariance
structure is known, and provides a simple method for sampling Gaussian mea-
sures on Hilbert spaces, which is a necessary step in the implementation of the
methods outlined in Chapter 6.
Example 11.5. Suppose that C : H → H is a self-adjoint, positive-definite,
DR

nuclear operator on a Hilbert space H and let m ∈ H. Let (λk , ek )k∈N be


a sequence of orthonormal eigenpairs for C, ordered by decreasing eigenvalue
λk . Let ξ1 , ξ2 , . . . be IID samples from N (0, 1). Then, by the Karhunen–Loève
theorem,
X∞ p
X := m + λk ξk ek
k=1

is an H-valued random variable with distribution N (m, C). Therefore, a finite


PK √
sum of the form m + k=1 λk ξk ek for large K is a reasonable approximation
to a draw from N (m, C); this is the procedure used to generate the sample

paths in Figure 11.1.

Definition 11.6. A principal component analysis of an RN -valued random vec-


tor U is the Karhunen–Loève expansion of U seen as a stochastic process
U : {1, . . . , N } × Ω → R. It is also known as the discrete Karhunen–Loève
transform, the Hotelling transform, and the proper orthogonal decomposition.
Principal component analysis is often applied to sample data, and is inti-
mately related to the singular value decomposition:
Example 11.7. Let X ∈ RN ×M be a matrix whose columns are M indepen-
dent and identically distributed samples from some probability measure on RN ,
11.1. KARHUNEN–LOÈVE EXPANSIONS 125

1.0 1.0

0.5 0.5

0 0
1 1
−0.5 −0.5


(a) 10 KL modes (b) 100 KL modes

1.0 1.0

0.5 0.5

0 0
1 1

T
−0.5 −0.5

(c) 1000 KL modes (d) 10000 KL modes


AF
Figure 11.1: Sample paths of the Gaussian distribution on H01 ([0, 1]) that has

d2 −1
mean path m(x) = x(1 − x) and covariance operator − dx 2 . Along with the
mean path (dashed), six sample paths are shown for Karhunen–Loève expan-
sions using 10, 100, 1000, and 10000 terms.
DR

and assume without loss of generality that the samples have mean zero. The
empirical covariance matrix of the samples is
b :=
C 1 ⊤
M 2 XX .

The eigenvalues λn and eigenfunctions en of the Karhunen–Loève expansion


b Let Λ ∈ RN ×N
are just the eigenvalues and eigenvectors of this matrix C.
be the diagonal matrix of the eigenvalues λn (which are non-negative, and are
assumed to be in decreasing order) and E ∈ RN ×N the matrix of corresponding
b diagonalizes as

orthonormal eigenvectors, so that C


b = EΛE ⊤ .
C

The principal component transform of the data X is W := E ⊤ X; this is an or-


thogonal transformation of RN that transforms X to a new coordinate system
in which the greatest component-wise variance comes to lie on the first coordi-
nate (called the first principal component), the second greatest variance on the
second coordinate, and so on.
On the other hand, taking the singular value decomposition of the data
(normalized by the number of samples) yields
1
MX = U ΣV ⊤ ,
126 CHAPTER 11. SPECTRAL EXPANSIONS

where U ∈ RN ×N and V ∈ RM×M are orthogonal and Σ ∈ RN ×M is diago-


1
nal with decreasing non-negative diagonal entries (the singular values of M X).
Then
Cb = U ΣV ⊤ (U ΣV ⊤ )⊤ = U ΣV ⊤ V Σ⊤ U ⊤ = U Σ2 U ⊤ .

from which we see that U = E and Σ2 = Λ. This is just another instance


of the well-known relation that, for any matrix A, the eigenvalues of AA∗ are
the singular values of A and the right eigenvectors of AA∗ are the left singular
vectors of A; however, in this context, it also provides an alternative way to
compute the principal component transform.


In fact, performing principal component analysis via the singular value de-
composition is numerically preferable to forming and then diagonalizing the
covariance matrix, since the formation of XX ⊤ can cause a disastrous loss of
precision; the classic example of this phenomenon is the Läuchli matrix
 
1 ε 0 0
1 0 ε 0 (0 < ε ≪ 1),

T
1 0 0 ε

for which taking the singular value decomposition is stable, but forming and
diagonalizing XX ⊤ is unstable.
AF
11.2 Wiener–Hermite Polynomial Chaos
The next section will cover polynomial chaos (PC) expansions in greater gen-
erality, and this section serves as an introductory prelude. In this, the classical
and notationally simplest setting, we consider expansions of a real-valued ran-
dom variable U with respect to a single standard Gaussian random variable
DR

Ξ, using appropriate orthogonal polynomials of Ξ, i.e. the Hermite polynomi-


als. This setting was pioneered by Norbert Wiener, and so it is known as the
Wiener–Hermite polynomial chaos.
To this end, let Ξ ∼ N (0, 1) =: γ be a standard Gaussian random variable.
The probability density function ρΞ : R → R of Ξ with respect to Lebesgue
measure on R is  2
dγ 1 ξ
(ξ) = ρΞ (ξ) = √ exp − .
dx 2π 2
Now let Hen (ξ) ∈ R[ξ], for n ∈ N0 , be the Hermite polynomials, which are a

system of orthogonal polynomials for the standard Gaussian measure γ.


P n
Lemma 11.8. If p(x) = n∈N0 pn x is any polynomial or convergent power
series, then E[p(Ξ)Hem (Ξ)] = m!pm .

Proof. TO DO!

By the Stone–Weierstrass theorem and the approximability of L2 functions


by continuous ones, the Hermite polynomials form a complete orthogonal basis
of the Hilbert space L2 (R, γ; R) with the inner product
Z
hU, V i := E[U (Ξ)V (Ξ)] ≡ U (ξ)V (ξ)ρΞ (ξ) dξ.
R
11.2. WIENER–HERMITE POLYNOMIAL CHAOS 127

Let us further extend the h · , · i notation for inner products and write h · i for
expectation with respect to the distribution γ of Ξ. So, for example, the or-
thogonality relation for the Hermite polynomials reads hHem Hen i = n!δmn .

Definition 11.9. Let U ∈ L2 (R, γ; R) be a square-integrable real-valued ran-


dom variable. The Wiener–Hermite polynomial chaos expansion of U with re-
spect to the standard Gaussian Ξ is the expansion of U in the orthogonal basis
{Hen }n∈N0 , i.e.
X


U= un Hen (Ξ)
n∈N0

with scalar Wiener–Hermite polynomial chaos coefficients {un }n∈N0 ⊆ R given


by
Z ∞
hU Hen i 1 2
un = = √ U (x)Hen (x)e−x /2 dx.

T
2
hHen i n! 2π −∞

Remark 11.10. From the perspective of sampling of random variables, this


means that if we wish to draw a sample from the distribution of U , it is enough
AF
to draw a P
sample ξ from the standard normal distribution and then evaluate
the series n∈N0 un Hen (ξ) at that ξ.

Note that, in particular, since He0 ≡ 1,

X
E[U ] = hHe0 , U i = un hHe0 , Hen i = u0 ,
DR

n∈N0

so the expected value of U is simply its 0th PC coefficient. Similarly, its variance
is a weighted sum of the squares of its PC coefficients:

 
V[U ] = E |U − E[U ]|2
 
2
X
= E un Hen  since E[U ] = u0
n∈N

X
= um un hHem , Hen i
m,n∈N
X
= u2n hHe2n i by Hermitian orthogonality.
n∈N

Example 11.11. Let X ∼ N (µ, σ 2 ) be a real-valued Gaussian random variable


with mean m ∈ R and variance σ 2 ≥ 0. Let Y := eX ; since log Y is normally
distributed, the non-negative-valued random variable Y is said to be a log-
normal random variable. As usual, let Ξ ∼ N (0, 1) be the standard Gaussian
L L
random variable; clearly X = µ + σΞ and Y = eµ eσΞ . The Wiener–Hermite
128 CHAPTER 11. SPECTRAL EXPANSIONS

P
expansion of Y as k∈N0 yk Hek (Ξ) has coeffients
 
E eµ eσΞ Hek (Ξ)
yk =
E [Hek (Ξ)2 ]
µ  
e
= E eσΞ Hek (Ξ)
k!
 k k 
eµ σ Ξ
= E Hek (Ξ) by Hermitian orthogonality
k! k!
2
eµ eσ /2 σ k
=


k!
i.e.
2 X σk
Y = eµ+σ /2
Hek (Ξ).
k!
k∈N0
2
From this expansion it can be seen that E[Y ] = eµ+σ /2 and
 k 2  2 
2 X σ

T
2
V[Y ] = e2µ+σ hHe2k i = e2µ+σ eσ − 1 .
k!
k∈N0
P
Of course, in practice, the series expansion U = n∈N0 un Hen (Ξ) must be
truncated after finitely many terms, and so it is natural to ask about the quality
AF
of the approximation
P
X
P
U ≈ U := un Hen (Ξ)
n=0
Since the Hermite polynomials form a complete orthogonal basis for L2 (R, γ; R),
the standard results about orthogonal approximations in Hilbert spaces apply.
In particular, by Corollary 3.19, the truncation error U − U P is orthogonal to
the space from which U P was chosen, i.e.
DR

span{He0 , He1 , . . . , HeP },


and tends to zero in mean square.
Lemma 11.12. The truncation error U − U P is orthogonal to the subspace
span{He0 , He1 , . . . , HeP }
of L2 (R, ρΞ dx; R). Furthermore, limP→∞ U P = U in L2 (R, γ; R).
PP
Proof. Let V := m=0 vm Hem be any element of the subspace of L2 (R, γ; R)
spanned by the Hermite polynomials of degree at most P. Then

* ! P
!+
X X
hU − U P , V i = un Hen vm Hem
n>P m=0
* +
X
= un vm Hen Hem
n>P
m∈{0,...,P}
X
= un vm hHen Hem i
n>P
m∈{0,...,P}

= 0.
11.3. GENERALIZED PC EXPANSIONS 129

Hence, by Pythagoras’ theorem,

kU kL2 = kU P kL2 (γ) + kU − U P kL2 (γ) ,

and hence kU − U P kL2 (γ) → 0 as P → ∞.

11.3 Generalized PC Expansions


The ideas of polynomial chaos can be generalized well beyond the setting in
which the stochastic germ Ξ is a standard Gaussian random variable, or even a
vector Ξ = (Ξ1 , . . . , Ξd ) of mutually orthogonal Gaussian random variables.
Let Ξ = (Ξ1 , . . . , Ξd ) be an Rd -valued random variable with independent
(and hence orthogonal) components. As usual, let R[ξ1 , . . . , ξd ] denote the ring

T
of all polynomials in ξ1 , . . . , ξd with real coefficients, and let R[ξ1 , . . . , ξd ]≤p de-
note those polynomials of total degree at most p ∈ N0 . Let Γp ⊆ R[ξ1 , . . . , ξd ]≤p
be a collection of polynomials that are mutually orthogonal and orthogonal to
R[ξ1 , . . . , ξd ]≤p−1 , and let Γ̃p := span Γp . This yields the orthogonal decompo-
AF
sition
M
L2 (Θ, µ; R) = Γ̃p .
p∈N0

It is important to note that there is a lack of uniqueness in these basis polyno-


mials whenever d ≥ 2: each choice of ordering of multi-indices α ∈ Nd0 can yield
a different orthogonal basis of L2 when the Gram–Schmidt procedure is applied
DR

to the monomials ξ α .
Note that (as usual, assuming separability) the L2 space over the product
probability space (Θ, F , µ) is isomorphic to the Hilbert space tensor product of
the L2 spaces over the marginal probability spaces:

d
O
L2 (Θ1 × · · · × Θd , µ1 ⊗ · · · ⊗ µd ; R) = L2 (Θi , µi ; R),
i=1

from which we see that an orthogonal system of multivariate polynomials for


L2 (Θ, µ; R) can be found by taking products of univariate orthogonal polynomi-
als for the marginal spaces L2 (Θi , µi ; R). A generalized polynomial chaos (gPC)
expansion of a random variable or stochastic process U is simply the expansion
of U with respect to such a complete orthogonal polynomial basis of L2 (Θ, µ).

Example 11.13. Let Ξ = (Ξ1 , Ξ2 ) be such that Ξ1 and Ξ2 are independent (and
hence orthogonal) and such that Ξ1 is a standard Gaussian random variable
and Ξ2 is uniformly distributed on [−1, 1]. Hence, the univariate orthogonal
polynomials for Ξ1 are the Hermite polynomials Hen and the univariate orthog-
onal polynomials for Ξ2 are the Legendre polynomials Len . Then a system of
130 CHAPTER 11. SPECTRAL EXPANSIONS

orthogonal polynomials for Ξ up to total degree 3 is

Γ0 = {1},
Γ1 = {He1 (ξ1 ), Le1 (ξ2 )}
= {ξ1 , ξ2 },
Γ2 = {He2 (ξ1 ), He1 (ξ1 )Le1 (ξ2 ), Le2 (ξ2 )}
= {ξ12 − 1, ξ1 ξ2 , 21 (3ξ22 − 1)},
Γ3 = {He3 (ξ1 ), He2 (ξ1 )Le1 (ξ2 ), He1 (ξ1 )Le2 (ξ2 ), Le3 (ξ2 )}


= {ξ13 − 3ξ1 , ξ12 ξ2 − ξ2 , 12 (3ξ1 ξ22 − ξ1 ), 12 (5ξ23 − 3ξ2 )}.

Rather than have the orthogonal basis polynomials have two indices, one for
the degree p and one within each set Γp , it is useful and conventional to order
the basis polynomials using a single index k ∈ N0 . It is common in practice
to take Ψ0 = 1 and to have the polynomial degree be (weakly) increasing with
respect to the new index k. So, to continue Example 11.13, one could take

T
Ψ0 (ξ) = 1,
Ψ1 (ξ) = ξ1 ,
Ψ2 (ξ) = ξ2 ,
AF
Ψ3 (ξ) = ξ12 − 1,
Ψ4 (ξ) = ξ1 ξ2 ,
Ψ5 (ξ) = 21 (3ξ22 − 1),
Ψ6 (ξ) = ξ13 − 3ξ1 ,
Ψ7 (ξ) = ξ12 ξ2 − ξ2 ,
Ψ8 (ξ) = 21 (3ξ1 ξ22 − ξ1 ),
DR

Ψ9 (ξ) = 21 (5ξ23 − 3ξ2 ).


P
Truncation of gPC Expansions. Suppose that a gPC expansion U = k∈N0 uk Ψk
is truncated, i.e. we consider
P
X
UP = uk Ψk .
k=0

It is an easy exercise to show that the truncation error U − U P is orthogonal to


span{Ψ0 , . . . , ΨP }. It is also worth considering how many terms there are in such

a truncated gPC expansion. Suppose that the stochastic germ Ξ has dimension
d (i.e. has d independent components), and we work only with polynomials
of total degree at most p. The total number of coefficients in the truncated
expansion U P is
(d + p)!
P+1=
d!p!
That is, the total number of gPC coefficients that must be calculated grows
combinatorially as a function of the number of input random variables and the
degree of polynomial approximation. Such rapid growth limits the usefulness of
gPC expansions for practical applications where d and p are much greater than,
say, 10.
11.3. GENERALIZED PC EXPANSIONS 131

Remark 11.14. It is possible to adapt the notion of a gPC expansion to the


situation of dependent random variables, but there are some complications. In
summary, suppose that Ξ = (Ξ1 , . . . , Ξd ) taking values in Θ = Θ1 × · · · × Θd
has joint law µ, which is not necessarily a product measure. Nevertheless, let
µi denote the marginal law of Ξi , i.e.

µi (Ei ) := µ(Θ1 × · · · × Θi−1 × Ei × Θi+1 × · · · × Θd ).

To simplify matter further, assume that µ (resp. µi ) has Lebesgue density ρ


(i)
(resp. ρi ). Now let φp (ξi ) ∈ R[ξi ], p ∈ N0 , be univariate orthogonal polynomials


for µi . The chaos function associated to a multi-index α defined to be
s
ρ1 (ξ1 ) . . . ρd (ξd ) (1)
Ψα (ξ) := φα1 (ξ1 ) . . . φ(d)
αd (ξd ).
ρ(ξ)

It can be shown that the family {Ψα | α ∈ Nd0 } is a complete orthonormal


P

T
basis for L2 (Θ, µ; R), so we have the usual series expansion U = α uα Ψα .
Note, however, that with the exception of Ψ0 = 1, the functions Ψα are not
polynomials. Nevertheless, we still have the usual properties that truncation
error is orthogonal the the approximation subspace, and
AF
X
Eµ [U ] = u0 , Vµ [U ] = u2α hΨ2α i.
α6=0

Expansions of Random Variables. Consider a real-valued random variable U ,


which we expand in terms of a stochastic germ ξ. Let U P be a truncated GPC
expansion of U :
DR

P
X
U P (ξ) = uk Ψk (ξ),
k=0

where the polynomials Ψk are orthogonal with respect to the law of ξ, and with
the usual convention that Ψ0 = 1. A first, easy, observation is that
P
X
E[U P ] = hΨ0 , U P i = uk hΨ0 , Ψk i = u0 ,
k=0

so the expected value of U P is simply its 0th GPC coefficient. Similarly, its

variance is a weighted sum of the squares of its GPC coefficients:


 2

XP
 P 
E |U − E[U P ]|2 = E  uk Ψk 
k=1
P
X
= uk uℓ hΨk , Ψℓ i
k,ℓ=1
P
X
= u2k hΨ2k i.
k=1
132 CHAPTER 11. SPECTRAL EXPANSIONS

Expansions of Random Vectors. Similar remarks can be made for a Rd -valued


random vector U having truncated GPC expansion
P
X
U P (ξ) = uk Ψk (ξ),
k=0

with coefficients uk = (u1k , . . . , udk ) ∈ Rd for each k ∈ {0, . . . , P}. As before,


P
X
E[U P ] = hΨ0 , U P i = uk hΨ0 , Ψk i = u0 ∈ Rd
k=0


d×d
and the covariance matrix C ∈ R of U P is given by
P
X P
X
C= uk u⊤ 2
k hΨk i i.e. Cij = uik ujk hΨ2k i.
k=1 k=1

Expansions of Stochastic Processes. Consider now a square-integrable stochas-

T
tic process U : Ω × Θ → R; that is, for each x ∈ Ω, U (x, · ) ∈ L2 (Θ, µ) is a real-
valued random variable, and, for each θ ∈ Θ, U ( · , θ) ∈ L2 (Ω, dx) is a scalar
field on the domain Ω. Recall that

L2 (Θ, µ; R) ⊗ L2 (Ω, dx; R) ∼
= L2 (Θ × Ω, µ ⊗ dx; R) ∼
= L2 Θ, µ; L2 (Ω, dx) .
AF
As usual, take {Ψk | k ∈ N0 } to be an orthogonal polynomial basis of L2 (Θ, µ; R),
ordered (weakly) by total degree, with Ψ0 = 1. A GPC expansion of the random
field U is an L2 -convergent expansion of the form
X
U (x, ξ) = uk (x)Ψk (ξ),
k∈N0

which in practice is truncated to


DR

P
X
U (x, ξ) ≈ U P (x, ξ) = uk (x)Ψk (ξ).
k=0

The functions uk : Ω → R are called the stochastic modes of the process U . The
stochastic mode u0 : Ω → R is the mean field of U :
E[U (x)] = E[U P (x)] = u0 (x).
The variance of the field at x ∈ Ω is

X P
X

V[U (x)] = u2k hΨ2k i ≈ u2k hΨ2k i,


k=1 k=1

whereas, for two points x, y ∈ Ω,


* +
X X
E[U (x)U (y)] = uk (x)Ψk (ξ) uℓ (y)Ψℓ (ξ)
k∈N0 ℓ∈N0
X
= uk (x)uk (y)hΨ2k i
k∈N0
P
X
≈ uk (x)uk (y)hΨ2k i,
k=0
BIBLIOGRAPHY 133

FINISH ME!!!

Figure 11.1: ...

FINISH ME!!!


Figure 11.2: ...

and so
X
CU (x, y) = uk (x)uk (y)hΨ2k i
k∈N

T
P
X
≈ uk (x)uk (y)hΨ2k i
k=1
= CU P (x, y).
AF
At least when dim Ω is low, it is very common to see the behaviour of a stochastic
field U (or U P ) summarized by plots of the mean field and the variance field,
as in Figure 11.1; when dim Ω = 1, a surface or contour plot of the covariance
field CU (x, y) as in Figure 11.2 can also be informative.

Bibliography
DR

Spectral expansions in general are covered in Chapter 2 of the monograph of Le


Maı̂tre & Knio [58], and Chapter 5 of the book of Xiu [117].
The Karhunen–Loève expansion bears the names of Karhunen [48] and Loève
[62], but KL-type series expansions of stochastic processes were considered ear-
lier by Kosambi [52]. Lemma 11.12, that the truncation error in a PC expansion
is orthogonal to the approximation subspace, is nowadays a simple corollary of
standard results in Hilbert spaces, but is an observation that appears to have
first first been made in the stochastic context by Cameron & Martin [17]. The
application of Wiener–Hermite PC expansions to engineering systems was pop-

ularized by Ghanem & Spanos [34]; the extension to gPC and the connection
with the Askey scheme is due to Xiu & Karniadakis [118].
The extension of generalized polynomial chaos to arbitrary dependency among
the components of the stochastic germ, as in Remark 11.14, is due to Soize &
Ghanem [94].

Exercises
Exercise 11.1. Use the Karhunen–Loève expansion to generate samples from a
Gaussian random field U on [−1, 1] (i.e., for each x ∈ [−1, 1], U (x) is a Gaussian
random variable) with covariance function
134 CHAPTER 11. SPECTRAL EXPANSIONS

1. CU (x, y) = exp(−|x − y|2 /a2 ,
2. CU (x, y) = exp(−|x − y|/a , and
3. CU (x, y) = (1 + |x − y|2 /a2 )−1
for various values of a > 0. Plot and comment upon your results, particularly
the smoothness of the fields produced.
2
d
Exercise 11.2. Consider the negative Laplacian operator L := − dx 2 acting on

real-valued functions on the interval [0, 1], with zero boundary conditions. Show
that the eigenvalues µn and eigenfunctions en of L are


µn = (πn)2 , en (x) = sin(πnx).

Hence show that C := L−1 has the same eigenfunctions with eigenvalues λn =
(πn)−2 . Hence, using the Karhunen–Loève theorem, generate figures similar to
Figure 11.1 for your choice of mean field m : [0, 1] → R.
Exercise 11.3. Do the analogue of Exercise 11.3 for the negative Laplacian
d2 d2 2
operator L := − dx 2 − dy 2 acting on real-valued functions on the square [0, 1] ,

T
again with zero boundary conditions.
Exercise 11.4. Show that the eigenvalues λn and eigenfunctions en of the ex-
ponential covariance function C(x, y) = exp(−|x − y|/a) on [−b, b] are given
AF
by ( 2a
1+a2 wn2 , if n ∈ 2Z,
λn = 2a
2 ,
1+a2 vn if n ∈ 2Z + 1,
 q
sin(wn x) b − sin(2w n b)
, if n ∈ 2Z,
2wn
en (x) = q
cos(v x) b + sin(2vn b) , if n ∈ 2Z + 1,
n 2vn
DR

where wn and vn solve the transcendental equations


(
awn + tan(wn b) = 0, for n ∈ 2Z,
1 − avn tan(vn b) = 0, for n ∈ 2Z + 1.

Hence, using the Karhunen–Loève theorem, generate sample paths from the
Gaussian measure with covariance kernel C and your choice of mean path.

Chapter 12


Stochastic Galerkin Methods

Not to be absolutely certain is, I think,


one of the essential things in rationality.

T
Am I an Atheist or an Agnostic?
Bertrand Russell
AF
Unlike non-intrusive approaches, which rely on individual realizations to
determine the stochastic model response to random inputs, Galerkin methods
use a formalism of weak solutions, expressed in terms of inner products, to
form systems of governing equations for the solution’s PC coefficients, which
are generally coupled together. They are in essence the extension to a suitable
tensor product Hilbert space of the usual Galerkin formalism that that underlies
many theoretical and numerical approaches to PDEs. This chapter is devoted
DR

to the formulation of Galerkin methods of UQ using PC expansions.


Suppose that the relationship between some input data d and the output
(solution) u can be expressed formally as

M(u; d) = 0.

A good model for this kind of set-up is an elliptic boundary balue problem on,
say, a bounded, connected domain Ω ⊆ Rn with smooth boundary ∂Ω:

−∇ · (κ(x)∇u(x)) = f (x) for x ∈ Ω, (12.1)


u(x) = 0 for x ∈ ∂Ω.

In this case, the input data d are typically the forcing term f : Ω → R and
the permeability field κ : Ω → Rn×n ; in some cases, the domain Ω itself might
depend upon d, but this introduces additional complications that will not be
considered in this chapter. For a PDE such as this, solutions u are typically
sought in the Sobolev space H01 (Ω) of L2 functions that have a weak derivative
that itself lies in L2 , and that vanish on ∂Ω in the sense of trace. Moreover, it
is usual to seek weak solutions, i.e. u ∈ H01 (Ω) for which the inner product of
(12.1) with any v ∈ H01 (Ω) is an equality. That is, integrating by parts, we seek
u ∈ H01 (Ω) such that

hκ∇u, ∇viL2 (Ω) = hf, viL2 (Ω) for all v ∈ H01 (Ω). (12.2)
136 CHAPTER 12. STOCHASTIC GALERKIN METHODS

On expressing this problem in a chosen basis of H01 (Ω), the column vector [u] of
coefficients of u in this basis turn out to satisfy a matrix-vector equation (i.e. a
system of simultaneous linear equations) of the form [a][u] = [b] for some matrix
[a] determined by the permeability field κ and a column vector [b] determined
by the forcing term f .
In this chapter, after reviewing basic Lax–Milgram theory and Galerkin pro-
jection for problems like (12.1) / (12.2), we consider the situation in which the
input data d are uncertain and are described as a random variable D(ξ). Then
the solution is also a random variable U (ξ) and the model relationship becomes


M(U (ξ); D(ξ)) = 0.

Again, this equation is usually interpreted in a weak sense in a suitable Hilbert


space of H01 (Ω)-valued random variables. If D and U are expanded in some PC
basis, it is natural to ask how the coefficients of U with respect to this PC basis
and a chosen basis of H01 (Ω) are related to one another. It will turn out that,
like in the standard deterministic setting, this problem can be written in the

T
form of a matrix-vector equation [A][U ] = [B] related to, but more complicated
than, the deterministic problem [a][u] = [b].
AF
12.1 Lax–Milgram Theory and Galerkin Projection
Let H be a real Hilbert space equipped with a bilinear form a : H × H → R.
Given f ∈ H∗ (i.e. a continuous linear functional f : H → R), the associated
weak problem is:

find u ∈ H such that a(u, v) = hf | vi for all v ∈ V. (12.3)


DR

Example 12.1. Let Ω ⊆ Rn be a bounded, connected domain. Let a matrix-


valued function κ : Ω → Rn×n and a scalar-valued function f : Ω → R be given,
and consider the elliptic problem (12.1). The appropriate bilinear form a( · , · )
is defined by

a(u, v) := h−∇ · (κ∇u), viL2 (Ω) = hκ∇u, ∇viL2 (Ω) ,

where the second equality follows from integration by parts when u, v are smooth
functions that vanish on ∂Ω; such functions form a dense subset of the Sobolev
space H01 (Ω). This short calculation motivates two important developments in

the treatment of the PDE (12.1). First, even though the original formulation
(12.1) seems to require the solution u to have two orders of differentiability, the
last line of the above calculation makes sense even if u and v have only one
order of (weak) differentiability, and so we restrict attention to H01 (Ω). Second,
we declare u ∈ H01 (Ω) to be a weak solution of (12.1) if the L2 (Ω) inner product
of (12.1) with any v ∈ H01 (Ω) holds as an equality of real numbers, i.e. if
Z Z
− ∇ · (κ(x)∇u(x))v(x) dx = f (x)v(x) dx
Ω Ω

i.e. if
a(u, v) = hf, viL2 (Ω) for all v ∈ H01 (Ω).
12.1. LAX–MILGRAM THEORY AND GALERKIN PROJECTION 137

The existence and uniqueness of solutions problems like (12.3), under appro-
priate conditions on a (which of course are inherited from appropriate conditions
on κ), is ensured by the Lax–Milgram theorem, which generalizes the Riesz rep-
resentation theorem that any Hilbert space is isomorphic to its dual space.
Theorem 12.2 (Lax–Milgram). Let a be a bilinear form on a Hilbert space H
such that
1. (boundedness) there exists a constant C > 0 such that, for all u, v ∈ H,
|a(u, v)| ≤ Ckukkvk; and
2. (coercivity) there exists a constant c > 0 such that, for all v ∈ H, |a(v, v)| ≥


ckvk2 .
Then for all f ∈ H∗ , there exists a unique u ∈ V such that, for all v ∈ H,
a(u, v) = hf | vi.
Proof. For each u ∈ H, v 7→ a(u, v) is a bounded linear functional on H. So,
by the Riesz representation theorem, given u ∈ H, there is a unique w ∈ H
such that hw, vi = a(u, v). Define Au := w. The map A : H → H is clearly
well-defined. It is also linear: take α1 , α2 ∈ R and u1 , u2 ∈ H:

T
hA(α1 u1 + α2 u2 ), vi = a(α1 u1 + α2 u2 , v)
= α1 a(u1 , v) + α2 a(u2 , v)
= α1 hAu1 , vi + α2 hAu2 , vi
AF
= hα1 Au1 + α2 Au2 , vi.
A is a bounded map, since
kAuk2 = hAu, Aui = a(u, Au) ≤ CkukkAuk,
so kAuk ≤ Ckuk. Furthermore, A is injective since
kAukkuk ≥ hAu, ui = a(u, u) ≥ ckuk2,
DR

so Au = 0 =⇒ u = 0.
To see that the range R(A) is closed, take a convergent sequence (vn )n∈N in
R(A) that converges to some v ∈ H. Choose un ∈ H such that Aun = vn for
each n ∈ N. The sequence (Aun )n∈N is Cauchy, so
kAun − Aum kkun − um k ≥ |hAun − Aum , un − um i|
= |a(un − um , un − um )|
≥ ckun − um k2 .

So ckun − um k ≤ kvn − vm k → 0. So (un )n∈N is Cauchy and converges to some


u ∈ H. So vn = Aun → Au = v by the continuity (boundedness) of A, so
v ∈ R(A), and so R(A) is closed.
Finally, A is surjective: for, if not, there is an s ∈ H, s 6= 0, such that
s ⊥ R(A). But then
cksk2 ≤ a(s, s) = hs, Asi = 0,
so s = 0, a contradiction.
So, take f ∈ H∗ . There is a unique w ∈ H such that hw, vi = hf | vi for
all v ∈ H. The equation Au = w has a unique solution since A is invertible.
So hAu, vi = hf | vi for all v ∈ H. But hAu, vi = a(u, v). So there is a unique
u ∈ H such that a(u, v) = hf | vi.
138 CHAPTER 12. STOCHASTIC GALERKIN METHODS

Galerkin Projection. Now consider the problem of finding a good finite-dimensional


approximation to u. Fix a subspace V (M) ⊆ H of dimension M . The Galerkin
projection of the weak problem is:

find u(M) ∈ V (M) such that a(u(M) , v (M) ) = hf, v (M) i for all v (M) ∈ V (M) .

Note that if the hypotheses of the Lax–Milgram theorem are satisfied on the full
space H, then they are certainly satisfied on the subspace V (M) , thereby ensuring
the existence and uniqueness of solutions to the Galerkin problem. Note well,
though, that existence of a unique Galerkin solution for each M ∈ N0 does not


imply the existence of a unique weak solution (nor even multiple weak solutions)
to the full problem; for this, one typically needs to show that the Galerkin
approximations are uniformly bounded and appeal to a Sobolev embedding
theorem to extract a convergent subsequence.
Example 12.3. 1. The Fourier basis {ek }k∈Z of L2 (S1 , dx; C) defined by
1

T
ek (x) = √ exp(ikx).

For Galerkin projection, one can use the finite-dimensional subspace
AF
V (2M+1) := span{e−M , . . . , e−1 , e0 , e1 , . . . , eM }

of functions that are band-limited to contain frequencies at most M . In


case of real-valued functions, one can use the functions

x 7→ cos(kx), for k ∈ N0 ,
x 7→ sin(kx), for k ∈ N.
DR

2. Fix a partition a = x0 < x1 < · · · < xM = b of a compact interval


[a, b] ( R and consider the associated tent functions defined by


0, if x ≤ a or x ≤ xm−1 ;

(x − x
m−1 )/(xm − xm−1 ), if xm−1 ≤ x ≤ xm ;
φm (x) :=

(xm+1 − x)/(xm+1 − xm ), if xm ≤ x ≤ xm+1 ;


0, if x ≥ b or x ≥ xm+1 .

The function φm takes the value 1 at xm and decays linearly to 0 along


the two line segments adjacent to xm . The (M + 1)-dimensional vec-


tor space V (M) := span{φ0 , . . . , φM } consists of all continuous functions
on [a, b] that are piecewise affine on the partition, i.e. have constant
derivative on each of the open intervals (xm−1 , xm ). The space V e (M) :=
span{φ1 , . . . , φM−1 } consists of the continuous functions that piecewise
affine on the partition and take the value 0 at a and b; hence, V e (M) is
one good choice for a finite-dimensional space to approximate the Sobolev
space H01 ([a, b]). More generally, one could consider tent functions associ-
ated to any simplicial mesh in Rn .
An important property of the solution u(M) of the Galerkin problem, viewed
as an approximation to the solution u of the original problem, is that the error
12.1. LAX–MILGRAM THEORY AND GALERKIN PROJECTION 139

u − u(M) is a-orthogonal to the approximation subspace V (M) : for any choice of


v (M) ∈ V (M) ⊆ H,
a(u − u(M) , v (M) ) = a(u, v (M) ) − a(u(M) , v (M) ) = hf, v (M) i − hf, v (M) i = 0.
 However, note well that u(M) is generally not the optimal approximation of u
from the subspace V (M) , i.e.
n o
u − u(M) 6= inf u − v (M) v (M) ∈ V (M) .

The optimal approximation of u from V (M) is the orthogonal P projection of u


onto V (M) ; if H has an orthonormal basis {en } and u = n∈N un en , then the
P
optimal approximation of u in V (M) = span{e1 , . . . , eM } is M n
n=1 u en , but this
(M)
is not generally the same as the Galerkin solution u . However, the next
result, Céa’s lemma, shows that u(M) is a quasi-optimal approximation to u
(note that the ratio C/c is always at least 1):
Lemma 12.4 (Céa’s lemma). Let a, c and C be as in the statement of the Lax–
Milgram theorem. Then the weak solution u ∈ H and the Galerkin solution

T
u(M) ∈ V (M) satisfy
C n o
u − u(M) ≤ inf u − v (M) v (M) ∈ V (M) .
c
AF
Proof. Exercise 12.2.

Matrix Form. It is helpful to recast the Galerkin problem in matrix form. Let
{φ1 , . . . , φM } be a basis for V (M) . Then u(M) solves the Galerkin problem if
and only if
a(u(M) , φm ) = hf, φm i for m ∈ {1, . . . , M }.
PM
Now expand u(M) in this basis as u(M) = m=1 um φm and insert this into the
DR

previous equation:
M
! M
X X
m
a u φm , φi = um a(φm , φi ) = hf, φi i for i ∈ {1, . . . , M }.
m=1 m=1

In other words, the vector of coefficients [u(M) ] = [u1 , . . . , uM ]⊤ ∈ RM satisfies


the matrix equation
  1  
a(φ1 , φ1 ) . . . a(φM , φ1 ) u hf, φ1 i
 .. .. ..   ..   . 
 . . .   .  =  ..  . (12.4)

a(φ1 , φM ) . . . a(φM , φM ) M hf, φM i


u
| {z } | {z } | {z }
=:[a] =:[u(M ) ] =:[b]

The matrix [a] ∈ RM×M is the Gram matrix of the bilinear form a, and is of
course a symmetric matrix whenever a is a symmetric bilinear form.
Remark 12.5. In practice the matrix-vector equation [a][u(M) ] = [b] is never
solved by explicitly inverting the Gram matrix [a] to obtain the coefficients um
via [u(M) ] = [a]−1 [b]. Indeed, in many situations the Gram matrix is sparse,
and so inversion methods that take advantage of that sparsity are used; further-
more, for large systems, the methods used are often iterative rather than direct
(e.g. factorization-based).
140 CHAPTER 12. STOCHASTIC GALERKIN METHODS

Lax–Milgram Theory for Banach Spaces. There are many extensions of the
now-classical Lax–Milgram lemma beyond the setting of symmetric bilinear
forms on Hilbert spaces. For example, the following generalization is due to
Kozono & Yanagisawa [53]:
Theorem 12.6 (Kozono–Yanagisawa). Let X be a Banach space, Y a reflexive
Banach space, and a : X × Y → C a bilinear form such that
1. there is a constant M > 0 such that

|a(x, y)| ≤ M kxkX kykY for all x ∈ X , y ∈ Y;


2. the null spaces

NX := {x ∈ X | a(x, y) = 0 for all y ∈ Y} ⊆ X ,


NY := {x ∈ Y | a(x, y) = 0 for all x ∈ X } ⊆ Y,

admit complementary closed subspaces RX and RY such that X = NX ⊕

T
RX and Y = NY ⊕ RY ;
3. there is a constant C > 0 such that
!
AF
|a(x, y)|
kxkX ≤ C sup + kPX xkX for all x ∈ X ,
y∈Y kykY
!
|a(x, y)|
kykY ≤ C sup + kPY ykY for all x ∈ X ,
x∈X kxkX

where PX (resp. PY ) is the projection of X (resp. Y) onto NX (resp. NY )


along the direct sum X = NX ⊕ RX (resp. Y = NY ⊕ RY ).
DR

Then, for every f ∈ Y ′ such that hf | yi = 0 for all y ∈ NY , there exists x ∈ X


such that
a(x, y) = hf | yi for all y ∈ Y.
Furthermore, there is a constant C independent of x and f such that kxkX ≤
Ckf kY ′ .

12.2 Stochastic Galerkin Projection


Stochastic Lax–Milgram Theory. The next step is to build appropriate Lax–


Milgram theory and Galerkin projection for stochastic problems, for which a
good prototype is

−∇ · (κ(θ, x)∇u(θ, x)) = f (θ, x) for x ∈ Ω,


u(x) = 0 for x ∈ ∂Ω,

with θ being drawn from some probability space (Θ, F , µ). To that end, we
introduce a stochastic space S, which in the following will be L2 (Θ, µ; R). We
retain also a Hilbert space V in which the deterministic solution u(θ) is sought
for each θ ∈ Θ; implicitly, V is independent of the problem data, or rather of
θ. Thus, the space in which the stochastic solution U is sought is the tensor
12.2. STOCHASTIC GALERKIN PROJECTION 141

product Hilbert space H := V ⊗ S, which is isomorphic to the space L2 (Θ, µ; V)


of square-integrable V-valued random variables.
In terms of bilinear forms, the setup is that of a bilinear-form-on-V-valued
random variable A and a V ′ -valued random variable F . Define a bilinear form
α on H by
Z

α(X, Y ) := Eµ [A(X, Y )] ≡ A(θ) X(θ), Y (θ) dµ(θ)
Θ

and, similarly, a linear functional β on H by


β(Y ) := Eµ [hF | Y iV ].
Clearly, if α satisfies the boundedness and coercivity assumptions of the Lax–
Milgram theorem on H, then, for every F ∈ L2 (Θ, µ; V ′ ), there is a unique weak
solution U ∈ L2 (Θ, µ; V) satisfying
α(U, Y ) = β(Y ) for all Y ∈ L2 (Θ, µ; V).
A sufficient, but not necessary, condition for α to satisfy the hypotheses of the

T
Lax–Milgram theorem on H is for A(θ) to satisfy those hypotheses uniformly in
θ on V:
Theorem 12.7 (Stochastic Lax–Milgram theorem). Let (Θ, F , µ) be a probabil-
AF
ity space, and let A be a random variable on Θ, taking values in the space of
symmetric bilinear forms on a Hilbert space V, and satisfying the hypotheses of
the deterministic Lax–Milgram theorem (Theorem 12.2) uniformly with respect
to θ ∈ Θ. Define a bilinear form α on L2 (Θ, µ; V) by
α(X, Y ) := Eµ [A(X, Y )].
Then, for every F ∈ L2 (Θ, µ; V ′ ), there is a unique U ∈ L2 (Θ, µ; V) such that
DR

α(U, Y ) = β(Y ) for all Y ∈ L2 (Θ, µ; V).


Proof. Suppose that A(θ) satisfies the boundedness assumption with constant
C(θ) and the coercivity assumption with constant c(θ). By hypothesis,
C ′ := sup C(θ) and
θ∈Θ
c′ := inf c(θ)
θ∈Θ

are both strictly positive and finite. The bilinear form α satisfies, for all X, Y ∈
H,

α(X, Y ) = Eµ [A(X, Y )]
 
≤ Eµ CkXkV kY kV
 1/2
≤ C ′ Eµ kXk2V ]1/2 Eµ [kY k2V
= C ′ kXkKkY kH ,
and
α(X, X) = Eµ [A(X, X)]
 
≥ Eµ ckXk2V
≥ c′ kXk2H.
142 CHAPTER 12. STOCHASTIC GALERKIN METHODS

Hence, by the deterministic Lax–Milgram theorem applied to the bilinear form


α on the Hilbert space H, for every F ∈ L2 (Θ, µ; V ′ ), there exists a unique
U ∈ L2 (Θ, µ; V) such that

α(U, Y ) = β(Y ) for all Y ∈ L2 (Θ, µ; V).

Remark 12.8. Note, however, that uniform boundedness and coercivity of A are
not necessary for α to be bounded and coercive. For example, the constants c(θ)
and C(θ) may degenerate to 0 or ∞ as θ approaches certain points of the sample
space Θ. Provided that these degeneracies are integrable and yield positive and


finite expected values, this will not ruin the boundedness and coercivity of α.
Indeed, there may be an arbitrarily large (but µ-measure zero) set of θ for which
there is no weak solution u to the deterministic problem

A(θ)(u, v) = hF (θ) | vi for all v ∈ V.

T
Stochastic Galerkin Projection. Let V (M) be a finite-dimensional subspace of
V, with basis φ1 , . . . , φM . As indicated above, take the stochastic space S to be
L2 (Θ, µ; K), which we assume to be equipped with an orthogonal decomposition
such as a PC decomposition. Let S P be a finite-dimensional subspace of S,
AF
for example the span of the polynomials of degree at most P P. The Galerikin
projection of the stochastic problem on H is to find U = m,k um k φm ⊗ Ψk ∈
(M) P
V ⊗ S such that

α(U, V ) = β(V ) for all V ∈ L2 (Θ, µ; V).

In particular, it suffices to find U that satisfies this condition for each basis
DR

element V = φn ⊗ Ψℓ of V (M) ⊗ S P . Recall that φn ⊗ Ψℓ is the function


(θ, x) 7→ φn (x)Ψℓ (θ).
As before, the Galerkin problem is equivalent to the matrix-vector equation

[α][U ] = [β]

in the basis {φm ⊗ Ψk | m = 1, . . . M ; k = 0, . . . , P} of V (M) ⊗ S P . An obvious


question is how the Gram matrix [α] ∈ RM(P+1)×M(P+1) is related to the Gram
matrix of the random bilinear form A.
...

...
Suppose that we are already given a linear problem with its deterministic
problem discretized and cast in the matrix form

[A](ξ)U (ξ) = B(ξ)


∼ RM . The
which, for each fixed ξ ∈ L2 (Ξ, pξ ), has its solution U (ξ) ∈ V (M) =
PP
stochastic Galerkin projection for the stochastic solution U = k=1 uk Ψk ∈
V (M) ⊗ S P gives
P
X
hΨi , [A]Ψj iuj = hΨi , Bi for each i ∈ {0, . . . , P}.
j=0
12.2. STOCHASTIC GALERKIN PROJECTION 143

This is equivalent to the (large!) block system


    
[A]00 . . . [A]0P u0 b0
 .. . .. ..   ..  =  ... 
.   .  
 . , (12.5)
[A]P0 . . . [A]PP uP bP
where for 0 ≤ i, j ≤ P,
• [A]ij := hΨi , [A]Ψj i ∈ RM×M , where [A] ∈ RM×M is the Gram matrix of
the random
P bilinear form A;
• ui = M m
m=1 ui φm ∈ V
(M)
is the ith stochastic mode of the solution U ;


• and bi := hΨi , Bi ∈ RM is the ith stochastic mode of the source term B.
Note that, in general, the stochastic modes uj of the solution U (and, indeed the
coefficients um
j of the stochastic modes in the deterministic basis φ1 , . . . , φM )
are all coupled together through the matrix [A]. This can be a limitation of
stochastic Galerkin methods, and will be remarked upon later.
Example 12.9. As a special case, suppose that the random data have no impact
on the differential operator and affect only the right-hand side B. In this case

T
A(θ) = a for all θ ∈ Θ and so
[A]ij := hΨi , [a]Ψj i = [a]hΨi , Ψj i = [a]δij hΨ2i i.
AF
Hence, the stochastic Galerkin system, in its matrix form (12.5), becomes
    
[a] [0] ... [0] u0 b0
 . . ..   u   b 
[0] [a]hΨ1 i 2 . .  1  1
  ..  =  ..  .
 . . . 
  .
 .. .. .. [0]  .
[0] ... [0] [a]hΨ2P i uP bP
DR

Hence the stochastic modes uj decouple and are given by


hb, Ψj i
uj = [a]−1 .
hΨ2j i

The Galerkin Tensor. In contrast to Example 12.9, in which the differential


operator is deterministic, we can consider the case in which the matrix [α] has
a (truncated) PC expansion
P
X
[α] = [α]k Ψk

k=0

with coefficient matrices [α]k ∈ RM×M for k ≥ 0. In this case, the blocks [α]ij
are given by
P
X
[α]ij = hΨi , [α]Ψj i = [α]k hΨi , Ψj Ψk i.
k=0
Hence, the Galerkin block system (12.5) is equivalent to
    
[α]00 . . . [α]0P u0 b0
 . . .  .   . 
 .. .. ..   ..  =  ..  , (12.6)
[α]P0 . . . [α]PP uP bP
144 CHAPTER 12. STOCHASTIC GALERKIN METHODS

where
hB, Ψi i
bi := ,
hΨ2i i
P
X
[α]ij := [α]k Ckji ,
k=0

hΨi Ψj Ψk i
Cijk := .
hΨk Ψk i
The rank-3 tensor Cijk is called the Galerkin tensor :


• it is symmetric in the first two indices: Cijk = Cjik ;
• this induces symmetry in the problem (12.6): [α]ij = [α]ji ;
• and since the Ψk are an orthogonal system, many of the (P + 1)3 entries
of Cijk are zero, leading to sparsity for the matrix problem; for example,
P
X
[α]00 = [α]k Ck00 = [α]0 .

T
k=0

Note that the Galerkin tensor Cijk is determined entirely by the PC system
{Ψk | k ≥ 0}, and so while there is a significant computational cost associated
AF
to evaluating its entries, this is a one-time cost: the Galerkin tensor can be
pre-computed, stored, and then used for many different problems, i.e. many As
and Bs.

Example 12.10 (Ordinary differential equations). Consider random variables Z, B ∈


L2 (Θ, µ; R) and the random linear first-order ordinary differential equation

dU
= −ZU, U (t) = B,
DR

dt
for U : [0, T ] × Θ → R. Let {Ψk }k∈N0 be an orthogonal basis for L2 (Θ, µ; R)
with the
P usual conventionPthat Ψ0 = 1. Give Z, B P and U the chaos expansions
Z = k∈N0 zk Ψk , B = k∈N0 bk Ψk and U (t) = k∈N0 uk (t)Ψk respectively.
Projecting the evolution equation onto the basis {Ψk }k∈N0 yields
 
dU
E Ψk = −E[ZU Ψk ] for each k ∈ N0 .
dt

Inserting the chaos expansions for Z and U into this yields, for every k ∈ N0 ,

" #  
X X X
E u̇i (t)Ψi Ψk = −E  zj Ψj ui (t)Ψi Ψk  ,
i∈N0 j∈N0 i∈N0
X
i.e. u̇k (t)hΨ2k i =− zj ui (t)E[Ψj Ψi Ψk ],
i,j∈N0
X
i.e. u̇k (t) = − Cijk zj ui (t).
i,j∈N0

The coefficients uk are a coupled system of countably many ordinary differen-


tial equations. If all the chaos expansions are truncated at order P, then all
12.3. NONLINEARITIES 145

the above summations over N0 become summations over {0, . . . , P}, yielding a
coupled system of P + 1 ordinary differential equations. In matrix form:
       
u0 (t) u0 u0 (0) b0
d  .  ⊤ .   ..   .. 
.
 .  = A .
 .  , =
 .   . ,
dt
uP (t) uP uP (0) bP
P
where A ∈ R(P+1)×(P+1) is Aik = − j Cijk zj .


12.3 Nonlinearities
Nonlinearities of various types occur throughout practical problems, and their
treatment is critical in the context of stochastic Galerkin methods, which require
the projection of these nonlinearities onto the finite-dimensional solution spaces.
For example, given the approximation

T
P
X
U (ξ) ≈ U P (ξ) = uk Ψk (ξ)
k=0
p
how does one calculate the coefficients of, say, U (ξ)2 or U (ξ)? The first exam-
AF
ple, U 2 , is a special case of taking the product of two Galerkin approximations,
and can be resolved using the Galerkin tensor Cijk of the previous section.

Galerkin Products. Consider two random variables U, V ∈ L2 (Θ, µ; R). The


product random variable U V is again an P element of L2 (Θ, µ; R). P
The natural
question to ask, given expansions U = u Ψ
k∈N0 k k and V = k∈N0 vk Ψk ,
is how to quickly compute the coefficients of U V in terms of {uk }k∈N0 and
DR

{vk }k∈N0 — particularly if expansions are truncated to finite precision.


P PP
Example 12.11. Suppose that U = P k=0 uk Ψk and V = k=0 vk Ψk . Then
their product W := U V has the expansion
P
X
W = ui vj Ψi Ψj .
i,j=0

Note that, while W is guaranteed


P to be in L2 , it is not necessarily in S P .
Nevertheless, if we write W = k≥0 wk Ψk , it follows that

XP
hW, Ψk i
wk = = ui vj Cijk .
hΨk , Ψk i i,j=0
P
The truncation of the expansion W = k≥0 wk Ψk to k = 0, . . . , P is the or-
thogonal projection of W onto S P (and hence the L2 -closest approximation of
W in S P ) and is called the Galerkin product, or pseudo-spectral product, of U
and V , denoted U ∗ V .

The fact that multiplication of two random variables can be handled effi-
ciently, albeit with some truncation error, in terms of their expansions in the
146 CHAPTER 12. STOCHASTIC GALERKIN METHODS

gPC basis and the Galerkin tensor is very useful: it adds to the list of reasons
why one might wish to pre-computate and store the Galerkin tensor for use in
many problems.
However, the situation of binary products (and hence squares) is. . .
Triple and higher products... non-commutativity

Galerkin Inversion. Given

X P
X
U= uk Ψk ≈ uk Ψk


k≥0 k=0

P PP
we seek a random variable V = k≥0 vk Ψk ≈ k=0 vk Ψk such that U (ξ)V (ξ) =
1 for almost every ξ. The weak interpretation of this desideratum is that U ∗
V = Ψ0 . Thus we are led to the following matrix-vector equation for the gPC
coefficients of V :
P PP    1

T
P
Ck00 uk . . . k=0 CkP0 uk v0
 k=0 .  .   0
 . .. ..  .  = 


..  (12.7)
 . . .  . 
PP PP .
k=0 Ck0P uk ... k=0 CkPP uk
vP
0
AF
Naturally, if U (ξ) = 0 for a positive probability set of ξ, then V (ξ) will be
undefined for those same ξ. Furthermore, if U ≈ 0 with ‘too large’ probability,
then V may exist a.e. but fail to be in L2 . Hence, it is not surprising to learn
that while (12.7) has a unique solution whenever the matrix on the left-hand
is non-singular, the system becomes highly ill-conditioned as the amount of
probability mass near U = 0 increases.
DR

FINISH ME!!!

Bibliography
Basic Lax–Milgram theory and Galerkin methods for PDEs can be found in any
modern textbook on PDEs, such as those by Evans [27] (see Chapter 6) and
Renardy & Rogers [80] (see Chapter 9).
The monograph of Xiu [117] provides a general introduction to spectral
methods for uncertainty quantification, including Galerkin methods (Chapter
6), but is light on proofs. The book of Le Maı̂tre & Knio [58] covers Galerkin

methods in Chapter 4, including an extensive treatment of nonlinearities in


Section 4.5.

Exercises
Exercise 12.1. Let a be a bilinear form satisfying the hypotheses of the Lax–
Milgram theorem. Given f ∈ H∗ , show that the unique u such that a(u, v) =
hf | vi for all v ∈ H satisfies kukH ≤ c−1 kf kH′ .

Exercise 12.2 (Céa’s lemma). Let a, c and C be as in the statement of the


Lax–Milgram theorem. Show that the weak solution u ∈ H and the Galerkin
EXERCISES 147

solution u(M) ∈ V (M) satisfy


C n o
u − u(M) ≤ inf u − v (M) v (M) ∈ V (M) .
c
Exercise 12.3. Consider a partition of the unit interval [0, 1] into N + 1 equally
spaced nodes

0 = x0 < x1 = h < x2 = 2h < · · · < xN = 1,


1
where h = > 0. For n = 0, . . . , N , let


N


0, if x ≤ 0 or x ≤ xn−1 ;

(x − x
n−1 )/h, if xn−1 ≤ x ≤ xn ;
φn (x) :=

(x n+1 − x)/h, if xn ≤ x ≤ xn+1 ;


0, if x ≥ 1 or x ≥ xn+1 .

T
What space of functions is spanned by φ0 , . . . , φN ? For these functions φ0 , . . . , φN ,
calculate the Gram matrix for the bilinear form
Z 1
a(u, v) := u′ (x)v ′ (x) dx
AF
0

corresponding to the Laplace operator. Determine also the vector components


hf, φn i in the Galerkin equation (12.4).
Exercise 12.4. Show that, for fixed P, the Galerkin product satisfies for all
U, V, W ∈ S P and α, β ∈ R,

U ∗ V = V ∗ U,
DR

(αU ) ∗ (βV ) = αβ(U ∗ V ),


(U + V ) ∗ W = U ∗ W + V ∗ W.

Exercise 12.5. Galerkin division: WRITE ME!!!



148 CHAPTER 12. STOCHASTIC GALERKIN METHODS


T
AF
DR

Chapter 13


Non-Intrusive Spectral
Methods

T [W]hen people thought the Earth was


flat, they were wrong. When people
AF
thought the Earth was spherical, they
were wrong. But if you think that think-
ing the Earth is spherical is just as wrong
as thinking the Earth is flat, then your
view is wronger than both of them put
together.

The Relativity of Warong


DR

Isaac Asimov

Chapter 12 considered a spectral approach to UQ, namely Galerkin expan-


sion, that is mathematically very attractive in that it is a natural extension
of the Galerkin methods that are commonly used for deterministic PDEs and

minimizes the stochastic residual, but has the severe disadvantage that the
stochastic modes of the solution are coupled together by a large system such
as (12.5). Hence, the Galerkin formalism is not suitable for situations in which
deterministic solutions are slow and expensive to obtain, and the determinis-
tic solution method cannot be modified. Many so-called legacy codes are not
amenable to such intrusive methods of UQ.
In contrast, this chapter considers non-intrusive spectral methods for UQ.
These are characterized by the feature that the solution of the deterministic
problem is a ‘black box’ that does not need to be modified for use in the spec-
tral method, beyond being able to be evaluated at any desired point θ of the
probability space (Θ, F , µ).
150 CHAPTER 13. NON-INTRUSIVE SPECTRAL METHODS

13.1 Pseudo-Spectral Methods


Consider a square-integrable stochastic process u : Θ → H taking values in a
separable Hilbert space H, with a spectral expansion
X
u= uk Ψk
k∈N0

of u ∈ L (Θ, µ; H) ∼
2
= H ⊗ L2 (Θ, µ; R) in terms of coefficients (stochastic modes)
uk ∈ H and an orthogonal basis {Ψk | k ∈ N0 } of L2 (Θ, µ; R). As usual, the
stochastic modes are given by


Z
Eµ [uΨk ] 1
ûk = = u(θ)Ψk (θ) dµ(θ).
Eµ [Ψ2k ] γk Θ
If the normalization constants γk := kΨk k2L2 (µ) are known ahead of time, then
it remains only to approximate the integral with respect to µ of the product of
u with each basis function Ψk .

T
Deterministic Quadrature. If the dimension of Θ is low and u(θ) is relatively
smooth as a function of θ, then an appealing approach to the estimation of
Eµ [uΨk ] is deterministic quadrature. For optimal polynomial accuracy, Gaus-
sian quadrature (i.e. nodes at the roots of µ-orthogonal polynomials) may be
AF
used. In practice, nested quadrature rules such as Clenshaw–Curtis may be
preferable since one does not wish to have to discard past solutions of u upon
passing to a more accurate quadrature rule. For multi-dimensional domains of
integration Θ, sparse quadrature rules may be used to alleviate the curse of
dimension.

Monte Carlo Integration. If the dimension of Θ is high, or u(θ) is a non-


DR

smooth function of θ, then it is tempting to resort to Monte Carlo approximation


of Eµ [uΨk ]. This approach is also appealing because the calculation of the
stochastic modes uk can be written as a straightforward (but often large) matrix-
matrix multiplication, as in Exercise 13.1. The problem with Monte Carlo
methods, as ever, is the slow convergence rate of ∼ (number of samples)−1/2 .

Sources of Error. In practice, the following sources of error arise when com-
puting pseudo-spectral expansions of this type:
1. discretization error comes about through the approximation of H by a
finite-dimensional subspace, i.e. the
Pmapproximation of and of the stochastic

modes uk by a finite sum uk ≈ i=1 uk,i ϕi , where {ϕi | i ∈ N} is some


basis for H;
2. truncation error comes about through the truncation
PK of the spectral ex-
pansion for u after finitely many terms, i.e. u ≈ k=0 uk Ψk ;
3. quadrature error comes about through the approximate nature of the nu-
merical integration scheme used to find the stochastic modes.

13.2 Stochastic Collocation


Collocation methods for ordinary and partial differential equations are a form
of polynomial interpolation. The idea is to find a low-dimensional object —
13.2. STOCHASTIC COLLOCATION 151

usually a polynomial — that approximates the true solution to the differential


equation by means of exactly satisfying the differential equation at a selected
set of points, called collocation points or collocation nodes.
Example 13.1 (Collocation for an ODE). Consider for example the initial value
problem

u̇(t) = f (t, u(t)), for t ∈ [a, b]


u(a) = ua ,


to be solved on an interval of time [a, b]. Choose n points

a ≤ t1 < t2 < · · · < tn ≤ b,

called collocation nodes. Now find a polynomial p(t) ∈ R≤n [t] so that the ODE

ṗ(tk ) = f (tk , p(tk ))

T
is satisfied for k = 1, . . . , n, as is the initial condition p(a) = ua . For example,
if n = 2, t1 = a and t2 = b, then the coefficients c2 , c1 , c0 ∈ R of the polynomial
approximation
X2
p(t) = ck (t − a)k ,
AF
k=0

which has derivative ṗ(t) = 2c2 (t − a) + c1 , are required to satisfy

ṗ(a) = c1 = f (a, p(a))


ṗ(b) = 2c2 (b − a) + c1 = f (b, p(b))
p(a) = c0 = ua
DR

i.e.
f (b, p(b)) − f (a, ua )
p(t) = (t − a)2 + f (a, ua )(t − a) + ua .
2(b − a)
The above equation implicitly defines the final value p(b) of the collocation
solution. This method is also known as the trapezoidal rule for ODEs, since the
same solution is obtained by rewriting the differential equation as
Z t
u(t) = u(a) + f (s, u(s)) ds
a

and approximating the integral on the right-hand side by the trapezoidal quadra-
ture rule for integrals.
It should be made clear at the outset that there is nothing stochastic about
‘stochastic collocation’, just as there is nothing chaotic about ‘polynomial chaos’.
The meaning of the term ‘stochastic’ in this case is that the collocation principle
is being applied across the ‘stochastic space’ (i.e. the probability space) of a
stochastic process, rather than the space/time/space-time domain. Consider
for example the random PDE

Lθ [u(x, θ)] = 0 for x ∈ Ω, θ ∈ Θ,


Bθ [u(x, θ)] = 0 for x ∈ ∂Ω, θ ∈ Θ,
152 CHAPTER 13. NON-INTRUSIVE SPECTRAL METHODS

where, for µ-a.e. θ in some probability space (Θ, F , µ), the differential operator
Lθ and boundary operator Bθ are well-defined and the PDE admits a unique
solution u( · , θ) : Ω → R. The solution u : Ω × Θ → R is then a stochastic
process. We now let ΘM = {θ(1) , . . . , θ(M) } ⊆ Θ be a finite set of prescribed
collocation nodes. The collocation problem is to find an approximate solution
u(M) ≈ u that satisfies
 
Lθ(m) u(M) x, θ(m) = 0 for x ∈ Ω,
 (M) (m)

Bθ(m) u x, θ =0 for x ∈ ∂Ω,


for m = 1, . . . , M ; there is, however, some flexibility in how to approximate
u(x, θ) for θ ∈
/ ΘM .

Interpolation Approach. An obvious first approach is to use interpolating


polynomials when they are available. This is easiest when the stochastic space
Θ is one-dimensional.

T
Example 13.2. Consider the initial value problem
d
u(t, θ) = −eθ u(t, θ), u(0, θ) = 1,
dt
AF
with θ ∼ N (0, 1). Take as the collocation nodes θ(1) , . . . , θ(M) ∈ R the M roots
of the Hermite polynomial HeM of degree M . The collocation solution u(m) is
given at the collocation nodes θ(m) by
d (m) (m)
u (t, θ(m) ) = −eθ u(t, θ(m) ), u(0, θ(m) ) = 1,
dt
i.e. (m)
u(t, θ(m) ) = exp(−eθ t)
DR

(M)
Away from the collocation nodes, u is defined by polynomial interpolation:
for each t, u(M) (t, θ) is a polynomial in θ of degree at most M with prescribed
values at the collocation nodes. Writing this interpolation in Lagrange form
yields
M
X
(M)
u (t, θ) = u(m) (t, θ(m) )ℓm (θ)
m=1
M
X Y
(m) θ − θ(k)
= exp(−eθ t) .
θ(m)− θ(k)

m=1 1≤k≤M
k6=m

Tensor Product Collocation.

Sparse Grid Collocation. Sparse grid interpolation using Smolyak–Chebyshev


nodes [6]

Stochastic Collocation on Unstructured Grids. Stochastic collocation meth-


ods for arbitrary unstructured sets of nodes is a notably tricky subject, essen-
tially because it boils down to polynomial interpolation through an unstructured
set of nodes, which, as we have seen (WHERE???), is generally impossible.
BIBLIOGRAPHY 153

Bibliography
The monograph of Xiu [117] provides a general introduction to spectral methods
for uncertainty quantification, including collocation methods, but is light on
proofs. The recent paper of Narayan & Xiu [70] presents a method for stochastic
collocation on arbitrary sets of nodes using the framework of least orthogonal
interpolation.
Non-intrusive methods for UQ, including NISP and stochastic collocation,
are covered in Chapter 3 of Le Maı̂tre & Knio [58].


Exercises
Exercise 13.1. Let u = (u1 , . . . , uM ) : Θ → RM be a square-integrable random
vector defined over a probability space (Θ, F , µ), and let {Ψk | k ∈ N0 }, with
normalization constants γk := kΨk k2L2 (µ) , be an orthogonal basis for L2 (Θ, µ; R).
Suppose that N independent samples {(θ(n) , u(θ(n) )) | n = 1, . . . , N } with

T
θ(n) ∼ µ, are given, and it is desired to use these N Monte Carlo samples
to form a truncated pseudo-spectral expansion
K
X (N )
u ≈ u(N ) := uk Ψk
AF
k=0

of u, where the approximate stochastic modes are obtained using Monte Carlo
integration. Write down the defining equation for the mth component of the
(N )
k th approximate stochastic mode, uk,m , and hence show that the approximate
stochastic modes solve the matrix equation
 (N ) (N )

DR

u1,1 . . . uK,1
 . .. .. 
 .  −1 ⊤
 . . .  = Γ DP ,
(N ) (N )
u1,M . . . uK,M

where

Γ := diag(γ0 , . . . , γK ),
 
u1 (θ(1) ) . . . u1 (θ(N ) )
 .. .. .. 
D :=  . . . ,

(1) (N )
uM (θ ) . . . uM (θ )
 
Ψ0 (θ(1) ) . . . Ψ1 (θ(N ) )
 .. .. .. 
P :=  . . . .
ΨK (θ(1) ) . . . ΨK (θ ) (N )

Exercise 13.2. What is the analogue of the result of Exercise 13.1 when the
integrals are approximated using a quadrature rule, rather than using Monte
Carlo?
154 CHAPTER 13. NON-INTRUSIVE SPECTRAL METHODS


T
AF
DR

Chapter 14


Distributional Uncertainty

Technology, in common with many other


activities, tends toward avoidance of

T
risks by investors. Uncertainty is ruled
out if possible. [P]eople generally prefer
the predictable. Few recognize how de-
structive this can be, how it imposes se-
AF
vere limits on variability and thus makes
whole populations fatally vulnerable to
the shocking ways our universe can throw
the dice.

Heretics of Dune
Frank Herbert
DR

In the previous chapters, it has been assumed that an exact model is avail-
able for the probabilistic components of a system, i.e. that all probability dis-
tributions involved are known and can be sampled. In practice, however, such
assumptions about probability distributions are always wrong to some degree:
the distributions used in practice may only be simple approximations of more
complicated real ones, or there may be significant uncertainty about what the
real distributions actually are. The same is true of uncertainty about the correct
form of the forward physical model.

14.1 Maximum Entropy Distributions


Principle of Maximum Entropy. If all one knows about a probability measure


µ is that it lies in some set A ⊆ M1 (X ), then one should take µ to be the
element µME ∈ A of maximum entropy.
((Heuristic justifications. . . Wallis–Jaynes derivation?))
FINISH ME!!!
Example 14.1 (Unconstrained maximum entropy distributions). If X = {1, . . . , m}
and p ∈ Rm
>0 is a probability measure on X , then the entropy of p is
m
X
H(p) := − pi log pi . (14.1)
i=1
156 CHAPTER 14. DISTRIBUTIONAL UNCERTAINTY

Pm only constraints on p are the natural ones that pi ≥ 0 and that S(p) :=
The
i=1 pi = 1. Temporarily neglect the inequality constraints and use the method
of Lagrange multipliers to find the extrema of H(p) among all p ∈ Rm with
S(p) = 1; such p must satisfy, for some λ ∈ R,
 
1 + log p1 + λ
 .. 
0 = ∇H(p) − λ∇S(p) = −  . 
1 + log pm + λ


It is clear that any solution to this equation must have p1 = · · · = pm , for
if pi and pj differ, then at most one of 1 + log pi + λ and 1 + log pj + λ can
equal 0 for the same value of λ. Therefore, since S(p) = 1, it follows that the
1
unique extremizer of H(p) among {p ∈ Rm | S(p) = 1} is p1 = · · · = pm = m .
The inequality constraints that were neglected initially are satisfied, and are not
active constraints, so it follows that the uniform probability measure on X is
the unique maximum entropy distribution on X .

T
A similar argument using the calculus of variations shows that the unique
maximum entropy probability distribution on an interval [a, b] ( R is the uni-
1
form distribution |b−a| dx.
AF
Example 14.2 (Constrained maximum entropy distributions). Consider the set of
all probability measures µ on R that have mean m and variance s2 ; what is
the maximum entropy distribution in this set? Consider probability measures µ
that are absolutely continuous with respect to Lebesgue measure, having density
ρ. Then the aim is to find µ to maximize
Z
H(µ) = − ρ(x) log ρ(x) dx,
R
DR

R R
Rsubject to2 the constraints that ρ ≥ 0, R ρ(x) dx = 1, R xρ(x) dx = 0 and
2
R (x − m) ρ(x) dx = s . Introduce Lagrange multipliers c = (c0 , c1 , c2 ) and the
Lagrangian
Z Z Z Z
Fc (ρ) := − ρ(x) log ρ(x) dx+c0 ρ(x) dx+c1 xρ(x) dx+c2 (x−m)2 ρ(x) dx.
R R R R

Consider a perturbation ρ + tσ; if ρ is indeed a critical point of Fc , then, re-


gardless of σ, it must be true that

d
Fc (ρ + tσ) = 0.
dt t=0

This derivative is given by


Z Z
d  
Fc (ρ + tσ) = σ(x) − log ρ(x) + c0 + c1 x + c2 (x − m)2 dx − ρ(x) dx.
dt t=0 R
R
Since R ρ(x) dx = 1 and it is required that
Z
 
0= − log ρ(x) + c0 − 1 + c1 x + c2 (x − m)2 σ(x) dx
R
14.2. DISTRIBUTIONAL ROBUSTNESS 157

for every σ, the expression in the brackets must vanish, i.e.

ρ(x) = exp(−c0 + 1 − c1 x − c2 (x − m)2 ).

Since ρ(x) is the exponential of a quadratic form in x, µ must be a Gaussian of


some mean and variance, which, by hypothesis, are m and s2 respectively, i.e.
√ 
c0 = 1 − log 1/ 2πs2 ,
c1 = 0,
c2 = 21 s2 .


Discrete Entropy and Convex Programming. In discrete settings, the entropy
of a probability measure p ∈ M1 ({1, . . . , m}) with respect to the uniform mea-
sure as defined in (14.1) is a strictly convex function of p ∈ Rm >0 . Therefore,
when p is constrained by a family of convex constraints, finding the maximum
entropy distribution is a convex program:

T
m
X
minimize: pi log pi
i=1
with respect to: p ∈ Rm
AF
subject to: p ≥ 0
p·1=1
ϕi (p) ≤ 0 for i = 1, . . . , n,

for given convex functions ϕ1 , . . . , ϕn : Rm → R. This is useful because an ex-


plicit formula for the maximum entropy distribution, such as in Example 14.2,
is rarely available. Therefore, the possibility of effciciently computing the max-
DR

imum entropy distribution, as in this convex programming situation, is very


attractive.
However, it must be remembered that despite the various justifications for
the use of the MaxEnt principle, it remains a selection mechanism that in some
sense returns a ‘typical’ or ‘representative’ distribution from a given class; what
if one is more interested in ‘atypical’ behaviour? This is the topic of the next
section.

14.2 Distributional Robustness


Suppose that we are interested in the value Q(µ† ) of some quantity of interest
that is a functional of a partially-known probability measure µ† on a space
X . Very often, Q(µ† ) arises as the expected value with respect to µ† of some
function q : X → R, so the objective is to determine

Q(µ† ) ≡ EX∼µ† [q(X)].

Now suppose that µ† is known only to lie in some subset A ⊆ M1 (X ). In the


absence of any further information about which µ ∈ A are more or less likely
to be µ† , and particular if the consequences of planning based on an inaccurate
estimate of Q(µ† ) are very high, it makes sense to adopt a posture of ‘healthy
158 CHAPTER 14. DISTRIBUTIONAL UNCERTAINTY

conservatism’ and compute bounds on Q(µ† ) that are as tight as justified by


the information that µ† ∈ A, but no tighter, i.e. to find

Q(A) := inf Q(µ) and Q(A) := sup Q(µ).


µ∈A µ∈A

When Q(µ) is the expected value with respect to µ of some function q : X → R,


the objective is to determine

Q(A) := inf Eµ [q] and Q(A) := sup Eµ [q].


µ∈A µ∈A


The inequality
Q(A) ≤ Q(µ† ) ≤ Q(A)
is, by construction, the sharpest possible bound on Q(µ† ) given only information
that µ† ∈ A. The obvious question is, can Q(A) and Q(A) be computed?

Finite Sample Spaces. Suppose that the sample space X = {1, . . . , K} is a

T
finite set equipped with the discrete topology. Then the space of measurable
functions f : X → R is isomorphic to RK and the space of probability measures
µ on X is isomorphic to the unit simplex in RK . If the available information on
µ† is that it lies in the set
AF
A := {µ ∈ M1 (X ) | Eµ [ϕn ] ≤ cn for n = 1, . . . , N }

for known measurable functions ϕ1 , . . . , ϕN : X → R and values c1 , . . . , cN ∈ R,


then the problem of finding the extreme values of Eµ [q] among µ ∈ A reduces
to linear programming:

extremize: p · q
DR

with respect to: p ∈ RK


subject to: p ≥ 0
p·1=1
p · ϕn ≤ cn for n = 1, . . . , N .

Note that the feasible set A for this problem is a convex subset of RK ; indeed,
A is a polytope, i.e. the intersection of finitely many closed half-spaces of RK .
Furthermore, as a closed subset of the probability simplex in RK , A is compact.
Therefore, by Corollary 4.18, the extreme values of this problem are certain to

be found in the extremal set ext(A). This insight can be exploited to great effect
in the study of distributional robustness problems for general sample spaces X .
Remarkably, when the feasible set A of probability measures is sufficiently
like a polytope, it is not necessary to consider finite sample spaces. What would
appear to be an intractable optimization problem over an infinite-dimensional
set of measures is in fact equivalent to a tractable finite-dimensional problem.
Thus, the aim of this section is to find a finite-dimensional subset A∆ of A with
the property that
ext Q(µ) = ext Q(µ).
µ∈A µ∈A∆

To perform this reduction, it is necessary to restrict attention to probability


measures, topological spaces, and functionals that are sufficiently well-behaved.
14.2. DISTRIBUTIONAL ROBUSTNESS 159

Extreme Points of Moment Classes. The first step in this reduction is to clas-
sify the extremal measures in sets of probability measures that are prescribed
by inequality or equality constraints on the expected value of finitely many arbi-
trary measurable test functions, so-called moment classes. Since, in finite time,
we can only verify — even approximately, numerically — the truth of finitely
many inequalities, such moment classes are appealing feasible sets from an epis-
temological point of view because they they conform to Karl Popper’s dictum
that “Our knowledge can only be finite, while our ignorance must necessarily
be infinite.”


Definition 14.3. A Borel measure µ on a topological space X is called inner
regular if, for every Borel-measurable set E ⊆ X ,

µ(E) = sup{µ(K) | K ⊆ E and K is compact}.

A pseudo-Radon space is a topological space on which every Borel probability


measure is inner regular. A Radon space is a separable, metrizable, pseudo-

T
Radon space.

Examples 14.4. 1. Lebesgue measure on Euclidean space Rn (restricted to


the Borel σ-algebra B(Rn ), if pedantry is the order of the day) is an
inner regular measure. Similarly, Gaussian measure is an inner regular
AF
probability measure on Rn . Proof. See MA359 Measure Theory. W
2. However, Lebesgue/Gaussian measures on R equipped with the topology
of one-sided convergence are not inner regular measures. Proof. See Ex-
ercise 14.1
3. Every Polish space (i.e. every separable and completely metrizable topo-
logical space) is a pseudo-Radon space.
DR

Compare the following definition of a barycentre (a centre of mass) for a set


of probability measures with the conclusion of the Choquet–Bishop–de Leeuw
theorem Theorem 4.13:

Definition 14.5. A barycentre for a set A ⊆ M1 (X ) is a probability measure


µ ∈ M1 (X ) such that there exists p ∈ M1 (ext(A)) such that
Z
µ(B) = ν(B) dp(ν) for all measurable B ⊆ X . (14.2)
ext(A)

Definition 14.6. A Riesz space (or vector lattice) is a vector space V together

with a partial order ≤ that is


1. (translation invariant) for all x, y, z ∈ V, x ≤ y =⇒ x + z ≤ y + z;
2. (positively homogeneous) for all x, y ∈ V and scalars α ≥ 0, x ≤ y =⇒
αx ≤ αy;
3. (lattice structure) for all x, y ∈ V, there exists a supremum element x ∨ y
that is a least upper bound for x and y in the order x ≤ y.

Definition 14.7. A subset S of a vector space V is a Choquet simplex if the cone

C := {(tx, t) ∈ V × R | t ∈ R, t ≥ 0, x ∈ V}

is such that C − C is a Riesz space when C is taken to be the non-negative cone.


160 CHAPTER 14. DISTRIBUTIONAL UNCERTAINTY

lD
A = {µ ∈ M1 (X ) | Eµ [ϕ] ≤ c}
δx 2

lD ∈ ext(A) ⊆ ∆1 ∩ A
lD

δx 1

lD

δx 3 lD


⊂ M± (X )

Figure 14.1: Heuristic justification of Winkler’s classification of extreme points


of moment sets (Theorem 14.8).

T
The definition of a Choquet simplex extends the usual finite-dimensional
definition: a finite-dimensional compact Choquet simplex is a simplex in the
ordinary sense of being the closed convex hull of finitely many points.
With these definitions, the extreme points of moment sets of probability
AF
measures can be described by the following theorem due to Winkler. The proof
is rather technical, and can be found in [116]. The important point is that
when X is a pseudo-Radon space, Winkler’s theorem applies with P = M1 (X ),
and so the extreme measures in moment classes will simply be finite convex
combinations of Dirac measures. Figures like Figure 14.1 should make this an
intuitively plausible claim.

Theorem 14.8 (Winkler). Let (X , F ) be a measurable space and let P ⊆ M1 (F )


DR

be a Choquet simplex such that ext(P ) consists of Dirac measures. Fix measur-
able functions ϕ1 , . . . , ϕn : X → R and c1 , . . . , cn ∈ R and let
 
for i = 1, . . . , n,
A := µ ∈ P .
ϕi ∈ L1 (µ) and Eµ [ϕi ] ≤ ci

Then A is convex and its extremal set satisfies


 P 

 µ= m i=1 αi δxi , 

 
1 ≤ m ≤ n + 1, and
ext(A) ⊆ A∆ := µ ∈ A ;


 the vectors (ϕ1 (xi ), . . . , ϕn (xi ), 1)m
i=1 

 
are linearly independent

Furthermore, if all the moment conditions defining A are given by equalities,


then ext(A) = A∆ .

Optimization of Measure Affine Functionals. Having understood the extreme


points of moment classes, the next step is to show that the optimization of
suitably nice functionals on such classes can be exactly reduced to optimization
over the extremal measures in the class.
14.2. DISTRIBUTIONAL ROBUSTNESS 161

Definition 14.9. For A ⊆ M1 (X ), a function F : A → R ∪ {±∞} is said to be


measure affine if, for all µ ∈ A and p ∈ M1 (ext(A)) for which (14.2) holds, F
is p-integrable with Z
F (µ) = F (ν) dp(ν). (14.3)
ext(A)

As always, the reader should check that the terminology ‘measure affine’
is a sensible choice by verifying that when X = {1, . . . , K} is a finite sample
space, the restriction of any affine function F : RK ∼= M± (X ) → R to a subset
A ⊆ M1 (X ) is a measure affine function in the sense of Definition 14.9.


An important and simple example of a measure affine functional is an eval-
uation functional, i.e. the integration of a fixed measurable function q:
Proposition 14.10. If q is bounded either below or above, then ν 7→ Eν [q] is a
measure affine map.
Proof. Exercise 14.2.

T
In summary, we now have the following:
Theorem 14.11. Let X be a pseudo-Radon space and let A ⊆ M1 (X ) be a
moment class of the form
AF
A := {µ ∈ M1 (X ) | Eµ [ϕj ] ≤ 0 for j = 1, . . . , N }

for prescribed measurable functions ϕj : X → R. Then the extreme points of A


are given by

ext(A) ⊆ A ∩ ∆N (X )
 

 for some α0 , . . . , αN ∈ [0, 1], x0 , . . . , xN ∈ X , 


 PN 

µ = i=0 αi δxi
DR

= µ ∈ M1 (A) PN .

 i=0 αi = 1, 


 PN 

and i=0 i j i ) ≤ 0 for j = 1, . . . , N
α ϕ (x

Hence, if q is bounded either below or above, then Q(A) = Q(A∆ ) and Q(A) =
Q(A∆ ).
Proof. The classification of ext(A) is given by Winkler’s theorem (Theorem 14.8).
Since q is bounded on at least one side, Proposition 14.10 implies that µ 7→
F (µ) := Eµ [q] is measure affine. Let µ ∈ A be arbitrary and choose a proba-

bility measure p ∈ M1 (ext(A)) with barycentre µ. Then, it follows from the


barycentric formula (14.3) that

F (µ) ≤ sup{F (ν) | ν ∈ ext(A)}.

The proves the claim for maximization; minimization is similar.


The kinds of constraints on measures (or, if you prefer, random variables)
that can be considered in Theorem 14.11 include values for or bounds on func-
tions of one or more of those random variables, e.g. the mean of X1 , the variance
of X2 , the covariance of X3 and X4 . However, one type of information that is
not of this type is that X5 and X6 are independent random variables, i.e. that
their joint law is a product measure. The problem here is that sets of product
162 CHAPTER 14. DISTRIBUTIONAL UNCERTAINTY

measures can fail to be convex (see Exercise 14.3), so the reduction to extreme
points cannot be applied directly. Fortunately, a cunning application of Fubini’s
theorem resolves this difficulty. Note well, though, that unlike Theorem 14.11, 
Theorem 14.12 does not say that A∆ = ext(A); it only says that the optimiza-
tion problem has the same extreme values over A∆ and A.
Theorem 14.12. Let A ⊆ M1 (X ) be a moment class of the form
 

 Eµ [ϕj ] ≤ 0 for j = 1, . . . , N , 


 K K Eµ1 [ϕ1j ] ≤ 0 for j = 1, . . . , N1 , 

O O


A := µ = µk ∈ M1 (Xk ) ..

 . 


 k=1 k=1 

EµK [ϕKj ] ≤ 0 for j = 1, . . . , NK

for prescribed measurable functions ϕj : X → R and ϕkj : X → R. Let

A∆ := {µ ∈ A | µk ∈ ∆N +Nk (Xk )} .

T
Then, if q is bounded either above or below, Q(A) = Q(A∆ ) and Q(A) = Q(A∆ ).
ε
Proof. Let ε > 0 and let µ∗ ∈ A be K+1 -suboptimal for the maximization of
µ 7→ Eµ [q] over µ ∈ A, i.e.
AF
ε
Eµ∗ [q] ≥ sup Eµ [q] − .
µ∈A K +1

By Fubini’s theorem,
 
Eµ∗1 ⊗···⊗µ∗K [q] = Eµ∗1 Eµ∗2 ⊗···⊗µ∗K [q]

By the same arguments used in the proof of Theorem 14.11, µ∗1 can be replaced
DR

by some probability measure ν1 ∈ M1 (X1 ) with support on at most N + N1


points, such that ν1 ⊗ µ∗2 ⊗ · · · ⊗ µ∗K ∈ A, and
    ε 2ε
Eν1 Eµ∗2 ⊗···⊗µ∗K [q] ≥ Eµ∗1 Eµ∗2 ⊗···⊗µ∗K [q] − ≥ sup Eµ [q] − .
K + 1 µ∈A K +1
NK
Repeating this argument a further K − 1 times yields ν = k=1 νk ∈ A∆ such
that
Eν [q] ≥ sup Eµ [q] − ε.
µ∈A

Since ε > 0 was arbitrary, it follows that

sup Eµ [q] = sup Eµ [q].


µ∈A∆ µ∈A

The proof for the infimum is similar.

14.3 Functional and Distributional Robustness


In addition to epistemic uncertainty about probability measures, applications
often feature epistemic uncertainty about the functions involved. For example,
if the system of interest is in reality some function g † from a space X of inputs to
14.3. FUNCTIONAL AND DISTRIBUTIONAL ROBUSTNESS 163

another space Y of outputs, it may only be known that g † lies in some subset G of
the set of all (measurable) functions from X to Y; furthermore, our information
about g † and our information about µ† may be coupled in some way, e.g. by
knowledge of EX∼µ† [g † (X)]. Therefore, we now consider admissible sets of the
form  
g : X → Y is measurable
A ⊆ (g, µ) ,
and µ ∈ M1 (X )
quantities of interest of the form Q(g, µ) = EX∼µ [q(X, g(X))] for some measur-
able function q : X × Y → R, and seek the extreme values


Q(A) := inf EX∼µ [q(X, g(X))] and Q(A) := sup EX∼µ [q(X, g(X))].
(g,µ)∈A (g,µ)∈A

Obviously, if for each g : X → Y the set of µ ∈ M1 (X ) such that (g, µ) ∈ A is


a moment class of the form considered in Theorem 14.12, then

ext EX∼µ [q(X, g(X))] = ext EX∼µ [q(X, g(X))].


(g,µ)∈A (g,µ)∈A
N

T
µ∈ K
k=1 ∆N +Nk (Xk )

In principle, though, although the search over µ is finite-dimensional for each g,


the search over g is still infinite-dimensional. However, the passage to discrete
measures often enables us to finite-dimensionalize the search over g, since, in
AF
some sense, only the values of g on the finite set supp(µ) ‘matter’ in computing
EX∼µ [q(X, g(X))].
The idea is quite simple: instead of optimizing with respect to g ∈ G, we
optimize with respect to the finite-dimensional vector y = g|supp(µ) . However,
this reduction step requires some care:
1. Some ‘functions’ do not have their values defined pointwise, e.g. ‘functions’
in Lebesgue and Sobolev spaces, which are actually equivalence classes
DR

of functions modulo equality almost everywhere. If isolated points have


measure zero, then it makes no sense to restrict such ‘functions’ to a finite
set supp(µ). These difficulties are circumvented by insisting that G be a
space of functions with pointwise-defined values.
2. In the other direction, it is sometimes difficult to verify wether a vector y
indeed arises as the restriction to supp(µ) of some g ∈ G; we need func-
tions that can be extended from supp(µ) to all of X . Suitable extension
properties are ensured if we restrict attention to smooth enough functions
between the right kinds of spaces.
Theorem 14.13 (Minty’s extension theorem). Let (X , d) be a metric space, let H

be a Hilbert space, let E ⊆ H, and suppose that f : E → H satisfies

kf (x) − f (y)kH ≤ d(x, y)α for all x, y ∈ E (14.4)

with Hölder constant 0 < α ≤ 1. Then there exists F : X → H such that F |E = f


and (14.4) holds for all x, y ∈ X if either α ≤ 21 or if X is an inner product
space with metric given by d(x, y) = k 1/α kx − yk for some k > 0. Furthermore,
the extension can be performed so that F (X ) ⊆ co(f (E)), and hence without
increasing the Hölder norm
kf (x) − f (y)kH
sup kf (x)kH + sup .
x x6=y d(x, y)α
164 CHAPTER 14. DISTRIBUTIONAL UNCERTAINTY

3
(R2 , k · k2 )
b b
1 f 2 b

1
−1 1
−1 b b b

(R2 , k · k∞ ) −2 −1 1 2
−1


Figure 14.1: The function f that takes the three points on the left (equipped
with k · k∞ ) to the three points on the right (equipped with k · k2 ) has Lipschitz
constant 1, but has no 1-Lipschitz extension F to (0, 0), let alone the whole
plane R2 , since F ((0, 0)) would have to lie in the (empty) intersection of the
three grey discs. Cf. Example 14.14.

T
Special cases of Minty’s theorem include the Kirszbraun–Valentine theorem
(which assures that Lipschitz functions between Hilbert spaces can be extended
AF
without increasing the Lipschitz constant) and McShane’s theorem (which as-
sures that scalar-valued continuous functions on metric spaces can be extended
without increasing a prescribed convex modulus of continuity). However, the
extensibility property fails for Lipschitz functions between Banach spaces, even
finite-dimensional ones:

Example 14.14. Let E ⊆ R2 be given by E := {(1, −1), (−1, 1), (1, 1)} and
define f : E → R2 by
DR


f ((1, −1)) := (1, 0), f ((−1, 1)) := (−1, 0), and f ((1, 1)) := (0, 3).

Suppose that we wish to extend this f to F : R2 → R2 , where E and the domain


copy of R2 are given the metric arising from the maximum norm k · k∞ and the
range copy of R2 is given the metric arising from the Euclidean norm k · k2 .
Then, for all distinct x, y ∈ E,

kx − yk∞ = 2 = kf (x) − f (y)k2 ,


so f has Lipschitz constant 1 on E. What value should F take at the origin,


(0, 0), if it is to have Lipschitz constant at most 1? Since (0, 0) lies at k · k∞ -
distance 1 from all three points of E, F ((0, 0)) must lie within k · k2 -distance 1 of
all three points of f (E). However, there is no such point of R2 within distance
1 of all three points of f (E), and hence any extension of f to F : R2 → R2 must
have Lip(F ) > 1. See Figure 14.1.

Theorem 14.15. Let G be a collection of measurable functions from X to Y such


that, for every finite subset E ⊆ X and g : E → Y, it is possible to determine
whether or not g can be extended to an element of G. Let A ⊆ G × M1 (X )
be such that, for each g ∈ G, the set of µ ∈ M1 (X ) such that (g, µ) ∈ A is a
14.3. FUNCTIONAL AND DISTRIBUTIONAL ROBUSTNESS 165

moment class of the form considered in Theorem 14.12. Let


 NK 
 µ ∈ k=1 ∆N +Nk (Xk ), 
A∆ := (y, µ) y is the restriction to supp(µ) of some g ∈ G, .
 
and (g, µ) ∈ A

Then, if q is bounded either above or below, Q(A) = Q(A∆ ) and Q(A) = Q(A∆ ).

Proof. Exercise 14.6.


Example 14.16. Suppose that g † : [−1, 1] → R is known to have Lipschitz con-
stant Lip(g † ) ≤ L. Suppose also that the inputs of g † are distributed according
to µ† ∈ M1 ([−1, 1]), and it is known that

EX∼µ† [X] = 0 and EX∼µ† [g † (X)] ≥ m > 0.

Hence, the corresponding feasible set is

T
 
g : [−1, 1] → R has Lipschitz constant ≤ L,
A = (g, µ) .
µ ∈ M1 ([−1, 1]), EX∼µ [X] = 0, and EX∼µ [g(X)] ≥ m
AF
Suppose that our quantity of interest is the probability of output values below
0, i.e. q(x, y) = 1[y ≤ 0]. Then Theorem 14.15 ensures that the extreme values
of
Q(g, µ) = EX∼µ [1[g(X) ≤ 0]] = PX∼µ [g(X) ≤ 0]
are the solutions of
2
X
extremize: wi 1[yi ≤ 0]
DR

i=0
with respect to: w0 , w1 , w2 ≥ 0
x0 , x1 , x2 ∈ [−1, 1]
y0 , y1 , y2 ∈ R
2
X
subject to: wi = 1
i=0
X2
wi xi = 0

i=0
X2
wi yi ≥ m
i=0
|yi − yj | ≤ L|xi − xj | for i, j ∈ {0, 1, 2}.

Example 14.17 (McDiarmid). The following admissible set corresponds to the


assumptions of McDiarmid’s inequality (Theorem 10.5):
 
 g : X → R has Dk [g] ≤ Dk , 
NK
A= (g, µ) µ = k=1 µk ∈ M1 (X ), .
 
and EX∼µ [g(X)] = m
166 CHAPTER 14. DISTRIBUTIONAL UNCERTAINTY

McDiarmid’s inequality was the upper bound


!
2 max{0, m}2
Q(A) := sup Pµ [g(X) ≤ 0] ≤ exp − PK .
2
(g,µ)∈A k=1 Dk

Perhaps not surprisingly given its general form, McDiarmid’s inequality is not
the least upper bound on Pµ [g(X) ≤ 0]; the actual least upper bound can be
calculated using the reduction theorems.
FINISH ME!!!


Bibliography
Berger [8] makes the case for distributional robustness, with respect to priors
and likelihoods, in Bayesian inference. Smith [89] provides theory and several
practical examples for generalized Chebyshev inequalities in decision analysis.
Boyd & Vandenberghe [16, Sec. 7.2] cover some aspects of distributional robust-

T
ness under the heading of nonparametric distribution estimation, in the case in
which it is a convex problem. Convex optimization approaches to distributional
robustness and optimal probability inequalities are also considered by Bertsi-
mas & Popescu [9]. There is also an extensive literature on the related topic of
AF
majorization, for which see the book of Marshall & al. [64].
The classification of the extreme points of moment sets and the consequences
for the optimization of measure affine functionals are due to von Weizsäcker &
Winkler [112, 113] and Winkler [116]. Karr [49] proved similar results under
additional topological and continuity assumptions. Theorem 14.12 and the Lip-
schitz version of Theorem 14.15 can be found in Owhadi & al. [76] and Sullivan
& al. [99] respectively. Extension Theorem 14.13 is due to Minty [68], and gen-
DR

eralizes earlier results by McShane [66], Kirszbraun [51] and Valentine [111].
Example 14.14 is taken from the example given on p. 202 of Federer [28].

Exercises
Exercise 14.1. Consider the topology T on R generated by the basis of open
sets [a, b), where a, b ∈ R.
1. Show that this topology generates the same σ-algebra on R as the usual
Euclidean topology does. Hence, show that Gaussian measure is a well-

defined probability measure on the Borel σ-algebra of (R, T ).


2. Show that every compact subset of (R, T ) is a countable set.
3. Conclude that Gaussian measure on (R, T ) is not inner regular and that
(R, T ) is not a pseudo-Radon space.

Exercise 14.2. Suppose that A is a moment class of probability measures on


X and that q : X → R ∪ {±∞} is bounded either below or above. Show that
µ 7→ Eµ [q] is a measure affine map. Hint: verify the assertion for the case
in which q is the indicator function of a measurable set; extend it to bounded
measurable functions using the Monotone Class Theorem; for non-negative µ-
integrable functions q, use monotone convergence to verify the barycentric for-
mula.
EXERCISES 167

Exercise 14.3. Let λ denote uniform measure on the unit interval [0, 1] ( R.
Show that the line segment in M1 ([0, 1]2 ) joining the measures λ ⊗ δ0 and δ0 ⊗ λ
contains measures that are not product measures. Hence show that a set A of
product probability measures like that considered in Theorem 14.12 is typically
not convex.
Exercise 14.4. Calculate by hand, as a function of t ∈ R, D ≥ 0 and m ∈ R,

sup PX∼µ [X ≤ t],


µ∈A


where  
EX∼µ [X] ≥ m, and
A := µ ∈ M1 (R) .
diam(supp(µ)) ≤ D
Exercise 14.5. Calculate by hand, as a function of t ∈ R, s ≥ 0 and m ∈ R,

sup PX∼µ [X − m ≥ st],


µ∈A

T
and
sup PX∼µ [|X − m| ≥ st],
µ∈A
AF
where  
EX∼µ [X] ≤ m, and
A := µ ∈ M1 (R) .
EX∼µ [|X − m|2 ] ≤ s2
Exercise 14.6. Prove Theorem 14.15.
Exercise 14.7. Calculate by hand, as a function of t ∈ R, m ∈ R, z ∈ [0, 1] and
v ∈ R,
sup PX∼µ [g(X) ≤ t],
DR

(g,µ)∈A

where  
 g : [0, 1] → R has Lipschitz constant 1, 
A := (g, µ) µ ∈ M1 ([0, 1]), EX∼µ [g(X)] ≥ m, .
 
and g(z) = v
Numerically verify your calculations.

168 CHAPTER 14. DISTRIBUTIONAL UNCERTAINTY


T
AF
DR


T
Bibliography and Index
AF
DR


DR
AF
T

Bibliography


[1] M. Abramowitz and I. A. Stegun, editors. Handbook of Mathematical
Functions with Formulas, Graphs, and Mathematical Tables. Dover Pub-
lications Inc., New York, 1992. Reprint of the 1972 edition.
[2] C. D. Aliprantis and K. C. Border. Infinite Dimensional Analysis: A
Hitchhiker’s Guide. Springer, Berlin, third edition, 2006.

T
[3] Ö. F. Alış and H. Rabitz. Efficient implementation of high dimensional
model representations. J. Math. Chem., 29(2):127–142, 2001.
[4] M. Atiyah. Collected Works. Vol. 6. Oxford Science Publications. The
AF
Clarendon Press Oxford University Press, New York, 2004.
[5] K. Azuma. Weighted sums of certain dependent random variables. Tōhoku
Math. J. (2), 19:357–367, 1967.
[6] V. Barthelmann, E. Novak, and K. Ritter. High dimensional polynomial
interpolation on sparse grids. Adv. Comput. Math., 12(4):273–288, 2000.
[7] F. Beccacece and E. Borgonovo. Functional ANOVA, ultramodularity and
DR

monotonicity: applications in multiattribute utility theory. European J.


Oper. Res., 210(2):326–335, 2011.
[8] J. O. Berger. An overview of robust Bayesian analysis. Test, 3(1):5–124,
1994. With comments and a rejoinder by the author.
[9] D. Bertsimas and I. Popescu. Optimal inequalities in probability theory:
a convex optimization approach. SIAM J. Optim., 15(3):780–804 (elec-
tronic), 2005.
[10] P. Billingsley. Probability and Measure. Wiley Series in Probability and

Mathematical Statistics. John Wiley & Sons Inc., New York, third edition,
1995. A Wiley-Interscience Publication.
[11] E. Bishop and K. de Leeuw. The representations of linear functionals by
measures on sets of extreme points. Ann. Inst. Fourier. Grenoble, 9:305–
331, 1959.
[12] S. Bochner. Integration von Funktionen, deren Werte die Elemente eines
Vectorraumes sind. Fund. Math., 20:262–276, 1933.
[13] G. Boole. An Investigation of the Laws of Thought on Which are Founded
the Mathematical Theories of Logic and Probabilities. Walton and Maber-
ley, London, 1854.
172 BIBLIOGRAPHY

[14] N. Bourbaki. Topological Vector Spaces. Chapters 1–5. Elements of Mathe-


matics (Berlin). Springer-Verlag, Berlin, 1987. Translated from the French
by H. G. Eggleston and S. Madan.

[15] N. Bourbaki. Integration. I. Chapters 1–6. Elements of Mathematics


(Berlin). Springer-Verlag, Berlin, 2004. Translated from the 1959, 1965
and 1967 French originals by Sterling K. Berberian.

[16] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univer-


sity Press, Cambridge, 2004.


[17] R. H. Cameron and W. T. Martin. The orthogonal development of non-
linear functionals in series of Fourier–Hermite functionals. Ann. of Math.
(2), 48:385–392, 1947.

[18] M. Capiński and E. Kopp. Measure, Integral and Probability. Springer Un-
dergraduate Mathematics Series. Springer-Verlag London Ltd., London,
second edition, 2004.

T
[19] C. W. Clenshaw and A. R. Curtis. A method for numerical integration
on an automatic computer. Numer. Math., 2:197–205, 1960.
AF
[20] S. L. Cotter, M. Dashti, J. C. Robinson, and A. M. Stuart. Bayesian
inverse problems for functions and applications to fluid mechanics. Inverse
Problems, 25(11):115008, 43, 2009.

[21] S. L. Cotter, M. Dashti, and A. M. Stuart. Approximation of Bayesian


inverse problems for PDEs. SIAM J. Numer. Anal., 48(1):322–345, 2010.

[22] M. Dashti, S. Harris, and A. Stuart. Besov priors for Bayesian inverse
DR

problems. Inverse Probl. Imaging, 6(2):183–200, 2012.

[23] A. P. Dempster. Upper and lower probabilities induced by a multivalued


mapping. Ann. Math. Statist., 38:325–339, 1967.

[24] J. Diestel and J. J. Uhl, Jr. Vector Measures. Number 15 in Mathematical


Surveys. American Mathematical Society, Providence, R.I., 1977.

[25] R. M. Dudley. Real Analysis and Probability, volume 74 of Cambridge


Studies in Advanced Mathematics. Cambridge University Press, Cam-
bridge, 2002. Revised reprint of the 1989 original.

[26] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Prob-


lems, volume 375 of Mathematics and its Applications. Kluwer Academic
Publishers Group, Dordrecht, 1996.

[27] L. C. Evans. Partial Differential Equations, volume 19 of Graduate Studies


in Mathematics. American Mathematical Society, Providence, RI, second
edition, 2010.

[28] H. Federer. Geometric Measure Theory. Die Grundlehren der mathema-


tischen Wissenschaften, Band 153. Springer-Verlag New York Inc., New
York, 1969.
BIBLIOGRAPHY 173

[29] L. Fejér. On the infinite sequences arising in the theories of harmonic


analysis, of interpolation, and of mechanical quadratures. Bull. Amer.
Math. Soc., 39(8):521–534, 1933.

[30] J. Feldman. Equivalence and perpendicularity of Gaussian processes. Pa-


cific J. Math., 8:699–708, 1958.

[31] X. Fernique. Intégrabilité des vecteurs gaussiens. C. R. Acad. Sci. Paris


Sér. A-B, 270:A1698–A1699, 1970.


[32] R. A. Fisher and W. A. Mackenzie. The manurial response of different
potato varieties. J. Agric. Sci., 13:311–320, 1923.

[33] W. Gautschi. Orthogonal Polynomials: Computation and Approximation.


Numerical Mathematics and Scientific Computation. Oxford University
Press, New York, 2004. Oxford Science Publications.

[34] R. G. Ghanem and P. D. Spanos. Stochastic Finite Elements: A Spectral

T
Approach. Springer-Verlag, New York, 1991.

[35] R. A. Gordon. The Integrals of Lebesgue, Denjoy, Perron, and Henstock,


volume 4 of Graduate Studies in Mathematics. American Mathematical
AF
Society, Providence, RI, 1994.

[36] J. Hájek. On a property of normal distribution of any stochastic process.


Czechoslovak Math. J., 8 (83):610–618, 1958.

[37] J. H. Halton. On the efficiency of certain quasi-random sequences of points


in evaluating multi-dimensional integrals. Numer. Math., 2:84–90, 1960.
DR

[38] W. Hoeffding. A class of statistics with asymptotically normal distribu-


tion. Ann. Math. Statistics, 19:293–325, 1948.

[39] W. Hoeffding. Probability inequalities for sums of bounded random ran-


dom variables. J. Amer. Statist. Assoc., 58:13–30, 1963.

[40] G. Hooker. Generalized functional ANOVA diagnostics for high-


dimensional functions of dependent variables. J. Comput. Graph. Statist.,
16(3):709–732, 2007.

[41] J. Humpherys, P. Redd, and J. West. A Fresh Look at the Kalman Filter.

SIAM Rev., 54(4):801–823, 2012.

[42] L. Jaulin, M. Kieffer, O. Didrit, and É. Walter. Applied Interval Analysis:
With Examples in Parameter and State Estimation, Robust Control and
Robotics. Springer-Verlag London Ltd., London, 2001.

[43] A. H. Jazwinski. Stochastic Processes and Filtering Theory, volume 64


of Mathematics in Science and Engineering. Academic Press, New York,
1970.

[44] V. M. Kadets. Non-differentiable indefinite Pettis integrals. Quaestiones


Math., 17(2):137–139, 1994.
174 BIBLIOGRAPHY

[45] J. Kaipio and E. Somersalo. Statistical and Computational Inverse Prob-


lems, volume 160 of Applied Mathematical Sciences. Springer-Verlag, New
York, 2005.

[46] R. E. Kalman. A new approach to linear filtering and prediction problems.


Trans. ASME Ser. D. J. Basic Engrg., 82:35–45, 1960.

[47] R. E. Kalman and R. S. Bucy. New results in linear filtering and prediction
theory. Trans. ASME Ser. D. J. Basic Engrg., 83:95–108, 1961.


[48] K. Karhunen. Über lineare Methoden in der Wahrscheinlichkeitsrechnung.
Ann. Acad. Sci. Fennicae. Ser. A. I. Math.-Phys., 1947(37):79, 1947.

[49] Alan F. Karr. Extreme points of certain sets of probability measures, with
applications. Math. Oper. Res., 8(1):74–85, 1983.

[50] J. M. Keynes. A Treatise on Probability. Macmillan and Co., London,


1921.

T
[51] M. D. Kirszbraun. Über die zusammenziehende und Lipschitzsche Trans-
formationen. Fund. Math., 22:77–108, 1934.
AF
[52] D. D. Kosambi. Statistics in function space. J. Indian Math. Soc. (N.S.),
7:76–88, 1943.

[53] H. Kozono and T. Yanagisawa. Generalized Lax–Milgram theorem in


Banach spaces and its application to the elliptic system of boundary value
problems. Manuscripta Math., 141(3-4):637–662, 2013.

[54] M. Krein and D. Milman. On extreme points of regular convex sets. Studia
DR

Math., 9:133–138, 1940.

[55] S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math.


Statistics, 22:79–86, 1951.

[56] V. P. Kuznetsov. Intervalnye statisticheskie modeli. “Radio i Svyaz′ ”,


Moscow, 1991.

[57] Matti Lassas, Eero Saksman, and Samuli Siltanen. Discretization-


invariant Bayesian inversion and Besov space priors. Inverse Probl. Imag-
ing, 3(1):87–122, 2009.

[58] O. P. Le Maı̂tre and O. M. Knio. Spectral Methods for Uncertainty Quan-


tification: With applications to computational fluid dynamics. Scientific
Computation. Springer, New York, 2010.

[59] M. Ledoux. The Concentration of Measure Phenomenon, volume 89 of


Mathematical Surveys and Monographs. American Mathematical Society,
Providence, RI, 2001.

[60] M. Ledoux and M. Talagrand. Probability in Banach Spaces. Classics in


Mathematics. Springer-Verlag, Berlin, 2011. Isoperimetry and Processes,
Reprint of the 1991 edition.
BIBLIOGRAPHY 175

[61] P. Lévy. Problèmes Concrets d’Analyse Fonctionnelle. Avec un


Complément sur les Fonctionnelles Analytiques par F. Pellegrino.
Gauthier-Villars, Paris, 1951. 2d ed.

[62] M. Loève. Probability Theory. II. Springer-Verlag, New York, fourth


edition, 1978. Graduate Texts in Mathematics, Vol. 46.

[63] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms.


Cambridge University Press, New York, 2003.


[64] A. W. Marshall, I. Olkin, and B. C. Arnold. Inequalities: Theory of
Majorization and its Applications. Springer Series in Statistics. Springer,
New York, second edition, 2011.

[65] C. McDiarmid. On the method of bounded differences. In Surveys in


combinatorics, 1989 (Norwich, 1989), volume 141 of London Math. Soc.
Lecture Note Ser., pages 148–188. Cambridge Univ. Press, Cambridge,
1989.

T
[66] E. J. McShane. Extension of range of functions. Bull. Amer. Math. Soc.,
40(12):837–842, 1934.
AF
[67] J. Mikusiński. The Bochner Integral. Birkhäuser Verlag, Basel, 1978.
Lehrbücher und Monographien aus dem Gebiete der exakten Wis-
senschaften, Mathematische Reihe, Band 55.

[68] G. J. Minty. On the extension of Lipschitz, Lipschitz–Hölder continuous,


and monotone functions. Bull. Amer. Math. Soc., 76:334–339, 1970.

[69] R. E. Moore. Interval Analysis. Prentice-Hall Inc., Englewood Cliffs, N.J.,


DR

1966.

[70] A. Narayan and D. Xiu. Stochastic collocation methods on unstruc-


tured grids in high dimensions via interpolation. SIAM J. Sci. Comput.,
34(3):A1729–A1752, 2012.

[71] H. Niederreiter. Low-discrepancy and low-dispersion sequences. J. Number


Theory, 30(1):51–70, 1988.

[72] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series


in Operations Research and Financial Engineering. Springer, New York,

second edition, 2006.

[73] B. Øksendal. Stochastic Differential Equations: An Introduction with Ap-


plications. Universitext. Springer-Verlag, Berlin, sixth edition, 2003.

[74] N. Oreskes, K. Shrader-Frechette, and K. Belitz. Verification, validation,


and confirmation of numerical models in the earth sciences. Science.

[75] A. B. Owen. Latin supercube sampling for very high dimensional simula-
tions. ACM Trans. Mod. Comp. Sim., 8(2):71–102, 1998.

[76] H. Owhadi, C. Scovel, T. J. Sullivan, M. McKerns, and M. Ortiz. Optimal


Uncertainty Quantification. SIAM Rev., 55(2):271–345, 2013.
176 BIBLIOGRAPHY

[77] K. V. Price, R. M. Storn, and J. A. Lampinen. Differential Evolution:


A Practical Approach to Global Optimization. Natural Computing Series.
Springer-Verlag, Berlin, 2005.

[78] H. Rabitz and Ö. F. Alış. General foundations of high-dimensional model


representations. J. Math. Chem., 25(2-3):197–233, 1999.

[79] M. Reed and B. Simon. Methods of Modern Mathematical Physics. I.


Functional Analysis. Academic Press, New York, 1972.

[80] M. Renardy and R. C. Rogers. An Introduction to Partial Differential


Equations, volume 13 of Texts in Applied Mathematics. Springer-Verlag,
New York, second edition, 2004.

[81] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer


Texts in Statistics. Springer-Verlag, New York, second edition, 2004.

[82] R. T. Rockafellar. Convex Analysis. Princeton Landmarks in Mathemat-

T
ics. Princeton University Press, Princeton, NJ, 1997. Reprint of the 1970
original, Princeton Paperbacks.

[83] W. Rudin. Functional Analysis. International Series in Pure and Applied


Mathematics. McGraw-Hill Inc., New York, second edition, 1991.
AF
[84] C. Runge. Über empirische Funktionen und die Interpolation zwischen
äquidistanten Ordinaten. Zeitschrift für Mathematik und Physik, 46:224–
243, 1901.

[85] R. A. Ryan. Introduction to Tensor Products of Banach Spaces. Springer


Monographs in Mathematics. Springer-Verlag London Ltd., London, 2002.
DR

[86] B. P. Rynne and M. A. Youngson. Linear Functional Analysis. Springer


Undergraduate Mathematics Series. Springer-Verlag London Ltd., Lon-
don, second edition, 2008.

[87] G. Shafer. A Mathematical Theory of Evidence. Princeton University


Press, Princeton, N.J., 1976.

[88] C. E. Shannon. A mathematical theory of communication. Bell System


Tech. J., 27:379–423, 623–656, 1948.

[89] J. E. Smith. Generalized Chebychev inequalities: theory and applications


in decision analysis. Oper. Res., 43(5):807–825, 1995.

[90] S. A. Smolyak. Quadrature and interpolation formulae on tensor products


of certain function classes. Dokl. Akad. Nauk SSSR, 148:1042–1045, 1963.

[91] I. M. Sobol′ . Uniformly distributed sequences with an additional property


of uniformity. Ž. Vyčisl. Mat. i Mat. Fiz., 16(5):1332–1337, 1375, 1976.

[92] I. M. Sobol′ . Estimation of the sensitivity of nonlinear mathematical


models. Mat. Model., 2(1):112–118, 1990.

[93] I. M. Sobol′ . Sensitivity estimates for nonlinear mathematical models.


Math. Modeling Comput. Experiment, 1(4):407–414 (1995), 1993.
BIBLIOGRAPHY 177

[94] C. Soize and R. Ghanem. Physical systems with random uncertainties:


chaos representations with arbitrary probability measure. SIAM J. Sci.
Comput., 26(2):395–410 (electronic), 2004.

[95] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis, volume 12 of


Texts in Applied Mathematics. Springer-Verlag, New York, third edition,
2002. Translated from the German by R. Bartels, W. Gautschi and C.
Witzgall.

[96] C. J. Stone. The use of polynomial splines and their tensor products in


multivariate function estimation. Ann. Statist., 22(1):118–184, 1994.

[97] R. Storn and K. Price. Differential evolution — a simple and efficient


heuristic for global optimization over continuous spaces. J. Global Optim.,
11(4):341–359, 1997.

[98] A. M. Stuart. Inverse problems: a Bayesian perspective. Acta Numer.,


19:451–559, 2010.

T
[99] T. J. Sullivan, M. McKerns, D. Meyer, F. Theil, H. Owhadi, and M. Or-
tiz. Optimal uncertainty quantification for legacy data observations of
Lipschitz functions. Math. Model. Numer. Anal., 47(6):1657–1689, 2013.
AF
[100] G. Szegő. Orthogonal Polynomials. American Mathematical Society, Prov-
idence, R.I., fourth edition, 1975. American Mathematical Society, Collo-
quium Publications, Vol. XXIII.

[101] H. Takahasi and M. Mori. Double exponential formulas for numerical


integration. Publ. Res. Inst. Math. Sci., 9:721–741, 1973/74.

[102] M. Talagrand. Pettis integral and measure theory. Mem. Amer. Math.
DR

Soc., 51(307):ix+224, 1984.

[103] A. Tarantola. Inverse Problem Theory and Methods for Model Parame-
ter Estimation. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 2005.

[104] J. B Tenenbaum, V. de Silva, and J. C. Langford. A global ge-


ometric framework for nonlinear dimensionality reduction. Science,
290(5500):2319–2323, 2000.

[105] A. N. Tikhonov. On the stability of inverse problems. C. R. (Doklady)


Acad. Sci. URSS (N.S.), 39:176–179, 1943.

[106] A. N. Tikhonov. On the solution of incorrectly put problems and the


regularisation method. In Outlines Joint Sympos. Partial Differential
Equations (Novosibirsk, 1963), pages 261–265. Acad. Sci. USSR Siberian
Branch, Moscow, 1963.

[107] L. N. Trefethen. Is Gauss quadrature better than Clenshaw–Curtis? SIAM


Rev., 50(1):67–87, 2008.

[108] L. N. Trefethen and D. Bau, III. Numerical Linear Algebra. Society for
Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1997.
178 BIBLIOGRAPHY

[109] U.S. Department of Energy. Scientific Grand Challenges for National


Security: The Role of Computing at the Extreme Scale. 2009.
[110] N. N. Vakhania. The topological support of Gaussian measure in Banach
space. Nagoya Math. J., 57:59–63, 1975.
[111] F. A. Valentine. A Lipschitz condition preserving extension for a vector
function. Amer. J. Math., 67(1):83–93, 1945.
[112] H. von Weizsäcker and G. Winkler. Integral representation in the set of
solutions of a generalized moment problem. Math. Ann., 246(1):23–32,


1979/80.
[113] H. von Weizsäcker and G. Winkler. Noncompact extremal integral rep-
resentations: some probabilistic aspects. In Functional analysis: surveys
and recent results, II (Proc. Second Conf. Functional Anal., Univ. Pader-
born, Paderborn, 1979), volume 68 of Notas Mat., pages 115–148. North-
Holland, Amsterdam, 1980.

T
[114] P. Walley. Statistical Reasoning with Imprecise Probabilities, volume 42
of Monographs on Statistics and Applied Probability. Chapman and Hall
Ltd., London, 1991.
AF
[115] K. Weichselberger. The theory of interval-probability as a unifying concept
for uncertainty. Internat. J. Approx. Reason., 24(2-3):149–170, 2000.
[116] G. Winkler. Extreme points of moment sets. Math. Oper. Res., 13(4):581–
587, 1988.
[117] D. Xiu. Numerical Methods for Stochastic Computations: A Spectral
Method Approach. Princeton University Press, Princeton, NJ, 2010.
DR

[118] D. Xiu and G. E. Karniadakis. The Wiener–Askey polynomial chaos for


stochastic differential equations. SIAM J. Sci. Comput., 24(2):619–644
(electronic), 2002.

Index

absolute continuity, 17 Cut-HDMR, 117


affine combination, 39
almost everywhere, 10 Dirac measure, 10
ANOVA, 116 direct sum, 28
arg max, 34 dominated convergence theorem, 15
arg min, 34 dual space, 26

Banach space, 24 ensemble Kálmán filter, 78

T
barycentre, 159 entropy, 52, 155
Bayes’ rule, 11, 63 equivalent measures, 17
Bessel’s inequality, 29 Eulerian observations, 80
Birkhoff–Khinchin ergodic theorem, 106 expectation, 15
AF
Bochner integral, 16, 32 expected value, 15
bounded differences inequality, 113 extended Kálmán filter, 77
extreme point, 39, 160
Céa’s lemma, 139, 146
Favard’s theorem, 90
Cameron–Martin space, 20
Fejér quadrature, 104
Cauchy–Schwarz inequality, 24, 52
Feldman–Hájek theorem, 21, 67
Chebyshev nodes, 93
Fernique’s theorem, 20
DR

Chebyshev’s inequality, 16
filtration, 12
Choquet simplex, 159
Fubini–Tonelli theorem, 18
Choquet–Bishop–de Leeuw theorem, 39
Christoffel–Darboux formula, 90 Galerkin product, 145
Clenshaw–Curtis quadrature, 104 Galerkin projection
collocation method deterministic, 137
for ODEs, 151 stochastic, 140
stochastic, 151 Galerkin tensor, 144
complete measure space, 10 Gauss–Markov theorem, 59
concentration of measure, 113 Gauss–Newton iteration, 46

conditional expectation, 28, 114 Gaussian measure, 18, 19


conditional probability, 11 Gibbs’ inequality, 55
constraining function, 38 Gibbs’ phenomenon, 96
convex combination, 39 Gram matrix, 122, 139
convex function, 40
convex hull, 39 Hankel determinant, 86, 89
convex optimization problem, 41 HDMR
convex set, 39 projectors, 118
counting measure, 10 Hellinger distance, 65, 68
covariance Hermite polynomials, 86, 126
matrix, 15 Hilbert space, 24
operator, 19, 51 Hoeffding’s inequality, 113
180 INDEX

Hoeffding’s lemma, 114 measurable function, 12


Hotelling transform, 124 measurable space, 9
measure, 9
independence, 17 measure affine function, 160
information, 52 Mercer kernel, 122
inner product, 23 Mercer’s theorem, 122
inner product space, 24 midpoint rule, 100
inner regular measure, 159 Minty’s extension theorem, 163
integral Moore–Penrose pseudo-inverse, 61
Bochner, 16, 32 mutually singular measures, 17


Lebesgue, 14
Pettis, 16, 19 Newton’s method, 34
strong, 16 Newton–Cotes formula, 101
weak, 16 norm, 23
interior point method, 42 normal equations, 44
interval arithmetic, 50 normed space, 23
null set, 10

T
Kálmán filter, 74
ensemble, 78 orthogonal complement, 27
extended, 77 orthogonal polynomials, 88
linear, 74 orthogonal projection, 27
AF
Karhunen–Loève theorem, 123 orthogonal set, 27
sampling Gaussian measures, 124 orthonormal set, 27
Karush–Kuhn–Tucker conditions, 37
Koksma’s inequality, 108 parallelogram identity, 24
Koksma–Hlawka inequality, 108 Parseval identity, 29
Kozono–Yanagisawa theorem, 140 penalty function, 38
Kreı̆n–Milman theorem, 39 Pettis integral, 16, 19
DR

Kullback–Leibler divergence, 54 polarization identity, 24


precision operator, 19
Lagrange multipliers, 37 principal component analysis, 124
Lagrange polynomials, 92 probability density function, 17
Lagrangian observations, 80 probability measure, 9
law of large numbers, 106 product measure, 17
Lax–Milgram theorem push-forward measure, 12
deterministic, 137
stochastic, 141 quadrature formula, 99
Lebesgue Lp space, 15

Lebesgue integral, 14 Radon space, 159


Lebesgue measure, 10, 18 Radon–Nikodým theorem, 17
Legendre polynomials, 86 random variable, 12
linear Kálmán filter, 74 Riesz representation theorem, 26
linear program, 43 Riesz space, 159
RS-HDMR, 116
marginal, 18 Runge’s phenomenon, 93
maximum entropy
principle of, 155 Schrödinger’s inequality, 52
McDiarmid diameter, 112 Schur complements, 81
McDiarmid subdiameter, 112 and conditioning of Gaussians, 81
McDiarmid’s inequality, 113 semi-norm, 23
INDEX 181

Sherman–Morrison–Woodbury formula,
81
signed measure, 9
simulated annealing, 36
singular value decomposition, 111, 125
Sobol′ indices, 118
Sobolev space, 26, 95
stochastic collocation method, 151
stochastic process, 12
strong integral, 16


support, 10
surprisal, 52

Takahasi–Mori quadrature, 109


tanh–sinh quadrature, 109
tensor product, 30
Tikhonov regularization, 45, 58

T
total variation distance, 54, 68
trapezoidal rule, 100
trivial measure, 10
AF
uncertainty principle, 52

Vandermonde matrix, 92
variance, 15
vector lattice, 159

weak integral, 16
Wiener–Hermite PC expansion, 127
DR

Winkler’s theorem, 160

zero-one measure, 39

You might also like