Stochastic Processes Lecture Notes
Stochastic Processes Lecture Notes
Stochastic Processes
Lecture, Summer term 2020, Bonn
3 Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1 Definition of stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Construction of stochastic processes; Kolmogorov’s theorem . . . . . . 44
3.3 Examples of stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Independent random variables . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.2 Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.3 Markov processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.4 Gibbs measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Upcrossings and convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Doob decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
v
vi Contents
5 Markov processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.1 Markov processes with stationary transition probabilities . . . . . . . . . 83
5.2 The strong Markov property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Markov processes and martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 Harmonic functions and martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Dirichlet problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.5.1 Green function, equilibrium potential, and equilibrium
measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.6 Doob’s h-transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.7 Markov chains with countable state space . . . . . . . . . . . . . . . . . . . . . . 96
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Chapter 1
A review of measure theory
1
2 1 A review of measure theory
The most useful application of Dynkin’s theorem is the observation that, if two
probability (resp. σ-finite) measures are equal on a Π-system, then they are equal
on σ(T ) (the σ-algebra generated by T ) (since the set on which the two measures
coincide forms a λ-system containing T ).
Examples.
The general setup allows to treat many important examples on the same footing.
1.1 Probability spaces 5
In the case when I is infinite (e.g. I = R+ ), we will often use a weaker topology that
“ignores infinity”, called the topology of “uniform convergence on finite subsets”.
It can be metrized with a norm
∞
X sup0≤t≤n | f (t) − g(t)|
k f − gk ≡ 2−n . (1.1.6)
n=1
1 + sup0≤t≤n | f (t) − g(t)|
6 1 A review of measure theory
We will begin to deal with such examples in the later parts of this course, when we
introduce Gaussian random processes with continuous time.
Spaces of measures. Another space we are often encountering in probability the-
ory is that of measures on a Borel-σ-algebra. There are various ways to introduce
topologies on spaces of measures, but a very common one is the so-called weak
topology. Let E be the topological space in question, and C0 (E, R) the space of real-
valued, bounded, and continuous functions on E. We denote by M+ (E, B(E)) the
set of all positive measures on (E, B(E)). One can then define neighbourhoods of a
measure µ of the form
k
B,k, f1 ,..., fk (µ) ≡ ν ∈ M+ (E, B(E)) : max |µ( fi ) − ν( fi )| < , (1.1.7)
i=1
The aim of this section is to prove the following version of Carathéodory’s theo-
rem.
Theorem 1.15 (Carathéodory’s theorem). Let Ω be a set and let S be an algebra
of subsets of Ω. Let µ0 : S → [0, ∞] be a σ-additive set-function. Then there exists a
measure, µ, on (Ω, σ(S)), such that µ = µ0 on S. If µ0 is σ-finite, then µ is unique.
Proof. We begin by defining the notion of an outer measure.
Definition 1.16. Let Ω be a set. A map µ∗ : P(Ω) → [0, ∞] is called an outer measure
if,
1.2 Construction of measures 7
(i) µ∗ (∅) = 0;
(ii) If A ⊂ B, then µ∗ (A) ≤ µ∗ (B) (increasing);
(iii) µ∗ is σ-subadditive, i.e., for any sequence An ∈ P(Ω), n ∈ N,
[ X
µ∗ An ≤ µ∗ (An ) . (1.2.2)
n∈N n∈N
Note that an outer measure is far less constraint than a measure; this is why it can
be defined on any set, not just on σ-algebras.
Example. If (Ω, F, µ) is a measure space, we can define an extension of µ that will
be an outer measure on P(Ω) as follows: For any D ⊂ Ω, let
µ∗ (A) = µ∗ (A ∩ B) + µ∗ (A ∩ Bc ). (1.2.4)
µ∗ (A) ≤ µ∗ (A ∩ B) + µ∗ (A ∩ Bc ). (1.2.5)
µ∗ (A) ≥ µ∗ (A ∩ Bc ) = µ∗ (A ∩ B) + µ∗ (A ∩ Bc ). (1.2.6)
Thus, M(µ∗ ) contains all sets B with µ∗ (B) = 0. This implies in particular that ∅ ∈
M(µ∗ ). Also, by the symmetry of the definition, M(µ∗ ) contains all of its elements
together with their complements. Thus the only non-trivial thing to show (i) is the
stability under countable unions. Let B1 , B2 be in M(µ∗ ). Note also that (1.2.6) holds
trivially if µ∗ (A) = +∞. Therefore we can assume henceforth that the sets A satisfy
µ∗ (A) < ∞. Then
= µ∗ (A ∩ B1 ) + µ∗ (A ∩ Bc1 ) = µ∗ (A).
Thus B1 ∪ B2 ∈ M(µ∗ ). This implies that M(µ∗ ) is closed under finite union. Since
it is also closed under passage to the complement, it is closed under finite inter-
section. Thus it is enough to show that countable unions of pairwise disjoint sets,
Bk ∈ M(µ∗ ), k ∈ N, are in M(µ∗ ). To show this, we show that, for all m ∈ N,
m
m
X \
µ (A) =
∗
µ (A ∩ Bn ) + µ A ∩
∗ ∗ Bn .
c
(1.2.9)
n=1 n=1
n=1
so, inserting this into (1.2.9), it holds for m + 1. Hence, by induction, it is true for all
m ∈ N.
From (1.2.9) we deduce further that
m
∞
X \
µ(A) ≥ µ (A ∩ Bn ) + µ A ∩
∗ ∗ Bn .
c
(1.2.10)
n=1 n=1
n=1 n=1
n=1 n=1
∞ ∞ c
[ [
≤ µ A ∩ Bn + µ A ∩ Bn .
∗ ∗ (1.2.12)
n=1 n=1
1.2 Construction of measures 9
that µ∗ (∅) = 0. Let now Bn be disjoint as above. Let us choose in the first line of
(1.2.11) A = ∞
S
n=1 Bn . This gives
∞ ∞
[ X
µ Bn ≥
∗
µ∗ (Bn ) . (1.2.13)
n=1 n=1
Since the converse inequality holds by sub-additivity, equality holds and the result
is proven. ut
The preceding theorem provides a clear strategy for proving Carathéodory’s the-
orem. All we need is to prescribe a σ-additive function, µ0 , on the algebra. Then
construct an outer measure µ∗ . This can be done in the following way: If S is an
algebra, set
X
µ (D) = inf
∗
µ
(F ) : F ∈ S; ∪ F ⊃ D (1.2.14)
0 n n n∈N n
n∈N
One needs to show that this is sub-additive and defines an outer measure. Once this
is done, it remains to show that M(µ∗ ) contains σ(S). This is done by showing that
it contains S, since M(µ∗ ) is a σ-algebra.
Let us now conclude our proof by carrying out these steps.
Proof. First, note that the first two conditions for µ∗ to be an outer measure are
trivially satisfied. To prove sub-additivity, let An , n ∈ N be a family of subsets of Ω.
For each n and given > 0, we can choose a family of sets Fn,i ∈ S, i ∈ N, such that
An ⊂ ∪i∈N Fn,i , and i∈N µ0 (Fn,i ) ≤ µ∗ (An ) + 2−n . Such families must exist due to the
P
definition of µ∗ as the infimum over the sum of the masses of all such covers. Then,
S S S
since n i Fn,i ⊃ n An ,
[ X X ∗ X ∗
µ∗ An ≤ µ0 Fn,i ≤ µ (An ) + 2−n ≤ µ (An ) + 2, (1.2.15)
n∈N n,i∈N n∈N n∈N
Lemma 1.20. Let µ∗ be the outer measure defined by (1.2.14). Let M(µ∗ ) be the
σ-algebra of µ∗ -measurable sets. Then σ(S) ⊂ M(µ∗ ).
Proof. We must show that M(µ∗ ) contains a family that generates σ(S). In fact, we
will show that it contains all the elements of the algebra S. To see this, let A ⊂ Ω
be arbitrary with µ∗ (A) < ∞. Then, for any > 0, there is a collection (Fn )n∈N , with
F ∈ S for all n ∈ N, such that A ⊂ ∪n∈N Fn and µ∗ (A) ≥ n∈N µ0 (Fn ) − . But then,
P
10 1 A review of measure theory
where the last inequality uses subadditivity and the fact that µ∗ equals µ0 on S. Also,
for the same reasons,
X
µ∗ (A ∩ Bc ) ≤ µ∗ (∪n∈N Fn \ Bn ) ≤ µ0 (Fn \ Bn ) . (1.2.17)
n∈N
Adding both inequalities and using that µ0 (Bn ) + µ0 (Fn \ Bn ) = µ0 (Fn ), we get that
X
µ∗ (A ∩ B) + µ∗ (A ∩ Bc ) ≤ µ(Fn ) ≤ µ∗ (A) + . (1.2.18)
n∈N
This proves
µ∗ (A) ≥ µ∗ (A ∩ B) + µ∗ (A ∩ Bc ), (1.2.19)
and since the opposite inequality follows by sub-additivity, B ∈ M(µ∗ ). t
u
Remark. Carathéodory’s theorem should appear rather striking at first by its gener-
ality. It makes no assumptions on the nature of the space Ω whatsoever. Does this
mean that the construction of a measure is in general trivial? The answer is of course
no, but Caratherodory’s theorem separates clearly the topological aspects form the
algebraic aspects of measure theory. Namely, it shows that in a concrete situation,
to construct a measure one needs to construct a σ-additive set-function on an alge-
bra that contains a Π-system that will generate the desired σ-algebra. The proof of
Carathéodory’s theorem shows that the extension to a measure is essentially a matter
of algebra and completely general. We will see later how topological aspects enter
into the construction of additive set-functions, and why aspects like separability and
metric topologies become relevant.
Remark. The σ-algebra M(µ∗ ) is in general not equal to the σ-algebra generated
by S. In particular, we have seen that M(µ∗ ) contains all sets of µ∗ -measure zero,
1.2 Construction of measures 11
all of which need not be in σ(S). This observation suggests to consider in general
extensions of a given σ-algebra with respect to a measure that ensures that all sets
of measure zero are measurable. Let (Ω, F, µ) be a measure space. Define the outer
measure, µ∗ , as in (1.2.3), and define the inner measure, µ∗ , as
Then
M(µ) ≡ {A ⊂ Ω : µ∗ (A) = µ∗ (A)}. (1.2.21)
One can easily check that M(µ) is a σ-algebra that contains F and all sets of outer
measure zero.
Terminology. A measure, µ, defined on a Borel-σ-algebra F = B(Ω) is sometimes
called a Borel measure. The measure space (Ω, M(µ), µ) is called the completion of
(Ω, F, µ).
It is a nice feature of null-sets that not only can they be added, but they can also
be gotten rid off. This is the content of the next lemma.
Lemma 1.21. Let (Ω, F, µ) be a probability space and assume that G ⊂ Ω is such
that µ∗ (G) = 1. Then, for any A ∈ F, µ∗ (G ∩ A) = µ(A) and if G ≡ F ∩ G (that is the
set of all subsets of G of the form G ∩ A, A ∈ F), then (G, G, µ∗ ) is a probability
space.
Proof. Exercise. u
t
Definition 1.22. Let Ω be a Hausdorff space and B(Ω) the corresponding Borel-σ-
algebra. A measure, µ, on (Ω, F = B(Ω)), is called:
(0) Borel measure, if for any compact set3 , C ∈ F, µ(C) < ∞;
(i) inner regular or tight, if, for all B ∈ F, µ(B) = supC⊂B µ(C), where the supremum
is over all compact sets contained in B;
(ii) outer regular, if for all B ∈ F, µ(B) = inf O⊃B µ(O), where the infimumum is over
all open sets containing B.
(iii) locally finite, if for any point p ∈ Ω there exists a neighbourhood U p such that
µ(U p ) < ∞.
(iv) Radon measure, if it is inner regular and locally finite.
A very important result is that on a compact metrisable spaces4 , all probability
measures are inner regular. The following result will be used in the construction of
stochastic processes in Section 3.2.
Theorem 1.23. Let Ω be a (Hausdorff) compact metrisable space and let P be a
probability measure on (Ω, B(Ω)). Then P is inner regular.
Proof. Let A be the class of elements, B, of B(Ω), such that, for all > 0, there exists
a compact set, K ⊂ B, and an open set, G ⊃ B, such that P(B\K) < and P(G\B) < .
Step 1: Show that A is an algebra. First, if B ∈ A, then its complement, Bc , will
also be in A (for Gc is closed, Bc ⊃ Gc , and Bc \Gc = G\B, and vice versa). Next,
if B1 , B2 ∈ A, then there are Ki ⊂ Bi and Gi ⊃ Bi , such that P(Bi \Ki ) < /2 and
P(Gi \Bi ) < /2. Then K = K1 ∪ K2 and G = G1 ∪ G2 are the desired sets for B =
B1 ∪ B2 . Thus A is an algebra.
Step 2: Show that A is a σ-algebra. Now let Bn be an increasing sequence of
elements of A such that n∈N Bn = B. We choose sets Kn and Gn as
S
before, but with
/2 replaced by 2−n−1 . Then there exists a N < ∞ such that P B\ n=1 Kn < .
SN
S
Indeed, P B\ n∈N Kn < /2, while P n∈N Kn \ n=1 Kn < /2 for N large enough.
S SN
if E is the union of a family of open sets, there is a finite subfamily whose union is E.
1.3 Random variables 13
contains all closed sets, and since B(Ω) is the smallest σ-algebra that contains all
closed sets, then B(Ω) ⊂ A. By definition A ⊂ B(Ω), thus B(Ω) = A.
Now for any B ∈ B(Ω) and K ⊂ B compact, P(B) = P(K) + P(B\K). But since,
for any B ∈ A, and for any > 0, by definition, there exists K such that P(B\K) < .
Thus sup{P(K) : K ⊂ B} = P(B), so P is inner regular. u t
Remark. We used in the proof the fact that Ω is compact to conclude that all closed
sets are compact. Otherwise, we would only deduce that for all B ∈ B(Ω), P(B) =
inf{P(C) : C ⊂ B}, where the infimum is over all closed sets. Sometimes this is called
inner regular, and tight is reserved for our definition. Note also that the conclusion
of the theorem also holds for separable metrisable spaces without the compactness
assumption.
Remark. Note that the proof shows that P is also outer regular. Measures that are
both inner and outer regular are sometimes called regular.
Definition 1.24. Let (Ω, F) and (E, G) be two measurable spaces. A map f : Ω → E
is called measurable from (Ω, F) to (E, G), if, for all A ∈ G, f −1 (A) ≡ {ω ∈ Ω : f (ω) ∈
A} ∈ F.
P f ≡ P ◦ f −1
defines a probability measure on (E, G), called the induced measure. Namely, for
any B ∈ G, by definition
P f (B) = P f −1 (B)
Definition 1.25. Let (Ω, F) be a measurable space, and let (E, B(E)) be a topological
space equipped with its Borel-σ-algebra. Let f be an E-valued random variable. We
14 1 A review of measure theory
say that σ( f ) is the smallest σ-algebra such that f is measurable from (Ω, σ( f )) to
(E, B(E)).
Note that σ( f ) depends on the set of values f takes. E.g., if f is real valued, but
takes only finitely many values, the σ-algebra generated by f has just finitely many
elements. If f is the constant function, then σ( f ) = {Ω, ∅}, the trivial σ-algebra. This
notion is particularly useful, if several random variables are defined on the same
probability space.
Dynkin’s theorem has a useful analogue for so-called monotone classes of func-
tions.
Theorem 1.26 (Monotone class theorem). Let H be a class of bounded functions
on Ω to R. Assume that
(i) H is a vector space over R,
(ii) 1 ∈ H,
(iii) if fn ≥ 0 are in H, and fn ↑ f , where f is bounded, then f ∈ H.
If H contains the indicator functions of every element of a Π-system S, then H
contains any bounded σ(S)-measurable function.
Proof. Let D be the class of subsets D of Ω such that 1D ∈ H. Then D is a λ-system.
Since by hypothesis D contains S, by Dynkin’s theorem, D contains the σ-algebra
generated by S. Now let f be a σ(S)-measurable function s.t. 0 ≤ f ≤ K < ∞ for
some constant K. Set
and set
K2n
i2−n 1D(n,i) (ω).
X
fn (ω) ≡ (1.3.2)
i=0
R are equipped with their Borel σ-algebras. Thus, all functions that are pointwise
limits of continuous functions are measurable, etc..
1.4 Integrals
k
wi 1Ai (ω).
X
g(ω) =
i=1
Definition 1.29. Let (Ω, F, µ) be a measure space and g = ki=1 wi 1Ai a non-negative
P
simple function. Then
Z Xk
gdµ = wi µ(Ai ). (1.4.1)
Ω i=1
Note that the fact that 1A dµ = µ(A) by any interpretation of what an integral
R
Definition 1.30.
(i) Let f be non-negative and measurable. Then
Z Z
f dµ ≡ sup gdµ (1.4.2)
Ω g≤ f,g∈E+ Ω
Remark. One sometimes also writes µ( f ) for f dµ. But this should be avoided, in
R
In the case when we are dealing with integrals with respect to a probability mea-
sure, there exists a very useful improvement of the dominated convergence theorem
that leads us to the important notion of uniform integrability.
We have frequently used the following fact about probability measures:
Lemma 1.35. Let (Ω, F, P) be a probability space and let X be an absolutlely inte-
grable real valued random variables on this space. Then, for any > 0, there exists
K < ∞, such that
E |X|1|X|>K < .
(1.4.9)
When dealing with families of random variables, one problem is that this prop-
erty will in general not hold uniformly. A nice situation occurs if it does:
18 1 A review of measure theory
E |X|1|X|>K < .
(1.4.10)
Clearly, for any K, limn→∞ E(|Xn |1|Xn |>K ) = 1. One should always keep this example
in mind when reflecting upon uniform integrability. Note that on the other hand the
class of functions (Yn , n ∈ N) with
√
P(Yn = 1) = 1 − 1/n and P(Yn = n) = 1/n (1.4.12)
is uniformly integrable.
The following lemma gives an equivalent formulation of uniform integrability
that is sometimes useful.
Lemma 1.37. A class, C, of real valued random variables is uniformly integrable,
if and only if supX∈C E|X| < ∞ and for all > 0 there exists δ > 0, such that, for any
set A ∈ F such that
P(A) < δ, (1.4.13)
it holds that, for all X ∈ C,
E [|X|1A ] < . (1.4.14)
Proof. First, assume that the class C is uniformly integrable. For given , let K be
such that E |X|1|X|>K < /2 for all X ∈ C and K > 2/. Then
δ > 0, K < ∞ such that P(|X| > K) ≤ δ. Then E |X|1|X|>K < , if δ is chosen as in the
conclusion of the lemma. Hence C is uniformly integrable. u t
Proof. The key idea is that if a sequence of random variables is uniformly inte-
grable, then this is almost als good as if that sequence was bounded. But a sequence
of bounded random variables has limit points, and these limit points must all be the
same, if the sequence converges in probability. So Xn converges to X almost surely.
But then Xn also converges in L1 by dominated convergence. Conversely, if Xn con-
verges to X in L1 , then Xn is essentially bounded, since all most all Xn are very close
to X (which is essentially bounded by Lemma 1.36. Let us now make this rigorous.
We first show the “if” part. Define
K, if x > K,
φK (x) ≡
|x| ≤ K, (1.4.17)
x, if
−K, if x < −K.
We have obviously from the uniform integrability that for any there exists K < ∞,
such that
E(|φK (Xn ) − Xn |) ≤ , (1.4.18)
for n ≥ 0 (where for convenience we set X ≡ X0 ). Moreover, since |φK (x) − φK (y)| ≤
|x − y|, (i) implies that φK (Xn ) → φK (X) in probability. Thus, for any δ > 0 and > 0,
there exists n0 ∈ N such that, for all n ≥ n0 , P(|φK (Xn ) − φK (X)| > δ) ≤ /K. Then,
since |φK (Xn )| ≤ K, for such n,
and so limn→∞ E(|φK (Xn ) − φK (X)|) = 0. In view of the fact that (1.4.18) holds for
any , it follows that E(|Xn − X|) → 0.
Let us now show the converse (“only”) direction. If E(|Xn − X|) → 0, then by
Chebychev’s inequality, P(|Xn − X| > ) ≤ E(|Xn−X|) → 0, so Xn → X in probability.
Now write Xn = (Xn − X) + X and use that, by the triangle inequality,
For any > 0, there exists n0 such that, for all n ≥ n0 , E(|Xn − X|) < . Since all Xi and
X are integrable, there exists K such that, for all n ≤ n0 , E(|Xn |1|Xn |>K ) < . Hence
if n ≤ n0 ,
E(|Xn |1|Xn |>2K ) ≤
(1.4.21)
E(|X|1|Xn |>2K ) + if n > n0 .
20 1 A review of measure theory
The importance of this result lies in the fact that in probability theory, we are
very often dealing with functions that are not really bounded, and where Lebesgue’s
theorem is not immediately applicable either. Uniform integrability is the best pos-
sible condition for convergence of the integrals. Note that the simple example
(1.4.12) of a uniformly integrable family given above furnishes a nice example
where E(|Xn − X|) → 0, but where Lebesgue’s dominated convergence theorem can-
not be applied.
Exercise: Use the previous criterion to prove Lebesgue’s dominated convergence
theorem in the case of probability measures.
I will only rather briefly summarise some frequently used notions concerning spaces
of integrable functions. Given a measure space, (Ω, F, µ), one defines, for p ∈ [1, ∞)
and measurable functions, f ,
Z !1/p
p 1/p
k f k p,µ ≡ k f k p ≡ E| f | = p
| f | dµ . (1.5.1)
Ω
Also, set
k f k∞ ≡ sup | f (ω)|. (1.5.2)
ω∈Ω
k f + gk p ≤ k f k p + kgk p , (1.5.5)
1.5 L p and L p spaces 21
Both inequalities follow from one of the most important inequalities in integra-
tion theory, Jensen’s inequality.
Theorem 1.41 (Jensen’s inequality). Let (Ω, F, P) be a Probability space, let X be
an absolutely integrable random variable, and let ϕ : R → R be a convex function.
Then, for any c ∈ R,
Eϕ(X − EX + c) ≥ ϕ(c), (1.5.7)
and in particular
Eϕ(X) ≥ ϕ (EX) . (1.5.8)
Proof. If ϕ is convex, then for any y there is a straight line below ϕ that touches
ϕ at (y, ϕ(y)), i.e. there exists m ∈ R such that ϕ(x) ≥ ϕ(y) + (x − y)m. Choosing x =
X − EX + c and y = c and taking expectations on both sides yields (1.5.7). u t
ϕ(x)
ϕ(y)
y x
Fig. 1.1 Convex function
Exercise: Prove the Hölder inequalities (for p > 1) using Jensen’s inequality.
Since Minkowski’s inequality is really a triangle inequality and property (i) (lin-
earity w.r.t. scalar multiplication) is trivial, we would be inclined to think that k · k p
is a norm and L p is a normed space. In fact, the only problem is that k f k p = 0 does
not imply f = 0, since f maybe non-zero on sets of µ-measure zero. Therefore to
define a normed space, one considers equivalence classes of functions in L p by call-
ing two functions, f, f 0 equivalent, if f − f 0 is non-zero only on set of measure zero.
The space of these equivalence classes is called L p ≡ L p (Ω, F, µ).
The following fact about L p spaces will be useful to know.
22 1 A review of measure theory
Lemma 1.42. The spaces L p (Ω, F; µ) are Banach spaces (i.e. complete normed vec-
tor space).
Proof. The by now only non-trivial fact that needs to be proven is the completeness
of L p . Let fi ∈ L p , i ∈ N be a Cauchy sequence. Then there are nk ∈ N, such that, for
all i, j ≥ nk , k fi − f j k p ≤ 2−k−k/p . Set gk ≡ fnk and
X
F≡ 2kp |gk − gk+1 | p . (1.5.9)
k∈N
Then
Z X Z X
p
Fdµ = 2kp |gk − gk+1 | p dµ = 2kp kgk − gk+1 k p ≤ 1. (1.5.10)
k∈N k∈N
Therefore, F is integrable and hence finite except possibly on a set of measure zero.
It follows that for all ω ∈ Ω s.t. F(ω) is finite, |gk (ω) − gk+1 (ω)| ≤ 2−k F(ω)1/p . It
follows further, using telescopic expansion and the triangle inequality, that gk (ω) is a
Cauchy sequence of real numbers, and hence convergent. Set f (ω) = limk→∞ gk (ω).
For the ω in the null-set where F(x) = +∞, we set f (ω) = 0. It follows readily that
Z
|gk − f | p dµ → 0, (1.5.11)
t
u
The case p = 2 is particularly nice, in that the space L2 is not only a Banach
space, but a Hilbert space. The point here is that the Hölder inequality, applied for
the case p = 2, yields Z
f gdµ ≤ k f k kgk . (1.5.13)
2 2
which has
p the properties of a scalar product. The L2 -norm being the derived norm,
k f k2 = ( f, g)µ . Although somehow L spaces are not the most natural settings for
2
An always important tool for the computation of integral on product spaces is Fu-
bini’s theorem. We consider first the case of non-negative functions.
Theorem 1.43 (Fubini-Tonnelli). Let (Ω1 , F1 , µ1 ), and (Ω2 , F2 , µ2 ) be two mea-
sure spaces, and let f be a real-valued, non-negative measurable function on
(Ω1 × Ω2 , F1 ⊗ F2 ). Then the two functions
Z Z
h(x) ≡ f (x, y)µ2 (dy) and g(y) ≡ f (x, y)µ1 (dx)
Ω2 Ω1
(i) ν µ.
(ii) There exists a non-negative measurable function, f , such that ν = µ f .
Moreover, f is unique up to µ-null sets.
Proof. It is enough to consider the case when µ is finite. Moreover, we may restrict
ourselves to the case when |gt | < C, for all t ∈ T (e.g. by passing from gt to tanh(gt ),
which is monotone and preserves all properties of the definition). Let S denote the
class of all countable subsets of T . Set
!
α ≡ sup E sup gt . (1.7.4)
I∈S t∈I
The notion of essential supremum is used in the next lemma, which is the major
step in the proof of the Radon-Nikodým theorem.
Lemma 1.50. Let (Ω, F, µ) be a measure space, with µ a σ-finite measure, and let
ν be another σ-finite measure on (Ω, F).R Let H be the family of all measurable
functions, h ≥ 0, such that, for all A ∈ F, A hdµ ≤ ν(A). Then, for all A ∈ F,
Z
ν(A) = ψ(A) + gdµ, (1.7.6)
A
g = esuph∈H h (1.7.7)
with respect to µ.
Proof. We again assume µ, ν to be finite, and leave the extension to σ-finite mea-
sures as an easy exercise. We also exclude the trivial case of µ = 0. From Lemma
1.49 we know that there exists a sequence of functions hn ∈ H, such that g =
supn∈N hn . Let us first note that if h1 , h2 ∈ H, then so is h ≡ max(h1 , h2 ). To see
this, note that the disjoint sets
which implies h ∈ H. We may therefore assume the sequence hn ordered such that
hn ≤ hn+1 , for all n ≥ 1. Then g = limn→∞ hn , and by monotone convergence, for all
A ∈ F, Z Z
gdµ = lim hn dµ ≤ ν(A). (1.7.10)
A n→∞ A
1.7 Densities, Radon-Nikodým derivatives 27
The key fact is that any set A of positive µ measure contains such subsets, i.e.
Dn (A) , ∅ whenever µ(A) , 0. This is proven by contradiction: assume that Dn (A) =
∅. Then set h0 = n−1 1A . For all B ∈ F one has that
Z Z
h0 dµ = n−1 µ(A ∩ B) ≤ ψ(A ∩ B) ≤ ψ(B) = ν(B) − gdµ. (1.7.12)
B B
But then B (h0 + g)dµ ≤ ν(B), for all B ∈ F, so that g + h0 ∈ H, which contradicts
R
the fact that g = esuph∈H h, since h0 > 0 on a set of positive µ-measure. Ideally we
might try to look at the union of all sets B ∈ Dn (Ω). If this was an element of F, it
would have ψ-mass of order n−1 , while complement of this set must have vanishing
µ-mass (otherwise there would be parts of Dn (Ω) in this complement. The problem
is that this union is not necessarily countable and thus it may not be in F. Therefore
we must resort to a delicate iterative procedure to construct the desired set.
We begin by choosing a set B1,n ∈ Dn (Ω) with the property that
1
µ(B1,n ) ≥ sup {µ(B) : B ∈ Dn (Ω)} ≡ α1,n . (1.7.13)
2
Morally, B1,n is our first attempt to pick up as much µ-mass as we can from the ψ-
tiny sets. If we were lucky, and µ(Bc1,n ) = 0, then we stop the procedure. Otherwise,
we continue by picking up as much mass as we can from what was left, i.e. we
choose B2,n ∈ Dn (Bc1,n ) with
1 n o
µ(B2,n ) ≥ sup µ(B) : B ∈ Dn (Bc1,n ) ≡ α2,n . (1.7.14)
2
If µ (B2,n ∪ B1,n )c = 0, we are happy and stop. Otherwise, we continue and choose
B3,n ∈ Dn (B1,n ∪ B2,n )c with
1 n o
µ(B3,n ) ≥ sup µ(B) : B ∈ Dn (Bc1,n ∩ Bc2,n ) ≡ α3,n , (1.7.15)
2
and so on. If the process stops at some kn -th step, set B j,n = ∅ for j > kn .
It is obvious from the definition that B j,n ∈ Dn (Ω), if B j,n , ∅. Since Dn (Ω) is
closed under countable disjoint unions (both ψ and µ being measures), also Mn ≡
j=1 B j,n ∈ Dn (Ω). We want to show that µ(Mn ) = 0, that is we have picked up
S∞ c
28 1 A review of measure theory
D n(Ω)
B1,n
B2,n
all the mass eventually. To do this, note again that, if µ(Mnc ) > 0, then there exists
D ∈ Dn (Mnc ) with µ(D) > 0.
On the other hand, for any m ∈ N,
m−1
\
2αm,n = sup µ(B) : B ∈ Dn c
B j,n (1.7.16)
j=1
≥ sup µ(B) : B ∈ Dn (Mnc ) ≥ µ(D).
Thus, if µ(D) > 0, then there exists some α > 0, such that µ(Bm,n ) ≥ αm,n =≥ α, for
all m. Since all B j,n are disjoint, this would imply that µ(Mn ) = ∞, which contradicts
the assumption that µ is a finite measure. Thus we conclude that µ(Mnc ) = 0, and so
ψ(Mn ) < n−1 µ(Mn ) = n−1 µ(Ω). Therefore,
∞
\ ∞
ψ Mn ≤ lim ψ (Mn ) = 0,
(1.7.17)
n=1
n=1
\ c
∞ ∞ ∞
[ X
µ Mn = µ Mnc ≤ µ Mnc = 0.
n=1 n=1 n=1
Proof. Lemma 1.50 provides the existence of two measures ν s and νc with the de-
sired properties. To prove the uniqueness of this decomposition, assume that there
are ν̃ s , ν̃c with the same properties. Since the measures ν s , ν̃ s are carried on sets of
zero µ-mass, they can only be different if there exists a set A ∈ F with µ(A) = 0
and ν s (A) , ν̃ s (A) > 0. But then νc (A) , ν̃c (A) as well, while by absolute continuity,
νc (A) = ν̃c (A) = 0. Thus ν s = ν̃ s and consequently νc = ν̃c . u t
The Radon-Nikodým theorem is now immediate: Assume that ν is absolutely
continuous with respect to µ. The decomposition (1.7.6) applied to µ-null sets A
then implies that for all these sets, ψ(A) = 0. But ψ is singular with respect to µ,
so there should be a µ-null set, A, for which ψ(Ac ) = 0. But since for all such A,
ψ(A) = 0, it follows that ψ(Ω) = ψ(A) + ψ(Ac ) = 0, and so ψ is the zero-measure.
All that remains is to assert that the Radon-Nikodým derivative is unique a.e.. To
do this, assume that there exists another measurable function, g∗ , such that
Z
ν(A) = g∗ dµ. (1.7.18)
A
Now define the measurable set A = {ω : C > g∗ > g > −C}. Then, by assumption,
Z Z
g∗ dµ = ν(A) = gdµ. (1.7.19)
A A
But since on A g∗ > g, this can only hold if µ(A) = 0, for all C < ∞. Thus, µ(g∗ >
g) = 0. In the same way one shows that µ(g∗ < g) = 0, implying that g and g∗ differ
at most on sets of measure zero. ut
Remark. We have said (and seen in the proof), that the Radon-Nikodým derivative
is defined modulo null-sets (w.r.t. µ). This is completely natural. Note that if µ and
ν are equivalent, then 0 < dµ
dν
< ∞ almost everywhere, and dµ dν
= dµ1
.
dν
Proof. We may assume that µ is finite and X non-negative. Appealing to the mono-
tone convergence theorem, it is also enough to consider bounded X (otherwise, ap-
proximate and pass to the limit on both sides). Let H be the class of all bounded
non-negative F-measurable functions for which (1.7.20) is true. Then H satisfies
the hypothesis of Theorem 1.26: clearly, (i) H is a vector space, (ii) the function
1 is contained in H by definition of the Radon-Nikodým derivative, and the prop-
erty (1.7.20) is stable under monotone convergence by the monotone convergence
theorem. Also, H contains the indicator functions of all elements of F. Then the
assertion of Theorem 1.26 implies that H contains all bounded F-measurable func-
tion, as claimed. u t
Chapter 2
Conditional expectations and conditional
probabilities
In this chapter we will generalise the notion of conditional expectations and condi-
tional probabilities from elementary probability theory considerably. In elementary
probability, we could condition only on events of positive probability. This notion
is too restrictive, as we have seen in the context of Markov processes, where this
limited us to consider discrete state spaces. The new notions we will introduce is
conditioning on σ-algebras. In this section we follow largely the presentation in
Chow and Teicher [4] where much further material can be found.
31
32 2 Conditional expectations and conditional probabilities
To check this, note that the right and side is obviously A-measurable. There are only
four sets in A, namely ∅, A, Ac , and Ω. But
(2.1.6)
and call this the conditional expectation of X given Y. We may than also think of
is a function of the value of the random variable Y. As we can see, the difficulty
associated with constructing conditional expectations in the general case relates to
making sense of expressions of the form 0/0. The key to the construction of condi-
tional expectations in the general case will use the concept of the Radon-Nikodým
derivative.
Theorem 2.2. Let (Ω, F, P) be a probability space and let X be a random variable
such that E(|X|) < ∞, and let G ⊂ F be a sub-σ-algebra of F. Then
(i) there exists a G-measurable function, E(X|G), unique up to sets of measure
zero, the conditional expectation of X given G, such that for all A ∈ G,
Z Z
E(X|G)dP = XdP. (2.1.8)
A A
2.1 Conditional expectations 33
Thus, Y has the properties of a conditional expectation and we may set E(X|G) =
dλG
Y = dP G
. Note that Y is unique up to sets of measure zero. Finally, to show that the
conditional expectation is unique in the same sense, assume that there is a function
Y 0 satisfying the conditions of the conditional expectation that differs from Y on a
set of positive measure. Then one may set A± = {ω : ±(Y 0 (ω) − Y(ω)) > 0}, and at
least one of these sets, say A+ , has positive measure. Then
Z Z Z Z
XdP = Y dP >
0
YdP = XdP, (2.1.13)
A+ A+ A+ A+
Of course, these are just the analogs of the three basic convergence theorems for
ordinary expectations. Note that we replaced the usual condition 0 ≤ Xn in (i) and
(ii) by Y ≤ Xn , which is of course a trivial generalisation (just pass to Xn − Y). We
leave the proofs as exercises.
A useful, but not unexpected, property is the following lemma.
Lemma 2.5. Let X be integrable and let Y be bounded and G-measurable. Then
Proof. We may assume that X, Y are non-negative; otherwise decompose them into
positive and negative parts and use linearity of the conditional expectation.
Define, for any A ∈ F,
2.2 Elementary properties of conditional expectations 35
Z Z
ν(A) ≡ XYdP, µ(A) ≡ XdP. (2.2.4)
A A
Both µ and ν are finite measures that are absolutely continuous with respect to P.
Then
dνG dµG dµ
= E(XY|G), = E(X|G), = X. (2.2.5)
dPG dPG dP
Then, using Lemma 1.52, for any A ∈ G,
Z Z Z
dµG
YdµG = Y dPG = YE(X|G)dPG , (2.2.6)
A A dPG A
Specializing the second equality to the case when A ∈ G, we find that for those A,
Z Z
YE(X|G)dP = Y XdP. (2.2.8)
A A
Note that in the theorem we can replace “for all integrable G2 measurable random
variable” by “for all random variables of the form X = 1B , B ∈ G2 ”.
Proof. Assume first that G1 and G2 are independent. Let A ∈ G1 and X be G2 -
measurable. The random variables 1A and X are independent, thus
Let us consider some cases where conditional expectations can be computed more
“explicitly”. For this, consider two random variables, X, Y, with values in Rm and
Rn (in the sequel, nothing but notation changes if the assume n = m = 1, so we will
do this). We assume that the joint distribution of X and Y is absolutely continuous
with respect to Lebesgue’s measure with density p(x, y). That is, for any function
f : Rm × Rn → R+ , Z
E( f (X, Y)) = f (x, y)p(x, y)dxdy. (2.3.1)
(where we should modify the density to be zero when p(x, y)dx = ∞. This can be
R
done because this can be true only on a set of Lebesgue measure zero). Let us note
also that the set where q(y) = 0 has measure zero.
Z Z Z
1q(y)=0 p(x, y)dxdy = 1q(y)=0 q(y)dy = 0. (2.3.3)
Proof. It is obvious that the right-hand side of Equation (2.3.6) is measurable with
respect to σ(Y). Verifying the second defining property of the conditional expecta-
tion amounts to repeating the computations in Eq. (2.3.4). u t
p(x,y)
Definition 2.8. The function q(y) as a function of x is called the conditional density
of X given Y = y.
What is particular here is that we can represent it as an expectation with respect to
an explicitly given probability measure.
Conditional expectations have a particularly nice interpretation in the case when the
random variable X is square-integrable, i.e. if X ∈ L2 (Ω, F, P) (since for the moment
we think of conditional expectations as equivalence classes modulo sets of measure
zero, we may consider X as an element of L2 rather than L2 ). We will identify the
space L2 (Ω, G, P) as the subspace of L2 (Ω, F, P) for which at least one representative
of each equivalence class is G-measurable.
Theorem 2.9. If X ∈ L2 (Ω, F, P), then E(X|G) is the orthogonal projection of X on
L2 (Ω, G, P).
Proof. The Jensen-inequality applied to the conditional expectation yields that
E(X 2 |G) ≥ E(X|G)2 , and hence E[E(X|G)2 ] ≤ E[E(X 2 |G)] = E(X 2 ) < ∞, so that
E(X|G) ∈ L2 (Ω, G, P). Moreover, for any bounded, G-measurable function Z,
Note that this interpretation of the conditional expectation can be used to define
the conditional expectation for L2 -random variables.
for any B ∈ G.
It clearly inherits from the conditional expectation the following properties:
(i) 0 ≤ P(A|G) ≤ 1, a.s.;
(ii) P(A|G) = 0, a.s., if and only if P(A) = 0; also P(A|G) = 1, a.s., if and only if
P(A) = 1;
(iii) If An ∈ F, n ∈ N, are disjoint sets, then
[ X
P An G = P (An |G) , a.s.; (2.5.3)
n∈N n∈N
However, this set may depend on the sequence, and since that space is not countable,
it is unclear whether there exists a set of full measure on which (2.5.5) holds for all
sequences of sets.
These considerations lead to the definition of so-called regular conditional prob-
abilities.
The point is that, if we have a regular conditional probability, then we can express
conditional expectations as expectations with respect normal probability measures.
Theorem 2.11. With the notation form above, if P is a regular conditional proba-
bility on F given G, then for a F-measurable integrable random variable, X,
Z
E(X|G)(ω) = X(ω̃)dP(ω, dω̃), a.s. (2.5.6)
Proof. As often, we may assume X positive. The proof then goes through the mono-
tone class theorem (Theorem 1.26), quite similar to the proof of Theorem 1.52. One
defines the class of functions where (2.5.6) holds, verifies that it satisfies the hy-
pothesis of the monotone class theorem and notices that it is true for all indicator
functions of sets in F. u
t
The question remains whether and when regular conditional probabilities exist.
An example of a regular conditional probability measure (on the measure space
(Rn × Rm , B((Rn × Rm ), p(x, y)dxdy) is the measure ν from Proposition 2.7. It is easy
to check that this has all the required properties properties, in particular it exists for
every y.
A central result for us is the existence in the case when Ω is a Polish space.
Theorem 2.12. Let (Ω, B(Ω), P) be a probability space where Ω is a Polish space
and B(Ω) is the Borel-σ-algebra. Let G ⊂ B(Ω) be a sub-σ-algebra. Then there
exists a regular conditional probability P(ω, A) given G.
41
42 3 Stochastic processes
From the point of view of mappings, we have the picture that for any t ∈ I, there
is a measurable map,
Xt : Ω → S , (3.1.1)
whose inverse maps B into F.
For this to work, we do want, of course, F to be so rich that it makes all functions
Xt , t ∈ I measurable. We denote this σ-algebra by
Sample paths.
Given a stochastic process as defined above, we can take a different perspective and
view, for each ω ∈ Ω, X(ω) as a map from I to S ,
X(ω) : I → S (3.1.3)
t 7→ Xt (ω)
X :Ω→SI (3.1.4)
ω 7→ X(ω),
with A ∈ B, t ∈ I. Then BI is the smallest σ-algebra such that all the maps πt : S I → S
that map x 7→ xt , are measurable from (S I , BI ) → (S , B). Moreover, σ(Xt , t ∈ I) is the
smallest σ-algebra such that the map X : Ω → S I is measurable from (Ω, σ(Xt , t ∈ I))
to (S I , BI ).
Proof. We first show that all πt are measurable from
BI → B. (3.1.6)
3.1 Definition of stochastic processes 43
π−1
t (A) = C(A, t) ∈ B .
I
(3.1.7)
Thus each πt is measurable. On the other hand, assume that there is some t and some
A such that C(A, t) < BI . Then clearly π−1 t (A) < B , and then πt is not measurable!
I
Lemma 3.2. The map X : Ω → S I is measurable from F → BI if and only if, for
each t, Xt is measurable from F → B.
a cylinder set or more precisely finite dimensional cylinder sets. If B is of the form
B = ×t∈J At , At ∈ B, we call such a set a special cylinder.
It is clear that BI contains all finite dimensional cylinder sets, but of course it
contains much more. We call BI the product σ-algebra, or the algebra generated by
the cylinder sets.
It is easy to check that the special cylinders form a Π-system, and the cylinders
form an algebra; both generate BI .
Thus we see that the choice of the σ-algebra BI is just the right one to make
the two points of view on stochastic processes equivalent from the point of view of
measurability.
Once we view X as a map from Ω to the S -valued functions on I, we can define the
probability distribution induced by P on the space (S I , BI ),
µX ≡ P ◦ X −1 (3.1.10)
Canonical process.
Given a stochastic process with law µ, one can of course realise this process on the
probability space (S I , BI , µ). In that case the random variable X is the trivial map
The viewpoint of the canonical process is, however, not terribly helpful, since more
often than not, we want to keep a much richer probability space on which many
other random objects can be defined.
The construction of a stochastic process may appear rather formidable, but we may
draw encouragement from the fact that we have introduced a rather coarse σ-algebra
on the space S I . The most fundamental observation is that stochastic processes are
determined by their observation on just finitely many points in time. We first make
this important notion precise.
For any J ⊂ I, we will denote by π J the canonical projection from S I to S J ,
i.e. π J X ∈ S J , such that, for all t ∈ J, (π J X)t = Xt . Naturally, we can define the
distributions
µXJ ≡ P ◦ (π J X)−1 (3.2.1)
on S J .
Definition 3.4. Let F(I) denote the set of all finite, non-empty subsets of I. Then
the collection of probability measures
n o
µXJ : J ∈ F(I) (3.2.2)
µ J1 = µ J2 ◦ π−1
J1 , (3.2.3)
µ ◦ π−1
J =µ .
J
(3.2.4)
Proof. It will not come as a surprise that we will use Carathéodory’s theorem to
prove our result. To do this, we have to construct a σ-additive set function on an
algebra that generates the σ-algebra BI . Of course, this algebra will be the algebra
of all finite-dimensional cylinder events. It is rather easy to see what this set function
will have to be. Namely, if B is a finite dimensional cylinder, then there exists J ∈
F(I), and A J ∈ B J , such that B = A J × S I\J (we call in such a case J the base of the
cylinder). Then we can define
µ0 (B) = µ J (A J ). (3.2.5)
where the consistency relations (3.2.3) were used in the last step. The usual way to
prove σ-additivity is to use the fact that an additive set-function, µ0 , is σ-additive if
and only if for any sequence Gn ↓ ∅, µ(Gn ) ↓ 0.
Therefore, the proof will be finished once we establish the following lemma.
Lemma 3.6. Let Bn , n ∈ N be a sequence of cylinder sets such that Bn ⊃ Bn+1 for all
n. If there exists an > 0, such that for all n ∈ N, µ0 (Bn ) ≥ 2, then limn→∞ Bn , ∅.
Proof. If Bn satisfies the assumptions of the lemma, then there exists a sequence
Jn ∈ F(I) and An ∈ B Jn , such that Bn = An × S I\Jn , Jn ⊂ Jn+1 and
there exist an x ∈ S I whose projections are equal to these limits for all k and hence
x ∈ kj=1 B j for all k, hence x ∈ n∈N Bn and so n∈N Bn , ∅. But this is the claim of
T T T
the lemma. u t
Note that we have used the assumption on the space S only to ensure that the
measures µ J , for J ∈ F(I), are all inner regular. Thus we can replace the assertion of
the theorem by:
Theorem 3.7. Let S be a topological space, and let B = B(S ) be its Borel-σ-
algebra. Let I be a set. Suppose that, for each J ∈ F(I), there exists an inner regular
probability measure, µ J , on (S J , B J ), such that for any J1 ⊂ J2 ∈ F(I),
µ J1 = µ J2 ◦ π−1
J1 , (3.2.11)
µ ◦ π−1
J =µ .
J
(3.2.12)
Finally, one can show that the assumption that S be compact and metrisable in
Theorem 1.23 can be replaced by assuming that S be Polish.
This is due to the following extension of Theorem 1.23.
Proof. We only need to modify the proof of Theorem 1.23 slightly. Instead of show-
ing that any compact set is an element of the algebra A, we will show that any closed
set is in A. The main step here is to show that S ∈ A, since then the mass of S is well
approximated by masses of compact subsets and we are done.
2 This is possible because the subsequences for k + 1 is a sub-subsequence for k, due to kj=1 K j ⊃
T
Tk+1
j=1 K j . Indeed, denote (ni,k )i≥1 the subsequence for k. Applying the Cantor diagonal procedure,
i.e., taking (nk,k )k≥1 provides the desired subsequence.
3.2 Construction of stochastic processes; Kolmogorov’s theorem 47
Denote by Kr (x) = {y ∈ S : ρ(x, y) ≤ r} the closed balls around x with radius r with
respect to an complete metric on S . Then there is a countable sequence of points,
(xn , n ∈ N), such that for any r > 0, S = n∈N Kr (xn ). By σ-additivity,
S
k
[
lim P Kr (xi ) = 1.
(3.2.13)
k↑∞
i=1
In particular, for any > 0, there exists a sequence kn , such that for all n ∈ N,
k
[n
P K1/n (xi ) ≥ 1 − 2−n .
(3.2.14)
i=1
Clearly the finite unions of the balls are closed and so is their intersection,
kn
\[
K≡ K1/n (xi ), (3.2.15)
n∈N i=1
As a consequence:
Corollary 3.9. The hypothesis of Theorem 3.6 holds for any probability measure
(S , B(S )) when S is a Polish space.
Remark. Note that we have seen no need to distinguish cases according to the nature
of the set I.
48 3 Stochastic processes
Theorem 3.10. Let I be a set and let, for each t ∈ I, µt be a probability measure on
(S , B(S )), where S is a polish space. Then there exists a unique probability measure,
µ, on (S I , BI ), such that, for J ∈ F(I), At ∈ B, and A J ≡ ×t∈J At ∈ B J ,
Y
µ π−1J (A J ) ≡ µ J
(A J ) = µt (At ). (3.3.1)
t∈J
Proof. Eq. (3.3.1) prescribes µ J on the rectangles ×t∈J At . But the rectangles are a
Π-system that generates the σ-algebra B J . By Carathérody’s theorem this fixes a
unique measure µ J on B J . This family is easily verified to satisfy the consistency
relations in the Kolmogorov-Daniell theorem. Hence, unique measure µ on (S I , BI )
with these marginals exists. ut
Remark. Note that we do not assume I to be countable. In the case when I is un-
countable, such a collection of random variables is sometimes called white noise.
This is, however, a rather unpleasant object. When we discuss seriously the issue
of stochastic processes with continuous time, we will see that we always will want
additional properties of sample paths that the theorem above does not provide.
Independent random variables are a major building block for more interesting
stochastic processes. We have already encountered sums of independent random
variables. Other interesting processes are e.g. maxima of independent random vari-
ables: If Xi , i ∈ N are independent random variables, define
Mn = max Xk . (3.3.2)
1≤k≤n
Gaussian processes is one of the most important classes of stochastic process that
can be defined with the help of densities. Let us proceed in two steps.
First, we consider finite dimensional Gaussian vectors. Let n ∈ N be fixed, and let
C be a real symmetric positive definite n × n matrix. We denote by C −1 its inverse.
Define the Gaussian density,
!
1 1
fC (x1 , . . . , xn ) ≡ √ exp − (x,C x) .
−1
(3.3.3)
(2π)n/2 detC 2
You see that the necessity of having C positive derives from the fact that we want
this density to be integrable with respect to the n-dimensional Lebesgue measure.
Definition 3.11. A family of n real random variables is called jointly Gaussian with
mean zero and covariance C, if and only if their distribution is absolutely continuous
w.r.t. the Lebesgue measure on Rn with density given by fC .
Remark. In this section we always assume that Gaussian random variables have
mean zero. The corresponding expressions in the general case can be recovered by
simple computations.
The definition of Gaussian vectors is no problem. The question is, can we define
Gaussian processes? From what we have learned, it will be crucial to be able to
define compatible families of finite dimensional distributions.
The following result will be important.
and h i 1
E ei(u,X) = e− 2 (u,Cu) . (3.3.6)
Proof. Item (i) follows from the first part of Item (iii), or by direct computation. (ii)
can be proven in various ways. Note that it is clear that if the denote the random
vector with components X` , ` ∈ J by X J , then, trivially,
Therefore, all we have to show to infer (ii) is that X J is jointly Gaussian. For this we
need to compute the joint marginal density of this vector. Without loss of generality,
we can assume that J = 1, . . . , m. Let us write x = (x J , xR ). We have to compute
Z
fC (x J , xR )dxR , (3.3.8)
where the notation should be obvious. Now it is clear that the matrix C −1 can be
written in Block form as !
A D
C = t
−1
, (3.3.9)
D B
where the particular form of A, B, D does not matter. Then we can write
(x J , xR ),C −1 (x J , xR ) = (x J , Ax J ) + (x J , DxR ) + (xR , Dt x J ) + (xR , BxR ). (3.3.10)
Therefore,
Z Z
fC (x J , xR )dxR = const.e− 2 xJ ,(A−DB D )xJ e− 2 xR +B ,B(xR +B D xJ ) dxR
1 −1 t 1 −1 −1 t
1 x J ,(A−DB−1 Dt )x J
= const.0 e− 2 , (3.3.12)
for constants that we do not care to compute. Moreover, we know that the integral
over the expression in the last line is equal to one, so the quadratic form in the
exponent must be positive definite. That shows that the marginal distribution of X J is
Gaussian, and since we know the covariances from (i), we know that the covariance
matrix is just the restriction of C set J.
Item (iii) is just computations. We first compute the moment generating function,
or the Laplace transform, of our jointly Gaussian vector. We define, for u ∈ Cn ,
Pn Z Pn
φC (u) ≡ E e(u,X) ≡ E e i=1 ui Xi = dx1 . . . dxn fC (x1 , . . . , xn )e i=1 ui xi . (3.3.13)
Z !
1 1
φC (u) = √ dx exp − (x,C −1 x) + (u, x)
(2π)n/2 detC 2
Z !
1 1 1
= √ dx exp − (x − Cu,C −1
(x − Cu)) + (u,Cu)
(2π)n/2 detC 2 2
Z
exp 12 (u,Cu) 1
!
= √ dx exp − (x − Cu,C −1 (x − Cu))
(2π) n/2 detC 2
!
1
= exp (u,Cu) , (3.3.14)
2
3.3 Examples of stochastic processes 51
where in the last line we used that the domain of integration in the integral is invari-
ant under translation. To obtain the analog result for the Fourier transform, i.e. Eq.
(3.3.6), use Cauchy’s theorem to show that the last line in Eq. (3.3.14) is also valid
if u is a complex vector.
Now it is easy to compute the correlation functions. Clearly,
d2 φC (u)
E(Xk X` ) = = Ck,` . (3.3.15)
duk du` u=0
This establishes (i). (ii) is now quite simple. To compute the Laplace or Fourier
transform of the vector X` , ` ∈ J, we just need to set ui = uiJ for i ∈ J and ui = 0
for i < J. The result is precisely the Laplace transform of a Gaussian vector with
covariance C J . Since the Fourier (and also the Laplace) transform determines the
distribution uniquely, (ii) follows. ut
This result if very encouraging for the prospect of defining Gaussian vectors.
If we can specify an infinite dimensional positive matrix, C then all its finite di-
mensional sub-matrices, C J , J ∈ F(N), are positive and the ensuing family of finite
dimensional distributions are Gaussian distributions that do satisfy the consistency
requirements of Kolmogorov’s theorem! The result is:
Theorem 3.13. Let C : N×N → R define a positive quadratic form. Then there exists
a unique stochastic process, X, with index set N and state space R, such that, for all
finite J ⊂ N, the marginal distributions are |J|-dimensional Gaussian vectors with
covariance C J . X is called a Gaussian processes indexed by N.
Theorem 3.14. Let I be a set and let C : I × I → R define a positive quadratic form
J = C(t, s), t, s ∈ J
in the sense that for every J ∈ F(I), the matrix C J with elements Ct,s
is positive definite. Then there exists a unique stochastic process, X, with index set I
and state space R, such that, for all finite J ⊂ N, the marginal distributions are |J|-
dimensional Gaussian vectors with covariance C J . X is called a Gaussian processes
indexed by I.
Thus the trick is to construct positive quadratic forms. Of course is easy to guess
a few by going the other way, and using independent Gaussian random variables
as building blocks. For example, consider Xn , n ∈ N to be independent, Gaussian
random variables with mean zero and variance σ2n . Set Zn ≡ nk=1 Xk . Then
P
n X
X m m∧n
X m∧n
X
Cn,m ≡ E(Zn Zm ) = E Xi X j = E(Xi2 ) = σ2i . (3.3.16)
i=1 j=1 i=1 i=1
X m∧n
X X X X
(u,Cu) = un um σ2i = σ2i um un (3.3.17)
n,m∈N i=1 i∈N m≥i n≥i
X X 2
= σ2i um ≥0
i∈N m≥i
C(t, s) ≡ t ∧ s. (3.3.18)
What we have to check is that, for any J ∈ F(R+ ), the restriction of C to a quadratic
form on R J is positive. But indeed,
X X Z (t∧s) Z ∞ X 2
ut u s (t ∧ s) = ut u s 1 dr = dr ut ≥ 0. (3.3.19)
t,s∈J t,s∈J 0 0 t∈J,t≥r
Thus all finite dimensional distributions exist as Gaussian vectors, and the com-
patibility conditions are trivially satisfied. Therefore there exists a Gaussian process
on R+ with this covariance. This process is called “Brownian motion”. Note, how-
ever, that this constructs the process only in the product topology, which does not
yet yield any nice path properties. We will later see that this process can actually
be constructed on the space of continuous functions, and this object will then more
properly called Brownian motion.
Exercise. Let Xk , k ∈ N, be independent Gaussian random variables with mean zero
and variance σ2 = 1. Define, for n ∈ N, and t ∈ [0, 1],
[nt]
1 X
Zn (t) ≡ √ Xk , (3.3.20)
n k=1
where [·] denotes the largest integer smaller than ·. Show that
(i) Zn (t) is a stochastic process with indexset [0, 1] and state space R.
(ii) Compute the covariance, Cn , of Zn and show that for any I ∈ F([0, 1]), CnI → C I ,
where C(s, t) = s ∧ t.
(iii) Show that the finite dimensional distributions of the processes Zn converge, as
n → ∞, to those of the “Brownian motion” defined above.
(iv) Show that the results (i) − (iii) remain true if instead of requiring that the Xk are
Gaussian we just assume that their variance equals to 1.
Note that to prove (iv), you need to prove the multi-dimensional analogue of the
central limit theorem. This requires, however, little more than an adaptation of the
notation from the standard CLT in dimension one.
3.3 Examples of stochastic processes 53
Gaussian processes were build from independent random variables using densities.
Another important way to construct non-trivial processes uses conditional probabil-
ities. Markov processes are the most prominent examples. In the case of Markov
processes we really think of the index set, N0 or R+ , as time. The process Xt then
shall have two properties: (1) it should be causal, i.e. we want an expression for the
law of Xt given the σ-algebra Ft− ≡ σ(X s , s < t), (2) we want this law to be forgetful
of the past: if we know the position (value; we will think mostly of a Markov pro-
cess as a “particle” moving around in S ) of X at some time s < t, then the law of
Xt should be independent of the positions of X s0 with s0 < s. In a way, Markov pro-
cesses are meant be the stochastic analogues of deterministic evolution (differential
equations).
To set such a process up, let us consider the (much simpler) case of discrete time,
i.e. I = N0 (we always want zero in our index set). The main building block for a
Markov chain is then the so called (one-step) transition kernel, P : N0 × S × B →
[0, 1], with the following properties:
(i) For each t ∈ N0 and x ∈ S , Pt (x, ·) is a probability measure on (S , B).
(ii) For each A ∈ B, and t ∈ N0 , Pt (·, A) is a B-measurable function on S .
In the sequel we denote by Ft = σ(X s , s ≤ t) ⊂ F the sigma algebra generated by
the process up to time t.
Definition 3.15. A stochastic process X with state space S and index set N0 is a
discrete time Markov process with transition kernel P if, for all A ∈ B, t ∈ N,
The remarkable thing is that this requirement fixes the law P up to one more proba-
bility measure on (S , B), the so-called initial distribution, P0 .
Proof. In view of the Kolmogorov-Daniell theorem, we have to show that our re-
quirements fix all finite dimensional distributions, and that these satisfy the compat-
ibility conditions. This is more a problem of notation than anything else. We will
need to be able to derive formulas for
P Xtn ∈ An , . . . , Xt1 ∈ A1 . (3.3.23)
54 3 Stochastic processes
To get started, we consider P (Xt ∈ A|F s ), for s < t. To do this, we use that by the
elementary properties of conditional expectations (we drop the a.s. that applies to
all equations relating to conditional expectations).
Z
=E Pt−1 (xt−1 , A)Pt−2 (xt−2 , dxt−1 ) · · ·
· · · P s+1 (x s+1 , dx s+2 )P s (X s (ω), dx s+1 )F s
Z
= Pt−1 (xt−1 , A)Pt−2 (xt−2 , dxt−1 ) · · · P s (X s (ω), dx s+1 ),
(3.3.24)
and call P s,t the transition kernel from time s to time t. With this object defines, we
can now proceed to more complicated expressions:
P Xtn ∈ An , . . . , Xt1 ∈ A1
= E P(Xtn ∈ An |Ftn−1 )1An−1 (Xtn−1 ) · · · 1A1 (Xt1 )
h i
Thus, we have the desired expression of the marginal distributions in terms of the
transition kernel P and the initial distribution P0 . The compatibility relations follow
from the following obvious, but important property of the transition kernels.
3.3 Examples of stochastic processes 55
Lemma 3.17. The transition kernels P s,t satisfy the Chapman-Kolmogorov equa-
tions Z
P s,t (x, A) = Pr,t (y, A)P s,r (x, dy) (3.3.27)
Using this function, we will construct a family of probability kernels, µΛ , that have
the following properties:
d d
(i) For each y ∈ S Z , µΛ (·, τ) is a probability measure on S Z ;
d
(ii) For each A ∈ BZ , µΛ (A, ·) is a FΛc -measurable function, where FΛc = σ(Xi , i ∈ Λc );
d
(iii) For any pair of volumes, Λ, Λ0 , with Λ ⊂ Λ0 , and any A ∈ BZ ,
Z
µΛ (z, A)µΛ0 (x, dz) = µΛ0 (x, A). (3.3.29)
It is easily checked that this expression indeed defines a kernel with properties (i)
and (ii). An expression of this type is called a local Gibbs specification.
Now we see that the properties of these kernels are reminiscent of those of regular
conditional probabilities.
One defines the notion of a Gibbs measure as follows:
d
Definition 3.18. A probability measure on S Z is called a Gibbs measures, if and
only if, for any finite Λ ⊂ Zd , the kernel µΛ is a regular conditional probability for µ
given FΛc .
More specifically, if the kernel is the Gibbs specification (3.3.30), it will be called
a Gibbs measures for the d-dimensional Ising model at temperature β−1 .
One can prove that such Gibbs measures exist; for this one shows that any ac-
cumulation point of a sequence µΛn (·, x), where Λn ↑ Zd is any increasing sequence
of volumes that converges to Zd (in the sense, that, for any finite Λ, there exists
n0 , such that, for all n ≥ n0 , Λ ⊂ Λn ), will be a Gibbs measure. This is relatively
straightforward, by writing equation (3.3.29) for a sequence of volumes Λn ↑ Zd :
Z
µΛ (z, A)µΛn (x, dz) = µΛn (x, A). (3.3.31)
If µΛn conveges weakly to some measureR µ, then the right-hand side converges to
µ(A). The left-hand side will converge to µΛ (z, A)µ(x, dz), since one can easily see
that µΛ (z, A) is a continuous function, if A is a cylinder event (in fact, in our example,
it is a local function on a discrete space). But then µ satisfies the desired properties
of a Gibbs measure.
d
The existence of accumulation points is then guaranteed by the fact that S Z is
compact (Tychonov, since S = {−1, 1} is compact), and that the set of probability
measures over a compact space is compact. What makes this setting interesting is
that there is no general uniqueness result. In fact, if d ≥ 2, and β > βc , for a certain βc ,
then it is known that there exists more than one Gibbs measure. This mathematical
fact is connected to the physical phenomenon of a so-called phase transition, and
this is what makes the study of Gibbs measures so interesting. For deeper material
on Gibbs measures see [3, 12, 6].
Chapter 4
Martingales
4.1 Definitions
In this chapter we will henceforth always assume that we are given a filtered
space. Also, all stochastic processes are assumed to have state space R.
57
58 4 Martingales
Filtrations and stochastic processes are closely linked. We will see that this goes
in two ways.
Definition 4.2. A stochastic process, {Xn , n ∈ N0 }, is called adapted to the filtration
{Fn , n ∈ N0 }, if, for every n, Xn is Fn -measurable.
Now the other direction:
Definition 4.3. Let {Xn , n ∈ N0 } be a stochastic process on (Ω, F, P). The natural
filtration, {Wn , n ∈ N0 } with respect to X is the smallest filtration such that X is
adapted to it, that is,
Wn = σ(X0 , . . . , Xn ). (4.1.2)
We see that the basic idea of the natural filtration is that functions of the process
that are measurable with respect to Wn depend only on the observations of the
process up to time n.
We now define martingales.
Definition 4.4. A stochastic process, X, on a filtered space is called a martingale, if
and only if the following hold:
(i) The process X is adapted to the filtration {Fn , n ∈ N0 };
(ii) For all n ∈ N0 , E(|Xn |) < ∞;
(iii) For all n ∈ N,
E(Xn |Fn−1 ) = Xn−1 , a.s.. (4.1.3)
If (i) and (ii) hold, but instead to (iii) it holds that E(Xn |Fn−1 ) ≥ Xn−1 , respectively
E(Xn |Fn−1 ) ≤ Xn−1 , then the process X is called a sub-martingale, respectively a
super-martingale.
In particular, for a martingale E(Xn ) = E(Xn−1 ), for a sub-martingale E(Xn ) ≥
E(Xn−1 ), finally, for a super-martingale E(Xn ) ≤ E(Xn−1 ).
It is clear that the property (iii) is what makes martingales special: intuitively, it
means that the best guess for what Xn could be, knowing what happened up to time
n − 1 is simply Xn−1 . No prediction on the direction of change is possible.
We will now head for the fundamental theorem concerning the impossibility of
winning systems in games build on martingales.
To put us into the gambling mood, we think of the increments of the process,
Yn ≡ Xn − Xn−1 , as the result of (not necessarily independent) games (Examples: (i)
Coin tosses, or (ii) the daily increase of the price of a stock). We are allowed to bet
on the outcome in the following way: at each moment in time, k − 1, we choose a
number Ck ∈ R.
Then our wealth will increase by the amount Ck Yk on the k-th day, i.e. the wealth
process, Wn is given by Wn = nk=1 Ck Yk (Example: (i) in the coin tosscase, chosing
P
Ck > 0 means to bet on head (= {Yn = +1}) an amount Ck , and Cn < 0 means to bet on
the outcome tails (= {Yn = −1}) the amount −Cn ; (ii) in the stock case, Cn represents
the number of shares an investor decides to hold at time n − 1 up to time n (here
negative values can be realised by short-selling).
The choice of the Ck is done knowing the process up to time k − 1. This justifies
the following definition.
4.1 Definitions 59
Definition 4.5. A stochastic process {Cn , n ∈ N} is called previsible1 , if, for all n ∈ N,
Cn is Fn−1 -measurable.
Given an adapted stochastic process, X and a previsible process C, we can define
the wealth process
X n
Wn ≡ Ck (Xk − Xk−1 ) ≡ (C • X)n . (4.1.4)
k=1
Definition 4.6. The process C • X is called the martingale transform of X by C or
the discrete stochastic integral of C with respect to X.
Now we can formulate the general “no-system” theorem for martingales:
Theorem 4.7. Let (Ω, F, P, {Fn , n ∈ N}) be a filtered probability space.
(i) Let C be a bounded non-negative previsible process such that there exists K <
∞, such that, for all n, and all ω ∈ Ω, |Cn (ω)| ≤ K. Let X be a super-martingale.
Then C • X is a super-martingale that vanishes for n = 0.
(ii) Let C be a bounded previsible process (boundedness as above) and X be a
martingale. Then C • X is a martingale that vanishes at zero.
(iii) Both in (i) and (ii), the condition of boundedness can be replaced by Cn ∈ L2 ,
if also Xn ∈ L2 .
Remark. In terms of gambling, (i) says that, if the underlying process has a tendency
to fall, then playing against the trend (“investing in a falling stock”) leads to a wealth
process that tends to fall. On the other hand, (ii) says that, if the underlying process
X is a martingale, then no matter what strategy you use, the wealth process has mean
zero.
Proof. (i) and (ii). To check integrability it is trivial. We also have that Wn − Wn−1 =
Cn (Xn − Xn−1 ). Then
E(Wn − Wn−1 |Fn−1 ) = Cnk E(Xn − Xn−1 |Fn−1 ) + E((Cn −Cnk )(Xn − Xn−1 )|Fn−1 ). (4.1.6)
Again by Cauchy-Schwartz, the second term tends to zero as k ↑ ∞, while the first
tends to Cn E(Xn − Xn−1 |Fn−1 ), almost surely. u
t
1 The terminology previsible refers to the fact that C can be foreseen from the information avail-
n
able at time n − 1.
60 4 Martingales
Some examples.
Xn ≡ E(X|Fn ). (4.1.8)
In particular, sums of independent random variables with mean zero are martingales.
Consider an interval [a, b]. We want to count the number of times a process crosses
this interval from below.
Definition 4.8. Let a < b ∈ R and let X s be a stochastic process with values in R. We
say that an upcrossing of [a, b] occurs between times s and t, if
(i) X s < a, Xt > b,
(ii)for all r such that s < r < t, Xr ∈ [a, b].
We denote by U N (X, [a, b])(ω) the number of uprossings in the time interval
[0, N].
We consider an (obviously) previsible process constructed as follows:
This process represents a “winning” strategy: wait until the process (say, price of
....) drops below a. Buy the stock, and hold it until its price exceeds b; sell, wait
until the price drops below a, and so on. Our wealth process is W = C • X.
Now each time there is an upcrossing of [a, b] we win at least (b − a). Thus, at
time N, we have
where the last term count is the maximum loss that we could have incurred if we are
invested at time N and the price is below a.
Naive intuition would suggest that in the long run, the first term must win. Our
theorem above says that this is false, if we are in a fair or disadvantageous game
(that is, in practice, always).
Previsible process: X
Cn−1 → Cn
”wait” ”buy” ”wait” b
Xn−1 ≥ a Xn−1 < a Xn−1 ≤ b
0 1 last buy
Xn−1 > b a
max. possible
”sell”
loss
XN
N n
Definition 4.10. We say that a stochastic process with discrete time and values in a
Banach space is L p -bounded. if
definition
Remark. Note that this requirement to be L p -bounded is strictly stronger than just
asking that for all n, E (kXn k p ) < ∞.)
62 4 Martingales
Proof. Exercise! t
u
Proof. Define
But
Λa,b ⊂ {ω : U∞ (X, [a, b])(ω) = ∞}. (4.2.7)
Therefore, by Corollary 4.11, P(Λa,b ) = 0, and thus also P( a<b∈Q Λa,b ) = 0, since
S
countable unions of null-sets are null-sets.
Thus the limit of Xn exists in [−∞, ∞] with probability one. It remains to show
that it is finite. To do this, we use Fatou’s lemma:
E(|X∞ |) = E(lim inf |Xn |) ≤ lim inf E(|Xn |) ≤ sup E(|Xn |) < ∞. (4.2.8)
n→∞ n→∞ n∈N0
is uniformly integrable.
Proof. Since X is absolutely integrable, for any > 0, we can find δ > 0 such that, if
F ∈ F with P(F) < δ, then E(|X|1F ) < . Let such and δ be given. Choose K such
that K −1 E(|X|) < δ. Let now G ⊂ F be a σ-algebra, and let Y be a version of E(X|G).
Then Jensen’s inequality for conditional expectations implies that
By Chebychev inequality we have, KP(|Y| > K) ≤ E(|Y|) ≤ E(|X|). Thus P(|Y| > K) <
δ. On the other hand, since the event {|Y| > K} ∈ G, we can argue that
where in the last step we have set F = {|Y| > K}. This is the uniform integrability
property we want to prove. u t
Theorem 4.15 (Lévy’s upward theorem). Let ξ be an absolutely integrable ran-
dom variable on a filtered probability space (Ω, F, P, {Fn , n ∈ N0 }). Define Xn ≡
E(ξ|Fn ), a.s.. Then Xn is a uniformly integrable martingale and
64 4 Martingales
Xn → X∞ = E(ξ|F∞ ), (4.2.14)
and so
E[1F Xn ] = lim E[1F Xm ] = E[1F X∞ ] (4.2.17)
m↑∞
since Xm converges in L1 . Thus E[1F E(ξ|F∞ )] = E(1F X∞ ) for any F in the π-system
n∈N0 Fn that generates the σ-algebra F∞ . Using Lebesgue’s dominated conver-
S
gence theorem, one can verify easily that the class of sets for which the equal-
ity holds is also a λ-system. Therefore, by Dynkin’s theorem, the equality holds
for the σ-algebra generated by the π-system, that ist for F∞ . But this means that
E(ξ|F∞ ) = X∞ , almost surely. ut
Now η is Tn -measurable for each n and hence independent of Fn . Thus, for any n
and so η = P(F), a.s.. But η takes only the values 0 and 1, being an indicator function.
Thus P(F) ∈ {0, 1}, proving the theorem. u t
The next theorem relates to filtrations to the infinite past. It is called the Lévy-
Doob downward theorem. It is somehow an inverted version of the upward theorem.
4.2 Upcrossings and convergence 65
for m ≥ n. Assume that supn≥1 E(X−n ) < ∞. Then the process X is uniformly inte-
grable and the limit
X−∞ = lim X−n (4.2.22)
n→∞
Remark. Note that the limit we are considering here is really quite different form
the one in the previous convergence theorems. We are really looking backward in
time: as n tends to infinity, X−n is measurable with respect to smaller and smaller
σ-algebras, contrary to the usual Xn , that depend on more information. Therefore,
while a convergent martingale Xn can converge to a constant only if the entire se-
quence is a constant, but usually is a random variable, a convergent X−n has a much
better chance to converge to a real constant. We will see shortly why this can be
used to prove things like the strong law of large numbers.
Proof. The condition supn∈N E(X−n ) < ∞, and the super-martingale property imply
that ∞ > E(X−∞ ) ≥ E(X−1 ) > −∞. In particular, the sequence E(X−n ) is increasing
and bounded, and so the limit limn↑∞ E(X−n ) < ∞ exists and is finite. Thus, for any
> 0, there is k ∈ N, such that
E(|X−n |1|X−n |>λ ) ≤ −E(X−k 1X−n <−λ ) + E(X−k ) − E(X−k 1X−n ≤λ ) + /2
= −E(X−k 1X−n <−λ ) + E(X−k 1X−n >λ ) + /2
≤ E(|X−k |1|X−n |>λ ) + /2. (4.2.26)
66 4 Martingales
Since X−k is absolutely integrable, there exists δ > 0 such that for all F with
But P(|X−n | > λ) ≤ λ−1 E(|X−n |). To control E(|X−n |), let us set X − ≡ max(−X, 0), and
write
E(|X−n |) = E(X−n ) + 2E(X−n −
). (4.2.28)
But X − is a sub-martingale, and so
(for the second we just use the integrability for the finitely many values of i; for
the first we use the uniform bound (4.2.29)). Then the first inequalities imply that
E(|X−n |1|X−n |>K ) ≤ for n ≥ k via (4.2.26) and the implication (4.2.27). This proves
the uniform integrability. But uniform integrability implies trivially boundedness in
L1 , and then the upcrossing lemma implies a.s. convergence. By uniform integrabil-
ity, convergence in L1 follows. Equation (4.2.23) follows from the convergence of
Xm to X−∞ . u t
n−1 S n → µ, (4.2.31)
a.s. and in L1 .
The reason for these equalities is simply that knowing something about the sums
S n , S n+1 , etc. effects the expectation of the Xk , k ≤ n all in the same way: we could
simply re-label the first indices without changing anything. Then, by linearity
E(X1 |G−n ) = (n − 1)−1 E(S n−1 |G−n ) = n−1 E(S n |G−n ) = n−1 S n , a.s. (4.2.33)
where we used the fact that S n is G−n measurable. Thus, L−n ≡ n−1 S n is a mar-
tingale with respect to the filtration {G−n , n ∈ N}. Thus, by the preceding theorem
L ≡ limn→∞ L−n exists a.s. and in L1 .
4.3 Inequalities 67
But clearly we also have, for any finite k, that L = limn→∞ n−1 (Xk+1 + · · · + Xn+k ),
which means that L is measurable with respect to Tk , for any k. Now Kolmogorov’s
zero-one law implies that, for any c, P(L ≤ c) ∈ {0, 1}. Since as a function of c this is
monotone and right-continuous, there must be exactly one c0 , such that P(L ≤ c) = 1
for all c ≥ c0 and P(L ≤ c) = 0 for all c < c0 , and hence P(L = c0 ) = 1. Then E(L) = c0 .
But E(L−n ) = µ, for all n, so c0 = µ. u t
4.3 Inequalities
In this section we derive some fundamental inequalities for martingales. One of the
most useful ones is the following maximum inequality.
Remark. You may recall a similar result for sums of iid random variables as Kol-
mogorov’s inequality. The estimate is extremely powerful, since it gives the same
estimate for the probability of the maximum to exceed c as Chebyshev’s inequality
would give for just the endpoint!
and set
n o [n
F ≡ max Zk ≥ c = Fk . (4.3.3)
k≤n
k=0
for all k ≤ n. Here the first inequality used of course the sub-martingale property of
Z. Thus
n n
E(Zn 1F ) = E(Zn 1Fk ) ≥ c
X X
P(Fk ) = cP(F). (4.3.5)
k=0 k=0
This allows to obtain many useful inequalities from the one of Theorem 4.19!
In particular, Kolmogorov’s inequality follows by choosing f (X) = X 2 . Other useful
choices are the exponential function, f (x) = exp(λx), for λ > 0.
Our next target is Doob’s L p inequality. The next lemma is a first step in this
direction.
Lemma 4.21. Let X and Y be non-negative random variables such that, for all c > 0,
Starting from the right-hand side, we can perform the same calculation, and derive
that
R = qE(X p−1 Y) ≤ qkYk p kX p−1 kq , (4.3.10)
where the second inequality is just Hölder’s inequality. Then
Theorem 4.22 (Doob’s L p -inequality). Let p > 1 and q−1 = 1 − p−1 . Let Z be a
non-negative sub-martingale bounded in L p , and define
Z ∗ ≡ sup Zk . (4.3.14)
k∈N0
Then Z ∗ ∈ L p , and
kZ ∗ k p ≤ q sup kZn k p . (4.3.15)
n∈N0
Proof. Define Zn∗ ≡ supk≤n Zk . Theorem 4.19 implies that the random variables Zn∗
and Zn satisfy the hypothesis on X and Y in Lemma 4.21. Therefore,
One of the games when dealing with stochastic processes is to “extract the martin-
gale part”. There are several such decompositions, but the following Doob decom-
position is very important and its continuous time analogue will be fundamental for
the theory of stochastic integration.
Theorem 4.23 (Doob decomposition).
(i) Let {Xn , n ∈ N0 } be an adapted process on a filtered space (Ω, F, P, {Fn , n ∈ N0 })
with Xn ∈ L1 for all n. Then X can be written in the form2
X = X0 + M + A, (4.4.1)
from (4.4.4). u t
An immediate application of the decomposition theorem is a maximum inequal-
ity without positivity assumption.
2 To make sure that there is no confusion about notation: the following equation is to be understood
in the sense that X0 = X0 , and for n ≥ 1, Xn = X0 + Mn + An .
4.4 Doob decomposition 71
X = X0 + M + A (4.4.7)
sup |Xk | ≤ |X0 | + sup |Mk | + sup Ak = |X0 | + sup |Mk | + An . (4.4.8)
k≤n k≤n k≤n k≤n
Note that |M| is a non-negative sub-martingale, so for the supremum of |Mk | we can
use Theorem (4.19). We use the simple observation that, if x + y + z > 3c, then at
least one of the x, y, z must exceed c. Thus,
! !
cP sup |Xk | ≥ 3c ≤ cP(|X0 | ≥ c) + cP sup |Mk | ≥ c + cP(An ≥ c)
k≤n k≤n
≤ E(|X0 |) + E(|Mn |) + E(An ) (4.4.9)
Now
E(|Mn |) = E(|Xn − X0 − An |) ≤ E(|Xn |) + E(|X0 |) + E(An ) (4.4.10)
and
E(An ) = E(Xn − X0 − Mn ) = E(Xn − X0 ) ≤ E(|Xn |) + E(|X0 |). (4.4.11)
Inserting these two bounds into (4.4.9) gives the claimed inequality. t
u
The Doob decomposition gives rise to two important derived processes associ-
ated to a martingale, M, the bracket, hMi, and [M].
Definition 4.25. Let M be a martingale in L2 with M0 = 0. Then M 2 is a sub-
martingale with Doob decomposition
M 2 = N + hMi, (4.4.12)
where N is a martingale that vanishes at zero and hMi is a previsible process that
vanishes at zero. The process hMi is called the bracket of M.
Note that boundedness in L1 of hMi is equivalent to boundedness in L2 of M.
From the formulas associated with the Doob decomposition, we derive that
Proof. Exercise! t
u
We will now give in some way a justification of the name “discrete stochastic in-
tegral” for the martingale tranform. We consider a martingale M zero in zero and
a function F : R → R. We want to consider the process F(MT ) and ask whether
we can represent F(MT ) − F(M0 ) as a “stochastic integral. Since we have called
C • M a stochastic integral, we might expect that this formula could simply read
F(MT ) = (F 0 • M)T + F(M0 ), as in the usual fundamental theorem of calculus, but
this will not turn out to be the case in general.
Let us consider the situation when the increments of Mt are getting very small;
the idea here is that the spacings between consecutive times are really small. So we
introduce parameter > 0 that will later tend to zero, while we think that T = −1C.
We also assume that E(Mt − Mt−1 )2 = O(). To see why this may reasonable think of
Mt ≡ Bt/T with B Brownian motion, where E(Bt/T − B(t−1)T )2 = 1/T = . Assuming
that F is a smooth function, we can expand F(Mt ) in a Taylor series:
This expression looks almost like the Doob decomposition of the process F(Mt ),
except that the last term is not exactly predictable. In fact, from the Doob decompo-
4.6 Central limit theorem for martingales 73
However, under reasonable assumptions (on F and on the behavior of the increments
of the martingale M), the martingale
T
X h i
∆T ≡ F 00 (Mt−1 ) (Mt − Mt−1 )2 − E (Mt − Mt−1 )2 |Ft−1 (4.5.5)
t=1
satisfies E(∆2T ) = O(), and is therefore negligible in our approximation. This implies
the discrete version of Itô’s formula:
T
X
F(MT ) = F(M0 ) + F 0 (Mt−1 )(Mt − Mt−1 ) (4.5.6)
t=1
T
1 X 00 h i
+ F (Mt−1 )E (Mt − Mt−1 )2 |Ft−1 + O( 1/2 ).
2 t=1
One important further result for martingales concerns central limit theorems. There
are various different formulations of such theorems. We will present one which em-
phasises the rôle of the bracket.
n
E (Mk − Mk−1 )2 1|Mk −Mk−1 |> sn Fk−1 ↓ 0, a.s.
X h i
s−2
n (4.6.1)
k=1
s−1
n Mn → N(0, 1) (4.6.2)
in distribution.
Remark. Condition (4.6.1) is called the conditional Lindeberg condition. In the case
when Mn = S n = ni=1 Xi with independent centered random variables Xi , (4.6.1)
P
reduces to the usual Lindeberg condition
74 4 Martingales
n
E Xk2 1|Xk |> sn ↓ 0.
X h i
s−2
n (4.6.3)
k=1
Moreover, in this case E([M]n ) = hMin , and so the condition hMin /s2n → 1 is trivially
verified (it is equal to one for all n). Thus the above theorem implies the usual CLT
for sums of independent random variables under the weakest possible conditions.
Interestingly, the conditions for the CLT for the martingale include a law of large
numbers for the bracket of the martingale. This is worth keeping in mind.
Proof. To simplify notation we set (for n fixed) M̃k ≡ Mk /sn . (We should really
write Mkn to indicate that this quantity depends explicitly on n; the same is true for
all other objects carrying a tilda). Then the assumptions of the theorem read:
as n → ∞. We have to prove that M̃n → N(0, 1). This holds if and only if, for all
u ∈ R,
lim E(eiu M̃n ) = e−u /2 .
2
(4.6.5)
n→∞
Things are a little tricky, and the following decomposition is quite helpful:
E eiu M̃n − e− u22
2
u2 u2 u2
u
= E eiu M̃n 1 − e 2 h M̃in e− 2 + e− 2 eiu M̃n e 2 h M̃in − 1
u2 h M̃i − u2
2
iu M̃n u2 h M̃in
(4.6.7)
≤E 1 − e
2 n
e + E e
2
e − 1
n
X
u2 h M̃i − u2 u2 h M̃i u2 h M̃i
≤E 1 − e
2 n
e +
2
E e
iu M̃k e 2 k −eiu M̃k−1 e 2 k−1
k=1
h M̃in ≤ C (4.6.8)
4.6 Central limit theorem for martingales 75
for some finite constant C. In a second step we will show how to remove this as-
sumption. First, notice that the assumption that h M̃in → 1 in probability implies
that
u2 h M̃i − u2
n
E 1 − e
2 e → 0, as n → ∞.
2 (4.6.9)
Thus we need to deal with the second term in (4.6.7). Using (4.6.6), we get
u2 u2
E eiu M̃k e 2 h M̃ik − eiu M̃k−1 e 2 h M̃ik−1
u2 u2
=E eiu M̃k−1 e 2 h M̃ik−1 eiuX̃k + 2 E(X̃k |Fk−1 ) − 1
2
(4.6.10)
u2 u2
=E eiu M̃k−1 e 2 h M̃ik−1 E eiuX̃k + 2 E(X̃k |Fk−1 ) − 1Fk−1 .
2
u2 2
To simplify the notation, set σ2k ≡ E(X̃k2 |Fk−1 ). To bound E eiuX̃k + 2 σk − 1Fk−1 , we
use the following elementary estimates:
Since σk is Fk−1 -measurable, the second bracket can be taken out of the conditional
expectation. Also, E(X̃k |Fk−1 ) = 0 since M̃ is a martingale. Since E(X̃k2 |Fk−1 ) = σ2k ,
so that
u2
!
u2 2
E eiuX̃k + 2 σk − 1Fk−1 = 1 + σ2k + R2 (uσk ) E R1 (uX̃k )Fk−1
2
u2 u4
!
+ 1 − σ2k R2 (uσk ) − σ4k . (4.6.15)
2 4
(ii) σ2k = E(X̃k2 |Fk−1 ) ≤ 2 + E(X̃k2 1|X̃k |> |Fk−1 ). This is nice, because the second term
is controlled by the Lindeberg condition.
(iii) |E(R1 (uX̃k )|Fk−1 )| ≤ |u|3 σ2k + u2 E(X̃k2 1|X̃k |> |Fk−1 ). This holds by computing the
conditional expectation given Fk−1 of both sides of the inequality
min{u2 X̃k2 , |u|3 |X̃k |3 } = min{u2 X̃k2 , |u|3 |X̃k |3 } 1|X̃k |≤ + 1|X̃k |>
u2 u2
(iv) |R2 (uσk )| ≤ e 2 C u4 σ4k ≤ e 2 C u4C 2 .
Using these estimates, we get
u2
!
u2 C 4 2
|(4.6.15)| ≤ 1 + C + e 2 u C |u|3 σ2k + u2 E(X̃k2 1|X̃k |> |Fk−1 )
2
u2
! 2
+ 1 + C e 2 C u4 σ2k 2 + CE(X̃k2 1|X̃k |> |Fk−1 )
u
2
u 4
σ2k 2 + CE(X̃k2 1|X̃k |> |Fk−1 )
+
4
≤ K(u) σ2k 2 + E(X̃k2 1|X̃k |> |Fk−1 ) ,
(4.6.17)
there by the Lindeberg condition the second term tends to zero for any > 0. Thus
the limit as n ↑ ∞ of the second trem in Eq. (4.6.7) is bounded by a constant times
2 , for any > 0, that is it is equal to zero, as desired. This proves the CLT under the
assumption (4.6.8).
To conclude, let us show that we can remove Assumption (4.6.8). Define, for n
fixed and m ≤ n
m
X
ω Ω 2
.
Am ≡ ∈ : h M̃i ≡ X̃ |F ≤ C (4.6.19)
m E k k−1
k=1
(to be more formal, we should write Anm , as this set is different for different choices
of n. Remember that all terms with a tilda are divided by s2n ). Of course, for m ≤ n,
n /sn → 1 and so
An ⊂ Am , and so P(An ) ≤ P(Am ). Moreover, by assumption hMi 2
limn→∞ P(An ) = limn→∞ P(An ) = 1. Notice that k=1 E X̃k |Fk−1 is Fm−1 measur-
n Pm 2
able, and hence so is 1Am . Thus, if we set Zm ≡ X̃m 1Am , it holds that E(Zm |Fm−1 ) = 0,
for all m ≤ n. Therefore the variables {Zm , m ≤ n}, for fixed n, form a martingale dif-
ference sequence. Since |Zm | ≤ |X̃m |, all the properties used in the calculations above
4.6 Central limit theorem for martingales 77
carry over to the Zm . Therefore, repeating the calculations above with M̃n replaced
by Mbn ≡ Pn Zm , we find that
m=1
lim E eiu Mn = e−u /2 .
2
(4.6.20)
b
n→∞
lim E eiu M̃n = lim E eiu M̃n 1An + lim E eiu M̃n 1Acn
(4.6.21)
n→∞ n→∞ n→∞
= lim E eiu Mn 1An + 0
b
n→∞
= lim E eiu Mn − lim E eiu Mn 1Acn
b b
n→∞ n→∞
−u2 /2
=e .
Very similar computations like those presented above play an important rôle in
what is called the concentration of measure phenomenon. Without going into too
many details, let me briefly describe this. The setting one is considering is the fol-
lowing. We have n independent, identically distributed random variables, X1 , . . . , Xn ,
assumed to have mean zero, variance one, and to satisfy, e.g. E(euXi ) < ∞, for all
u ∈ R. Let f : Rn → R be a differentiable function that satisfies
∂f
n
sup sup (x1 , . . . , xn ) ≤ 1. (4.6.22)
k=1 x∈Rn ∂xk
Set F ≡ f (X1 , . . . , Xn ). Then one can show that for some constant, C > 0,
nρ2
P (|F − E(F)| > ρn) ≤ 2e− 2C . (4.6.23)
The proof relies on the exponential Markov inequality, that states that
The computations one has to do are quite similar to those we have perfored in the
proof of the central limit theorem. There is one small trick that is useful to use: Set
F u ≡ f (X1 , . . . , uXk , Xk+1 , . . . Xn ) . Then
1 1
∂
Z Z
d u
F − F0 = du F = duXk f (X1 , . . . , uXk , Xk+1 , . . . Xn ) (4.6.27)
0 du 0 ∂xk
and
Z 1 " # " #!
d u d u
E(F|Fk ) − E(F|Fk−1 ) = du EF Fk − E F Fk−1
0 du du
≡ E(Zk |Fk ) − E(Zk |Fk−1 ), (4.6.28)
≤ λ2C (4.6.29)
by the assumption on the law of Xk . We leave the remaining details of the calculation
as an exercise. For more on concentration of measure, see e.g. [9].
In a stochastic process we often want to consider random times that are determined
by the occurrence of a particular event. If this event depends only on what happens
“in the past”, we call it a stopping time. Stopping times are nice, since we can de-
termine there occurrence as we observe the process; hence, if we are only interested
in them, we can stop the process at this moment, hence the name.
{T = n} ∈ Fn . (4.7.1)
Example. The most important examples of stopping times are hitting time. Let X
be an adapted process, and let B ∈ B. Define
Definition 4.30. The pre-T -σ-algebra, FT , is the set of events F ⊂ F, such that, for
all n ∈ N0 ∪ {+∞},
F ∩ {T ≤ n} ∈ Fn . (4.7.5)
Pre-T -σ-algebras will play an important rôle in formulation the strong Markov
property.
There are some useful elementary facts associated with this concept.
Proof. Exercise. u
t
and since CnT only takes the two values 0, 1, this suffices to show that CnT is Fn−1 −
measurable. The wealth process associated to this strategy is then
(C T • X)n = XT ∧n − X0 . (4.7.8)
C T • X = X T − X0 . (4.7.10)
for all n ∈ N, ω ∈ Ω.
(ii) If X is a martingale and one of the conditions (a)-(c) holds, then E(XT ) = E(X0 ).
Remark. This theorem may look strange, and contradict the “no strategy” idea: take
a simple random walk, S n , (i.e. a series of fair games, and define a stopping time
T = inf{n : S n = 10}. Then clearly E(S T ) = S T = 10 , E(S 0 ) = 0! So we conclude,
using (c), that E(T ) = +∞. In fact, the “sure” gain if we achieve our goal is offset
by the fact that on average, it takes infinitely long to reach it (of course, most games
will end quickly, but chances are that some may take very very long!).
4.7 Stopping times, optional stopping 81
Proof. We already know that E(XT ∧n ) − E(X0 ) ≤ 0 for all n ∈ N. Consider case (a).
Then we know that T ∧ N = T , and so E(XT ) = E(XT ∧N ) ≤ E(X0 ), as claimed.
In case (b), we start from E(XT ∧n ) − E(X0 ) ≤ 0 and let n → ∞. Since T is almost
surely finite, limn→∞ XT ∧n = XT , a.s., and since Xn is uniformly bounded,
and by assumption E(KT ) < ∞. Thus, we can again take the limit n → ∞ and use
Lebesgue’s dominated convergence theorem to justify that the inequality survives.
Finally, to justify (ii), use that if X is a martingale, then both X and −X are super-
martingales. The ensuing two inequalities imply the desired equality. u t
Case (c) in the above theorem is certainly the most frequent situation one may
hope to be in. For this it is good to know how to show that E(T ) < ∞, if that is the
case. The following lemma states that this is always the case, whenever, eventually,
the probability that the event leading to T is reasonably big.
Lemma 4.35. Suppose that T is a stopping time and that there exists N ∈ N and
> 0, such that, for all n ∈ N,
≤ (1 − )E 1T >(k−1)N
≤ (1 − )k ,
by iteration. The exponential decay of the probability implies the finiteness of the
expectation of T immediately. u t
Proof. We know that E(XT ∧n ) ≤ E(X0 ). Using Fatou’s lemma and the fact that Xn →
X∞ , a.s., allows to show that
E(X0 ) ≥ lim inf E(XT ∧n ) ≥ E lim inf XT ∧n = E (XT ) . (4.7.21)
n n
For (4.7.20), set T = inf{n : Xn > c}. Then, E(X0 ) ≥ E(XT ). But XT > c, provided the
set {n : Xn > c} is not empty, so E(XT ) ≥ cP(supk Xk > c). u t
Chapter 5
Markov processes
In general, we call a stochastic process whose index set supports the action of a
group (or semi-group) stationary (with respect to the action of this (semi) group, if
all finite dimensional distributions are invariant under the simultaneous shift of all
time-indices. Specifically, if our index sets, I, are R+ or Z, resp. N, then a stochastic
process is stationary if for all ` ∈ N, s1 , . . . , s` ∈ I, all A1 , . . . , A` ∈ B, and all t ∈ I,
h i h i
P X s1 ∈ A1 , . . . , X s` ∈ A` = P X s1 +t ∈ A1 , . . . , X s` +t ∈ A` . (5.1.1)
We can express this also as follows: Define the shift θ, for any t ∈ I, as (X ◦ θt ) s ≡
Xt+s . Then X is stationary, if and only if, for all t ∈ I, the processes X and X ◦ θt have
the same finite dimensional distributions.
In the case of Markov processes, a necessary (but not sufficient) condition for
stationarity is the stationarity of the transitions kernels. Recall that we have defined
the one-step transition kernel Pt of a Markov process in Section 3.3.
83
84 5 Markov processes
Definition 5.1. A Markov process with discrete time N0 and state space S is said
to have stationary transition probabilities (kernels), if its one step transition kernel,
Pt , is independent of t, i.e. there exists a probability kernel, P, s.t.
for all s < t ∈ N, x ∈ S , and A ∈ B. Note that there is a potential conflict of notation
between Pt and Pt which should not be confused.
A key concept for Markov chains with stationary transition kernels is the notion
of an invariant distribution.
Definition 5.2. Let P be the transition kernel of a Markov chain with stationary
transition kernels. Then a probability measure, π, on (S , B) is called an invariant
(probability) distribution, if
Z
π(dx)P(x, A) = π(A), (5.1.4)
The setting of Markov processes is very much suitable for the application of the
notions of stopping times introduced in the last section. In fact, one of the very
5.3 Markov processes and martingales 85
important properties of Markov processes is the fact that we can split expectations
between past and future also at random times.
Theorem 5.4. Let X be a Markov process with stationary transition kernels. Let
Fn = σ(X0 , . . . , Xn ) be the natural filtration, and let T be a stopping time. Let F and
G be F-measurable functions, and let F in addition be measurable with respect to
the pre-T -σ-algebra FT . Then
Remark. If this looks fancy, just think of G as a function of the Markov process, i.e.
G = G(Xi1 , . . . , Xik ), and F = F(XT , XT −1 , . . . , X0 ). Then the statement of the theorem
says that
Proof. We have
We now want to develop some theory that will be more important and more difficult
in the continuous time case. First we want to see how the transition kernels can be
seen as operators acting on spaces of measures respectively spaces of function.
If µ is a σ-finite measure on S , and P is a Markov transition kernel, we define
the measure µP as Z
µP(A) ≡ P(x, A)dµ(x), (5.3.1)
S
and similarly, for the t-step transition kernel, Pt ,
Z
µPt (A) ≡ Pt (x, A)dµ(x). (5.3.2)
S
The action on measures has of course the following natural interpretation in terms
of the process: if P(X0 ∈ A) = µ(A), then
and Z
(Pt f )(x) ≡ f (y)Pt (x, dy), (5.3.6)
S
where again
Pt f = Pt f. (5.3.7)
We say that Pt is a semi-group acting on the space of measures, respectively on the
space of bounded measurable functions. The interpretation of the action on functions
is given as follows.
t
u
where we call L ≡ P − 1 the (discrete) generator of our Markov process (this formula
will have a complete analogue in the continuous-time case).
An interesting consequence is the following observation:
is a martingale.
5.3 Markov processes and martingales 87
where the last term vanishes because of (5.3.10). This proves the lemma. t
u
Remark. (5.3.11) is of course the Doob decomposition of the process f (Xt ), since
Pt−1
s=0 L f (X s ) is a previsible process. One may check that this can be obtained directly
using the formula (4.4.5) [Exercise!].
Proof. Lemma 5.6 already provides the “only if” part, so it remains to show the “if”
part.
First, if we assume that X is a Markov process, setting r = 1 and t = 0 above and
taking conditional expectations given F0 , we see from Lemma 5.5 that E( f (X1 )) =
f (X0 ) + (L f )(X0 ), implying that the transition kernel must be 1 + L.
It remains to show that X is indeed a Markov process. For this we want to show
that
E( f (Xt+s )|Ft ) = (1 + L) s f (Xt ) ≡ P s f (Xt ), (5.3.13)
follows from the martingale problem formulation. To see this, we just use the above
calculation to see that
88 5 Markov processes
which is (5.3.13) for r = 1. Now proceed by induction: assume that (5.3.13) holds
for it holds for all bounded measurable functions for s ≤ r − 1. We must show that it
then also holds for s = r. To do this, we use (5.3.14) for the last sum in (5.3.14),
r−1
X r−1
X
E((L f )(Xt+s )|Ft ) = (P s (L f ))(Xt ) = (Pr f )(Xt ) − f (Xt ), (5.3.16)
s=0 s=0
where we undid the telescopic sum. Inserting this into (5.3.14) yields (5.3.13) for
s = r. Hence (5.3.13) holds for all r, by induction. u
t
Remark. The full strength of this theorem will come out in the continuous time case.
A crucial point is that it will not be necessary to even consider all bounded functions,
but just sufficiently rich classes. This allows to formulate martingale problems even
then one cannot write down the generator in a explicit form. The idea of character-
ising Markov processes by the associated martingale problem goes back to Stroock
and Varadhan, see [13].
We have seen that measures that satisfy µL = 0 are of special importance in the
theory of Markov processes (they are the invariant measures). Also of central im-
portance are functions that satisfy L f = 0. In this section we will assume that the
transition kernels of our Markov chains have bounded support, so that for some
K < ∞, |Xt+1 − Xt | ≤ K < ∞ for all t.
L f (x) = 0, ∀x ∈ D, (5.4.1)
5.5 Dirichlet problems 89
for all x ∈ D, hence the claim of the theorem. Of course we used again the Doob’s
optional stopping theorem in case (i,c). ut
The theorem can be phrased as saying that (sub) harmonic functions take on their
maximum on the boundary, since of course the set Dc in (5.4.2) can be replaced by
a subset, ∂D ⊂ Dc such that P x (XT ∈ ∂D) = 1.
The above proof is an example of how intrinsically analytic results can be proven
with probabilistic means. The next section will further develop this theme.
Let us now consider a connected bounded open subset D of S . We define the stop-
ping time T = τDc ≡ inf{t > 0 : Xt ∈ Dc }.
If g is a bounded measurable function on D, we consider the Dirichlet problem
associated to a generator, L, of a Markov process, X:
90 5 Markov processes
Theorem 5.11. Assume that E(τDc ) < ∞. Then (5.5.1) has a unique solution given
by
τ c −1 τ c −1
D
X D
X
f (x) = E g(Xt )F0 (x) ≡ E x g(Xt ) (5.5.2)
t=0 t=0
But we want f such that −L f = g on D. Thus, (5.5.3) seen as a problem for f , reads
T
X −1
MT = − f (X0 ) + g(Xt ). (5.5.4)
t=0
or T −1
X
f (x) = E x g(Xt ) (5.5.6)
t=0
where
Z Z Z
PtD (x, dy) = P(x, dz1 ) P(z1 , dz2 )· · · P(zt−1 , dy). (5.5.10)
D D D
Theorem 5.12 is a two way game: it allows to produce solutions of analytic prob-
lems in terms of stochastic processes, and it allows to compute interesting proba-
bilistic problems analytically. As an example, assume that Dc = A∪ B with A∩ B = ∅.
Set h = 1A . Then, clearly, for x ∈ D,
and so P x (XT ∈ A) can be represented as the solution of the boundary value problem
(L f )(x) = 0, x ∈ D, (5.5.14)
f (x) = 1, x ∈ A,
f (x) = 0, x ∈ B.
The is a generalisation of the ruin problem for the random walk that we discussed
in Probability 1.
Exercise. Derive the formula for P x (τA < τB ) directly from the Markov property
without using Lemma 5.6.
92 5 Markov processes
Let us consider the case where the solution of the Dirichlet problem is unique. Then
the solution of (5.5.12) can be written in the form
Z Z
f (x) = G Dc (x, dz)g(z) + HDc (x, dz)h(z), (5.5.15)
D Dc
where τ c −1
D
1X(t)∈A
X
G Dc (x, A) = E x
t=0
1X(τDc )∈A
h i
HDc (x, A) = E x (5.5.16)
∞
X
= P x (τDc = t ∧ X(t) ∈ A)
t=0
is called the Poisson kernel. The Green kernel can also be characterised as the weah
solution of the problem
Suppose that (5.5.18) has a unique solution, e.g. because E x [τA∪B ] < ∞ for all x ∈ S .
The harmonic function that solves (5.5.18) will be denoted by hA,B (x) and is called
the equilibrium potential. We have already seen that
We would like to view this equation as an analytic expression for the probability
in the right-hand side. Naturally we would like to obtain such a formula also when
x ∈ A or x ∈ B. However, using the Markov property, we see that
Z Z
P x (τA < τB ) = P(x, dy)Py (τA < τB ) + P(x, dy) (5.5.20)
(A∪B)c A
Z
= P(x, dy)hA,B (y) = PhA,B (x)
S
= (LhA,B )(x) + hA,B (x).
5.6 Doob’s h-transform 93
and for x ∈ A as
The quantity eA,B (x) ≡ −LhA,B (x), x ∈ A, is called the equilibrium measure on A.
The following simple observation provides the fundamental connection between
the objects we have introduced so far, and leads to a different representation of the
equilibrium potential. Pretend that the equilibrium measure eA,B is already known.
Then the equilibrium potential satisfies the inhomogeneous Dirichlet problem
Note that ea,B (a) = Pa (τB < τa ) has the meaning of an escape probability.
In particular, Ph [Ω|F0 ] = 1.
Theorem 5.15. Let X be a Markov chain with generator L. Let h be a positive har-
monic function. Then the h-transformed measure, Ph , is the law of a Markov process
with generator Lh , where for any bounded measurable function f ,
Z
h 1
L f (x) ≡ P(x, dy)h(y) f (y) − f (x). (5.6.5)
h(x) S
Proof. To proof this theorem we turn to the martingale problem. We will show that
for Lh defined by (5.6.5),
t−1
X
Mth ≡ f (Xt ) − f (X0 ) − (Lh f )(X s ) (5.6.6)
s=0
The middle terms are part of Mrh and we must consider E[ f (Xt )h(Xt )|Fr ]. This is
done by applying the martingale problem for P and the function f h. This yields
5.6 Doob’s h-transform 95
t−1
X
E[ f (Xt )h(Xt )|Fr ] = f (Xr )h(Xr ) + E[(L( f h))(X s )|Fr ]
s=r
The second term will vanish if we choose Lh f (x) = h(x)−1 (L(h f ))(x), i.e. as defined
in (5.15).
Hence we see that under Ph , X solves the martingale problem corresponding to
the generator Lh , and so is a Markov chain with transition kernel Ph = Lh + 1. The
process X under Ph is called the (Doob) h-transform of the original Markov process.
t
u
Now let us take h to the the harmonic potential h(x) = hA,B (x), for A, B ⊂ S . Then
(recall the definition of hA,B , (5.5.18))
1
Ph (x, dy) = P(x, dy)hA,B (y)
P x (τA ≤ τB )
1
= P x (X1 ∈ dy ∧ τA ≤ τB )
P x (τA ≤ τB )
= P x (X1 ∈ dy)|τA ≤ τB ). (5.6.9)
1
Ehx [Y] = E x [Yh(XτA∪B )] (5.6.11)
h(x)
1
= E x Y 1τA <τB
P x [τA < τB ]
= E x [Y|τA < tB ].
t
u
96 5 Markov processes
Much of the theory of Markov chains with countable state space is similar to the
case of finite state space. In particular, the notions of communicating classes, irre-
ducibility, and periodicity carry over. There are, however, important new concepts
in the case when the state space is infinite. These are the notions of recurrence and
transience. It will be useful to use a notation close to the matrix notation of finite
chains. Thus we set
P(i, { j}) = p(i, j) (5.7.1)
We will place us in the setting of an irreducible Markov chain, i.e. the all states
in S communicate (i.e. for any i, j ∈ S , P j (τ j < ∞) > 0). We may also for simplicity
assume that our chain is aperiodic. In the case of finite state space, we have seen that
such chains are ergodic in the sense that there exists a unique invariant probability
distribution, and the marginal distributions at time t, converge to this distribution
independently of the starting measure. Essentially this is true because the chain is
trapped on the finite set. If S is infinite, a new phenomenon is possible: the chain
may run “to infinity”.
Definition 5.17. Let X be an irreducible aperiodic Markov chain with countable
state space S . Then:
(i) X is called transient, if for any i ∈ S ,
Remark. The notion of recurrence and transience can be defined for states rather
than for the entire chain. In the case of irreducible and aperiodic chains, all states
have the same characteristics.
Some simple consequences of the definition are the following.
Lemma 5.18. Let X be a Markov chain with countable state space be irreducible.
Then X is transient, iff
5.7 Markov chains with countable state space 97
Proof. Assume that X is transient. Then P` (τ` < ∞) = c < 1. By the first Borel-
Cantelli lemma, (5.7.4) holds if
∞
X
P` (Xt = `) < ∞. (5.7.5)
t=0
But ∞ ∞
∞
P` (Xt = `) = E` 1Xt =` =
X X X
nP` (Xt = `, n-times) . (5.7.6)
t=0 t=0 n=1
Inserting this equality into (5.7.6) yields that (5.7.5) holds and thus that (5.7.4) is
true.
To show the converse, assume that (5.7.4) holds. Then
Positive recurrent chains are called ergodic, because they are ergodic in the same
sense as finite Markov chains.
Lemma 5.19. Let X positive recurrent Markov chain with countable state space, S .
Then, for any j, ` ∈ S , Pτ
1 Xt = j
`
E` t=1
µ( j) ≡ . (5.7.9)
E` τ`
is the unique invariant probability distribution of X.
hPτ
1
i
Proof. Define ν` ( j) = E` t=1`
X t = j . We show first that ν is an invariant measure.
Obviously, 1 = m∈S 1X`−1 =m , and hence, using the strong Markov property,
P
98 5 Markov processes
τ τ
ν` ( j) ≡ E` 1Xt = j = E` 1Xt = j 1Xt−1 =m
X̀ X X̀
t=1 m∈S t=1
τ
E` 1Xt−1 =m P[Xt = j|Ft−1 ](m)
X X̀
=
m∈S t=1
τ
E` 1Xt−1 =m p(m, j)
X X̀
=
m∈S t=1
τ
E` 1Xt =m P(m, j)
X X̀
=
m∈S t=1
X
= ν` (m)P(m, j)
m∈S
Thus µ` solves the invariance equation and thus is an invariant measure. It remains
to show that ν` is normalisable. But
X
ν` ( j) = E` (t` ) < ∞, (5.7.10)
j∈Σ
Now iterate the same relation in the first term in (5.7.11). Thus
Proof. The proof uses the method of coupling. Let π0 be our initial distribution. We
construct a second Markov chain, independent of X with the same transition kernel
100 5 Markov processes
but initial distribution µ. Then we define the stopping time T with respect to the
filtration Fn ≡ σ(X0 , Y0 , X1 , Y1 , . . . , Xn , Yn ) as
T ≡ inf {n : Xn = Yn = i} , (5.7.18)
The expression in the brackets is smaller than one while the coefficient P (T > n)
tends to zero, as n ↑ ∞. This proves the theorem. u
t
Remark. Note that both irreducibility and aperiodicity were used in the proof. It is
clear from elementary examples that the conclusion cannot hold for periodic Markov
chains.
Finally we note that the strong ergodic theorem that we know for irreducible
Markov chains with finite state space holds also for positive recurrent chains with
5.7 Markov chains with countable state space 101
countable state space. The proof is identical to that in finite state space, given that
we already know existence und uniqueness of an invariant probability measure.
Chapter 6
Random walks and Brownian motion
with Xi , i ∈ N iid random variables are generally called random walks and receive a
considerable attention in probability theory. A special case is the so-called simple
random walk on Zd , characterised by the fact that the random variables Xi take
values in the set of ± unit vectors in the lattice Zd . Consequently, S n ∈ Zd , is a
stochastic process with discrete state space. Obviously, S n is a Markov chain, and,
µ
moreover, the coordinate processes, S n , µ = 1, . . . d, are sub-, super-, or martingales,
µ
depending on whether E(X0 ) is positive, negative, or zero.
Let us focus on the centred case, E(X1 ) = 0. In this case we have seen that Zn ≡
n−1/2 S n converges in distribution to a Gaussian random variable. By considering the
process coordinate wise, it will also be enough to think about d = 1. We now want
to extend this result to a convergence result on the level of stochastic process. That
is, rather than saying something about the position of the random walk at a time n,
we want to trace the entire trajectories of the process and try give a description of
their statistical properties in terms of some limiting stochastic process.
It is rather clear from the central limit theorem that we must consider a rescaling
like
103
104 6 Random walks and Brownian motion
[tn]
X
Zn (t) ≡ n−1/2 Xk . (6.1.2)
k=1
In that case we have from the central limit theorem, that for any t ∈ (0, 1],
D
Zn (t) → Bt , (6.1.3)
([x] denotes the lower integer part of x) where Bt is a centred Gaussian random vari-
able with variance t. Moreover, for any finite collection of indices t1 , . . . , t` , define
Yn (i) ≡ Zn (ti ) − Zn (ti−1 ). Then the random variables Yn (i) are independent and it is
easy to see that they converge, as n → ∞, jointly to a family of independent centred
Gaussian variables with variances ti − ti−1 . This implies that the finite dimensional
distributions of the processes Zn (t), t ∈ (0, 1], converge to the finite dimensional dis-
tributions of the Gaussian process with covariance C(s, t) = s ∧ t, that we introduced
in Section 3.3.2 and that we have preliminarily called Brownian motion.
We now want to go a step further and discuss the properties of the paths of our
processes.
8 5
4 -10
2 -15
-20
10 20 30 40 50
-25
60
5000 10000 15000 20000
-50
40
-100
20
-150
-200
1000 2000 3000 4000 5000
-250
-20
-300
From looking at pictures, it is clear that the limiting process Bt should have rather
continuous looking sample paths.
Before stating the desired convergence result, we have to define and construct the
limiting object, the Brownian motion.
6.2 Construction of Brownian motion 105
for k ∈ {0, . . . , 2n−1 − 1} and n ≥ 1. We set I(n) ≡ {0, . . . , 2n−1 − 1} for n ≥ 1 and
I(0) = {0}. The functions hkn , n ∈ N, k ∈ I(n) form a complete orthonormal system
of functions in L2 ([0, 1]), as one may easily check. Now set
Z t
fn (t) =
k
hkn (u)du, (6.2.2)
0
and set
n X
X
Bt(n) ≡ fmk (t)Xm,k (6.2.3)
m=0 k∈I(m)
for t ∈ [0, 1], where Xm,k are our independent standard normal random variables.
We will show that (i) the continuous functions B(n) (ω) converge uniformly, almost
106 6 Random walks and Brownian motion
surely, and hence to continuous functions, and (ii) that the covariances of B(n) con-
verge to the correct limit. The limit, modified to be Bt (ω) ≡ 0 when B(n)
t (ω) does not
converge to a continuous function, will then be Brownian Motion on [0, 1].
Let us now prove (i). The point here is that, of course, that the functions fnk (t) are
very small, namely,
| fnk (t)| ≤ 2−(n+1)/2 . (6.2.4)
Moreover, for given t, there is only one value of k such that fnk (t) , 0. Therefore,
" # X
(n) (n−1)
P sup |Bt − Bt | > an = P sup fnk (t)Xn,k > an (6.2.5)
0≤t≤1 0≤t≤1 k∈I(n)
≤ P sup |Xn,k | > 2 (n+1)/2
an
k∈I(n)
h i
≤ 2 P |Xn,1 | > 2(n+1)/2 an
n
2 n 2 n
e−an 2 2n/2 e−an 2
≤2 √ n
= √ ,
π/2an 2(n+1)/2 πan
where we used the very useful bound
1
e−u /2
2
P[|X| > u] ≤ √ (6.2.6)
u π/2
for Gaussian probabilities. Now we are close to being done: Choose a sequence an
such that ∞n=0 an < ∞ and
P
∞
X " #
(n) (n−1)
P sup |Bt − Bt | > an < ∞. (6.2.7)
n=1 0≤t≤1
Clearly, the choice an = 2−n/4 will do. Then, by the Borel-Cantelli lemma,
" #
(n) (n−1)
P sup |Bt − Bt | > an i.o. = 0, (6.2.8)
0≤t≤1
which implies that almost surely, the sequence B(n) converges uniformly in the inter-
val [0, 1]. Since uniformly convergent sequences of continuous functions converge
to continuous functions, limn→∞ Bt(n) ≡ Bt (ω) in C([0, 1], R), for almost all ω.
To check (ii), we compute the covariances:
6.2 Construction of Brownian motion 107
n X X
n
0
X X
E(B(n) (n)
t Bs ) = fmk (t) fmk 0 (s)E(Xm,k Xm0 ,k0 )
m=0 k∈I(m) m0 =0 k0 ∈I(m0 )
Xn X
= fmk (t) fmk (s) (6.2.10)
m=0 k∈I(m)
Z 1 Z 1 n X
dv1[0,t] (u)1[0,s] (v)
X
= du hkm (u)hkm (v).
0 0 m=0 k∈I(m)
due to the fact that the system hkn is a complete orthonormal system. Now note that
from the definition of Brownian motion, for s < t,
so the limiting covariance is that of Brownian motion. Finally, since Bt(n) are Gaus-
sian whose covariances converge, the limit is necessarily Gaussian with the limiting
covariance (Exercise! Hint: Show that the Fourier transforms converge!).
This provides Bt on [0, 1]. To construct Bt for t ∈ (k, k + 1], just take k + 1 inde-
pendent copies of the B we just constructed, say Bt,1 , . . . , Bt,k+1 , via
k
X
Bt = B1,i + Bt−k,k+1 . (6.2.13)
i=1
It is easily checked that this process is a Brownian motion in Rd . This concludes the
existence proof. u t
Having constructed the random variable Bt in C(R+ , Rd ), we can now define its
distribution, the so-called Wiener measure.
For this is it useful to observe that
Lemma 6.3. The smallest σ-algebra, C, on C(R+ , Rd ) that makes marginals w(t) :
C(R+ , Rd ) → Rd measurable for all t ∈ R+ measurable coincides with the Borel-σ-algebra,
108 6 Random walks and Brownian motion
B ≡ B(C(R+ , Rd )), of the metrisable space C(R+ , Rd ) equipped with the topology of
uniform convergence on compact sets.
Proof. First, C ⊂ B since all functions t 7→ w(t) are continuous and hence measur-
able with respect to the Borel-σ-algebra B. To prove that B ⊂ C, we note that the
topology of uniform convergence is equivalent to the metric topology relative to the
metric X
d(w, w0 ) ≡ 2−n sup |w(t) − w0 (t)| ∧ 1 .
(6.2.15)
n∈N 0≤t≤n
We thus have to show that any ball with respect to this distance is measurable with
respect to C. But since w are continuous functions,
Note that by construction, the map ω 7→ B(ω) is measurable, since the maps
ω 7→ Bt (ω) are measurable for all t, and by definition of C, all coordinate maps
B 7→ Bt are measurable. Thus the following definition makes sense.
Remark. The assertion of the theorem implies what is called weak convergence in
the uniform topology on [0, 1]. This means the following: Take any function F :
B([0, 1], R) → R, that is continuous in the uniform topology, meaning that for any >
6.3 Donsker’s invariance principle 109
0, one can find δ > 0, such that whenever two functions w, w0 satisfy supt∈[0,1] |w(t) −
w0 (t)| < δ, then |F(w) − F(w0 )| < . Then
EF(Zn ) = EF(Z
en ). (6.3.3)
Next,
en ) − F(B) ≤
E F(Z en ) − F(B)1
E (F(Z (6.3.4)
supt∈[0,1] |Z
en (t)−B(t)|≤δ
+ E (F(Zn ) − F(B)1sup
e
t∈[0,1] |Zn (t)−B(t)|>δ
e
≤ + CP sup kZen (t) − Bt k > .
t∈[0,1]
Obviously, the interval [0, 1] can be replaced with any other finite interval.
Proof. We will give an interesting proof of this theorem which will not use what
we already know about finite dimensional distributions. For simplicity we consider
the case d = 1 only. It will be based on the famous Skorokhod embedding. What this
will do is to construct any desired random walk from a Brownian motion. This goes
a follows: we assume that F is the common distribution function of our random
variables Xi , assumed to have finite second moments σ2 . We now want to construct
stopping times, T , for the Brownian motion, B, such that (i) the law of BT is F, and
(ii) E(T ) = σ2 . This is a little tricky. First, we construct a probability measure on
(−R+ ) × R+ , from the restrictions, F± , of F to the positive and negative axis:
We need some elementary facts that are easy if we accept that our results on
martingales carry over to continuous time.
Lemma 6.6. Let a < 0 < b and τ ≡ inf{t > 0 : Bt < (a, b)}. Then
(i) P(Bτ = a) = b
b−a ;
(ii) E(τ) = |ab|.
110 6 Random walks and Brownian motion
Proof. As we will discuss shortly, Bt is a martingale and let us anticipate that Doob’s
optional stopping theorem also holds for Brownian motion. Then 0 = E(Bτ ) =
bP[Bτ = b] + aP[Bτ = a] = b + (a − b)P[Bτ = a], which gives (i). To prove (ii) consider
which is a martingale with M0 = −ba. On the other hand (again assuming that we
can use the optional stopping theorem,
Exercise. Construct the Skorokhod embedding for the simple random walk on Z.
We can now define a sequence of stopping times T 1 = T , T 2 = T 1 + T 20 , . . . , where
0
T i are independent and constructed in the same way as T on the Brownian motions
BT i−1 +t − BT i−1 . Then it follows immediately from the preceding theorem that:
Theorem 6.8. The process Sen , n ∈ N where Sen ≡ BT n , for all n ∈ N, has the same
distribution as the process S n ≡ ni=1 Xi , where Xi are iid with distribution F. Simi-
P
Proof. Let Xi be iid with distribution functions F. By Theorem 6.7, the random vari-
ei ≡ BT i − BT are iid with the same distribution as Xi . Therefore, S n (t) has
ables X i−1
the same law as BT n . Then Zn (t) has the same distribution as n−1/2 BT [nt] . However,
we can also construct the Skorokhod embedding to reproduce the random variables
n−1/2 Xi as BT in − BT i−1 en (t) ≡ BT n [nt] .
n . Then Zn (t) also has the same distribution as Z
Lemma 6.9. For any a ∈ R+ , the processes Bt and Bat ≡ a−1 Bta2 have the same dis-
tribution.
From the scaling property it follows easily that T in have the same law as T i /n.
This proves the theorem. u
t
Tn
lim = E(T ) = 1, a.s. (6.3.14)
n→∞ n
Thus
lim n−1 sup |T k − k| = 0, a.s. (6.3.15)
n→∞ k≤n
Since T in have the same law as T i /n, this impies that also
" #
n
P sup T k − k/n ≥ δ/3 ≤ /2. (6.3.17)
k≤n
This implies that the difference between Z en (t) and Bt converges uniformly in
t ∈ [0, 1] to zero in probability. On the other hand, Z en (t) has the same law as Zn (t).
This implies weak convergence as claimed. u t
Although we have not studied with full rigour the concepts of martingales and
Markov processes in continuous time, Brownian motion is a good example to get
provisionally acquainted with them. The nice thing here is that we know already that
it has continuous paths, so that we need not worry about discontinuities; moreover,
a path is determined by knowing it on a dense set of times, say the rational numbers,
so we also need not worry about unaccountability.
Proposition 6.10. Brownian motion is a continuous time martingale, in the sense
that, if Ft = σ(Bs , s ≤ t), then, for any s < t,
E[Bt |F s ] = Bs . (6.4.1)
Proof. Of course we have not defined what a continuous time filtration is, but we
will not worry at this moment, and just take Ft as the σ-algebra generated be {Bs } s≤t .
Now we know that Bt = Bt − Bs + Bs , where Bt − Bs and Bs are independent. Thus
as claimed. u
t
Next we show that Brownian motion is also a Markov process. As a definition of
a continuous time Markov process, we adopt the obvious generalisation of (3.3.21).
Definition 6.11. A stochastic process with state space S and index set R+ is called
a continuous time Markov process, if there exists a two-parameter family of proba-
bility kernels, P s,t , satisfying the Chapman-Kolmogorov equations,
Z
P s,t (x, A) = Pr,t (y, A)P s,r (x, dy), ∀r ∈ (s, t), A ∈ B, (6.4.3)
S
6.4 Martingale and Markov properties 113
This definition may not sound abstract enough, because it stipulates that we search
for the kernels P s,t ; one may replace this by saying that
is independent of the σ-algebras Fr , for all r < s; or in other words, that P[Bt ∈
A|F s ](ω) is a function of Bs (ω), a.s.. You can see that we will have to worry a little
bit about these definitions in general, but by the continuity of Brownian motion, we
may just look at rational times and then no problem arises. We come to these things
in the next course. We see that the two definitions are really the same, using the
existence of regular conditional probabilities: namely, P s,t will be just the regular
version of P[Bt ∈ A|F s ].
|y − x|2
Z !
1
P s,t (x, A) = exp − dy. (6.4.6)
(2π(t − s))d/2 A 2(t − s)
Proof. The proof is next to trivial from the defining property (i) of Brownian motion
and left as an exercise. ut
Theorem 6.13. Let f be a two times differentiable function with bounded second
derivatives. Let Bt be Brownian motion. Then
1 t
Z
Mt = f (Bt ) − f (B0 ) − ∆ f (Bs )ds (6.4.7)
2 0
is a martingale.
Proof. We consider for simplicity only the case d = 1; the general case works the
same way. We proceed as in the discrete time case.
114 6 Random walks and Brownian motion
Z t
1
E[Mt+r |Ft ] = f (Bt ) − f (B0 ) − f 00 (Bs )ds (6.4.8)
2 0
1 r
Z
+E[ f (Bt+r ) − f (Bt )|Ft ] − E[ f 00 (Bt+s )|Ft ]ds
2 0
(y − Bt )2
Z !
1
= Mt + √ f (y) exp − dy − f (Bt )
2πr R 2r
1 r 1 (y − Bt )2
Z Z !
− √ f 00 (y) exp − dyds
2 0 2πs R 2s
= Mt
(y − x)2
Z !
1 00
√ f (y) exp − dy (6.4.9)
2πs R 2s
d2 1 (y − x)2
Z !
= f (y) 2 √ exp − dy
R dy 2πs 2s
(y − x)2
Z !
1 h i
= √ f (y) −s −3/2
+ (y − x) s
2 −5/2
exp − dy
2π R 2s
(y − x)2
Z !
d 1
=2 f (y) √ exp − dy
R ds 2πs 2s
(x − y)2
Z !
2
√ f (y) exp − dy − 2 f (x), (6.4.10)
2πr R 2r
kxk2
!
1
e(t, x) ≡ √ exp − (6.4.12)
2πt 2t
∂ 1
e(x, t) = ∆e(x, t), (6.4.13)
∂t 2
with the (singular) initial condition
(where δRhere denotes the Dirac-delta function, i.e., for any bounded integrable
function R δ(x) f (x)dx = f (0)). e(t, x) is called the heat kernel associated to (one-
dimensional) Brownian motion.
1 t
Z
f (Bt ) = f (B0 ) + Mt + ∆ f (Bs )ds, (6.4.15)
2 0
it formally resembles the Itô formula (4.5.6) that we derived formally in Section 4.
The martingale Mt should then play the rôle of the stochastic integral, i.e. we would
like to think of Z t
Mt = ∇ f (Bs ) · dBs . (6.4.16)
0
It will turn out that this is indeed a correct interpretation if and that (6.4.15) is the
Itô formula for Brownian motion.
Clearly
n o
An,K ⊂ ∪nk=2 |B j/n − B( j−1)/n | ≤ 4K/n, for j ∈ {k − 1, k, k + 1} . (6.5.2)
Now
t
u
Remark. The argument used in the proof can be extended to show that Brownian
motion is nowhere Hölder continuous with exponent larger than 1/2. Namely, for
α > 1/2, let k be chosen such that k(α − 1/2) > 1. Then define
An important notion is that of the quadratic variation. Let tkn ≡ (k2−n ) ∧ t and set
∞
X
n ] .
2
[B]nt ≡ [Btkn − Btk−1 (6.5.8)
k=1
Proof. Note that all sums over k contain only finitely many non-zero terms, and
that all the summands in (6.5.13) are independent random variables, satisfying (for
tkn ≤ t)
2
E Btkn − Btk−1
n = 2−n , (6.5.9)
2
var Btkn − Btk−1
n = 3 2−2n . (6.5.10)
Thus
E[B]nt = t, var [B]nt = 2−n t,
(6.5.11)
and thus
lim [B]nt = t, a.s.. (6.5.12)
n→∞
By telescopic expansion,
6.5 Sample path properties 117
∞
X
B2t − B20 = B2tn − B2tn (6.5.13)
k k−1
k=1
X∞
= Btkn − Btk−1
n Btkn + Btk−1
n
k=1
∞
X
= 2Btk−1
n Btkn − Btk−1
n + [B]nt .
k=1
Now set
∞
X
Vtn ≡ B2t − [B]nt = 2Btk−1
n Btkn − Btk−1
n . (6.5.14)
k=1
One can check easily that for any n, V n is a martingale. Then also
where the last inequality is obtained by explicit computation. This implies that [B]nt
converges uniformly on compact intervals. u t
Remark. Lemma 6.15 plays a crucial rôle in stochastic calculus. It justifies the claim
that d[B]t = dt. If we go with this into our “discrete Itô formula (Section 4.6), this
means this justifies in a more precise way the step from Eq. (4.5.3) to Eq. (4.5.6).
Remark. The definition of the quadratic variation we adopt here via di-adic parti-
tions is different from the “true” quadratic variation that would be
n
X
2
, = < < < = ,
sup [B − B ] n ∈ 0 t t · · · t 1 (6.5.17)
t k tk−1 N, 0 1 n
k=1
which can be shown to be infinite almost surely (note that the choices of the ti can be
adapted to the specific realisation of the BM). The diadic version above is, however,
important in the construction of stochastic integrals.
Remark. The fact that the quadratic variation of BM converges to t implies that the
linear variation,
X∞
Btkn − Btk−1
n (6.5.18)
k=1
is infinite on every interval. This means in particular that the length of a Brownian
path between any times t, t0 is infinite.
118 6 Random walks and Brownian motion
Remark. Just as the CLT, the LIL has extensions to the case of non-identically dis-
tributed random variables. For a host of results, see [4], Chapter 10. Furthermore,
there are extensions to the case of martingales, under similar conditions as for the
CLT.
The nicest proof of this result passes though the analogous result for Brownian
motion and then uses the Skorokhod embedding theorem. The proof below follows
[11].
Thus we want to first prove:
Theorem 6.17. Let Bt be a one-dimensional Brownian motion. Then
" #
Bt
P lim sup √ = 1 = 1, (6.6.2)
t→∞ 2t ln ln t
and
Bt
= 1 = 1.
P lim sup √ (6.6.3)
t↓0 2t ln ln(1/t)
Proof. Note first that the two statements are equivalent since the two processes Bt
and tB1/t have the same law (Exercise!). √
We concentrate on (6.6.3). Set h(t) = 2t ln ln(1/t). Basically, the idea is to use
exponentially shrinking subsequences tn ≡ θn in such a way that the variables Btn are
essentially independent. Then, for the lower bound, it is enough to show that along
such a subsequence, the h(tn ) is reached infinitely often: this will prove that the
lim sup is as large as claimed. For the upper bound, one shows that along such sub-
sequences, the threshold h(tn ) is not exceeded, and then uses a maximum inequality
for martingales to control the intermediate values of t.
We first show that lim supt↓0 (· · · ) ≤ 1. For this we will assume that we can use
Doob’s submartingale inequality, Theorem 4.19 also in the continuous time case.
Define !
1 2
Zt ≡ exp αBt − α t . (6.6.4)
2
A simple calculation shows that Zt is a martingale (with E(Zt ) = 1), and so
" # " #
αB s −α2 s/2 αβ
P sup(Bs − αs/2) > β = P sup e > e ≤ e−αβ E(Zt ) = e−αβ . (6.6.5)
s≤t s≤t
6.6 The law of the iterated logarithm 119
Let θ, δ ∈ (0, 1), and chose tn = θn , αn = θ−n (1 + δ)h(θn ), and βn = 12 h(θn ). Then
" #
P sup (Bs − αn s/2) > βn ≤ n−(1+δ) (ln 1/θ)−(1+δ) , (6.6.6)
s≤θn
It follows that
θn 1 1
sup Bs ≤ (1 + δ)θ−n h(θn ) + h(θn ) = (2 + δ)h(θn ) (6.6.8)
s≤θn 2 2 2
The events are independent, and their probability can be bounded easily using that
for any u > 0, Z ∞
1 1
e−x /2 dx ≥ √ e−u /2 1 − 2u−2 .
2 2
(6.6.13)
2π u u 2π
This implies that
Z ∞
x2
!
1
P[An ] = √ exp − dx (6.6.14)
2π(θn (1 − θ)) (1−θ)1/2 h(θn ) 2θn (1 − θ)
Z ∞
x2
!
1
= √ exp − dx
2π θ−n/2 h(θn ) 2
exp −θ−n h(θn )2 /2
≥ √ 1 − 2θn h(θn )−2 ≡ γn .
2πθ−n/2 h(θn )
120 6 Random walks and Brownian motion
Now, the upper bound (6.6.11) also holds for −Bt , so that, almost surely, for all but
finitely many n,
Bθn+1 ≥ −h(θn+1 ). (6.6.17)
But by some simple estimates,
s
ln ln(θ−n θ−1 )
h(θn+1 ) = θ1/2 h(θn ) −n
≤ θ1/2 h(θn ) 1 + O(ln θ−1 /n) , (6.6.18)
ln ln(θ )
From the LIL for Brownian motion one can prove the LIL for random walk using
the Skorokhod embedding.
Proof. (of Theorem 6.16). From the construction of the Skorokhod embedding, we
know that we may choose S n (ω) = BT n (ω). The strong law of large numbers implies
that T n /n → 1, a.s., and so also h(T n )/h(n) → 1, a.s.. Thus the upper bound follows
trivially:
Sn BT n Bt
lim sup = lim sup ≤ lim sup = 1. (6.6.22)
n→∞ h(n) n→∞ h(T n ) t→∞ h(t)
To prove the complementing lower bound, note that by Kolmogorov’s 0 − 1-law,
ρ ≡ lim supn→∞ h(n)Sn
is almost surely a constant (since the limsup is measurable with
respect to the tail-σ-algebra. Then, there exists n0 < ∞, such that for all n ≥ n0 ,
BT n
h(T n ) ≤ ρ. Assume ρ < 1; we will show that this leads to a contradiction with (6.6.2)
of Theorem 6.17. To show this, we must show that the Brownian motion cannot rise
too far in the intervals [T n , T n+1 ]. But recall that T n+1 is defined as the stopping time
at the random interval [α, β] of the Brownian motion Bt . We want to show that in no
6.6 The law of the iterated logarithm 121
√
such interval can the BM climb by more than 2n ln ln n. An explicit computation
shows that
Z 0 Z ∞
−a
φ(x) ≡ P sup Bt > x = γ ,
dF− (a) dF+ (b)(b − a) (6.6.23)
t≤T 1 −∞ x x −a
−a
where the ratio x−a is the probability that the BM reaches x before a (i.e. before T 1 )
(the logic of the formula is that for Bt to exceed x before T 1 , the random variable β
must be larger than x, and then Bt may not hit the lower boundary before reaching
x). Now we will be done by Borel-Cantelli, if
X √
φ( 2n ln ln n) < ∞, (6.6.24)
n
which implies
Bt
lim sup < ρ + , (6.6.27)
t→∞ h(t)
which is smaller than 1 if is, e.g. (1 − ρ)/2, thus contradicting the result for BM.
We are left with checking (6.6.25). We may decompose φ as
Z 0 Z ∞
−a
Φ(x) = γ dF− (a) dF+ (b)(b − x) (6.6.28)
−∞ x x−a
Z 0 Z ∞
+γ dF− (a) dF+ (b)|a| ≡ φ1 (x) + φ2 (x).
−∞ x
√ R∞ √
Now n φ2 ( n) < ∞ if 0 φ2 ( x) < ∞. Recalling the formula for γ, (6.3.7),
P
we see that
Z ∞ Z ∞ Z ∞
√ √
φ2 ( x)dx = (1 − F( x))dx = −2 (1 − F(t))tdt < −2 E(X 2 ) < ∞.
0 0 0
(6.6.29)
To deal with φ1 , use that x − a > x, and then as before
Z ∞
φ1 (x) ≤ x −1
(b − x)dF+ (x) (6.6.30)
x
which again hold since F has finite second moment. This concludes the proof. t
u
Remark. On can show more than what we did. For one thing, not only is lim supt Bt /h(t) =
+1 (and hence by symmetry lim inf t Bt /h(t) = −1), a.s., it is also true that the set
of limit points of the process Bt /h(t) is the entire interval [−1, 1]; i.e., for any
a ∈ [−1, 1], there exist subsequences tn , such that limn Btn /h(tn ) = a.
The LIL √ shows that at any given time t, the increment of BM, Bt+δ − Bt , grows
as fast as 2δ ln ln 1/d. The following theorem, called Lévy’s theorem , shows that
there are exceptional times when it increases even faster.
Remark. This theorem implies in particular that Brownian motion is almost surely
Hölder continuous with exponent α, i.e.
P lim sup sup δ−α |Bt+δ − Bt | = 0 = 1,
(6.6.33)
δ↓0 t∈[0,1]
for any α < 1/2. This is a basic property of Brownian motion that one should mem-
orise. But Theorem 6.18 is sharper than that. It states that almost
√ surely, on any
compact interval, there will be points where BM increases like δ| ln δ|, faster than
what√one would guess from the LIL, which states that at any given point, it increases
like δ ln | ln δ|!
Proof. The proof we give her is due to Lévy and differs from that of the LIL in
that it does not use maximum inequalities for the upper bound, but a new technique,
called chaining. We first prove the lower bound. For ∈ (0, 1), Here it is enough to
exhibit candidates for the highly singular behaviour:
p
P maxn Bk2n − B(k−1)2n ≤ (1 − ) 21−n ln 2n
(6.6.34)
k≤2
p 2n
= 1 − P B2−n > (1 − ) 21−n ln 2n
h √ i2n
= 1 − P B1 > (1 − ) 2 ln 2n
n
"
1 #2
≤ 1− √ exp −(1 − )2 ln 2n
2π2n ln 2
2−n(1−) +n
2
≤ exp − √ ∼ exp −22n ,
2π2n ln 2
6.6 The law of the iterated logarithm 123
which tends to zero and is summable over n, for any > 0. By the first Borel-
Cantelli lemma, this implies that the event considered can happen only for finitely
many values of n, almost surely. Thus
Bt+δ − Bt
≥ 1 = 1.
P lim sup sup √ (6.6.35)
δ↓0 t∈[0,1] 2δ| ln δ|
The lower bound is more tricky and uses an interesting technique of chaining.
We first establish
√ that the required conditions hold on a 2−n grid. By convention we
set h() ≡ 2| ln |. Then we estimate
!
−n − B j 2−n > 1 + 2
−n
P maxn h( j2 ) B j 2 2 1
(6.6.36)
j+ ji − j2 ≤2 , ji ≤2n
≤ 2(1+)n P B j2−n > (1 + 2)h( j2−n )
p
= 2(1+)n P |B1 | > (1 + ) 2 ln 2n(1−)
2
≤ 2(1+)n √ exp(−(1 + 2)2 ln 2n(1−) ) ≤ 2−2n .
2π2n ln 2
This bound is summable over n, so that, by the first Borel-Cantelli lemma, almost
surely, there exists an n(ω) < ∞, such that for all n ≥ n(ω),
−n − B j 2−n ≤ 1 + .
−n
maxn h( j2 ) B j 2 2 1
(6.6.37)
j+ ji − j2 ≤2 , ji ≤2
n
We may chose n(ω) in such a way that 2(n+1)−1 > 2 and 2−n(1−) < 1/e, and
∞
X
h(2−m ) ≤ h(2−(1−)(n+1) ), (6.6.38)
m=n+1
1. H. Bauer and R. Burckel. Probability theory and elements of measure theory. Academic Press
London, 1981.
2. P. Billingsley. Probability and measure. Wiley Series in Probability and Mathematical Statis-
tics. John Wiley & Sons Inc., New York, 1995.
3. A. Bovier. Statistical mechanics of disordered systems. Cambridge Series in Statistical and
Probabilistic Mathematics. Cambridge University Press, Cambridge, 2006.
4. Y. Chow and H. Teicher. Probability theory. Springer Texts in Statistics. Springer-Verlag,
New York, 1997.
5. J. Doob. Measure theory, volume 143. Springer, 1994.
6. H.-O. Georgii. Gibbs measures and phase transitions, volume 9 of de Gruyter Studies in
Mathematics. Walter de Gruyter & Co., Berlin, 1988.
7. K. Itô and H. P. McKean, Jr. Diffusion processes and their sample paths. Die Grundlehren
der Mathematischen Wissenschaften, Band 125. Academic Press Inc., Publishers, New York,
1965.
8. O. Kallenberg. Random measures. Akademie-Verlag, Berlin, 1983.
9. M. Ledoux and M. Talagrand. Probability in Banach spaces, volume 23 of Ergebnisse der
Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)].
Springer-Verlag, Berlin, 1991.
10. M. Rao. Measure theory and integration, volume 265. CRC, 2004.
11. L. Rogers and D. Williams. Diffusions, Markov processes, and martingales, volume 1. Cam-
bridge University Press, 2000.
12. B. Simon. The statistical mechanics of lattice gases. Vol. I. Princeton Series in Physics.
Princeton University Press, Princeton, NJ, 1993.
13. D. Stroock and S. Varadhan. Multidimensional diffusion processes, volume 233. Springer,
1979.
14. M. Taylor. Measure theory and integration, volume 76. Amer. Math. Society, 2006.
125
Index
0 − 1-law density, 24
Kolmogorov’s, 64 Dirichlet problem, 89
L p -space, 20 Donsker’s theorem, 108
L p -space, 20 Doob decomposition, 70, 87
Π-system, 4 Doob’s super-martingale inequality, 82
λ-system, 4 Dynkin’s theorem, 4
σ-additive, 6
σ-algebra, 1 equilibrium measure, 93
Borel, 3 equilibrium potential, 92
generated, 3 equivalence (of measures), 24
ergodic, 97
absolute continuity, 24 ergodic theorem, 99
absolutely integrable, 16 essential supremum, 20, 25
adapted process, 58 expectation, 16
algebra, 1
Fatou’s lemma, 17
Baire σ-algebra, 15 filtered space, 57
Baire function, 15 filtration, 57
Banach space, 3 filtrations
Borel measure, 11, 12 natural, 58
Borel-σ-algebra, 3 Fourier transform, 50
Brownian motion, 51, 104 Fubini-Lebesgue theorem, 23
construction, 104 Fubini-Tonnelli theorem, 22
127
128 Index