0% found this document useful (0 votes)
45 views

Inference in Hidden Markov Models

This document provides an overview of inference methods for hidden Markov models (HMMs). It introduces HMMs and the key concepts of filtering, smoothing, and the forward-backward algorithm for computing likelihoods and posterior probabilities of states. It describes sequential Monte Carlo methods like particle filtering for approximate inference in HMMs. The document also covers maximum likelihood estimation for HMM parameters using the expectation-maximization algorithm and analyzes the statistical properties of maximum likelihood estimators for HMMs. Background on Markov chain theory is provided to support the statistical analysis.

Uploaded by

jenna.amber000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Inference in Hidden Markov Models

This document provides an overview of inference methods for hidden Markov models (HMMs). It introduces HMMs and the key concepts of filtering, smoothing, and the forward-backward algorithm for computing likelihoods and posterior probabilities of states. It describes sequential Monte Carlo methods like particle filtering for approximate inference in HMMs. The document also covers maximum likelihood estimation for HMM parameters using the expectation-maximization algorithm and analyzes the statistical properties of maximum likelihood estimators for HMMs. Background on Markov chain theory is provided to support the statistical analysis.

Uploaded by

jenna.amber000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 204

Inference in Hidden Markov Models

Olivier Cappé, Eric Moulines and Tobias Rydén

June 17, 2009


2
Contents

1 Main Definitions and Notations 1


1.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Transition Kernels . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Homogeneous Markov Chains . . . . . . . . . . . . . . . . . . 3
1.1.3 Non-homogeneous Markov Chains . . . . . . . . . . . . . . . 5
1.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Definitions and Notations . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Conditional Independence in Hidden Markov Models . . . . . 8

I State Inference 11
2 Filtering and Smoothing Recursions 13
2.1 Basic Notations and Definitions . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 The Forward-Backward Decomposition . . . . . . . . . . . . . 17
2.1.4 Implicit Conditioning (Please Read This Section!) . . . . . . 18
2.2 Forward-Backward . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 The Forward-Backward Recursions . . . . . . . . . . . . . . . 19
2.2.2 Filtering and Normalized Recursion . . . . . . . . . . . . . . 20
2.3 Markovian Decompositions . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Forward Decomposition . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Backward Decomposition . . . . . . . . . . . . . . . . . . . . 27

3 Forgetting of the initial condition and filter stability 31


3.0.3 Total Variation . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.0.4 Lipshitz Contraction for Transition Kernels . . . . . . . . . . 35
3.0.5 The Doeblin Condition and Uniform Ergodicity . . . . . . . . 37
3.0.6 Forgetting Properties . . . . . . . . . . . . . . . . . . . . . . 39
3.0.7 Uniform Forgetting Under Strong Mixing Conditions . . . . . 43
3.0.8 Forgetting Under Alternative Conditions . . . . . . . . . . . . 47

4 Sequential Monte Carlo Methods 57


4.1 Importance Sampling and Resampling . . . . . . . . . . . . . . . . . 58
4.1.1 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . 58
4.1.2 Sampling Importance Resampling . . . . . . . . . . . . . . . 59
4.2 Sequential Importance Sampling . . . . . . . . . . . . . . . . . . . . 61
4.2.1 Sequential Implementation for HMMs . . . . . . . . . . . . . 61
4.2.2 Choice of the Instrumental Kernel . . . . . . . . . . . . . . . 63
4.3 Sequential Importance Sampling with Resampling . . . . . . . . . . 72
4.3.1 Weight Degeneracy . . . . . . . . . . . . . . . . . . . . . . . . 73

i
ii CONTENTS

4.3.2 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4.1 Implementation of Multinomial Resampling . . . . . . . . . . 79
4.4.2 Alternatives to Multinomial Resampling . . . . . . . . . . . . 80

II Parameter Inference 93
5 Maximum Likelihood Inference, Part I:
Optimization Through Exact Smoothing 95
5.1 Likelihood Optimization in Incomplete Data Models . . . . . . . . . 95
5.1.1 Problem Statement and Notations . . . . . . . . . . . . . . . 96
5.1.2 The Expectation-Maximization Algorithm . . . . . . . . . . . 96
5.1.3 Gradient-based Methods . . . . . . . . . . . . . . . . . . . . . 99
5.2 Application to HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.1 Hidden Markov Models as Missing Data Models . . . . . . . 103
5.2.2 EM in HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2.3 Computing Derivatives . . . . . . . . . . . . . . . . . . . . . . 106
5.3 The Example of Normal Hidden Markov Models . . . . . . . . . . . 107
5.3.1 EM Parameter Update Formulas . . . . . . . . . . . . . . . . 107
5.3.2 Estimation of the Initial Distribution . . . . . . . . . . . . . . 109
5.3.3 Computation of the Score and Observed Information . . . . . 109
5.4 The Example of Gaussian Linear State-Space Models . . . . . . . . . 114
5.4.1 The Intermediate Quantity of EM . . . . . . . . . . . . . . . 114
5.5 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5.1 Global Convergence of the EM Algorithm . . . . . . . . . . . 116
5.5.2 Rate of Convergence of EM . . . . . . . . . . . . . . . . . . . 119
5.5.3 Generalized EM Algorithms . . . . . . . . . . . . . . . . . . . 120

6 Statistical Properties of the Maximum Likelihood


Estimator 121
6.1 A Primer on MLE Asymptotics . . . . . . . . . . . . . . . . . . . . . 122
6.2 Stationary Approximations . . . . . . . . . . . . . . . . . . . . . . . 123
6.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.3.1 Construction of the Stationary Conditional Log-likelihood . . 125
6.3.2 The Contrast Function and Its Properties . . . . . . . . . . . 127
6.4 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.4.1 Equivalence of Parameters . . . . . . . . . . . . . . . . . . . . 129
6.4.2 Identifiability of Mixture Densities . . . . . . . . . . . . . . . 132
6.4.3 Application of Mixture Identifiability to Hidden Markov Models133
6.5 Asymptotic Normality of the Score and Convergence of the Observed
Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5.1 The Score Function and Invoking the Fisher Identity . . . . . 134
6.5.2 Construction of the Stationary Conditional Score . . . . . . . 136
6.5.3 Weak Convergence of the Normalized Score . . . . . . . . . . 140
6.5.4 Convergence of the Normalized Observed Information . . . . 140
6.5.5 Asymptotics of the Maximum Likelihood Estimator . . . . . 141
6.6 Applications to Likelihood-based Tests . . . . . . . . . . . . . . . . . 142

III Background and Complements 145


7 Elements of Markov Chain Theory 147
7.1 Chains on Countable State Spaces . . . . . . . . . . . . . . . . . . . 147
CONTENTS iii

7.1.1 Irreducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 147


7.1.2 Recurrence and Transience . . . . . . . . . . . . . . . . . . . 148
7.1.3 Invariant Measures and Stationarity . . . . . . . . . . . . . . 150
7.1.4 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.2 Chains on General State Spaces . . . . . . . . . . . . . . . . . . . . . 153
7.2.1 Irreducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.2.2 Recurrence and Transience . . . . . . . . . . . . . . . . . . . 155
7.2.3 Invariant Measures and Stationarity . . . . . . . . . . . . . . 164
7.2.4 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.2.5 Geometric Ergodicity and Foster-Lyapunov Conditions . . . . 175
7.2.6 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.3 Applications to Hidden Markov Models . . . . . . . . . . . . . . . . 182
7.3.1 Phi-irreducibility . . . . . . . . . . . . . . . . . . . . . . . . . 182
7.3.2 Atoms and Small Sets . . . . . . . . . . . . . . . . . . . . . . 183
7.3.3 Recurrence and Positive Recurrence . . . . . . . . . . . . . . 185
iv CONTENTS
Chapter 1

Main Definitions and


Notations

We now formally describe hidden Markov models, setting the notations that will be
used throughout the book. We start by reviewing the basic definitions and concepts
pertaining to Markov chains.

1.1 Markov Chains


1.1.1 Transition Kernels
Definition 1 (Transition Kernel). Let (X, X ) and (Y, Y) be two measurable spaces.
An unnormalized transition kernel from (X, X ) to (Y, Y) is a function Q : X × Y →
[0, ∞] that satisfies

(i) for all x ∈ X, Q(x, ·) is a positive measure on (Y, Y);

(ii) for all A ∈ Y, the function x 7→ Q(x, A) is measurable.

If Q(x, Y) = 1 for all x ∈ X, then Q is called a transition kernel, or simply a kernel.


If X = Y and Q(x, X) = 1 for all x ∈ X, then Q will also be referred to as a Markov
transition kernel on (X, X ).
An (unnormalized) transition kernel Q is said to admit a density with respect to
the positive measure µ on Y if there exists a non-negative function q : X×Y → [0, ∞],
measurable with respect to the product σ-field X ⊗ Y, such that
Z
Q(x, A) = q(x, y) µ(dy) , A∈Y.
A

The function q is then referred to as an (unnormalized) transition density function.


When X and Y are countable sets it is customary to write Q(x, y) as a short-
hand notation for Q(x, {y}), and Q is generally referred to as a transition matrix
(whether or not X and Y are finite sets).

We summarize below some key properties of transition kernels, introducing im-


portant pieces of notation that are used in the following.

• Let Q and R be unnormalized transition kernels from (X, X ) to (Y, Y) and


from (Y, Y) to (Z, Z), respectively. The product QR, defined by
Z
def
QR(x, A) = Q(x, dy) R(y, A) , x ∈ X, A ∈ Z ,

1
2 CHAPTER 1. MAIN DEFINITIONS AND NOTATIONS

is then an unnormalized transition kernel from (X, X ) to (Z, Z). If Q and R


are transition kernels, then so is QR, that is, QR(x, Z) = 1 for all x ∈ X.

• If Q is an (unnormalized) Markov transition kernel on (X, X ), its iterates are


defined inductively by

Q0 (x, ·) = δx for x ∈ X and Qk = QQk−1 for k ≥ 1 .

These iterates satisfy the Chapman-Kolmogorov equation: Qn+m = Qn Qm


for all n, m ≥ 0. That is, for all x ∈ X and A ∈ X ,
Z
Qn+m (x, A) = Qn (x, dy) Qm (y, A) . (1.1)

If Q admits a density q with respect to the measure µ on (X, X ), then for


all n ≥ 2 the kernel Qn is also absolutely continuous with respect to µ. The
corresponding transition density is
Z
qn (x, y) = q(x, x1 ) · · · q(xn−1 , y) µ(dx1 ) · · · µ(dxn−1 ) . (1.2)
Xn−1

• Positive measures operate on (unnormalized) transition kernels in two different


ways. If µ is a positive measure on (X, X ), the positive measure µQ on (Y, Y)
is defined by Z
def
µQ(A) = µ(dx) Q(x, A) , A∈Y.

Moreover, the measure µ ⊗ Q on the product space (X × Y, X ⊗ Y) is defined


by ZZ
def
µ ⊗ Q(C) = µ(dx) Q(x, dy) , C ∈X ⊗Y .
C

If µ is a probability measure and Q is a transition kernel, then µQ and µ ⊗ Q


are probability measures.

• (Unnormalized) transition kernels operate on functions. Let f be a real mea-


surable function on Y. The real measurable function Qf on X is defined by
Z
def
Qf (x) = Q(x, dy) f (y) , x∈X,

provided the integral is well-defined. It will sometimes be more convenient


to use the alternative notation Q(x, f ) instead of Qf (x). In particular, for
x ∈ X and A ∈ Y, Q(x, A), δx Q(A), Q1A (x), and Q(x, 1A ), where 1A denotes
the indicator function of the set A, are four equivalent ways of denoting the
same quantity. In general, we prefer using the Q(x, 1A ) and Q(x, A) variants,
which are less prone to confusion in complicated expressions.

• For any positive measure µ on (X, X ) and any real measurable function f on
(Y, Y),
ZZ
(µQ) (f ) = µ (Qf ) = µ(dx) Q(x, dy) f (y) ,

provided the integrals are well-defined. We may thus use the simplified nota-
tion µQf instead of (µQ)(f ) or µ(Qf ).
1.1. MARKOV CHAINS 3

Definition 2 (Reverse Kernel). Let Q be a transition kernel from (X, X ) to (Y, Y)




and let ν be a probability measure on (X, X ). The reverse kernel Q ν associated
to ν and Q is a transition kernel from (Y, Y) to (X, X ) such that for all bounded
measurable functions f defined on X × Y,


ZZ ZZ
f (x, y) ν(dx)Q(x, dy) = f (x, y) νQ(dy) Q ν (y, dx) . (1.3)
X×Y X×Y

The reverse kernel does not necessarily exist and is not uniquely defined. Never-

− ←
− ←
− ←

theless, if Q ν,1 and Q ν,2 satisfy (1.3), then for all A ∈ X , Q ν,1 (y, A) = Q ν,2 (y, A)
for νQ-almost every y in Y. The reverse kernel does exist if X and Y are Polish
spaces endowed with their Borel σ-fields. If Q admits a density q with respect to a

− R
measure µ on (Y, Y), then Q ν can be defined for all y such that X q(z, y) ν(dz) 6= 0
by

− q(x, y) ν(dx)
Q ν (y, dx) = R . (1.4)
X
q(z, y) ν(dz)
←− R
The values of Q ν on the set {y ∈ Y : X q(z, y) ν(dz) = 0} are irrelevant because
this set is νQ-negligible. In particular, if X is discrete and µ is counting measure,
then for all (x, y) ∈ X × Y such that νQ(y) 6= 0,

− ν(x)Q(x, y)
Q ν (y, x) = . (1.5)
νQ(y)

1.1.2 Homogeneous Markov Chains


Let (Ω, F, P) be a probability space and let (X, X ) be a measurable space. An
X-valued (discrete index) stochastic process {Xn }n≥0 is a collection of X-valued
random variables. A filtration of (Ω, F) is a non-decreasing sequence {Fn }n≥0 of
sub-σ-fields of F. A filtered space is a triple (Ω, F, F), where F is a filtration;
(Ω, F, F, P) is called a filtered probability space. For any filtration F = {Fn }n≥0 , we
denote by F∞ = ∨∞ n=0 Fn the σ-field generated by F or, in other words, the minimal
σ-field containing F. A stochastic process {Xn }n≥0 is adapted to F = {Fn }n≥0 , or
simply F-adapted, if Xn is Fn -measurable for all n ≥ 0 The natural filtration of a
process {Xn }n≥0 , denoted by FX = {FnX }n≥0 , is the smallest filtration with respect
to which {Xn } is adapted.
Definition 3 (Markov Chain). Let (Ω, F, F, P) be a filtered probability space and
let Q be a Markov transition kernel on a measurable space (X, X ). An X-valued
stochastic process {Xk }k≥0 is said to be a Markov chain under P, with respect to
the filtration F and with transition kernel Q, if it is F-adapted and for all k ≥ 0 and
A ∈ X,

P (Xk+1 ∈ A | Fk ) = Q(Xk , A) . (1.6)

The distribution of X0 is called the initial distribution of the chain, and X is called
the state space.
If {Xk }k≥0 is F-adapted, then for all k ≥ 0 it holds that FkX ⊆ Fk ; hence a
Markov chain with respect to a filtration F is also a Markov chain with respect to
its natural filtration. Hereafter, a Markov chain with respect to its natural filtration
will simply be referred to as a Markov chain. When there is no risk of confusion,
we will not mention the underlying probability measure P.
A fundamental property of a Markov chain is that its finite-dimensional distri-
butions, and hence the distribution of the process {Xk }k≥0 , are entirely determined
by the initial distribution and the transition kernel.
4 CHAPTER 1. MAIN DEFINITIONS AND NOTATIONS

Proposition 4. Let {Xk }k≥0 be a Markov chain with initial distribution ν and
transition kernel Q. For any k ≥ 0 and any bounded X ⊗(k+1) -measurable function
f on X(k+1) ,
Z
E [f (X0 , . . . , Xk )] = f (x0 , . . . , xk ) ν(dx0 ) Q(x0 , dx1 ) · · · Q(xk−1 , dxk ) .

In the following, we will use the generic notation f ∈ Fb (Z) to denote the fact
that f is a measurable bounded function on (Z, Z). In the case of Proposition 4 for
instance, one considers functions f that are in Fb X(k+1) . More generally, we will
usually describe measures and transition kernels on (Z, Z) by specifying the way
they operate on the functions of Fb (Z).

Canonical Version

Let (X, X ) be a measurable space. The canonical space associated to (X, X ) is


the infinite-dimensional product space (XN , X ⊗N ). The coordinate process is the X-
valued stochastic process {Xk }k≥0 defined on the canonical space by Xn (ω) = ω(n).
The canonical space will always be endowed with the natural filtration FX of the
coordinate process.
Let (Ω, F) = (XN , X ⊗N ) be the canonical space associated to the measurable
space (X, X ). The shift operator θ : Ω → Ω is defined by

θ(ω)(n) = ω(n + 1) , n≥0.

The iterates of the shift operator are defined inductively by θ0 = Id (the identity),
θ1 = θ and θk = θ ◦ θk−1 for k ≥ 1. If {Xk }k≥0 is the coordinate process with
associated natural filtration FX , then for all k, n ≥ 0, Xk ◦ θn = Xk+n , and more
generally for any FkX -measurable random variable Y , Y ◦ θn is Fn+k
X
-measurable.
The following theorem, which is a particular case of the Kolmogorov consistency
theorem, states that it is always possible to define a Markov chain on the canonical
space.

Theorem 5. Let (X, X ) be a measurable set, ν a probability measure on (X, X ), and


Q a transition kernel on (X, X ). Then there exists an unique probability measure on
(XN , X ⊗N ), denoted by Pν , such that the coordinate process {Xk }k≥0 is a Markov
chain (with respect to its natural filtration) with initial distribution ν and transition
kernel Q.
For x ∈ X, let Px be an alternative simplified notation for Pδx . Then for all A ∈
X ⊗N , the mapping x → Px (A) = Q(x, A) is X -measurable, and for any probability
measure ν on (X, X ),
Z
Pν (A) = ν(dx) Px (A) . (1.7)

The Markov chain defined in Theorem 5 is referred to as the canonical version


of the Markov chain. The probability Pν defined on (XN , X ⊗N ) depends on ν and
on the transition kernel Q. Nevertheless, the dependence with respect to Q is
traditionally omitted in the notation. The relation (1.7) implies that x → Px is a
regular version of the conditional probability Pν ( · | Xk = x) in the sense that one
can rewrite (1.6) as

Pν Xk+1 ∈ A | FkX = Pν X1 ◦ θk ∈ A FkX = PXk (X1 ∈ A)


 
Pν -a.s.
1.1. MARKOV CHAINS 5

Markov Properties
More generally, an induction argument easily yields the Markov property: for any
X
F∞ -measurable random variable Y ,

Eν [Y ◦ θk | FkX ] = EXk [Y ] Pν -a.s. (1.8)

The Markov property can be extended to a specific class of random times known as
stopping times. Let N̄ = N∪{+∞} denote the extended integer set and let (Ω, F, F)
be a filtered space. Then, a mapping τ : Ω → N̄ is said to be an F-stopping time if
{τ = n} ∈ Fn for all n ≥ 0. Intuitively, this means that at any time n one should
be able to tell, based on the information Fn available at that time, if the stopping
time occurs at this time n (or before then) or not. The class Fτ defined by

Fτ = {B ∈ F∞ : B ∩ {τ = n} ∈ Fn for all n ≥ 0} ,

is a σ-field, referred to as the σ-field of the events occurring before τ .

Theorem 6 (Strong Markov Property). Let {Xk }k≥0 be the canonical version of
a Markov chain and let τ be an FX -stopping time. Then for any bounded F∞X
-
measurable function Ψ,

Eν [1{τ <∞} Ψ ◦ θτ | FτX ] = 1{τ <∞} EXτ [Ψ] Pν -a.s. (1.9)


X
We note that an F∞ -measurable function, or random variable, Ψ, is typically a
function of potentially the whole trajectory of the Markov chain, although it may
of course be a rather simple function like X1 or X2 + X32 .

1.1.3 Non-homogeneous Markov Chains


Definition 7 (Non-homogeneous Markov Chain). Let (Ω, F, F, P) be a filtered prob-
ability space and let {Qk }k≥0 be a family of transition kernels on a measurable space
(X, X ). An X-valued stochastic process {Xk }k≥0 is said to be a non-homogeneous
Markov chain under P, with respect to the filtration F and with transition kernels
{Qk }, if it is F-adapted and for all k ≥ 0 and A ∈ X ,

P(Xk+1 ∈ A | Fk ) = Qk (Xk , A) .

For i ≤ j we define
Qi,j = Qi Qi+1 · · · Qj .
With this notation, if ν denotes the distribution of X0 (which we refer to as the
initial distribution as in the homogeneous case), the distribution of Xn is ν Q0,n−1 .
An important example of a non-homogeneous Markov chain is the so-called reverse
chain. The construction of the reverse chain is based on the observation that if
{Xk }k≥0 is a Markov chain, then for any index n ≥ 1 the time-reversed (or, index-
reversed) process {Xn−k }nk=0 is a Markov chain too. The definition below provides
its transition kernels.

Definition 8 (Reverse Chain). Let Q be a Markov kernel on some space X, let ν be


a probability measure on this space, and let n ≥ 1 be an index. The reverse chain is
the non-homogeneous Markov chain with initial distribution νQn , (time) index set
k = 0, 1, . . . , n and transition kernels


Qk = Q νQn−k−1 , k = 0, . . . , n − 1 ,

assuming that the reverse kernels are indeed well-defined.


6 CHAPTER 1. MAIN DEFINITIONS AND NOTATIONS

If the transition kernel Q admits a transition density function q with respect


to a measure µ on (X, X ), then Qk also admits a density with respect to the same
measure µ, namely
R
qn−k−1 (z, x)q(x, y) ν(dz)
hk (y, x) = R . (1.10)
qn−k (z, y) ν(dz)
Here, ql is the transition density function of Ql with respect to µ as defined in (1.2).
If the state space is countable, then
νQn−k−1 (x)Q(x, y)
Qk (y, x) = . (1.11)
νQn−k (y)
An interesting question is in what cases the kernels Qk do not depend on the
index k and are in fact all equal to the forward kernel Q. A Markov chain with
this property is said to be reversible. The following result gives a necessary and
sufficient condition for reversibility.
Theorem 9. Let X be a Polish space. A Markov kernel Q on X is reversible with
respect to a probability measure ν if and only if for all bounded measurable functions
f on X × X,
ZZ ZZ
f (x, x0 ) ν(dx) Q(x, dx0 ) = f (x, x0 ) ν(dx0 ) Q(x0 , dx) . (1.12)

The relation (1.12) is referred to as the local balance equations (or detailed bal-
ance equations). If the state space is countable, these equations hold if for all
x, x0 ∈ X,
ν(x)Q(x, x0 ) = ν(x0 )Q(x0 , x) . (1.13)
Upon choosing a function f that only depends on the second variable in (1.12),
it is easily seen that νQ(f ) = ν(f ) for all functions f ∈ Fb (X). We can also write
this as ν = νQ. This equation is referred to as the global balance equations. By
induction, we find that νQn = ν for all n ≥ 0. The left-hand side of this equation
is the distribution of Xn , which thus does not depend on n when global balance
holds. This is a form of stationarity, obviously implied by local balance. We shall
tie this form of stationarity to the following customary definition.
Definition 10 (Stationary Process). A stochastic process {Xk } is said to be sta-
tionary (under P) if its finite-dimensional distributions are translation invariant,
that is, if for all k, n ≥ 1 and all n1 , . . . , nk , the distribution of the random vector
(Xn1 +n , . . . , Xnk +n ) does not depend on n.
A stochastic process with index set N, stationary but otherwise general, can
always be extended to a process with index set Z, having the same finite-dimensional
distributions (and hence being stationary). This is a consequence of Kolmogorov’s
existence theorem for stochastic processes.
For a Markov chain, any multi-dimensional distribution can be expressed in
terms of the initial distribution and the transition kernel—this is Proposition 4—
and hence the characterization of stationarity becomes much simpler than above.
Indeed, a Markov chain is stationary if and only if its initial distribution ν and
transition kernel Q satisfy νQ = ν, that is, satisfy global balance. Much more will
be said about stationary distributions of Markov chains in Chapter 7.

1.2 Hidden Markov Models


A hidden Markov model is a doubly stochastic process with an underlying stochastic
process that is not directly observable (it is “hidden”) but can be observed only
through another stochastic process that produces the sequence of observations.
1.2. HIDDEN MARKOV MODELS 7

1.2.1 Definitions and Notations


In simple cases such as fully discrete models, it is common to define hidden Markov
models by using the concept of conditional independence. It turns out that condi-
tional independence is mathematically more difficult to define in general settings (in
particular, when the state space X of the Markov chain is not countable), and we
will adopt a different route to define general hidden Markov models. The HMM is
defined as a bivariate Markov chain, only partially observed though, whose transi-
tion kernel has a special structure. Indeed, its transition kernel should be such that
both the joint process {Xk , Yk }k≥0 and the marginal unobservable (or hidden) chain
{Xk }k≥0 are Markovian. From this definition, the usual conditional independence
properties of HMMs will then follow (see Corollary 15 below).

Definition 11 (Hidden Markov Model). Let (X, X ) and (Y, Y) be two measurable
spaces and let Q and G denote, respectively, a Markov transition kernel on (X, X )
and a transition kernel from (X, X ) to (Y, Y). Consider the Markov transition kernel
defined on the product space (X × Y, X ⊗ Y) by
ZZ
T [(x, y), C] = Q(x, dx0 ) G(x0 , dy 0 ) , (x, y) ∈ X × Y, C ∈ X ⊗ Y . (1.14)
C

The Markov chain {Xk , Yk }k≥0 with Markov transition kernel T and initial distri-
bution ν ⊗ G, where ν is a probability measure on (X, X ), is called a hidden Markov
model.

Although the definition above concerns the joint process {Xk , Yk }k≥0 , the term
hidden is only justified in cases where {Xk }k≥0 is not observable. In this respect,
{Xk }k≥0 can also be seen as a fictitious intermediate process that is useful only
in defining the distribution of the observed process {Yk }k≥0 . We shall denote by
Pν and Eν the probability measure and corresponding expectation associated with
the process {Xk , Yk }k≥0 on the canonical space (X × Y)N , (X ⊗ Y)⊗N . Notice that


this constitutes a slight departure from the Markov notations introduced previously,
as ν is a probability measure on X only and not on the state space X × Y of the
joint process. This slight abuse of notation is justified by the special structure of
the model considered here. Equation (1.14) shows that whatever the distribution
of the initial joint state (X0 , Y0 ), even if it were not of the form ν × G, the law of
{Xk , Yk }k≥1 only depends on the marginal distribution of X0 . Hence it makes sense
to index probabilities and expectations by this marginal initial distribution only.
If both X and Y are countable, the hidden Markov model is said to be discrete,
which is the case originally considered by Baum and Petrie (1966).

Definition 12 (Partially Dominated Hidden Markov Model). The model of Def-


inition 11 is said to be partially dominated if there exists a probability measure
µ on (Y, Y) such that for all x ∈ X, G(x, ·) is absolutely continuous with respect
to µ, G(x,R·)  µ(·), with transition density function g(x, ·). Then, for A ∈ Y,
G(x, A) = A g(x, y) µ(dy) and the joint transition kernel T can be written as
ZZ
T [(x, y), C] = Q(x, dx0 )g(x0 , y 0 ) µ(dy 0 ) C ∈ X ⊗ Y . (1.15)
C

In the third part of the book (Chapter 5 and following) where we consider
statistical estimation for HMMs with unknown parameters, we will require even
stronger conditions and assume that the model is fully dominated in the following
sense.
8 CHAPTER 1. MAIN DEFINITIONS AND NOTATIONS

Definition 13 (Fully Dominated Hidden Markov Model). If, in addition to the


requirements of Definition 12, there exists a probability measure λ on (X, X ) such
that ν  λ and, for all x ∈ X, Q(x, R ·)  λ(·) with transition density function
q(x, ·). Then, for A ∈ X , Q(x, A) = A q(x, x0 ) λ(dx0 ) and the model is said to be
fully dominated. The joint Markov transition kernel T is then dominated by the
product measure λ ⊗ µ and admits the transition density function
def
t [(x, y), (x0 , y 0 )] = q(x, x0 )g(x0 , y 0 ) . (1.16)

Note that for such models, we will generally re-use the notation ν to denote the
probability density function of the initial state X0 (with respect to λ) rather than
the distribution itself.

1.2.2 Conditional Independence in Hidden Markov Models


In this section, we will show that the “intuitive” way of thinking about an HMM,
in terms of conditional independence, is justified by Definition 11.

Proposition 14. Let {Xk , Yk }k≥0 be a Markov chain over the product space X × Y
with transition kernel T given by (1.14). Then, for any integer p, any ordered set
{k1 < · · · < kp } of indices and all functions f1 , . . . , fp ∈ Fb (Y),
" p # p Z
Y Y
Eν fi (Yki ) Xk1 , . . . , Xkp = fi (y) G(Xki , dy) . (1.17)
i=1 i=1 Y

Proof. For any h ∈ Fb (Xp ), it holds that


" p #
Y
Eν fi (Yki )h(Xk1 , . . . , Xkp )
i=1
 
Z Z kp
Y
= ··· ν(dx0 )G(x0 , dy0 )  Q(xi−1 , dxi ) G(xi , dyi )
i=1
p
" #
Y
× fi (yki ) h(xk1 , . . . , xkp )
i=1
Z Z kp
Y
= ··· ν(dx0 ) Q(xi−1 , dxi )h(xk1 , . . . , xkp )
i=1
  
Z Z Y Y Z
···  G(xi , dyi )  fi (yi ) G(xi , dyi ) .
i6∈{k1 ,...,kp } i∈{k1 ,...,kp }

R
Because G(xi , dyi ) = 1,

p
" #
Y
Eν fi (Yki )h(Xk1 , . . . , Xkp ) =
i=1
 
Y Z
Eν h(Xk1 , . . . , Xkp ) fi (yi ) G(Xi , dyi ) .
i∈{k1 ,...,kp }

Corollary 15.
1.2. HIDDEN MARKOV MODELS 9

(i) For any integer p and any ordered set {k1 < · · · < kp } of indices, the random
variables Yk1 , . . . , Ykp are Pν -conditionally independent given (Xk1 , Xk2 , . . . , Xkp ).

(ii) For any integers k and p and any ordered set {k1 < · · · < kp } of indices
such that k 6∈ {k1 , . . . , kp }, the random variables Yk and (Xk1 , . . . , Xkp ) are
Pν -conditionally independent given Xk .
Proof. Part (i) is an immediate consequence of Proposition 14. To prove (ii), note
that for any f ∈ Fb (Y) and h ∈ Fb (Xp ),
 
Eν f (Yk )h(Xk1 , . . . , Xkp ) | Xk
 
= Eν Eν [f (Yk ) | Xk1 , . . . , Xkp , Xk ]h(Xk1 , . . . , Xkp ) Xk
 
= Eν f (Yk ) | Xk ] Eν [h(Xk1 , . . . , Xkp ) | Xk .

The conditional independence of the observations given the underlying sequence


of states implies that for any integers p and p0 , any indices k1 < · · · < kp and k10 <
· · · < kp0 0 such that {k1 , . . . , kp } ∩ {k10 , . . . , kp0 0 } = ∅ and any function f ∈ Fb (Yp ),

Eν [f (Yk1 , . . . , Ykp ) | Xk1 , . . . , Xkp , Xk10 , . . . , Xkp0 , Yk10 , . . . Ykp0 ]


= Eν [f (Yk1 , . . . , Ykp ) | Xk1 , . . . , Xkp ] . (1.18)

Indeed, in terms of conditional independence of the variables,

(Yk1 , . . . , Ykp ) ⊥
⊥ (Yk10 , . . . , Ykp0 0 ) | (Xk1 , . . . , Xkp , Xk10 , . . . , Xkp0 0 ) [Pν ]

and
(Yk1 , . . . , Ykp ) ⊥
⊥ (Xk10 , . . . , Xkp0 0 ) | (Xk1 , . . . , Xkp ) [Pν ] .
Hence,

(Yk1 , . . . , Ykp ) ⊥
⊥ (Xk10 , . . . , Xkp0 0 , Yk10 , . . . , Ykp0 ) | (Xk1 , . . . , Xkp ) [Pν ] ,

which implies (1.18).


10 CHAPTER 1. MAIN DEFINITIONS AND NOTATIONS
Part I

State Inference

11
Chapter 2

Filtering and Smoothing


Recursions

This chapter deals with a fundamental issue in hidden Markov modeling: given a
fully specified model and some observations Y0 , . . . , Yn , what can be said about the
corresponding unobserved state sequence X0 , . . . , Xn ? More specifically, we shall
be concerned with the evaluation of the conditional distributions of the state at
index k, Xk , given the observations Y0 , . . . , Yn , a task that is generally referred
to as smoothing. There are of course several options available for tackling this
problem (Anderson and Moore, 1979, Chapter 7) and we focus, in this chapter, on
the fixed-interval smoothing paradigm in which n is held fixed and it is desired to
evaluate the conditional distributions of Xk for all indices k between 0 and n. Note
that only the general mechanics of the smoothing problem are dealt with in this
chapter. In particular, most formulas will involve integrals over X. We shall not,
for the moment, discuss ways in which these integrals can be effectively evaluated,
or at least approximated, numerically.
The driving line of this chapter is the existence of a variety of smoothing ap-
proaches that involve a number of steps that only increase linearly with the number
of observations. This is made possible by the fact (to be made precise in Sec-
tion 2.3) that conditionally on the observations Y0 , . . . , Yn , the state sequence still
is a Markov chain, albeit a non-homogeneous one.
From a historical perspective, it is interesting to recall that most of the early
references on smoothing, which date back to the 1960s, focused on the specific case
of Gaussian linear state-space models, following the pioneering work by Kalman and
Bucy (1961). The classic book by Anderson and Moore (1979) on optimal filtering,
for instance, is fully devoted to linear state-space models—see also Chapter 10 of the
recent book by Kailath et al. (2000) for a more exhaustive set of early references on
the smoothing problem. Although some authors such as (for instance) Ho and Lee
(1964) considered more general state-space models, it is fair to say that the Gaus-
sian linear state-space model was the dominant paradigm in the automatic control
community1 . In contrast, the work by Baum and his colleagues on hidden Markov
models (Baum et al., 1970) dealt with the case where the state space X of the hidden
state is finite. These two streams of research (on Gaussian linear models and finite
state space models) remained largely separated. Approximately at the same time, in
the field of probability theory, the seminal work by Stratonovich (1960) stimulated
a number of contributions that were to compose a body of work generally referred to

1 Interestingly,until the early 1980s, the works that did not focus on the linear state-space
model were usually advertised by the use of the words “Bayes” or “Bayesian” in their title—see,
e.g., Ho and Lee (1964) or Askar and Derin (1981).

13
14 CHAPTER 2. FILTERING AND SMOOTHING RECURSIONS

as filtering theory. The object of filtering theory is to study inference about partially
observable Markovian processes in continuous time. A number of early references in
this domain indeed consider some specific form of discrete state space continuous-
time equivalent of the HMM (Shiryaev, 1966; Wonham, 1965)—see also Lipster
and Shiryaev (2001), Chapter 9. Working in continuous time, however, implies the
use of mathematical tools that are definitely more complex than those needed to
tackle the discrete-time model of Baum et al. (1970). As a matter of fact, filtering
theory and hidden Markov models evolved as two mostly independent fields of re-
search. A poorly acknowledged fact is that the pioneering paper by Stratonovich
(1960) (translated from an earlier Russian publication) describes, in its first sec-
tion, an equivalent to the forward-backward smoothing approach of Baum et al.
(1970). It turns out, however, that the formalism of Baum et al. (1970) generalizes
well to models where the state space is not discrete anymore, in contrast to that
of Stratonovich (1960).

2.1 Basic Notations and Definitions


2.1.1 Likelihood
The joint probability of the unobservable states and observations up to index n is
such that for any function f ∈ Fb {X × Y}n+1 ,
Z Z
Eν [f (X0 , Y0 , . . . , Xn , Yn )] = ··· f (x0 , y0 , . . . , xn , yn )
n
Y
× ν(dx0 )g(x0 , y0 ) {Q(xk−1 , dxk )g(xk , yk )} µn (dy0 , . . . , dyn ) , (2.1)
k=1

where µn denotes the product distribution µ⊗(n+1) on (Yn+1 , Y ⊗(n+1) ). Marginaliz-


ing with respect to the unobservable variables X0 , . . . , Xn , one obtains the marginal
distribution of the observations only,
Z Z
Eν [f (Y0 , . . . , Yn )] = · · · f (y0 , . . . , yn ) Lν,n (y0 , . . . , yn ) µn (dy0 , . . . , dyn ) , (2.2)

where Lν,n is an important quantity which we define below for future reference.
Definition 16 (Likelihood). The likelihood of the observations is the probability
density function of Y0 , Y1 , . . . , Yn with respect to µn defined, for all (y0 , . . . , yn ) ∈
Yn+1 , by

Lν,n (y0 , . . . , yn ) =
Z Z
· · · ν(dx0 )g(x0 , y0 )Q(x0 , dx1 )g(x1 , y1 ) · · · Q(xn−1 , dxn )g(xn , yn ) . (2.3)

In addition,
def
`ν,n = log Lν,n , (2.4)
is referred to as the log-likelihood function.
Remark 17 (Concise Notation for Sub-sequences). For the sake of conciseness, we
will use in the following the notation Yl:m to denote the collection of consecutively
indexed variables Yl , . . . , Ym wherever possible (proceeding the same way for the
unobservable sequence {Xk }). In quoting (2.3) for instance, we shall write Lν,n (y0:n )
rather than Lν,n (y0 , . . . , yn ). By transparent convention, Yk:k refers to the single
variable Yk , although the second notation (Yk ) is to be preferred in this particular
2.1. BASIC NOTATIONS AND DEFINITIONS 15

case. In systematic expressions, however, it may be helpful to understand Yk:k as a


valid replacement of Yk . For similar reasons, we shall, when needed, accept Yk+1:k as
a valid empty set. The latter convention should easily be recalled by programmers,
as instructions of the form “for i equals k+1 to k, do...”, which do nothing, constitute
a well-accepted ingredient of most programming idioms.

2.1.2 Smoothing
We first define generically what is meant by the word smoothing before deriving
the basic results that form the core of the techniques discussed in the rest of the
chapter.
Definition 18 (Smoothing, Filtering, Prediction). For positive indices k, l, and n
with l ≥ k, denote by φν,k:l|n the conditional distribution of Xk:l given Y0:n , that is
(a) φν,k:l|n is a transition kernel from Y(n+1) to X(l−k+1) :

• for any given set A ∈ X ⊗(l−k+1) , y0:n 7→ φν,k:l|n (y0:n , A) is a Y ⊗(n+1) -


measurable function,
• for any given sub-sequence y0:n , A 7→ φν,k:l|n (y0:n , A) is a probability dis-
tribution on (Xl−k+1 , X ⊗(l−k+1) ).

(b) φν,k:l|n satisfies, for any function f ∈ Fb Xl−k+1 ,
Z Z
Eν [f (Xk:l ) | Y0:n ] = · · · f (xk:l ) φν,k:l|n (Y0:n , dxk:l ) ,

where the equality holds Pν -almost surely. Specific choices of k and l give rise to
several particular cases of interest:
Joint Smoothing: φν,0:n|n , for n ≥ 0;

(Marginal) Smoothing: φν,k|n for n ≥ k ≥ 0;

Prediction: φν,n+1|n for n ≥ 0; In describing algorithms, it will be convenient to


extend our notation to use φν,0|−1 as a synonym for the initial distribution ν;

p-step Prediction: φν,n+p|n for n, p ≥ 0.

Filtering: φν,n|n for n ≥ 0; Because the use of filtering will be preeminent in the
following, we shall most often abbreviate φν,n|n to φν,n .
In more precise terms, φν,k:l|n is a version of the conditional distribution of Xk:l
given Y0:n . It is however not obvious that such a quantity indeed exists in great
generality. The proposition below complements Definition 18 by a constructive
approach to defining the smoothing quantities from the elements of the hidden
Markov model.
Proposition 19. Consider a hidden Markov model compatible with Definition 12,
let n be a positive integer and y0:n ∈ Yn+1 a sub-sequence such that Lν,n (y0:n ) > 0.
The joint smoothing distribution φν,0:n|n then satisfies
Z Z
φν,0:n|n (y0:n , f ) = Lν,n (y0:n )−1 ··· f (x0:n )
n
Y
× ν(dx0 )g(x0 , y0 ) Q(xk−1 , dxk )g(xk , yk ) (2.5)
k=1
16 CHAPTER 2. FILTERING AND SMOOTHING RECURSIONS

for all functions f ∈ Fb Xn+1 . Likewise, for indices p ≥ 0,
Z Z
φν,0:n+p|n (y0:n , f ) = ··· f (x0:n+p )
n+p
Y
× φν,0:n|n (y0:n , dx0:n ) Q(xk−1 , dxk ) (2.6)
k=n+1

for all functions f ∈ Fb Xn+p+1 .
Proof. Equation (2.5) defines φν,0:n|n in a way that obviously satisfies part (a) of
Definition 18. To prove the (b) part of the definition, consider a function h ∈
Fb Yn+1 . By (2.1),
Z Z
Eν [h(Y0:n )f (X0:n )] = ···
h(y0:n )f (x0:n )
" n #
Y
× ν(dx0 )g(x0 , y0 ) Q(xk−1 , dxk )g(xk , yk ) µn (dy0:n ) .
k=1

Using Definition 16 of the likelihood Lν,n and (2.5) for φν,0:n|n yields
Z Z
Eν [h(Y0:n )f (X0:n )] = · · · h(y0:n ) φν,0:n|n (y0:n , f )Lν,n (y0:n ) µn (dy0:n )

= Eν [h(Y0:n )φν,0:n|n (Y0:n , f )] . (2.7)



Hence Eν [f (X0:n ) | Y0:n ] equals φν,0:n|n (Y0:n , f ), Pν -a.e., for any function f ∈ Fb Xn+1 .
For (2.6), proceed similarly and consider two functions f ∈ Fb Xn+p+1 and
h ∈ Fb Yn+1 . First apply (2.1) to obtain
Z Z
Eν [h(Y0:n )f (X0:n+p )] = ··· f (x0:n+p )
" n
#
Y
× ν(dx0 )g(x0 , y0 ) Q(xk−1 , dxk )g(xk , yk ) h(y0:n )
k=1
n+p
" #
Y
× Q(xl−1 , dxl )g(xl , yl ) µn+p (dy0:n+p ) .
l=n+1

When integrating with respect to the subsequence yn+1:n+p , the third line of the pre-
Qn+p
vious equation reduces to l=n+1 Q(xl−1 , dxl )µn (dy0:n ). Finally use (2.3) and (2.5)
to obtain
Z Z
Eν [h(Y0:n )f (X0:n+p )] = · · · h(y0:n )f (x0:n+p )
" n+p #
Y
φν,0:n|n (y0:n , dx0:n ) Q(xk−1 , dxk ) Lν,n (y0:n )µn (dy0:n ) , (2.8)
k=n+1

which concludes the proof.


Proposition 19 also implicitly defines all other particular cases of smoothing
kernels mentioned in Definition 18, as these are obtained by marginalization. For
instance, the marginal smoothing kernel φν,k|n for 0 ≤ k ≤ n is such that for any
y0:n ∈ Yn+1 and f ∈ Fb (X),
Z Z
def
φν,k|n (y0:n , f ) = · · · f (xk ) φν,0:n|n (y0:n , dx0:n ) , (2.9)
2.1. BASIC NOTATIONS AND DEFINITIONS 17

where φν,0:n|n is defined by (2.5).


Likewise, for any given y0:n ∈ Yn+1 , the p-step predictive distribution φν,n+p|n (y0:n , ·)
may be obtained by marginalization of the joint distribution φν,0:n+p|n (y0:n , ·) with
respect to all variables xk except the last one (the one with index k = n + p).
A closer examination of (2.6) together with the use of the Chapman-Kolmogorov
equations introduced in (1.1) (cf. Chapter 7) directly shows that φν,n+p|n (y0:n , ·) =
φν,n (y0:n , ·)Qp , where φν,n refers to the filter (conditional distribution of Xn given
Y0:n ).

2.1.3 The Forward-Backward Decomposition


Replacing φν,0:n|n in (2.9) by its expression given in (2.5) shows that it is always
possible to rewrite φν,k|n (y0:n , f ), for functions f ∈ Fb (X), as
Z
φν,k|n (y0:n , f ) = Lν,n (y0:n )−1 f (x) αν,k (y0:k , dx)βk|n (yk+1:n , x) , (2.10)

where αν,k and βk|n are defined below in (2.11) and (2.12), respectively. In simple
terms, αν,k correspond to the factors in the multiple integral that are to be inte-
grated with respect to the state variables xl with indices l ≤ k while βk|n gathers
the remaining factors (which are to be integrated with respect to xl for l > k). This
simple splitting of the multiple integration in (2.9) constitutes the forward-backward
decomposition.
Definition 20 (Forward-Backward “Variables”). For k ∈ {0, . . . , n}, define the
following quantities.
Forward Kernel αν,k is the non-negative finite kernel from (Yk+1 , Y ⊗(k+1) ) to
(X, X ) such that
Z Z k
Y
αν,k (y0:k , f ) = ··· f (xk ) ν(dx0 )g(x0 , y0 ) Q(xl−1 , dxl )g(xl , yl ) , (2.11)
l=1

with the convention that the rightmost product term is empty for k = 0.
Backward Function βk|n is the non-negative measurable function on Yn−k × X
defined by

βk|n (yk+1:n , x) =
Z Z n
Y
· · · Q(x, dxk+1 )g(xk+1 , yk+1 ) Q(xl−1 , dxl )g(xl , yl ) , (2.12)
l=k+2

for k ≤ n − 1 (with the same convention that the rightmost product is empty for
k = n − 1); βn|n (·) is set to the constant function equal to 1 on X.
The term “forward and backward variables” as well as the use of the symbols
α and β is part of the HMM credo and dates back to the seminal work of Baum
and his colleagues (Baum et al., 1970, p. 168). It is clear however that for a general
model as given in Definition 12, these quantities as defined in (2.11) and (2.12)
are very different in nature, and indeed sufficiently so to prevent the use of the
loosely defined term “variable”. In the original framework studied by Baum and
his coauthors where X is a finite set, both the forward measures αν,k (y0:k , ·) and the
backward functions βk|n (yk+1:n , ·) can be represented by vectors with non-negative
entries. Indeed, in this case αν,k (y0:k , x) has the interpretation Pν (Y0 = y0 , . . . , Yk =
yk , Xk = x) while βk|n (yk+1:n , x) has the interpretation P(Yk+1 = yk+1 , . . . , Yn =
yn | Xk = x). This way of thinking of αν,k and βk|n may be extended to general state
18 CHAPTER 2. FILTERING AND SMOOTHING RECURSIONS

spaces: αν,k (y0:k , dx) is then the joint density (with respect to µk+1 ) of Y0 , . . . , Yk
and distribution of Xk , while βk|n (yk+1:n , x) is the conditional joint density (with
respect to µn−k ) of Yk+1 , . . . , Yn given Xk = x. Obviously, these entities may then
not be represented as vectors of finite length, as when X is finite; this situation is
the exception rather than the rule.
Let us simply remark at this point that while the forward kernel at index k
is defined irrespectively of the length n of the observation sequence (as long as
n ≥ k), the same is not true for the backward functions. The sequence of backward
functions clearly depends on the index where the observation sequence stops. In
general, for instance, βk|n−1 differs from βk|n even if we assume that the same
sub-observation sequence y0:n−1 is considered in both cases. This is the reason for
adding the terminal index n to the notation used for the backward functions. This
notation also constitutes a departure from HMM traditions in which the backward
functions are simply indexed by k. For αν,k , the situation is closer to standard
practice and we simply add the subscript ν to recall that the forward kernel αν,k , in
contrast with the backward measure, does depend on the distribution ν postulated
for the initial state X0 .

2.1.4 Implicit Conditioning (Please Read This Section!)


We now pause to introduce a convention that will greatly simplify the exposition of
the material contained in the first part of the book (from this chapter on, starting
with the next section), both from terminological and notational points of view. This
convention would however generate an acute confusion in the mind of a hypothetical
reader who, having read Chapter 2 up to now, would decide to skip our friendly
encouragement to read what follows carefully.
In the rest of Part I (with the notable exception of Section 3), we focus on the
evaluation of quantities such as φν,0:n|n or φν,k|n for a given value of the observation
sequence y0:n . In this context, we expunge from our notations the fact that all
quantities depend on y0:n . In particular, we rewrite (2.5) for any f ∈ Fb Xn+1
more concisely as
Z Z n
Y
φν,0:n|n (f ) = L−1
ν,n ··· f (x0:n ) ν(dx0 )g0 (x0 ) Q(xi−1 , dxi )gi (xi ) , (2.13)
i=1

def
where gk are the data-dependent functions on X defined by gk (x) = g(x, yk ) for
the particular sequence y0:n under consideration. The sequence of functions {gk }
is about the only new notation that is needed as we simply re-use the previously
defined quantities omitting their explicit dependence on the observations. For in-
stance, in addition to writing Lν,n instead of Lν,n (y0:n ), we will also use φn (·) rather
than φn (y0:n , ·), βk|n (·) rather than βk|n (yk+1:n , ·), etc. This notational simplifica-
tion implies a corresponding terminological adjustment. For instance, αν,k will be
referred to as the forward measure at index k and considered as a positive finite
measure on (X, X ). In all cases, the conversion should be easy to do mentally, as in
the case of αν,k , for instance, what is meant is really “the measure αν,k (y0:k , ·), for
a particular value of y0:k ∈ Yk+1 ”.
At first sight, omitting the observations may seem a weird thing to do in a
statistically oriented book. However, for posterior state inference in HMMs, one
indeed works conditionally on a given fixed sequence of observations. Omitting the
observations from our notation will thus allow more concise expressions in most
parts of the book. There are of course some properties of the hidden Markov
model for which dependence with respect to the distribution of the observations
does matter (hopefully!) This is in particular the case of Section 3 on forgetting
2.2. FORWARD-BACKWARD 19

and Chapter 6, which deals with statistical properties of the estimates for which we
will make the dependence with respect to the observations explicit.

2.2 Forward-Backward
The forward-backward decomposition introduced in Section 2.1.3 is just a rewriting
of the multiple integral in (2.9) such that for f ∈ Fb (X),
Z
−1
φν,k|n (f ) = Lν,n f (x) αν,k (dx)βk|n (x) , (2.14)

where
Z Z k
Y
αν,k (f ) = ··· f (xk ) ν(dx0 )g0 (x0 ) Q(xl−1 , dxl )gl (xl ) (2.15)
l=1

and

βk|n (x) =
Z Z n
Y
··· Q(x, dxk+1 )gk+1 (xk+1 ) Q(xl−1 , dxl )gl (xl ) . (2.16)
l=k+2

The last expression is, by convention, equal to 1 for the final index k = n. Note
that we are now using the implicit conditioning convention discussed in the previous
section.

2.2.1 The Forward-Backward Recursions


The point of using the forward-backward decomposition for the smoothing problem
is that both the forward measures αν,k and the backward functions βk|n can be
expressed recursively rather than by their integral representations (2.15) and (2.1.4).
This is the essence of the forward-backward algorithm proposed by Baum et al.
(1970, p. 168), which we now describe.

Proposition 21 (Forward-Backward Recursions). The forward measures defined


by (2.15) may be obtained, for all f ∈ Fb (X), recursively for k = 1, . . . , n according
to
Z Z
αν,k (f ) = f (x ) αν,k−1 (dx)Q(x, dx0 )gk (x0 )
0
(2.17)

with initial condition Z


αν,0 (f ) = f (x)g0 (x) ν(dx) . (2.18)

Similarly, the backward functions defined by (2.16) may be obtained, for all x ∈
X, by the recursion
Z
βk|n (x) = Q(x, dx0 )gk+1 (x0 )βk+1|n (x0 ) (2.19)

operating on decreasing indices k = n − 1 down to 0; the initial condition is

βn|n (x) = 1 . (2.20)


20 CHAPTER 2. FILTERING AND SMOOTHING RECURSIONS

Proof. The proof of this result is straightforward and similar for both recursions.
For αν,k for instance, simply rewrite (2.15) as

Z Z "Z Z
αν,k (f ) = f (xk ) ···
xk ∈X xk−1 ∈X x0 ∈X,...,xk−2 ∈X
k−1
#
Y
ν(dx0 )g0 (x0 ) Q(xl−1 , dxl )gl (xl ) Q(xk−1 , dxk )gk (xk ) ,
l=1

where the term in brackets is recognized as αν,k−1 (dxk−1 ).

Remark 22 (Concise Markov Chain Notations). In the following, we shall often


quote the above results using the concise Markov chain notations introduced in
Chapter 1. For instance, instead of (2.17) and (2.19) one could write more simply
αν,k (f ) = αν,k−1 Q(f gk ) and βk|n = Q(gk+1 βk+1|n ). Likewise, the decomposition
(2.14) may be rewritten as

φν,k|n (f ) = L−1

ν,n αν,k f βk|n .

The main shortcoming of the forward-backward representation is that the quan-


tities αν,k and βk|n do not have an immediate probabilistic interpretation. Recall,
in particular, that the first one is a finite (positive) measure but certainly not a
probability measure, as αν,k (1) 6= 1 (in general). There is however an important
solidarity result between the forward and backward quantities αν,k and βk|n , which
is summarized by the following proposition.

Proposition 23. For all indices k ∈ {0, . . . , n},

αν,k (βk|n ) = Lν,n

and
αν,k (1) = Lν,k ,
where Lν,k refers to the likelihood of the observations up to index k (included) only,
under Pν .

Proof. Because (2.14) must hold in particular for f = 1 and the marginal smoothing
distribution φν,k|n is a probability measure,

def
φν,k|n (1) = 1 = L−1

ν,n αν,k βk|n .

For the final index k = n, βn|n is the constant function equal to 1 and hence
αν,n (1) = Lν,n . This observation is however not specific to the final index n, as αν,k
only depends on the observations up to index k and thus any particular index may
be selected as a potential final index (in contrast to what happens for the backward
functions).

2.2.2 Filtering and Normalized Recursion


The forward and backward quantities αν,k and βk|n , as defined in previous sections,
are unnormalized in the sense that their scales are largely unknown. On the other
hand, we know that αν,k (βk|n ) is equal to Lν,n , the likelihood of the observations
up to index n under Pν .
The long-term behavior of the likelihood Lν,n , or rather its logarithm, is a result
known as the asymptotic equipartition property, or AEP (Cover and Thomas, 1991)
2.2. FORWARD-BACKWARD 21

in the information theoretic literature and as theShannon-McMillan-Breiman theo-


rem in the statistical literature. For HMMs, Proposition 112 (Chapter 6) shows that
under suitable mixing conditions on the underlying unobservable chain {Xk }k≥0 ,
the AEP holds in that n−1 log Lν,n converges Pν -a.s. to a limit as n tends to infinity.
The likelihood Lν,n will thus either grow to infinity or shrink to zero, depending on
the sign of the limit, exponentially fast in n.
The famous tutorial by Rabiner (1989) coined the term scaling to describe a
practical solution to this problem. Interestingly, scaling also partly answers the
question of the probabilistic interpretation of the forward and backward quantities.
Scaling as described by Rabiner (1989) amounts to normalizing αν,k and βk|n
by positive real numbers to keep the numeric values needed to represent αν,k and
βk|n within reasonable bounds. There are clearly a variety of options available,
especially if one replaces (2.14) by the equivalent auto-normalized form
Z
φν,k|n (f ) = [αν,k (βk|n )]−1 αν,k (f βk|n ) , (2.21)

assuming that αν,k (βk|n ) is indeed finite and non-zero.


In our view, the most natural scaling scheme (developed below) consists in
replacing the measure αν,k and the function βk|n by scaled versions ᾱν,k and β̄k|n
of these quantities, satisfying both

(i) ᾱν,k (1) = 1, and

(ii) ᾱν,k (β̄k|n ) = 1.

Item (i) implies that the normalized forward measures ᾱν,k are probability measures
that have a probabilistic interpretation given below. Item R (ii) implies that the
normalized backward functions are such that φν,k|n (f ) = f (x)β̄k|n (x) ᾱν,k (dx) for
all f ∈ Fb (X), without the need for a further renormalization. We note that this
scaling scheme differs slightly from the one described by Rabiner (1989).
To derive the probabilistic interpretation of ᾱν,k , observe that (2.14) and Propo-
sition 23, instantiated for the final index k = n, imply that the filtering distribution
φν,n at index n (recall that φν,n is used as a simplified notation for φν,n|n ) may be
written [αν,n (1)]−1 αν,n . This finding is of course not specific to the choice of the
index n as already discussed when proving the second statement of Proposition 23.
Thus, the normalized version ᾱν,k of the forward measure αν,k coincides with the
filtering distribution φν,k introduced in Definition 18. This observation together
with Proposition 23 implies that there is a unique choice of scaling scheme that
satisfies the two requirements of the previous paragraph, as
Z Z
f (x) φν,k|n (dx) = L−1
ν,n f (x) αν,k (dx)βk|n (x)
Z
= f (x) L−1 −1
ν,k αν,k (dx) Lν,n Lν,k βk|n (x)
| {z }| {z }
ᾱν,k (dx) β̄k|n (x)

must hold for any f ∈ Fb (X). The following definition summarizes these conclu-
sions, using the notation φν,k rather than ᾱν,k , as these two definitions refer to the
same object—the filtering distribution at index k.

Definition 24 (Normalized Forward-Backward Variables). For k ∈ {0, . . . , n}, the


normalized forward measure ᾱν,k coincides with the filtering distribution φν,k and
satisfies
φν,k = [αν,k (1)]−1 αν,k = L−1
ν,k αν,k .
22 CHAPTER 2. FILTERING AND SMOOTHING RECURSIONS

The normalized backward functions β̄k|n are defined by


αν,k (1) Lν,k
β̄k|n = βk|n = βk|n .
αν,k (βk|n ) Lν,n
The above definition would be pointless if computing αν,k and βk|n was indeed
necessary to obtain the normalized variables φν,k and β̄k|n . The following result
shows that this is not the case.

Proposition 25 (Normalized Forward-Backward Recursions). Forward Filtering


Recursion The filtering measures may be obtained, for all f ∈ Fb (X), recursively
for k = 1, . . . , n according to
Z Z
cν,k = φν,k−1 (dx)Q(x, dx0 )gk (x0 ) ,
Z Z
−1
φν,k (f ) = cν,k f (x ) φν,k−1 (dx)Q(x, dx0 )gk (x0 ) ,
0
(2.22)

with initial condition


Z
cν,0 = g0 (x)ν(dx) ,
Z
φν,0 (f ) = c−1
ν,0 f (x)g0 (x) ν(dx) .

Normalized Backward Recursion The normalized backward functions may be


obtained, for all x ∈ X, by the recursion
Z
β̄k|n (x) = c−1
ν,k+1 Q(x, dx0 )gk+1 (x0 )β̄k+1|n (x0 ) (2.23)

operating on decreasing indices k = n−1 down to 0; the initial condition is β̄n|n (x) =
1.
Once the two recursions above have been carried out, the smoothing distribution
at any given index k ∈ {0, . . . , n} is available via
Z
φν,k|n (f ) = f (x) β̄k|n (x)φν,k (dx) (2.24)

for all f ∈ Fb (X).


Proof. Proceeding by forward induction for φν,k and backward induction for βk|n ,
it is easily checked from (2.22) and (2.23) that
k
!−1 n
!−1
Y Y
φν,k = cν,l αν,k and β̄k|n = cν,l βk|n . (2.25)
l=0 l=k+1

Because φν,k is normalized,


k
!−1
def
Y
φν,k (1) = 1 = cν,l αν,k (1) .
l=0

Proposition 23 then implies that for any integer k,


k
Y
Lν,k = cν,l . (2.26)
l=0

In other words, cν,0 = Lν,0 and for subsequent indices k ≥ 1, cν,k = Lν,k /Lν,k−1 .
Hence (2.25) coincides with the normalized forward and backward variables as spec-
ified by Definition 24.
2.2. FORWARD-BACKWARD 23

We now pause to state a series of remarkable consequences of Proposition 25.


Remark 26. The forward recursion in (2.22) may also be rewritten to highlight
a two-step procedure involving both the predictive and filtering measures. Recall
our convention that φν,0|−1 refers to the predictive distribution of X0 when no
observation is available and is thus an alias for ν, the distribution of X0 . For
k ∈ {0, 1, . . . , n} and f ∈ Fb (X), (2.22) may be decomposed as
cν,k = φν,k|k−1 (gk ) ,
φν,k (f ) = c−1
ν,k φν,k|k−1 (f gk ) ,
φν,k+1|k = φν,k Q . (2.27)
The equivalence of (2.27) with (2.22) is straightforward and is a direct consequence
of the remark that φk+1|k = φν,k Q, which follows from Proposition 19 in Sec-
tion 2.1.2. In addition, each of the two steps in (2.27) has a very transparent
interpretation.
Predictor to Filter : The first two equations in (2.27) may be summarized as
Z
φν,k (f ) ∝ f (x) g(x, Yk ) φν,k|k−1 (dx) , (2.28)

where the symbol ∝ means “up to a normalization constant” (such that


φν,k (1) = 1) and the full notation g(x, Yk ) is used in place of gk (x) to highlight
the dependence on the current observation Yk . Equation (2.28) is recognized
as Bayes’ rule applied to a very simple equivalent Bayesian pseudo-model in
which
• Xk is distributed a priori according to the predictive distribution φν,k|k−1 ,
• g is the conditional probability density function of Yk given Xk .
The filter φν,k is then interpreted as the posterior distribution of Xk given Yk
in this simple equivalent Bayesian pseudo-model.

Filter to Predictor : The last equation in (2.27) simply means that the updated
predicting distribution φν,k+1|k is obtained by applying the transition kernel Q
to the current filtering distribution φν,k . We are thus left with the very basic
problem of determining the one-step distribution of a Markov chain given its
initial distribution.
Remark 27. In many situations, using (2.27) to determine φν,k is indeed the goal
rather than simply a first step in computing smoothed distributions. In particular,
for sequentially observed data, one may need to take actions based on the observa-
tions gathered so far. In such cases, filtering (or prediction) is the method of choice
for inference about the unobserved states, a topic that will be developed further in
Chapter 4.
Remark 28. Another remarkable fact about the filtering recursion is that (2.26)
together with (2.27) provides a method for evaluating the likelihood Lν,k of the ob-
servations up to index k recursively in the index k. In addition, as cν,k = Lν,k /Lν,k−1
from (2.26), cν,k may be interpreted as the conditional likelihood of Yk given the pre-
vious observations Y0:k−1 . However, as discussed at the beginning of Section 2.2.2,
using (2.26) directly is generally impracticable for numerical reasons. In order to
avoid numerical under- or overflow, one can equivalently compute the log-likelihood
`ν,k . Combining (2.26) and (2.27) gives the important formula
k
def
X
`ν,k = log Lν,k = log φν,l|l−1 (gl ) , (2.29)
l=0
24 CHAPTER 2. FILTERING AND SMOOTHING RECURSIONS

where φν,l|l−1 is the one-step predictive distribution computed according to (2.27)


(recalling that by convention, φν,0|−1 is used as an alternative notation for ν).

Remark 29. The normalized backward function β̄k|n does not have a simple proba-
bilistic interpretation when isolated from the corresponding filtering measure. How-
ever, (2.24) shows that the marginal smoothing distribution, φν,k|n , is dominated
by the corresponding filtering distribution φν,k and that β̄k|n is by definition the
Radon-Nikodym derivative of φν,k|n with respect to φν,k ,

dφν,k|n
β̄k|n =
dφν,k

As a consequence,

inf M ∈ R : φν,k ({β̄k|n ≥ M }) = 0 ≥ 1

and

sup M ∈ R : φν,k ({β̄k|n ≤ M }) = 0 ≤ 1 ,
with the conventions inf ∅ = ∞ and sup ∅ = −∞. As a consequence, all values
of β̄k|n cannot get simultaneously large or close to zero as was the case for βk|n ,
although one cannot exclude the possibility that β̄k|n still has important dynamics
without some further assumptions
Qn on the model.
The normalizing factor l=k+1 cν,l = Lν,n /Lν,k by which β̄k|n differs from the
corresponding unnormalized backward function βk|n may be interpreted as the con-
ditional likelihood of the future observations Yk+1:n given the observations up to
index k, Y0:k .

2.3 Markovian Decompositions


The forward-backward recursions (Proposition 21) and their normalized versions
(Proposition 25) were probably already well-known to readers familiar with the
hidden Markov model literature. A less widely observed fact is that the smooth-
ing distributions may also be expressed using Markov transitions. In contrast to
the forward-backward algorithm, this second approach will already be familiar to
readers working with dynamic (or state-space) models (Kailath et al., 2000, Chap-
ter 10). Indeed, the method to be described in Section 2.3.2, when applied to
the specific case of Gaussian linear state-space models, is known as Rauch-Tung-
Striebel (sometimes, abbreviated to RTS) smoothing after Rauch et al. (1965). The
important message here is that {Xk }k≥0 (as well as the index-reversed version of
{Xk }k≥0 , although greater care is needed to handle this second case) is a non-
homogeneous Markov chain when conditioned on some observed values {Yk }0≤k≤n .
The use of this approach for HMMs with finite state spaces as an alternative to the
forward-backward recursions is due to Askar and Derin (1981)—see also (Ephraim
and Merhav, 2002, Section V) for further references.

2.3.1 Forward Decomposition


Let n be a given positive index and consider the finite-dimensional distributions of
{Xk }k≥0 given Y0:n . Our goal will be to show that the distribution of Xk given
X0:k−1 and Y0:n reduces to that of Xk given Xk−1 only and Y0:n , this for any
positive index k. The following definition will be instrumental in decomposing the
joint posterior distributions φν,0:k|n .
2.3. MARKOVIAN DECOMPOSITIONS 25

Definition 30 (Forward Smoothing Kernels). Given n ≥ 0, define for indices


k ∈ {0, . . . , n − 1} the transition kernels
(
[βk|n (x)]−1 A Q(x, dx0 )gk+1 (x0 )βk+1|n (x0 ) if βk|n (x) 6= 0
R
def
Fk|n (x, A) = (2.30)
0 otherwise ,

for any point x ∈ X and set A ∈ X . For indices k ≥ n, simply set


def
Fk|n = Q , (2.31)
where Q is the transition kernel of the unobservable chain {Xk }k≥0 .
Note that for indices k ≤ n − 1, Fk|n depends on the future observations Yk+1:n
through the backward variables βk|n and βk+1|n only. The subscript n in the Fk|n
notation is meant to underline the fact that, like the backward functions βk|n , the
forward smoothing kernels Fk|n depend on the final index n where the observation
sequence ends. The backward recursion of Proposition 21 implies that [βk|n (x)]−1
is the correct normalizing constant. Thus, for any x ∈ X, A 7→ Fk|n (x, A) is a
probability measure on X . Because the functions x 7→ βk|n (x) are measurable on
(X, X ), for any set A ∈ X , x 7→ Fk|n (x, A) is X /B(R)-measurable. Therefore, Fk|n
is indeed a Markov transition kernel on (X, X ). The next proposition provides a
probabilistic interpretation of this definition in terms of the posterior distribution of
the state at time k + 1, given the observations up to time n and the state sequence
up to time k.
Proposition 31. Given n, for any index k ≥ 0 and function f ∈ Fb (X),
Eν [f (Xk+1 ) | X0:k , Y0:n ] = Fk|n (Xk , f ) ,
where Fk|n is the forward smoothing kernel defined by (2.30) for indices k ≤ n − 1
and (2.31) for indices k ≥ n.
Proof. First consider
 an index 0 ≤ k ≤ n and let f and h denote functions in Fb (X)
and Fb Xk+1 , respectively. Then
Z Z
Eν [f (Xk+1 )h(X0:k ) | Y0:n ] = · · · f (xk+1 )h(x0:k ) φν,0:k+1|n (dx0:k+1 ) ,

which, using (2.13) and the definition (2.16) of the backward function, expands to
Z Z k
Y
L−1
ν,n ··· h(x0:k ) ν(dx0 )g0 (x0 ) Q(xi−1 , dxi )gi (xi )
i=1
Z
× Q(xk , dxk+1 )f (xk+1 )gk+1 (xk+1 )
Z Z n
Y
× ··· Q(xi−1 , dxi )gi (xi ) . (2.32)
i=k+2
| {z }
βk+1|n (xk+1 )
R
From Definition 30, Q(xk , dxk+1 )f (xk+1 )gk+1 (xk+1 )βk+1|n (xk+1 ) is equal to Fk|n (xk , f )βk|n (xk ).
Thus, (2.32) may be rewritten as
Z Z
Eν [f (Xk+1 )h(X0:k ) | Y0:n ] = L−1
ν,n · · · Fk|n (xk , f )h(x0:k )
" k #
Y
× ν(dx0 )g0 (x0 ) Q(xi−1 , dxi )gi (xi ) βk|n (xk ) . (2.33)
i=1
26 CHAPTER 2. FILTERING AND SMOOTHING RECURSIONS

Using the definition (2.16) of βk|n again, this latter integral is easily seen to be
similar to (2.32) except for the fact that f (xk+1 ) has been replaced by Fk|n (xk , f ).
Hence
Eν [f (Xk+1 )h(X0:k ) | Y0:n ] = Eν [Fk|n (Xk , f )h(X0:k ) | Y0:n ] ,

for all functions h ∈ Fb Xk+1 as requested.
For k ≥ n, the situation is simpler because (2.6) implies that φν,0:k+1|n =
φν,0:k|n Q. Hence,

Eν [f (Xk+1 )h(X0:k ) | Y0:n ]


Z Z Z
= · · · h(x0:k ) φν,0:k|n (dx0:k ) Q(xk , dxk+1 )f (xk+1 ) ,

and thus
Z Z
Eν [f (Xk+1 )h(X0:k ) | Y0:n ] = ··· h(x0:k )φν,0:k|n (dx0:k )Q(xk , f ) ,

= Eν [Q(Xk , f )h(X0:k ) | Y0:n ] .

Remark 32. A key ingredient of the above proof is (2.32), which gives a repre-
sentation of the joint smoothing distribution of the state variables X0:k given the
observations up to index n, with n ≥ k. This representation, which states that

φν,0:k|n (f )
Z Z " k
#
Y
= L−1
ν,n ··· f (x0:k ) ν(dx0 )g0 (x0 ) Q(xi−1 , dxi )gi (xi ) βk|n (xk ) (2.34)
i=1

for all f ∈ Fb Xk+1 , is a generalization of the marginal forward-backward decom-
position as stated in (2.14).
Proposition 31 implies that, conditionally on the observations Y0:n , the state
sequence {Xk }k≥0 is a non-homogeneous Markov chain associated with the family
of Markov transition kernels {Fk|n }k≥0 and initial distribution φν,0|n . The fact that
the Markov property of the state sequence is preserved when conditioning sounds
surprising because the (marginal) smoothing distribution of the state Xk depends
on both past and future observations. There is however nothing paradoxical here, as
the Markov transition kernels Fk|n indeed depend (and depend only) on the future
observations Yk+1:n .
As a consequence of Proposition 31, the joint smoothing distributions may be
rewritten in a form that involves the forward smoothing kernels using the Chapman-
Kolmogorov equations (1.1).

Proposition 33. For any integers n and m, function f ∈ Fb Xm+1 and initial
probability ν on (X, X ),

Eν [f (X0:m )) | Y0:n ] =
Z Z m
Y
··· f (x0:m ) φν,0|n (dx0 ) Fi−1|n (xi−1 , dxi ) , (2.35)
i=1

where {Fk|n }k≥0 are defined by (2.30) and (2.31) and φν,0|n is the marginal smooth-
ing distribution defined, for any A ∈ X , by
Z
−1
φν,0|n (A) = [ν(g0 β0|n )] ν(dx)g0 (x)β0|n (x) . (2.36)
A
2.3. MARKOVIAN DECOMPOSITIONS 27

If one is only interested in computing the fixed point marginal smoothing dis-
tributions, (2.35) may also be used as the second phase of a smoothing approach
which we recapitulate below.

Corollary 34 (Alternative Smoothing Algorithm). Backward Recursion Com-


pute the backward variables βn|n down to β0|n by backward recursion according
to (2.19) in Proposition 21.
Forward Smoothing φν,0|n is given by (2.36) and for k ≥ 0,

φν,k+1|n = φν,k|n Fk|n ,

where Fk|n are the forward kernels defined by (2.30).


For numerical implementation, Corollary 34 is definitely less attractive than the
normalized forward-backward approach of Proposition 25 because the backward
pass cannot be carried out in normalized form without first determining the forward
measures αν,k .
On the other hand, Proposition 33 provides a general decomposition of the
joint smoothing distribution that will be instrumental in establishing some form of
ergodicity of the Markov chain that corresponds to the unobservable states {Xk }k≥0 ,
conditional on some observations Y0:n (see Section 3).

2.3.2 Backward Decomposition


In the previous section it was shown that, conditionally on the observations up to
index n, Y0:n , the state sequence {Xk }k≥0 is a Markov chain, with transition kernels
Fk|n . We now turn to the so-called time-reversal issue: is it true in general that
the unobserved chain with the indices in reverse order, forms a non-homogeneous
Markov chain, conditionally on some observations Y0:n ?
We already discussed time-reversal for Markov chains in Section 1.1 where it
has been argued that the main technical difficulty consists in guaranteeing that the
reverse kernel does exist. For this, we require somewhat stronger assumptions on
the nature of X by assuming for the rest of this section that X is a Polish space
and that X is the associated Borel σ-field. From the discussion in Section 1.1 (see
Definition 2 and comment below), we then know that the reverse kernel does exist
although we may not be able to provide a simple closed-form expression for it. The
reverse kernel does have a simple expression, however, as soon as one assumes that
the kernel to be reversed and the initial distribution admit densities with respect
to some measure on X.
Let us now return to the smoothing problem. For positive indices k such that
k ≤ n − 1, the posterior distribution of (Xk , Xk+1 ) given the observations up to
time k satisfies
ZZ
Eν [f (Xk , Xk+1 ) | Y0:k ] = f (xk , xk+1 ) φν,k (dxk ) Q(xk , dxk+1 ) (2.37)

for all f ∈ Fb (X × X). From the previous discussion, there exists a Markov transi-
tion kernel Bν,k which satisfies Definition 2, that is
def
Bν,k = {Bν,k (x, A), x ∈ X, A ∈ X }

such that for any function f ∈ Fb (X × X),


ZZ
Eν [f (Xk , Xk+1 ) | Y0:k ] = f (xk , xk+1 ) φν,k+1|k (dxk+1 ) Bν,k (xk+1 , dxk ) , (2.38)

where φν,k+1|k = φν,k Q is the one-step predictive distribution.


28 CHAPTER 2. FILTERING AND SMOOTHING RECURSIONS

Proposition 35. Given a strictly positive index n, initial distribution ν, and index
k ∈ {0, . . . , n − 1},

Eν [f (Xk ) | Xk+1:n , Y0:n ] = Bν,k (Xk+1 , f )

for any f ∈ Fb (X). Here, Bν,k is the backward smoothing kernel defined in (2.38).

Before giving the proof of this result, we make a few remarks to provide some
intuitive understanding of the backward smoothing kernels.

Remark 36. Contrary to the forward kernel, the backward transition kernel is only
defined implicitly through the equality of the two representations (2.37) and (2.38).
This limitation is fundamentally due to the fact that the backward kernel implies a
non-trivial time-reversal operation.
Proposition 35 however allows a simple interpretation of the backward kernel:
Because Eν [f (Xk ) | Xk+1:n , Y0:n ] is equal to Bν,k (Xk+1 , f ) and thus depends neither
on Xl for l > k + 1 nor on Yl for l ≥ k + 1, the tower property of conditional ex-
pectation implies that not only is Bν,k (Xk+1 , f ) equal to Eν [f (Xk ) | Xk+1 , Y0:n ] but
also coincides with Eν [f (Xk ) | Xk+1 , Y0:k ], for any f ∈ Fb (X). In addition, the dis-
tribution of Xk+1 given Xk and Y0:k reduces to Q(Xk , ·) due to the particular form
of the transition kernel associated with a hidden Markov model (see Definition 11).
Recall also that the distribution of Xk given Y0:k is denoted by φν,k . Thus, Bν,k
can be interpreted as a Bayesian posterior in the equivalent pseudo-model where

• Xk is distributed a priori according to the filtering distribution φν,k ,

• The conditional distribution of Xk+1 given Xk is Q(Xk , ·).

Bν,k (Xk+1 , ·) is then interpreted as the posterior distribution of Xk given Xk+1 in


this equivalent pseudo-model.
In particular, for HMMs that are “fully dominated” in the sense of Definition 13,
Q has a transition probability density function q with respect to a measure λ on X.
This is then also the case for φν,k , which is a marginal of (2.13). In such cases, we
shall use the slightly abusive but unambiguous notation φν,k (dx) = φν,k (x) λ(dx)
(that is, φν,k denotes the probability density function with respect to λ rather
than the probability distribution). The backward kernel Bν,k (xk+1 , ·) then has a
probability density function with respect to λ, which is given by Bayes’ formula,

φν,k (x)q(x, xk+1 )


Bν,k (xk+1 , x) = R . (2.39)
φ
X ν,k
(x)q(x, xk+1 ) λ(dx)

Thus, in many cases of interest, the backward transition kernel Bν,k can be
written straightforwardly as a function of φν,k and Q. In these situations, Propo-
sition 38 is the method of choice for smoothing, as it only involves normalized
quantities, whereas Corollary 34 is not normalized and thus can generally not be
implemented as it stands.


of Proposition 35. Let k ∈ {0, . . . , n − 1} and h ∈ Fb Xn−k . Then
Z Z
Eν [f (Xk )h(Xk+1:n ) | Y0:n ] = ··· f (xk )h(xk+1:n ) φν,k:n|n (dxk:n ) . (2.40)
2.3. MARKOVIAN DECOMPOSITIONS 29

Using the definition (2.13) of the joint smoothing distribution φν,k:n|n yields

Eν [f (Xk )h(Xk+1:n ) | Y0:n ]


Z Z k
Y
−1
= Lν,n · · · ν(dx0 )g0 (x0 ) Q(xi−1 , dxi )gi (xi )f (xk )
i=1
" n
#
Y
× Q(xi−1 , dxi )gi (xi ) h(xk+1:n ) ,
i=k+1
ZZ
Lν,k
= φν,k|n (dxk )Q(xk , dxk+1 )f (xk )gk+1 (xk+1 )
Lν,n
Z Z " Y n
#
× ··· Q(xi−1 , dxi )gi (xi ) h(xk+1:n ) , (2.41)
i=k+2

which implies, by the definition (2.38) of the backward kernel, that

Eν [f (Xk )h(Xk+1:n ) | Y0:n ]


ZZ
Lν,k
= Bν,k (xk+1 , dxk )f (xk )φν,k+1|k (dxk+1 )gk+1 (xk+1 )
Lν,n
Z Z " Y n
#
× ··· Q(xi−1 , dxi )gi (xi ) h(xk+1:n ) . (2.42)
i=k+2

Taking f ≡ 1 shows that for any function h0 ∈ Fb Xn−k ,




Z Z
Lν,k
Eν [h0 (Xk+1:n ) | Y0:n ] = ··· h0 (xk+1:n )
Lν,n
n
Y
× φν,k+1|k (dxk+1 )gk+1 (xk+1 ) Q(xi−1 , dxi )gi (xi ) .
i=k+2

Identifying h0 with h(xk+1:n ) f (x) Bν,k (xk+1 , dx), we find that (2.42) may be
R

rewritten as

Eν [f (Xk )h(Xk+1:n ) | Y0:n ]


 Z 
= Eν h(Xk+1:n ) Bν,k (Xk+1 , dx)f (x) Y0:n ,

which concludes the proof.

The next result is a straightforward consequence of Proposition 35, which refor-


mulates the joint smoothing distribution φν,0:n|n in terms of the backward smooth-
ing kernels.

Corollary 37. For any integer n > 0 and initial probability ν,


Z Z n−1
Y
Eν [f (X0:n ) | Y0:n ] = ··· f (x0:m ) φν,n (dxn ) Bν,k (xk+1 , dxk ) (2.43)
k=0

for all f ∈ Fb Xn+1 . Here, {Bν,k }0≤k≤n−1 are the backward smoothing kernels
defined in (2.38) and φν,n is the marginal filtering distribution corresponding to the
final index n.
30 CHAPTER 2. FILTERING AND SMOOTHING RECURSIONS

It follows from Proposition 35 and Corollary 37 that, conditionally on Y0:n , the


joint distribution of the index-reversed sequence {X̄k }0≤k≤n , with X̄k = Xn−k , is
that of a non-homogeneous Markov chain with initial distribution φν,n and transi-
tion kernels {Bν,n−k }1≤k≤n . This is an exact analog of the forward decomposition
where the ordering of indices has been reversed, starting from the end of the obser-
vation sequence and ending with the first observation. Three important differences
versus the forward decomposition should however be kept in mind.
(i) The backward smoothing kernel Bν,k depends on the initial distribution ν
and on the observations up to index k but it depends neither on the future
observations nor on the index n where the observation sequence ends. As a
consequence, the sequence of backward transition kernels {Bν,k }0≤k≤n−1 may
be computed by forward recurrence on k, irrespectively of the length of the
observation sequence. In other terms, the backward smoothing kernel Bν,k
depends only on the filtering distribution φν,k , whereas the forward smoothing
kernel Fk|n was to be computed from the backward function βk|n .
(ii) Because Bν,k depends on φν,k rather than on the unnormalized forward mea-
sure αν,k , its computation involves only properly normalized quantities (Re-
mark 36). The backward decomposition is thus more adapted to the actual
computation of the smoothing probabilities than the forward decomposition.
The necessary steps are summarized in the following result.

Proposition 38 (Forward Filtering/Backward Smoothing). Forward Filtering


Compute, forward in time, the filtering distributions φν,0 to φν,n using the recur-
sion (2.22). At each index k, the backward transition kernel Bν,k may be computed
according to (2.38).
Backward Smoothing From φν,n , compute, for k = n − 1, n − 2, . . . , 0,

φν,k|n = φν,k+1|n Bν,k ,

def
recalling that φν,n|n = φν,n .

(iii) A more subtle difference between the forward and backward Markovian decom-
positions is the observation that Definition 30 does provide an expression of
the forward kernels Fk|n for any k ≥ 0, that is, also for indices after the end of
the observation sequence. Hence, the process {Xk }k≥0 , when conditioned on
some observations Y0:n , really forms a non-homogeneous Markov chain whose
finite-dimensional distributions are defined by Proposition 33. In contrast, the
backward kernels Bν,k are defined for indices k ∈ {0, . . . , n − 1} only, and
thus the index-reversed process {Xn−k } is also defined, by Proposition 35, for
indices k in the range {0, . . . , n} only. In order to define the index-reversed
chain for negative indices, a minimal requirement is that the underlying chain
{Xk } also be well defined for k < 0. Defining Markov chains {Xk } with in-
dices k ∈ Z is only meaningful in the stationary case, that is when ν is the
stationary distribution of Q. As both this stationarization issue and the for-
ward and backward Markovian decompositions play a key role in the analysis
of the statistical properties of the maximum likelihood estimator, we postpone
further discussion of this point to Chapter 6.
Chapter 3

Forgetting of the initial


condition and filter stability

Recall from previous chapters that in a partially dominated HMM model (see Def-
inition 12), we denote by
• Pν the probability associated to the Markov chain {Xk , Yk }k≥0 on the canon-
ical space (X × Y)N , (X ⊗ Y)⊗N with initial probability measure ν and tran-
sition kernel T defined by (1.15);
• φν,k|n the distribution of the hidden state Xk conditionally on the observations
Y0:n , under the probability measure Pν .
Forgetting properties pertain to the dependence of φν,k|n with respect to the
initial distribution ν. A typical question is to ask whether φν,k|n and φν 0 ,k|n are
close (in some sense) for large values of k and arbitrary choices of ν and ν 0 . This
issue will play a key role both when studying the convergence of sequential Monte
Carlo methods (Chapter ??) and when analyzing the asymptotic behavior of the
maximum likelihood estimator (Chapter 6).
In the following, it is shown more precisely that, under appropriate conditions
on the kernel Q of the hidden chain and on the transition density function g, the
total variation distance φν,k|n − φν 0 ,k|n TV converges to zero as k tends to infinity.
Remember that, following the implicit conditioning convention (Section 2.1.4), we
usually omit to indicate explicitly that φν,k|n indeed depends on the observations
Y0:n . In this section however we cannot use this convention anymore, as we will
meet both situations in which, say, kφν,n − φν 0 ,n kTV converges to zero (as n tends
to infinity) for all possible values of the sequence {yn }n≥0 ∈ YN (uniform forget-
ting) and cases where kφν,n − φν 0 ,n kTV can be shown to converge to zero almost
surely only when {Yk }k≥0 is assumed to be distributed under a specific distribution
(typically Pν? for some initial distribution ν? ). In this section, we thus make de-
pendence with respect to the observations explicit by indicating the relevant subset
of observation between brackets, using, for instance, φν,k|n [y0:n ] rather than φν,k|n .
We start by recalling some elementary facts and results about the total variation
norm of a signed measure, providing in particular useful characterizations of the
total variation as an operator norm over appropriately defined function spaces. We
then discuss the contraction property of Markov kernels, using the measure-theoretic
approach introduced in an early paper by Dobrushin (1956) and recently revisited
and extended by Del Moral et al. (2003). We finally present the applications of these
results to establish forgetting properties of the smoothing and filtering recursions
and discuss the implications of the technical conditions required to obtain these
results.

31
32CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY

3.0.3 Total Variation


Let (X, X ) be a measurable space and let ξ be a signed measure on (X, X ). Then
there exists a measurable set H ∈ X , called a Jordan set, such that
(i) ξ(A) ≥ 0 for each A ∈ X such that A ⊆ H;
(ii) ξ(A) ≤ 0 for each A ∈ X such that A ⊆ X \ H.
The set H is not unique, but any other such set H 0 ∈ X satisfies ξ(H ∩ H 0 ) = 1.
Hence two Jordan sets differ by at most a set of zero measure. If X is finite or
countable and X = P(X) is the collection of all subsets of X, then H = {x : ξ(x) ≥ 0}
and H 0 = {x : ξ(x) > 0} are two Jordan sets. As another example, if ξ is absolutely
continuous with respect to a measure ν on (X, X ) with Radon-Nikodym derivative
f , then {f ≥ 0} and {f > 0} are two Jordan sets. We define two measures on
(X, X ) by

ξ+ (A) = ξ(H ∩ A) and ξ− (A) = −ξ(H c ∩ A) , A∈X .

The measures ξ+ and ξ− are referred to as the positive and negative variations of
the signed measure ξ. By construction, ξ = ξ+ − ξ− . This decomposition of ξ into
its positive and negative variations is called the Hahn-Jordan decomposition of ξ.
The definition of the positive and negative variations above is easily shown to be
independent of the particular Jordan set chosen.
Definition 39 (Total Variation of a Signed Measure). Let (X, X ) be a measurable
space and let ξ be a signed measure on (X, X ). The total variation norm of ξ is
defined as
kξkTV = ξ+ (X) + ξ− (X) ,
where (ξ+ , ξ− ) is the Hahn-Jordan decomposition of ξ.

P If X is finite or countable and ξ is a signed measure on (X, P(X)), then kξkTV =


x∈X |ξ(x)|.
R If ξ has a density f with respect to a measure λ on (X, X ), then
kξkTV = |f (x)| λ(dx).
Definition 40 (Total Variation Distance). Let (X, X ) be a measurable space and
let ξ and ξ 0 be two measures on (X, X ). The total variation distance between ξ and
ξ 0 is the total variation norm of the signed measure ξ − ξ 0 .
Denote by M(X, X ) the set of finite signed measures on the measurable space
(X, X ), by M1 (X, X ) the set of probability measures on (X, X ) and by M0 (X, X ) the
set of finite signed measures ξ on (X, X ) satisfying ξ(X) = 0. M(X, X ) is a Banach
space with respect to the total variation norm. In this Banach space, the subset
M1 (X, X ) is closed and convex.
Let Fb (X) denote the set of bounded measurable real functions on X. This
set embedded with the supremum norm kf k∞ = sup{f (x) : x ∈ X} also R is a
Banach space. For any ξ ∈ M(X, X ) and f ∈ Fb (X), we may define ξ(f ) = f dξ.
Therefore any finite signed measure ξ in M(X, X ) defines a linear functional on the
Banach space (Fb (X) , k·k∞ ). We will use the same notation for the measure and
for the functional. The following lemma shows that the total variation of the signed
measure ξ agrees with the operator norm of ξ.
Lemma 41.
(i) For any ξ ∈ M(X, X ) and f ∈ Fb (X),
Z
f dξ ≤ kξkTV kf k∞ .
33

(ii) For any ξ ∈ M(X, X ),

kξkTV = sup {ξ(f ) : f ∈ Fb (X, X ) , kf k∞ = 1} .

(iii) For any f ∈ Fb (X),

kf k∞ = sup {ξ(f ) : ξ ∈ M(X, X ), kξkTV = 1} .

Proof. Let H be a Hahn-Jordan set of ξ. Then ξ+ (H) = ξ(H) and ξ− (H c ) =


−ξ(H c ). For f ∈ Fb (X),

|ξ(f )| ≤ |ξ+ (f )| + |ξ− (f )| ≤ kf k∞ (ξ+ (X) + ξ− (X)) = kf k∞ kξkTV ,

showing (i). It also shows that the suprema in (ii) and (iii) are no larger than kξkTV
and kf k∞ , respectively. To establish equality in these relations, first note that
k1H − 1H c k∞ = 1 and ξ (1H − 1H c ) = ξ(H) − ξ(H c ) = kξkTV . This proves (ii).
Next pick f and let let {xn } be a sequence in X such that limn→∞ |f (xn )| = kf k∞ .
Then kf k∞ = limn→∞ |δxn (f )|, proving (iii).
The set M0 (X, X ) possesses some interesting properties that will prove useful in
the sequel. Let ξ be in this set. Because ξ(X) = 0, for any f ∈ Fb (X) and any real c
it holds that ξ(f ) = ξ(f − c). Therefore by Lemma 41(i), |ξ(f )| ≤ kξkTV kf − ck∞ ,
which implies that
|ξ(f )| ≤ kξkTV inf kf − ck∞ .
c∈R

It is easily seen that for any f ∈ Fb (X), inf c∈R kf − ck∞ is related to the oscillation
semi-norm of f , also called the global modulus of continuity,
def
osc (f ) = sup |f (x) − f (x0 )| = 2 inf kf − ck∞ . (3.1)
(x,x0 )∈X×X c∈R

The lemma below provides some additional insight into this result.
Lemma 42. For any ξ ∈ M(X, X ) and f ∈ Fb (X),

|ξ(f )| ≤ sup |ξ+ (X)f (x) − ξ− (X)f (x0 )| , (3.2)


(x,x0 )∈X×X

where (ξ+ , ξ− ) is the Hahn-Jordan decomposition of ξ. In particular, for any ξ ∈


M0 (X, X ) and f ∈ Fb (X),
1
|ξ(f )| ≤ kξkTV osc (f ) , (3.3)
2
where osc (f ) is given by (3.1).
Proof. First note that
Z Z
ξ(f ) = f (x) ξ+ (dx) − f (x) ξ− (dx)

f (x) ξ+ (dx) ξ− (dx0 ) f (x0 ) ξ+ (dx) ξ− (dx0 )


RR RR
= − .
ξ− (X) ξ+ (X)
Therefore
ZZ
|ξ(f )| ≤ |f (x)/ξ− (X) − f (x0 )/ξ+ (X)| ξ+ (dx) ξ− (dx0 )

≤ sup |f (x)/ξ− (X) − f (x0 )/ξ+ (X)| ξ+ (X)ξ− (X) ,


(x,x0 )∈X×X

1
which shows (3.2). If ξ(X) = 0, then ξ+ (X) = ξ− (X) = 2 kξkTV , showing (3.3).
34CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY

Therefore, for ξ ∈ M0 (X, X ), kξkTV is the operator norm of ξ considered as an


operator over the space Fb (X) equipped with the oscillation semi-norm (3.1). As a
direct application of this result, if ξ and ξ 0 are two probability measures on (X, X ),
then ξ − ξ 0 ∈ M0 (X, X ) which implies that for any f ∈ Fb (X),

1
|ξ(f ) − ξ 0 (f )| ≤ kξ − ξ 0 kTV osc (f ) . (3.4)
2
This inequality is sharper than the bound |ξ(f )−ξ 0 (f )| ≤ kξ − ξ 0 kTV kf k∞ provided
by Lemma 41(i), because osc (f ) ≤ 2 kf k∞ .
We conclude this section by establishing some alternative expressions for the
total variation distance between two probability measures.

Lemma 43. For any ξ and ξ 0 in M1 (X, X ),

1
kξ − ξ 0 kTV = sup |ξ(A) − ξ 0 (A)| (3.5)
2 A
= 1 − sup ν(X) (3.6)
ν≤ξ,ξ 0
Xn
= 1 − inf ξ(Ai ) ∧ ξ 0 (Ai ) . (3.7)
p=1

Here the supremum in (3.5) is taken over all measurable subsets of X, the supremum
in (3.6) is taken over all finite signed measures ν on (X, X ) satisfying ν ≤ ξ and
ν ≤ ξ 0 , and the infimum in (3.7) is taken over all finite measurable partitions
A1 , . . . , An of X.

Proof. To prove (3.5), first write ξ(A)−ξ 0 (A) = (ξ −ξ 0 )1A and note that osc (1A ) =
1. Thus (3.4) shows that the supremum in (3.5) is no larger than (1/2) kξ − ξ 0 kTV .
Now let H be a Jordan set of the signed measure ξ − ξ 0 . The supremum is bounded
from below by ξ(H) − ξ 0 (H) = (ξ − ξ 0 )+ (X) = (1/2) kξ − ξ 0 kTV . This establishes
equality in (3.5).
We now turn to (3.6). For any p, q ∈ R, |p − q| = p + q − 2(p ∧ q). Therefore for
any A ∈ X ,
1 1
|ξ(A) − ξ 0 (A)| = (ξ(A) + ξ 0 (A)) − ξ(A) ∧ ξ 0 (A) .
2 2
Applying this relation to the sets H and H c , where H is as above, shows that
1 1
(ξ − ξ 0 ) (H) = [ξ(H) + ξ 0 (H)] − ξ(H) ∧ ξ 0 (H) ,
2 2
1 0 1
(ξ − ξ) (H c ) = [ξ(H c ) + ξ 0 (H c )] − ξ(H c ) ∧ ξ 0 (H c ) .
2 2
For any measure ν such that ν ≤ ξ and ν ≤ ξ 0 , it holds that ν(H) ≤ ξ(H) ∧ ξ 0 (H)
and ν(H c ) ≤ ξ(H c ) ∧ ξ 0 (H c ), showing that

1 1 1
(ξ − ξ 0 ) (H) + (ξ 0 − ξ) (H c ) = kξ − ξ 0 kTV ≤ 1 − ν(X) .
2 2 2
Thus (3.6) is no smaller than the left-hand side. To show equality, let ν be the
measure defined by
ν(A) = ξ(A ∩ H c ) + ξ 0 (A ∩ H) . (3.8)
By the definition of H, ξ(A ∩ H c ) ≤ ξ 0 (A ∩ H c ) and ξ 0 (A ∩ H) ≤ ξ(A ∩ H) for any
A ∈ X . Therefore ν(A) ≤ ξ(A) and ν(A) ≤ ξ 0 (A). In addition, ν(H) = ξ 0 (H) =
35

ξ(H) ∧ ξ 0 (H) and ν(H c ) = ξ(H c ) = ξ(H c ) ∧ ξ 0 (H c ), showing that 21 kξ − ξ 0 kTV =


1 − ν(X) and concluding the proof of (3.6).
Finally, because ν(X) = ξ(H) ∧ ξ 0 (H) + ξ(H c ) ∧ ξ 0 (H c ) we have
n
X
sup ν(X) ≥ inf ξ(Ai ) ∧ ξ 0 (Ai ) .
ν≤ξ,ξ 0 i=1

Conversely, for any measure ν satisfying ν ≤ ξ and ν ≤ ξ 0 , and any partition


A1 , . . . , A n ,
n
X Xn
ν(X) = ν(Ai ) ≤ ξ(Ai ) ∧ ξ 0 (Ai ) ,
i=1 i=1

showing that
n
X
sup ν(X) ≤ inf ξ(Ai ) ∧ ξ 0 (Ai ) .
ν≤ξ,ξ 0 i=1

The supremum and the infimum thus agree, and the proof of (3.7) follows from
(3.6).

3.0.4 Lipshitz Contraction for Transition Kernels


In this section, we study the contraction property of transition kernels with respect
to the total variation distance. Such results have been discussed in a seminal paper
by Dobrushin (1956) (see Del Moral, 2004, Chapter 4, for a modern presentation
and extensions of these results to a general class of distance-like entropy criteria).
Let (X, X ) and (Y, Y) be two measurable spaces and let K be a transition kernel
from (X, X ) to (Y, Y) (see Definition 1). The kernel K is canonically associated to
two linear mappings:
(i) a mapping M(X, X ) → M(Y, Y) that maps
R any ξ in M(X, X ) to a (possibly
signed) measure ξK given by ξK(A) = X ξ(dx) K(x, A) for any A ∈ Y;
(ii) a mapping Fb (Y) R→ Fb (X) that maps any f in Fb (Y) to the function Kf
given by Kf (x) = K(x, dy) f (y).
Here again, with a slight abuse in notation, we use the same notation K for these
two mappings. If we equip the spaces M(X, X ) and M(Y, Y) with the total variation
norm and the spaces Fb (X) and Fb (Y) with the supremum norm, a first natural
problem is to compute the operator norm(s) of the kernel K.
Lemma 44. Let (X, X ) and (Y, Y) be two measurable spaces and let K be a tran-
sition kernel from (X, X ) to (Y, Y). Then

1 = sup {kξKkTV : ξ ∈ M(X, X ), kξkTV = 1}

= sup {kKf k∞ : f ∈ Fb (Y) , kf k∞ = 1} .

Proof. By Lemma 41,

sup {kξKkTV : ξ ∈ M(X, X ), kξkTV = 1}

= sup {ξKf : ξ ∈ M(X, X ), f ∈ Fb (Y) , kf k∞ = 1, kξkTV = 1}

= sup {kKf k∞ : f ∈ Fb (Y, Y) , kf k∞ = 1} ≤ 1 .

If ξ is a probability measure then so is ξK. Because the total variation of any


probability measure is one, we see that the left-hand side of this display is indeed
equal to one. Thus all members equate to one, and the proof is complete.
36CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY

To get sharper results, we will have to consider K as an operator acting on a


smaller set of finite measures than M(X, X ). Of particular interest is the subset
M0 (X, X ) of signed measures with zero total mass. Note that if ξ lies in this subset,
then ξK is in M0 (Y, Y). Below we will bound the operator norm of the restriction
of the operator K to M0 (X, X ).

Definition 45 (Dobrushin Coefficient). Let K be a transition kernel from (X, X )


to (Y, Y). Its Dobrushin coefficient δ(K) is given by

1
δ(K) = sup kK(x, ·) − K(x0 , ·)kTV
2 (x,x0 )∈X×X
kK(x, ·) − K(x0 , ·)kTV
= sup .
(x,x0 )∈X×X,x6=x0 kδx − δx0 kTV

We remark that as K(x, ·) and K(x0 , ·) are probability measures, it holds that
kK(x, ·)kTV = kK(x0 , ·)kTV = 1. Hence δ(K) ≤ 12 (1+1) = 1, so that the Dobrushin
coefficient satisfies 0 ≤ δ(K) ≤ 1.

Lemma 46. Let ξ be a finite signed measure on (X, X ) and let K be a transition
kernel from (X, X ) to (Y, Y). Then

kξKkTV ≤ δ(K) kξkTV + (1 − δ(K)) |ξ(X)| . (3.9)

Proof. Pick ξ ∈ M(X, X ) and let, as usual, ξ+ and ξ− be its positive and negative
part, respectively. If ξ− (X) = 0 (ξ is a measure), then kξkTV = ξ(X) and (3.9)
becomes kξKkTV ≤ kξkTV ; this follows from Lemma 44. If ξ+ (X) = 0, an analogous
argument applies.
Thus assume that both ξ+ and ξ− are non-zero. In view of Lemma 41(ii), it
suffices to prove that for any f ∈ Fb (Y) with kf k∞ = 1,

|ξKf | ≤ δ(K)(ξ+ (X) + ξ− (X)) + (1 − δ(K))|ξ+ (X) − ξ− (X)| . (3.10)

We shall suppose that ξ+ (X) ≥ ξ− (X), if not, replace ξ by −ξ and (3.10) remains
the same. Then as |ξ+ (X) − ξ− (X)| = ξ+ (X) − ξ− (X), (3.10) becomes

|ξKf | ≤ 2ξ− (X)δ(K) + ξ+ (X) − ξ− (X) . (3.11)

Now, by Lemma 42, for any f ∈ Fb (Y) it holds that

|ξKf | ≤ sup |ξ+ (X)Kf (x) − ξ− (X)Kf (x0 )|


(x,x0 )∈X×X

≤ sup kξ+ (X)K(x, ·) − ξ− (X)K(x0 , ·)kTV kf k∞ .


(x,x0 )∈X×X

Finally (3.11) follows upon noting that

kξ+ (X)K(x, ·) − ξ− (X)K(x0 , ·)kTV


≤ ξ− (X) kK(x, ·) − K(x0 , ·)kTV + [ξ+ (X) − ξ− (X)] kK(x, ·)kTV
= 2ξ− (X)δ(K) + ξ+ (X) − ξ− (X) .

Corollary 47.

δ(K) = sup {kξKkTV : ξ ∈ M0 (X, X ), kξkTV ≤ 1} . (3.12)


37

Proof. If ξ(X) = 0, then (3.9) becomes kξKkTV ≤ δ(K) kξkTV , showing that

sup {kξKkTV : ξ ∈ M0 (X, X ), kξkTV ≤ 1} ≤ δ(K) .

The converse inequality is obvious, as


 
1
δ(K) = sup (x, x0 ) ∈ X × X, (δx − δx0 )K
2 TV
≤ sup {kξKkTV : ξ ∈ M0 (X, X ), kξkTV = 1} .

If ξ and ξ 0 are two probability measures on (X, X ), Corollary 47 implies that

kξK − ξ 0 KkTV ≤ δ(K) kξ − ξ 0 kTV .

Thus the Dobrushin coefficient is the norm of K considered as a linear operator


from M0 (X, X ) to M0 (Y, Y).
Proposition 48. The Dobrushin coefficient is sub-multiplicative. That is, if K :
(X, X ) → (Y, Y) and R : (Y, Y) → (Z, Z) are two transition kernels, then δ(KR) ≤
δ(K)δ(R).
Proof. This is a direct consequence of the fact that the Dobrushin coefficient is
an operator norm. By Corollary 47, if ξ ∈ M0 (X, X ), then ξK ∈ M0 (Y, Y) and
kξKkTV ≤ δ(K) kξkTV . Likewise, kνRkTV ≤ δ(R) kνkTV holds for any ν ∈
M0 (Y, Y). Thus

kξKRkTV = k(ξK)RkTV ≤ δ(R) kξKkTV ≤ δ(K)δ(R) kξkTV

3.0.5 The Doeblin Condition and Uniform Ergodicity


Anticipating results on general state-space Markov chains presented in Chapter 7,
we will establish, using the contraction results developed in the previous section,
some ergodicity results for a class of Markov chains (X, X ) satisfying the so-called
Doeblin condition.
Assumption 49 (Doeblin Condition). There exist an integer m ≥ 1,  ∈ (0, 1),
and a transition kernel ν = {νx,x0 , (x, x0 ) ∈ X × X} from (X × X, X ⊗ X ) to (X, X )
such that for all (x, x0 ) ∈ X × X and A ∈ X ,

Qm (x, A) ∧ Qm (x0 , A) ≥ νx,x0 (A) .

We will frequently consider a strengthened version of this assumption.


Assumption 50 (Doeblin Condition Reinforced). There exist an integer m ≥ 1,
 ∈ (0, 1), and a probability measure ν on (X, X ) such that for any x ∈ X and A ∈ X ,

Qm (x, A) ≥ ν(A) .

By Lemma 43, the Dobrushin coefficient of Qm may be equivalently written as


n
X
m
δ(Q ) = 1 − inf Qm (x, Ai ) ∧ Qm (x0 , Ai ) , (3.13)
i=1

where the infimum is taken over all (x, x0 ) ∈ X×X and all finite measurable partitions
A1 , . . . , An of X of X. Under
Pn the Doeblin condition, the sum in this display is
bounded from below by  i=1 νx,x0 (Ai ) = . Hence the following lemma is true.
38CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY

Lemma 51. Under Assumption 49, δ(Qm ) ≤ 1 − .

Stochastic processes that are such that for any k, the distribution of the ran-
dom vector (Xn , . . . , Xn+k ) does not depend on n are called stationary (see Defi-
nition 10). It is clear that in general a Markov chain will not be stationary. Nev-
ertheless, given a transition kernel Q, it is possible that with an appropriate choice
of the initial distribution ν we may produce a stationary process. Assuming that
such a distribution exists, the stationarity of the marginal distribution implies that
Eν [1A (X0 )] = Eν [1A (X1 )] for any A ∈ X . This can equivalently be written as
ν(A) = νQ(A), or ν = νQ. In such a case, the Markov property implies that all
finite-dimensional distributions of {Xk }k≥0 are also invariant under translation in
time. These considerations lead to the definition of invariant measure.

Definition 52 (Invariant Measure). If Q is a Markov kernel on (X, X ) and π is a


σ-finite measure satisfying πQ = π, then π is called an invariant measure.

If an invariant measure is finite, it may be normalized to an invariant probability


measure. In practice, this is the main situation of interest. If an invariant measure
has infinite total mass, its probabilistic interpretation is much more difficult. In
general, there may exist more than one invariant measure, and if X is not finite,
an invariant measure may not exist. As a trivial example, consider X = N and
Q(x, x + 1) = 1.
Invariant probability measures are important not merely because they define
stationary processes. Invariant probability measures also define the long-term or
ergodic behavior of a stationary Markov chain. Assume that for some initial mea-
sure ν, the sequence of probability measures {νQn }n≥0 converges to a probability
measure γν in total variation norm. This implies that for any function f ∈ Fb (X),
limn→∞ νQn (f ) = γν (f ). Therefore
ZZ
γν (f ) = lim ν(dx) Qn (x, dx0 ) f (x0 )
n→∞
ZZ
= lim ν(dx) Qn−1 (x, dx0 ) Qf (x0 ) = γν (Qf ) .
n→∞

Hence, if a limiting distribution exists, it is an invariant probability measure, and if


there exists a unique invariant probability measure, then the limiting distribution
γν will be independent of ν, whenever it exists. These considerations lead to the
following definitions.

Definition 53. Let Q be a Markov kernel admitting a unique invariant probability


measure π. The chain is said to be ergodic if for all x in a set A ∈ X such that
π(A) = 1, limn→∞ kQn (x, ·) − πkTV = 0. It is said to be uniformly ergodic if
limn→∞ supx∈X kQn (x, ·) − πkTV = 0.

Note that when a chain is uniformly ergodic, it is indeed uniformly geometrically


ergodic because limn→∞ supx∈X kQn (x, ·) − πkTV = 0 implies that there exists an
integer m such that 21 sup(x,x0 )∈X×X kQm (x, ·) − Qm (x0 , ·)kTV < 1 by the triangle
inequality. Hence the Dobrushin coefficient δ(Qm ) is strictly less than 1, and Qm is
contractive with respect to the total variation distance by Lemma 46. Thus there
exist constants C < ∞ and ρ ∈ [0, 1) such that supx∈X kQn (x, ·) − πkTV ≤ Cρn for
all n.
The following result shows that if a power Qm of the Markov kernel Q satisfies
Doeblin’s condition, then the chain admits a unique invariant probability and is
uniformly ergodic.
39

Theorem 54. Under Assumption 49, Q admits a unique invariant probability mea-
sure π. In addition, for any ξ ∈ M1 (X, X ),

kξQn − πkTV ≤ (1 − )bn/mc kξ − πkTV ,

where buc is the integer part of u.


Proof. Let ξ and ξ 0 be two probability measures on (X, X ). Corollary 47, Proposi-
tion 48, and Lemma 51 yield that for all k ≥ 1,

ξQkm − ξ 0 Qkm TV
≤ δ k (Qm ) kξ − ξ 0 kTV ≤ (1 − )k kξ − ξ 0 kTV . (3.14)

Taking ξ 0 = ξQpm , we find that

ξQkm − ξQ(k+p)m ≤ (1 − )k ,


TV

showing that {ξQkm } is a Cauchy sequence in M1 (X, X ) endowed with the total
variation norm. Because this metric space is complete, there exists a probability
measure π such that ξQkm → π. In view of the discussion above, π is invariant
for Qm . Moreover, by (3.14) this limit does not depend on ξ. Thus Qm admits
π as unique invariant probability measure. The Chapman-Kolmogorov equations
imply that (πQ)Qm = (πQm )Q = πQ, showing that πQ is also invariant for Qm
and hence that πQ = π as claimed.
Remark 55. Classical uniform convergence to equilibrium for Markov processes
has been studied during the first half of the 20th century by Doeblin, Kolmogorov,
and Doob under various conditions. Doob (1953) gave a unifying form to these
conditions, which he named Doeblin type conditions. More recently, starting in
the 1970s, an increasing interest in non-uniform convergence of Markov processes
has arisen. An explanation for this interest is that many useful processes do not
converge uniformly to equilibrium, while they do satisfy weaker properties such as a
geometric convergence. It later became clear that non-uniform convergence relates
to local Doeblin type condition and to hitting times for so-called small sets. These
types of conditions are detailed in Chapter 7.

3.0.6 Forgetting Properties


Recall from Chapter 2 that the smoothing probability φν,k|n [Y0:n ] is defined by

φν,k|n [Y0:n ](f ) = Eν [f (Xk ) | Y0:n ] , f ∈ Fb (X) .

Here, k and n are integers, and ν is the initial probability measure on (X, X ).
The filtering probability is defined by φν,n [Y0:n ] = φν,n|n [Y0:n ]. In this section, we
will establish that under appropriate conditions on the transition kernel Q and on
the function g, the sequence of filtering probabilities satisfies a property referred
to in the literature as “forgetting of the initial condition”. This property can be
formulated as follows: given two probability measures ν and ν 0 on (X, X ),

lim kφν,n [Y0:n ] − φν 0 ,n [Y0:n ]kTV = 0 Pν? -a.s. (3.15)


n→∞

where ν? is the initial probability measure that defines the law of the observations
{Yk }. Forgetting is also a concept that applies to the smoothing distributions, as it
is often possible to extend the previous results showing that

lim sup φν,k|n [Y0:n ] − φν 0 ,k|n [Y0:n ] TV


=0 Pν? -a.s. (3.16)
k→∞ n≥0
40CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY

Equation (3.16) can also be strengthened by showing that, under additional condi-
tions, the forgetting property is uniform with respect to the observed sequence Y0:n
in the sense that there exists a deterministic sequence {ρk } satisfying ρk → 0 and

sup sup φν,k|n [y0:n ] − φν 0 ,k|n [y0:n ] TV


≤ ρk .
y0:n ∈Yn+1 n≥0

Several of the results to be proven in the sequel are of this latter type (uniform
forgetting).
As shown in (2.5), the smoothing distribution is defined as the ratio
R R Qn
··· f (xk ) ν(dx0 ) g(x0 , y0 ) i=1 Q(xi−1 , dxi ) g(xi , yi )
φν,k|n [y0:n ](f ) = R R Qn .
··· ν(dx0 ) g(x0 , y0 ) i=1 Q(xi−1 , dxi ) g(xi , yi )

Therefore, the mapping associating the probability measure ν ∈ M1 (X, X ) to the


probability measure φν,k|n [y0:n ] is non-linear. The theory developed above allows
one to separately control the numerator and the denominator of this quantity but
does not lend a direct proof of the forgetting properties (3.15) or (3.16). To achieve
this, we use the alternative representation of the smoothing probability φν,k|n [y0:n ]
introduced in Proposition 33, which states that
Z Z k
Y
φν,k|n [y0:n ](f ) = ··· φν,0|n [y0:n ](dx0 ) Fi−1|n [yi:n ](xi−1 , dxi ) f (xk )
i=1
k
Y
= φν,0|n [y0:n ] Fi−1|n [yi:n ]f . (3.17)
i=1

Here we have used the following notations and definitions from Chapter 2.
(i) Fi|n [yi+1:n ] are the forward smoothing kernels (see Definition 30) given for
i = 0, . . . , n − 1, x ∈ X and A ∈ X , by

def −1
Fi|n [yi+1:n ](x, A) = βi|n [yi+1:n ](x)
Z
× Q(x, dxi+1 ) g(xi+1 , yi+1 )βi+1|n [yi+2:n ](xi+1 ) , (3.18)
A

where βi|n [yi+1:n ](x) are the backward functions (see Definition 20)

βi|n [yi+1:n ](x) =


Z
Q(x, dxi+1 ) g(xi+1 , yi+1 )βi+1|n [yi+2:n ](xi+1 ) . (3.19)

Recall that, by Proposition 31, {Fi|n }i≥0 are the transition kernels of the non-
homogeneous Markov chain {Xk } conditionally on Y0:n ,

Eν [f (Xi+1 ) | X0:i , Y0:n ] = Fi|n [Yi+1:n ](Xi , f ) .

(ii) φν,0|n [y0:n ] is the posterior distribution of the state X0 conditionally on Y0:n =
y0:n , defined for any A ∈ X by
R
ν(dx0 ) g(x0 , y0 )β0|n [y1:n ](x0 )
φν,0|n [y0:n ](A) = RA . (3.20)
ν(dx0 ) g(x0 , y0 )β0|n [y1:n ](x0 )

We see that the non-linear mapping ν 7→ φν,k|n [y0:n ] is the composition of two
mappings on M1 (X, X ).
41

(i) The mapping ν 7→ φν,0|n [y0:n ], which associates to the initial distribution ν the
posterior distribution of the state X0 given Y0:n = y0:n . This mapping consists
in applying Bayes’ formula, which we write as
φν,0|n [y0:n ] = B[g(·, y0 )β0|n [y1:n ](·), ν] .
Here R
f (x)φ(x) ξ(dx)
B[φ, ξ](f ) = R , f ∈ Fb (X) , (3.21)
φ(x) ξ(dx)
for any probability measure ξ on (X, X ) and any non-negative measurable
function φ on X. Note that B[φ, ξ] is a probability measure on (X, X ). Because
of the normalization, this step is non-linear.
Qk
(ii) The mapping ξ 7→ ξ i=1 Fi−1|n [yi:n ], which is a linear mapping being defined
as product of Markov transition kernels.
For two initial probability measures ν and ν 0 on (X, X ), the difference of the
associated smoothing distributions may thus be expressed as

φν,k|n [y0:n ] − φν 0 ,k|n [y0:n ] =


k
Y
B[g(·, y0 )β0|n [y1:n ], ν] − B[g(·, y0 )β0|n [y1:n ], ν 0 ] Fi−1|n [yi:n ] . (3.22)
i=1

Note that the function g(x, y0 )β0|n [y1:n ](x) defined for x ∈ X may also be inter-
preted as the likelihood of the observation Lδx ,n [y0:n ] when starting from the initial
condition X0 = x (Proposition 23). In the sequel, we use the likelihood nota-
tion whenever possible, writing, in addition, Lx,n [y0:n ] rather than Lδx ,n [y0:n ] and
L•,n [y0:n ] when referring to the whole function.
Using Corollary 47, (3.22) implies that

φν,k|n [y0:n ] − φν,k|n [y0:n ] TV



k
!
Y
0
kB[L•,n [y0:n ], ν] − B[L•,n [y0:n ], ν ]kTV δ Fi−1|n [yi:n ] , (3.23)
i=1

where the final factor is a Dobrushin coefficient. Because Bayes operator B returns
probability measures, the total variation distance in the right-hand side of this
display is always bounded by 2. Although this bound may be sufficient, it is often
interesting to relate the total variation distance between B[φ, ξ] and B[φ, ξ 0 ] to the
total variation distance between ξ and ξ 0 . The following lemma is adapted from
(Künsch, 2000)—see also (Del Moral, 2004, Theorem 4.3.1).
Lemma 56. Let ξ and ξ 0 be two probability measures on (X, X ) and let φ be a
non-negative measurable function such that ξ(φ) > 0 or ξ 0 (φ) > 0. Then
kφk∞
kB[φ, ξ] − B[φ, ξ 0 ]kTV ≤ kξ − ξ 0 kTV . (3.24)
ξ(φ) ∨ ξ 0 (φ)
Proof. We may assume, without loss of generality, that ξ(φ) ≥ ξ 0 (φ). For any
f ∈ Fb (X),
B[φ, ξ](f ) − B[φ, ξ 0 ](f )
f (x)φ(x) (ξ − ξ 0 )(dx) f (x)φ(x) ξ 0 (dx) φ(x) (ξ 0 − ξ)(dx)
R R R
= R + R R
φ(x) ξ(dx) φ(x) ξ 0 (dx) φ(x) ξ(dx)
Z
1
= (ξ − ξ 0 )(dx) φ(x)(f (x) − B[φ, ξ 0 ](f )) .
ξ(φ)
42CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY

By Lemma 43,
Z
(ξ − ξ 0 )(dx) φ(x)(f (x) − B[φ, ξ 0 ](f )) ≤ kξ − ξ 0 kTV ×
1
sup |φ(x)(f (x) − B[φ, ξ 0 ](f )) − φ(x0 )(f (x0 ) − B[φ, ξ 0 ](f ))| .
2 (x,x0 )∈X×X

Because |B[φ, ξ 0 ](f )| ≤ kf k∞ and φ ≥ 0, the supremum on the right-hand side of


this display is bounded by 2 kφk∞ kf k∞ . This concludes the proof.
As mentioned by Künsch (2000), the Bayes operator may be non-contractive:
the numerical factor in the right-hand side of (3.24) is sometimes larger than one
and the bound may be shown to be tight on particular examples. The intuition
that the posteriors should at least be as close as the priors if the same likelihood
(the same data) is applied is thus generally wrong.
Equation (3.17) also implies that for any integer j such that j ≤ k,
j
Y k
Y
φν,k|n [y0:n ] = φν,0|n [y0:n ] Fi−1|n [yi:n ] Fi−1|n [yi:n ]
i=1 i=j+1
k
Y
= φν,j|n [y0:n ] Fi−1|n [yi:n ] . (3.25)
i=j+1

This decomposition and Corollary 47 shows that for any 0 ≤ j ≤ k, any initial distri-
butions ν and ν 0 and any sequence y0:n such that Lν,n [y0:n ] > 0 and Lν 0 ,n [y0:n ] > 0,

φν,k|n [y0:n ] − φν 0 ,k|n [y0:n ] TV


 
k
Y
≤ δ Fi−1|n [yi:n ] φν,j|n [y0:n ] − φν 0 ,j|n [y0:n ] TV
.
i=j+1

Because the Dobrushin coefficient of a Markov kernel is bounded by one, this rela-
tion implies that the total variation distance between the smoothing distributions
associated with two different initial distributions is non-expanding. To summarize
this discussion, we have obtained the following result.
Proposition 57. Let ν and ν 0 be two probability measures on (X, X ). For any
non-negative integers j, k, and n such that j ≤ k and any sequence y0:n ∈ Yn+1
such that Lν,n [y0:n ] > 0 and Lν 0 ,n [y0:n ] > 0,

φν,k|n [y0:n ] − φν 0 ,k|n [y0:n ] TV


 
Yk
≤ δ Fi−1|n [yi:n ] φν,j|n [y0:n ] − φν 0 ,j|n [y0:n ] TV
, (3.26)
i=j+1

φν,k|n [y0:n ] − φν 0 ,k|n [y0:n ] TV


k
!
kL•,n [y0:n ]k∞ Y
≤ δ Fi−1|n [yi:n ] kν − ν 0 kTV . (3.27)
Lν,n [y0:n ] ∨ Lν 0 ,n [y0:n ] i=1

Along the same lines, we can compare the posterior distribution of the state Xk
given observations Yj:n for different values of j. To avoid introducing new notations,
we will simply denote these conditional distributions by Pν ( Xk ∈ · | Yj:n = yj:n ).
43

As mentioned in the introduction of this chapter, it is sensible to expect that


Pν ( Xk ∈ · | Yj:n ) gets asymptotically close to Pν ( Xk ∈ · | Y0:n ) as k − j tends to
infinity. Here again, to establish this alternative form of the forgetting property, we
will use a representation of Pν ( Xk ∈ · | Yj:n ) similar to (3.17).
Because {(Xk , Yk )} is a Markov chain, and assuming that k ≥ j,

Pν ( Xk ∈ · | Xj , Yj:n ) = Pν ( Xk ∈ · | Xj , Y0:n ) .

Moreover, we know that conditionally on Y0:n , {Xk } is a non-homogeneous Markov


chain with transition kernels Fk|n [Yk+1:n ] where Fi|n = Q for i ≥ n (Proposition 31).
Therefore the Chapman-Kolmogorov equations show that for any function f ∈
Fb (X),

Eν [f (Xk ) | Yj:n ] = Eν [ Eν [f (Xk ) | Xj , Yj:n ] | Yj:n ]


 
Y k k
Y
= Eν  Fi−1|n [Yi:n ]f (Xj ) Yj:n  = φ̃ν,j|n [Yj:n ] Fi−1|n [Yi:n ]f ,
i=j+1 i=j+1

cf. (3.25), where the probability measure φ̃ν,j|n [Yj:n (f )] is defined by

φ̃ν,j|n [Yj:n ](f ) = Eν [f (Xj ) | Yj:n ] , f ∈ Fb (X) .

Using (3.25) as well, we thus find that the difference between Pν ( Xk ∈ · | Yj:n ) and
Pν ( Xk ∈ · | Y0:n ) may be expressed by
k
Y
Eν [f (Xk ) | Yj:n ] − Eν [f (Xk ) | Y0:n ] = (φ̃ν,j|n − φν,j|n ) Fi−1|n [Yi:n ]f .
i=j+1

Proceeding like in Proposition 57, we may thus derive a bound on the total variation
distance between these probability measures.
Proposition 58. For any integers j, k, and n such that 0 ≤ j ≤ k and any
probability measure ν on (X, X ),
 
k
Y
kPν ( Xk ∈ · | Y0:n ) − Pν ( Xk ∈ · | Yj:n )kTV ≤ 2δ  Fi−1|n [Yi:n ] . (3.28)
i=j+1

3.0.7 Uniform Forgetting Under Strong Mixing Conditions


In light of the discussion above, establishing forgetting properties amounts to de-
termining non-trivial bounds on the Dobrushin coefficient of products of forward
transition kernels and, if required, on ratio of likelihoods Lx,n (y0:n )/(Lν,n (y0:n ) ∨
Lν 0 ,n (y0:n )). To do so, we need to impose additional conditions on Q and g. We
consider in this section the following assumption, which was introduced by Le Gland
and Oudjane (2004, Section 2).
Assumption 59 (Strong Mixing Condition). There exist a transition kernel K :
(Y, Y) → (X, X ) and measurable functions ς − and ς + from Y to (0, ∞) such that
for any A ∈ X and y ∈ Y,
Z
ς − (y)K(y, A) ≤ Q(x, dx0 ) g(x0 , y) ≤ ς + (y)K(y, A) . (3.29)
A

We first show that under this condition, one may derive a non-trivial upper
bound on the Dobrushin coefficient of the forward smoothing kernels.
44CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY

Lemma 60. Under Assumption 59, the following hold true.

(i) For any non-negative integers k and n such that k < n and x ∈ X,
n
Y n
Y
ς − (yj ) ≤ βk|n [yk+1:n ](x) ≤ ς + (yj ) . (3.30)
j=k+1 j=k+1

(ii) For any non-negative integers k and n such that k < n and any probability
measures ν and ν 0 on (X, X ),
R
ς − (yk+1 ) ν(dx) βk|n [yk+1:n ](x) ς + (yk+1 )
≤ RX ≤ .
ς + (yk+1 ) X
ν 0 (dx) βk|n [yk+1:n ](x) ς − (yk+1 )

(iii) For any non-negative integers k and n such that k < n, there exists a transition
kernel λk,n from (Yn−k , Y ⊗(n−k) ) to (X, X ) such that for any x ∈ X, A ∈ X ,
and yk+1:n ∈ Yn−k ,

ς − (yk+1 )
λk,n (yk+1:n , A) ≤ Fk|n [yk+1:n ](x, A)
ς + (yk+1 )
ς + (yk+1 )
≤ λk,n (yk+1:n , A) . (3.31)
ς − (yk+1 )

(iv) For any non-negative integers k and n, the Dobrushin coefficient of the forward
smoothing kernel Fk|n [yk+1:n ] satisfies
(
ρ0 (yk+1 ) k < n ,
δ(Fk|n [yk+1:n ]) ≤
ρ1 k≥n,

where for any y ∈ Y,


ς − (y)
Z
def def
ρ0 (y) = 1 − + and ρ1 = 1 − ς − (y) µ(dy) . (3.32)
ς (y)

Proof. Take A = X in Assumption 59 to see that X Q(x, dx0 ) g(x0 , y) is bounded


R

from above and below by ς + (y) and ς − (y), respectively. Part (i) then follows from
(2.16).
Next, (2.19) shows that
Z
ν(dx) βk|n [yk+1:n ](x)
ZZ
= ν(dx) Q(x, dxk+1 ) g(xk+1 , yk+1 )βk+1|n [yk+2:n ](xk+1 ) .

This expression is bounded from above by


Z
ς + (yk+1 ) K(yk+1 , dxk+1 ) βk+1|n [yk+2:n ](xk+1 ) ,

and similarly a lower bound, with ς − (yk+1 ) rather than ς + (yk+1 ), holds too. These
bounds are independent of ν, and (ii) follows.
We turn to part (iii). Using the definition (2.30), the forward kernel Fk|n [yk+1:n ]
may be expressed as
R
Q(x, dxk+1 ) g(xk+1 , yk+1 )βk+1|n [yk+2:n ](xk+1 )
Fk|n [yk+1:n ](x, A) = RA .
X
Q(x, dxk+1 ) g(xk+1 , yk+1 )βk+1|n [yk+2:n ](xk+1 )
45

Using arguments as above, (3.31) holds with


R
def K(yk+1 , dxk+1 ) βk+1|n [yk+2:n ](xk+1 )
λk,n (yk+1:n , A) = RA .
X
K(yk+1 , dxk+1 ) βk+1|n [yk+2:n ](xk+1 )

Finally, part (iv) for k < n follows from part (iii) and Lemma 51. In the opposite
case, recall from (2.31) Rthat Fk|n = Q for indices k ≥ n. Integrating (3.29) with
respect to µ and using g(x, y) µ(dy) = 1, we find that for any A ∈ X and any
x ∈ X,
Z Z R −
− − ς (y)K(y, A) µ(dy)
Q(x, A) ≥ ς (y)K(y, A) µ(dy) = ς (y) µ(dy) × R ,
ς − (y) µ(dy)
where the ratio on the right-hand side is a probability measure. The proof of
part (iv) again follows from Lemma 51.
The final part of the above lemma shows that under Assumption 59, the Do-
brushin coefficient of the transition kernel Q satisfies δ(Q) ≤ 1 −  for some  > 0.
This is in fact a rather stringent assumption, which fails to be satisfied in many of
the examples considered in Chapter ??. When X is finite, this condition is satisfied
if Q(x, x0 ) ≥  for any (x, x0 ) ∈ X × X. When X is countable, δ(Q) < 1 is satis-
fied under the Doeblin condition 49 with n = 1. When X ⊆ Rd or more generally
is a topological space, δ(Q) < 1 typically requires that X is compact, which is,
admittedly, a serious limitation.
Proposition 61. Under 59 the following hold true.
(i) For any non-negative integers k and n and any probability measures ν and ν 0
on (X, X ),

φν,k|n [y0:n ] − φν 0 ,k|n [y0:n ] TV


k∧n
Y
≤ ρ0 (yj ) × ρ1k−k∧n φν,0|n [y0:n ] − φν 0 ,0|n [y0:n ] TV
,
j=1

where ρ0 and ρ1 are defined in (3.32).


(ii) For any non-negative integer n and any
R probability measures ν and ν 0 on (X, X )
0
R
such that ν(dx0 ) g(x0 , y0 ) > 0 and ν (dx0 ) g(x0 , y0 ) > 0,

φν,0|n [y0:n ] − φν 0 ,0|n [y0:n ] TV


ς + (y1 ) kgk∞
≤ kν − ν 0 kTV .
ς − (y1 ) ν(g(·, y0 )) ∨ ν 0 (g(·, y0 ))

(iii) For any non-negative integers j, k, and n such that j ≤ k and any probability
measure ν on (X, X ),

kPν ( Xk ∈ · | Y0:n = y0:n ) − Pν (Xk ∈ · | Yj:n = yj:n )kTV


k∧n
k−j−(k∧n−j∧n)
Y
≤2 ρ0 (yi ) × ρ1 .
i=j∧n+1

Proof. Using Lemma 60(iv) and Proposition 48, we find that for j ≤ k,
k∧n
k−j−(k∧n−j∧n)
Y
δ(Fj|n [yj+1:n ] · · · Fk|n [yk+1:n ]) ≤ ρ0 (yi ) × ρ1 .
i=j∧n+1
46CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY

Parts (i) and (iii) then follow from Propositions 57 and 58, respectively. Next we
note that (3.20) shows that
 
φν,0|n [y0:n ] = B β0|n [y1:n ](·), B[g(·, y0 ), ν] .
Apply Lemma 56 twice to this form to arrive at a bound on the total variation norm
of the difference φν,0|n [y0:n ] − φν 0 ,0|n [y0:n ] given by

β0|n [y1:n ] ∞ kg(·, y0 )k∞


× kν − ν 0 kTV .
B[g(·, y0 ), ν](β0|n [y1:n ]) ν(g(·, y0 )) ∨ ν 0 (g(·, y0 ))
Finally, bound the first ratio of this display using Lemma 60(ii); the supremum
norm is obtained by taking one of the initial measures as an atom at some point
x ∈ X. This completes the proof of part (ii).
From the above it is clear that forgetting properties stem from properties of the
product
k∧n
k−j−(k∧n−j∧n)
Y
ρ0 (Yi )ρ1 . (3.33)
i=j∧n+1

The situation is elementary when the factors of this product are (non-trivially)
upper-bounded uniformly with respect to the observations Y0:n . To obtain such
bounds, we consider the following strengthening of the strong mixing condition,
first introduced by Atar and Zeitouni (1997).
Assumption 62 (Strong Mixing Reinforced). (i) There exist two positive real num-
bers σ − and σ + and a probability measure κ on (X, X ) such that for any x ∈ X
and A ∈ X ,
σ − κ(A) ≤ Q(x, A) ≤ σ + κ(A) .
R
(ii) For all y ∈ Y, 0 < X κ(dx) g(x, y) < ∞.
It is easily seen that this implies Assumption 59.
Lemma 63.R Assumption 62 implies Assumption 59 with ς − (y) = σ −
R
X
κ(dx) g(x, y),
ς + (y) = σ + X κ(dx) g(x, y), and
R
κ(dx) g(x, y)
K(y, A) = RA .
X
κ(dx) g(x, y)
In particular, ς − (y)/ς + (y) = σ − /σ + for any y ∈ Y.
Proof. The proof follows immediately upon observing that
Z Z Z
− 0 0 0 0
σ κ(dx ) g(x , y) ≤ Q(x, dx ) g(x , y) ≤ σ +
κ(dx0 ) g(x0 , y) .
A A A

Replacing Assumption 59 by Assumption 62, Proposition 61 may be strength-


ened as follows.
Proposition 64. Under Assumption 62, the following hold true.
(i) For any non-negative integers k and n and any probability measures ν and ν 0
on (X, X ),

φν,k|n [y0:n ] − φν 0 ,k|n [y0:n ] TV


k∧n
σ−

≤ 1− + (1 − σ − )k−k∧n φν,0|n [y0:n ] − φν 0 ,0|n [y0:n ] TV
.
σ
47

0
(ii) For any non-negative R probability measures ν and ν on (X, X )
integer n and any
such that ν(dx0 ) g(x0 , y0 ) > 0 and ν 0 (dx0 ) g(x0 , y0 ) > 0,
R

φν,0|n [y0:n ] − φν 0 ,0|n [y0:n ] TV


σ+ kgk∞
≤ kν − ν 0 kTV .
σ − ν[g(·, y0 )] ∨ ν 0 [g(·, y0 )]

(iii) For any non-negative integers j, k, and n such that j ≤ k and any probability
measure ν on (X, X ),

kPν ( Xk ∈ · | Y0:n = y0:n ) − Pν ( Xk ∈ · | Yj:n = yj:n )kTV


k∧n−j∧n
σ−

k−j−(k∧n−j∧n)
≤2 1− + 1 − σ− .
σ

Thus, under Assumption 62 the filter and the smoother forget their initial condi-
tions exponentially fast, uniformly with respect to the observations. This property,
which holds under rather stringent assumptions, plays a key role in the sequel (see
for instance Chapters ?? and 6).
Of course, the product (3.33) can be shown to vanish asymptotically under
conditions that are less stringent than Assumption 62. A straightforward adaptation
of Lemma 63 shows that the following result is true.

Lemma 65. Assume 59 and that there exists a set C ∈ Y and constants 0 < σ − ≤
σ + < ∞ satisfying µ(C) > 0 and, for all y ∈ C, σ − ≤ ς − (y) ≤ ς + (y) ≤ σ + . Then,
ρ0 (y) ≤ 1 − σ − /σ + , ρ1 ≥ 1 − σ − µ(C) and

k∧n
k−j−(k∧n−j∧n)
Y
ρ0 (Yi )ρ1
i=j∧n+1
k∧n
Pi=j∧n+1 1C (Yi )  k−j−(k∧n−j∧n)
≤ 1 − σ − /σ + 1 − σ − µ(C) . (3.34)

In words, forgetting is guaranteed to occur when {Yk } visits a given set C in-
finitely often in the long run. Of course, such a property cannot hold true for all
possible sequences of observations but it may hold with probability one under appro-
priate assumptions on the law of {Yk }, assuming in particular that the observations
are distributed under the model, perhaps with a different initial distribution ν? .
To answer whether this happens or not requires additional results from the general
theory of Markov chains, and we postpone this discussion to Section 7.3 (see in
particular Proposition 208 on the recurrence of the joint chain in HMMs).

3.0.8 Forgetting Under Alternative Conditions


Because Assumptions 59 and 62 are not satisfied in many contexts of interest, it
is worthwhile to consider ways in which these assumptions can be weakened. This
happens to raise difficult mathematical challenges that largely remain unsolved
today. Perhaps surprisingly, despite many efforts in this direction, there is up to now
no truly satisfactory assumption that covers a reasonable fraction of the situations
of practical interest. The problem really is more complicated than appears at first
sight. In particular, Example 66 below shows that the forgetting property does not
necessarily hold under assumptions that imply that the underlying Markov chain
is uniformly ergodic. This last section on forgetting is more technical and requires
some knowledge of Markov chain theory as can be found in Chapter 7.
48CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY

Example 66. This example was first discussed by Kaijser (1975) and recently
worked out by Chigansky and Lipster (2004). Let {Xk } be a Markov chain on
X = {0, 1, 2, 3}, defined by the recurrence equation Xk = (Xk−1 + Uk ) mod 4,
where {Uk } is an i.i.d. binary sequence with P(Bk = 0) = p and P(Bk = 1) = 1 − p
for some 0 < p < 1. For any (x, x0 ) ∈ X × X, Q4 (x, x0 ) > 0, which implies that
δ(Q4 ) < 1 and, by Theorem 54, that the chain is uniformly geometrically ergodic.
The observations {Yk } are a deterministic binary function of the chain, namely
Yk = 1{0,2} (Xk ) .

The function mapping Xk to Yk is not injective, but knowledge of Yk indicates two


possible values of Xk . The filtering distribution is given recursively by
φν,k [y0:k ](0) = yk {φν,k−1 [y0:k−1 ](0) + φν,k−1 [y0:k−1 ](3)} ,
φν,k [y0:k ](1) = (1 − yk ) {φν,k−1 [y0:k−1 ](1) + φν,k−1 [y0:k−1 ](0)} ,
φν,k [y0:k ](2) = yk {φν,k−1 [y0:k−1 ](2) + φν,k−1 [y0:k−1 ](1)} ,
φν,k [y0:k ](3) = (1 − yk ) {φν,k−1 [y0:k−1 ](3) + φν,k−1 [y0:k−1 ](2)} .
In particular, either one of the two sets {0, 2} and {1, 3} has null probability under
φν,k [y0:k ], depending on the value of yk , and irrespectively of the choice of ν. We
also notice that
yk φν,k [y0:k ](j) = φν,k [y0:k ](j) , for j = 0, 2,
(1 − yk ) φν,k [y0:k ](j) = φν,k [y0:k ](j) , for j = 1, 3. (3.35)
In addition, it is easily checked that, except when ν({0, 2}) or ν({1, 3}) equals 1
(which rules out one of the two possible values for y0 ), the likelihood Lν,n [y0:n ] is
strictly positive for any integer n and any sequence y0:n ∈ {0, 1}n+1 .
Dropping the dependence on y0:k for notational simplicity and using (3.35) we
obtain
|φν,k (0) − φν 0 ,k (0)|
= yk |φν,k−1 (0) − φν 0 ,k−1 (0) + φν,k−1 (3) − φν 0 ,k−1 (3)|
= yk {yk−1 |φν,k−1 (0) − φν 0 ,k−1 (0)| + (1 − yk−1 )|φν,k−1 (3) − φν 0 ,k−1 (3)|} .
Proceeding similarly, we also find that

|φν,k (1) − φν 0 ,k (1)| =


(1 − yk ) {(1 − yk−1 )|φν,k−1 (1) − φν 0 ,k−1 (1)| + yk−1 |φν,k−1 (0) − φν 0 ,k−1 (0)|} ,
|φν,k (2) − φν 0 ,k (2)| =
yk {yk−1 |φν,k−1 (2) − φν 0 ,k−1 (2)| + (1 − yk−1 )|φν,k−1 (1) − φν 0 ,k−1 (1)|} ,
|φν,k (3) − φν 0 ,k (3)| =
(1 − yk ) {(1 − yk−1 )|φν,k−1 (3) − φν 0 ,k−1 (3)| + yk−1 |φν,k−1 (2) − φν 0 ,k−1 (2)|} .
Adding the above equalities using (3.35) again shows that for any k = 1, . . . , n,
kφν,k [y0:k ] − φν 0 ,k [y0:k ]kTV = kφν,k−1 [y0:k−1 ] − φν 0 ,k−1 [y0:k−1 ]kTV
= kφν,0 [y0 ] − φν 0 ,0 [y0 ]kTV .
By construction, φν,0 [y0 ](j) = y0 ν(j)/(ν(0)+ν(2)) for j = 0 and 2, and φν,0 [y0 ](j) =
(1−y0 ) ν(j)/(ν(1)+ν(3)) for j = 1 and 3. This implies that kφν,0 [y0 ] − φν 0 ,0 [y0 ]kTV 6=
0 if ν 6= ν 0 .
In this model, the hidden Markov chain {Xk } is uniformly ergodic, but the
filtering distributions φν,k [y0:k ] never forget the influence of the initial distribution
ν, whatever the observed sequence.
49

In the above example, the kernel Q does not satisfy Assumption 62 with m = 1
(one-step minorization), but the condition is verified for a power Qm (here for
m = 4). This situation is the rule rather than the exception. In particular, a
Markov chain on a finite state space has a unique invariant probability measure
and is ergodic if and only if there exists an integer m > 0 such that Qm (x, x0 ) > 0
for all (x, x0 ) ∈ X × X (but the condition may not hold for m = 1). This suggests
considering the following assumption (see for instance Del Moral, 2004, Chapter 4).
Assumption 67.
(i) There exist an integer m, two positive real numbers σ − and σ + , and a proba-
bility measure κ on (X, X ) such that for any x ∈ X and A ∈ X ,

σ − κ(A) ≤ Qm (x, A) ≤ σ + κ(A) .

(ii) There exist two measurable functions g − and g − from Y to (0, ∞) such that
for any y ∈ Y,

g − (y) ≤ inf g(x, y) ≤ sup g(x, y) ≤ g + (y) .


x∈X x∈X

Compared to Assumption 62, the condition on the transition kernel has been
weakened, but at the expense of strengthening the assumption on the function g.
Note in particular that part (ii) is not satisfied in Example 66.
Using (3.17) and writing k = jm+r with 0 ≤ r < m, we may express φν,k|n [y0:n ]
as
 
Y (u+1)m−1
j−1 Y k−1
Y
φν,k|n [y0:n ] = φν,0|n [y0:n ]  Fi|n [yi+1:n ] Fi|n [yi+1:n ] .
u=0 i=um i=jm

This implies, using Corollary 47, that for any probability measures ν and ν 0 on
(X, X ) and any sequence y0:n satisfying Lν,n [y0:n ] > 0 and Lν 0 ,n [y0:n ] > 0,

φν,k|n [y0:n ] − φν 0 ,k|n [y0:n ] TV


 
j−1 (u+1)m−1
Y Y
≤ δ Fi|n [yi+1:n ] φν,0|n [y0:n ] − φν 0 ,0|n [y0:n ] TV
. (3.36)
u=0 i=um

Qum+m−1
This expression suggest computing a bound on δ( i=um Fi|n [yi+1:n ]) rather than
a bound on δ(Fi|n ). The following result shows that such a bound can be derived
under Assumption 67.
Lemma 68. Under Assumption 67, the following hold true.
(i) For any non-negative integers k and n such that k < n and x ∈ X,
n
Y n
Y
g − (yj ) ≤ βk|n [yk+1:n ](x) ≤ g + (yj ) , (3.37)
j=k+1 j=k+1

where βk|n is the backward function (2.16).


(ii) For any non-negative integers u and n such that 0 ≤ u < bn/mc and any
probability measures ν and ν 0 on (X, X ),
(u+1)m R (u+1)m
σ− Y g − (yi ) ν(dx) βum|n [yum+1:n ](x) σ+ Y g + (yi )
≤ RX ≤ .
σ+ i=um+1
+
g (yi ) X
ν 0 (dx) βum|n [yum+1:n ](x) σ− i=um+1
g − (yi )
50CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY

(iii) For any non-negative integers u and n such that 0 ≤ u < bn/mc, there exists
a transition kernel λu,n from Y(n−(u+1)m) , Y ⊗(n−(u+1)m) to (X, X ) such that
for any x ∈ X, A ∈ X and yum+1:n ∈ Y(n−um) ,

(u+1)m (u+1)m−1
σ− Y g − (yi ) Y
λu,n (y(u+1)m+1:n , A) ≤ Fi|n [yi+1:n ](x, A)
σ+ i=um+1
g + (yi ) i=um
(u+1)m
σ+ Y g + (yi )
≤ − λu,n (y(u+1)m+1:n , A) . (3.38)
σ i=um+1
g − (yi )

(iv) For any non-negative integers u and n,


  (
(u+1)m−1
Y ρ0 (yum+1:(u+1)m ) u < bn/mc ,
δ Fi|n [yi+1:n ] ≤
i=um
ρ1 u ≥ dn/me ,

where for any yum+1:(u+1)m ∈ Ym ,


(u+1)m
def σ− Y g − (yi ) def
ρ0 (yum+1:(u+1)m ) = 1 − and ρ1 = 1 − σ − . (3.39)
σ+ i=um+1
g + (yi )

Proof. Part (i) can be proved using an argument similar to the one used for Lem-
ma 60(i).
Next notice that for 0 ≤ u < bn/mc,

βum|n [yum+1:n ](xum )


Z Z (u+1)m
Y
= ··· Q(xi−1 , dxi ) g(xi , yi ) β(u+1)m|n [y(u+1)m+1:n ](x(u+1)m ) .
i=um+1

Under Assumption 67, dropping the dependence on the ys for notational simplicity,
the right-hand side of this display is bounded from above by
(u+1)m Z Z (u+1)m
Y Y
g + (yi ) ··· Q(xi−1 , dxi ) β(u+1)m|n (x(u+1)m )
i=um+1 i=um+1
(u+1)m Z
Y
≤ σ+ g + (yi ) β(u+1)m|n (x(u+1)m ) κ(dx(u+1)m ) .
i=um+1

In a similar fashion, a lower bound may be obtained, containing σ − and g − rather


than σ + and g + . Thus part (ii) follows.
For part (iii), we use (2.30) to write
(u+1)m−1
Y
Fi|n [yi+1:n ](xum , A)
i=um

1
R R Q(u+1)m
··· i=um+1 Q(xi−1 , xi ) g(xi , yi ) A (x(u+1)m )β(u+1)m|n (x(u+1)m )
= R R Q(u+1)m .
··· i=um+1 Q(xi−1 , xi ) g(xi , yi )β(u+1)m|n (x(u+1)m )

The right-hand side is bounded from above by


(u+1)m R
σ+ Y g + (yi ) κ(dx) β(u+1)m|n [y(u+1)m+1:n ](x)
× RA .
σ− i=um+1
g − (y )
i κ(dx) β(u+1)m|n [y(u+1)m+1:n ](x)
51

We define λu,n as the second ratio of this expression. Again a corresponding lower
bound is obtained similarly, proving part (iii).
Part (iv) follows from part (iii) and Lemma 51.

Using this result together with (3.36), we may obtain statements analogous to
Proposition 61. In particular, if there exist positive real numbers γ − and γ + such
that for all y ∈ Y,
γ − ≤ g − (y) ≤ g + (y) ≤ γ + ,
then the smoothing and the filtering distributions both forget uniformly the initial
distribution.
Assumptions 62 and 67 are still restrictive and fail to hold in many interesting
situations. In both cases, we assume that either the one-step or the m-step transition
kernel is uniformly bounded from above and below. The following weaker condition
is a first step toward handling more general settings.

Assumption 69. Let Q be dominated by aR probability measure κ on (X, X ) such


that for any x ∈ X and A ∈ X , Q(x, A) = A qκ (x, x0 ) κ(dx0 ) for some transition
density function qκ . Assume in addition that

(i) There exists a set C ∈ X , two positive real numbers σ − and σ + such that for
all x ∈ C and x0 ∈ X,
σ − ≤ qκ (x, x0 ) ≤ σ + .

qκ (x, x0 ) g(x0 , y) κ(dx0 ) > 0;


R
(ii) For all y ∈ Y and all x ∈ X, C

(iii) There exists a (non-identically null) function α : Y → [0, 1] such that for any
(x, x0 ) ∈ X × X and y ∈ Y,

ρ[x, x0 ; y](x00 ) κ(dx00 )


R
RC ≥ α(y) ,
X
ρ[x, x0 ; y](x00 ) κ(dx00 )

where for (x, x0 , x00 ) ∈ X3 and y ∈ Y,

def
ρ[x, x0 ; y](x00 ) = qκ (x, x00 )g(x00 , y)qκ (x00 , x0 ) . (3.40)

Part (i) of this assumption implies that the set C is 1-small for the kernel Q
(see Definition 155). It it shown in Section 7.2.2 that such small sets do exist under
conditions that are weak and generally simple to check. Assumption 69 is trivially
satisfied under Assumption 62 using the whole state space X as the state C: in that
case, their exists a transition density function qκ (x, x0 ) that is bounded from above
and below for all (x, x0 ) ∈ X2 . It is more interesting to consider cases in which
the hidden chain is not uniformly ergodic. One such example, first addressed by
Budhiraja and Ocone (1997), is a Markov chain observed in noise with bounded
support.

Example 70 (Markov Chain in Additive Bounded Noise). We consider real states


{Xk } and observations {Yk }, assuming that the states form a Markov chain with
a transition density q(x, x0 ) with respect to Lebesgue measure. Furthermore we
assume the following.

(i) Yk = Xk + Vk , where {Vk } is an i.i.d. sequence of satisfying P(|V | ≥ M ) = 0


for some finite M (the essential supremum of the noise sequence is bounded).
In addition, Vk has a probability density g with respect to Lebesgue measure.
52CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY

(ii) The transition density satisfies q(x, x0 ) > 0 for all (x, x0 ) and there exists a
positive constant A, a probability density h and positive constants σ − and σ +
such that for all x ∈ C = [−A − M, A + M ],
σ − h(x0 ) ≤ q(x, x0 ) ≤ σ + h(x0 ) .

The results below can readily be extended to cover the case Yk = ψ(Xk ) + Vk ,
provided that the level sets {x ∈ R : |ψ(x)| ≤ K} of the function ψ are compact.
This is equivalent to requiring |ψ(x)| → ∞ as |x| → ∞. Likewise extensions to
multivariate states and/or observations are obvious.
Under (ii), Assumption 69(i) is satisfied with C as above and κ(dx) = h(x) dx.
Denote by φ the probability density of the random variables Vk . Then g(x, y) =
φ(y − x). The density φ may be chosen such that suppφ ⊆ [−M, +M ], so that
g(x, y) > 0 if and only if x ∈ [y − M, y + M ]. To verify Assumption 69(iii), put
Γ = [−A, A]. For y ∈ Γ, we then have g(x, y) = 0 if x 6∈ [−A − M, A + M ], and thus
Z Z A+M
00 00 00 0 00
q(x, x )g(x , y)q(x , x ) dx = q(x, x00 )g(x00 , y)q(x00 , x0 ) dx00 .
−A−M
0
This implies that for all (x, x ) ∈ X × X,
q(x, x00 )g(x00 , y)q(x00 , x0 ) dx00
R
RC =1.
X
q(x, x00 )g(x00 , y)q(x00 , x0 ) dx00
The bounded noise case is of course very specific, because an observation Yk allows
locating the corresponding state Xk within a bounded set.
Under assumption 69, the lemma below establishes that the set C is a 1-small set
for the forward transition kernels Fk|n [yk+1:n ] and that it is also uniformly accessible
from the whole space X (for the same kernels).
Lemma 71. Under Assumption 69, the following hold true.
(i) For any initial
R probability measure ν on (X, X ) and any sequence y0:n ∈ Yn+1
satisfying C ν(dx0 ) g(x0 , y0 ) > 0,
Lν,n (y0:n ) > 0 .

(ii) For any non-negative integers k and n such that k < n and any y0:n ∈ Yn+1 ,
the set C is a 1-small set for the transitions kernels Fk|n . Indeed there exists a
transition kernel λk,n from (Y(n−k) , Y ⊗(n−k) ) to (X, X ) such that for all x ∈ C,
yk+1:n ∈ Yn−k and A ∈ X ,
σ−
Fk|n [yk+1:n ](x, A) ≥ λk,n [yk+1:n ](A) .
σ+
(iii) For any non-negative integers k and n such that n ≥ 2 and k < n − 1, and any
yk+1:n ∈ Yn−k ,
inf Fk|n [yk+1:n ](x, C) ≥ α(yk+1 ) .
x∈X

Proof. Write
Z Z n
Y
Lν,n (y0:n ) = ··· ν(dx0 ) g(x0 , y0 ) Q(xi−1 , dxi ) g(xi , yi )
i=1
Z Z n
Q(xi−1 , dxi ) g(xi , yi )1C (xi−1 )
Y
≥ ··· ν(dx0 ) g(x0 , y0 )
i=1
Z n Z
− n
 Y
≥ ν(dx0 ) g(x0 , y0 ) σ g(xi , yi ) κ(dxi ) ,
C i=1 C
53

showing part (i). The proof of (ii) is similar to that of Lemma 60(iii). For (iii),
write

Fk|n [yk+1:n ](x, C)


ρ[x, xk+2 ; yk+1 ](xk+1 )1C (xk+1 )ϕ[yk+2:n ](xk+2 ) κ(dxk+1:k+2 )
RR
= RR
ρ[x, xk+2 ; yk+1 ](xk+1 )ϕ[yk+2:n ](xk+2 ) κ(dxk+1:k+2 )
RR
Φ[yk+1 ](x, xk+2 )ρ[x, xk+2 ; yk+1 ](xk+1 )ϕ[yk+2:n ](xk+2 ) κ(dxk+1:k+2 )
= RR .
ρ[x, xk+2 ; yk+1 ](xk+1 )ϕ[yk+2:n ](xk+2 ) κ(dxk+1:k+2 )

where ρ is defined in (3.40) and

ϕ[yk+2:n ](xk+2 ) = g(xk+2 , yk+2 )βk+2|n [yk+3:n ](xk+2 ) ,


ρ[x, xk+2 ; yk+1 ](xk+1 )1C (xk+1 ) κ(dxk+1 )
R
Φ[yk+1 ](x, xk+2 ) = R .
ρ[x, xk+2 ; yk+1 ](xk+1 ) κ(dxk+1 )

Under Assumption 69, Φ(x, x0 ; y) ≥ α(y) for all (x, x0 ) ∈ X × X and y ∈ Y, which
concludes the proof.

The corollary below then shows that the whole set X is a 1-small set for the
composition Fk|n [yk+1:n ]Fk+1|n [yk+2:n ]. This generalizes a well-known result for
homogeneous Markov chains (see Proposition 157).

Corollary 72. Under Assumption 69, for positive indices 2 ≤ k ≤ n,

bk/2c−1 
σ−
Y 
φν,k|n [y0:n ] − φν 0 ,k|n [y0:n ] ≤2 1− α(y2j+1 ) .
TV
j=0
σ+

Proof. Because of Lemma 71(i), we may use the decomposition in (3.26) with j = 0
bounding the total variation distance by 2 to obtain
k−1
Y 
φν,k|n [y0:n ] − φν 0 ,k|n [y0:n ] TV
≤2 δ Fj|n [yj+1:n ] .
j=0

Now, using assertions (ii) and (iii) of Lemma 71,

Fj|n [yj+1:n ]Fj+1|n [yj+2:n ](x, A)


Z
≥ Fj|n [yj+1:n ](x, dx0 )Fj+1|n [yj+2:n ](x0 , A)
C
σ−
≥ α(yj+1 ) λj+1,n [yj+2:n ](A) ,
σ+
for all x ∈ X and A ∈ X . Hence the composition Fj|n [yj+1:n ]Fj+1|n [yj+2:n ] satis-
fies Doeblin’s condition (Assumption 50) and the proof follows by Application of
Lemma 51.

Corollary 72 is only useful in cases where the function α is such that the obtained
bound indeed decreases as k and n grow. In Example 70, one could set α(y) = 1Γ (y),
for an interval Γ. In such a case, it suffices that the joint chain {Xk , Yk }k≥0 be
recurrent under Pν? —which was the case in Example 70—to guarantee that 1Γ (Yk )
equals one infinitely often and thus that φν,k|n [Y0:n ] − φν 0 ,k|n [Y0:n ] TV tends to
zero Pν? -almost surely as k, n → ∞. The following example illustrates a slightly
more complicated situation in which Assumption 69 still holds.
54CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY

Example 73 (Non-Gaussian Autoregressive Process in Gaussian Noise). In this


example, we consider a first-order non-Gaussian autoregressive process, observed in
Gaussian noise. This is a practically relevant example for which there is apparently
no results on forgetting available in the literature. The model is thus

Xk+1 = φXk + Uk , X0 ∼ ν ,
Yk = Xk + Vk ,

where

(i) {Uk }k≥0 is an i.i.d. sequence of random variables with Laplace (double expo-
nential) distribution with scale parameter λ;

(ii) {Vk }k≥0 is an i.i.d. sequence of Gaussian random variable with zero mean and
variance σ 2 .

We will see below that the fact that the tails of the Xs are heavier than the tails of
the observation noise is important for the derivations that follow. It is assumed that
|φ| < 1, which implies that the chain {Xk } is positive recurrent, that is, admits a sin-
gle invariant probability measure π. It may be shown (see Chapter 7) that although
the Markov chain {Xk } is geometrically ergodic, that is, kQn (x, ·) − πkTV → 0 geo-
metrically fast, it is not uniformly ergodic as lim inf n→∞ supx∈R kQn (x, ·) − πkTV >
0. We will nevertheless see that the forward smoothing kernel is uniformly geomet-
rically ergodic.
Under the stated assumptions,

1
q(x, x0 ) = exp (−λ|x0 − φx|) ,

(y − x)2
 
1
g(x, y) = √ exp − .
2πσ 2σ 2

Here we set, for some M > 0 to be specified later, C = [−M − 1/2, M + 1/2], and
we let y ∈ [−1/2, +1/2]. Note that

R M +1/2
−M −1/2
exp(−λ|u − φx| − |y − u|2 /2σ 2 − λ|x0 − φu|) du
R∞
−∞
exp(−λ|u − φx| − |y − u|2 /2σ 2 − λ|x0 − φu|) du
RM
exp(−λ|u − x| − u2 /2σ 2 − φλ|x0 − u|) du
≥ R−M
∞ ,
−∞
exp(−λ|u − x| − u2 /2σ 2 − φλ|x0 − u|) du

and to show Assumption 69(iii) it suffices to show that the right-hand side is
bounded from below. This in turn is equivalent to showing that sup(x,x0 )∈R×R R(x, x0 ) <
1, where
R R ∞
−M
−∞
+ M
exp(−α|u − x| − βu2 − γ|x0 − u|) du
0
R(x, x ) = R ∞ (3.41)
−∞
exp(−α|u − x| − βu2 − γ|x0 − u|) du

with α = λ, β = 1/2σ 2 and γ = φλ.


To do this, first note that any M > 0 we have sup{R(x, x0 ) : |x| ≤ M, |x0 | ≤
M } < 1, and we thus only need to study the behavior of this quantity when x
and/or x0 become large. We first show that

lim sup sup R(x, x0 ) < 1 . (3.42)


M →∞ x≥M, |x0 |≤M
55

For this we note that for |x0 | ≤ M and x ≥ M , it holds that


Z x Z ∞ 
exp −α|x − u| − βu2 − γ(u − x0 ) du
 
+
M x
exp[−βM 2 + (α − γ)M ] exp(−βx2 − γx)
≤ e−αx eγM + eγM ,
2βM − (α − γ) 2βx + (γ + α)
where we used the bound
Z ∞
exp(λu − βu2 ) du ≤ (2βy − λ) exp(−βy 2 + λy) ,
y

which holds as soon as 2βy − λ ≥ 0. Similarly, we have


Z −M
exp −α(x − u) − βu2 − γ(x0 − u) du
 
−∞
exp[−βM 2 − (γ + α)M ]
≤ e−αx eγM ,
2βM + (γ + α)

Z M
exp −α(x − u) − βu2 − γ|u − x0 | du
 
−M
Z M
≥ e−2γM e−αx exp(−βu2 + αu) du .
−M

Thus, (3.41) is bounded by


2 exp[−βM 2 + (α − γ)M ] exp[−βx2 + (α − γ)x]
+ sup
2βM + γ − α x≥M βx + (γ + α)
e3γM RM
−M
exp(−βu2 + αu) du
proving (3.42).
Next we show that
lim sup sup R(x, x0 ) < 1 . (3.43)
M →∞ x≥M, x0 ≥M

We consider the case M ≤ x ≤ x0 ; the other case can be handled similarly. The
denominator in (3.41) is then bounded by
Z M
−αx−γx0
e exp(−βu2 + (α + γ)u) du .
−M

The two terms in the numerator are bounded by, respectively,


Z −M
exp −α(x − u) − βu2 − γ(x0 − u) du
 
−∞
0 exp[−βM 2 − (α + γ)M ]
≤ e−αx−γx
2βM + α + γ
and
Z ∞
exp −α|x − u| − βu2 − γ|x0 − u| du

M
exp[−βM 2 + (α + γ)M ]
0
≤ e−αx−γx
2βM − α − γ
exp(−βx + γx − γx0 ) exp[−β(x0 )2 + αx − αx0 ]
2
+ + ,
2βx − γ + α 2βx0 + α + γ
56CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY

and (3.43) follows by combining the previous bounds.


We finally have to check that

lim sup sup R(x, x0 ) < 1 .


M →∞ x0 ≤−M, x≥M

This can be done along the same lines.


Chapter 4

Sequential Monte Carlo


Methods

The use of Monte Carlo methods for non-linear filtering can be traced back to the
pioneering contributions of Handschin and Mayne (1969) and Handschin (1970).
These early attempts were based on sequential versions of the importance sampling
paradigm, a technique that amounts to simulating samples under an instrumental
distribution and then approximating the target distributions by weighting these
samples using appropriately defined importance weights. In the non-linear filtering
context, importance sampling algorithms can be implemented sequentially in the
sense that, by defining carefully a sequence of instrumental distributions, it is not
needed to regenerate the population of samples from scratch upon the arrival of
each new observation. This algorithm is called sequential importance sampling,
often abbreviated SIS. Although the SIS algorithm has been known since the early
1970s, its use in non-linear filtering problems was rather limited at that time. Most
likely, the available computational power was then too limited to allow convincing
applications of these methods. Another less obvious reason is that the SIS algorithm
suffers from a major drawback that was not clearly identified and properly cured
until the seminal paper by Gordon et al. (1993). As the number of iterations
increases, the importance weights tend to degenerate, a phenomenon known as
sample impoverishment or weight degeneracy. Basically, in the long run most of the
samples have very small normalized importance weights and thus do not significantly
contribute to the approximation of the target distribution. The solution proposed
by Gordon et al. (1993) is to allow rejuvenation of the set of samples by duplicating
the samples with high importance weights and, on the contrary, removing samples
with low weights.

The particle filter of Gordon et al. (1993) was the first successful application of
sequential Monte Carlo techniques to the field of non-linear filtering. Since then,
sequential Monte Carlo (or SMC) methods have been applied in many different
fields including computer vision, signal processing, control, econometrics, finance,
robotics, and statistics (Doucet et al., 2001; Ristic et al., 2004). This chapter reviews
the basic building blocks that are needed to implement a sequential Monte Carlo
algorithm, starting with concepts related to the importance sampling approach.
More specific aspects of sequential Monte Carlo techniques will be further discussed
in Chapter ??, while convergence issues will be dealt with in Chapter ??.

57
58 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

4.1 Importance Sampling and Resampling


4.1.1 Importance Sampling
Importance sampling is a method that dates back to, at least, Hammersley and
Handscomb (1965) and that is commonly used in several fields (for general references
on importance sampling, see Glynn and Iglehart, 1989, Geweke, 1989, Evans and
Swartz, 1995, or Robert and Casella, 2004.)
Throughout this section, µ will denote a probability measure of interest on a
measurable space (X, X ), which we shall refer to as the target distribution.
R As in
Chapter ??, the aim is to approximate integrals of the form µ(f ) = X f (x) µ(dx)
for real-valued measurable functions f . The Monte Carlo approach exposed in
Section ?? consists in drawing an i.i.d. sample ξ 1 , . . . , ξ N from the probability
PN
measure µ and then evaluating the sample mean N −1 i=1 f (ξ i ). Of course, this
technique is applicable only when it is possible (and reasonably simple) to sample
from the target distribution µ.
Importance sampling is based on the idea that in certain situations it is more
appropriate to sample from an instrumental distribution ν, and then to apply a
change-of-measure formula to account for the fact that the instrumental distribution
is different from the target distribution. More formally, assume that the target
probability measure µ is absolutely continuous with respect to an instrumental
probability measure ν from which sampling is easily feasible. Denote by dµ/dν
the Radon-Nikodym derivative of µ with respect to ν. Then for any µ-integrable
function f , Z Z

µ(f ) = f (x) µ(dx) = f (x) (x) ν(dx) . (4.1)

In particular, if ξ 1 , ξ 2 , . . . is an i.i.d. sample from ν, (4.1) suggests the following
estimator of µ(f ):
N
−1
X dµ
µ̃IS
ν,N (f ) = N f (ξ i ) (ξ i ) . (4.2)
i=1

Because this estimator is the sample mean of independent random variables, there
is a range of results to assess the quality of µ̃IS
ν,N (f ) as an estimator of µ(f ). First
of all, the strong law of large number implies that µ̃ISν,N (f ) converges to µ(f ) almost
surely as N tends to infinity. In addition, the central limit theorem for i.i.d. vari-
ables (or deviation inequalities) may serve as a guidance for selecting the proposal
distribution ν, beyond the obvious requirement that it should dominate the target
distribution µ. We postpone this issue and, more generally, considerations that
pertain to the behavior of the approximation for large values of N to Chapter ??.
In many situations, the target probability measure µ or the instrumental prob-
ability measure ν is known only up to a normalizing factor. As already discussed
in Remark ??, this is particularly true when applying importance sampling ideas
to HMMs and, more generally, in Bayesian statistics. The Radon-Nikodym deriva-
tive dµ/dν is then known up to a (constant) scaling factor only. It is however still
possible to use the importance sampling paradigm in that case, by adopting the
self-normalized form of the importance sampling estimator,
PN i dµ i
i=1 f (ξ ) dν (ξ )
bIS
µ ν,N (f ) = PN dµ i . (4.3)
i=1 dν (ξ )

This quantity is obviously free from any scale factor in dµ/dν. The self-normalized
importance sampling estimator µ bIS
ν,N (f ) is defined as a ratio of the sample means of
the functions f1 = f × (dµ/dν) and f2 = dµ/dν. The strong law of large numbers
PN PN
thus implies that N −1 i=1 f1 (ξ i ) and N −1 i=1 f2 (ξ i ) converge almost surely, to
4.1. IMPORTANCE SAMPLING AND RESAMPLING 59

bIS
µ(f1 ) and ν(dµ/dν) = 1, respectively, showing that µν,N (f ) is a consistent estimator
of µ(f ). Again, more precise results on the behavior of this estimator will be given
in Chapter ??. In the following, the term importance sampling usually refers to the
self-normalized form (4.3) of the importance sampling estimate.

4.1.2 Sampling Importance Resampling


Although importance sampling is primarily intended to overcome difficulties with
direct sampling from µ when approximating integrals of the form µ(f ), it can also be
used for (approximate) sampling from the distribution µ. The latter can be achieved
by the sampling importance resampling (or SIR) method due to Rubin (1987, 1988).
Sampling importance resampling is a two-stage procedure in which importance sam-
pling as discussed below is followed by an additional random sampling step. In the
first stage, an i.i.d. sample (ξ˜1 , . . . , ξ˜M ) is drawn from the instrumental distribution
ν, and one computes the normalized version of the importance weights,

(ξ˜i )
ω i = PMdν dµ ˜i
, i = 1, . . . , M . (4.4)
i=1 dν (ξ )

In the second stage, the resampling stage, a sample of size N denoted by ξ 1 , . . . , ξ N


is drawn from the intermediate set of points ξ˜1 , . . . , ξ˜M , taking into account the
weights computed in (4.4). The rationale is that points ξ˜i for which ω i in (4.4) is
large are most likely under the target distribution µ and should thus be selected
with higher probability during the resampling than points with low (normalized)
importance weights. This principle is illustrated in Figure 4.1.

TARGET

TARGET

Figure 4.1: Principle of resampling. Top plot: the sample drawn from ν with as-
sociated normalized importance weights depicted by bullets with radii proportional
to the normalized weights (the target density corresponding to µ is plotted in solid
line). Bottom plot: after resampling, all points have the same importance weight,
and some of them have been duplicated (M = N = 7).

There are several ways of implementing this basic idea, the most obvious ap-
proach being sampling with replacement with probability of sampling each ξ i equal
60 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

to the importance weight ω i . Hence the number of times N i each particular point
ξ˜i in the first-stage sample is selected follows a binomial Bin(N, ω i ) distribution.
The vector (N 1 , . . . , N M ) is distributed from Mult(N, ω 1 , . . . , ω M ), the multinomial
distribution with parameter N and probabilities of success (ω 1 , . . . , ω M ). In this
resampling step, the points in the first-stage sample that are associated with small
normalized importance weights are most likely to be discarded, whereas the best
points in the sample are duplicated in proportion to their importance weights. In
most applications, it is typical to choose M , the size of the first-stage sample, larger
(and sometimes much larger) than N . The SIR algorithm is summarized below.

Algorithm 74 (SIR: Sampling Importance Resampling). Sampling: Draw an i.i.d.


sample ξ˜1 , . . . , ξ˜M from the instrumental
distribution ν.

Weighting: Compute the (normalized) importance weights



(ξ˜i )
ω i = PMdν dµ ˜j
for i = 1, . . . , M .
j=1 dν (ξ )

Resampling:
• Draw, conditionally independently given (ξ˜1 , . . . , ξ˜M ), N discrete ran-
dom variables (I 1 , . . . , I N ) taking values in the set {1, . . . , M } with prob-
abilities (ω 1 , . . . , ω M ), i.e.,

P(I 1 = j) = ω j , j = 1, . . . , M . (4.5)
i
• Set, for i = 1, . . . , N , ξ i = ξ˜I .

The set (I 1 , . . . , I N ) is thus a multinomial trial process. Hence, this method of


selection is known as the multinomial resampling scheme.
At this point, it may not be obvious that the sample ξ 1 , . . . , ξ N obtained from
Algorithm 74 is indeed (approximately) i.i.d. from µ in any suitable sense. In
Chapter ??, it will be shown that the sample mean of the draws obtained using the
SIR algorithm,
N
1 X
µ̂SIR
ν,M,N (f ) = f (ξ i ) , (4.6)
N i=1

is a consistent estimator of µ(f ) for all functions f satisfying µ(|f |) < ∞. The
resampling step might thus be seen as a means to transform the weighted importance
sampling estimate µ bIS
ν,M (f ) defined by (4.3) into an unweighted sample average.
Recall that N is the number of times that the element ξ˜i is resampled. Rewriting
i

N M
1 X X N i ˜i
µ̂SIR
ν,M,N (f ) = f (ξ i ) = f (ξ ) ,
N i=1 i=1
N

it is easily seen that the sample mean µ̂SIRν,M,N (f ) of the SIR sample is, conditionally
˜1 ˜M
on the first-stage sample (ξ , . . . , ξ ), equal to the importance sampling estimator
bIS
µ ν,M (f ) defined in (4.3),
h i
E µ̂SIR ˜1 ˜M = µ
bIS
ν,M,N (f ) ξ , . . . , ξ ν,M (f ) .

As a consequence, the mean squared error of the SIR estimator µ̂SIR


ν,M,N (f ) is always
larger than that of the importance sampling estimator (4.3) due to the well-known
4.2. SEQUENTIAL IMPORTANCE SAMPLING 61

variance decomposition
h 2 i
E µ̂SIR
ν,M,N (f ) − µ(f )
h 2 i h 2 i
= E µ̂SIR bIS
ν,M,N (f ) − µν,M (f ) +E µbIS
ν,M (f ) − µ(f ) .

The variance E[(µ̂SIR bIS


ν,M,N (f ) − µ
2
ν,M (f )) ] may be interpreted as the price to pay
for converting the weighted importance sampling estimate into an unweighted ap-
proximation.
Showing that the SIR estimate (4.6) is a consistent and asymptotically normal
estimator of µ(f ) is not a trivial task, as ξ 1 , . . . , ξ N are no more independent due
to the normalization of the weights followed by resampling. As such, the elemen-
tary i.i.d. convergence results that underlie the theory of the importance sampling
estimator are of no use, and we refer to Section ?? for the corresponding proofs.
Remark 75. A closer examination of the numerical complexity of Algorithm 74
reveals that whereas all steps of the algorithm have a complexity that grows in
proportion to M and N , this is not quite true for the multinomial sampling step
whose numerical complexity is, a priori, growing faster than N (about N log2 M —
see Section 4.4.1 below for details). This is very unfortunate, as we know from
elementary arguments discussed in Section ?? that Monte Carlo methods are most
useful when N is large (or more appropriately that the quality of the approximation
improves rather slowly as N grows).
A clever use of elementary probabilistic results however makes it possible to de-
vise methods for sampling N times from a multinomial distribution with M possible
outcomes using a number of operations that grows only linearly with the maximum
of N and M . In order not to interrupt our exposition of sequential Monte Carlo, the
corresponding algorithms are discussed in Section 4.4.1 at the end of this chapter.
Note that we are here only discussing implementations issues. There are however
different motivations, also discussed in Section 4.4.2, for adopting sampling schemes
other than multinomial sampling.

4.2 Sequential Importance Sampling


4.2.1 Sequential Implementation for HMMs
We now specialize the sampling techniques considered above to hidden Markov
models. As in previous chapters, we adopt the hidden Markov model as specified
by Definition 12 where Q denotes the Markov transition kernel of the hidden chain,
ν is the distribution of the initial state X0 , and g(x, y) (for x ∈ X, y ∈ Y) denotes
the transition density function of the observation given the state, with respect to
the measure µ on (Y, Y). To simplify the mathematical expressions, we will also
use the shorthand notation gk (·) = g(·, Yk ) introduced in Section 2.1.4. We denote
the joint smoothing distribution by φ0:k|k , omitting the dependence with respect to
the initial distribution ν, which does not play an important role here. According to
(??), the joint smoothing distribution may be updated recursively in time according
to the relations
R
f (x0 ) g0 (x0 ) ν(dx0 )
φ0 (f ) = R for all f ∈ Fb (X) ,
g0 (x0 ) ν(dx0 )
Z Z
φ0:k+1|k+1 (fk+1 ) = ··· fk+1 (x0:k+1 ) φ0:k|k (dx0:k )Tku (xk , dxk+1 )

for all fk+1 ∈ Fb Xk+2 , (4.7)



62 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

where Tku is the transition kernel on (X, X ) defined by


 −1 Z
Lk+1
Tku (x, f ) = f (x0 ) Q(x, dx0 )gk+1 (x0 )
Lk
for all x ∈ X, f ∈ Fb (X) . (4.8)
The superscript “u” (for “unnormalized”) in the notation Tku is meant to highlight
the fact that Tku is not a probability transition kernel. This distinction is important
here because the normalized version Tk = Tku /Tku (1) of the kernel will play an
important role in the following. Note that except in some special cases discussed in
Chapter ??, the likelihood ratio Lk+1 /Lk can generally not be computed in closed
form, rendering analytic evaluation of Tku or φ0:k|k hopeless. The rest of this section
reviews importance sampling methods that make it possible to approximate φ0:k|k
recursively in k.
First, because importance sampling can be used when the target distribution
is known only up to a scaling factor, the presence of non-computable constants
such as Lk+1 /Lk does not preclude the use of the algorithm. Next, it is convenient
to choose the instrumental distribution as the probability measure associated with
a possibly non-homogeneous Markov chain on X. As seen below, this will make
it possible to derive a sequential version of the importance sampling technique.
Let {Rk }k≥0 denote a family of Markov transition kernels on (X, X ) and let ρ0
denote a probability measure on (X, X ). Further denote by {ρ0:k }k≥0 the family of
probability measures associated with the inhomogeneous Markov chain with initial
distribution ρ0 and transition kernels {Rk }k≥0 ,
Z Z k−1
def
Y
ρ0:k (fk ) = ··· fk (x0:k ) ρ0 (dx0 ) Rl (xl , dxl+1 ) .
l=0

In this context, the kernels Rk will be referred to as the instrumental kernels. The
term importance kernel is also used. The following assumptions will be adopted in
the sequel.
Assumption 76 (Sequential Importance Sampling). 1. The target distribution
φ0 is absolutely continuous with respect to the instrumental distribution ρ0 .
2. For all k ≥ 0 and all x ∈ X, the measure Tku (x, ·) is absolutely continuous with
respect to Rk (x, ·).

Then for any k ≥ 0 and any function fk ∈ Fb Xk+1 ,
(k−1 )
Z Z
dφ0 Y dT u (xl , ·)
l
φ0:k|k (fk ) = · · · fk (x0:k ) (x0 ) (xl+1 ) ρ0:k (dx0:k ) , (4.9)
dρ0 dRl (xl , ·)
l=0

which implies that the target distribution φ0:k|k is absolutely continuous with re-
spect to the instrumental distribution ρ0:k with Radon-Nikodym derivative given
by
k−1
Y dT u (xl , ·)
dφ0:k|k dφ0 l
(x0:k ) = (x0 ) (xl+1 ) . (4.10)
dρ0:k dρ0 dRl (xl , ·)
l=0
It is thus legitimate to use ρ0:k as an instrumental distribution to compute im-
portance sampling estimates for integrals with respect to φ0:k|k . Denoting by
1 N
ξ0:k , . . . , ξ0:k N i.i.d. random sequences with common distribution
 ρ0:k , the im-
portance sampling estimate of φ0:k|k (fk ) for fk ∈ Fb Xk+1 is defined as
PN i i
i=1 ωk fk (ξ0:k )
φ̂IS (f
0:k|k k ) = PN , (4.11)
i
i=1 ωk
4.2. SEQUENTIAL IMPORTANCE SAMPLING 63

where ωki are the unnormalized importance weights defined recursively by

dφ0 i
ω0i = (ξ ) for i = 1, . . . , N , (4.12)
dρ0 0

and, for k ≥ 0,

i dTku (ξki , ·) i
ωk+1 = ωki (ξ ) for i = 1, . . . , N . (4.13)
dRk (ξki , ·) k+1

The multiplicative decomposition of the (unnormalized) importance weights in (4.13)


implies that these weights may be computed recursively in time as successive ob-
servations become available. In the sequential Monte Carlo literature, the update
factor dTku /dRk is often called the incremental weight. As discussed previously in
Section 4.1.1, the estimator in (4.11) is left unmodified if the weights, or equiva-
lently the incremental weights, are evaluated up to a constant only. In particular,
one may omit the problematic scaling factor Lk+1 /Lk that we met in the definition
of Tku in (4.8). The practical implementation of sequential importance sampling
thus goes as follows.

Algorithm 77 (SIS: Sequential Importance Sampling). Initial State: Draw an i.i.d.


sample ξ01 , . . . , ξ0N from ρ0 and set

dν i
ω0i = g0 (ξ0i ) (ξ ) for i = 1, . . . , N .
dρ0 0

Recursion: For k = 0, 1, . . . ,

1 N j
• Draw (ξk+1 , . . . , ξk+1 ) conditionally independently given {ξ0:k , j = 1, . . . , N }
i
from the distribution ξk+1 ∼ Rk (ξki , ·). Append ξk+1
i i
to ξ0:k to form
i i i
ξ0:k+1 = (ξ0:k , ξk+1 ).
• Compute the updated importance weights

i dQ(ξki , ·) i
ωk+1 = ωki × gk+1 (ξk+1
i
) (ξ ), i = 1, . . . , N .
dRk (ξki , ·) k+1

At any iteration index k importance sampling estimates may be evaluated according


to (4.11).

An important feature of Algorithm 77, which corresponds to the method origi-


nally proposed in Handschin and Mayne (1969) and Handschin (1970), is that the
1 N
N trajectories ξ0:k , . . . , ξ0:k are independent and identically distributed for all time
indices k. Following the terminology in use in the non-linear filtering community,
we shall refer to the sample at time index k, ξk1 , . . . , ξkN , as the population (or sys-
i
tem) of particles and to ξ0:k for a specific value of the particle index i as the history
(or trajectory) of the ith particle. The principle of the method is illustrated in
Figure 4.2.

4.2.2 Choice of the Instrumental Kernel


Before discussing in Section 4.3 a serious drawback of Algorithm 77 that needs to
be fixed in order for the method to be applied to any problem of practical interest,
we examine strategies that may be helpful in selecting proper instrumental kernels
Rk in several models (or families of models) of interest.
64 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

FILT.

INSTR.

FILT. +1

Figure 4.2: Principle of sequential importance sampling (SIS). Upper plot: the curve
represents the filtering distribution, and the particles with weights are represented
along the axis by bullets, the radii of which being proportional to the normalized
weight of the particle. Middle plot: the instrumental distribution with resampled
particle positions. Bottom plot: filtering distribution at the next time index with
particle updated weights. The case depicted here corresponds to the choice Rk = Q.

Prior Kernel
The first obvious and often very simple choice of instrumental kernel Rk is that
of setting Rk = Q (irrespectively of k). In that case, the instrumental kernel
simply corresponds to the prior distribution of the new state in the absence of the
corresponding observation. The incremental weight then simplifies to

dTku (x, ·) 0 Lk
(x ) = gk+1 (x0 ) ∝ gk+1 (x0 ) for all (x, x0 ) ∈ X2 . (4.14)
dQ(x, ·) Lk+1

A distinctive feature of the prior kernel is that the incremental weight in (4.14)
does not depend on x, that is, on the previous position. The use of the prior kernel
Rk = Q is popular because sampling from the prior kernel Q is often straightforward,
and computing the incremental weight simply amounts to evaluating the conditional
likelihood of the new observation given the current particle position. The prior
kernel also satisfies the minimal requirement of importance sampling as stated in
Assumption 76. In addition, because the importance function reduces to gk+1 , it is
upper-bounded as soon as one can assume that supx∈X,y∈Y g(x, y) is finite, which
(often) is a very mild condition (see also Section ??). Despite these appealing
properties, the use of the prior kernel can sometimes lead to poor performance,
often manifesting itself as a lack of robustness with respect to the values taken by
the observed sequence {Yk }k≥0 . The following example illustrates this problem in
a very simple situation.

Example 78 (Noisy AR(1) Model). To illustrate the potential problems associated


with the use of the prior kernel, Pitt and Shephard (1999) consider the simple model
where the observations arise from a first-order linear autoregression observed in
4.2. SEQUENTIAL IMPORTANCE SAMPLING 65

noise,

Xk+1 = φXk + σU Uk , Uk ∼ N(0, 1) ,


Yk = Xk + σV Vk , Vk ∼ N(0, 1) ,
2
where φ = 0.9, σU = 0.01, σV2 = 1 and {Uk }k≥0 and {Vk }k≥0 are independent
Gaussian white noise processes. The initial distribution ν is the stationary distri-
bution of the Markov chain {Xk }k≥0 , that is, normal with zero mean and variance
2
σU /(1 − φ2 ).
In the following, we assume that n = 5 and simulate the first five observations
from the model, whereas the sixth observation is set to the arbitrary value 20. The
observed series is

(−0.652, −0.345, −0.676, 1.142, 0.721, 20) .

The last observation is located 20 standard deviations away from the mean (zero)
of the stationary distribution, which definitively corresponds to an aberrant value
from the model’s point of view. In a practical situation however, we would of course
like to be able to handle also data that does not necessarily come from the model
under consideration. Note also that in this toy example, one can evaluate the exact
smoothing distributions by means of the Kalman filtering recursion discussed in
Section ??.
Figure 4.3 displays box and whisker plots for the SIS estimate of the posterior
mean of the final state X5 as a function of the number N of particles when using
the prior kernel. These plots have been obtained from 125 independent replications
of the SIS algorithm. The vertical line corresponds to the true posterior mean of
X5 given Y0:5 , computed using the Kalman filter. The figure shows that the SIS
algorithm with the prior kernel grossly underestimates the values of the state even
when the number of particles is very large. This is a case where there is a conflict be-
tween the prior distribution and the posterior distribution: under the instrumental
distribution, all particles are proposed in a region where the conditional likelihood
function g5 is extremely low. In that case, the renormalization of the weights used
to compute the filtered mean estimate according to (4.11) may even have unexpect-
edly adverse consequences: a weight close to 1 does not necessarily correspond to
a simulated value that is important for the distribution of interest. Rather, it is
a weight that is large relative to other, even smaller weights (of particles even less
important for the filtering distribution). This is a logical consequence of the fact
that the weights must sum to one.

Optimal Instrumental Kernel


The mismatch between the instrumental distribution and the posterior distribution
observed in the previous example is the type of problem that one should try to
alleviate by a proper choice of the instrumental kernel. An interesting choice to
address this problem is the kernel
f (x0 ) Q(x, dx0 )gk+1 (x0 )
R
Tk (x, f ) = R for x ∈ X, f ∈ Fb (X), (4.15)
Q(x, dx0 )gk+1 (x0 )

which is just Tku defined in (4.8) properly normalized to correspond to a Markov


transition kernel (that is, Tk (x, 1) = 1 for all x ∈ X). The kernel Tk may be inter-
preted as a regular version of the conditional distribution of the hidden state Xk+1
given Xk and the current observation Yk+1 . In the sequel, we will refer to this kernel
as the optimal kernel, following the terminology found in the sequential importance
sampling literature. This terminology dates back probably to Zaritskii et al. (1975)
66 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

and Akashi and Kumamoto (1977) and is largely adopted by authors such as Liu
and Chen (1995), Chen and Liu (2000), Doucet et al. (2000), Doucet et al. (2001)
and Tanizaki (2003). The word “optimal” is somewhat misleading, and we refer to
Chapter ?? for a more precise discussion of optimality of the instrumental distribu-
tion in the context of importance sampling (which generally has to be defined for
a specific choice of the function f of interest). The main property of Tk as defined
in (4.15) is that

dTku (x, ·) 0 Lk
(x ) = γk (x) ∝ γk (x) for (x, x0 ) ∈ X2 , (4.16)
dTk (x, ·) Lk+1

where γk (x) is the denominator of Tk in (4.15):


Z
def
γk (x) = Q(x, dx0 )gk+1 (x0 ) . (4.17)

Equation (4.16) means that the incremental weight in (4.13) now depends on the
previous position of the particle only (and not on the new position proposed at index
k + 1). This is the exact opposite of the situation observed previously for the prior
kernel. The optimal kernel (4.15) is attractive because it incorporates information
both on the state dynamics and on the current observation: the particles move
“blindly” with the prior kernel, whereas they tend to cluster into regions where
the current local likelihood gk+1 is large when using the optimal kernel. There are
however two problems with using Tk in practice. First, drawing from this kernel
is usually not directly feasible. Second, calculation of the incremental importance
weight γk in (4.17) may be analytically intractable. Of course, the optimal kernel
takes a simple form with easy simulation and explicit evaluation of (4.17) in the
particular cases discussed in Chapter ??. It turns out that it can also be evaluated
for a slightly larger class of non-linear Gaussian state-space models, as soon as the
observation equation is linear (Zaritskii et al., 1975). Indeed, consider the state-
space model with non-linear state evolution equation

Xk+1 = A(Xk ) + R(Xk )Uk , Uk ∼ N(0, I) , (4.18)


Yk = BXk + SVk , Vk ∼ N(0, I) , (4.19)

where A and R are matrix-valued functions of appropriate dimensions. By applica-


tion of Proposition ??, the conditional distribution of the state vector Xk+1 given
Xk = x and Yk+1 is multivariate Gaussian with mean mk+1 (x) and covariance
matrix Σk+1 (x), given by
−1
Kk+1 (x) = R(x)Rt (x)B t BR(x)Rt (x)B t + SS t

,
mk+1 (x) = A(x) + Kk+1 (x) [Yk+1 − BA(x)] ,
Σk+1 (x) = [I − Kk+1 (x)B] R(x)Rt (x) .
i
Hence new particles ξk+1 need to be simulated from the distribution

N mk+1 (ξki ), Σk+1 (ξki ) ,



(4.20)

and the incremental weight for the optimal kernel is proportional to


Z
γk (x) = q(x, x0 )gk+1 (x0 ) dx0 ∝
 
−1/2 1 t −1
|Γk+1 (x)| exp − [Yk+1 − BA(x)] Γk+1 (x) [Yk+1 − BA(x)]
2
4.2. SEQUENTIAL IMPORTANCE SAMPLING 67

where
Γk+1 (x) = BR(x)Rt (x)B t + SS t .
In other situations, sampling from the kernel Tk and/or computing the normalizing
constant γk is a difficult task. There is no general recipe to solve this problem, but
rather a set of possible solutions that should be considered.
Example 79 (Noisy AR(1) Model, Continued). We consider the noisy AR(1) model
of Example 78 again using the optimal importance kernel, which corresponds to the
particular case where all variables are scalar and A and R are constant in (4.18)–
(4.19) above. Thus, the optimal instrumental transition density is given by
 2 2   2 2

σU σV φx Yk σU σV
tk (x, ·) = N 2 + σ2 2 + σ2 , 2
σU V σU V σU + σV2

and the incremental importance weights are proportional to

1 (Yk − φx)2
 
γk (x) ∝ exp − 2 + σ2 .
2 σU V

Figure 4.4 is the exact analog of Figure 4.3, also obtained from 125 independent
runs of the algorithm, for this new choice of instrumental kernel. The figure shows
that whereas the SIS estimate of posterior mean is still negatively biased, the op-
timal kernel tends to reduce the bias compared to the prior kernel. It also shows
that as soon as N = 400, there are at least some particles located around the true
filtered mean of the state, which means that the method should not get entirely lost
as subsequent new observations arrive.
To illustrate the advantages of the optimal kernel with respect to the prior kernel
graphically, we consider the model (4.18)–(4.19) again with φ = 0.9, σu2 = 0.4,
σv2 = 0.6, and (0, 2.6, 0.6) as observed series (of length 3). The initial distribution
is a mixture 0.6 N(−1, 0.3) + 0.4 N(1, 0.4) of two Gaussians, for which it is still
possible to evaluate the exact filtering distributions as the mixture of two Kalman
filters using, respectively, N(−1, 0.3) and N(1, 0.4) as the initial distribution of X0 .
We use only seven particles to allow for an interpretable graphical representation.
Figures 4.5 and 4.6 show the positions of the particles propagated using the prior
kernel and the optimal kernel, respectively. At time 1, there is a conflict between
the prior and the posterior as the observation does not agree with the particle
approximation of the predictive distribution. With the prior kernel (Figure 4.5),
the mass becomes concentrated on a single particle with several particles lost out
in the left tail of the distribution with negligible weights. In contrast, in Figure 4.6
most of the particles stay in high probability regions through the iterations with
several distinct particles having non-negligible weights. This is precisely because the
optimal kernel “pulls” particles toward regions where the current local likelihood
gk (x) = gk (x, Yk ) is large, whereas the prior kernel does not.

Accept-Reject Algorithm
Because drawing from the optimal kernel Tk is most often not feasible, a first nat-
ural idea consists in trying the accept-reject method (Algorithm ??), which is a
versatile approach to sampling from general distributions. To sample from the
optimal importance kernel Tk (x, ·) defined by (4.15), one needs an instrumental
kernel Rk (x, ·) from which it is easy to sample and such that there exists M sat-
dQ(x,·)
isfying dR k (x,·)
(x0 )gk (x0 ) ≤ M (for all x ∈ X). Note that because it is generally
impossible to evaluate the normalizing constant γk of Tk , we must resort here to
the unnormalized version of the accept-reject algorithm. The algorithm consists in
68 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

generating pairs (ξ, U ) of independent random variables with ξ ∼ Rk (x, ·) and U


uniformly distributed on [0, 1] and accepting ξ if
1 dQ(x, ·)
U≤ (ξ)gk (ξ) .
M dRk (x, ·)
Recall that the distribution of the number of simulations required is geometric with
parameter
Q(x, dx0 )gk (x0 )
R
p(x) = .
M
The strength of the accept-reject technique is that, using any instrumental kernel Rk
satisfying the domination condition, one can obtain independent samples from the
optimal importance kernel Tk . When the conditional likelihood of the observation
gk (x)—viewed as a function of x—is bounded, one can for example use the prior
kernel Q as the instrumental distribution. In that case
dTk (x, ·) 0 gk (x0 ) sup 0 gk (x0 )
(x ) = R ≤ R x ∈X .
dQ(x, ·) gk (u) Q(x, du) gk (u) Q(x, du)

The algorithm then consists in drawing ξ from the prior kernel Q(x, ·), U uniformly
on [0, 1] and accepting the draw if U ≤ gk (ξ)/ supx∈X gk (x). The acceptance rate of
this algorithm is then given by
Q(x, dx0 )gk (x0 )
R
p(x) = X .
supx0 ∈X gk (x0 )
Unfortunately, it is not always possible to design an importance kernel Rk (x, ·) that
is easy to sample from, for which the bound M is indeed finite, and such that the
acceptance rate p(x) is reasonably large.

Local Approximation of the Optimal Importance Kernel


A different option consists in trying to approximate the optimal kernel Tk by a
simpler proposal kernel Rk that is handy for simulating. Ideally, Rk should be such
that Rk (x, ·) both has heavier tails than Tk (x, ·) and is close to Tk (x, ·) around its
dTk (x,·)
modes, with the aim of keeping the ratio dR k (x,·)
(x0 ) as small as possible. To do so,
authors such as Pitt and Shephard (1999) and Doucet et al. (2000) suggest to first
locate the high-density regions of the optimal distribution Tk (x, ·) and then use an
over-dispersed (that is, with sufficiently heavy tails) approximation of Tk (x, ·). The
first part of this program mostly applies to the case where the distribution Tk (x, ·)
is known to be unimodal with a mode that can be located in some way. The overall
procedure will need to be repeated N times with x corresponding in turn to each
of the current particles. Hence the method used to construct the approximation
should be reasonably simple if the potential advantages of using a “good” proposal
kernel are not to be offset by an unbearable increase in computational cost.
A first remark of interest is that there is a large class of state-space models
for which the distribution Tk (x, ·) can effectively be shown to be unimodal using
convexity arguments. In the remainder of this section, we assume that X = Rd and
that the hidden Markov model is fully dominated (in the sense of Definition 13),
denoting by q the transition density function associated with the hidden chain.
Recall that for a certain form of non-linear state-space models given by (4.18)–
(4.19), we were able to derive the optimal kernel and its normalization constant
explicitly. Now consider the case where the state evolves according to (4.18), so
that  
0 1 0 t
 t −1 0
q(x, x ) ∝ exp − (x − A(x)) R(x)R (x) (x − A(x)) ,
2
4.2. SEQUENTIAL IMPORTANCE SAMPLING 69

and g(x, y) is simply constrained to be a log-concave function of its x argument.


This of course includes the linear Gaussian observation model considered previ-
ously in (4.19) but also many other cases like the non-linear observation con-
sidered below in Example 80. Then the optimal transition density tuk (x, x0 ) =
(Lk+1 /Lk )−1 q(x, x0 )gk (x0 ) is also a log-concave function of its x0 argument, as its
logarithm is the sum of two concave functions (and a constant term). This implies
in particular that x0 7→ tuk (x, x0 ) is unimodal and that its mode may be located
using computationally efficient techniques such as Newton iterations.
The instrumental transition density function is usually chosen from a parametric
family {rθ }θ∈Θ of densities indexed by a finite-dimensional parameter θ. An obvious
choice is the multivariate Gaussian distribution with mean m and covariance matrix
Γ, in which case θ = (µ, Γ). A better choice is a multivariate t-distribution with
η-degrees of freedom, location m, and scale matrix Γ. Recall that the density of
this distribution is proportional to rθ (x) ∝ [η + (x − m)t Γ−1 (x − m)](−η+d)/2 . The
choice η = 1 corresponds to a Cauchy distribution. This is a conservative choice that
ensures over-dispersion, but if X is high-dimensional, most draws from a multivariate
Cauchy might be too far away from the mode to reasonably approximate the target
distribution. In most situations, values such as η = 4 (three finite moments) are
more reasonable, especially if the underlying model does not feature heavy-tailed
distributions. Recall also that simulation from the multivariate t-distribution with
η degrees of freedom, location m, and scale Σ can easily be achieved by first drawing
from a multivariate Gaussian distribution with mean m and covariance Γ and then
dividing the outcome by the square root of an independent chi-square draw with η
degrees of freedom divided by η.
To choose the parameter θ of the instrumental distribution rθ , one should try
to minimize the supremum of the importance function,

q(x, x0 )gk (x0 )


min sup . (4.21)
θ∈Θ x0 ∈X rθ (x0 )

This is a minimax guarantee by which θ is chosen to minimize an upper bound on


the importance weights. Note that if rθ was to be used for sampling from tk (x, ·)
by the accept-reject algorithm, the value of θ for which the minimum is achieved
in (4.21) is also the one that would make the acceptance probability maximal. In
practice, solving the optimization problem in (4.21) is often too demanding, and a
more generic strategy consists in locating the mode of x0 7→ tk (x, x0 ) by an iterative
algorithm and evaluating the Hessian of its logarithm at the mode. The parameter
θ is then selected in the following way.

Multivariate normal: fit the mean of the normal distribution to the mode of
tk (x, ·) and fit the covariance to minus the inverse of the Hessian of log tk (x, ·)
at the mode.

Multivariate t-distribution: fit the location and scale parameters as the mean
and covariance parameters in the normal case; the number of degrees of free-
dom is usually set arbitrarily (and independently of x) based on the arguments
discussed above.

We discuss below an important model for which this strategy is successful.

Example 80 (Stochastic Volatility Model). The state-space equations that define


the model is

Xk+1 = φXk + σUk ,


Yk = β exp(Xk /2)Vk ,
70 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

We directly obtain

(x0 − φx)2
 
1
q(x, x0 ) = √ exp − ,
2πσ 2 2σ 2
Y2
 
1 1
gk (x0 ) = p exp − k2 exp(−x0 ) − x0 .
2πβ 2 2β 2

Simulating from the optimal transition kernel tk (x, x0 ) is difficult, but the function
x0 7→ log(q(x, x0 )gk (x0 )) is indeed (strictly) concave. The mode mk (x) of x0 7→
tk (x, x0 ) is the unique solution of the non-linear equation

1 0 Yk2 1
− (x − φx) + exp(−x0 ) − = 0 , (4.22)
σ2 2β 2 2

which can be found using Newton iterations. Once at the mode, the (squared)
scale σk2 (x) is set as minus the inverse of the second-order derivative of x0 7→
(log q(x, x0 )gk (x0 )) evaluated at the mode mk (x). The result is
−1
Yk2

1
σk2 (x) = + 2 exp [−mk (x)] . (4.23)
σ2 2β

In this example, a t-distribution with η = 5 degrees of freedom was used, with


location mk (x) and scale σk (x) obtained as above. The incremental importance
weight is then given by
0
h
Y2
i
−φx)2 0
exp − (x 2σ 2 − 2βk2 exp(−x0 ) − x2
n o−(η+1)/2 .
−1 [x0 −mk (x)]2
σk (x) η + σ 2 (x)
k

Figure 4.7 shows a typical example of the type of fit that can be obtained
for the stochastic volatility model with this strategy using 1,000 particles. Note
that although the data used is the same as in Figure ??, the estimated distribu-
tions displayed in both figures are not directly comparable, as the MCMC method
in Figure ?? approximates the marginal smoothing distribution, whereas the se-
quential importance sampling approach used for Figure 4.7 provides a (recursive)
approximation to the filtering distributions.
When there is no easy way to implement the local linearization technique, a
natural idea explored by Doucet et al. (2000) and Van der Merwe et al. (2000)
consists in using classical non-linear filtering procedures to approximate tk . These
include in particular the so-called extended Kalman filter (EKF), which dates back
to the 1970s (Anderson and Moore, 1979, Chapter 10), as well as the unscented
Kalman filter (UKF) introduced by Julier and Uhlmann (1997)—see, for instance,
Ristic et al. (2004, Chapter 2) for a recent review of these techniques. We illustrate
below the use of the extended Kalman filter in the context of sequential importance
sampling.
We now consider the most general form of the state-space model with Gaussian
noises:

Xk+1 = a(Xk , Uk ) , Uk ∼ N(0, I) , (4.24)


Yk = b(Xk , Vk ) , Vk ∼ N(0, I) , (4.25)

where a, b are vector-valued measurable functions. It is assumed that {Uk }k≥0 and
{Vk }k≥0 are independent white Gaussian noises. As usual, X0 is assumed to be
N(0, Σν ) distributed and independent of {Uk } and {Vk }. The extended Kalman
4.2. SEQUENTIAL IMPORTANCE SAMPLING 71

filter proceeds by approximating the non-linear state-space equations (4.24)–(4.25)


by a non-linear Gaussian state-space model with linear measurement equation. We
are then back to a model of the form (4.18)–(4.19) for which the optimal kernel may
be determined exactly using Gaussian formulas. We will adopt the approximation

Xk ≈ a(Xk−1 , 0) + R(Xk−1 )Uk−1 , (4.26)


Yk ≈ b [a(Xk−1 , 0), 0] + B(Xk−1 ) [Xk − a(Xk−1 , 0)] + S(Xk−1 )Vk , (4.27)

where
• R(x) is the dx × du matrix of partial derivatives of a(x, u) with respect to u
and evaluated at (x, 0),

def ∂ [a(x, 0)]i


[R(x)]i,j = for i = 1, . . . , dx and j = 1, . . . , du ;
∂uj

• B(x) and S(x) are the dy × dx and dy × dv matrices of partial derivatives of


b(x, v) with respect to x and v respectively and evaluated at (a(x, 0), 0),

∂ {b [a(x, 0), 0]}i


[B(x)]i,j = for i = 1, . . . , dy and j = 1, . . . , dx ,
∂xj
∂ {b [a(x, 0), 0]}i
[S(x)]i,j = for i = 1, . . . , dy and j = 1, . . . , dv .
∂vj

It should be stressed that the measurement equation in (4.27) differs from (4.19)
in that it depends both on the current state Xk and on the previous one Xk−1 .
The approximate model specified by (4.26)–(4.27) thus departs from the HMM
assumptions. On the other hand, when conditioning on the value of Xk−1 , the
structure of both models, (4.18)–(4.19) and (4.26)–(4.27), are exactly similar. Hence
the posterior distribution of the state Xk given Xk−1 = x and Yk is a Gaussian
distribution with mean mk (x) and covariance matrix Γk (x), which can be evaluated
according to
−1
Kk (x) = R(x)Rt (x)B t (x) B(x)R(x)Rt (x)B t (x) + S(x)S t (x)

,
mk (x) = a(x, 0) + Kk (x) {Yk − b [a(x, 0), 0]} ,
Γ(x) = [I − Kk (x)B(x)] R(x)Rt (x) .

The Gaussian distribution with mean mk (x) and covariance Γk (x) may then be used
as a proxy for the optimal transition kernel Tk (x, ·). To improve the robustness of the
method, it is safe to increase the variance, that is, to use cΓk (x) as the simulation
variance, where c is a scalar larger than one. A perhaps more recommendable
option consists in using as previously a proposal distribution with tails heavier
than the Gaussian, for instance, a multivariate t-distribution with location mk (x),
scale Γk (x), and four or five degrees of freedom.
Example 81 (Growth Model). We consider the univariate growth model discussed
by Kitagawa (1987) and Polson et al. (1992) given, in state-space form, by

Xk = ak−1 (Xk−1 ) + σu Uk−1 , Uk ∼ N(0, 1) , (4.28)


Yk = bXk2 + σv Vk , Vk ∼ N(0, 1) , (4.29)

where {Uk }k≥0 and {Vk }k≥0 are independent white Gaussian noise processes and
x
ak−1 (x) = α0 x + α1 + α2 cos [1.2(k − 1)] (4.30)
1 + x2
72 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

with α0 = 0.5, α1 = 25, α2 = 8, b = 0.05, and σv2 = 1 (the value of σu2 will be
discussed below). The initial state is known deterministically and set to X0 = 0.1.
This model is non-linear both in the state and in the measurement equation. Note
that the form of the likelihood adds an interesting twist to the problem: whenever
Yk ≤ 0, the conditional likelihood function

b2
 
def 2
gk (x) = g(x; Yk ) ∝ exp − 2 x2 − Yk /b
2σv

is unimodal and symmetric about 0; when Yk > 0 however, the likelihood gk is


symmetric about 0 with two modes located at ±(Yk /b)1/2 .
The EKF approximation to the optimal transition kernel is a Gaussian distri-
bution with mean mk (x) and variance Γk (x) given by
−1
Kk (x) = 2σu2 bak−1 (x) 4σu2 b2 a2k−1 (x) + σv2

,
2
 
mk (x) = ak−1 (x) + Kk (x) Yk − bak−1 (x) ,
σv2 σu2
Γk (x) = .
4σu2 b2 a2k−1 (x) + σv2

In Figure 4.8, the optimal kernel, the EKF approximation to the optimal kernel,
and the prior kernel for two different values of the state variance are compared.
This figure corresponds to the time index one, and Y1 is set to 6 (recall that the
initial state X0 is equal to 0.1). In the case where σu2 = 1 (left plot in Figure 4.8),
the prior distribution of the state, N(a0 (X0 ), σu2 ), turns out to be more informative
(more peaky, less diffuse) than the conditional likelihood g1 . In other words, the
observed Y1 does not carry a lot of information about the state X1 , compared to
the information provided by X0 ; this is because the measurement variance σv2 is
not small compared to σu2 . The optimal transition kernel, which does take Y1 into
account, is then very close to the prior kernel, and the differences between the three
kernels are minor. In such a situation, one should not expect much improvement
with the EKF approximation compared to the prior kernel.
In the case shown in the right plot of Figure 4.8 (σu2 = 10), the situation is
reversed. Now σv2 is relatively small compared to σu2 , so that the information about
X1 contained in g1 is large to that provided by the prior information on X0 . This
is the kind of situation where we expect the optimal kernel to improve considerably
on the prior kernel. Indeed, because Y1 > 0, the optimal kernel is bimodal, with the
second mode far smaller than the first one (recall that the plots are on log-scale); the
EKF kernel correctly picks the dominant mode. Figure 4.8 also illustrates the fact
that, in contrast to the prior kernel, the EKF kernel does not necessarily dominate
the optimal kernel in the tails; hence the need to simulate from an over-dispersed
version of the EKF approximation as discussed above.

4.3 Sequential Importance Sampling with Resam-


pling
Despite quite successful results for short data records, as was observed in Exam-
ple 80, it turns out that the sequential importance sampling approach discussed so
far is bound to fail in the long run. We first substantiate this claim with a simple
illustrative example before examining solutions to this shortcoming based on the
concept of resampling introduced in Section 4.1.2.
4.3. SEQUENTIAL IMPORTANCE SAMPLING WITH RESAMPLING 73

4.3.1 Weight Degeneracy


The intuitive interpretation of the importance sampling weight ωki is as a measure
i
of the adequacy of the simulated trajectory ξ0:k to the target distribution φ0:k|n .
A small importance weight implies that the trajectory is drawn far from the main
body of the posterior distribution φ0:k|n and will contribute only moderately to
the importance sampling estimates of the form (4.11). Indeed, a particle such that
PN
the associated weight ωki is orders of magnitude smaller than the sum i=1 ωk
i

is practically ineffective. If there are too many ineffective particles, the particle
approximation becomes both computationally and statistically inefficient: most of
the computing effort is put on updating particles and weights that do not contribute
significantly to the estimator; the variance of the resulting estimator will not reflect
the large number of terms in the sum but only the small number of particles with
non-negligible normalized weights.
Unfortunately, the situation described above is the rule rather than the excep-
tion, as the importance weights will (almost always) degenerate as the time index
PN
k increases, with most of the normalized importance weights ωki / j=1 ωkj close to
0 except for a few ones. We consider below the case of i.i.d. models for which it
is possible to show using simple arguments that the large sample variance of the
importance sampling estimate can only increase with the time index k.
Example 82 (Weight Degeneracy in the I.I.D. Case). The simplest case of appli-
cation of the sequential importance sampling technique is when µ is a probability
distribution on (X, X ) and the sequence of target distributions corresponds to the
product distributions, that is, the sequence of distributions on (Xk+1 , X ⊗(k+1) ) de-
fined recursively by µ0 = µ and µk = µk−1 ⊗ µ, for k ≥ 1. Let ν be another
probability distribution on (X, X ) and assume that µ is absolutely continuous with
respect to ν and that
Z  2

(x) ν(dx) < ∞ . (4.31)

Finally, let f be a bounded measurable function that is not (µ-a.s.) constant such
that its variance under µ, µ(f 2 ) − µ2 (f ), is strictly positive.
Consider the sequential importance sampling estimate given by
N Qk dµ i
dν (ξl )
X
µIS
bk,N (f ) = f (ξk ) PN l=0
i
Qk dµ j , (4.32)
i=1 j=1 l=0 dν (ξl )

where the random variables {ξlj }, l = 1, . . . , k, j = 1, . . . , N are i.i.d. with common


distribution ν. As discussed in Section 4.2, the unnormalized importance weights
may be computed recursively and hence (4.32) really corresponds to an estimator
of the form (4.11) in the particular case of a function fk that depends on the last
component only. This is of course a rather convoluted and very inefficient way
of constructing an estimate of µ(f ) but still constitutes a valid instance of the
sequential importance sampling approach (in a very particular case).
Now let k be fixed and write
PN Qk 
1/2
 IS N −1/2 i=1 l=0 f (ξki ) − µ(f ) dµ i
dν (ξl )
N bk,N (f ) − µ(f ) =
µ N k
. (4.33)
N −1 i=1 l=0 dµ i
P Q
dν (ξl )

Because " k #
Y dµ
E (ξli ) = 1 ,

l=0
the weak law of large numbers implies that the denominator of the right-hand side
of (4.33) converges to 1 in probability as N increases. Likewise, under (4.31), the
74 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

central limit theorem shows that the numerator of the right-hand side of (4.33)
converges in distribution to the normal N(0, σk2 (f )) distribution, where
(
k
)2 
Y 2 dµ
σk2 (f ) = E  f (ξk1 ) − µ(f ) (ξ 1 ) 
 
(4.34)
dν l
l=0
"Z  2 #k Z  2
dµ dµ 2
= (x) ν(dx) (x) [f (x) − µ(f )] ν(dx) .
dν dν

Slutsky’s lemma then implies that (4.33) also converges in distribution to the same
N(0, σk2 (f )) limit as N grows. Now Jensen’s inequality implies that
Z 2 Z  2
dµ dµ
1= (x)ν(dx) ≤ (x) ν(dx) ,
dν dν
with equality if and only if µ = ν. Therefore, if µ 6= ν, the asymptotic variance
σk2 (f ) grows exponentially with the iteration index k for all functions f such that
Z  2 Z
dµ 2 dµ 2
(x) [f (x) − µ(f )] ν(dx) = (x) [f (x) − µ(f )] µ(dx) 6= 0 .
dν dν
Because µ is absolutely continuous with respect to ν, µ{x ∈ X : dµ/dν(x) = 0} = 0
and the last integral is null if and only if f has zero variance under µ.
Thus in the i.i.d. case, the asymptotic variance of the importance sampling
estimate (4.32) increases exponentially with the time index k as soon as the proposal
and target differ (except for constant functions).
It is more difficult to characterize the degeneracy of the weights for general target
and instrumental distributions. There have been some limited attempts to study
more formally this phenomenon in some specific scenarios. In particular, Del Moral
and Jacod (2001) have shown the degeneracy of the sequential importance sampling
estimator of the posterior mean in Gaussian linear models when the instrumental
kernel is the prior kernel. Such results are in general difficult to derive (even in the
Gaussian linear models where most of the derivations can be carried out explicitly)
and do not provide much additional insight. Needless to say, in practice, weight
degeneracy is a prevalent and serious problem making the vanilla sequential impor-
tance sampling method discussed so far almost useless. The degeneracy can occur
after a very limited number of iterations, as illustrated by the following example.

Example 83 (Stochastic Volatility Model, Continued). Figure 4.9 displays the


histogram of the base 10 logarithm of the normalized importance weights after 1, 10,
and 100 time indices for the stochastic volatility model considered in Example 80
(using the same instrumental kernel). The number of particles is set to 1,000.
Figure 4.9 shows that, despite the choice of a reasonably good approximation to the
optimal importance kernel, the normalized importance weights quickly degenerate
as the number of iterations of the SIS algorithm increases. Clearly, the results
displayed in Figure 4.7 still are reasonable for k = 20 but would be disastrous for
larger time horizons such as k = 100.
Because the weight degeneracy phenomenon is so detrimental, it is of great
practical significance to set up tests that can detect this phenomenon. A simple
criterion is the coefficient of variation of the normalized weights used by Kong et al.
(1994), which is defined by

N
!2 1/2
i
1 X ω
CVN =  N PN −1  . (4.35)
N i=1 j=1 ω
j
4.3. SEQUENTIAL IMPORTANCE SAMPLING WITH RESAMPLING 75

The coefficient of variation is minimal when the normalized√ weights are all equal to
1/N , and then CVN = 0. The maximal value of CVN is N − 1, which corresponds
to one of the normalized weights being one and all others being null. Therefore, the
coefficient of variation is often interpreted as a measure of the number of ineffective
particles (those that do not significantly contribute to the estimate). A related
criterion with a simpler interpretation is the so-called effective sample size Neff
(Liu, 1996), defined as

N
!2 −1
i
X ω
Neff = PN  , (4.36)
i=1 j=1 ωj

which varies between 1 (all weights null but one) and N (equal weights). It is
straightforward to verify the relation
N
Neff = .
1 + CV2N

Some additional insights and heuristics about the coefficient of variation are given
by Liu and Chen (1995).
Yet another possible measure of the weight imbalance is the Shannon entropy
of the importance weights,
N
!
X ωi ωi
Ent = − PN log2 PN . (4.37)
j j
i=1 j=1 ω j=1 ω

When all the normalized importance weights are null except for one of them, the
entropy is null. On the contrary, if all the weights are equal to 1/N , then the
entropy is maximal and equal to log2 N .

Example 84 (Stochastic Volatility Model, Continued). Figure 4.10 displays the


coefficient of variation (left) and Shannon entropy (right) as a function of the time
index k under the same conditions as for Figure 4.9, that is, for the stochastic
volatility model of 80. The figure shows that the distribution of the weights steadily
degenerates: the coefficient of variation increases and the entropy of the importance
weights decreases. After 100 iterations, there are less than 50 particles (out 1,000)
significantly contributing to the importance sampling estimator. Most particles
have importance weights that are zero to machine precision, which is of course a
tremendous waste in computational resource.

4.3.2 Resampling
The solution proposed by Gordon et al. (1993) to reduce the degeneracy of the
importance weights is based on the concept of resampling already discussed in the
context of importance sampling in Section 4.1.2. The basic method consists in
resampling in the current population of particles using the normalized weights as
probabilities of selection. Thus, trajectories with small importance weights are
eliminated, whereas those with large importance weights are duplicated. After
resampling, all importance weights are reset to one. Up to the first instant when re-
sampling occurs, the method can really be interpreted as an instance of the sampling
importance resampling (SIR) technique discussed in Section 4.1.2. In the context
of sequential Monte Carlo, however, the main motivation for resampling is to avoid
future weight degeneracy by reseting (periodically) the weights to equal values. The
resampling step has a drawback however: as emphasized in Section 4.1.2, resampling
76 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

introduces additional variance in Monte Carlo approximations. In some situations,


the additional variance may be far from negligible: when the importance weights
already are nearly equal for instance, resampling can only reduce the number of
distinct particles, thus degrading the accuracy of the Monte Carlo approximation.
The one-step effect of resampling is thus negative but, in the long term, resampling
is required to guarantee a stable behavior of the algorithm. This interpretation sug-
gests that it may be advantageous to restrict the use of resampling to cases where
the importance weights are becoming very uneven. The criteria defined in (4.35),
(4.36), or (4.37) are of course helpful for that purpose. The resulting algorithm,
which is generally known under the name of sequential importance sampling with
resampling (SISR), is summarized below.

Algorithm 85 (SISR: Sequential Importance Sampling with Resampling). Initial-


ize the particles as in Algorithm 77, optionally applying the resampling step below.
For subsequent time indices k ≥ 0, do the following.

Sampling:
j
• Draw (ξ˜k+1
1
, . . . , ξ˜k+1
N
) conditionally independently given {ξ0:k , j = 1, . . . , N }
from the instrumental kernel: ξ˜k+1 ∼ Rk (ξk , ·), i = 1, . . . , N .
i i

• Compute the updated importance weights

dQ(ξki , ·) ˜i
i
ωk+1 = ωki gk+1 (ξ˜k+1
i
) (ξ ), i = 1, . . . , N .
dRk (ξki , ·) k+1

Resampling (Optional):
j
• Draw, conditionally independently given {(ξ0:k i
, ξ˜k+1 ), i, j = 1, . . . , N },
1 N
the multinomial trial (Ik+1 , . . . Ik+1 ) with probabilities of success
1 N
ωk+1 ωk+1
PN j , . . . , PN j .
j ωk+1 j ωk+1

i
• Reset the importance weights ωk+1 to a constant value for i = 1, . . . , N .
i
If resampling is not applied, set for i = 1, . . . , N , Ik+1 = i.

Trajectory update: for i = 1, . . . , N ,


 i i

Ik+1 Ik+1
i
ξ0:k+1 = ξ0:k , ξ˜k+1 . (4.38)

As discussed previously the resampling step in the algorithm above may be used
systematically (for all indices k), but it is often preferable to perform resampling
from time to time only. Usually, resampling is either used systematically but at a
lower rate (for one index out of m, where m is fixed) or at random instants based
on the values of the coefficient of variation or the entropy criteria defined in (4.35)
and (4.37), respectively. Note that in addition to arguments based on the variance
of the Monte Carlo approximation, there is usually also a computational incentive
for limiting the use of resampling; indeed, except in models where the evaluation of
the incremental weights is costly (think of large-dimensional multivariate observa-
tions for instance), the computational cost of the resampling step is not negligible.
Both Sections 4.4.1 and 4.4.2 discuss several implementations and variants of the
resampling step that may render the latter argument less pregnant.
4.3. SEQUENTIAL IMPORTANCE SAMPLING WITH RESAMPLING 77

The term particle filter is often used to refer to Algorithm 85 although the
terminology SISR is preferable, as particle filtering is sometimes also used more
generically for any sequential Monte Carlo method. Gordon et al. (1993) actually
proposed a specific instance of Algorithm 85 in which resampling is done systemati-
cally at each step and the instrumental kernel is chosen as the prior kernel Rk = Q.
This particular algorithm, commonly known as the bootstrap filter , is most often
very easy to implement because it only involves simulating from the transition kernel
Q of the hidden chain and evaluation of the conditional likelihood function g.
There is of course a whole range of variants and refinements of Algorithm 85,
many of which will be covered in some detail in the next chapter. A simple remark
though is that, as in the case of the simplest SIR method discussed in Section 4.1.2,
it is possible to resample N times from a larger population of M intermediate
samples. In practice, it means that Algorithm 85 should be modified as follows at
indices k for which resampling is to be applied.
i,1 i,α
SIS: For i = 1, . . . , N , draw α candidates ξ˜k+1 , . . . , ξ˜k+1 from each proposal distri-
i
bution Rk (ξk , ·).
1,1 1,α N,1 N,α
Resampling: Draw (Nk+1 , . . . , Nk+1 , . . . , Nk+1 , . . . , Nk+1 ) from the multinomial
distribution with parameter N and probabilities
i,j
ωk+1
PN Pα l,m
for i = 1, . . . , N , j = 1, . . . , α .
l=1 m=1 ωk+1

Remark 86 (Marginal Interpretation of SIS and SISR). Both Algorithms 77 and 85


i
have been introduced as methods to simulate whole trajectories {ξ0:k }1≤i≤N that
approximate the joint smoothing distribution φ0:k|k . This was done quite easily in
the case of sequential importance sampling (Algorithm 77), as the trajectories are
simply extended independently of one another as new samples arrive. When using
resampling however, the process is more involved because it becomes necessary to
duplicate or discard some trajectories according to (4.38).
This presentation of the SIS and SISR methods has been adopted because it is
the most natural way to introduce sequential Monte Carlo methods. It does not
mean that, when implementing the SISR algorithm, storing the whole trajectories
is required. Neither do we claim that for large k, the approximation of the complete
i
joint distribution φ0:k|k provided by the particle trajectories {ξ0:k }1≤i≤N is accurate.
Most often, Algorithm 85 is implemented storing only the current generation of
particles {ξki }1≤i≤N , and (4.38) simplifies to
i
Ik+1
i
ξk+1 = ξ˜k+1 i = 1, . . . , N .

In that case, the system of particles {ξki }1≤i≤N with associated weights {ωki }1≤i≤N ,
provides an approximation to the filtering distribution φk , which is the marginal of
the joint smoothing distribution φ0:k|k .
The notation ξki could be ambiguous when resampling is applied, as the first
i
k + 1 elements of the ith trajectory ξ0:k+1 at time k + 1 do not necessarily coincide
with the ith trajectory ξ0:k at time k. By convention, ξki always refers to the last
i
i
point in the ith trajectory, as simulated at index k. Likewise, ξl:k is the portion of
the same trajectory that starts at index l and ends at the last index (that is, k).
i
When needed, we will use the notation ξ0:k (l) for the element of index l in the ith
particle trajectory at time k to avoid ambiguity.
To conclude this section on the SISR algorithm, we briefly revisit two of the
examples already considered previously to contrast the results obtained with the
SIS and SISR approaches.
78 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

Example 87 (Stochastic Volatility Model, Continued). To illustrate the effective-


ness of the resampling strategy, we consider once again the stochastic volatility
model introduced in Example 80, for which the weight degeneracy phenomenon (in
the basic SIS approach) was patent in Figures 4.9 and 4.10.
Figures 4.11 and 4.12 are the counterparts of Figs. 4.10 and 4.9, respectively,
when resampling is applied whenever the coefficient of variation (4.35) of the normal-
ized weights exceeds one. Note that Figure 4.11 displays the coefficient of variation
and Shannon entropy computed, for each index k, before resampling, at indices for
which resampling do occur. Contrary to what happened in plain importance sam-
pling, the histograms of the normalized importance weights shown in Figure 4.12
are remarkably similar, showing that the weight degeneracy phenomenon is now
under control. Another important remark in this example is that both criteria (the
coefficient of variation and entropy) are strongly correlated. Triggering resampling
whenever the entropy gets below, say 9.2, would thus be nearly equivalent with
resampling occurring, on average, once every tenth time indices. The Shannon
entropy of the normalized importance weights evolves between 10 and 9, suggest-
ing that there are at least 500 particles that are significantly contributing to the
importance sampling estimate (out of 1,000).

Example 88 (Growth Model, Continued). Consider again the non-linear state-


space model of Example 81, with the variance σu2 of the state noise set to 10; this
makes the observations very informative relative to the prior distribution on the
hidden states. Figures 4.13 and 4.14 display the filtering distributions estimated
for the first 31 time indices when using the SIS method with the prior kernel Q as
instrumental kernel (Figure 4.13), and the corresponding SISR algorithm with sys-
tematic resampling—that is, the bootstrap filter—in Figure 4.14. Both algorithms
use 500 particles.
For each time index, the top plots of Figures 4.13 and 4.14 show the highest pos-
terior density (HPD) regions corresponding to the estimated filtering distribution,
where the lighter grey zone contains 95% of the probability mass and the darker
area corresponds to 50% of the probability mass. These HPD regions are based
on a kernel density estimate (using the Epanechnikov kernel with bandwidth 0.2)
computed from the weighted particles (that is, before resampling in the case of the
bootstrap filter). Up to k = 8, the two methods yield very similar results. With
the SIS algorithm however, the bottom panel of Figure 4.13 shows that the weights
degenerate √ quickly. Remember that the maximal value of the coefficient of variation
(4.35) is N − 1, that is about 22.3 in the case of Figure 4.13. Hence for k = 6
and for all indices after k = 12, the bottom panel of Figure 4.13 indeed means that
almost all normalized weights but one are null: the filtered estimate is concentrated
at one point, which sometimes severely departs from the actual state trajectory
shown by the crosses. In contrast, the bootstrap filter (Figure 4.14) appears to be
very stable and provides reasonable state estimates even at indices for which the
filtering distribution is strongly bimodal (see Example 81 for an explanation of this
latter feature).

4.4 Complements
As discussed above, resampling is a key ingredient of the success of sequential Monte
Carlo techniques. We discuss below two separate aspects related to this issue. First,
we show that there are several schemes based on clever probabilistic results that
may be exploited to reduce the computational load associated with multinomial
resampling. Next, we examine some variants of resampling that achieves lower
4.4. COMPLEMENTS 79

conditional variance than multinomial resampling. In this latter case, the aim is of
course to be able to decrease the number of particles without losing too much on
the quality of the approximation.
Throughout this section, we will assume that it is required to draw N samples
ξ 1 , . . . , ξ N out of a, usually larger, set {ξ˜1 , . . . , ξ˜M } according to the normalized im-
portance weights {ω 1 , . . . , ω M }. We denote by G a σ-field such that both ω 1 , . . . , ω M
and ξ˜1 , . . . , ξ˜M are G-measurable.

4.4.1 Implementation of Multinomial Resampling


Drawing from the multinomial distribution is equivalent to drawing N random
indices I 1 , . . . , I N conditionally independently given G from the set {1, . . . , M } and
such that P(I j = i | G) = ω i . This is of course the simplest example of use of the
inversion method, and each index may be obtained by first simulating a random
variable U with uniform distribution on [0, 1] and then determining the index I
PI−1 PI
such that U ∈ ( j=1 ω j , j=1 ω j ] (see Figure 4.15). Determining the appropriate
index I thus requires on the average log2 M comparisons (using a simple binary
tree search). Therefore, the naive technique to implement multinomial resampling
requires the simulation of N independent uniform random variables and, on the
average, of the order N log2 M comparisons.
A nice solution to avoid the repeated sorting operations consists in pre-sorting
the uniform variables. Because the resampling is to be repeated N times, we need
N uniform random variables, which will be denoted by U1 , . . . , UN and U(1) ≤
U(2) ≤ · · · ≤ U(N ) denoting the associated order statistics. It is easily checked
that applying the inversion method from the ordered uniforms {U(i) } requires, in
the worst case, only M comparisons. The problem is that determining the order
statistics from the unordered uniforms {Ui } by sorting algorithms such as Heapsort
or Quicksort is an operation that requires, at best, of the order N log2 N comparisons
(Press et al., 1992, Chapter 8). Hence, except in cases where N  M , we have not
gained anything yet by pre-sorting the uniform variables prior to using the inversion
method. It turns out however that two distinct algorithms are available to sample
directly the ordered uniforms {U(i) } with a number of operations that scales linearly
with N .
Both of these methods are fully covered in by Devroye (1986, Chapter 5), and
we only cite here the appropriate results, referring to Devroye (1986, pp. 207–215)
for proofs and further references on the methods.

Proposition 89 (Uniform Spacings). Let U(1) ≤ . . . ≤ U(N ) be the order statistics


associated with an i.i.d. sample from the U ([0, 1]) distribution. Then the increments

Si = U(i) − U(i−1) , i = 1, . . . , N , (4.39)

(where by convention S1 = U(1) ) are called the uniform spacings and distributed as

E1 EN
PN +1 , . . . , PN +1 ,
i=1 Ei i=1 Ei

where E1 , . . . , EN +1 is a sequence of i.i.d. exponential random variables.

Proposition 90 (Malmquist, 1950). Let U(1) ≤ . . . ≤ U(N ) be the order statistics


1/N
of U1 , U2 , . . . , UN —a sequence of i.i.d. uniform [0, 1] random variables. Then UN ,
1/N 1/(N −1) 1/N 1/(N −1) 1/1
UN UN −1 , . . . , UN UN −1 · · · U1 is distributed as U(N ) , . . . , U(1) .

The two sampling algorithms associated with these probabilistic results may be
summarized as follows.
80 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

Algorithm 91 (After Proposition 89).


For i = 1, . . . , N + 1: Simulate Ui ∼ U ([0, 1]) and set Ei = − log Ui .
PN +1
Set G = i=1 Ei and U(1) = E1 /G.
For i = 2, . . . , n: U(i) = U(i−1) + Ei /G.
Algorithm 92 (After Proposition 90).
1/N
Generate VN ∼ U ([0, 1]) and set U(N ) = VN .
1/i
For i = N − 1 down to 1: Generate Vi ∼ U ([0, 1]) and set U(i) = Vi U(i+1) .
Note that Devroye (1986) also discusses a third, slightly more complicated
algorithm—the bucket sort method of Devroye and Klincsek (1981)—which also
has an expected computation time of order N . Using any of these methods, the
computational cost of multinomial resampling scales only linearly in N and M ,
which makes the method practicable even when a large number of particles is used.

4.4.2 Alternatives to Multinomial Resampling


Instead of using the multinomial sampling scheme, it is also possible to use a dif-
ferent resampling (or reallocation) scheme. For i = 1, . . . , M , denote by N i the
number of times the ith element ξ˜i is selected. A resampling scheme will be said to
be unbiased with respect to G if
M
X
Ni = N , (4.40)
i=1
E N i G = N ωi ,
 
i = 1, . . . , M . (4.41)
We focus here on resampling techniques that keep the number of particles con-
stant (see for instance Crisan et al., 1999, for unbiased sampling with a random
number of particles). There are many different conditions under which a resam-
pling scheme is unbiased. The simplest unbiased scheme is multinomial resampling,
for which (N 1 , . . . , N M ), conditionally on G, has the multinomimal distribution
Mult(N, ω 1 , . . . , ω N ). Because I 1 , . . . , I M are conditionally i.i.d. given G, it is easy
to evaluate the conditional variance in the multinomial resampling scheme:
" #  2
N M M
1 X ˜I i 1 X i  ˜i X
Var f (ξ ) G = ω f (ξ ) − ω f (ξ˜ )
j j
N i=1 N i=1 j=1
 "M #2 
M
1  X X 
= ω i f 2 (ξ˜i ) − ω i f (ξ˜i ) . (4.42)
N i=1 i=1

A sensible objective is to try to construct resampling schemes for which the con-
PN i
ditional variance Var( i=1 NN f (ξ˜i ) | G) is as small as possible and, in particular,
smaller than (4.42), preferably for any choice of the function f .

Residual Resampling
Residual resampling, or remainder resampling, is mentioned by Whitley (1994) (see
also Liu and Chen, 1998) as a simple means to decrease the variance incurred by
the sampling step. In this scheme, for i = 1, . . . , M we set
N i = N ω i + N̄ i ,
 
(4.43)
4.4. COMPLEMENTS 81

where N̄ 1 , . . . , N̄ M are distributed, conditionally P


on G, according to the multinomial
M
distribution Mult(N − R, ω̄ 1 , . . . , ω̄ N ) with R = i=1 bN ω i c and

N ω i − bN ω i c
ω̄ i = , i = 1, . . . , M . (4.44)
N −R

This scheme is obviously unbiased with respect to G. Equivalently, for any measur-
able function f , the residual sampling estimator is

N M N −R
1 X X bN ω i c ˜i 1 X i
f (ξ i ) = f (ξ ) + f (ξ˜J ) , (4.45)
N i=1 i=1
N N i=1

where J 1 , . . . , J N −R are conditionally independent given G with distribution P(J i =


k | G) = ω̄ k for i = 1, . . . , N − R and k = 1, . . . , M . Because the residual resampling
estimator is the sum of one term that, given G, is deterministic and one term that
involves conditionally i.i.d. labels, the variance of residual resampling is given by
" N −R #
1 X
˜J i N −R h 1
i
Var f (ξ ) G = Var f (ξ˜J ) G (4.46)
N2 i=1
N 2

 2
M M
(N − R) X
i

˜i ) −
X
j ˜j )

= ω̄ f (ξ ω̄ f ( ξ
N2 
i=1 j=1

M M
( M
)2
1 X i 2 ˜i X bN ω i c 2 ˜i N −R X
i ˜i
= ω f (ξ ) − 2
f (ξ ) − ω̄ f (ξ ) .
N i=1 i=1
N N2 i=1

Residual sampling dominates multinomial sampling also in the sense of having


smaller conditional variance. Indeed, first write

M M M
X X bN ω i c N − R X i ˜i
ω i f (ξ˜i ) = f (ξ˜i ) + ω̄ f (ξ ) .
i=1 i=1
N N i=1

Then note that the sum of the M numbers bN ω i c/N plus (N − R)/N equals one,
whence this sequence of M + 1 numbers can be viewed as a probability distribution.
Thus Jensen’s inequality applied to the square of the right-hand side of the above
display yields
( M
)2 M
(M )2
X
i ˜i
X bN ω i c 2 ˜i N −R X
i ˜i
ω f (ξ ) ≤ f (ξ ) + ω̄ f (ξ ) .
i=1 i=1
N N i=1

Combining with (4.46) and (4.42), this shows that the conditional variance of resid-
ual sampling is always smaller than that of multinomial sampling.

Stratified Resampling
The inversion method for sampling a multinomial sequence of trials maps uniform
(0, 1) random variables U 1 , . . . , U N into indices I 1 , . . . , I N through a deterministic
function. For any function f ,

N N
X i X
f (ξ˜I ) = Φf (U i ) ,
i=1 i=1
82 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

where the function Φf (which depends on both f and {ξ˜i }) is defined, for any
u ∈ (0, 1], by
M
i1(Pi−1 ωj ,Pi
def
X
Φf (u) = f (ξ˜I(u) ), I(u) = ω j ] (u) . (4.47)
j=1 j=1
i=1
R1 PM i ˜i
Note that, by construction, 0 Φf (u) du = i=1 ω f (ξ ). To reduce the con-
Ii
PN ˜
ditional variance of i=1 f (ξ ), we may change the way in which the sample
U 1 , . . . , U N is drawn. A possible solution, commonly used in survey sampling,
is based on stratification (see Kitagawa, 1996, and Fearnhead, 1998, Section 5.3,
for discussion of the method in the context of particle filtering). The interval
(0, 1] is partitioned into different strata, assumed for simplicity to be intervals
(0, 1] = (0, 1/N ] ∪ (1/N, 2/N ] ∪ · · · ∪ ({N − 1}/N, 1]. More general partitions could
have been considered as well; in particular, the number of partitions does not have
to equal N , and the interval lengths could be made dependent on the ω i . One then
draws a sample Ũ 1 , . . . , Ũ N conditionally independently given G from the distribu-
tion Ũ i ∼ U (({i − 1} /N, i/N ]) (for i = 1, . . . , N ) and let I˜i = I(Ũ i ) with I as in
(4.47) (see Figure 4.16). By construction, the difference between Ñ i = j=1 1{I˜j =i}
PN

and the target (non-integer) value N ω i is less than one in absolute value. It also
follows that
" N # " N #
˜ i
X X
E f (ξ˜ ) G = E
I i
Φf (Ũ ) G
i=1 i=1
N Z i/N
X Z 1 M
X
=N Φf (u) du = N Φf (u) du = N ω i f (ξ˜i ) ,
i=1 (i−1)/N 0 i=1

showing that the stratified sampling scheme is unbiased. Because Ũ 1 , . . . , Ũ N are


conditionally independent given G,
" N
# " N
#
1 X ˜I˜i 1 X i
Var f (ξ ) G = Var Φf (Ũ ) G
N i=1 N i=1
N
1 X h
i
i
= Var Φ f ( Ũ ) G
N 2 i=1
M N
" Z #2
i/N
1 X i 2 ˜i 1 X
= ω f (ξ ) − N Φf (u)du ;
N i=1 N i=1 (i−1)/N
R1 R1 PM
here we used that 0 Φ2f (u) du = 0 Φf 2 (u) du = i=1 ω i f 2 (ξ˜i ). By Jensen’s in-
equality,
N
" Z #2 "N Z #2
i/N X i/N
1 X
N Φf (u)du ≥ Φf (u)du
N i=1 (i−1)/N i=1 (i−1)/N
"M #2
X
= i ˜i
ω f (ξ ) ,
i=1

showing that the conditional variance of stratified sampling is always smaller than
that of multinomial sampling.
Remark 93. Note that stratified sampling may be coupled with the residual sam-
pling method discussed previously: the proof above shows that using stratified
sampling on the R residual indices that are effectively drawn randomly can only
decrease the conditional variance.
4.4. COMPLEMENTS 83

Systematic Resampling
Stratified sampling aims at reducing the discrepancy
N
1 X
?
DN
def
(U 1 , . . . , U N ) = sup 1(0,a] (U i ) − a
a∈(0,1] N i=1

of the sample U from the uniform distribution function on (0, 1]. This is sim-
ply the Kolmogorov-Smirnov distance between the empirical distribution function
of the sample and the distribution function of the uniform distribution. The
Koksma-Hlawka inequality (Niederreiter, 1992) shows that for any function f having
bounded variation on [0, 1],
N Z 1
1 X i ?
f (u ) − f (u) du ≤ C(f )DN (u1 , . . . , uN ) ,
N i=1 0

where C(f ) is the variation of f . This inequality suggests that it is desirable to


design random sequences U 1 , . . . , U N whose expected discrepancy is as low as pos-
sible. This provides another explanation of the improvement brought by stratified
resampling (compared to multinomial resampling).

Pursuing in this direction, it makes sense to look for sequences with even smaller
average discrepancy. One such sequence is U i = U + (i − 1)/N , where U is drawn
from a uniform U((0, 1/N ]) distribution. In survey sampling, this method is known
as systematic sampling. It was introduced in the particle filter literature by Carpen-
ter et al. (1999) but is mentioned by Whitley (1994) under the name of universal
sampling. The interval (0, 1] is still divided into N sub-intervals ({i − 1}/N, i/N ]
and one sample is taken from each of them, as in stratified sampling. However, the
samples are no longer independent, as they have the same relative position within
each stratum (see Figure 4.17). This sampling scheme is obviously still unbiased.
Because the samples are not taken independently across strata, it is however not
possible to obtain simple formulas for the conditional variance (Künsch, 2003). It is
often conjectured that the conditional variance of systematic resampling is always
lower than that of multinomial resampling. This is not correct, as demonstrated by
the following example.
Example 94. Consider the case where the initial population of particles {ξ˜i }1≤i≤N
is composed of the interleaved repetition of only two distinct values x0 and x1 , with
identical multiplicities (assuming N to be even). In other words,

{ξ˜i }1≤i≤N = {x0 , x1 , x0 , x1 , . . . , x0 , x1 } .

We denote by 2ω/N the common value of the normalized weight ω i associated to


the N/2 particles ξ˜i that satisfy ξ˜i = x1 , so that the remaining ones (which are such
that ξ˜i = x0 ) share a common weight of 2(1 − ω)/N . Without loss of generality, we
assume that 1/2 ≤ ω < 1 and that the function of interest f is such that f (x0 ) = 0
and f (x1 ) = F .
Under multinomial resampling, (4.42) shows that the conditional variance of the
PN
estimate N −1 i=1 f (ξ i ) is given by
" N
#
1 X 1
Var f (ξmult ) G = (1 − ω)ωF 2 .
i
(4.48)
N i=1 N

Because the value 2ω/N is assumed to be larger than 1/N , it is easily checked
that systematic resampling deterministically sets N/2 of the ξ i to be equal to x1 .
84 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

ω 0.51 0.55 0.6 0.65 0.70 0.75


Multinomial 0.050 0.049 0.049 0.048 0.046 0.043
Residual, stratified 0.010 0.021 0.028 0.032 0.035 0.035
Systematic 0.070 0.150 0.200 0.229 0.245 0.250
Systematic with prior random shuffling 0.023 0.030 0.029 0.029 0.028 0.025

Table 4.1: Standard deviations of various resampling methods for N = 100 and
F = 1. The bottom line has been obtained by simulations, averaging 100,000
Monte Carlo replications.

Depending on the draw of the initial shift, all the N/2 remaining particles are either
set to x1 , with probability 2ω − 1, or to x0 , with probability 2(1 − ω). Hence the
variance is that of a single Bernoulli draw scaled by N/2, that is,
" N
#
1 X
Var f (ξsyst ) G = (ω − 1/2)(1 − ω)F 2 .
i
N i=1

Note that in this case, the conditional variance of systematic resampling is not only
larger than (4.48) for most values of ω (except when ω is very close to 1/2), but
it does not even decrease to zero as N grows! Clearly, this observation is very
dependent on the order in which the initial population of particles is presented.
Interestingly, this feature is common to the systematic and stratified sampling
schemes, whereas the multinomial and residual approaches are unaffected by the
order in which the particles are labelled. In this particular example, it is straight-
forward to verify that residual and stratified resampling are equivalent—which is
not the case in general—and amount to deterministically setting N/2 particles to
the value x1 , whereas the N/2 remaining ones are drawn by N/2 conditionally in-
dependent Bernoulli trials with probability of picking x1 equal to 2ω − 1. Hence
the conditional variance, for both the residual and stratified schemes, is equal to
N −1 (2ω − 1)(1 − ω)F 2 . It is hence always smaller than (4.48), as expected from
the general study of these two methods.
Once again, the failure of systematic resampling in this example is entirely due
to the specific order in which the particles are labeled: it is easy to verify, at least
empirically, that the problem vanishes upon randomly permuting the initial particles
before applying systematic resampling. Table 4.1 also shows that a common feature
of both the residual, stratified, and systematic resampling procedures is to become
very efficient in some particular configurations of the weights such as when ω = 0.51
for which the probabilities of selecting the two types of particles are almost equal
and the selection becomes quasi-deterministic. Note also that prior random shuffling
does somewhat compromise this ability in the case of systematic resampling.
4.4. COMPLEMENTS 85

1.2

1.1
True Value= .907

0.9

0.8
Values

0.7

0.6

0.5

0.4

0.3

100 400 1600 6400


Number of particles

Figure 4.3: Box and whisker plot of the posterior mean estimate of X5 obtained
from 125 replications of the SIS filter using the prior kernel and increasing numbers
of particles. The horizontal line represents the true posterior mean.

1.2

True value= .907


1.1

0.9

0.8
Values

0.7

0.6

0.5

0.4

0.3

100 400 1600 6400


Number of particles

Figure 4.4: Box and whisker plot of the posterior mean estimate for X5 obtained
from 125 replications of the SIS filter using the optimal kernel and increasing num-
bers of particles. Same data and axes as Figure 4.3.
86 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

FILT.

FILT. +1

FILT. +2

Figure 4.5: SIS using the prior kernel. The positions of the particles are indicated
by circles whose radii are proportional to the normalized importance weights. The
solid lines show the filtering distributions for three consecutive time indices.

FILT.

FILT. +1

FILT. +2

Figure 4.6: SIS using the optimal kernel (same data and display as in Figure 4.5).
4.4. COMPLEMENTS 87

0.08

0.06
Density

0.04

0.02

0
0

10

−2
15 −1.5
−1
−0.5
0
0.5
1
Time Index 20
2
1.5
State

Figure 4.7: Waterfall representation of filtering distributions as estimated by SIS


with N = 1,000 particles (densities estimated with Epanechnikov kernel, bandwidth
0.2). Data is the same as in Figure ??.
.

Optimal kernel Optimal kernel


0 EKF kernel 0 EKF kernel
Prior kernel
Prior kernel

−5 −5
log−density

−10 −10
log−density

−15 −15

−20 −20

−25 −25

−30 −30
−20 −15 −10 −5 0 5 10 15 20 −20 −15 −10 −5 0 5 10 15 20

Figure 4.8: Log-density of the optimal kernel (solid line), EKF approximation of
the optimal kernel (dashed-dotted line), and the prior kernel (dashed line) for two
different values of the state noise variance σu2 : left, σu2 = 1; right, σu2 = 10.
88 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

1000

500

0
−25 −20 −15 −10 −5 0

1000

500

0
−25 −20 −15 −10 −5 0

100

50

0
−25 −20 −15 −10 −5 0
Importance Weights (base 10 logarithm)

Figure 4.9: Histograms of the base 10 logarithm of the normalized importance


weights after (from top to bottom) 1, 10, and 100 iterations for the stochastic
volatility model of Example 80. Note that the vertical scale of the bottom panel
has been multiplied by 10.

20 10
Coeff. of Variation

15 8
Entropy

10 6

5 4

0 2
0 20 40 60 80 100 0 20 40 60 80 100
Time Index Time Index

Figure 4.10: Coefficient of variation (left) and entropy of the normalized importance
weights as a function of the number of iterations for the stochastic volatility model
of Example 80. Same model and data as in Figure 4.9.

2.5 10
Coeff. of Variation

2 9.5
Entropy

1.5 9

1 8.5

0.5 8

0 7.5
0 20 40 60 80 100 0 20 40 60 80 100
Time Index Time Index

Figure 4.11: Coefficient of variation (left) and entropy of the normalized importance
weights as a function of the number of iterations in the stochastic volatility model
of Example 80. Same model and data as in Figure 4.10. Resampling occurs when
the coefficient of variation gets larger than 1.
4.4. COMPLEMENTS 89

1000

500

0
−25 −20 −15 −10 −5 0

1000

500

0
−25 −20 −15 −10 −5 0

1000

500

0
−25 −20 −15 −10 −5 0
Importance Weights (base 10 logarithm)

Figure 4.12: Histograms of the base 10 logarithm of the normalized importance


weights after (from top to bottom) 1, 10, and 100 iterations in the stochastic volatil-
ity model of Example 80. Same model and data as in Figure 4.9. Resampling occurs
when the coefficient of variation gets larger than 1.

20

10
State

−10

−20
0 5 10 15 20 25 30

25
Coefficient of Variation

20

15

10

0
0 5 10 15 20 25 30
Time Index

Figure 4.13: SIS estimates of the filtering distributions in the growth model with
instrumental kernel being the prior one and 500 particles. Top: true state sequence
(×) and 95%/50% HPD regions (light/dark grey) of estimated filtered distribution.
Bottom: coefficient of variation of the normalized importance weights.
90 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS

20

10

State
0

−10

−20
0 5 10 15 20 25 30

5
Coefficient of Variation

0
0 5 10 15 20 25 30

p] Time Index

Figure 4.14: Same legend for Figure 4.13, but with results for the corresponding
bootstrap filter.

ω1 + ω2 + ω3
ω1 + ω2

ω1
0
1 2 3 4 5 6

Figure 4.15: Multinomial sampling from uniform distribution by the inversion


method.

ω1 + ω2 + ω3
ω1 + ω2

ω1
0
1 2 3 4 5 6

Figure 4.16: Stratified sampling: the interval (0, 1] is divided into N intervals ((i −
1)/N, i/N ]. One sample is drawn uniformly from each interval, independently of
samples drawn in the other intervals.
4.4. COMPLEMENTS 91

ω1 + ω2 + ω3
ω1 + ω2

ω1
0
1 2 3 4 5 6

Figure 4.17: Systematic sampling: the unit interval is divided into N intervals
((i − 1)/N, i/N ] and one sample is drawn from each of them. Contrary to stratified
sampling, each sample has the same relative position within its stratum.
92 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS
Part II

Parameter Inference

93
Chapter 5

Maximum Likelihood
Inference, Part I:
Optimization Through Exact
Smoothing

In previous chapters, we have focused on structural results and methods for HMMs,
considering in particular that the models under consideration were always perfectly
known. In most situations, however, the model cannot be fully specified beforehand,
and some of its parameters need to be calibrated based on observed data. Except
for very simplistic instances of HMMs, the structure of the model is sufficiently
complex to prevent the use of direct estimators such as those provided by moment
or least squares methods. We thus focus in the following on computation of the
maximum likelihood estimator.
Given the specific structure of the likelihood function in HMMs, it turns out
that the key ingredient of any optimization method applicable in this context is
the ability to compute smoothed functionals of the unobserved sequence of states.
Hence the methods discussed in the second part of the book for evaluating smoothed
quantities are instrumental in devising parameter estimation strategies.
This chapter only covers the class of HMMs for which the smoothing recursions
may effectively be implemented on computers. For such models, the likelihood
function is computable, and hence our main task will be to optimize a possibly
complex but entirely known function. The topic of this chapter thus relates to the
more general field of numerical optimization. For models that do not allow for
exact numerical computation of smoothing distributions, this chapter provides a
framework from which numerical approximations can be built.

5.1 Likelihood Optimization in Incomplete Data


Models
To describe the methods as concisely as possible, we adopt a very general viewpoint
in which we only assume that the likelihood function of interest may be written
as the marginal of a higher dimensional function. In the terminology introduced
by Dempster et al. (1977), this higher dimensional function is described as the
complete data likelihood; in this framework, the term incomplete data refers to
the actual observed data while the complete data is a (not fully observable) higher

95
96 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I

dimensional random variable. In Section 5.2, we will exploit the specific structure
of the HMM, and in particular the fact that it corresponds to a missing data model
in which the observations simply are a subset of the complete data. We ignore these
specifics for the moment however and consider the general likelihood optimization
problem in incomplete data models.

5.1.1 Problem Statement and Notations


Given a σ-finite measure λ on (X, X ), we consider a family {f (·; θ)}θ∈Θ of non-
negative λ-integrable functions on X. This family is indexed by a parameter θ ∈ Θ,
where Θ is a subset of Rdθ (for some integer dθ ). The task under consideration is
the maximization of the integral
Z
def
L(θ) = f (x ; θ) λ(dx) (5.1)

with respect to the parameter θ. The function f (· ; θ) may be thought of as an


unnormalized probability density with respect to λ. Thus L(θ) is the normalizing
constant for f (· ; θ). In typical examples, f (· ; θ) is a relatively simple function of θ.
In contrast, the quantity L(θ) usually involves high-dimensional integration and is
therefore sufficiently complex to prevent the use of simple maximization approaches;
even the direct evaluation of the function might turn out to be non-feasible.
In Section 5.2, we shall consider more specifically the case where f is the joint
probability density function of two random variables X and Y , the latter being
observed while the former is not. Then X is referred to as the missing data, f is
the complete data likelihood, and L is the density of Y alone, that is, the likelihood
available for estimating θ. Note however that thus far, the dependence on Y is not
made explicit in the notation; this is reminiscent of the implicit conditioning con-
vention discussed in Section 2.1.4 in that the observations do not appear explicitly.
Having sketched these statistical ideas, we stress that we feel it is actually easier
to understand the basic mechanisms at work without relying on the probabilistic
interpretation of the above quantities. In particular, it is not required that L be
a likelihood, as any function satisfying (5.1) is a valid candidate for the methods
discussed here (cf. Remark ??).
In the following, we will assume that L(θ) is positive, and thus maximizing L(θ)
is equivalent to maximizing
def
`(θ) = log L(θ) . (5.2)
In a statistical setting, ` is the log-likelihood. We also associate to each function
f (· ; θ) the probability density function p(· ; θ) (with respect to the dominating mea-
sure λ) defined by
def
p(x ; θ) = f (x ; θ)/L(θ) . (5.3)
In the statistical setting sketched above, p(x; θ) is the conditional density of X given
Y.

5.1.2 The Expectation-Maximization Algorithm


The most popular method for solving the general optimization problem outlined
above is the EM (for expectation-maximization) algorithm introduced, in its full
generality, by Dempster et al. (1977) in their landmark paper. Given the literature
available on the topic, our aim is not to provide a comprehensive review of all the
results related to the EM algorithm but rather to highlight some of its key features
and properties in the context of hidden Markov models.
5.1. LIKELIHOOD OPTIMIZATION IN INCOMPLETE DATA MODELS 97

The Intermediate Quantity of EM


The central concept in the framework introduced by Dempster et al. (1977) is an
auxiliary function (or, more precisely, a family of auxiliary functions) known as the
intermediate quantity of EM.
Definition 95 (Intermediate Quantity of EM). The intermediate quantity of EM
is the family {Q(· ; θ0 )}θ0 ∈Θ of real-valued functions on Θ, indexed by θ0 and defined
by Z
def
Q(θ ; θ0 ) = log f (x ; θ)p(x ; θ0 ) λ(dx) . (5.4)

Remark 96. To ensure that Q(θ ; θ0 ) is indeed well-defined for all values of the
pair (θ, θ0 ), one needs regularity conditions on the family of functions {f (· ; θ)}θ∈Θ ,
which will be stated below (Assumption 97). To avoid trivial cases however, we use
the convention 0 log 0 = 0 in (5.4) and in similar relations below. In more formal
terms, for every measurable set N such that both f (x ; θ) and p(x ; θ0 ) vanish λ-a.e.
on N , set Z
def
log f (x ; θ)p(x ; θ0 ) λ(dx) = 0 .
N
With this convention, Q(θ ; θ0 ) stays well-defined in cases where there exists a non-
empty set N such that both f (x ; θ) and f (x ; θ0 ) vanish λ-a.e. on N .
The intermediate quantity Q(θ ; θ0 ) of EM may be interpreted as the expecta-
tion of the function log f (X ; θ) when X is distributed according to the probability
density function p(· ; θ0 ) indexed by a, possibly different, value θ0 of the parameter.
Using (5.2) and (5.3), one may rewrite the intermediate quantity of EM in (5.4) as

Q(θ ; θ0 ) = `(θ) − H(θ ; θ0 ) , (5.5)

where Z
0 def
H(θ ; θ ) = − log p(x ; θ)p(x ; θ0 ) λ(dx) . (5.6)

Equation (5.5) states that the intermediate quantity Q(θ ; θ0 ) of EM differs from (the
log of) the objective function `(θ) by a quantity that has a familiar form. Indeed,
H(θ0 ; θ0 ) is recognized as the entropy of the probability density function p(· ; θ0 )
(see for instance Cover and Thomas, 1991). More importantly, the increment of
H(θ ; θ0 ), Z
0 0 0 p(x ; θ)
H(θ ; θ ) − H(θ ; θ ) = − log p(x ; θ0 ) λ(dx) , (5.7)
p(x ; θ0 )
is recognized as the Kullback-Leibler divergence (or relative entropy) between the
probability density functions p indexed by θ and θ0 , respectively.
The last piece of notation needed is the following: the gradient and Hessian
of a function, say L, at θ0 will be denoted by ∇θ L(θ0 ) and ∇2θ L(θ0 ), respectively.
To avoid ambiguities, the gradient of H(· ; θ0 ) with respect to its first argument,
evaluated at θ00 , will be denoted by ∇θ H(θ ; θ0 )|θ=θ00 (where the same convention
will also be used, if needed, for the Hessian).
We conclude this introductory section by stating a minimal set of assumptions
that guarantee that all quantities introduced so far are indeed well-defined.
Assumption 97.
(i) The parameter set Θ is an open subset of Rdθ (for some integer dθ ).
(ii) For any θ ∈ Θ, L(θ) is positive and finite.
(iii) For any (θ, θ0 ) ∈ Θ × Θ, |∇θ log p(x ; θ)|p(x ; θ0 ) λ(dx) is finite.
R
98 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I

.
Assumption 97(iii) implies in particular that the probability distributions in the
family {p(· ; θ) dλ}θ∈Θ are all absolutely continuous with respect to one another.
Any individual distribution p(· ; θ) dλ can only vanish on sets that are assigned null
probability by all other probability distributions in the family. Thus both H(θ ; θ0 )
and Q(θ ; θ0 ) are well-defined for all pairs of parameters.

The Fundamental Inequality of EM


We are now ready to state the fundamental result that justifies the standard con-
struction of the EM algorithm.
Proposition 98. Under Assumption 97, for any (θ, θ0 ) ∈ Θ × Θ,

`(θ) − `(θ0 ) ≥ Q(θ ; θ0 ) − Q(θ0 ; θ0 ) , (5.8)

where the inequality is strict unless p(· ; θ) and p(· ; θ0 ) are equal λ-a.e.
Assume in addition that
(a) θ 7→ L(θ) is continuously differentiable on Θ;
(b) for any θ0 ∈ Θ, θ 7→ H(θ ; θ0 ) is continuously differentiable on Θ.
Then for any θ0 ∈ Θ, θ 7→ Q(θ ; θ0 ) is continuously differentiable on Θ and

∇θ `(θ0 ) = ∇θ Q(θ ; θ0 )|θ=θ0 . (5.9)

Proof. The difference between the left-hand side and the right-hand side of (5.8)
is the quantity defined in (5.7), which we already recognized as a Kullback-Leibler
distance. Under Assumption 97(iii), this latter term is well-defined and known to
be strictly positive (by direct application of Jensen’s inequality) unless p(· ; θ) and
p(· ; θ0 ) are equal λ-a.e. (Cover and Thomas, 1991; Lehmann and Casella, 1998).
For (5.9), first note that Q(θ ; θ0 ) is a differentiable function of θ, as it is the dif-
ference of two functions that are differentiable under the additional assumptions (a)
and (b). Next, the previous discussion implies that H(θ ; θ0 ) is minimal for θ = θ0 ,
although this may not be the only point where the minimum is achieved. Thus its
gradient vanishes at θ0 , which proves (5.9).

The EM Algorithm
The essence of the EM algorithm, which is suggested by (5.5), is that Q(θ ; θ0 ) may
be used as a surrogate for `(θ). Both functions are not necessarily comparable but,
in view of (5.8), any value of θ such that Q(θ ; θ0 ) is increased over its baseline
Q(θ0 ; θ0 ) corresponds to an increase of ` (relative to `(θ0 )) that is at least as large.
The EM algorithm as proposed by Dempster et al. (1977) consists in iteratively
building a sequence {θi }i≥1 of parameter estimates given an initial guess θ0 . Each
iteration is classically broken into two steps as follows.
E-Step: Determine Q(θ ; θi );
M-Step: Choose θi+1 to be the (or any, if there are several) value of θ ∈ Θ that
maximizes Q(θ ; θi ).
Proposition 98 provides the two decisive arguments behind the EM algorithm. First,
an immediate consequence of (5.8) is that, by the very definition of the sequence
{θi }, the sequence {`(θi )}i≥0 of log-likelihood values is non-decreasing. Hence EM
is a monotone optimization algorithm. Second, if the iterations ever stop at a point
5.1. LIKELIHOOD OPTIMIZATION IN INCOMPLETE DATA MODELS 99

θ? , then Q(θ ; θ? ) has to be maximal at θ? (otherwise it would still be possible to


improve over θ? ), and hence θ? is such that ∇θ L(θ? ) = 0, that is, this is a stationary
point of the likelihood.
Although this picture is largely correct, there is a slight flaw in the second half
of the above intuitive reasoning in that the if part (if the iterations ever stop at a
point) may indeed never happen. Stronger conditions are required to ensure that the
sequence of parameter estimates produced by EM from any starting point indeed
converges to a limit θ? ∈ Θ. However, it is actually true that when convergence
to a point takes place, the limit has to be a stationary point of the likelihood. In
order not to interrupt our presentation of the EM framework, convergence results
pertaining to the EM algorithm are deferred to Section 5.5 at the end of this chapter;
see in particular Theorems 105 and 106.

EM in Exponential Families
Definition 99 (Exponential Family). The family {f (· ; θ)}θ∈Θ defines an exponen-
tial family of positive functions on X if

f (x ; θ) = exp{ψ(θ)t S(x) − c(θ)}h(x) , (5.10)

where S and ψ are vector-valued functions (of the same dimension) on X and Θ
respectively, c is a real-valued function on Θ and h is a non-negative real-valued
function on X.
Here S(x) is known as the vector of natural sufficient statistics, and η = ψ(θ)
is
R the natural parameterization. If {f (· ; θ)}θ∈Θ is an exponential family and if
|S(x)|f (x ; θ) λ(dx) is finite for any θ ∈ Θ, the intermediate quantity of EM re-
duces to
Z  Z
Q(θ ; θ0 ) = ψ(θ)t S(x)p(x ; θ0 ) λ(dx) − c(θ) + p(x ; θ0 ) log h(x) λ(dx) . (5.11)

Note that the right-most term does not depend on θ and thus plays no role in
the maximization. It may as well be ignored, and in practice it is not required to
compute it. Except for this term, the right-hand side of (5.11) has an explicit form as
soon as it is possible to evaluate the expectation of the vector of sufficient statistics
S under p(· ; θ0 ). The other important feature of (5.11), ignoring the rightmost term,
is that Q(θ ; θ0 ), viewed as Ra function of θ, is similar to the logarithm of (5.10) for
the particular value Sθ0 = S(x)p(x ; θ0 ) λ(dx) of the sufficient statistic.
In summary, if {f (· ; θ)}θ∈Θ is an exponential family, the two above general
conditions needed for the EM algorithm to be practicable reduce to the following.
E-Step: The expectation of the vector of sufficient statistics S(X) under p(· ; θ0 ) must
be computable.
M-Step: Maximization of ψ(θ)t s − c(θ) with respect to θ ∈ Θ must be feasible in closed
form for any s in the convex hull of S(X) (that is, for any valid value of the
expected vector of sufficient statistics).

5.1.3 Gradient-based Methods


A frequently ignored observation is that in any model where the EM strategy may
be applied, it is also possible to evaluate derivatives of the objective function `(θ)
with respect to the parameter θ. This is obvious from (5.9), and we will expand on
this matter below. As a consequence, instead of resorting to a specific algorithm
such as EM, one may borrow tools from the (comprehensive and well-documented)
toolbox of gradient-based optimization methods.
100 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I

Computing Derivatives in Incomplete Data Models


A first remark is that in cases where the EM algorithm is applicable, the objective
function `(θ) is actually computable: because the EM requires the computation of
expectations under the conditional density p(· ; θ), it is restricted to cases where
the normalizing constant L(θ)—and hence `(θ) = log L(θ)—is available. The two
equalities below show that it is indeed also the case for the first- and second-order
derivatives of `(θ).
Proposition 100 (Fisher’s and Louis’ Identities). Assume 97 and that the following
conditions hold.
(a) θ 7→ L(θ) is twice continuously differentiable on Θ.
(b) For any θR0 ∈ Θ, θ 7→ H(θ ; θ0 ) is twice continuously differentiable on Θ. In
addition, |∇kθ log p(x ; θ)|p(x ; θ0 ) λ(dx) is finite for k = 1, 2 and any (θ, θ0 ) ∈
Θ × Θ, and
Z Z
∇θ log p(x ; θ)p(x ; θ ) λ(dx) = ∇kθ log p(x ; θ)p(x ; θ0 ) λ(dx) .
k 0

Then the following identities hold:


Z
∇θ `(θ0 ) = ∇θ log f (x ; θ)|θ=θ0 p(x ; θ0 ) λ(dx) , (5.12)

Z
− ∇2θ `(θ0 ) = − ∇2θ log f (x ; θ) p(x ; θ0 ) λ(dx)
θ=θ 0
Z
+ ∇2θ log p(x ; θ) θ=θ 0
p(x ; θ0 ) λ(dx) . (5.13)

The second equality may be rewritten in the equivalent form


Z 
t
∇2θ `(θ0 ) + {∇θ `(θ0 )} {∇θ `(θ0 )} = ∇2θ log f (x ; θ) θ=θ 0

t
+ { ∇θ log f (x ; θ)|θ=θ0 } { ∇θ log f (x ; θ)|θ=θ0 } p(x ; θ0 ) λ(dx) . (5.14)

Equation (5.12) is sometimes referred to as Fisher’s identity (see the comment by


B. Efron in the discussion of Dempster et al., 1977, p. 29). In cases where the func-
tion L may be interpreted as the likelihood associated with some statistical model,
the left-hand side of (5.12) is the score function (gradient of the log-likelihood).
Equation (5.12) shows that the score function may be evaluated by computing the
expectation, under p(· ; θ0 ), of the function ∇θ log f (X ; θ)|θ=θ0 . This latter quan-
tity, in turn, is referred to as the complete score function in a statistical context, as
log f (x ; θ) is the joint log-likelihood of the complete data (X, Y ); again we remark
that at this stage, Y is not explicit in the notation.
Equation (5.13) is usually called the missing information principle after Louis
(1982) who first named it this way, although it was mentioned previously in a slightly
different form by Orchard and Woodbury (1972) and implicitly used in Dempster
et al. (1977). In cases where L is a likelihood, the left-hand side of (5.13) is the
associated observed information matrix, and the second term on the right-hand side
is easily recognized as the (negative of the) Fisher information matrix associated
with the probability density function p(· ; θ0 ).
Finally (5.14), which is here written in a form that highlights its symmetry, was
also proved by Louis (1982) and is thus known as Louis’ identity. Together with
5.1. LIKELIHOOD OPTIMIZATION IN INCOMPLETE DATA MODELS 101

(5.12), it shows that the first- and second-order derivatives of ` may be evaluated
by computing expectations under p(· ; θ0 ) of quantities derived from f (· ; θ). We now
prove these three identities.

of Proposition 100. Equations (5.12) and (5.13) are just (5.5) where the right-hand
side is differentiated once, using (5.9), and then twice under the integral sign.
To prove (5.14), we start from (5.13) and note that the second term on its right-
hand side is the negative of an information matrix for the parameter θ associated
with the probability density function p(· ; θ) and evaluated at θ0 . We rewrite this
second term using the well-known information matrix identity
Z
∇2θ log p(x ; θ) θ=θ 0
p(x ; θ0 ) λ(dx)
Z
t
=− { ∇θ log p(x ; θ)|θ=θ0 } { ∇θ log p(x ; θ)|θ=θ0 } p(x ; θ0 ) λ(dx) .

This is again a consequence of assumption (b) and the fact that p(· ; θ) is a proba-
bility density function for all values of θ, implying that
Z
∇θ log p(x ; θ)|θ=θ0 p(x ; θ0 ) λ(dx) = 0 .

Now use the identity log p(x ; θ) = log f (x ; θ) − `(θ) and (5.12) to conclude that
Z
t
{ ∇θ log p(x ; θ)|θ=θ0 } { ∇θ log p(x ; θ)|θ=θ0 } p(x ; θ0 ) λ(dx)
Z
t
= { ∇θ log f (x ; θ)|θ=θ0 } { ∇θ log f (x ; θ)|θ=θ0 } p(x ; θ0 ) λ(dx)
t
− {∇θ `(θ0 )} {∇θ `(θ0 )} ,

which completes the proof.

The Steepest Ascent Algorithm


We briefly discuss the main features of gradient-based iterative optimization algo-
rithms, starting with the simplest, but certainly not most efficient, approach. We
restrict ourselves to the case where the optimization problem is unconstrained in the
sense that Θ = Rdθ , so that any parameter value produced by the algorithms below
is valid. For an in-depth coverage of the subject, we recommend the monographs
by Luenberger (1984) and Fletcher (1987).
The simplest method is the steepest ascent algorithm in which the current value
of the estimate θi is updated by adding a multiple of the gradient ∇θ `(θi ), referred
to as the search direction:

θi+1 = θi + γi ∇θ `(θi ) . (5.15)

Here the multiplier γi is a non-negative scalar that needs to be adjusted at each


iteration to ensure, a minima, that the sequence {`(θi )} is non-decreasing—as was
the case for EM. The most sensible approach consists in choosing γi as to maximize
the objective function in the search direction:

γi = arg maxγ≥0 `[θi + γ∇θ `(θi )] . (5.16)

It can be shown (Luenberger, 1984, Chapter 7) that under mild assumptions, the
steepest ascent method with multipliers (5.16) is globally convergent, with a set of
102 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I

limit points corresponding to the stationary points of ` (see Section 5.5 for precise
definitions of these terms and a proof that this property holds for the EM algorithm).
It remains that the use of the steepest ascent algorithm is not recommended,
particularly in large-dimensional parameter spaces. The reason for this is that its
speed of convergence linear in the sense that if the sequence {θi }i≥0 converges to a
point θ? such that the Hessian ∇2θ `(θ? ) is negative definite (see Section 5.5.2), then

θi+1 (k) − θ? (k)


lim = ρk < 1 ; (5.17)
i→∞ |θi (k) − θ? (k)|

here θ(k) denotes the kth coordinate of the parameter vector. For large-dimensional
problems it frequently occurs that, at least for some components k, the factor ρk
is close to one, resulting in very slow convergence of the algorithm. It should
be stressed however that the same is true for the EM algorithm, which also ex-
hibits speed of convergence that is linear, and often very poor (Dempster et al.,
1977; Jamshidian and Jennrich, 1997; Meng, 1994; Lange, 1995; Meng and Van
Dyk, 1997). For gradient-based methods however, there exists a whole range of
approaches, based on the second-order properties of the objective function, to guar-
antee faster convergence.

Newton and Second-order Methods


The prototype of second-order methods is the Newton, or Newton-Raphson, algo-
rithm:
θi+1 = θi − H −1 (θi )∇θ `(θi ) , (5.18)
where H(θi ) = ∇2θ `(θi ) is the Hessian of the objective function. The Newton
iteration is based on the second-order approximation
1 t
`(θ) ≈ `(θ0 ) + ∇`(θ0 ) (θ − θ0 ) + (θ − θ0 ) H(θ0 ) (θ − θ0 ) .
2
If the sequence {θi }i≥0 produced by the algorithm converges to a point θ? at which
the Hessian is negative definite, the convergence is, at least, quadratic in the sense
that for sufficiently large i there exists a positive constant β such that kθi+1 − θ? k ≤
βkθi − θ? k2 . Therefore the procedure can be very efficient.
The practical use of the Newton algorithm is however hindered by two serious
difficulties. The first is analogous to the problem already encountered for the steep-
est ascent method: there is no guarantee that the algorithm meets the minimal
requirement to provide a final parameter estimate that is at least as good as the
starting point θ0 . To overcome this difficulty, one may proceed as for the steepest
ascent method and introduce a multiplier γi controlling the step-length in the search
direction, so that the method takes the form

θi+1 = θi − γi H −1 (θi )∇θ `(θi ) . (5.19)

Again, γi may be set to maximize `(θi+1 ). In practice, it is most often impos-


sible to obtain the exact maximum point called for by the ideal line-search, and
one uses approximate directional maximization procedures. Generally speaking, a
line-search algorithm is an algorithm to find a reasonable multiplier γi in a step of
the form (5.19). A frequently used algorithm consists in determining the (approxi-
mate) maximum based on a polynomial interpolation of `(θ) along the line-segment
between the current point θi and the proposed update given by (5.18).
A more serious problem is that except in the particular case where the function
`(θ) is strictly concave, the direct implementation of (5.18) is prone to numerical
instabilities: there may well be whole regions of the parameter space where the
5.2. APPLICATION TO HMMS 103

Hessian H(θ) is either non-invertible (or at least very badly conditioned) or not
negative semi-definite (in which case −H −1 (θi )∇θ `(θi ) is not necessarily an ascent
direction). To combat this difficulty, Quasi-Newton methods1 use the modified
recursion
θi+1 = θi + γi W i ∇`(θi ) ; (5.20)
here W i is a weight matrix that may be tuned at each iteration, just like the multi-
plier γi . The rationale is that if W i becomes close to −H −1 (θi ) when convergence
occurs, the modified algorithm will share the favorable convergence properties of the
Newton algorithm. On the other hand, by using a weight matrix W i different from
−H −1 (θi ), numerical issues associated with the matrix inversion may be avoided.
We again refer to Luenberger (1984) and Fletcher (1987) for a more precise discus-
sion of the available approaches and simply mention here the fact that usually the
methods only take profit of gradient information to construct W i , for instance using
finite difference calculations, without requiring the direct evaluation of the Hessian
H(θ).
In some contexts, it may be possible to build explicit strategies that are not as
good as the Newton algorithm—failing in particular to reach quadratic convergence
rates—but yet significantly faster at converging than the basic steepest ascent ap-
proach. For incomplete data models, Lange (1995) suggested to use in (5.20) a
weight matrix Ic−1 (θi ) given by
Z
Ic (θ ) = − ∇2θ log f (x ; θ) θ=θ0 p(x ; θ0 ) λ(dx) .
0
(5.21)

This is the first term on the right-hand side of (5.13). In many models of interest,
this matrix is positive definite for all θ0 ∈ Θ, and thus its inversion is not subject
to numerical instabilities. Based on (5.13), it is also to be expected that in some
circumstances, Ic (θ0 ) is a reasonable approximation to the Hessian ∇2θ `(θ0 ) and
hence that the weighted gradient algorithm converges faster than the steepest ascent
or EM algorithms (see Lange, 1995, for further results and examples). In a statistical
context, where f (x ; θ) is the joint density of two random variables X and Y , Ic (θ0 ) is
the conditional expectation given Y of the observed information matrix of associated
with this pair.

5.2 Application to HMMs


We now return to our primary focus and discuss the application of the previous
methods to the specific case of hidden Markov models.

5.2.1 Hidden Markov Models as Missing Data Models


HMMs correspond to a sub-category of incomplete data models known as missing
data models. In missing data models, the observed data Y is a subset of some not
fully observable complete data (X, Y ). We here assume that the joint distribution
of X and Y , for a given parameter value θ, admits a joint probability density
function f (x, y ; θ) with respect to the product measure λ ⊗ µ. As mentioned in
Section 5.1.1, the function f is sometimes referred to as the complete data likelihood.
It is important to understand that f is a probability density function only when
considered as a function of both x and y. For a fixed value of y and considered as a
function of x only, f is a positive integrable function. Indeed, the actual likelihood

1 Conjugate gradient methods are another alternative approach that we do not discuss here.
104 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I

of the observation, which is defined as the probability density function of Y with


respect to µ, is obtained by marginalization as
Z
L(y ; θ) = f (x, y ; θ) λ(dx) . (5.22)

For a given value of y this is of course a particular case of (5.1), which served as the
basis for developing the EM framework in Section 5.1.2. In missing data models,
the family of probability density functions {p(· ; θ)}θ∈Θ defined in (5.3) may thus
be interpreted as
f (x, y ; θ)
p(x|y ; θ) = R , (5.23)
f (x, y ; θ) λ(dx)
the conditional probability density function of X given Y .
In the last paragraph, slightly modified versions of the notations introduced
in (5.1) and (5.3) were used to reflect the fact that the quantities of interest now
depend on the observed variable Y . This is obviously mostly a change regarding
terminology, with no impact on the contents of Section 5.1.2, except that we may
now think of integrating with respect to p(· ; θ) dλ as taking the conditional expec-
tation with respect to the missing data X, given the observed data Y , in the model
indexed by the parameter value θ.

5.2.2 EM in HMMs
We now consider more specifically hidden Markov models using the notations in-
troduced in Section 1.2, assuming that observations Y0 to Yn (or, in short, Y0:n ) are
available. Because we only consider HMMs that are fully dominated in the sense
of Definition 13, we will use the notations ν and φk|n to refer to the probability
density functions of these distributions (of X0 and of Xk given Y0:n ) with respect
to the dominating measure λ. The joint probability density function of the hidden
states X0:n and associated observations Y0:n , with respect to the product measure
λ⊗(n+1) ⊗ µ⊗(n+1) , is given by

fn (x0:n , y0:n ; θ) = ν(x0 ; θ)g(x0 , y0 ; θ)q(x0 , x1 ; θ)g(x1 , y1 ; θ)


· · · q(xn−1 , xn ; θ)g(xn , yn ; θ) , (5.24)

where we used the same convention as above to indicate dependence with respect
to the parameter θ.
Because we mainly consider estimation of the HMM parameter vector θ from a
single sequence of observations, it does not make much sense to consider ν as an
independent parameter. There is no hope to estimate ν consistently, as there is
only one random variable X0 (that is not even observed!) drawn from this density.
In the following, we shall thus consider that ν is either fixed (and known) or fully
determined by the parameter θ that appears in q and g. A typical example of the
latter consists in assuming that ν is the stationary distribution associated with the
transition function q(·, · ; θ) (if it exists). This option is generally practicable only
in very simple models (see Example ?? below for an example) because of the lack of
analytical expressions relating the stationary distribution of q(·, · ; θ) to θ for general
parameterized hidden chains. Irrespective of whether ν is fixed or determined by θ,
it is convenient to omit dependence with respect to ν in our notations, writing, for
instance, Eθ for expectations under the model parameterized by (θ, ν).
The likelihood of the observations Ln (y0:n ; θ) is obtained by integrating (5.24)
with respect to the x (state) variables under the measure λ⊗(n+1) . Note that here
we use yet another slight modification of the notations adopted in Section 5.1 to
acknowledge that both the observations and the hidden states are indeed sequences
5.2. APPLICATION TO HMMS 105

with indices ranging from 0 to n (hence the subscript n). Upon taking the logarithm
in (5.24),

n−1
X
log fn (x0:n , y0:n ; θ) = log ν(x0 ; θ) + log q(xk , xk+1 ; θ)
k=0
Xn
+ log g(xk , yk ; θ) ,
k=0

and hence the intermediate quantity of EM has the additive structure

n−1
X
Q(θ ; θ0 ) = Eθ0 [log ν(X0 ; θ) | Y0:n ] + Eθ0 [log q(Xk , Xk+1 ; θ) | Y0:n ]
k=0
Xn
+ Eθ0 [log g(Xk , Yk ; θ) | Y0:n ] .
k=0

In the following, we will adopt the “implicit conditioning” convention that we


have used extensively from Section 2.1.4 and onwards, writing gk (x ; θ) instead of
g(x, Yk ; θ). With this notation, the intermediate quantity of EM may be rewritten
as

n
X
Q(θ ; θ0 ) = Eθ0 [log ν(X0 ; θ) | Y0:n ] + Eθ0 [log gk (Xk ; θ) | Y0:n ]
k=0
n−1
X
+ Eθ0 [log q(Xk , Xk+1 ; θ) | Y0:n ] . (5.25)
k=0

Equation (5.25) shows that in great generality, evaluating the intermediate quantity
of EM only requires the computation of expectations under the marginal φk|n (· ; θ0 )
and bivariate φk:k+1|n (· ; θ0 ) smoothing distributions, given the parameter vector θ0 .
The required expectations may thus be computed using either any of the variants
of the forward-backward approach presented in Chapter 2 or the recursive smooth-
ing approach discussed in Section ??. To make the connection with the recursive
smoothing approach of Section ??, we simply rewrite (5.25) as Eθ0 [tn (X0:n ; θ) | Y0:n ],
where
t0 (x0 ; θ) = log ν(x0 ; θ) + log g0 (x0 ; θ) (5.26)

and

tk+1 (x0:k+1 ; θ) = tk (x0:k ; θ) + {log q(xk , xk+1 ; θ) + log gk+1 (xk+1 ; θ)} . (5.27)

Proposition ?? may then be applied directly to obtained the smoothed expectation


of the sum functional tn .
Although the exact form taken by the M-step will obviously depend on the way
g and q depend on θ, the EM update equations follow a very systematic scheme
that does not change much with the exact model under consideration. For instance,
all discrete state space models for which the transition matrix q is parameterized
by its r × r elements and such that g and q do not share common parameters (or
parameter constraints) give rise to the same update equations for q, given in (5.34)
below. Several examples of the EM update equations will be reviewed in Sections 5.3
and 5.4.
106 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I

5.2.3 Computing Derivatives


Recall that the Fisher identity—(5.12)—provides an expression for the gradient of
the log-likelihood `n (θ) with respect to the parameter vector θ, closely related to
the intermediate quantity of EM. In the HMM context, (5.12) reduces to
n
X
∇θ `n (θ) = Eθ [∇θ log ν(X0 ; θ) | Y0:n ] + Eθ [∇θ log gk (Xk ; θ) | Y0:n ]
k=0
n−1
X
+ Eθ [∇θ log q(Xk , Xk+1 ; θ) | Y0:n ] . (5.28)
k=0

Hence the gradient of the log-likelihood may also be evaluated using either the
forward-backward approach or the recursive technique discussed in Chapter 3. For
the latter, we only need to redefine the functional of interest, replacing (5.26)
and (5.27) by their gradients with respect to θ.
Louis’ identity (5.14) gives rise to more complicated expressions, and we only
consider here the case where g does depend on θ, whereas the state transition density
q and the initial distribution ν are assumed to be fixed and known (the opposite
situation is covered in detail in a particular case in Section 5.3.3). In this case,
(5.14) may be rewritten as
t
∇2θ `n (θ) + {∇θ `n (θ)} {∇θ `n (θ)} (5.29)
Xn
= Eθ [ ∇2θ log gk (Xk ; θ) Y0:n ]
k=0
n X
X n h i
t
+ Eθ {∇θ log gk (Xk ; θ)} {∇θ log gj (Xj ; θ)} Y0:n .
k=0 j=0

The first term on the right-hand side of (5.29) is obviously an expression that can be
computed proceeding as for (5.28), replacing first- by second-order derivatives. The
second term is however more tricky because it (seemingly) requires the evaluation
of the joint distribution of Xk and Xj given the observations Y0:n for all pairs of
indices k and j, which is not obtainable by the smoothing approaches based on
some form of the forward-backward decomposition. The rightmost term of (5.29)
is however easily recognized as a squared sum functional similar to (??), which can
thus be evaluated recursively (in n) proceeding as in Example ??. Recall that the
trick consists in observing that if
n
def
X
τn,1 (x0:n ; θ) = ∇θ log gk (xk ; θ) ,
k=0
( n
)( n
)t
def
X X
τn,2 (x0:n ; θ) = ∇θ log gk (xk ; θ) ∇θ log gk (xk ; θ) ,
k=0 k=0

then
t
τn,2 (x0:n ; θ) = τn−1,2 (x0:n−1 ; θ) + {∇θ log gn (xn ; θ)} {∇θ log gn (xn ; θ)}
t
+ τn−1,1 (x0:n−1 ; θ) {∇θ log gn (xn ; θ)}
t
+ ∇θ log gn (xn ; θ) {τn−1,1 (x0:n−1 ; θ)} .
This last expression is of the general form given in Definition ??, and hence Propo-
sition ?? may be applied to update recursively in n
Eθ [τn,1 (X0:n ; θ) | Y0:n ] and Eθ [τn,2 (X0:n ; θ) | Y0:n ] .
5.3. THE EXAMPLE OF NORMAL HIDDEN MARKOV MODELS 107

To make this approach more concrete, we will describe below, in Section 5.3.3, its
application to a very simple finite state space HMM.

5.3 The Example of Normal Hidden Markov Mod-


els
In order to make the general principles outlined in the previous section more con-
crete, we now work out the details on selected examples of HMMs. We begin with
the case where the state space is finite and the observation transition function g
corresponds to a (univariate) Gaussian distribution. Only the most standard case
where the parameter vector is split into two sub-components that parameterize,
respectively, g and q, is considered.

5.3.1 EM Parameter Update Formulas


In the widely used normal HMM, X is a finite set, identified with {1, . . . , r}, Y = R,
and g is a Gaussian probability density function (with respect to Lebesgue measure)
given by
(y − µx )2
 
1
g(x, y ; θ) = √ exp − .
2πυx 2υx
By definition, gk (x ; θ) is equal to g(x, Yk ; θ). We first assume that the initial
distribution ν is known and fixed, before examining the opposite case briefly in
Section 5.3.2 below. The parameter vector θ thus encompasses the transition prob-
abilities qij for i, j = 1, . . . , r as well as the means µi and variances υi for i = 1, . . . , r.
Note that in this section, because we will often need to differentiate with respect
to υi , it is simpler to use the variances υi = σi2 rather than the standard devia-
tions σi as parameters. The means and variances are unconstrained, except for the
positivity ofPthe latter, but the transition probabilities are subject to the equality
r
constraints j=1 qij = 1 for i = 1, . . . , r (in addition to the obvious constraint that
qij should be non-negative). When considering the parameter vector denoted by θ0 ,
we will denote by µ0i , υi0 , and qij 0
its various elements.
For the model under consideration, (5.25) may be rewritten as

n
" r #
(Yk − µi )2
 
1X
1{Xk = i} log υi +
X
0 st
Q(θ ; θ ) = C − Eθ0 Y0:n
2 i=1
υi
k=0
 
n r X
r
1{(Xk−1 , Xk ) = (i, j)} log qij Y0:n  ,
X X
+ Eθ0 
k=1 i=1 j=1

where the leading term does not depend on θ. Using the notations introduced in
Section 2.1 for the smoothing distributions, we may write
n r
(Yk − µi )2
 
0 st 1 XX 0
Q(θ ; θ ) = C − φk|n (i ; θ ) log υi +
2 i=1
υi
k=0
n X
X r X
r
+ φk−1:k|n (i, j ; θ0 ) log qij . (5.30)
k=1 i=1 j=1

Now, given the initial distribution ν and parameter θ0 , the smoothing distri-
butions appearing in (5.30) can be evaluated by any of the variants of forward-
backward smoothing discussed in Chapter 2. As already explained above, the E-
step of EM thus reduces to solving the smoothing problem. The M-step is specific
108 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I

and depends on the model parameterization: the task consists in finding a global
optimum of Q(θ ; θ0 ) that satisfies the constraints mentioned above. For this, sim-
ply introducePthe Lagrange multipliers λ1 , . . . , λr that correspond to the equality
r
constraints j=1 ij = 1 for i = 1, . . . , r (Luenberger, 1984, Chapter 10). The
q
first-order partial derivatives of the Lagrangian
 
r
X Xr
L(θ, λ ; θ0 ) = Q(θ ; θ0 ) + λi 1 − qij 
i=1 j=1

are given by
n
∂ 1 X
L(θ, λ ; θ0 ) = φk|n (i ; θ0 )(Yk − µi ) ,
∂µi υi
k=0
n
(Yk − µi )2
 
∂ 1X 1
L(θ, λ ; θ0 ) = − φk|n (i ; θ0 ) − ,
∂υi 2 υi υi2
k=0

n
∂ X φk−1:k|n (i, j ; θ0 )
L(θ, λ ; θ0 ) = − λi ,
∂qij qij
k=1
r
∂ X
L(θ, λ ; θ0 ) = 1− qij . (5.31)
∂λi j=1

Equating all expressions in (5.31) to zero yields the parameter vector

θ∗ = (µ∗i )i=1,...,r , (υi∗ )i=1,...,r , (qij



 
)i,j=1,...,r

which achieves the maximum of Q(θ ; θ0 ) under the applicable parameter constraints:
Pn 0
k=0 φk|n (i ; θ )Yk
µ∗i = P n 0
, (5.32)
k=0 φk|n (i ; θ )
Pn
∗ φk|n (i ; θ0 )(Yk − µ∗i )2
υi = k=0Pn 0
, (5.33)
k=0 φk|n (i ; θ )
Pn
∗ φk−1:k|n (i, j ; θ0 )
qij = Pn k=1
Pr 0
(5.34)
k=1 l=1 φk−1:k|n (i, l ; θ )

for i, j = 1, . . . , r, where the last equation may be rewritten more concisely as


Pn 0
∗ k=1 φk−1:k|n (i, j ; θ )
qij = P n 0
. (5.35)
k=1 φk−1|n (i ; θ )

Equations (5.32)–(5.34) are emblematic of the intuitive form taken by the parameter
update formulas derived though the EM strategy. These equations are simply the
maximum likelihood equations for the complete model in which both {Xk }0≤k≤n
and {Yk }0≤k≤n would be observed, except that the functions 1{Xk = i} and
1{Xk−1 = i, Xk = j} are replaced by their conditional expectations, φk|n (i ; θ0 )
and φk−1:k|n (i, j ; θ0 ), given the actual observations Y0:n and the available parame-
ter estimate θ0 . As discussed in Section 5.1.2, this behavior is fundamentally due to
the fact that the probability density functions associated with the complete model
form an exponential family. As a consequence, the same remark holds more gener-
ally for all discrete HMMs for which the conditional probability density functions
g(i, · ; θ) belong to an exponential family. A final word of warning about the way
in which (5.33) is written: in order to obtain a concise and intuitively interpretable
5.3. THE EXAMPLE OF NORMAL HIDDEN MARKOV MODELS 109

expression, (5.33) features the value of µ∗i as given by (5.32). It is of course possible
to rewrite (5.33) in a way that only contains the current parameter value θ0 and the
observations Y0:n by combining (5.32) and (5.33) to obtain

Pn 0 2  Pn 0 2
k=0 φk|n (i ; θ )Yk k=0 φk|n (i ; θ )Yk
υi∗ = P n 0
− P n 0
. (5.36)
k=0 φk|n (i ; θ ) k=0 φk|n (i ; θ )

5.3.2 Estimation of the Initial Distribution


As mentioned above, in this chapter we generally assume that the initial distribution
ν, that is, the distribution of X0 , is fixed and known. There are cases when one
wants to treat this as an unknown parameter however, and we briefly discuss below
this issue in connection with the EM algorithm for the normal HMM. We shall
assume that ν = (νi )1≤i≤r is an unknown probability vector (that is, with non-
negative entries summing to unity), which we accommodate within the parameter
vector θ. The complete log-likelihood will then be as above, where the initial term

r
1{X0 = i} log νi
X
log νX0 =
i=1

goes into Q(θ ; θ0 ) as well, giving the additive contribution

r
X
φ0|n (i ; θ0 ) log νi
i=1

to (5.30). This sum is indeed part of (5.30) already, but hidden within C st when
ν is not a parameter to be estimated. Using Lagrange multipliers as above, it is
straightforward to show that the M-step update of ν is νi∗ = φ0|n (i ; θ0 ).
It was also mentioned above that sometimes it is desirable to link ν to qθ as
being the stationary distribution of qθ . Then there is an additive contribution to
Q(θ ; θ0 ) as above, with the difference that ν can now not be chosen freely but is a
function of qθ . As there is no simple formula for the stationary distribution of qθ ,
the M-step is no longer explicit. However, once the sums (over k) in (5.30) have
been computed for all i and j, we are left with an optimization problem over the qij
for which we have an excellent initial guess, namely the standard update (ignoring
ν) (5.34). A few steps of a standard numerical optimization routine (optimizing over
the qij ) is then often enough to find the maximum of Q(· ; θ0 ) under the stationarity
assumption. Variants of the basic EM strategy, to be discussed in Section 5.5.3,
may also be useful in this situation.

5.3.3 Computation of the Score and Observed Information


For reasons discussed above, computing the gradient of the log-likelihood is not
a difficult task in finite state space HMMs and should preferably be done using
smoothing algorithms based on the forward-backward decomposition. The only new
requirement is to evaluate the derivatives with respect to θ that appear in (5.28). In
the case of the normal HMM, we already met the appropriate expressions in (5.31),
as Fisher’s identity (5.12) implies that the gradient of the intermediate quantity
at the current parameter estimate coincides with the gradient of the log-likelihood.
110 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I

Hence
n
∂ 1 X
`n (θ) = φk|n (i ; θ)(Yk − µi ) ,
∂µi υi
k=0
n
(Yk − µi )2
 
∂ 1X 1
`n (θ) = − φk|n (i ; θ) − ,
∂υi 2 υi υi2
k=0
n
∂ X φk−1:k|n (i, j ; θ)
`n (θ) = .
∂qij qij
k=1

We now focus on the computation of the derivatives of the log-likelihood in the


model of Example ?? with respect to the transition parameters ρ0 and ρ1 . As they
play a symmetric role, it is sufficient to consider, say, ρ0 only. The variance υ is
considered as fixed so that the only quantities that depend on the parameter ρ0
are the initial distribution ν and the transition matrix Q. We will, as usual, use
the simplified notation gk (x) rather than g(x, Yk ) to denote the Gaussian density
function (2πυ)−1/2 exp{−(Yk − x)2 /(2υ)} for x ∈ {0, 1}. Furthermore, in order to
simplify the expressions below, we also omit to indicate explicitly the dependence
with respect to ρ0 in the rest of this section. Fisher’s identity (5.12) reduces to
" n−1
#
∂ ∂ X ∂
`n = E log ν(X0 ) + log qXk Xk+1 Y0:n ,
∂ρ0 ∂ρ0 ∂ρ0
k=0

where the notation qij refers to the element in the (1 + i)-th row and (1 + j)-th
column of the matrix Q (in particular, q00 and q11 are alternative notations for
ρ0 and ρ1 ). We are thus in the framework of Proposition ?? with a smoothing
functional tn,1 defined by


t0,1 (x) = log ν(x) ,
∂ρ0

sk,1 (x, x0 ) = log qxx0 for k ≥ 0 ,
∂ρ0

where the multiplicative functions {mk,1 }k≥0 are equal to 1. Straightforward cal-
culations yield
 
ρ1
t0,1 (x) = (ρ0 + ρ1 )−1 δ0 (x) − δ1 (x) ,
ρ0
1 1
sk,1 (x, x0 ) = δ(0,0) (x, x0 ) − δ(0,1) (x, x0 ) .
ρ0 1 − ρ0

Hence a first recursion, following Proposition ??.

Algorithm 101 (Computation of the Score in Example ??). [Init:]


P1
Initialization: Compute c0 = i=0 ν(i)g0 (i) and, for i = 0, 1,

φk (i) = c−1
0 ν(i)g0 (i) ,
τ0,1 (i) = t0,1 (i)φ0 (i) .

P1 P1
Recursion: For k = 0, 1, . . . , compute ck+1 = i=0 j=0 φk (i)qij gk (j) and, for
5.3. THE EXAMPLE OF NORMAL HIDDEN MARKOV MODELS 111

j = 0, 1,
1
X
φk+1 (j) = c−1
k+1 φk (i)qij gk (j) ,
i=0
1
X
τk+1,1 (j) = c−1
k+1 τk,1 (i)qij gk+1 (j)
i=0

+ φk (0)gk+1 (0)δ0 (j) − φk (0)gk+1 (1)δ1 (j) .

Pk
At each index k, the log-likelihood is available via `k = l=0 log cl , and its
derivative with respect to ρ0 may be evaluated as
1
∂ X
`k = τk,1 (i) .
∂ρ0 i=0

For the second derivative, Louis’ identity (5.14) shows that

2 " n−1
#
∂2 ∂2 X ∂2


`n + `n =E log ν(X0 ) + log qXk Xk+1 Y0:n
∂ρ20 ∂ρ0 ∂ρ20 ∂ρ20
k=0
 !2 
n−1
∂ X ∂
+ E log ν(X0 ) + log qXk Xk+1 Y0:n  . (5.37)
∂ρ0 ∂ρ0
k=0

The first term on the right-hand side of (5.37) is very similar to the case of τn,1
considered above, except that we now need to differentiate the functions twice,
replacing t0,1 and sk,1 by ∂ρ∂ 0 t0,1 and ∂ρ∂ 0 sk,1 , respectively. The corresponding
smoothing functional tn,2 is thus now defined by

ρ1 (2ρ0 + ρ1 ) 1
t0,2 (x) = − 2 2
δ0 (x) + δ1 (x) ,
ρ0 (ρ0 + ρ1 ) (ρ0 + ρ1 )2
1 1
sk,2 (x, x0 ) = − 2 δ(0,0) (x, x0 ) − δ(0,1) (x, x0 ) .
ρ0 (1 − ρ0 )2

The second term on the right-hand side of (5.37) is more difficult, and we need
to proceed as in Example ??: the quantity of interest may be rewritten as the
conditional expectation of
" n−1
#2
X
tn,3 (x0:n ) = t0,1 (x0 ) + sk,1 (xk , xk+1 ) .
k=0

Expanding the square in this equation yields the update formula

tk+1,3 (x0:k+1 ) = tk,3 (x0:k ) + s2k,1 (xk , xk+1 ) + 2tk,1 (x0:k )sk,1 (xk , xk+1 ) .

Hence tk,1 and tk,3 jointly are of the form prescribed by Definition ?? with in-
cremental additive functions sk,3 (x, x0 ) = s2k,1 (x, x0 ) and multiplicative updates
mk,3 (x, x0 ) = 2sk,1 (x, x0 ). As a consequence, the following smoothing recursion
holds.

Algorithm 102 (Computation of the Observed Information in Example ??).


[Init:]
112 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I

Initialization: For i = 0, 1,

τ0,2 (i) = t0,2 (i)φ0 (i) .


τ0,3 (i) = t20,1 (i)φ0 (i) .

Recursion: For k = 0, 1, . . . , compute for j = 0, 1,


( 1
X
−1
τk+1,2 (j) = ck+1 τk,2 (i)qij gk+1 (j)
i=0
)
1 1
− φk (0)gk+1 (0)δ0 (j) − φk (0)gk+1 (1)δ1 (j) ,
ρ0 (1 − ρ0 )
( 1
X
τ0,3 (j) = c−1
k+1 τk,3 (i)qij gk+1 (j)
i=0
+ 2 [τk,1 (0)gk+1 (0)δ0 (j) − τk,1 (0)gk+1 (1)δ1 (j)]
)
1 1
+ φk (0)gk+1 (0)δ0 (j) + φk (0)gk+1 (1)δ1 (j) .
ρ0 (1 − ρ0 )

At each index k, the second derivative of the log-likelihood satisfies


2 1 1
∂2

∂ X X
`k + `k = τk,2 (i) + τk,3 (i) ,
∂ρ20 ∂ρ0 i=0 i=0

where the second term on the left-hand side may be evaluated in the same recursion,
following Algorithm 101.
To illustrate the results obtained with Algorithms 101–102, we consider the
model with parameters ρ0 = 0.95, ρ1 = 0.8, and υ = 0.1 (using the notations
introduced in Example ??). Figure 5.1 displays the typical aspect of two sequences
of length 200 simulated under slightly different values of ρ0 . One possible use of
the output of Algorithms 101–102 consists in testing for changes in the parameter
values. Indeed, under conditions to be detailed in Chapter 6 (and which hold here),
the normalized score n−1/2 ∂ρ∂ 0 `n satisfies a central limit theorem with variance given
by the limit of the normalized information −n−1 (∂ 2 /∂ρ20 )`n . Hence it is expected
that

∂ρ `n
Rn = q 0 2

− ∂ρ 2 `n
0

be asymptotically N(0, 1)-distributed under the null hypothesis that ρ0 is indeed


equal to the value used for computing the score and information recursively with
Algorithms 101–102.
Figure 5.2 displays the empirical quantiles of Rn against normal quantiles for
n = 200 and n =1,000. For the longer sequences (n =1,000), the result is clearly as
expected with a very close fit to the normal quantiles. When n = 200, asymptotic
normality is not yet reached and there is a significant bias toward high values of
Rn . Looking back at Figure 5.1, even if υ was equal to zero—or in other words,
if we were able to identify without ambiguity the 0 and 1 states from the data—
there would not be much information about ρ0 to be gained from runs of length
200: when ρ0 = 0.95 and ρ1 = 0.8, the average number of distinct runs of 0s that
one can observe in 200 consecutive data points is only about 200/(20 + 5) = 8.
To construct a goodness of fit test from Rn , one can monitor values of R2n , which
asymptotically has a chi-square distribution with one degree of freedom. Testing
5.3. THE EXAMPLE OF NORMAL HIDDEN MARKOV MODELS 113

ρ0 = 0.95
2

1.5

Data
0.5

−0.5

−1
0 20 40 60 80 100 120 140 160 180 200

ρ = 0.92
0
2

1.5

1
Data

0.5

−0.5

−1
0 20 40 60 80 100 120 140 160 180 200
Time Index

Figure 5.1: Two simulated trajectories of length n = 200 from the simplified ion
channel model of Example ?? with ρ0 = 0.95, ρ1 = 0.8, and σ 2 = 0.1 (top), and
ρ0 = 0.92, ρ1 = 0.8, and σ 2 = 0.1 (bottom).

the null hypothesis ρ0 = 0.95 gives p-values of 0.87 and 0.09 for the two sequences
in the top and bottom plots, respectively, of Figure 5.1. When testing at the 10%
level, both sequences thus lead to the correct decision: no rejection and rejection of
the null hypothesis, respectively. Interestingly, testing the other way around, that
is, postulating ρ0 = 0.92 as the null hypothesis, gives p-values of 0.20 and 0.55 for
the top and bottom sequences of Figure 5.1, respectively. The outcome of the test
is now obviously less clear-cut, which reveals an asymmetry in its discrimination
ability: it is easier to detect values of ρ0 that are smaller than expected than the
converse. This is because smaller values of ρ0 means more changes (on average) in
the state sequence and hence more usable information about ρ0 to be obtained from
a fixed size record. This asymmetry is connected to the upward bias visible in the
left plot of Figure 5.2.

n = 200 n = 1000

0.999 0.999
0.99 0.99
Probability

0.90 0.90

0.5 0.5

0.1 0.1
0.01 0.01
0.001 0.001

−2 0 2 4 6 8 −2 0 2 4

Figure 5.2: QQ-plot of empirical quantiles of the test statistic Rn (abscissas) for the
simplified ion channel model of Example ?? with ρ0 = 0.95, ρ1 = 0.8, and σ 2 = 0.1
vs. normal quantiles (ordinates). Samples sizes were n = 200 (left) and n =1,000
(right), and 10,000 independent replications were used to estimate the empirical
quantiles.
114 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I

5.4 The Example of Gaussian Linear State-Space


Models
We now consider more briefly the case of Gaussian linear state-space models that
form the other major class of hidden Markov models for which the methods dis-
cussed in Section 5.1 are directly applicable. It is worth mentioning that Gaussian
linear state-space models are perhaps the only important subclass of the HMM
family for which there exist reasonable simple non-iterative parameter estimation
algorithms not based on maximum likelihood arguments but are nevertheless useful
in practical applications. These sub-optimal algorithms, proposed by Van Over-
schee and De Moor (1993), rely on the linear structure of the model and use only
eigendecompositions of empirical covariance matrices—a general principle usually
referred to under the denomination of subspace methods (Van Overschee and De
Moor, 1996). Keeping in line with the general topic of this chapter, we nonethe-
less consider below only algorithms for maximum likelihood estimation in Gaussian
linear state-space models.
The Gaussian linear state-space model is given by

Xk+1 = AXk + RUk ,


Yk = BXk + SVk ,

where X0 , {Uk }k≥0 and {Vk }k≥0 are jointly Gaussian. The parameters of the model
are the four matrices A, R, B, and S. Note that except for scalar models, it is not
possible to estimate R and S because both {Uk } and {Vk } are unobservable and
hence R and S are only identifiable up to an orthonormal matrix. In other words,
multiplying R or S by any orthonormal matrix of suitable dimension does not modify
the distribution of the observations. Hence the parameters that are identifiable are
the covariance matrices ΥR = RRt and ΥS = SS t , which we consider below.
Likewise, the matrices A and B are identifiable up to a similarity transformation
only. Indeed, setting Xk0 = T Xk for some invertible matrix T , that is, making a
change of basis for the state process, it is straightforward to check that the joint
process {(Xk0 , Yk )} satisfies the model assumptions with T AT −1 , BT −1 , and T R
replacing A, B, and R, respectively. Nevertheless, we work with A and B in the
algorithm below. If a unique representation is desired, one may use, for instance,
the companion form of A given its eigenvalues; this matrix may contain complex
entries though. As in the case of finite state space HMMs (Section 5.2.2), it is not
sensible to consider the initial covariance matrix Σν as an independent parameter
when using a single observed sequence. On the other hand, for such models it is very
natural to assume that Σν is associated with the stationary distribution of {Xk }.
We shall also assume that both ΥR and ΥS are full rank covariance matrices so
that all Gaussian distributions admit densities with respect to (multi-dimensional)
Lebesgue measure.

5.4.1 The Intermediate Quantity of EM


With the previous notations, the intermediate quantity Q(θ ; θ0 ) of EM, defined
in (5.25), may be expressed as
" n−1
#
1 X
t −1
− Eθ0 n log |ΥR | + (Xk+1 − AXk ) ΥR (Xk+1 − AXk ) Y0:n
2
k=0
" n
#
1 X
− Eθ0 (n + 1) log |ΥS | + (Yk − BXk )t Υ−1
S (Yk − BXk ) Y0:n , (5.38)
2
k=0
5.4. THE EXAMPLE OF GAUSSIAN LINEAR STATE-SPACE MODELS 115

up to terms that do not depend on the parameters. In order to elicit the M-


step equations or to compute the score, we differentiate (5.38) using elementary
perturbation calculus as well as the identity ∇C log |C| = C −t for an invertible
matrix C—which is a consequence of the adjoint representation of the inverse (Horn
and Johnson, 1985, Section 0.8.2):

" n−1 #
X
0
∇A Q(θ ; θ ) = −Υ−1
R Eθ 0 t t
(AXk Xk − Xk+1 Xk ) Y0:n , (5.39)
k=0
(
0 1
∇Υ−1 Q(θ ; θ ) = − −nΥR (5.40)
R 2
" n−1 #)
X
t
+ Eθ 0 (Xk+1 − AXk )(Xk+1 − AXk ) Y0:n ,
k=0
" n #
X
∇B Q(θ ; θ0 ) = −Υ−1
S Eθ 0 (BXk Xkt − Yk Xkt ) Y0:n , (5.41)
k=0

(
0 1
∇Υ−1 Q(θ ; θ ) = − −(n + 1)ΥS (5.42)
S 2
" n #)
X
t
+ Eθ 0 (Yk − BXk )(Yk − BXk ) Y0:n .
k=0

Note that in the expressions above, we differentiate with respect to the inverses of
ΥR and ΥS rather than with respect to the covariance matrices themselves, which
is equivalent, because we assume both of the covariance matrices to be positive
definite, but yields simpler formulas. Equating all derivatives simultaneously to
zero defines the EM update of the parameters. We will denote these updates by A∗ ,
B ∗ , Υ∗R , and Υ∗S , respectively. To write them down, denote X̂k|n (θ0 ) = Eθ0 [Xk | Y0:n ]
and Σk|n (θ0 ) = Eθ0 [Xk Xk0 | Y0:n ]− X̂k|n (θ0 )X̂k|n
t
(θ0 ), where we now indicate explicitly
that these first two smoothing moments indeed depend on the current estimates of
the model parameters (they also depend on the initial covariance matrix Σν , but
we ignore this fact here because this quantity is considered as being fixed). We also
need to evaluate the conditional covariances

def
Ck,k+1|n (θ0 ) = Covθ0 [Xk , Xk+1 | Y0:n ]
t
= Eθ0 [Xk Xk+1 | Y0:n ] − X̂k|n (θ0 )X̂k+1|n
t
(θ0 ) .
116 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I

With these notations, the EM update equations are given by


"n−1 #t
X
∗ 0 0 t 0
A = Ck,k+1|n (θ ) + X̂k|n (θ )X̂k+1|n (θ ) (5.43)
k=0
"n−1 #−1
X
0 0
Σk|n (θ ) + X̂k|n (θ t
)X̂k|n (θ0 ) ,
k=0
n−1
1X h i
Υ∗R = Σk+1|n (θ0 ) + X̂k+1|n (θ0 )X̂k+1|n
t
(θ0 ) (5.44)
n
k=0
h i
∗ 0 0 t 0
− A Ck,k+1|n (θ ) + X̂k|n (θ )X̂k+1|n (θ ) ,
" n
#t
X
∗ 0
B = X̂k|n (θ )Ykt (5.45)
k=0
" n
#−1
X
0 0
Σk|n (θ ) + X̂k|n (θ t
)X̂k|n (θ0 ) ,
k=0

n
1 Xh i
Υ∗S = Yk Ykt − B ∗ X̂k|n (θ0 )Ykt . (5.46)
n+1
k=0

In obtaining the covariance update, we used the same remark that made it possible
to rewrite, in the case of normal HMMs, (5.33) as (5.36).

5.5 Complements
To conclude this chapter, we briefly return to an issue mentioned in Section 5.1.2
regarding the conditions that ensure that the EM iterations indeed converge to
stationary points of the likelihood.

5.5.1 Global Convergence of the EM Algorithm


As a consequence of Proposition 98, the EM algorithm described in Section 5.1.2 has
the property that the log-likelihood function ` can never decrease in an iteration.
Indeed,
`(θi+1 ) − `(θi ) ≥ Q(θi+1 ; θi ) − Q(θi ; θi ) ≥ 0 .
This class of algorithms, sometimes referred to as ascent algorithms (Luenberger,
1984, Chapter 6), can be treated in a unified manner following a theory developed
mostly by Zangwill (1969). Wu (1983) showed that this general theory applies to
the EM algorithm as defined above, as well as to some of its variants that he calls
generalized EM (or GEM). The main result is a strong stability guarantee known
as global convergence, which we discuss below.
We first need a mathematical formalism that describes the EM algorithm. This
is done by identifying any homogeneous (in the iterations) iterative algorithm with a
specific choice of a mapping M that associates θi+1 to θi . In the theory of Zangwill
(1969), one indeed considers families of algorithms by allowing for point-to-set maps
M that associate a set M (θ0 ) ⊆ Θ to each parameter value θ0 ∈ Θ. A specific
algorithm in the family is such that θi+1 is selected in M (θi ). In the example of
EM, we may define M as
n o
M (θ0 ) = θ ∈ Θ : Q(θ ; θ0 ) ≥ Q(θ̃ ; θ0 ) for all θ̃ ∈ Θ , (5.47)
5.5. COMPLEMENTS 117

that is, M (θ0 ) is the set of values θ that maximize Q(θ ; θ0 ) over Θ. Usually M (θ0 )
reduces to a singleton, and the mapping M is then simply a point-to-point map (a
usual function from Θ to Θ). But the use of point-to-set maps makes it possible to
deal also with cases where the intermediate quantity of EM may have several global
maxima, without going into the details of what is done in such cases. We next need
the following definition before stating the main convergence theorem.

Definition 103 (Closed Mapping). A map T from points of Θ to subsets of Θ


is said to be closed on a set S ⊆ Θ if for any converging sequences {θi }i≥0 and
{θ̃i }i≥0 , the conditions

(a) θi → θ ∈ S,

(b) θ̃i → θ̃ with θ̃i ∈ T (θi ) for all i ≥ 0,

imply that θ̃ ∈ T (θ).

Note that for point-to-point maps, that is, if T (θ) is a singleton for all θ, the
definition above is equivalent to the requirement that T be continuous on S. Defi-
nition 103 is thus a generalization of continuity for general (point-to-set) maps. We
are now ready to state the main result, which is proved in Zangwill (1969, p. 91)
or Luenberger (1984, p. 187).

Theorem 104 (Global Convergence Theorem). Let Θ be a subset of Rdθ and let
{θi }i≥0 be a sequence generated by θi+1 ∈ T (θi ) where T is a point-to-set map on
Θ. Let S ⊆ Θ be a given “solution” set and suppose that

(1) the sequence {θi }i≥0 is contained in a compact subset of Θ;

(2) T is closed over Θ \ S (the complement of S);

(3) there is a continuous “ascent” function s on Θ such that s(θ) ≥ s(θ0 ) for all
θ ∈ T (θ0 ), with strict inequality for points θ0 that are not in S.

Then the limit of any convergent subsequence of {θi } is in the solution set S. In
addition, the sequence of values of the ascent function, {s(θi )}i≥0 , converges mono-
tonically to s(θ? ) for some θ? ∈ S.

The final statement of Theorem 104 should not be misinterpreted: that {s(θi )}
converges to a value that is the image of a point in S is a simple consequence of
the first and third assumptions. It does however not imply that the sequence of
parameters {θi } is itself convergent in the usual sense, but only that the limit points
of {θi } have to be in the solution set S. An important property however is that
because {s(θi(l) )}l≥0 converges to s(θ? ) for any convergent subsequence {θi(l) }, all
limit points of {θi } must be in the set S? = {θ ∈ Θ : s(θ) = s(θ? )} (in addition
to being in S). This latter statement means that the sequence of iterates {θi } will
ultimately approach a set of points that are “equivalent” as measured by the ascent
function s.
The following general convergence theorem following the proof by Wu (1983) is
a direct application of the previous theory to the case of EM.

Theorem 105. Suppose that in addition to the hypotheses of Proposition 98 (As-


sumptions 97 as well as parts (a) and (b) of Proposition 98), the following hold.

(i) H(θ ; θ0 ) is continuous in its second argument, θ0 , on Θ.



(ii) For any θ0 , the level set Θ0 = θ ∈ Θ : `(θ) ≥ `(θ0 ) is compact and contained
in the interior of Θ.
118 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I

Then all limit points of any instance {θi }i≥0 of an EM algorithm initialized at θ0
are in L0 = {θ ∈ Θ0 : ∇θ `(θ) = 0}, the set of stationary points of ` with log-
likelihood larger than that of θ0 . The sequence {`(θi )} of log-likelihoods converges
monotonically to `? = `(θ? ) for some θ? ∈ L0 .

Proof. This is a direct application of Theorem 104 using L0 as the solution set and
` as the ascent function. The first hypothesis of Theorem 104 follows from (ii) and
the third one from Proposition 98. The closedness assumption (2) follows from
Proposition 98 and (i): for the EM mapping M defined in (5.47), θ̃i ∈ M (θi )
amounts to the condition

Q(θ̃i ; θi ) ≥ Q(θ ; θi ) for all θ ∈ Θ ,

which is also satisfied by the limits of the sequences {θ̃i } and {θi } (if these converge)
by continuity of the intermediate quantity Q, which follows from that of ` and H
(note that it is here important that H be continuous with respect to both argu-
ments). Hence the EM mapping is indeed closed on Θ as a whole and Theorem 105
follows.

The assumptions of Proposition 98 as well as item (i) above are indeed very mild
in typical situations. Assumption (ii) however may be restrictive, even for models
in which the EM algorithm is routinely used. The practical implication of (ii) being
violated is that the EM algorithm may fail to converge to the stationary points of
the likelihood for some particularly badly chosen initial points θ0 .
Most importantly, the fact that θi+1 maximizes the intermediate quantity Q(· ; θi )
of EM does in no way imply that, ultimately, `? is the global maximum of ` over
Θ. There is even no guarantee that `? is a local maximum of the log-likelihood: it
may well only be a saddle point (Wu, 1983, Section 2.1). Also, the convergence of
the sequence `(θi ) to `? does not automatically imply the convergence of {θi } to a
point θ? .
Pointwise convergence of the EM algorithm requires more stringent assumptions
that are difficult to verify in practice. As an example, a simple corollary of the global
convergence theorem states that if the solution set S in Theorem 104 is a single
point, θ? say, then the sequence {θi } indeed converges to θ? (Luenberger, 1984,
p. 188). The sketch of the proof of this corollary is that every subsequence of {θi }
has a convergent further subsequence because of the compactness assumption (1),
but such a subsequence admits s as an ascent function and thus converges to θ?
by Theorem 104 itself. In cases where the solution set is composed of several
points, further conditions are needed to ensure that the sequence of iterates indeed
converges and does not cycle through different solution points.
In the case of EM, pointwise convergence of the EM sequence may be guaranteed
under an additional condition given by Wu (1983) (see also Boyles, 1983, for an
equivalent result), stated in the following theorem.

Theorem 106. Under the hypotheses of Theorem 105, if

(iii) kθi+1 − θi k → 0 as i → ∞,

then all limit points of {θi } are in a connected and compact subset of L? = {θ ∈ Θ :
`(θ) = `? }, where `? is the limit of the log-likelihood sequence {`(θi )}.
In particular, if the connected components of L? are singletons, then {θi } con-
verges to some θ? in L? .

Proof. The set of limit points of a bounded sequence {θi } with kθi+1 − θi k → 0 is
connected and compact (Ostrowski, 1966, Theorem 28.1). The proof follows because
under Theorem 104, the limit points of {θi } must belong to L? .
5.5. COMPLEMENTS 119

5.5.2 Rate of Convergence of EM


Even if one can guarantee that the EM sequence {θ̂i } converges to some point
θ? , this limiting point can be either a local maximum, a saddle point, or even a
local minimum. The proposition below states conditions under which the stable
stationary points of EM coincide with local maxima only (see also Lange, 1995,
Proposition 1, for a similar statement). We here consider that the EM mapping M
is a point-to-point map, that is, that the maximizer in the M-step is unique.
To understand the meaning of the term “stable”, consider the following approx-
imation to the limit behavior of the EM sequence: it is sensible to expect that if the
EM mapping M is sufficiently regular in a neighborhood of the limiting fixed point
θ? , the asymptotic behavior of the EM sequence {θi } follows the tangent linear
dynamical system

(θi+1 − θ? ) = M (θi ) − M (θ? ) ≈ ∇θ M (θ? )(θi − θ? ) . (5.48)

Here ∇θ M (θ? ) is called the rate matrix (see for instance Meng and Rubin, 1991).
A fixed point θ? is said to be stable if the spectral radius of ∇θ M (θ? ) is less than 1.
In this case, the tangent linear system is asymptotically stable in the sense that the
sequence {ζ i } defined recursively by ζ i+1 = ∇θ M (θ? )ζ i tends to zero as n tends to
infinity (for any choice of ζ 0 ). The linear rate of convergence of EM is defined as
the largest moduli of the eigenvalues of ∇θ M (θ? ). This rate is an upper bound on
the factors ρk that appear in (5.17).
Proposition 107. Under the assumptions of Theorem 100, assume that Q(· ; θ)
has a unique maximizer for all θ ∈ Θ and that, in addition,
Z
H(θ? ) = − ∇2θ log f (x ; θ) θ=θ p(x ; θ? ) λ(dx) (5.49)
?

and Z
G(θ? ) = − ∇2θ log p(x ; θ) θ=θ?
p(x ; θ? ) λ(dx) (5.50)

are positive definite matrices for all stationary points of EM (i.e., such that M (θ? ) =
θ? ). Then for all such points, the following hold true.
(i) ∇θ M (θ? ) is diagonalizable and its eigenvalues are positive real numbers.
(ii) The point θ? is stable for the mapping M if and only if it is a proper maximizer
of `(θ) in the sense that all eigenvalues of ∇2θ `(θ? ) are negative.
Proof. The EM mapping is defined implicitly through the fact that M (θ0 ) maximizes
Q(· ; θ0 ), which implies that
Z
∇θ log f (x ; θ)|θ=M (θ0 ) p(x ; θ0 ) λ(dx) = 0 ,

using assumption (b) of Theorem 100. Careful differentiation of this relation at a


point θ0 = θ? , which is such that M (θ? ) = θ? and hence ∇θ `(θ)|θ=θ? = 0, gives
(Dempster et al., 1977; Lange, 1995, see also)

∇θ M (θ? ) = [H(θ? )]−1 H(θ? ) + ∇2θ `(θ? ) ,


 

where H(θ? ) is defined in (5.49). The missing information principle—or Louis’


formula (see Proposition 100)—implies that G(θ? ) = H(θ? ) + ∇2θ `(θ? ) is positive
definite under our assumptions.
Thus ∇θ M (θ? ) is diagonalizable with positive eigenvalues that are the same
(counting multiplicities) as those of the matrix A? = I+B? , where B? = [H(θ? )]−1/2 ∇2θ `(θ? )[H(θ? )]−1/2 .
120 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I

Thus ∇θ M (θ? ) is stable if and only if B? has negative eigenvalues only. The
Sylvester law of inertia (see for instance Horn and Johnson, 1985) shows that B? has
the same inertia (number of positive, negative, and zero eigenvalues) as ∇2θ `(θ? ).
Thus all of B? ’s eigenvalues are negative if and only if the same is true for ∇2θ `(θ? ),
that is, if θ? is a proper maximizer of `.
The proof above implies that when θ? is stable, the eigenvalues of M (θ? ) lie in
the interval (0, 1).

5.5.3 Generalized EM Algorithms


As discussed above, the type of convergence guaranteed by Theorem 105 is rather
weak but, on the other hand, this result is remarkable as it indeed covers not only
the original EM algorithm proposed by Dempster et al. (1977) but a whole class
of variants of the EM approach. One of the most useful extensions of EM is the
ECM (for expectation conditional maximization) by Meng and Rubin (1993), which
addresses situations where direct maximization of the intermediate quantity of EM
is intractable. Assume for instance that the parameter vector θ consists of two
sub-components θ1 and θ2 , which are such that maximization of Q((θ1 , θ2 ) ; θ0 ) with
respect to θ1 or θ2 only (the other sub-component being fixed) is easy, whereas joint
maximization with respect to θ = (θ1 , θ2 ) is problematic. One may then use the
following algorithm for updating the parameter estimate at iteration i.
E-step: Compute Q((θ1 , θ2 ) ; (θ1i , θ2i ));
CM-step: Determine
θ1i+1 = arg max Q((θ1 , θ2i ) ; (θ1i , θ2i )) ,
θ1

and then
θ2i+1 = arg max Q((θ1i+1 , θ2 ) ; (θ1i , θ2i )) .
θ2

It is easily checked that for this algorithm, (5.8) is still verified and thus ` is an ascent
function; this implies that Theorem 105 holds under the same set of assumptions.
The example above is only the simplest case where the ECM approach may be
applied, and further extensions are discussed by Meng and Rubin (1993).
Chapter 6

Statistical Properties of the


Maximum Likelihood
Estimator

The maximum likelihood estimator (MLE) is one of the backbones of statistics, and
as we have seen in previous chapters, it is very much appropriate also for HMMs,
even though numerical approximations are required when the state space is not
finite. A standard result in statistics says that, except for “atypical cases”, the
MLE is consistent, asymptotically normal with asymptotic (scaled) variance equal
to the inverse Fisher information matrix, and efficient. The purpose of the current
chapter is to show that these properties are indeed true for HMMs as well, provided
some conditions of rather standard nature hold. We will also employ the asymptotic
results obtained to verify the validity of certain likelihood-based tests.
Recall that the distribution (law) P of {Yk }k≥0 depends on a parameter θ that
lies in a parameter space Θ, which we assume is a subset of Rdθ for some dθ . Com-
monly, θ is a vector containing some components that parameterize the transition
kernel of the hidden Markov chain—such as the transition probabilities if the state
space X is finite—and other components that parameterize the conditional distri-
butions of the observations given the states. Throughout the chapter, it is assumed
that the HMM model is, for all θ, fully dominated in the sense of Definition 13 and
that the underlying Markov chain is positive (see Definition 171).

Assumption 108.

(i) There exists a probability measure λ on (X, X ) such that for any x ∈ X and
Rany θ ∈0 Θ, Q0θ (x, ·)  λ with transition density qθ . That is, Qθ (x, A) =
qθ (x, x ) λ(dx ) for A ∈ X .

(ii) There exists a probability measure µ on (Y, Y) such that for any x ∈ X and any
θR ∈ Θ, Gθ (x, ·)  µ with transition density function gθ . That is, Gθ (x, A) =
gθ (x, y) µ(dy) for A ∈ Y.

(iii) For any θ ∈ Θ, Qθ is positive, that is, Qθ is phi-irreducible and admits a


(necessarily unique) invariant distribution denoted by πθ .

In this chapter, we will generally assume that Θ is compact. Furthermore, θ?


is used to denote the true parameter, that is, the parameter corresponding to the
data that we actually observe.

121
122 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE

6.1 A Primer on MLE Asymptotics


The standard asymptotic properties of the MLE hinge on three basic results: a law
of large numbers for the log-likelihood, a central limit theorem for the score function,
and a law of large of numbers for the observed information. More precisely,

(i) for all θ ∈ Θ, n−1 `n (θ) → `(θ) Pθ? -a.s. uniformly over compact subsets of Θ,
where `n (θ) is the log-likelihood of the parameter θ given the first n obser-
vations and `(θ) is a continuous deterministic function with a unique global
maximum at θ? ;

(ii) n−1/2 ∇θ `n (θ? ) → N(0, J (θ? )) Pθ? -weakly, where J (θ) is the Fisher informa-
tion matrix at θ (we do not provide a more detailed definition at the moment);

(iii) limδ→0 limn→∞ sup|θ−θ? |≤δ k − n−1 ∇2θ `n (θ) − J (θ? )k = 0 Pθ? -a.s.

The function ` in (i) is sometimes referred to as the contrast function. We note that
−n−1 ∇2θ `n (θ) in (iii) is the observed information matrix, so that (iii) says that the
observed information should converge to the Fisher information in a certain uniform
sense. This uniformity may be replaced by conditions on the third derivatives of
the log-likelihood, which is common in statistical textbooks, but as we shall see, it
is cumbersome enough even to deal with second derivatives of the log-likelihood for
HMMs, whence avoiding third derivatives is preferable.
Condition (i) assures strong consistency of the MLE, which can be shown using
an argument that goes back to Wald (1949). The idea of the argument is as follows.
Denote by θbn the maximum the ML estimator; `n (θbn ) ≥ `n (θ) for any θ ∈ Θ.
Because ` has a unique global maximum at θ? , `(θ? ) − `(θ) ≥ 0 for any θ ∈ Θ and,
in particular, `(θ? ) − `(θbn ) ≥ 0. We now combine these two inequalities to obtain

0 ≤ `(θ? ) − `(θbn )
≤ `(θ? ) − n−1 `n (θ? ) + n−1 `n (θ? ) − n−1 `n (θbn ) + n−1 `n (θbn ) − `(θbn )
≤ 2 sup |`(θ) − n−1 `n (θ)| .
θ∈Θ

Therefore, by taking the compact subset in (i) above as Θ itself, `(θbn ) → `(θ? )
Pθ? -a.s. as n → ∞, which in turn implies, as ` is continuous with a unique global
maximum at θ? , that the MLE converges to θ? Pθ? -a.s.. In other words, the MLE
is strongly consistent.
Provided strong consistency holds, properties (ii) and (iii) above yield asymp-
totic normality of the MLE. In fact, we must also assume that θ? is an interior point
of Θ and that the Fisher information matrix J (θ? ) is non-singular. Then we can
for sufficiently large n make a Taylor expansion around θ? , noting that the gradient
of `n vanishes at the MLE θbn because θ? is maximal there,
Z 1 
0 = ∇θ `n (θbn ) = ∇θ `n (θ? ) + ∇2θ `n [θ? + t(θbn − θ? )] dt (θbn − θ? ) .
0

From this expansion we obtain


 Z 1 −1
n1/2 (θbn − θ? ) = −n−1 ∇2θ `n [θ? + t(θbn − θ? )] dt n−1/2 ∇θ `n (θ? ) .
0

Now θbn converges to θ? Pθ? -a.s. and so, using (iii), the first factor on the right-hand
side tends to J (θ? )−1 Pθ? -a.s. The second factor converges weakly to N(0, J (θ? ));
6.2. STATIONARY APPROXIMATIONS 123

this is (ii). Cramér-Slutsky’s theorem hence tells us that n1/2 (θbn − θ? ) tends Pθ? -
weakly to N(0, J −1 (θ? )), and this is the standard result on asymptotic normality
of the MLE.
In an entirely similar way properties (ii) and (iii) also show that for any u ∈ Rdθ
(recall that Θ is a subset of Rdθ ),

1
`n (θ? + n−1/2 u) − `n (θ? ) = n−1/2 uT ∇θ `n (θ? ) + uT [−n−1 ∇2θ `n (θ? )]u + Rn (u) ,
2

where n−1/2 ∇θ `n (θ? ) and −n−1 ∇2θ `n (θ? ) converge as described above, and where
Rn (u) tends to zero Pθ? -a.s. Such an expansion is known as local asymptotic nor-
mality (LAN) of the model, cf. Ibragimov and Hasminskii (1981, Definition II.2.1).
Under this condition, it is known that so-called regular estimators (a property pos-
sessed by all “sensible” estimators) cannot have an asymptotic covariance matrix
smaller than J −1 (θ? ) (Ibragimov and Hasminskii, 1981, p. 161). Because this limit
is obtained by the MLE, this estimator is efficient.
Later on in this chapter, we will also exploit properties (i)–(iii) to derive asymp-
totic properties of likelihood ratio and other tests for lower dimensional hypotheses
regarding θ.

6.2 Stationary Approximations


In this section, we will introduce a way of obtaining properties (i)–(iii) for HMMs;
more detailed descriptions are given in subsequent sections.
Before proceeding, we will be precise on the likelihood we shall analyze. In this
chapter, we generally make the assumption that the sequence {Xk }k≥0 is stationary;
then {Xk , Yk }k≥0 is stationary as well. Then there is obviously a corresponding
likelihood. However, it is sometimes convenient to work with a likelihood Lx0 ,n (θ)
that is conditional on an initial state x0 ,
Z n
Y
Lx0 ,n (θ) = gθ (x0 , Y0 ) qθ (xi−1 , xi )gθ (xi , Yi ) λ(dxi ) . (6.1)
i=1

We could also want to replace the fixed initial state by an initial distribution ν on
(X, X ), giving Z
Lν,n (θ) = Lx0 ,n (θ) ν(dx0 ) .
X

The stationary likelihood is then Lπθ ,n (θ), which we will simply denote by Ln (θ).
The advantage of working with the stationary likelihood is of course that it is
the correct likelihood for the model and may hence be expected to provide better
finite-sample performance. The advantage of assuming a fixed initial state x0 —and
hence adopting the likelihood Lx0 ,n (θ)—is that the stationary distribution πθ is not
always available in closed form when X is not finite. It is however important that
gθ (x0 , Y0 ) is positive Pθ? -a.s.; otherwise the log-likelihood may not be well-defined.
In fact, we shall require that gθ (x0 , Y0 ) is, Pθ? -a.s., bounded away from zero. In
the following, we always assume that this condition is fulfilled. A further advantage
of Lx0 ,n (θ) is that the methods described in the current chapter may be extended
to Markov-switching autoregressions (Douc et al., 2004), and then the stationary
likelihood is almost never computable, not even when X is finite. Throughout the
rest of this chapter, we will work with Lx0 ,n (θ) unless noticed, where x0 ∈ X is
chosen to satisfy the above positivity assumption but otherwise arbitrarily. The
MLE arising from this likelihood has the same asymptotic properties as has the
MLE arising from Ln (θ), provided the initial stationary distribution πθ has smooth
124 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE

second-order derivatives (cf. Bickel et al., 1998), whence from an asymptotic point
of view there is no loss in using the incorrect likelihood Lx0 ,n (θ).
We now return to the analysis of log-likelihood and items (i)–(iii) above. In the
setting of i.i.d. observations, the log-likelihood `n (θ) is a sum of i.i.d. terms, and
so (i) and (iii) follow from uniform versions of the strong law of large numbers and
(ii) is a consequence of the simplest central limit theorem. In the case of HMMs,
we can write `x0 ,n (θ) as a sum as well:
n
X Z 
`x0 ,n (θ) = log gθ (xk , Yk ) φx0 ,k|k−1 [Y0:k−1 ](dxk ; θ) (6.2)
k=0
Xn Z 
= log gθ (xk , Yk ) Pθ (Xk ∈ dxk | Y0:k−1 , X0 = x0 ) , (6.3)
k=0

where φx0 ,k|k−1 [Y0:k−1 ](· ; θ) is the predictive distribution of the state Xk given the
observations Y0:k−1 and X0 = x0 . These terms do not form a stationary sequence
however, so the law of large numbers—or rather the ergodic theorem—does not
apply directly. Instead we must first approximate `x0 ,n (θ) by the partial sum of a
stationary sequence.
When the joint Markov chain {Xk , Yk } has an invariant distribution, this chain is
stationary provided it is started from its invariant distribution. In this case, we can
(and will!) extend it to a stationary sequence {Xk , Yk }−∞<k<∞ with doubly infinite
time, as we can do with any stationary sequence. Having done this extension, we
can imagine a predictive distribution of the state Xk given the infinite past Y−∞:k−1
of observations. A key feature of these variables is that they now form a stationary
sequence, whence the ergodic theorem applies. Furthermore we can approximate
`x0 ,n (θ) by
X n Z 
s
`n (θ) = log gθ (xk , Yk ) Pθ (Xk ∈ dxk | Y−∞:k−1 ) , (6.4)
k=0

where superindex s stands for “stationary”. Heuristically, one would expect this
approximation to be good, as observations far in the past do not provide much
information about the current one, at least not if the hidden Markov chain enjoys
good mixing properties. What we must do is thus to give a precise definition of the
predictive distribution Pθ (Xk ∈ · | Y−∞:k−1 ) given the infinite past, and then show
that it approximates the predictive distribution φx0 ,k|k−1 (· ; θ) well enough that
the two sums (6.2) and (6.4), after normalization by n, have the same asymptotic
behavior. We can treat the score function similarly by defining a sequence that
forms a stationary martingale increment sequence; for sums of such sequences there
is a central limit theorem.
The cornerstone in this analysis is the result on conditional mixing stated in
Section 3. We will rephrase it here, but before doing so we state a first assumption.
It is really a variation of Assumption 62, adapted to the dominated setting and
uniform in θ.
Assumption 109.
(i) The transition density qθ (x, x0 ) of {Xk } satisfies 0 < σ − ≤ qθ (x, x0 ) ≤ σ + < ∞
for all x, x0 ∈ X and all θ ∈ Θ, and the measure λ is a probability measure.
R
(ii) For all y ∈ Y, the integral X gθ (x, y) λ(dx) is bounded away from 0 and ∞ on
Θ.
Part (i) of this assumption often, but not always holds when the state space X
is finite or compact. Note that Assumption 109 says that for all θ ∈ Θ, the whole
6.3. CONSISTENCY 125

state space X is a 1-small set for the transition kernel Qθ , which implies that for
all θ ∈ Θ, the chain is phi-irreducible and strongly aperiodic (see Section 7.2 for
definitions). It also ensures that there exists a stationary distribution πθ for Qθ .
In addition, the chain is uniformly geometrically ergodic in the sense that for any
x ∈ X and n ≥ 0, kQnθ (x, ·) − πθ kTV ≤ (1 − σ − )n . Under Assumption 108, it holds
that πθ  λ, and we use the same notation for this distribution and its density
with respect to the dominating measure λ.
Using the results of Section 7.3, we conclude that the state space X×Y is 1-small
for the joint chain {Xk , Yk }. Thus the joint chain is also phi-irreducible and strongly
aperiodic, and it admits a stationary distribution with density πθ (x)gθ (x, y) with
respect to the product measure λ ⊗ µ on (X × Y, X ⊗ Y) The joint chain also is
uniformly geometrically ergodic.
Put ρ = 1 − σ − /σ + ; then 0 ≤ ρ < 1. The important consequence of Assump-
tion 109 that we need in the current chapter is Proposition 64. It says that if
Assumption 109 holds true, then for all k ≥ 1, all y0:n and all initial distributions
ν and ν 0 on (X, X ),
Z
Pθ (Xk ∈ · | X0 = x, Y0:n = y0:n ) [ν(dx) − ν 0 (dx)] ≤ ρk . (6.5)
X TV

6.3 Consistency
6.3.1 Construction of the Stationary Conditional Log-likelihood
R
We shall now construct Pθ (Xk ∈ dxk | Y−∞:k−1 ) and gθ (xk , Yk ) Pθ (Xk ∈ dxk | Y−∞:k−1 ).
The latter variable will be defined as the limit of
Z
def
Hk,m,x (θ) = gθ (xk , Yk ) Pθ (Xk ∈ dxk | Y−m+1:k−1 , X−m = x) (6.6)

as m → ∞. Note that Hk,m,x (θ) is the conditional density of Yk given Y−m+1:k−1


and X−m = x, under the law Pθ . Put
def
hk,m,x (θ) = log Hk,m,x (θ) (6.7)

and consider the following assumption.


AssumptionR 110. b+ = supθ supx,y gθ (x, y) < ∞ and Eθ? |log b− (Y0 )| < ∞, where
b− (y) = inf θ X gθ (x, y) λ(dx).
Lemma 111. The following assertions hold true Pθ? -a.s. for all indices k, m and
m0 such that k > −(m ∧ m0 ):
0
ρk+(m∧m )−1
sup sup |hk,m,x (θ) − hk,m0 ,x0 (θ)| ≤ , (6.8)
θ∈Θ x,x0 ∈X 1−ρ

sup sup sup |hk,m,x (θ)| ≤ |log b+ | ∨ |log(σ − b− (Yk ))| . (6.9)
θ∈Θ m≥−(k−1) x∈X

Proof. Assume that m0 ≥ m and write


Z Z Z 
Hk,m,x (θ) = gθ (xk , Yk )qθ (xk−1 , xk ) λ(dxk )

× Pθ (Xk−1 ∈ dxk−1 | Y−m+1:k−1 , X−m = x−m ) δx (dx−m ) , (6.10)


126 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE
Z Z Z 
H k,m0 ,x0 (θ) = gθ (xk , Yk )qθ (xk−1 , xk ) λ(dxk )

× Pθ (Xk−1 ∈ dxk−1 | Y−m+1:k−1 , X−m = x−m )


× Pθ (X−m ∈ dx−m | Y−m0 +1:k−1 , X−m0 = x0 ) , (6.11)

and invoke (6.5) to see that


Z
k+m−1
|Hk,m,x (θ) − H k,m0 ,x0 (θ)| ≤ ρ sup gθ (xk , Yk )qθ (xk−1 , xk ) λ(dxk )
xk−1
Z
k+m−1 +
≤ ρ σ gθ (xk , Yk ) λ(dxk ) . (6.12)

Note that the step from the total variation bound to the bound on the difference
between the integrals does not need a factor “2”, because the integrands are non-
negative. Also note that (6.5) is stated for m = m0 = 0, but its initial time index is
of course arbitrary. The integral in (6.10) can be bounded from below as
Z
Hk,m,x (θ) ≥ σ − gθ (xk , Yk ) λ(dxk ) , (6.13)

and the same lower bound holds for (6.11). Combining (6.12) with these lower
bounds and the inequality |log x − log y| ≤ |x − y|/(x ∧ y) shows that

σ + k+m−1 ρk+m−1
|hk,m,x (θ) − hk,m0 ,x0 (θ)| ≤ ρ = ,
σ− 1−ρ

which is the first assertion of the lemma. Furthermore note that (6.10) and (6.13)
yield
σ − b− (Yk ) ≤ Hk,m,x (θ) ≤ b+ , (6.14)
which implies the second assertion.

Equation (6.8) shows that for any given k and x, {hk,m,x (θ)}m≥−(k−1) is a uni-
form (in θ) Cauchy sequence as m → ∞, Pθ? -a.s., whence there is a Pθ? -a.s. limit.
Moreover, again by (6.8), this limit does not depend on x, so we denote it by
hk,∞ (θ). Our interpretation of this limit is as log Eθ [ gθ (Xk , Yk ) | Y−∞:k−1 ]. Fur-
thermore (6.9) shows that provided Assumption 110 holds, {hk,m,x (θ)}m≥−(k−1) is
uniformly bounded in L1 (Pθ? ), so that hk,∞ (θ) is in L1 (Pθ? ) and, by the dominated
convergence theorem, the limit holds in this mode as well. Finally, by its definition
{hk,∞ (θ)}k≥0 is a stationary process, and it is ergodic because {Yk }−∞<k<∞ is. We
summarize these findings.

Proposition 112. Assume 108, 109, and 110 hold. Then for each θ ∈ Θ and
x ∈ X, the sequence {hk,m,x (θ)}m≥−(k−1) has, Pθ? -a.s., a limit hk,∞ (θ) as m → ∞.
This limit does not depend on x. In addition, for any θ ∈ Θ, hk,∞ (θ) belongs to
L1 (Pθ? ), and {hk,m,x (θ)}m≥−(k−1) also converges to hk,∞ (θ) in L1 (Pθ? ) uniformly
over θ ∈ Θ and x ∈ X.

Having come thus far, we can quantify the approximation of the log-likelihood
`x0 ,n (θ) by `sn (θ).

Proposition 113. For all n ≥ 0 and θ ∈ Θ,

1
|`x0 ,n (θ) − `sn (θ)| ≤ |log gθ (x0 , Y0 )| + h0,∞ (θ) + Pθ? -a.s.
(1 − ρ)2
6.3. CONSISTENCY 127

Proof. Letting m0 → ∞ in (6.8) we obtain |hk,0,x0 (θ) − hk,∞ (θ)| ≤ ρk−1 /(1 − ρ) for
k ≥ 1. Therefore, Pθ? -a.s.,
n
X n
X
|`x0 ,n (θ) − `sn (θ)| = hk,0,x0 (θ) − hk,∞ (θ)
k=0 k=0
n
X ρk−1
≤ |log gθ (x0 , Y0 )| + h0,∞ (θ) + .
1−ρ
k=1

6.3.2 The Contrast Function and Its Properties


Because hk,∞ (θ) is in L1 (Pθ? ) under the assumptions made above, we can define
def
the real-valued function `(θ) = Eθ? [hk,∞ (θ)]. It does not depend on k, by station-
arity. This is the contrast function `(θ) referred to above. By the ergodic theorem
n−1 `sn (θ) → `(θ) Pθ? -a.s., and by Proposition 113, n−1 `x0 ,n (θ) → `(θ) Pθ? -a.s. as
well. As noted above, however, we require this convergence to be uniform in θ,
which is not guaranteed so far. In addition, we require `(θ) to be continuous and
possess a unique global maximum at θ? ; the latter is an identifiability condition.
In the rest of this section, we address continuity and convergence; identifiability is
addressed in the next one.
To ensure continuity we need a natural assumption on continuity of the building
blocks of the likelihood.

Assumption 114. For all (x, x0 ) ∈ X × X and y ∈ Y, the functions θ 7→ qθ (x, x0 )


and θ 7→ gθ (x, y) are continuous.

The following result shows that hk,∞ (θ) is then continuous in L1 (Pθ? ).

Proposition 115. Assume 108, 109, 110, and 114. Then for any θ ∈ Θ,
" #
Eθ? sup |h0,∞ (θ0 ) − h0,∞ (θ)| → 0 as δ → 0 ,
θ 0 ∈Θ: |θ 0 −θ|≤δ

and θ 7→ `(θ) is continuous on Θ.

Proof. Recall that h0,∞ (θ) is the limit of h0,m,x (θ) as m → ∞. We first prove that
for any x ∈ X and any m > 0, the latter quantity is continuous in θ and then use
this to show continuity of the limit. Recall the interpretation of H0,m,x (θ) as a
conditional density and write

H0,m,x (θ) =
R R Q0
··· i=−m+1 qθ (xi−1 , xi )gθ (xi , Yi ) λ(dx−m+1 ) · · · λ(dx0 )
R R Q−1 (6.15)
··· i=−m+1 qθ (xi−1 , xi )gθ (xi , Yi ) λ(dx−m+1 ) · · · λ(dx−1 )

The integrand in the numerator is, by assumption, continuous and bounded by


(σ + b+ )m , whence dominated convergence shows that the numerator is continuous
with respect to θ (recall that λ is assumed finite). Likewise the denominator is
Q−1
continuous, and it is bounded from below by (σ − )m−1 −m+1 b− (Yi ) > 0 Pθ? -a.s.
Thus H0,m,x (θ) and h0,m,x (θ) are continuous as well. Because h0,m,x (θ) converges
to h0,∞ (θ) uniformly in θ as m → ∞, Pθ? -a.s., h0,∞ (θ) is continuous Pθ? -a.s. The
uniform bound (6.9) assures that we can invoke dominated convergence to obtain
the first part of the proposition.
128 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE

The second part is a corollary of the first one, as

sup |`(θ0 ) − `(θ)| = sup | Eθ? [h0,∞ (θ0 ) − h0,∞ (θ)]|


θ 0 : |θ 0 −θ|≤δ θ 0 : |θ 0 −θ|≤δ
" #
0
≤ Eθ? sup |h0,∞ (θ ) − h0,∞ (θ)| .
θ 0 : |θ 0 −θ|≤δ

We can now proceed to show uniform convergence of n−1 `x0 ,n (θ) to `(θ).
Proposition 116. Assume 108, 109, 110, and 114. Then

sup |n−1 `x0 ,n (θ) − `(θ)| → 0 Pθ? -a.s. as n → ∞.


θ∈Θ

Proof. First note that because Θ is compact, it is sufficient to prove that for all
θ ∈ Θ,
lim sup lim sup sup |n−1 `x0 ,n (θ0 ) − `(θ)| = 0 Pθ? -a.s.
δ→0 n→∞ θ 0 : |θ 0 −θ|≤δ

Now write

lim sup lim sup sup |n−1 `x0 ,n (θ0 ) − `(θ)|


δ→0 n→∞ θ 0 : |θ 0 −θ|≤δ

= lim sup lim sup sup |n−1 `x0 ,n (θ0 ) − n−1 `sn (θ)|
δ→0 n→∞ θ 0 : |θ 0 −θ|≤δ

≤ lim sup lim sup sup n−1 |`x0 ,n (θ0 ) − `sn (θ0 )|
δ→0 n→∞ θ 0 : |θ 0 −θ|≤δ

+ lim sup lim sup sup n−1 |`sn (θ0 ) − `sn (θ)| .
δ→0 n→∞ θ 0 : |θ 0 −θ|≤δ

The first term on the right-hand side vanishes by Proposition 113 (note that Lemma 111
shows that supθ0 |h0,∞ (θ0 )| is in L1 (Pθ? ) and hence finite Pθ? -a.s.). The second term
is bounded by
n
X
−1
lim sup lim sup sup n (hk,∞ (θ0 ) − hk,∞ (θ))
δ→0 n→∞ θ 0 : |θ 0 −θ|≤δ
k=0
Xn
−1
≤ lim sup lim sup n sup |hk,∞ (θ0 ) − hk,∞ (θ)|
δ→0 n→∞ 0 0
k=0 θ : |θ −θ|≤δ
" #
0
= lim sup Eθ? sup |h0,∞ (θ ) − h0,∞ (θ)| = 0 ,
δ→0 θ 0 : |θ 0 −θ|≤δ

with convergence Pθ? -a.s. The two final steps follow by the ergodic theorem and
Proposition 115 respectively. The proof is complete.
At this point, we thus know that n−1 `x0 ,n converges uniformly to `. The
same conclusion
R holds when other initial distributions ν are put on X0 , provided
supθ |log gθ (x, Y0 ) ν(dx)| is finite Pθ? -a.s. When ν is the stationary distribution πθ ,
uniform convergence can in fact be proved without this extra regularity assumption
by conditioning on the previous state X−1 to get rid of the first two terms in the
bound of Proposition 113; cf. Douc et al. (2004).
The uniform convergence of n−1 `x0 ,n (θ) to `(θ) can be used—with an argument
entirely similar to the one of Wald outlined in Section 6.1—to show that the MLE
converges a.s. to the set, Θ? say, of global maxima of `. Because ` is continuous,
6.4. IDENTIFIABILITY 129

we know that Θ? is closed and hence also compact. More precisely, for any (open)
neighborhood of Θ? , the MLE will be in that neighborhood for large n, Pθ? -a.s. We
say that the MLE converges to Θ? in the quotient topology. This way of describing
convergence was used, in the context of HMMs, by Leroux (1992). The purpose of
the identifiability constraint, that `(θ) has a unique global maximum at θ? , is thus
to ensure that Θ? consists of the single point θ? so that the MLE indeed converges
to the point θ? .

6.4 Identifiability
As became obvious in the previous section, the set of global maxima of ` is of
intrinsic importance, as this set constitutes the possible limit points of the MLE.
The definition of `(θ) as a limit is however usually not suitable for extracting relevant
information about the set of maxima, and the purpose of this section is to derive a
different characterization of the set of global maxima of `.

6.4.1 Equivalence of Parameters


We now introduce the notion of equivalence of parameters.

Definition 117. Two points θ, θ0 ∈ Θ are said to be equivalent if they govern


identical laws for the process {Yk }k≥0 , that is, if Pθ = Pθ0 .

We note that, by virtue of Kolmogorov’s extension theorem, θ and θ0 are equiv-


alent if and only if the finite-dimensional distributions Pθ (Y1 ∈ ·, Y2 ∈ ·, . . . , Yn ∈ ·)
and Pθ0 (Y1 ∈ ·, Y2 ∈ ·, . . . , Yn ∈ ·) agree for all n ≥ 1.
We will show that a parameter θ ∈ Θ is a global maximum point of ` if and only
if θ is equivalent to θ? . This implies that the limit points of the MLE are those
points θ that govern the same law for {Yk }k≥0 as does θ? . This is the best we can
hope for because there is no way—even with an infinitely large sample of Y s!—to
distinguish between the true parameter θ? and a different but equivalent parameter
θ. Naturally we would like to conclude that no parameter other than θ? itself is
equivalent to θ? . This is not always the case however, in particular when X is finite
and we can number the states arbitrarily. We will discuss this matter further after
proving the following result.

Theorem 118. Assume 108, 109, and 110. Then a parameter θ ∈ Θ is a global
maximum of ` if and only if θ is equivalent to θ? .

An immediate implication of this result is that θ? is a global maximum of `.

Proof. By the definition of `(θ) and Proposition 112,


h i h i
`(θ? ) − `(θ) = Eθ? lim h1,m,x (θ? ) − Eθ? lim h1,m,x (θ)
m→∞ m→∞
= lim Eθ? [h1,m,x (θ? )] − lim Eθ? [h1,m,x (θ)]
m→∞ m→∞
= lim Eθ? [h1,m,x (θ? ) − h1,m,x (θ)] ,
m→∞

where hk,m,x (θ) is given in (6.7). Next, write

Eθ? [h1,m,x (θ? ) − h1,m,x (θ)]


  
H1,m,x (θ? )
= Eθ? Eθ? log Y−m+1:0 , X−m = x ,
H1,m,x (θ)
130 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE

where Hk,m,x (θ) is given in (6.6). Recalling that H1,m,x (θ) is the conditional density
of Y1 given Y−m+1:0 and X−m = x, we see that the inner (conditional) expectation
on the right-hand side is a Kullback-Leibler divergence and hence non-negative.
Thus the outer expectation and the limit `(θ? ) − `(θ) are non-negative as well, so
that θ? is a global mode of `.
Now pick θ ∈ Θ such that `(θ) = `(θ? ). Throughout the remainder of the
proof, we will use the letter p to denote (possibly conditional) densities of random
variables, with the arguments of the density indicating which random variables are
referred to. For any k ≥ 1,

Eθ? [log pθ (Y1:k |Y−m+1:0 , X−m = x)]


k
X
= Eθ? [log pθ (Yi |Y−m+1:i−1 , X−m = x)]
i=1
k
X
= Eθ? [hi,m,x (θ)]
i=1

so that, employing stationarity,

lim Eθ? [log pθ (Y1:k |Y−m+1:0 , X−m = x)] = k`(θ) .


m→∞

Thus for any positive integer n < k,

0 = k(`(θ? ) − `(θ))
 
pθ? (Y1:k |Y−m+1:0 , X−m = x)
= lim Eθ? log
m→∞ pθ (Y1:k |Y−m+1:0 , X−m = x)
  
pθ? (Yk−n+1:k |Y−m+1:0 , X−m = x)
= lim Eθ? log
m→∞ pθ (Yk−n+1:k |Y−m+1:0 , X−m = x)
 
pθ? (Y1:k−n |Yk−n+1:k , Y−m+1:0 , X−m = x)
+ Eθ? log
pθ (Y1:k−n |Yk−n+1:k , Y−m+1:0 , X−m = x)
 
pθ? (Y1:n |Yn−k−m+1:n−k , Xn−k−m = x)
≥ lim sup Eθ? log ,
m→∞ pθ (Y1:n |Yn−k−m+1:n−k , Xn−k−m = x)

where the inequality follows by using stationarity for the first term and noting
that the second term is non-negative as an expectation of a (conditional) Kullback-
Leibler divergence as above. Hence we have inserted a gap between the variables
Y1:n whose density we examine and the variables Yn−k−m+1:n−k and Xn−k−m that
appear as a condition. The idea is now to let this gap tend to infinity and to show
that in the limit the condition has no effect. Next we shall thus show that
 
pθ? (Y1:n |Y−m+1:−k , X−m = x)
lim sup Eθ? log
k→∞ m≥k pθ (Y1:n |Y−m+1:−k , X−m = x)
 
pθ (Y1:n )
− Eθ? log ? =0. (6.16)
pθ (Y1:n )

Combining (6.16) with the previous inequality, it is clear that if `(θ) = `(θ? ), then
Eθ? {log[pθ? (Y1:n )/pθ (Y1:n )]} = 0, that is, the Kullback-Leibler divergence between
the n-dimensional densities pθ? (y1:n ) and pθ (y1:n ) vanishes. This implies, by the in-
formation inequality, that these densities coincide except on a set with µ⊗n -measure
zero, so that the n-dimensional laws of Pθ? and Pθ agree. Because n was arbitrary,
we find that θ? and θ are equivalent.
What remains to do is thus to prove (6.16). To that end, put Uk,m (θ) =
log pθ (Y1:n |Y−m+1:−k , X−m = x) and U (θ) = log pθ (Y1:n ). Obviously, it is enough
6.4. IDENTIFIABILITY 131

to prove that for all θ ∈ Θ,


 
lim Eθ? sup |Uk,m (θ) − U (θ)| = 0 . (6.17)
k→∞ m≥k

To do that we write
ZZ
pθ (Y1:n |Y−m+1:−k , X−m = x) = pθ (Y1:n |X0 = x0 ) Qkθ (x−k , dx0 )

× Pθ (X−k ∈ dx−k | Y−m+1:−k , X−m = x)


and ZZ
pθ (Y1:n ) = pθ (Y1:n |X0 = x0 ) Qkθ (x−k , dx0 ) πθ (dx−k ) ,

where πθ is the stationary distribution of {Xk }. Realizing that pθ (Y1:n |X0 = x0 )


is bounded from above by (b+ )n (condition on X1:n !) and that the transition ker-
nel Qθ satisfies the Doeblin condition (see Definition 50) and is thus uniformly
geometrically ergodic (see Definition 53 and Lemma 51), we obtain
sup |pθ (Y1:n |Y−m+1:−k , X−m = x) − pθ (Y1:n )| ≤ (b+ )n (1 − σ − )k (6.18)
m≥k

Pθ? -a.s.. Moreover, the bound


Z n
Z Y
pθ (Y1:n |X0 = x0 ) = ··· qθ (xi−1 , xi )gθ (xi , Yi ) λ(dxi )
i=1
n
Y
≥ (σ − )n b− (Yi )
i=1

implies that pθ (Y1:n |Y−m+1:−k , X−m = x) and pθ (Y1:n ) both obey the same lower
bound. Combined with the observation b− (Yi ) > 0 Pθ? -a.s., which follows from
Assumption 110, and the bound |log(x) − log(y)| ≤ |x − y|/x ∧ y, (6.18) shows that
lim sup |Uk,m (θ) − U (θ)| → 0 Pθ? -a.s.
k→∞ m≥k

Now (6.17) follows from dominated convergence provided


 
Eθ sup sup Uk,m (θ) < ∞ .
k m≥k

Using the aforementioned bounds, we conclude that this expectation is indeed finite.

We remark that the basic structure of the proof is potentially applicable also to
models other than HMMs. Indeed, using the notation of the proof, we may define
` as `(θ) = limm→∞ Eθ? [log pθ (Y1 |Y−m:1 )], a definition that does not exploit the
HMM structure. Then the first part of the proof, up to (6.16), does not use the
HMM structure either, so that all that is needed, in a more general framework, is
to verify (6.16) (or, more precisely, a version thereof not containing X−m ). For
particular other processes, this could presumably be carried out using, for instance,
suitable mixing properties.
The above theorem shows that the points of global maxima of `—forming the
set of possible limit points of the MLE—are those that are statistically equivalent
to θ? . This result, although natural and important (but not trivial!), is however yet
of a somewhat “high level” character, that is, not verifiable in terms of “low level”
conditions. We would like to provide some conditions, expressed directly in terms of
the Markov chain and the conditional distributions gθ (x, y), that give information
about parameters that are equivalent to θ? and, in particular, when there is no
other such parameter than θ? . We will do this using the framework of mixtures of
distributions.
132 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE

6.4.2 Identifiability of Mixture Densities


We first define what is meant by a mixture density.

Definition 119. Let fφ (y) be a parametric family of densities on Y with respect


to a common dominating measure µ and parameter φ in some set Φ. If π is a
probability measure on Φ, then the density
Z
fπ (y) = fφ (y) π(dφ)
Φ

is called a mixture density; the distribution π is called the mixing distribution.


We say that the class of (all) mixtures of (fφ ) is identifiable if fπ = fπ0 µ-a.e.
if and only if π = π 0 .
Furthermore we say that the class of finite mixtures of (fφ ) is identifiable if for
all probability measures π and π 0 with finite support, fπ = fπ0 µ-a.e. if and only if
π = π0 .

In other words, the class of all mixtures of (fφ ) is identifiable if the two distri-
butions with densities fπ and fπ0 respectively agree only when π = π 0 . Yet another
way to put this property is to say that identifiability means that the mapping
π 7→ fπ is one-to-one (injective). A way, slightly Bayesian, of thinking of a mixture
distribution that is often intuitive and fruitful is the following. Draw φ ∈ Φ with
distribution π and then Y from the density fφ . Then, Y has density fπ .
Many important and commonly used parametric classes of densities are identi-
fiable. We mention the following examples.

(i) The Poisson family (Feller, 1943). In this case, Y = Z+ , Φ = R+ , φ is the mean
of the Poisson distribution, µ is counting measure, and fφ (y) = φy e−φ /y!.

(ii) The Gamma family (Teicher, 1961), with the mixture being either on the scale
parameter (with a fixed form parameter) or on the form parameter (with a
fixed scale parameter). The class of joint mixtures over both parameters is not
identifiable however, but the class of joint finite mixtures is identifiable.

(iii) The normal family (Teicher, 1960), with the mixture being either on the mean
(with fixed variance) or on the variance (with fixed mean). The class of joint
mixtures over both mean and variance is not identifiable however, but the class
of joint finite mixtures is identifiable.

(iv) The Binomial family Bin(N, p) (Teicher, 1963), with the mixture being on the
probability p. The class of finite mixtures is identifiable, provided the number
of components k of the mixture satisfies 2k − 1 ≤ N .

Further reading on identifiability of mixtures is found, for instance, in Titterington


et al. (1985, Section 3.1).
A very useful result on mixtures, taking identifiability in one dimension into
several dimensions, is the following.

Theorem 120 (Teicher, 1967). Assume that the class of all mixtures of the family
(fφ ) of densities on Y with parameter φ ∈ Φ is identifiable. Then the class of all
(n)
mixtures of the n-fold product densities fφ (y) = fφ1 (y1 ) · · · fφn (yn ) on y ∈ Yn
n
with parameter φ ∈ Φ is identifiable. The same conclusion holds true when “all
mixtures” is replaced by “finite mixtures”.
6.4. IDENTIFIABILITY 133

6.4.3 Application of Mixture Identifiability to Hidden Markov


Models
Let us now explain how identifiability of mixture densities applies to HMMs. As-
sume that {Xk , Yk } is an HMM such that the conditional densities gθ (x, y) all belong
to a single parametric family. Then given Xk = x, Yk has conditional density gφ(x)
say, where φ(x) is a function mapping the current state x into the parameter space
Φ of the parametric family of densities. Now assume that the class of all mixtures
of this family of densities is identifiable, and that we are given a true parameter
θ? of the model as well as an equivalent other parameter θ. Associated with these
two parameters are two mappings φ? (x) and φ(x), respectively, as above. As θ?
and θ are equivalent, the n-dimensional restrictions of Pθ? and Pθ coincide; that
is, Pθ? (Y1:n ∈ ·) and Pθ (Y1:n ∈ ·) agree. Because the class of all mixtures of (gφ )
is identifiable, Theorem 120 tells us that the n-dimensional distributions of the
processes {φ? (Xk )} and {φ(Xk )} agree. That is, for all subsets A ⊆ Φn ,

Pθ? {(φ? (X1 ), φ? (X2 ), . . . , φ? (Xn )) ∈ A}


= Pθ {(φ(X1 ), φ(X2 ), . . . , φ(Xn )) ∈ A} .

This condition is often informative for concluding θ = θ? .


Example 121 (Normal HMM). Assume that X is finite, say X = {1, 2, . . . , r},
and that Yk |Xk = i ∼ N(µi , σ 2 ). The parameters of the model are the transition
probabilities qij of {Xk }, the µi and σ 2 . We thus identify φ(x) = µx . If θ? and
θ are two equivalent parameters, the laws of the processes {µ?Xk } and {µXk } are
thus the same, and in addition σ?2 = σ 2 . Here µ?i denotes the µi -component of θ? ,
etc. Assuming the µ?i to be distinct, this can only happen if the sets {µ?1 , . . . , µ?r }
and {µ1 , . . . , µr } are identical. We may thus conclude that the sets of means must
be the same for both parameters, but they need not be enumerated in the same
order. Thus there is a permutation {c(1), c(2), . . . , c(r)} of {1, 2, . . . , r} such that
µc(i) = µ?i for all i ∈ X. Now because the laws of {µ?Xk } under Pθ? and {µc(Xk ) }
under Pθ coincide with the µi s being distinct, we conclude that the laws of {Xk }
under Pθ? and of {c(Xk )} under Pθ also agree, which in turn implies q?ij = qc(i),c(j)
for all i, j ∈ X.
Hence any parameter θ that is equivalent to θ? is in fact identical, up to a
permutation of state indices. Sometimes the parameter space is restricted by, for
instance, requiring the means µi to be sorted: µ1 < µ2 < . . . < µr , which removes
the ambiguity.
In the current example, we could also have allowed the variance σ 2 to depend
on the state, Yk |Xk = i ∼ N(µi , σi2 ), reaching the same conclusion. The assumption
of conditional normality is of course not crucial either; any family of distributions
for which finite mixtures are identifiable would do.
Example 122 (General Stochastic Volatility). In this example, we consider a
stochastic volatility model of the form Yk |Xk = x ∈ N(0, σ 2 (x)), where σ 2 (x) is
a mapping from X to R+ . Thus, we identify φ(x) = σ 2 (x). Again assume that we
are given a true parameter θ? as well as another parameter θ, which is equivalent to
θ? . Because all variance mixtures of normal distributions are identifiable, the laws
of {σ?2 (Xk )} under Pθ? and of {σ 2 (Xk )} under Pθ agree. Assuming for instance
that σ?2 (x) = σ 2 (x) = x (and hence also X ⊆ R+ ), we conclude that the laws of
{Xk } under Pθ? and Pθ , respectively, agree. For particular models of the transition
kernel Q of {Xk }, such as the finite case of the previous example, we may then be
able to show that θ = θ? , possibly up to a permutation of state indices.
Example 123. Sometimes a model with finite state space is identifiable even
though the conditional densities g(x, ·) are identical for several x. For instance,
134 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE

consider a model on the state space X = {0, 1, 2} with Yk |Xk = i ∼ N(µi , σ 2 ), the
constraints µ0 = µ1 < µ2 , and transition probability matrix
 
q00 q01 0
Q =  q10 q11 q12  .
0 q21 q22
The Markov chain {Xk } is thus a (discrete-time) birth-and-death process in the
sense that it can change its state index by at most one in each step. This model
is similar to models used in modeling ion channel dynamics (cf. Fredkin and Rice,
1992). Because µ1 < µ2 , we could then think of states 0 and 1 as “closed” and of
state 2 as “open”.
Now assume that θ is equivalent to θ? . Just as in Example 121, we may then
conclude that the law of {µ?Xk } under Pθ? and that of {µXk } under Pθ agree, and
hence, because of the constraints on the µs, that the laws of {1(Xk ∈ {0, 1}) +
1(Xk = 2)} under Pθ? and Pθ agree. In other words, after lumping states 0 and
1 of the Markov chain we obtain processes with identical laws. This in particular
implies that the distributions under Pθ? and Pθ of the sojourn times in the state
aggregate {0, 1} coincide. The probability of such a sojourn having length 1 is q12 ,
whence q12 = q?12 must hold. For length 2, the corresponding probability is q11 q12 ,
whence q11 = q?11 follows and then also q10 = q?10 as rows of Q sum up to unity.
2
For length 3, the probability is q11 q12 + q10 q01 q12 , so that finally q01 = q?01 and
q00 = q?00 . We may thus conclude that θ = θ? , that is, the model is identifiable.
The reason that identifiability holds despite the means µi being non-distinct is the
special structure of Q. For further reading on identifiability of lumped Markov
chains, see Ito et al. (1992).

6.5 Asymptotic Normality of the Score and Con-


vergence of the Observed Information
We now turn to asymptotic properties of the score function and the observed in-
formation. The score function will be discussed in some detail, whereas for the
information matrix we will just state the results.

6.5.1 The Score Function and Invoking the Fisher Identity


Define the score function
n
X Z 
∇θ `x0 ,n (θ) = ∇θ log gθ (xk , Yk ) Pθ (Xk ∈ dxk | Y0:k−1 , X0 = x0 ) . (6.19)
k=0

To make sure that this gradient indeed exists and is well-behaved enough for our
purposes, we make the following assumptions.
Assumption 124. There exists an open neighborhood U = {θ : |θ − θ? | < δ} of θ?
such that the following hold.
(i) For all (x, x0 ) ∈ X × X and all y ∈ Y, the functions θ 7→ qθ (x, x0 ) and θ 7→
gθ (x, y) are twice continuously differentiable on U.
(ii)
sup sup k∇θ log qθ (x, x0 )k < ∞
θ∈U x,x0

and
sup sup k∇2θ log qθ (x, x0 )k < ∞ .
θ∈U x,x0
6.5. ASYMPTOTICS OF THE SCORE AND OBSERVED INFORMATION 135

(iii)
 
2
Eθ? sup sup k∇θ log gθ (x, Y1 )k <∞
θ∈U x

and
 
Eθ? sup sup k∇2θ log gθ (x, Y1 )k < ∞ .
θ∈U x

(iv) For µ-almost all y ∈ Y, there exists a function fy : X → R+ in L1 (λ) such that
supθ∈U gθ (x, y) ≤ fy (x).

(v) For λ-almost all x ∈ X, there exist functions fx1 : Y → R+ and fx2 : Y → R+ in
L1 (µ) such that k∇θ gθ (x, y)k ≤ fx1 (y) and k∇2θ gθ (x, y)k ≤ fx2 (y) for all θ ∈ U.

These assumptions assure that the log-likelihood is twice continuously differen-


tiable, and also that the score function and observed information have finite mo-
ments of order two and one, respectively, under Pθ? . The assumptions are natural
extensions of standard assumptions that are used to prove asymptotic normality of
the MLE for i.i.d. observations. The asymptotic results to be derived below are valid
also for likelihoods obtained using a distribution νθ for X0 (such as the stationary
one), provided this distribution satisfies conditions similar to the above ones: for all
x ∈ X, θ 7→ νθ (x) is twice continuously differentiable on U, and the first and second
derivatives of θ 7→ log νθ (x) are bounded uniformly over θ ∈ U and x ∈ X.
We shall now study the score function and its asymptotics in detail. Even
though the log-likelihood is differentiable, one must take some care to arrive at an
expression for the score function that is useful. A tool that is often useful in the
context of models with incompletely observed data is the so-called Fisher identity,
which we encountered in Section 5.1.3. Invoking this identity, which holds in a
neighborhood of θ? under Assumption 124, we find that (cf. (5.28))
" n
#
X
∇θ `x0 ,n (θ) = ∇θ log gθ (x0 , Y0 ) + Eθ φθ (Xk−1 , Xk , Yk ) Y0:n , X0 = x0 ,
k=1
(6.20)
where φθ (x, x0 , y 0 ) = ∇θ log[qθ (x, x0 )gθ (x0 , y 0 )]. However, just as when we obtained
a law of large numbers for the normalized log-likelihood, we want to express the
score function as a sum of increments, conditional scores. For that purpose we write
n
X n
X
∇θ `x0 ,n (θ) = ∇θ `x0 ,0 (θ) + {∇θ `x0 ,k (θ) − ∇θ `x0 ,k−1 (θ)} = ḣk,0,x0 (θ) , (6.21)
k=1 k=0

where ḣ0,0,x0 = ∇θ log gθ (x0 , Y0 ) and, for k ≥ 1,


" k
#
X
ḣk,0,x (θ) = Eθ φθ (Xi−1 , Xi , Yi ) Y0:k , X0 = x
i=1
" k−1 #
X
− Eθ φθ (Xi−1 , Xi , Yi ) Y0:k−1 , X0 = x .
i=1

Note that ḣk,0,x (θ) is the gradient with respect to θ of the conditional log-likelihood
hk,0,x (θ) as defined in (6.7). It is a matter of straightforward algebra to check that
(6.20) and (6.21) agree.
136 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE

6.5.2 Construction of the Stationary Conditional Score


We can extend, for any integers k ≥ 1 and m ≥ 0, the definition of ḣk,0,x (θ) to
" k
#
X
ḣk,m,x (θ) = Eθ φθ (Xi−1 , Xi , Yi ) Y−m+1:k , X−m = x
i=−m+1
" k−1
#
X
− Eθ φθ (Xi−1 , Xi , Yi ) Y−m+1:k−1 , X−m = x
i=−m+1

with the aim, just as before, to let m → ∞. This will yield a definition of ḣk,∞ (θ);
the dependence on x will vanish in the limit. Note however that the construction
below does not show that this quantity is in fact the gradient of hk,∞ (θ), although
one can indeed prove that this is the case.
As noted in Section 6.1, we want to prove a central limit theorem (CLT) for the
score function evaluated at the true parameter. A quite general way to do that is to
recognize that the corresponding score increments form, under reasonable assump-
tions, a martingale increment sequence with respect to the filtration generated by
the observations. This sequence is not stationary though, so one must either use a
general martingale CLT or first approximate the sequence by a stationary martin-
gale increment sequence. We will take the latter approach, and our approximating
sequence is nothing but {ḣk,∞ (θ? )}.
We now proceed to the construction of ḣk,∞ (θ). First write ḣk,m,x (θ) as

ḣk,m,x (θ) = Eθ [φθ (Xk−1 , Xk , Yk ) | Y−m+1:k , X−m = x]


k−1
X
+ (Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k , X−m = x]
i=−m+1
− Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k−1 , X−m = x]) . (6.22)

The following result shows that it makes sense to take the limit as m → ∞ in the
previous display.

Proposition 125. Assume 108, 109, and 124 hold. Then for any integers 1 ≤ i ≤
k, the sequence {Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k , X−m = x]}m≥0 converges Pθ? -a.s.
and in L2 (Pθ? ), uniformly with respect to θ ∈ U and x ∈ X, as m → ∞. The limit
does not depend on x.

We interpret and write this limit as Eθ [φθ (Xi−1 , Xi , Yi ) | Y−∞:k ].

Proof. The proof is entirely similar to that of Proposition 112. For any (x, x0 ) ∈
X × X and non-negative integers m0 ≥ m,

Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k , X−m = x]


− Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m0 +1:k , X−m0 = x0 ]
ZZZ
= φθ (xi−1 , xi , Yi ) Qθ (xi−1 , dxi )

× Pθ (Xi−1 ∈ dxi−1 | Y−m+1:k , X−m = x−m )

× [δx (dx−m ) − Pθ (X−m ∈ dx−m | Y−m0 +1:k , X−m0 = x0 )]

≤ 2 sup kφθ (x, x0 , Yi )kρ(i−1)+m , (6.23)


x,x0
6.5. ASYMPTOTICS OF THE SCORE AND OBSERVED INFORMATION 137

where the inequality stems from (6.5). Setting x = x0 in this display shows that
{Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k , X−m = x]}m≥0 is a Cauchy sequence, thus converg-
ing Pθ? -a.s. The inequality also shows that the limit does not depend on x. More-
over, because for any non-negative integer m, x ∈ X and θ ∈ U,

kEθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k , X−m = x]k ≤ sup kφθ (x, x0 , Yi )k


x,x0

with the right-hand side belonging to L2 (Pθ? ). The inequality (6.23) thus also
shows that {Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k , X−m = x]}m≥0 is a Cauchy sequence in
L2 (Pθ? ) and hence converges in L2 (Pθ? ).
With the sums arranged as in (6.22), we can let m → ∞ and define, for k ≥ 1,

ḣk,∞ (θ) = Eθ [φθ (Xk−1 , Xk , Yk ) | Y−∞:k ]


k−1
X
+ (Eθ [φθ (Xi−1 , Xi , Yi ) | Y−∞:k ] − Eθ [φθ (Xi−1 , Xi , Yi ) | Y−∞:k−1 ]) .
i=−∞

The following result gives an L2 -bound on the difference between ḣk,m,x (θ) and
ḣk,∞ (θ).
Lemma 126. Assume 108, 109, 110, and 124 hold. Then for k ≥ 1,

(Eθ kḣk,m,x (θ) − ḣk,∞ (θ)k2 )1/2


" #!1/2
ρ(k+m)/2−1
≤ 12 Eθ sup kφθ (x, x0 , Y1 )k2 .
x,x0 ∈X 1−ρ

Proof. The idea of the proof is to match, for each index i of the sums expressing
ḣk,m,x (θ) and ḣk,∞ (θ), pairs of terms that are close. To be more precise, we match
1. The first terms of ḣk,m,x (θ) and ḣk,∞ (θ);
2. For i close to k,

Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k , X−m = x]

and
Eθ [φθ (Xi−1 , Xi , Yi ) | Y−∞:k ] ,
and similarly for the corresponding terms conditioned on Y−m+1:k−1 and
Y−∞:k−1 , respectively;
3. For i far from k,

Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k , X−m = x]

and
Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k−1 , X−m = x] ,
and similarly for the corresponding terms conditioned on Y−∞:k and Y−∞:k−1 ,
respectively.
We start with the second kind of matches (of which the first terms are a special
case). Taking the limit in m0 → ∞ in (6.23), we see that

k Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k , X−m = x] − Eθ [φθ (Xi−1 , Xi , Yi ) | Y−∞:k ]k


≤ 2 sup kφθ (x, x0 , Yi )kρ(i−1)+m .
x,x0 ∈X
138 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE

This bound remains the same if k is replaced by k − 1. Obviously, it is small if i is


far away from m, that is, close to k.
For the third kind of matches, we need a total variation bound that works
“backwards in time”. Such a bound reads

kPθ (Xi ∈ · | Y−m+1:k , X−m = x)


− Pθ (Xi ∈ · | Y−m+1:k−1 , X−m = x)kTV ≤ ρk−1−i .

The proof of this bound is similar to that of Proposition 61 and uses the time-
reversed process. We postpone the proof to the end of this section. We may also let
m → ∞ and omit the condition on X−m without affecting the bound. As a result
of these bounds, we have

k Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k , X−m = x]


− Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k−1 , X−m = x]k
≤ 2 sup kφθ (x, x0 , Yi )kρk−1−i ,
x,x0 ∈X

with the same bound being valid if the conditioning is on Y−∞:k and Y−∞:k−1 ,
respectively. This bound is small if i is far away from k.
Combining these two kinds of bounds and using Minkowski’s inequality for the
L2 -norm, we find that (Eθ kḣk,m,x (θ) − ḣk,∞ (θ)k2 )1/2 is bounded by

k−1
X −m
X
2ρk+m−1 + 2 × 2 (ρk−i−1 ∧ ρi+m−1 ) + 2 ρk−i−1
i=−m+1 i=−∞

ρk+m−1 X X
≤ 4 +4 ρk−i−1 + 4 ρi+m−1
1−ρ
−∞<i≤(k−m)/2 (k−m)/2≤i<∞
(k+m)/2−1
ρ
≤ 12
1−ρ

up to the factor (Eθ supx,x0 ∈X kφθ (x, x0 , Yi )k2 )1/2 . The proof is complete.

We now establish the “backwards in time” uniform forgetting property, which


played a key role in the above proof.

Proposition 127. Assume 108, 109, and 110 hold. Then for any integers i, k,
and m such that m ≥ 0 and −m < i < k, any x−m ∈ X, y−m+1:k ∈ Yk+m , and
θ ∈ U,

kPθ (Xi ∈ · | Y−m+1:k = y−m+1:k , X−m = x−m )


− Pθ (Xi ∈ · | Y−m+1:k−1 = y−m+1:k−1 , X−m = x−m )kTV ≤ ρk−1−i .

Proof. The cornerstone of the proof is the observation that conditional on Y−m+1:k
and X−m , the time-reversed process X with indices from k down to −m is a non-
homogeneous Markov chain satisfying a uniform mixing condition. We shall indeed
use a slight variant of the backward decomposition developed in Section 2.3.2. For
any j = −m + 1, . . . , k − 1, we thus define the backward kernel (cf. (2.39)) by

Bx−m ,j [y−m+1:j ](x, f ) =


R R Qj
··· u=−m+1 q(xu−1 , xu )g(xu , yu ) λ(dxu ) f (xj )q(xj , x)
R R Qj (6.24)
··· u=−m+1 q(xu−1 , xu )g(xu , yu ) λ(dxu ) q(xj , x)
6.5. ASYMPTOTICS OF THE SCORE AND OBSERVED INFORMATION 139

for any f ∈ Fb (X). For brevity, we do not indicate the dependence of the quantities
involved on θ. We note that the Qjintegral
R of the denominator of this display is
bounded from below by (σ − )m+j −m+1 gθ (xu , yu ) λ(dxu ), and is hence positive
Pθ? -a.s. under Assumption 110.
It is trivial that for any x ∈ X,

Z Z j
Y
··· q(xu−1 , xu )g(xu , yu ) λ(dxu ) f (xj )q(xj , x) =
u=−m+1
Z Z j
Y
··· q(xu−1 , xu )g(xu , yu ) λ(dxu ) q(xj , x)Bx−m ,j [y−m+1:j ](x, f ) ,
u=−m+1

which implies that

Eθ [f (Xj ) | Xj+1:k , Y−m+1:k = y−m+1:k ,X−m = x]


= Bx−m ,j [y−m+1:j ](Xj+1 , f ) .

This is the desired Markov property referred to above.


Along the same lines as in the proof of Proposition 64, we can show that the
backward kernels satisfy a Doeblin condition,

σ− σ+
ν x ,j [y−m+1:j ] ≤ B x ,j [y −m+1:j ](x, ·) ≤ νx ,j [y−m+1:j ] ,
σ + −m −m
σ − −m

where for any f ∈ Fb (X),


R R Qj
··· u=−m+1 qθ (xu−1 , xu )gθ (xu , yu ) λ(dxu ) f (xj )
νx−m ,j [y−m+1:j ](f ) = R R Qj .
··· u=−m+1 qθ (xu−1 , xu )gθ (xu , yu ) λ(dxu )

Thus Lemma 51 shows that the Dobrushin coefficient of each backward kernel is
bounded by ρ = 1 − σ − /σ + .
Finally

Pθ (Xi ∈ · | Y−m+1:k−1 = y−m+1:k−1 , X−m = x−m )


Z
= Pθ (Xi ∈ · | Y−m+1:k−1 = y−m+1:k−1 , X−m = x−m , Xk−1 = xk−1 )

× Pθ (Xk−1 ∈ dxk−1 | Y−m+1:k−1 = y−m+1:k−1 , X−m = x−m )

and

Pθ (Xi ∈ · | Y−m+1:k = y−m+1:k , X−m = x−m )


Z
= Pθ (Xi ∈ · | Y−m+1:k−1 = y−m+1:k−1 , X−m = x−m , Xk−1 = xk−1 )

× Pθ (Xk−1 ∈ dxk−1 | Y−m+1:k = y−m+1:k , X−m = x−m ) ,

so that the two distributions on the left-hand sides can be considered as the result
of running the above-described reversed conditional Markov chain from index k − 1
down to index i, using two different initial conditions. Therefore, by Proposition 48,
they differ by at most ρk−1−i in total variation distance. The proof is complete.
140 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE

6.5.3 Weak Convergence of the Normalized Score


Pn
We now return to the question of a weak limit of the normalized score n−1/2 k=0 ḣk,0,x0 (θ? ).
Using Lemma 126 and Minkowski’s inequality, we see that
 1/2
n 2
X
Eθ? n−1/2 (ḣk,0,x0 (θ? ) − ḣk,∞ (θ? )) 
k=0
n h
X i1/2
≤ n−1/2 Eθ? kḣk,0,x0 (θ? ) − ḣk,∞ (θ? )k2 →0 as n → ∞ ,
k=0
Pn
whence the limiting behavior of the normalized score agrees with that of n−1/2 k=0 ḣk,∞ (θ? ).
Now define the filtration F by Fk = σ(Yi , −∞ < i ≤ k) for all integer k. By con-
ditional dominated convergence,
" k−1
X
Eθ? (Eθ? [φθ? (Xi−1 , Xi , Yi ) | Y−∞:k ]
i=−∞

− Eθ? [φθ? (Xi−1 , Xi , Yi ) | Y−∞:k−1 ]) | Fk−1 ] = 0 ,

and Assumption 124 implies that

Eθ? [φθ? (Xk−1 , Xk , Yk ) | Y−∞:k−1 ]


= Eθ? [Eθ? [φθ? (Xk−1 , Xk , Yk ) | Y−∞:k−1 , Xk−1 ] | Fk−1 ] = 0 .

It is also immediate that hk,∞ (θ? ) is Fk -measurable. Hence the sequence {hk,∞ (θ? )}k≥0
is a Pθ? -martingale increment sequence with respect to the filtration {Fk }k≥0 in
L2 (Pθ? ). Moreover, this sequence is stationary because {Yk }−∞<k<∞ is. Any sta-
tionary martingale increment
Pn sequence in L2 (Pθ? ) satisfies a CLT (Durrett, 1996,
−1/2
p. 418), that is, n 0 ḣk,∞ (θ? ) → N(0, J (θ? )) Pθ? -weakly, where

def
J (θ? ) = Eθ? [ḣ1,∞ (θ? )ḣt1,∞ (θ? )] (6.25)

is the limiting Fisher information.


Because the normalized score function has the same limiting behavior, the fol-
lowing result is immediate.
Theorem 128. Under Assumptions 108, 109, 110, and 124,

n−1/2 ∇θ `x0 ,n (θ? ) → N(0, J (θ? )) Pθ? -weakly

for all x0 ∈ X, where J (θ? ) is the limiting Fisher information as defined above.
We remark that above, we have normalized sums with indices from 0 to n, that
is, with n + 1 terms, by n1/2 rather than by (n + 1)1/2 . This of course does not
affect the asymptotics. However, if J (θ? ) is estimated for the purpose of making a
confidence interval for instance, then one may well normalize it using the number
n + 1 of observed data.

6.5.4 Convergence of the Normalized Observed Information


We shall now very briefly discuss the asymptotics of the observed information ma-
trix, −∇2θ `x0 ,n (θ). To handle this matrix, one can employ the so-called missing
information principle (see Section 5.1.3 and (5.29)). Because the complete infor-
mation matrix, just as the complete score, has a relatively simple form, this principle
6.5. ASYMPTOTICS OF THE SCORE AND OBSERVED INFORMATION 141

allows us to study the asymptotics of the observed information in a fashion similar


to what was done above for the score function. The analysis becomes more difficult
however, as covariance terms, arising from the conditional variance of the complete
score, also need to be accounted for. In addition, we need the convergence to be
uniform in a certain sense. We state the following theorem, whose proof can be
found in Douc et al. (2004).

Theorem 129. Under Assumptions 108, 109, 110, and 124,

lim lim sup k(−n−1 ∇2θ `x0 ,n (θ)) − J (θ? )k = 0 Pθ? -a.s.
δ→0 n→∞ |θ−θ? |≤δ

for all x0 ∈ X.

6.5.5 Asymptotics of the Maximum Likelihood Estimator


The general arguments in Section 6.1 and the theorems above prove the following
result.

Theorem 130. Assume 108, 109, 110, 114, and 124, and that θ? is identifiable,
that is, θ is equivalent to θ? only if θ = θ? (possibly up to a permutation of states if
X is finite). Then the following hold true.

(i) The MLE θbn = θbx0 ,n is strongly consistent: θbn → θ? Pθ? -a.s. as n → ∞.

(ii) If the Fisher information matrix J (θ? ) defined above is non-singular and θ? is
an interior point of Θ, then the MLE is asymptotically normal:

n1/2 (θbn − θ? ) → N(0, J (θ? )−1 ) Pθ? -weakly as n → ∞

for all x0 ∈ X.

(iii) The normalized observed information at the MLE is a strongly consistent esti-
mator of J (θ? ):

−n−1 ∇2θ `x0 ,n (θbn ) → J (θ? ) Pθ? -a.s. as n → ∞.

As indicated above, the MLE θbn depends on the initial state x0 , but that de-
pendence will generally not be included in the notation.
The last part of the result is important, as is says that confidence intervals
or regions and hypothesis tests based on the estimate −(n + 1)−1 ∇2θ `x0 ,n (θbn ) of
J (θ? ) will asymptotically be of correct size. In general, there is no closed-form
expression for J (θ? ), so that it needs to be estimated in one way or another.
The observed information is obviously one way to do that, while another one

is to simulate data Y1:N from the HMM, using the MLE, and then computing
−1 2
−(N + 1) ∇θ `x0 ,N (θn ) for this set of simulated data and some x0 . An advan-
b
tage of this approach is that N can be chosen arbitrarily large. Yet another
approach, motivated by (6.25), is to estimate the Fisher information by the em-
pirical covariance matrix of the conditional scores of (6.19) at the MLE, that
Pn
is, by R(n + 1)−1 0 [Sk|k−1 (θbn ) − S̄(θbn )][Sk|k−1 (θbn ) − S̄(θbn )]tPwith Sk|k−1 (θ) =
n
∇θ log gθ (x, Yk ) φx0 ,k|k−1 [Y0:k−1 ](dx ; θ) and S̄(θ) = (n + 1)−1 0 Sk|k−1 (θ). This
estimate can of course also be computed from estimated data, then using an ar-
bitrary sample size. The conditional scores may be computed as Sk|k−1 (θ) =
∇θ `x0 ,k (θ) − ∇θ `x0 ,k−1 (θ), where the scores are computed using any of the methods
of Section 5.2.3.
142 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE

6.6 Applications to Likelihood-based Tests


The asymptotic properties of the score function and observed information have
immediate implications for the asymptotics of the MLE, as has been described in
previous sections. However, there are also other conclusions that can be drawn from
these convergence results.
One such application is the validity of some classical procedures for testing
whether θ? lies in some subset, Θ0 say, of the parameter space Θ. Suppose that
Θ0 is an (dθ − s)-dimensional subset that may be expressed in terms of constraints
Ri (θ) = 0, i = 1, 2, . . . , s, and that there is an equivalent formulation θi = bi (γ),
i = 1, 2, . . . , dθ , where γ is the “constrained parameter” lying in a subset Γ of Rdθ −s .
We also let γ? be a point such that θ? = b(γ? ). Each function Ri and bi is assumed
to be continuously differentiable and such that the matrices
   
∂Ri ∂bi
Cθ = and Dγ =
∂θj s×dθ ∂γj dθ ×(dθ −s)

have full rank (s and dθ −s respectively) in a neighborhood of θ? and γ? , respectively.


Perhaps the simplest example is when we want to test a simple (point) null
hypothesis θ? = θ0 versus the alternative θ? 6= θ0 . Then, we take Ri (θ) = θi − θ0i
and bi (γ) = θi0 for i = 1, 2, . . . , dθ . In this case, γ is void as s = dθ and hence
dθ − s = 0. Furthermore, C is the identity matrix and D is void.
Now suppose that we want to test the equality θi = θi0 only for i in a subset K
of the dθ coordinates of θ, where K has cardinality s. The constraints we employ
are then Ri (θ) = θi −θ0i for i ∈ K; furthermore, γ comprises θi for i 6∈ K and, using
the dθ − s indices not in K for γ, bi (γ) = θ0i for i ∈ K and bi (γ) = γi otherwise.
Again it is easy to check that C and D are constant and of full rank.

Example 131 (Normal HMM). A slightly more involved example concerns the
Gaussian hidden Markov model with finite state space {1, 2, . . . , r} and conditional
distributions Yk |Xk = i ∼ N(µi , σi2 ). Suppose that we want to test for equality of
all of the r component-wise conditional variances σi2 : σ12 = σ22 = . . . = σr2 . Then,
the R-functions are for instance σi2 − σr2 for i = 1, 2, . . . , r − 1. The parameter γ is
obtained by removing from θ all σi2 and then adding a common conditional variance
σ 2 ; those b-functions referring to any of the σi2 evaluate to σ 2 . The matrices C and
D are again constant and of full rank.

A further application, to test the structure of conditional covariance matrices in


a conditionally Gaussian HMM with multivariate output, can be found in Giudici
et al. (2000).
There are many different tests available for testing the null hypothesis θ? ∈ Θ0
versus the alternative θ? ∈ Θ \ Θ0 . One is the generalized likelihood ratio test,
which uses the test statistic
 
λn = 2 sup `x0 ,n (θ) − sup `x0 ,n (θ) .
θ∈Θ θ∈Θ0

Another one is the Wald test, which uses the test statistic

Wn = nR(θbn )t [Cθbn Jn (θbn )−1 Cθbt ]−1 R(θbn ) ,


n

where R(θ) is the s×1 vector of R-functions evaluated at θ, and Jn (θ) = −n−1 ∇2θ `x0 ,n (θ)
is the observed information evaluated at θ. Yet another test is based on the Rao
statistic, defined as
Vn = n−1 Sn (θbn0 )Jn (θbn0 )−1 Sn (θbn0 )t ,
6.6. APPLICATIONS TO LIKELIHOOD-BASED TESTS 143

where θbn0 is the MLE over Θ0 , that is, the point where `x0 ,n (θ) is maximized subject
to the constraint Ri (θ) = 0, 1 ≤ i ≤ s, and Sn (θ) = ∇θ `x0 ,n (θ) is the score function
at θ. This test is also known under the names efficient score test and Lagrange
multiplier test. The Wald and Rao test statistics are usually defined using the true
Fisher information J (θ) rather than the observed one, but as J (θ) is generally
infeasible to compute for HMMs, we replace it by the observed counterpart.
Statistical theory for i.i.d. data suggests that the likelihood ratio, Wald and
Rao test statistics should all converge weakly to a χ2 distribution with s degrees of
freedom provided θ? ∈ Θ0 holds true, so that an approximate p-value of the test of
this null hypothesis can be computed by evaluating the complementary distribution
function of the χ2s distribution at the point λn , Wn , or Vn , whichever is preferred.
We now state formally that this procedure is indeed correct.
Theorem 132. Assume 108, 109, 110, 114, and 124 as well as the conditions
stated on the functions Ri and bi above. Also assume that θ? is identifiable, that is,
θ is equivalent to θ? only if θ = θ? (possibly up to a permutation of states if X is
finite), that J (θ? ) is non-singular, and that θ? and γ? are interior points of Θ and
Γ, respectively. Then if θ? ∈ Θ0 holds true, each of the test statistics λn , Wn , and
Vn converges Pθ? -weakly to the χ2s distribution as n → ∞.
The proof of this result follows, for instance, Serfling (1980, Section 4.4). The im-
portant observation is that the validity of the proof does not hinge on independence
of the data but on asymptotic properties of the score function and the observed
information, properties that have been established for HMMs in this chapter.
It is important to realize that a key assumption for Theorem 132 to hold is that
θ? is identifiable, so that θbn converges to a unique point θ? . As a result, the theorem
does not apply to the problem of testing the number of components of a finite state
HMM. In the normal HMM for instance, with Yk |Xk = i ∼ N(µi , σi2 ), one can
indeed effectively remove one component by invoking the constraints µ1 −µ2 = 0 and
σ12 − σ22 = 0, say. In this way, within Θ0 , components 1 and 2 collapse into a single
one. However, any θ ∈ Θ0 is then non-identifiable as the transition probabilities q12
and q21 , among others, can be chosen arbitrarily without changing the dynamics of
the model.
144 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE
Part III

Background and
Complements

145
Chapter 7

Elements of Markov Chain


Theory

7.1 Chains on Countable State Spaces


We review the key elements of the mathematical theory developed for studying the
limiting behavior of Markov chains. In this first section, we restrict ourselves to
the case where the state space X is countable, which is conceptually simpler. On
our way, we will also meet a number of important concepts to be used in the next
section when dealing with Markov chains on general state spaces.

7.1.1 Irreducibility
Let {Xk }k≥0 be a Markov chain on a countable state space X with transition matrix
Q. For any x ∈ X, we define the first hitting time σx on x and the return time τx
to x respectively as

σx = inf{n ≥ 0 : Xn = x} , (7.1)
τx = inf{n ≥ 1 : Xn = x} , (7.2)
(n)
where, by convention, inf ∅ = +∞. The successive hitting times σx and return
(n)
times τx , n ≥ 0, are defined inductively by

σx(0) = 0, σx(1) = σx , σx(n+1) = inf{k > σx(n) : Xk = x} ,


τx(0) = 0, τx(1) = τx , τx(n+1) = inf{k > τx(n) : Xk = x} .

For two states x and y, we say that state x leads to state y, which we write
x → y, if Px (σy < ∞) > 0. In words, x leads to y if the state y can be reached
from x. An alternative, equivalent definition is that there exists some integer n ≥ 0
such that the n-step transition probability Qn (x, y) > 0. If both x leads to y and y
leads to x, then we say that the x and y communicate, which we write x ↔ y.

Theorem 133. The relation “↔” is an equivalence relation on X.

Proof. We need to prove that the relation ↔ is reflexive, symmetric, and transitive.
The first two properties are immediate because, by definition, for all x, y ∈ X, x ↔ x
(reflexivity), and x ↔ y if and only if y ↔ x (symmetry).
For any pairwise distinct x, y, z ∈ X, {σy + σz ◦ θσy < ∞} ⊂ {σz < ∞} (if the
chain reaches y at some time and later z, it certainly reaches z). The strong Markov

147
148 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

property (Theorem 6) implies that

Px (σz < ∞) ≥ Px (σy + σz ◦ θσy < ∞) = Ex [1{σy <∞} 1{σz <∞} ◦ θσy ]
= Ex [1{σy <∞} PXσy (σz < ∞)] = Px (σy < ∞) Py (σz < ∞) .

In words, if the chain can reach y from x and z from y, it can reach z from x by
going through y. Hence if x → y and y → z, then x → z (transitivity).
For x ∈ X, we denote the equivalence class of x with respect to the relation “↔”
by C(x). Because “↔” is an equivalence relation, there exists a collection {xi } of
states, which may be finite or infinite, such that the classes {C(xi )} form a partition
of the state space X.
Definition 134 (Irreducibility). If C(x) = X for some x ∈ X (and then for all
x ∈ X), the Markov chain is called irreducible.

7.1.2 Recurrence and Transience


When a state is visited by the Markov chain, it is natural to ask how often the state
is visited in the long-run. Define the occupation time of the state x as
∞ ∞
def
1x (Xn ) = 1{σx(j) <∞} .
X X
ηx =
n=0 j=1

If the expected number of visits to x starting from x is finite, that is, if Ex [ηx ] < ∞,
then the state x is called transient. Otherwise, if Ex [ηx ] = ∞, x is said to be
recurrent. When X is countable, the recurrence or transience of a state x can be
expressed in terms of the probability Px (τx < ∞) that the chain started in x ever
returns to x.
Proposition 135. For any x ∈ X the following hold true,
(i) If x is recurrent, then Px (ηx = ∞) = 1 and Px (τx < ∞) = 1.
(ii) If x is transient, then Px (ηx < ∞) = 1 and Px (τx < ∞) < 1.
(iii) Ex [ηx ] = 1/[1 − Px (τx < ∞)], with 1/0 = ∞.
Proof. By construction,

X ∞
X
Ex [ηx ] = Px (ηx ≥ k) = Px (σx(k) < ∞) .
k=1 k=1

Applying strong Markov property (Theorem 6) for n > 1, we obtain


(n−1)
Px (σx(n) < ∞) = Px (σx(n−1) < ∞, τx ◦ θσx < ∞)
= Ex [1{σ(n−1) <∞} PX (n−1)
(τx < ∞)] .
x σx

(n−1)
If σx < ∞, then Xσ(n−1) = x Px -a.s., so that
x

Px (σx(n) < ∞) = Px (τx < ∞) Px (σx(n−1) < ∞) .


(n)
By definition Px (σx < ∞) = 1, whence Px (σx < ∞) = [Px (τx < ∞)]n−1 and

X
Ex [ηx ] = [Px (τx < ∞)]n−1 .
n=1
7.1. CHAINS ON COUNTABLE STATE SPACES 149

This proves part (iii).


Now assume x is recurrent. Then by definition Ex [ηx ] = ∞, and hence Px (τx <
(n)
∞) = 1 and Px (τx < ∞) = 1 for all n ≥ 1. Thus ηx = ∞ Px -a.s.
If x is transient then Ex [ηx ] < ∞, which implies Px (τx < ∞) < 1.
For a recurrent state x, the occupation time of x is infinite with probability one
under Px ; essentially, once the chain started from x returns to x with probability
one, it returns a second time with probability one, and so on. Thus the occupation
time of a state has a remarkable property, not shared by all random variables: if the
expectation of the occupation time is infinite, then the actual number of returns is
infinite with probability one. The mean of the occupation time of a state obeys the
so-called maximum principle.
Proposition 136. For all x and y in X,

Ex [ηy ] = Px (σy < ∞) Ey [ηy ] , (7.3)

with the convention 0 × ∞ = 0.


Proof. It follows from the definition that ηy 1{σy =∞} = 0 and ηy 1{σy <∞} = ηy ◦
θσy 1{σy <∞} . Thus, applying the strong Markov property,

Ex [ηy ] = Ex [1{σy <∞} ηy ] = Ex [1{σy <∞} ηy ◦ θσy ]


= Ex [1{σy <∞} EXσy [ηy ]] = Px (σy < ∞) Ey [ηy ] .

Corollary 137. If Ex [ηy ] = ∞ for some x, then y is recurrent. If X is finite, then


there exists at least one recurrent state.
Proof. By Proposition 136, Ey [ηy ] ≥ Ex [ηy ], so that Ex [ηy ] = ∞ implies that
Ey [ηy ] = ∞, that is,P
y is recurrent. P
Next, obviously y∈X ηy = ∞ and thus for all x ∈ X, y∈X Ex [ηy ] = ∞. Hence
if X is finite, given x ∈ X there necessarily exists at least one y ∈ X such that
Ex [ηy ] = ∞, which implies that y is recurrent.
Our next result shows that a recurrent state can only lead to another recurrent
state.
Proposition 138. Let x be a recurrent state. Then for y ∈ X, either of the following
two statements holds true.
(i) x leads to y, Ex [ηy ] = ∞, y is recurrent and leads to x, and Px (τy < ∞) =
Py (τx < ∞) = 1;
(ii) x does not lead to y and Ex [ηy ] = 0.
Proof. Assume that x leads to y. Then there exists an integer k such that Qk (x, y) >
0. Applying the Chapman-Kolmogorov equations, we obtain Qn+k (x, y) ≥ Qn (x, x)Qk (x, y)
for all n. Hence

X ∞
X
Ex [ηy ] ≥ Qn+k (x, y) ≥ Qn (x, x)Qk (x, y) = Ex [ηx ]Qk (x, y) = ∞ .
n=1 n=1

Thus y is also recurrent by Corollary 137. Because x is recurrent, the strong Markov
property implies that

0 = Px (τx = ∞) ≥ Px (τy < ∞, τx = ∞)


= Px (τy < ∞, τx ◦ θτy = ∞) = Px (τy < ∞) Py (τx = ∞) .
150 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

Because x leads to y, Px (τy < ∞) > 0, whence Py (τx = ∞) = 0. Thus y leads to x


and moreover Py (τx < ∞) = 1. By symmetry, Px (τy < ∞) = 1.
If x does not lead to y then Proposition 136 shows that Ex [ηy ] = 0.

For a recurrent state x, the equivalence class C(x) (with respect to the relation
of communication defined in Section 7.1.1) may thus be equivalently defined as

C(x) = {y ∈ X : Ex [ηy ] = ∞} = {y ∈ X : Px (τy < ∞) = 1} . (7.4)

If y 6∈ C(x), then Px (ηy = 0) = 1, which implies that Px (Xn ∈ C(x) for all
n ≥ 0) = 1. In words, the chain started from the recurrent state x forever stays in
C(x) and visits each state of C(x) infinitely many times.
The behavior of a Markov chain can thus be described as follows. If a chain is
not irreducible, there may exist several equivalence classes of communication. Some
of them contain only transient states, and some contain only recurrent states. The
latter are then called recurrence classes. If a chain starts from a recurrent state,
then it remains in its recurrence class forever. If it starts from a transient state, then
either it stays in the class of transient states forever, which implies that there exist
infinitely many transient states, or it reaches a recurrent state and then remains in
its recurrence class forever.
In contrast, if the chain is irreducible, then all the states are either transient or
recurrent. This is called the solidarity property of an irreducible chain. We now
summarize the previous results.

Theorem 139. Consider an irreducible Markov chain on a countable state space X.


Then every state is either transient, and the chain is called transient, or every state
is recurrent, and the chain is called recurrent. Moreover, either of the following two
statements holds true for all x and y in X.

(i) Px (τy < ∞) = 1, Ex [ηy ] = ∞ and the chain is recurrent.

(ii) Px (τx < ∞) < 1, Ex [ηy ] < ∞ and the chain is transient.

Remark 140. Note that in the transient case, we do not necessarily have Px (τy <
∞) < 1 for all x and y in X. For instance, if Q is a transition matrix on N such
that Q(n, n + 1) = 1 for all n, then Pk (τn < ∞) = 1 for all k < n. Nevertheless all
states are obviously transient because Xn = X0 + n.

7.1.3 Invariant Measures and Stationarity


For many purposes, we might want the marginal distribution of {Xk } not to depend
on k. If this is the case, then by the Markov property it follows that the finite-
dimensional distributions of {Xk } are invariant under translation in time, and {Xk }
is thus a stationary process. Such considerations lead us to invariant distributions.
A non-negative vector {π(x)}x∈X with the property
X
π(y) = π(x)Q(x, y) , y∈X,
x∈X

will be called invariant. If the invariant vector π is summable, then we assume it


is a probability distribution, that is, it sums to one. Such distributions are also
called stationary distributions or stationary probability measures. The key result
concerning the existence of invariant vectors is the following.

Theorem 141. Consider an irreducible recurrent Markov chain {Xk }k≥0 on a


countable state space X. Then there exists a unique (up to a scaling factor) invariant
7.1. CHAINS ON COUNTABLE STATE SPACES 151

measure π. Moreover 0 < π(x) < ∞ for all x ∈ X. This measure is summable if
and only if there exists a state x such that
Ex [τx ] < ∞ . (7.5)
In this case, Ey [τy ] < ∞ for all y ∈ X and the unique invariant probability measure
is given by
π(x) = 1/ Ex [τx ] , x∈X. (7.6)
Proof. Let Q be the transition matrix of the chain. Pick an arbitrary state x ∈ X
and define the measure λx by
"τ −1 # "τ #
x x

1y (Xk ) = Ex 1y (Xk ) .
X X
λx (y) = Ex (7.7)
k=0 k=1

That is, λx (y) is the expected number of visits to the state y before the first return
to x, given that the chain starts in x. Let f be a non-negative function on X. Then
"τ −1 # ∞
x

Ex 1{τx >k} f (Xk ) .


X X  
λx (f ) = Ex f (Xk ) =
k=0 k=0

Using this identity and the fact that Qf (Xk ) = Ex [f (Xk+1 ) | FkX ] Px -a.s. for all
k ≥ 1, we find that
∞ ∞
Ex [1{τx >k} Qf (Xk )] = Ex {1{τx >k} Ex [f (Xk+1 ) | FkX ]}
X X
λx (Qf ) =
k=0 k=0

" τx
#
Ex [1{τx >k} f (Xk+1 )] = Ex
X X
= f (Xk ) ,
k=0 k=1

showing that λx (Qf ) = λx (f )−f (x)+Ex [f (Xτx )] = λx (f ). Because f was arbitrary,


we see that λx Q = λx ; the measure λx is invariant. For any other state y, the chain
may reach y before returning to x when starting in x, as it is irreducible. This proves
that λx (y) > 0. Moreover, again by irreducibility, we can pick an m > 0 such that
Qm (y, x) > 0. By invariance λx (x) = z∈X λx (z)Qm (z, x) ≥ λx (y)Qm (y, x), and
P
as λx (x) = 1, we see that λx (y) < ∞
We now prove that the invariant measure is unique up to a scaling factor. The
first step consists in proving that if π is an invariant measure such that π(x) = 1,
then π ≥ λx . It suffices to show that, for any y ∈ X and any integer n,
n
Ex [1y (Xk )1{τx ≥k} ] .
X
π(y) ≥ (7.8)
k=1

The proof is by induction. The inequality is immediate for n = 1. Assume that


(7.8) holds for some n ≥ 1. Then
X
π(y) = Q(x, y) + π(z)Q(z, y)
z6=x
n
Ex [Q(Xk , y)1{x}c (Xk )1{τx ≥k} ]
X
≥ Q(x, y) +
k=1
n
Ex [1y (Xk+1 )1{τx ≥k+1} ]
X
≥ Q(x, y) +
k=1
n+1
Ex [1{y} (Xk )1{τx ≥k} ] ,
X
=
k=1
152 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

showing the induction. We will now show that π = λx . The proof is by contradic-
tion. Assume that π(z) > λx (z) for some z ∈ X. Then
X X
1 = π(x) = πQ(x) = π(z)Q(z, x) > λx (z)Q(z, x) = λx (x) = 1 ,
z∈X z∈X

which cannot be true.


The measure λx is summable if and only if
"τ −1 #
x

1{Xk =y} = Ex [τx ] .


X X X
∞> λx (y) = Ex
y∈X y∈X k=0

Thus the unique invariant measure is summable if and only if a state x satisfying
this relation exists. On the other hand, if such a state x exists then, by uniqueness
of the invariant measure, Ey [τy ] < ∞ must hold for all states y. In this case,
the invariant probability measure, π say, satisfies π(x) = λx (x)/λx (X) = 1/ Ex [τx ].
Because the reference state x was in fact arbitrary, we find that π(y) = 1/ Ex [τy ]
for all states y.
It is natural to ask what can be inferred from the knowledge that a chain pos-
sesses an invariant probability measure. The next proposition gives a partial answer.
Proposition 142. Let Q be a transition matrix and π an invariant probability
measure. Then every state x such that π(x) > 0 is recurrent. If Q is irreducible,
then it is recurrent.
P∞ P∞
Proof. Let y ∈ X. If π(y) > 0 then n=0 πQn (y) = n=0 π(y) = ∞. On the other
hand, by Proposition 136,

X X ∞
X
πQn (y) = π(x) Qn (x, y)
n=0 x∈X n=0
X X
= π(x) Ex [ηy ] ≤ Ey [ηy ] π(x) = Ey [ηy ] . (7.9)
x∈X x∈X

Thus π(y) > 0 implies Ey [ηy ] = ∞, that is, y is recurrent.


Let {Xk } be an irreducible Markov chain. If there exists an invariant probability
measure, the chain is called positive recurrent; otherwise it is called null. Note that
null chains can be either null recurrent or transient. Transient chains are always
null, though they may admit an invariant measure.

7.1.4 Ergodicity
A key result for positive recurrent irreducible chains is that the transition laws
converge, in a suitable sense, to the invariant vector π. The classical result is the
following.
Proposition 143. Consider an irreducible and positive recurrent Markov chain on
a countable state space. Then for any states x and y,
n
X
n−1 Qn (x, y) → π(y) as n → ∞ . (7.10)
i=1

The use of the Césaro limit can be avoided if the chain is aperiodic. The simplest
definition of aperiodicity is that a state x is aperiodic if Qk (x, x) > 0 for all k
sufficiently large or, equivalently, that the period of the state x is one. The period of
x is defined as the greatest common divisor of the set I(x) = {n > 0 : Qn (x, x) > 0}.
For irreducible chains, the following result holds true.
7.2. CHAINS ON GENERAL STATE SPACES 153

Proposition 144. If the chain is irreducible, then all states have the same period.
If the transition matrix Q is irreducible and aperiodic, then for all x and y in X,
there exists n(x, y) ∈ N such that Qk (x, y) > 0 for all k ≥ n(x, y).
Thus, an irreducible chain can be said to be aperiodic if the common period of
all states is one.
The traditional pointwise convergence (7.10) of transition probabilities has been
replaced in more recent research by convergence in total variation (see Defini-
tion 39). The convergence result may then be formulated as follows.
Theorem 145. Consider an irreducible and aperiodic positive recurrent Markov
chain on a countable state space X with transition matrix Q and invariant probability
distribution π. Then for all initial distributions ξ and ξ 0 on X,

kξQn − ξ 0 Qn kTV → 0 as n → ∞ . (7.11)

In particular, for any x ∈ X we may set ξ = δx and ξ 0 = π to obtain

kQn (x, ·) − πkTV → 0 as n → ∞ . (7.12)

The proof of this result, and indeed the focus on convergence in total variation,
follows using of the coupling technique. We postpone the presentation of this tech-
nique to Section 7.2.4 because essentially the same ideas can be applied to Markov
chains on general state spaces.

7.2 Chains on General State Spaces


In this section, we extend the concepts and results pertaining to countable state
spaces to general ones. In the following, X is an arbitrary set, and we just require
that it is equipped with a countably generated σ-field X . By {Xk }k≥0 we denote
an X-valued Markov chain with transition kernel Q. It is defined on a probability
space (Ω, F, P), and FX = {FkX }k≥0 denotes the natural filtration of {Xk }.
For any set A ∈ X , we define the first hitting time σA and return time τA
respectively by

σA = inf{n ≥ 0 : Xn ∈ A} , (7.13)
τA = inf{n ≥ 1 : Xn ∈ A} , (7.14)
(n)
where, by convention, inf ∅ = +∞. The successive hitting times σA and return
(n)
times τA , n ≥ 0, are defined inductively by
(0) (1) (n+1) (n)
σA = 0, σA = σA , σA = inf{k > σA : Xk ∈ A} ,
(0) (1) (n+1) (n)
τA = 0, τA = τA , τA = inf{k > τA : Xk ∈ A} .

We again define the occupation time ηA as the number of visits by {Xk } to A,



def
1A (Xk ) .
X
ηA = (7.15)
k=0

7.2.1 Irreducibility
The first step to develop a theory on general state spaces is to define a suitable
concept of irreducibility. The definition of irreducibility adopted for countable state
spaces does not extend to general ones, as the probability of reaching single point
x in the state space is typically zero.
154 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

Definition 146 (Phi-irreducibility). The transition kernel Q, or the Markov chain


{Xk }k≥0 with transition kernel Q, is said to be phi-irreducible if there exists a
measure φ on (X, X ) such that for any A ∈ X with φ(A) > 0, Px (τA < ∞) > 0 for
all x ∈ X. Such a measure is called an irreducibility measure for Q.
Phi-irreducibility is a weaker property than irreducibility of a transition kernel
on a countable state space. If a transition kernel on a countable state space is
irreducible, then it is phi-irreducible, and any measure is an irreducibility measure.
The converse is not true. For instance, the transition kernel
 
0 1
Q=
0 1

on {0, 1} is phi-irreducible (δ1 is an irreducibility measure for Q) but not irreducible.


In general, there are infinitely many irreducibility measures, and two irreducibil-
ity measures are not necessarily equivalent. For instance, if φ is an irreducibility
measure and φ̂ is absolutely continuous with respect to φ, then φ̂ is also an irre-
ducibility measure. Nevertheless, as shown in the next result, there exist maximal
irreducibility measures ψ, which are such that any irreducibility measure φ is abso-
lutely continuous with respect to ψ.
Theorem 147. Let Q be a phi-irreducible transition kernel on (X, X ). Then there
exists an irreducibility measure ψ such that all irreducibility measures are absolutely
continuous with respect to ψ and for all A ∈ X ,

ψ(A) > 0 ⇔ Px (τA < ∞) > 0 for all x ∈ X . (7.16)

Proof. Let φ be an irreducibility measure and  ∈ (0, 1). Let φ be the measure
defined by φ = φK , where K is the resolvent kernel defined by
def
X
K (x, A) = (1 − ) k Qk (x, A) , x ∈ X, A ∈ X . (7.17)
k≥0

We will first show that φ is an irreducibility measure. Let A ∈ X be such that


φ (A) > 0 and define

Ā = {x ∈ X : Px (σA < ∞) > 0} = {x ∈ X : K (x, A) > 0} . (7.18)

S φ(Ā) > 0. Define Ām = {x ∈ X : Px (σA <


By definition, φ (A) > 0 implies that
∞) ≥ 1/m}. By construction, Ā = m>0 Ām , and because φ(Ā) > 0, there exists
m such that φ(Ām ) > 0. Because φ is an irreducibility measure, Px (τĀm < ∞) > 0
for all x ∈ X. Hence by the strong Markov property, for all x ∈ X,

Px (τA < ∞) ≥ Px (τĀm + σA ◦ θτĀm < ∞ , τĀm < ∞)


1
= Ex [1{τĀm <∞} PXτĀ (σA < ∞)] ≥ Px (τĀm < ∞) > 0 ,
m m
showing that φ is an irreducibility measure.
Now for m ≥ 0 the Chapman-Kolmogorov equations imply
Z Z X∞
φ (dx) m Qm (x, A) = (1 − ) n Qn (x, A) φ(dx) ≤ φ (A) .
X X n=m

Therefore, if φ (A) = 0 then φ K (A) = 0, which in turn implies φ (Ā) = 0.


Summarizing the results above, for any A ∈ X ,

φ (A) > 0 ⇔ φ ({x ∈ X : Px (σA < ∞) > 0}) > 0 . (7.19)


7.2. CHAINS ON GENERAL STATE SPACES 155

This proves (7.16)


To conclude we must show that all irreducibility measures are absolutely con-
tinuous with respect to φ . Let φ̂ be an irreducibility measure and let C ∈ X be
such that φ̂(C) > 0. Then φ ({x ∈ X : Px (σC < ∞) > 0}) = φ (X) > 0, which, by
(7.19), implies that φ (C) > 0 . This exactly says that φ̂ is absolutely continuous
with respect to φ .

A set A ∈ X is said to be accessible for the kernel Q (or Q-accessible, or simply


accessible if there is no risk of confusion) if Px (τA < ∞) > 0 for all x ∈ X. The
family of accessible sets is denoted X + . If ψ is a maximal irreducibility measure
the set A is accessible if and only if ψ(A) > 0.

Example 148 (Autoregressive Model). The first-order autoregressive model on R


is defined iteratively by Xn = φXn−1 + Un , where φ is a real number and {Un } is an
i.i.d. sequence. If Γ is the probability distribution of the noise sequence {Un }, the
transition kernel of this chain is given by Q(x, A) = Γ(A − φx). The autoregressive
model is phi-irreducible provided that the noise distribution has an everywhere
positive density with respect to Lebesgue measure λLeb . If we take φ = λLeb , it is
easy to see that whenever λLeb (A) > 0, we have Γ(A − φx) > 0 for any x, and so
Q(x, A) > 0 in just one step.

Example 149. For simplicity, we assume here that X = Rd , which we equip with
the Borel σ-field X = B(Rd ). Assume that we are given a probability density
function π on with respect to Lebesgue measure λLeb . Let r be a transition density
kernel. Starting from Xn = x, a candidate transition x0 is generated from r(x, ·)
and accepted with probability

π(x0 ) r(x0 , x)
α(x, x0 ) = ∧1. (7.20)
π(x) r(x, x0 )

The transition kernel of the Metropolis-Hastings chain is given by


Z
Q(x, A) = α(x, x0 )r(x, x0 ) λLeb (dx0 )
A
Z
+ 1x (A) [1 − α(x, x0 )]r(x, x0 ) λLeb (dx0 ) . (7.21)

There are various sufficient conditions for the Metropolis-Hastings algorithm to be


phi-irreducible (Roberts and Tweedie, 1996; Mengersen and Tweedie, 1996). For
the Metropolis-Hastings chain, it is simple to check that the chain is phi-irreducible
if for λLeb -almost all x0 ∈ X, the condition π(x0 ) > 0 implies that r(x, x0 ) > 0 for
any x ∈ X.

7.2.2 Recurrence and Transience


In view of the discussion above, it is not sensible to define recurrence and transience
in terms of the expectation of the occupation measure of a state, but for phi-
irreducible chains it makes sense to consider the occupation measure of accessible
sets.

Definition 150 (Uniform Transience and Recurrence). A set A ∈ X is called


uniformly transient if supx∈A Ex [ηA ] < ∞. A set A ∈ X is called recurrent if
Ex [ηA ] = +∞ for all x ∈ A.

Obviously, if supx∈X Ex [ηA ] < ∞, then A is uniformly transient. In fact the


reverse implication holds true too, because if the chain is started outside A it cannot
156 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

hit A more times, on average, than if it is started at “the most favorable location” in
A. Thus an alternative definition of a uniformly transient set is supx∈X Ex [ηA ] < ∞.
The main result on phi-irreducible transition kernels is the following recur-
rence/transience dichotomy, which parallels Theorem 139 for countable state-space
Markov chains.

Theorem 151. Let Q be a phi-irreducible transition kernel (or Markov chain).


Then either of the following two statements holds true.

(i) Every accessible set is recurrent, in which case we call Q recurrent.

(ii) There is a countable cover of X with uniformly transient sets, in which case we
call Q transient.

In the next section, we will prove Theorem 151 in the particular case where the
chain possesses an accessible atom (see Definition 152); the proof is then very similar
to that for countable state space. In the general case, the proof is more involved. It
is necessary to introduce small sets and the so-called splitting construction, which
relates the chain to one that does possess an accessible atom.

Transience and Recurrence for Chains Possessing an Accessible Atom


Definition 152 (Atom). A set α ∈ X is called an atom if there exists a probability
measure ν on (X, X ) such that Q(x, A) = ν(A) for all x ∈ α and A ∈ X .

Atoms behave the same way as do individual states in the countable state space
case. Although any singleton {x} is an atom, it is not necessarily accessible, so that
Markov chain theory on general state spaces differs from the theory of countable
state space chains.
If α is an atom for Q, then for any m ≥ 1 it is an atom for Qm . Therefore we
denote by Qm (α, ·) the common value of Qm (x, ·) for all x ∈ α. This implies that
if the chain starts from within the atom, the distribution of the whole chain does
not depend on the precise starting point. Therefore we will also use the notation
Pα instead of Px for any x ∈ α.

Example 153 (Random Walk on the Half-Line). The random walk on the half-line
(RWHL) is defined by an initial condition X0 ≥ 0 and the recursion

Xk+1 = (Xk + Wk+1 )+ , k≥0, (7.22)

where {Wk }k≥1 is an i.i.d. sequence of random variables, independent of X0 , with


distribution function Γ on R. This process is a Markov chain with transition kernel
Q defined by

Q(x, A) = Γ(A − x) + Γ((−∞ , −x])1A (0) , x ∈ R+ , A ∈ B(R+ ) ,

where A − x = {y − x : y ∈ A}. The set {0} is an atom, and it is accessible if and


only if Γ((−∞ , 0]) > 0.

We now prove Theorem 151 when there exists an accessible atom.

Proposition 154. Let {Xk }k≥0 be a Markov chain that possesses an accessible
atom α, with associated probability measure ν. Then the chain is phi-irreducible, ν
is an irreducibility measure, and a set A ∈ X is accessible if and only if Pα (τA <
∞) > 0.
Moreover, α is recurrent if and only if Pα (τα < ∞) = 1 and (uniformly) tran-
sient otherwise, and the chain is recurrent if α is recurrent and transient otherwise.
7.2. CHAINS ON GENERAL STATE SPACES 157

Proof. For all A ∈ X and x ∈ X, the strong Markov property yields

Px (τA < ∞) ≥ Px (τα + τA ◦ θτα < ∞, τα < ∞)


= Ex [PXτα (τA < ∞)1{τα <∞} ]
= Pα (τA < ∞) Px (τα < ∞)
≥ ν(A) Px (τα < ∞) .

Because α is accessible, Px (τα < ∞) > 0 for all x ∈ X. Thus for any A ∈ X
satisfying ν(A) > 0, it holds that Px (τA < ∞) > 0 for all x ∈ X, showing that ν is
an irreducibility measure. The above display also shows that A is accessible if and
only if Pα (τA < ∞).
(n)
Now let σα be the successive hitting times of α (see (7.13)). The strong Markov
property implies that for any n > 1,

Pα (σα(n) < ∞) = Pα (τα < ∞) Pα (σα(n−1) < ∞) .


(n)
Hence, as for discrete state spaces, Pα (σα < ∞) = [Pα (τα < ∞)]n−1 and Eα [ηα ] =
1/[1 − Pα (τα < ∞)]. This proves that α is recurrent if and only if Pα (τα < ∞) = 1.
Assume that α is recurrent. Because the atom α is accessible, for any x ∈ X,
there exists r such that Qr (x, α) > 0. If A ∈ X + there exists s such that Qs (α, A) >
0. By the Chapman-Kolmogorov equations,
 
X X
Qr+s+n (x, A) ≥ Qr (x, α)  Qn (α, α) Qs (α, A) = ∞ .
n≥1 n≥1

Hence Ex [ηA ] = ∞ for all x ∈ X and A is recurrent. Because A was an arbitrary


accessible set, the chain is recurrent.
Assume now that α is transient, in which case Eα (ηα ) < ∞. Then, following the
same line of reasoning as in the discrete state space case (proof of Proposition 136),
we obtain that for all x ∈ X,

Ex [ηα ] = Px (τα < ∞) Eα [ηα ] ≤ Eα [ηα ] . (7.23)


Pj
Define Bj = {x : n=1 Qn (x, α) ≥ 1/j}. Then ∪∞ j=1 Bj = X because α is accessible.
Applying the definition of the sets Bj and the Chapman-Kolmogorov equations, we
find that


X ∞
X j
X
Qk (x, Bj ) ≤ Qk (x, Bj ) inf j Q` (y, α)
y∈Bj
k=1 k=1 `=1

X j Z
∞ X ∞
X
≤j Qk (x, dy) Q` (y, α) ≤ j 2 Qk (x, α) = j 2 Ex [ηα ] < ∞ .
k=1 `=1 Bj k=1

The sets Bj are thus uniformly transient. The proof is complete.

Small Sets and the Splitting Construction


We now return to the general phi-irreducible case. In order to prove Theorem 151,
we need to introduce the splitting technique. To do so, we need to define a class
of sets (containing accessible sets) that behave the same way in many respects as
do atoms. We shall see this in many of the results below, which exactly mimic the
atomic case results they generalize. These sets are called small sets.
158 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

Definition 155 (Small Set). Let Q and ν be a transition kernel and a probability
measure, respectively, on (X, X ), let m be a positive integer and  ∈ (0, 1]. A set
C ∈ X is called a (m, , ν)-small set for Q, or simply a small set, if ν(C) > 0 and
for all x ∈ C and A ∈ X ,
Qm (x, A) ≥ ν(A) .
If  = 1 then C is an atom for the kernel Qm .
Trivially, any individual point is a small set, but small sets that are not accessible
are of limited interest. If the state space is countable and Q is irreducible, then
every finite set is small. The minorization measure associated to an accessible small
set provides an irreducibility measure.
Proposition 156. Let C be an accessible (m, , ν)-small set for the transition kernel
Q on (X, X ). Then ν is an irreducibility measure.
Proof. Let A ∈ X be such that ν(A) > 0. The strong Markov property yields

Px (τA < ∞) ≥ Px (τC < ∞, τA ◦ θτC < ∞) = Ex [1{τC <∞} PXτC (τA < ∞)] .

Because C is a small set, for all y ∈ C it holds that

Py (τA < ∞) ≥ Py (Xm ∈ A) = Qm (y, A) ≥ ν(A) .

Because C is accessible and ν(A) > 0, for all x ∈ X it holds that

Px (τA < ∞) ≥ ν(A) Px (τC < ∞) > 0 .

Thus A is accessible, whence ν is an irreducibility measure.


An important result due to Jain and Jamison (1967) states that if the transition
kernel is phi-irreducible, then small sets do exist. For a proof see Nummelin (1984,
p. 16) or Meyn and Tweedie (1993, Theorem 5.2.2).
Proposition 157. If the transition kernel Q on (X, X ) is phi-irreducible, then every
accessible set contains an accessible small set.
Given the existence of just one small set from Proposition 157, we may show that
it is possible to cover X with a countable number of small sets in the phi-irreducible
case.
Proposition 158. Let Q be a phi-irreducible transition kernel on (X, X ).
(i) If C ∈ X is an (m, , ν)-small set and for any x ∈ D we have Qn (x, C) ≥ δ,
then D is (m + n, δ, ν)-small set.
(ii) If Q is phi-irreducible
S then there exists a countable collection of small sets Ci
such that X = i Ci .
Proof. Using the Chapman-Kolmogorov equations, we find that for any x ∈ D,
Z
Qn+m (x, A) ≥ Qn (x, dy) Qm (y, A) ≥ Qn (x, C)ν(A) ≥ δν(A) ,
C

showing part (i). Because Q is phi-irreducible, by Proposition 157 there exists an


accessible (m, , ν)-small set C. Moreover, by the definition of phi-irreducibility,
the sets C(n, m) = {x : Qn (x, C) ≥ 1/m} cover X and, by part (i), each C(n, m) is
small.
Proposition 159. If Q is phi-irreducible and transient, then every accessible small
set is uniformly transient.
7.2. CHAINS ON GENERAL STATE SPACES 159

Proof. Let C be an accessible (m, , ν)-small set. If Q is transient, there exists at


least one A ∈ X + that is uniformly transient. For δ ∈ (0, 1), by the Chapman-
Kolmogorov equations,

X ∞
X ∞
X
Ex [ηA ] = Qk (x, A) ≥ (1 − δ) δp Qk+m+p (x, A)
k=0 p=0 k=0

X ∞ Z
X Z
≥ (1 − δ) δp Qk (x, dx0 ) Qm (x0 , dx00 ) Qp (x00 , A)
p=0 k=0 C

X ∞
X
≥ Qk (x, C) × (1 − δ) δ p νQp (A) =  Ex [ηC ] νKδ (A) ,
k=0 p=0

where Kδ is the resolvent kernel (7.17). Because C is an accessible small set,


Proposition 156 shows that ν is an irreducibility measure. By Theorem 147, νKδ is a
maximal irreducibility measure, so that νKδ (A) > 0. Thus supx∈X Ex [ηC ] < ∞ and
we conclude that C is uniformly transient (see the remark following Definition 150).

Example 160 (Autoregressive Process, Continued). Suppose that the noise distri-
bution in Example 148 has an everywhere positive continuous density γ with respect
to Lebesgue measure λLeb . If C = [−M, M ] and  = inf |x|≤(1+φ)M γ(u), then for
A ⊆ C, Z
Q(x, A) = γ(x0 − φx) dx0 ≥ λLeb (A) .
A
Hence the compact set C is small. Obviously R is covered by a countable collection
of small sets and every accessible set (here sets with non-zero Lebesgue measure)
contains a small set.
Example 161 (Metropolis-Hastings Algorithm, Continued). Similar results hold
for the Metropolis-Hastings algorithm of Example 149 if π(x) and r(x, x0 ) are pos-
itive and continuous for all (x, x0 ) ∈ X × X. Suppose that C is compact with
λLeb (C) > 0. By positivity and continuity, we then have d = supx∈C π(x) < ∞ and
ε = inf (x,x0 )∈C×C q(x, x0 ) > 0. For any A ⊆ C, define

π(x0 )q(x0 , x)
 
def
Rx (A) = x0 ∈ A : < 1 ,
π(x)q(x, x0 )
the region of possible rejection. Then for any x ∈ C,
Z
Q(x, A) ≥ q(x, x0 )α(x, x0 ) dx0
ZA
q(x0 , x)
Z
≥ π(x0 ) dx0 + q(x, x0 ) dx0
Rx (A) π(x) A\Rx (A)
Z Z
ε 0 0 ε
≥ π(x ) dx + π(x0 ) dx0
d Rx (A) d A\Rx (A)
Z
ε
= π(x0 ) dx0 .
d A
Thus C is small and, again, X can be covered by a countable collection of small
sets.
We now show that it is possible to define a Markov chain with an atom, the
so-called split chain, whose properties are directly related to those of the original
chain. This technique was introduced by Nummelin (1978) (Athreya and Ney,
160 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

1978, introduced, independently, a virtually identical concept) and allows extending


results valid for Markov chain possessing an accessible atom to irreducible Markov
chains that only possess small sets. The basic idea is as follows. Suppose the chain
admits a (1, , ν)-small set C. Then as long as the chain does not enter C, the
transition kernel Q is used to generate the trajectory. However, as soon as the
chain hits C, say Xn ∈ C, a zero-one random variable dn is drawn, independent of
everything else. The probability that dn = 1 is , and hence dn = 0 with probability
1 − . Then if dn = 1, the next value Xn+1 is drawn from ν; otherwise Xn+1 is
drawn from the kernel

R(x, A) = [1 − 1C (x)]−1 [Q(x, A) − 1C (x)ν(A)] ,

with x = Xn . It is immediate that ν(A) + (1 − )R(x, A) = Q(x, A) for all x ∈ C,


so Xn+1 is indeed drawn from the correct (conditional) distribution. Note also that
R(x, ·) = Q(x, ·) for x 6∈ C. So, what is gained by this approach? What is gained is
that whenever Xn ∈ C and dn = 1, the next value of the chain will be independent
of Xn (because it is drawn from ν). This is often called a regeneration time, as
the joint chain {(Xk , dk )} in a sense “restarts” and forgets its history. In technical
terms, the state C × {1} in the extended state space is as atom, and it will be
accessible provided C is.
We now make this formal. Thus we define the so-called extended state space
as X̌ = X × {0, 1} and let X̌ be the associated product σ-field. We associate to
every measure µ on (X, X ) the split measure µ? on (X̌, X̌ ) as the unique measure
satisfying, for A ∈ X ,

µ? (A × {0}) = (1 − )µ(A ∩ C) + µ(A ∩ C c ) ,


µ? (A × {1}) = µ(A ∩ C) .

If Q is a transition kernel on (X, X ), we define the kernel Q? on X× X̌ by Q? (x, Ǎ) =


[Q(x, ·)]? (Ǎ) for x ∈ X and Ǎ ∈ X̌ .
Assume now that Q is a phi-irreducible transition kernel and let C be a (1, , ν)-
small set. We define the split transition kernel Q̌ on X̌ × X̌ as follows. For any
x ∈ X and Ǎ ∈ X̌ ,

Q̌((x, 0), Ǎ) = R? (x, Ǎ) , (7.24)


?
Q̌((x, 1), Ǎ) = ν (Ǎ) . (7.25)

Examining the above technicalities, we find that transitions into C c × {1} have
zero probability from everywhere, so that dn = 1 can only occur if Xn ∈ C. Because
dn = 1 indicates a regeneration time, from within C, this is logical. Likewise we
find that given a transition to some y ∈ C, the conditional probability that dn = 1
is , wherever the transition took place from. Thus the above split transition kernel
corresponds to the following simulation scheme for {(Xk , dk )}. Assume (Xk , dk ) are
given. If Xk 6∈ C, then draw Xk+1 from Q(Xk , ·). If Xk ∈ C and dn = 1, then draw
Xk+1 from ν, otherwise from R(Xk , ·). If the realized Xk+1 is not in C, then set
dk+1 = 0; if Xk+1 is in C, then set dk+1 = 1 with probability , and otherwise set
dk+1 = 0.
Split measures operate on the split kernel in the following way. For any measure
µ on (X, X ),

µ? Q̌ = (µQ)? . (7.26)

For any probability measure µ̌ on X̌ , we denote by P̌µ̌ and ̵̌ , respectively, the
probability distribution and the expectation on the canonical space (X̌N , X̌ ⊗N ) such
that the coordinate process, denoted {(Xk , dk )}k≥0 , is a Markov chain with initial
7.2. CHAINS ON GENERAL STATE SPACES 161

probability measure µ̌ and transition kernel Q̌. We also denote by {F̌k }k≥0 the
natural filtration of this chain and, as usual, by {FkX }k≥0 the natural filtration of
{Xk }k≥0 .
Proposition 162. Let Q be a phi-irreducible transition kernel on (X, X ), let C be
an accessible (1, , ν)-small set for Q and let µ be a probability measure on (X, X ).
Then for any bounded X -measurable function f and any k ≥ 1,
X
̵? [f (Xk ) | Fk−1 ] = Qf (Xk−1 ) P̌µ? -a.s. (7.27)
Before giving the proof, we discuss the implications of this result. It implies that
under P̌µ? , {Xk }k≥0 is a Markov chain (with respect to its natural filtration) with
transition kernel Q and initial distribution µ. By abuse of notation, we can identify
{Xk } with the coordinate process associated to the canonical space XN . Denote
by Pµ the probability measure on (XN , X ⊗N ) such that {Xk }k≥0 is a Markov chain
with transition kernel Q and initial distribution µ (see Section 1.1.2) and denote by
Eµ the associated expectation operator. Then Proposition 162 yields the following
X
identity. For any bounded F∞ -measurable random variable Y ,
̵? [Y ] = Eµ [Y ] . (7.28)
of Proposition 162. We have, µ? -a.s.,
̵? [f (Xk ) | F̌k−1 ] = 1{dk−1 =1} ν(f ) + 1{dk−1 =0} Rf (Xk−1 ) .
X
Because P̌µ̌ (dk−1 = 1 | Fk−1 ) = 1C (Xk−1 ) P̌µ? -a.s., it holds that
X X
̵? [f (Xk ) | Fk−1 ] = ̵? {Ě[f (Xk ) | F̌k−1 ] | Fk−1 }
= 1C (Xk−1 )ν(f ) + [1 − 1C (Xk−1 )]Rf (Xk−1 )
= Qf (Xk−1 ) .

Corollary 163. Under the assumptions of Proposition 162, X×{1} is an accessible


atom and ν ? is an irreducibility measure for the split kernel Q̌. More generally, if
B ∈ X is accessible for Q, then B × {0, 1} is accessible for the split kernel.
Proof. Because α̌ = X×{1} is an atom for the split kernel Q̌, Proposition 154 shows
that ν ? is an irreducibility measure if α̌ is accessible. Applying (7.28) we obtain for
x ∈ X,
P̌(x,1) (τα̌ < ∞) = P̌(x,1) (dn = 1 for some n ≥ 1)
≥ P̌(x,1) (d1 = 1) = ν(C) > 0 ,
P̌(x,0) (τα̌ < ∞) = P̌(x,0) ((Xn , dn ) ∈ C × {1} for some n ≥ 1)
≥ P̌(x,0) (τC×{0,1} < ∞ , dτC×{0,1} = 1) =  Px (τC < ∞) > 0 .

Thus α̌ is accessible and ν ? is an irreducibility measure for Q̌. This implies, by


Theorem 147, that for all η ∈ (0, 1), ν ? Ǩη is a maximal irreducibility measure
for the split kernel Q̌; here Kη is the resolvent kernel (7.17) associated to Q̌. By
straightforward applications of the definitions, it is easy to check that ν ? Ǩη =
(νKη )? . Moreover, ν is an irreducibility measure for Q, and νKη is a maximal
irreducibility measure for Q (still by Proposition 156 and Theorem 147). If B is
accessible, then νKη (B) > 0 and
ν ? Ǩη (B × {0, 1}) = (νKη )? (B × {0, 1}) = νKη (B) > 0.

Thus B × {0, 1} is accessible for Q̌.


162 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

Transience/Recurrence Dichotomy for General Phi-irreducible Chains


Using the splitting construction, we are now able to prove Theorem 151 for chains
not possessing accessible atoms. We first consider the simple case in which the chain
possesses a 1-small set.
Proposition 164. Let Q be a phi-irreducible transition kernel that admits an ac-
cessible (1, , ν)-small set C. Then Q is either recurrent or transient. It is recurrent
if and only if the small set C is recurrent.
Proof. Because the split chain possesses an accessible atom, by Proposition 154 the
split chain is phi-irreducible and either recurrent or transient. Applying (7.28) we
can write
Ěδx? [ηB×{0,1} ] = Ex [ηB ] . (7.29)
Assume first that the split chain is recurrent. Let B be an accessible set for Q. By
Proposition 162, B×{0, 1} is accessible for the split chain. Hence Ěδx? [ηB×{0,1} ] = ∞
for all x ∈ B, so that, by (7.29), Ex [ηB ] = ∞ for all x ∈ B.
Conversely, if the split chain is transient,
Pj then by Proposition 154 the atom α̌
is transient. For j ≥ 1, define Bj = {x : l=1 Q̌l ((x, 0), α̌) ≥ 1/j}. Because α̌ is
accessible, ∪∞
j=1 Bj = X. By the same argument as in the proof of Proposition 154,
the sets Bj × {0, 1} are uniformly transient for the split chain. Hence, by (7.29),
the sets Bj are uniformly transient for Q.
It remains to prove that if the small set C is recurrent, then the chain is recurrent.
We have just proved that Q is recurrent if and only if Q̌ is recurrent and, by
Proposition 154, this is true if and only if the atom α̌ is recurrent. Thus we only
need to prove that if C is recurrent then α̌ is recurrent. If C is recurrent, then
(7.29) yields for all x ∈ C,
Ěδx? [ηα̌ ] ≥ Ěδx? [ηC×{0,1} ] =  Ex [ηC ] = ∞ .
Using the definition of δx? , this implies that there exists x̌ ∈ X̌ such that Ěx̌ [ηα̌ ] =
∞. This observation and (7.23) imply that Ěα̌ [ηα̌ ] = ∞, that is, the atom is
recurrent.
Using the resolvent kernel, the previous results can be extended to the general
case where an accessible small set exists, but not necessarily a 1-small one.
Proposition 165. Let Q be transition kernel.
(i) If Q is phi-irreducible and admits an accessible (m, , ν)-small set C, then for
any η ∈ (0, 1),
P∞ C is an accessible (1, 0 , ν)-small set for the resolvent kernel
Kη = (1 − η) k=0 η k Qk with 0 = (1 − η)η m .
(ii) A set is recurrent (resp. uniformly transient) for Q if and only if it is recurrent
(resp. uniformly transient) for Kη for some (hence for all) η ∈ (0, 1).
(iii) Q is recurrent (resp. transient) if and only if Kη is recurrent (resp. transient)
for some (hence for all) η ∈ (0, 1).
Proof. For any η > 0, x ∈ C, and A ∈ X ,
Kη (x, A) ≥ (1 − η)η m Qm (x, A) ≥ (1 − η)η m ν(A) = 0 ν(A) .
Thus C is a (1, 0 , ν)-small set for Kη , showing part (i). The remaining claims follow
from the identity
X 1−η X n
Kηn = Q .
η
n≥1 n≥0
7.2. CHAINS ON GENERAL STATE SPACES 163

Harris Recurrence
As for countable state spaces, it is sometimes useful to consider stronger recurrence
properties, expressed in terms of return probabilities rather than mean occupation
times.

Definition 166 (Harris Recurrence). A set A ∈ X is said to be Harris recurrent if


Px (τA < ∞) = 1 for any x ∈ X. A phi-irreducible Markov chain is said to be Harris
(recurrent) if any accessible set is Harris recurrent.

It is intuitively obvious that, as for countable state spaces, Harris recurrence


implies recurrence.

Proposition 167. A Harris recurrent set is recurrent.


(j+1) (j)
Proof. Let A be a Harris recurrent set. Because for j ≥ 1, σA = τA ◦ θσA on
(j)
the set {σA < ∞}, the strong Markov property implies that for any x ∈ A,
 
(τA < ∞)1{σ(j) <∞} = Px (σA < ∞) .
(j+1) (j)
Px (σA < ∞) = Ex PX (j)
σ A
A

(1)
Because Px (σA < ∞) = 1 for x ∈ A, we obtain that for all x ∈ A and all j ≥ 1,
(j) P∞ (j)
Px (σA = 1) and Ex [ηA ] = j=1 Px (σA < ∞) = ∞.

Even though all transition kernels may not be Harris recurrent, the following
theorem provides a very useful decomposition of the state space of a recurrent phi-
irreducible transition kernel. For a proof of this result, see Meyn and Tweedie (1993,
Theorem 9.1.5)

Theorem 168. Let Q be a phi-irreducible recurrent transition kernel on a state


space X and let ψ be a maximal irreducibility measure. Then X = N ∪ H, where N
is covered by a countable family of uniformly transient sets, ψ(N) = 0 and every
accessible subset of H is Harris recurrent.

As a consequence, if A is an accessible set of a recurrent phi-irreducible chain,


then there exists a set A0 ⊆ A such that ψ(A\A0 ) = 0 for any maximal irreducibility
measure ψ, and Px (τA0 < ∞) = 1 for all x ∈ A0 .

Example 169. To understand why a recurrent Markov chain can fail to be Harris,
consider the following elementary example of a chain on X = N. Let the transition
kernel Q be given by Q(0, 0) = 1 and for x ≥ 1, Q(x, x + 1) = 1 − 1/x2 and
Q(x, 0) = 1/x2 . Thus the state 0 is absorbing. Because Q(x, 0) > 0 for any x ∈ X,
δ0 is an irreducibility measure. In fact, by application of Theorem 147, this measure
is maximal. The set {0} is an atom and because P0 (τ{0} < ∞) = 1, the chain is
recurrent by Proposition 154.
The chain is not Harris recurrent, however. Indeed, for any x ≥ 1 we have

x+k−1
Y
Px (τ0 ≥ k) = Px (X1 6= 0, . . . , Xk−1 6= 0) = (1 − 1/j 2 ) .
j=x

Q∞
Because j=2 (1 − 1/j 2 ) > 0, we obtain that Px (τ0 = ∞) = limk→∞ Px (τ0 ≥ k) > 0
for any x ≥ 2, so that the accessible state 0 is not certainly reached from such an
initial state. Comparing to Theorem 168, we see that the decomposition of the state
space is given by H = {0} and N = {1, 2, . . .}.
164 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

7.2.3 Invariant Measures and Stationarity


On general state spaces, we again further classify chains using invariant measures.
A σ-finite measure µ is called Q-sub-invariant if µ ≥ µQ and Q-invariant if µ = µQ.

Theorem 170. A phi-irreducible recurrent transition kernel (or Markov chain)


admits a unique (up to a multiplicative constant) invariant measure which is also a
maximal irreducibility measure.

This result leads us to define the following classes of chains.

Definition 171 (Positive and Null Chains). A phi-irreducible transition kernel


(or Markov chain) is called positive if it admits an invariant probability measure;
otherwise it is called null.

We now prove the existence of an invariant measure when the chain admits an
accessible atom. The invariant measure is defined as for countable state spaces, by
replacing an individual state by the atom. Thus define the measure µα on X by
"τ #
α

1A (Xn ) ,
X
µα (A) = Eα A∈X . (7.30)
n=1

Proposition 172. Let α be an accessible atom for the transition kernel Q. Then
µα is Q-sub-invariant. It is invariant if and only if the atom α is recurrent. In
that case, any Q-invariant measure µ is proportional to µα , and µα is a maximal
irreducibility measure.

Proof. By the definition of µα and the strong Markov property,


"τ # "τ +1 #
α α

1A (Xk )
X X
µα Q(A) = Eα Q(Xk , A) = Eα
k=1 k=2
= µα (A) − Pα (X1 ∈ A) + Eα [1A (Xτα +1 )1{τα <∞} ] .

Applying the strong Markov property once again yields

Eα [1A (Xτα +1 )1{τα <∞} ] = Eα {Eα [1A (X1 ) ◦ θτα | FτXα ]1{τα <∞} }
= Eα [PXτα (X1 ∈ A)1{τα <∞} ] = Pα (X1 ∈ A) Pα (τα < ∞) .

Thus µα Q(A) = µα (A) − Pα (X1 ∈ A)[1 − Pα (τα < ∞)]. This proves that µα is
sub-invariant, and invariant if and only if Pα (τα < ∞) = 1.
Now let µ be an invariant non-trivial measure and let A be an accessible set such
that µ(A) < ∞. Then there exists an integer n such that Qn (α, A) > 0. Because µ
is invariant, it holds that µ = µQn , so that

∞ > µ(A) = µQn (A) ≥ µ(α)Qn (α, A) .

This implies that µ(α) < ∞. Without loss of generality, we can assume µ(α) > 0;
otherwise we replace µ by µ + µα . Assuming µ(α) > 0, there is then no loss of
generality in assuming µ(α) = 1.
The next step is to prove that if µ is an invariant measure such that µ(α) = 1,
then µ ≥ µα . To prove this it suffices to prove that for all n ≥ 1,
n
X
µ(A) ≥ Pα (Xk ∈ A, τα ≥ k) .
k=1
7.2. CHAINS ON GENERAL STATE SPACES 165

We prove this inequality by induction. For n = 1 we can write

µ(A) = µQ(A) ≥ µ(α)Q(α, A) = Q(α, A) = Pα (X1 ∈ A) .

Now assume now that the inequality holds for some n ≥ 1. Then
Z
µ(A) = Q(α, A) + µ(dy) Q(y, A)
αc
n
Eα [Q(Xk , A)1{τα ≥k} 1{Xk ∈α}
X
≥ Q(α, A) + / ]
k=1
n
Eα [Q(Xk , A)1{τα ≥k+1} ] .
X
≥ Q(α, A) +
k=1

Because {τα ≥ k + 1} ∈ FkX , the Markov property yields

Eα [Q(Xk , A)1{τα ≥k+1} ] = Pα (Xk+1 ∈ A, τα ≥ k + 1) ,

whence
n+1
X n+1
X
µ(A) ≥ Q(α, A) + Pα (Xk ∈ A, τα ≥ k) = Pα (Xk ∈ A, τα ≥ k) .
k=2 k=1

This completes the induction, and we conclude that µ ≥ µα .


Assume that there exists a set A such that µ(A) > µα (A). It is straightforward
that µ and µα are both invariant for the resolvent kernel Kδ (see R(7.17)), for any δ ∈
R 1). Because α is accessible, Kδ (x, α) > 0 for all x ∈ X. Hence A µ(dx) Q(x, α) >
(0,
A α
µ (dx) Q(x, α), which implies that
Z Z
1 = µ(α) = µKδ (α) = µ(dx) Kδ (x, α) + µ(dx) Kδ (x, α)
A Ac
Z Z
> µα (dx) Kδ (x, α) + µα (dx) Kδ (x, α) = µα Kδ (α) = µα (α) = 1.
A Ac

This contradiction shows that µ = µα .


We finally prove that µα is a maximal irreducibility measure. Let ψ be a maximal
irreducibility measure and assume that ψ(A) = 0. Then Px (τA < ∞) = 0 for ψ-
almost all x ∈ X. This obviously implies that Px (τA < ∞) = 0 for ψ-almost all
x ∈ α. Because Px (τA < ∞) is constant over α, we find that Px (τA < ∞) = 0
for all x ∈ α, and this yields µα (A) = 0. Thus µα is absolutely continuous with
respect to ψ, hence an irreducibility measure. Let again Kδ be the resolvent kernel.
By Theorem 147, µα Kδ is a maximal irreducibility measure. But, as noted above,
µα K = µα , and therefore µα is a maximal irreducibility measure.
Proposition 173. Let Q be a recurrent phi-irreducible transition kernel that admits
an accessible (1, , ν)-small set C. Then it admits a non-trivial invariant measure,
unique up to multiplication by a constant and such that 0 < π(C) < ∞, and any
invariant measure is a maximal irreducibility measure.
Proof. By (7.26), (µQ)? = µ? Q̌, so that µ is Q-invariant if and only if µ? is Q̌-
invariant. Let µ̌ be a Q̌-invariant measure and define
Z Z
µ= µ̌(dx̌) R(x, ·) + µ̌(dx̌) Q(x, ·) + µ̌(X × {1})ν .
C×{0} C c ×{0}

By application of the definition of the split kernel and measures, it can be checked
that µ̌Q̌ = µ? . Hence µ? = µ̌Q̌ = µ̌. We thus see that µ? is Q̌-invariant, which, as
166 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

noted above, implies that µ is Q-invariant. Hence we have shown that there exists
a Q-invariant measure if and only if there exists a Q̌-invariant one.
If Q is recurrent then C is recurrent, and as appears in the proof of Propo-
sition 173 this implies that the atom α̌ is recurrent for the split chain Q̌. Thus,
by Proposition 154 the kernel Q̌ is recurrent, and by Proposition 172 it admits an
invariant measure that is unique up to a scaling factor. Hence Q also admits an
invariant measure, unique up to a scaling factor and such that 0 < π(C) < ∞.
Let µ be Q-invariant. Then µ? is Q̌-invariant and hence, by Proposition 172, a
maximal irreducibility measure. If µ(A) > 0, then µ? (A × {0, 1}) = µ(A) > 0. Thus
A × {0, 1} is accessible, and this implies that A is accessible. We conclude that µ is
an irreducibility measure, and it is maximal because it is Kη -invariant.

If the kernel Q is phi-irreducible and admits an accessible (m, , ν)-small set C,


then, by Proposition 165, for any η ∈ (0, 1) the set C is an accessible (1, 0 , ν)-small
set for the resolvent kernel Kη . If C is recurrent for Q, it is also recurrent for Kη
and therefore, by Proposition 164, Kη has a unique invariant probability measure.
The following result shows that this probability measure is invariant also for Q.

Lemma 174. A measure µ on (X, X ) is Q-invariant if and only if µ is Kη -invariant


for some (hence for all) η ∈ (0, 1).

Proof. If µQ = µ, then obviously µQn = µ for all n ≥ 0, so that µKη = µ.


Conversely, assume that µKη = µ. Because Kη = ηQKη + (1 − η)Q0 and QKη =
Kη Q, it holds that

µ = µKη = ηµQKη + (1 − η)µ = ηµKη Q + (1 − η)µ = ηµQ + (1 − η)µ .

Hence ηµQ = ηµ, which concludes the proof.

Drift Conditions
We first give a sufficient condition for a chain to be positive, based on the expectation
of the return time to an accessible small set.

Proposition 175. Let Q be a transition kernel that admits an accessible small set
C such that

sup Ex [τC ] < ∞ . (7.31)


x∈C

Then the chain is positive and the invariant probability measure π satisfies, for all
A ∈ X,
"τ −1 # Z "τ #
Z C C

1A (Xk ) = π(dy) Ey 1A (Xk ) .


X X
π(A) = π(dy) Ey (7.32)
C k=0 C k=1

If f is a non-negative measurable function such that


"τ −1 #
C
X
sup Ex f (Xk ) < ∞ , (7.33)
x∈C
k=0

then f is integrable with respect to π and


"τ −1 # Z "τ #
Z C
X X C

π(f ) = π(dy) Ey f (Xk ) = π(dy) Ey f (Xk ) .


C k=0 C k=1
7.2. CHAINS ON GENERAL STATE SPACES 167

Proof. First note that by Proposition 156, Q is phi-irreducible. Equation (7.31)


implies that for all Px (τC < ∞) = 1 x ∈ C, that is, C is Harris recurrent. By
Proposition 167, C is recurrent, and so, by Proposition 164, Q is recurrent. Let π
be an invariant measure such that 0 < π(C) < ∞, the existence of which is given
by Proposition 173. Then define a measure µC on X by
"τ #
Z C
def
1A (Xk ) .
X
µC (A) = π(dy) Ey
C k=1

Because τC < ∞ Py -a.s. for all y ∈ C, it holds that µC (C) = π(C). Then we
can show that µC (A) = π(A) for all A ∈ X . The proof is along the same lines as
the proof of Proposition 172 and is therefore omitted. Thus, µC is invariant. In
addition, we obtain that for any measurable set A,
Z Z
π(dy) Ey [1A (X0 )] = π(A ∩ C) = µC (A ∩ C) = π(dy) Ey [1A (XτC )] ,
C C

and this yields


"τ # "τ −1 #
Z C Z C

1A (Xk ) = 1A (Xk ) .
X X
µC (A) = π(dy) Ey π(dy) Ey
C k=1 C k=0

We thus obtain the following equivalent expressions for µC :


"τ −1 # Z "τ −1 #
Z C C

1A (Xk ) = µC (dy) Ey 1A (Xk )


X X
µC (A) = π(dy) Ey
C k=0 C k=0
"τ # "τ #
Z C Z C

1A (Xk ) = 1A (Xk )
X X
= µC (dy) Ey π(dy) Ey
C k=1 C k=1
= π(A) .
Hence
"τ −1 #
Z C

1X (Xk ) ≤ π(C) sup Ey [τC ] < ∞ ,


X
π(X) = π(dy) Ey
C y∈C
k=0

so that any invariant measure is finite and the chain is positive. Finally, under
(7.33) we obtain that
"τ −1 # "τ −1 #
Z C
X XC

π(f ) = π(dy) Ey f (Xk ) ≤ π(C) sup Ey f (Xk ) < ∞ .


C y∈C
k=0 k=1

Except in specific examples (where, for example, the invariant distribution is


known in advance), it may be difficult to decide if a chain is positive or null. To
check such properties, it is convenient to use drift conditions.
Proposition 176. Assume that there exists a set C ∈ X , two measurable functions
1 ≤ f ≤ V , and a constant b > 0 such that
QV ≤ V − f + b1C . (7.34)
Then
Ex [τC ] ≤ V (x) + b1C (x) , (7.35)
"τ −1 #
C

f (Xk ) ≤ V (x) + b1C (x) .


X
Ex [V (XτC )] + Ex (7.36)
k=0
168 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

If C is an accessible small set and V is bounded on C, then the chain is positive


recurrent and π(f ) < ∞.
Proof. Set for n ≥ 1,
" n−1
#
1{τC ≥n} .
X
Mn = V (Xn ) + f (Xk )
k=0

Then
" n
#
1{τC ≥n+1}
X
E[Mn+1 | Fn ] = QV (Xn ) + f (Xk )
k=0
" n
#
≤ V (Xn ) − f (Xn ) + b1C (Xn ) + 1{τC ≥n+1}
X
f (Xk )
k=0
" n−1
#
1{τC ≥n+1} ≤ Mn ,
X
= V (Xn ) + f (Xk )
k=0

as 1C (Xn )1{τC ≥n+1} = 0. Hence {Mn }n≥1 is a non-negative super-martingale. For


any integer n, τC ∧ n is a bounded stopping time, and Doob’s optional stopping
theorem shows that for any x ∈ X,
Ex [MτC ∧n ] ≤ Ex [M1 ] ≤ V (x) + b1C (x) . (7.37)
Applying this relation with f ≡ 1 yields for any x ∈ X and n ≥ 0,
Ex [τC ∧ n] ≤ V (x) + b1C (x) ,
and (7.35) follows using monotone convergence. This implies in particular that
Px (τC < ∞) = 1 for any x ∈ X. The proof of (7.36) follows similarly from (7.37)
by the letting n → ∞ and π(f ) is finite by (7.33).
Example 177 (Random Walk on the Half-Line, Continued). Consider again the
model of Example 153. Previously we have seen that sets of the form [0, c] are small.
If Γ((−∞ , −c]) > 0, then for x ∈ [0, c],
Q(x, A) ≥ Γ((−∞ , −c])1A (0) ;
otherwise there exists an integer m such that Γ∗m ((−∞ , −c]) > 0, whence
Qm (x, A) ≥ Γ∗m ((−∞ , −c])1A (0) .
To prove recurrenceR for µ < 0, we apply Proposition 176. Because µ < 0, there

exists c > 0 such that −c w Γ(dw) ≤ µ/2 < 0. Thus taking V (x) = x for x > c,
Z ∞
QV (x) − V (x) = [(x + w)+ − x] Γ(dw)
−∞
Z ∞
= −xΓ((−∞ , −x]) + w Γ(dw) ≤ µ/2 .
−x

Hence the chain is positive recurrent.


Consider now the case µ > 0. In view of PropositionP154, we have to show
n
that
 −1thePnatom {0} is transient. For any n, Xn ≥ X0 + i=1 Wi . Define Cn =
n i=1 Wi − µ ≥ µ/2 and write Dn for {Xn = 0}. The strong law of large
numbers implies that P0 (Dn i.o.) ≤ P0 (Cn i.o.) = 0. Hence the atom {0} is tran-
sient, and so is the chain.
When µ = 0, additional assumptions on Γ are needed to prove the recurrence of
the RWHL (see for instance Meyn and Tweedie, 1993, Lemma 8.5.2).
7.2. CHAINS ON GENERAL STATE SPACES 169

Example 178 (Autoregressive Model, Continued). Consider again the model of


Example 148 and assume that the noise process has zero mean and finite variance.
Choosing V (x) = x2 we have

P V (x) = E[(φx + U1 )2 ] = φ2 V (x) + E[U12 ] ,

so that (7.34) holds when C = [−M, M ] for some large enough M , provided |φ| <
1. Because we know that every compact set is small if the noise process has an
everywhere continuous positive density, Proposition 176 shows that the chain is
positive recurrent. Note that this approach provides an existence result but does
not help us to determine π. If {Uk } are Gaussian with zero mean and variance σ 2 ,
then one can check that the invariant distribution also is Gaussian with zero mean
and variance σ 2 /(1 − φ2 ).

Theorem 170 shows that if a chain is phi-irreducible and recurrent then the chain
is positive, that is, it admits a unique invariant probability measure π. In certain
situations, and in particular when dealing with MCMC procedures, it is known that
Q admits an invariant probability measure, but it is not known, a priori, that the
chain is recurrent. The following result shows that positivity implies recurrence.

Proposition 179. If the Markov kernel Q is positive, then it is recurrent.

Proof. Suppose that the chain is positive and let π be an invariant probability
measure. If Q is transient, the state space X is covered by a countable family {Aj }
of uniformly transient subsets (see Theorem 151). For any j and k,

k
X Z
n
kπ(Aj ) = πQ (Aj ) ≤ π(dx) Ex [ηAj ] ≤ sup Ex [ηAj ] . (7.38)
n=1 x∈X

The strong Markov property implies that

Ex [ηAj ] = Ex [ηAj 1{σAj <∞} ]


≤ Ex {1{σAj <∞} EXσA [ηAj ]} ≤ sup Ex [ηAj ] Px (σAj < ∞) .
j
x∈Aj

Thus, the left-hand side of (7.38) is bounded as k → ∞. This implies that π(Aj ) =
0, and hence π(X) = 0. This is a contradiction so the chain cannot be transient.

7.2.4 Ergodicity
In this section, we study the convergence of iterates Qn of the transition kernel to
the invariant distribution. As for discrete state spaces case, we first need to avoid
periodic behavior that prevents the iterates to converge. In the discrete case, the
period of a state x is defined as the greatest common divisor of the set of time
points {n ≥ 0 : Qn (x, x) > 0}. Of course this notion does not extend to general
state spaces, but for phi-irreducible chains we may define the period of accessible
small sets. More precisely, let Q be a phi-irreducible transition kernel with maximal
irreducibility measure ψ. By Theorem 156, there exists an accessible (m, , ψ)-small
set C. Because ψ is a maximal irreducibility measure, ψ(C) > 0, so that when the
chain starts from C there is a positive probability that the it will return to C at
time m. Let
def
EC = {n ≥ 1 : the set C is (n, n , ψ)-small for some n > 0} (7.39)
170 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

be the set of time points for which C is small with minorizing measure ψ. Note
that for n and m in EC , B ∈ X + and x ∈ C,
Z
Qn+m (x, B) ≥ Qm (x, dx0 ) Qn (x0 , B) ≥ m n ψ(C)ψ(B) > 0 ,
C

showing that EC is closed under addition. There is thus a natural period for EC ,
given by the greatest common divisor. Similar to the discrete case (see Proposi-
tion 144), this period d may be shown to be independent of the particular choice of
the small set C (see for instance Meyn and Tweedie, 1993, Theorem 5.4.4).

Proposition 180. Suppose that Q is phi-irreducible with maximal irreducibility


measure ψ. Let C be an accessible (m, , ψ)-small set and let d be the greatest
common divisor of the set EC , defined in (7.39). Then there exist disjoint sets
D1 , . . . , Dd (a d-cycle) such that

(i) for x ∈ Di , Q(x, Di+1 ) = 1, i = 0, . . . , d − 1 (mod d);

(ii) the set N = (∪di=1 Di )c is ψ-null.

The d-cycle is maximal in the sense if D10 , . . . , Dd0 0 is a d0 -cycle, then d0 divides d,
and if d = d0 , then up to a permutation of indices Di0 and Di are ψ-almost equal.

It is obvious from the this theorem that the period d does not depend on the
choice of the small set C and that any small set must be contained (up to ψ-null
sets) inside one specific member of a d-cycle. This in particular implies that if there
exists an accessible (1, , ψ)-small set C, then d = 1. This suggests the following
definition

Definition 181 (Aperiodicity). Suppose that Q is a phi-irreducible transition kernel


with maximal irreducibility measure ψ. The largest d for which a d-cycle exists is
called the period of Q. When d = 1, the chain is called aperiodic. When there
exists a (1, , ψ)-small set C, the chain is called strongly aperiodic.

In all the examples considered above, we have shown the existence of a 1-small
set; therefore all these Markov chains are strongly aperiodic.
Now we can state the main convergence result, formulated and proved by Athreya
et al. (1996). It parallels Theorem 145.

Theorem 182. Let Q be a phi-irreducible positive aperiodic transition kernel. Then


for π-almost all x,

lim kQn (x, ·) − πkTV = 0 . (7.40)


n→∞

If Q is Harris recurrent, the convergence occurs for all x ∈ X.

Although this result does not provide information on the rate of convergence
to the invariant distribution, its assumptions are quite minimal. In fact, it may be
shown that these assumptions are essentially necessary and sufficient. If kQn (x, ·) − πkTV →
0 for any x ∈ X, then by Nummelin (1984, Proposition 6.3), the chain is π-
irreducible, aperiodic, positive Harris, and π is an invariant distribution. This
form of the ergodicity theorem is of particular interest in cases where the invariant
distribution is explicitly known, as in Markov chain Monte Carlo. It provides con-
ditions that are simple and easy to verify, and under which an MCMC algorithm
converges to its stationary distribution.
Of course the exceptional null set for non-Harris recurrent chain is a nuisance.
The example below however shows that there is no way of getting rid of it.
7.2. CHAINS ON GENERAL STATE SPACES 171

Example 183. In the model of Example 169, π = δ0 is an invariant probability


measure. Because Qn (x, 0) = Px (τ{0} ≤ n) for any n ≥ 0, limn→∞ Qn (x, 0) =
Px (τ{0} < ∞). We have previously shown that Px (τ{0} < ∞) = 1 − Px (τ{0} =
∞) < 1 for x ≥ 2, whence lim sup kQn (x, ·) − πkTV 6= 0 for such x.

Fortunately, in many cases it is not hard to show that a chain is Harris.


A proof of Theorem 182 from first principles is given by Athreya et al. (1996).
We give here a proof due to Rosenthal (1995), based on pathwise coupling (see
Rosenthal, 2001; Roberts and Rosenthal, 2004). The same construction is used
to compute bounds on kQn (x, ·) − πkTV . Before proving the theorem, we briefly
introduce the pathwise coupling construction for phi-irreducible Markov chains and
present the associated Lindvall inequalities.

Pathwise Coupling and Coupling Inequalities


Suppose that we have two probability measures ξ and ξ 0 on (X, X ) that are such that
1 0
2 kξ − ξ kTV ≤ 1 −  for some  ∈ (0, 1] or, equivalently (see (3.6)), that there exists
a probability measure ν such that ν ≤ ξ ∧ ξ 0 . Because ξ and ξ 0 are probability
measures, we may construct a probability space (Ω, F, P) and X-valued random
variables X and X 0 such that P(X ∈ ·) = ξ(·) and P(X 0 ∈ ·) = ξ 0 , respectively. By
definition, for any A ∈ X ,

|ξ(A) − ξ 0 (A)| = | P(X ∈ A) − P(X 0 ∈ A)| = | E[1A (X) − 1A (X 0 )]| (7.41)


= | E[(1A (X) − 1A (X ))1{X6=X 0 } ]| ≤ P(X 6= X ) ,
0 0
(7.42)

so that the total variation distance between the laws of two random elements is
bounded by the probability that they are unequal. Of course, this inequality is
not in general sharp, but we can construct on an appropriately defined probability
space (Ω̃, F̃, P̃) two X-valued random variables X and X 0 with laws ξ and ξ 0 such
that P̃(X = X 0 ) ≥ 1 − . The construction goes as follows. We draw a Bernoulli
random variable d with probability of success . If d = 0, we then draw X and
X 0 independently from the distributions (1 − )−1 (ξ − ν) and (1 − )−1 (ξ 0 − ν),
respectively. If d = 1, we draw X from ν and set X = X 0 . Note that for any A ∈ X ,

P̃(X ∈ A) = P̃(X ∈ A | d = 0)P̃(d = 0) + P̃(X ∈ A | d = 1)P̃(d = 1)


= (1 − ){(1 − )−1 [ξ(A) − ν(A)]} = ξ(A)

and, similarly, P̃(X 0 ∈ A) = ξ 0 (A). Thus, marginally the random variables X and
X 0 are distributed according to ξ and ξ 0 . By construction, P̃(X = X 0 ) ≥ P(d =
1) ≥ , showing that X and X 0 are equal with probability at least . Therefore
the coupling bound (7.41) can be made sharp by using an appropriate construction.
Note that this construction may be used to derive bounds on distances between
probability measures that generalize the total variation; we will consider in the
sequel the V -total variation.

Definition 184 (V-Total Variation). Let V : X → [1, ∞) be a measurable function.


The V -total variation distance between two probability measures ξ and ξ 0 on (X, X )
is
def
kξ − ξ 0 kV = sup |ξ(f ) − ξ 0 (f )| .
|f |≤V

If V ≡ 1, k · k1 is the total variation distance.

When applied to Markov chains, the whole idea of coupling is to construct on


an appropriately defined probability space two Markov chains {Xk } and {Xk0 } with
transition kernel Q and initial distributions ξ and ξ 0 , respectively, in such a way
172 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

that Xn = Xn0 for all indices n after a random time T , referred to as the coupling
time. The coupling procedure attempts to couple the two Markov chains when they
simultaneously enter a coupling set.
Definition 185 (Coupling Set). Let C̄ ⊆ X×X,  ∈ (0, 1] and let ν = {νx,x0 , x, x0 ∈
X} be transition kernels from C̄ (endowed with the trace σ-field) to (X, X ). The set
C̄ is a (1, , ν)-coupling set if for all (x, x0 ) ∈ C̄ and all A ∈ X ,

Q(x, A) ∧ Q(x0 , A) ≥  νx,x0 (A) . (7.43)

By applying Lemma 43, this condition can be stated equivalently as: there exists
 ∈ (0, 1] such that for all (x, x0 ) ∈ C̄,
1
kQ(x, ·) − Q(x0 , ·)kTV ≤ 1 −  . (7.44)
2
For simplicity, only one-step minorization is considered in this chapter. Adap-
tations to m-step minorization (replacing Q by Qm in (7.43)) can be carried out as
in Rosenthal (1995). Condition (7.43) is often satisfied by setting C̄ = C × C for a
(1, , ν)-small set C. Indeed, in that case, for all (x, x0 ) ∈ C × C and A ∈ X ,

Q(x, A) ∧ Q(x0 , A) ≥ ν(A) .

The case  = 1 needs some consideration. If there exists an atom, say α,


i.e., there exists a probability measure ν such that for all x ∈ α and A ∈ X ,
Q(x, A) = ν(A), then C̄ = α × α is a (1, 1, ν)-coupling set with νx,x0 = ν for all
(x, x0 ) ∈ C̄. Conversely, assume that C̄ is a (1, 1, ν)-coupling set. The alternative
characterization (7.44) shows that Q(x, ·) = Q(x0 , ·) for all (x, x0 ) ∈ C̄, that is, C̄ is
an atom. This also implies that the set C̄ contains a set α1 × α2 , where α1 and α2
are atoms for Q.
We now introduce the coupling construction. Let C̄ be a (1, , ν)-coupling set.
Define X̄ = X × X and X̄ = X ⊗ X . Let Q̄ be a transition kernel on (X̄, X̄ ) given for
all A and A0 in X by

Q̄(x, x0 ; A × A0 ) = Q(x, A)Q(x0 , A0 )1C̄ c (x, x0 )+


(1 − )−2 [Q(x, A) − νx,x0 (A)][Q(x0 , A0 ) − νx,x0 (A0 )]1C̄ (x, x0 ) (7.45)

if  < 1 and Q̄ = Q ⊗ Q if  = 1. For any probability measure µ̄ on (X̄, X̄ ), let P̄µ̄ be


the probability measure on the canonical space (X̄N , X̄ N ) such that the coordinate
process {Xk } is a Markov chain with respect to its natural filtration and with initial
distribution µ̄ and transition kernel Q̄. As usual, denote the associated expectation
operator by ǵ̄ .
def
We now define a transition kernel Q̃ on the space X̃ = X × X × {0, 1} endowed
with the product σ-field X̃ by, for any x, x0 ∈ X and A, A0 ∈ X ,

Q̃ ((x, x0 , 0), A × A0 × {0}) = [1 − 1C̄ (x, x0 )]Q̄((x, x0 ), A × A0 ) , (7.46)


Q̃ ((x, x , 0), A × A × {1}) = 1C̄ (x, x )νx,x0 (A ∩ A ) ,
0 0 0 0
(7.47)
Q̃ ((x, x0 , 1), A × A0 × {1}) = Q(x, A ∩ A0 ) . (7.48)

For any probability measure µ̃ on (X̃, X̃ ), let P̃µ̃ be the probability measure on the
canonical space (X̃N , X̃ ⊗N ) such that the coordinate process {X̃k } is a Markov chain
with transition kernel Q̃ and initial distribution µ̃. The corresponding expectation
operator is denoted by Ẽµ̃ .
The transition kernel Q̃ can be described algorithmically. Given X̃0 = (X0 , X00 , d0 ) =
(x, x0 , d), X̃1 = (X1 , X10 , d1 ) is obtained as follows.
7.2. CHAINS ON GENERAL STATE SPACES 173

• If d = 1, then draw X1 from Q(x, ·) and set X10 = X1 , d1 = 1.

• If d = 0 and (x, x0 ) ∈ C̄, flip a coin with probability of heads .

– If the coin comes up heads, draw X1 from νx,x0 and set X10 = X1 and
d1 = 1.
– If the coin comes up tails, draw (X1 , X10 ) from Q̄(x, x0 ; ·) and set d1 = 0.

• If d = 0 and (x, x0 ) 6∈ C̄, draw (X1 , X10 ) from Q̄(x, x0 ; ·) and set d1 = 0.

The variable dn is called the bell variable; it indicates whether coupling has occurred
by time n (dn = 1) or not (dn = 0). The first index n at which dn = 1 is the coupling
time;

T = inf{k ≥ 1 : dk = 1}.

If dn = 1, then Xk = Xk0 for all k ≥ n. The coupling construction is carried out in


such a way that under P̃ξ⊗ξ0 ⊗δ0 , {Xk } and {Xk0 } are Markov chains with transition
kernel Q with initial distributions ξ and ξ 0 , respectively.
The coupling construction allows deriving quantitative bounds on the (V -)total
variation distance in terms of the tail probability of the coupling time.

Proposition 186. Assume that the transition kernel Q admits a (1, , ν)-coupling
set. Then for any probability measures ξ and ξ 0 on (X, X ) and any measurable
function V : X → [1, ∞),

kξQn − ξ 0 Qn kTV ≤ 2P̃ξ⊗ξ0 ⊗δ0 (T > n) , (7.49)


n 0 n
kξQ − ξ Q kV ≤ 2Ẽξ⊗ξ0 ⊗δ0 [V̄ 1
(Xn , Xn0 ) {T >n} ] , (7.50)

where V̄ : X × X → [1, ∞) is defined by V̄ (x, x0 ) = {V (x) + V (x0 )}/2.

Proof. We only need to prove (7.50) because (7.49) is obtained by setting V ≡ 1.


Pick a function f such that |f | ≤ V and note that [f (Xn ) − f (Xn0 )]1{dn =1} = 0.
Hence

|ξQn f − ξ 0 Qn f | = |Ẽξ⊗ξ0 ⊗δ0 [f (Xn ) − f (Xn0 )]|


= |Ẽξ⊗ξ0 ⊗δ0 [(f (Xn ) − f (Xn0 ))1{dn =0} ]|
≤ 2Ẽξ⊗ξ0 ⊗δ0 [V̄ (Xn , Xn0 )1{dn =0} ] .

We now provide an alternative expression of the coupling inequality that only


involves the process {X̄k }. Let σC̄ be the hitting time on the coupling set C̄ by this
process, define K0 () = 1, and for all n ≥ 1,

1{σ ≥n}


if  = 1 ;
Kn () = Q (7.51)
 n−1 [1 − 1 (X̄ )] if  ∈ (0, 1) .
j=0 C̄ j

Proposition 187. Assume that the transition kernel Q admits a (1, , ν)-coupling
set. Let ξ and ξ 0 be probability measures on (X, X ) and let V : X → [1, ∞) be a
measurable function. Then

kξQn − ξ 0 Qn kV ≤ 2Ēξ⊗ξ0 [V̄ (Xn , Xn0 )Kn ()] , (7.52)

with V̄ (x, x0 ) = [V (x) + V (x0 )]/2.


174 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

Proof. We show that for any probability measure µ̄ on (X̄, X̄ ),

Ẽµ̄⊗δ0 [V̄ (Xn , Xn0 )1{T >n} ] = ǵ̄ [V̄ (Xn , Xn0 )Kn ()] .

To do this, we shall prove by induction that for any n ≥ 0 and any bounded X̄ -
measurable functions {fj }j≥0 ,
   
n n
fj (Xj , Xj0 ) 1{T >n}  = ǵ̄ 
Y Y
Ẽµ̄⊗δ0  fj (Xj , X̄j ) Kn () . (7.53)
j=0 j=0
Qn
This is obviously true for n = 0. For n ≥ 0, put χn = j=0 fj (Xj , Xj0 ). The
induction assumption and the identity {T > n + 1} = {dn+1 = 0} yield

Ẽµ̄⊗δ0 [χn+1 1{T >n+1} ] = Ẽµ̄⊗δ0 [χn fn+1 (Xn+1 , Xn+1


0
)1{dn+1 =0} ]
0
= Ẽµ̄⊗δ0 {χn Ẽ[fn+1 (Xn+1 , Xn+1 )1{dn+1 =0} | F̃n ]1{dn =0} }
= Ẽµ̄⊗δ0 {χn [1 − 1C̄ (Xn , Xn0 )]Q̄fn+1 (Xn , Xn0 )1{dn =0} }
= ǵ̄ [χn Q̄fn+1 (X̄n )Kn+1 ()] = ǵ̄ [χn+1 Kn+1 ()] .

This concludes the induction and the proof.

Proof of Theorem 182


We preface the proof of Theorem 182 by two technical lemmas that establish some
elementary properties of a chain on the product space with transition kernel Q ⊗ Q.
Lemma 188. Suppose that Q is a phi-irreducible aperiodic transition kernel. Then
for any n, Qn is phi-irreducible and aperiodic.
Proof. Propositions 156 and 157 show that there exists an accessible (m, , ν)-small
set C and that ν is an irreducibility measure. Because Q is aperiodic, there exists
a sequence {k } of positive numbers and an integer nC such that for all n ≥ nC ,
x ∈ C, and A ∈ X , Qn (x, A) ≥ n ν(A). In addition, because C is accessible, there
exists p such that Qp (x, C) > 0 for any x ∈ X. Therefore for any n ≥ nC and any
A ∈ X such that ν(A) > 0,
Z
Q n+p
(x, A) ≥ Qp (x, dx0 ) Qn (x0 , A) ≥ n ν(A)Qp (x, C) > 0 . (7.54)
C

Lemma 189. Let Q be an aperiodic positive transition kernel with invariant prob-
ability measure π. Then Q ⊗ Q is phi-irreducible, π ⊗ π is Q ⊗ Q-invariant, and
Q ⊗ Q is positive. If C is an accessible (m, , ν)-small set for Q, then C × C is an
accessible (m, 2 , ν ⊗ ν)-small set for Q ⊗ Q.
Proof. Because Q is phi-irreducible and admits π as an invariant probability mea-
sure, π is a maximal irreducibility measure for Q. Let C be an accessible (m, , ν)-
small set for Q. Then for (x, x0 ) ∈ C × C and A ∈ X ⊗ X ,
ZZ
(Q ⊗ Q)m (x, x0 ; A) = Qm (x, dy) Qm (x0 , dy 0 ) ≥ 2 ν ⊗ ν(A) .
A

Because ν ⊗ ν(C × C) = [ν(C)]2 > 0, this shows that C × C is a (1, 2 , ν ⊗ ν)-small


set for Q ⊗ Q. By (7.54) there exists an integer nx such that for any n ≥ nx ,
Qn (x, C) > 0. This implies that for any (x, x0 ) ∈ X × X and any n ≥ nx ∨ nx0 ,

(Q ⊗ Q)n (x, x0 ; C × C) = Qn (x, C)Qn (x0 , C) > 0 ,


7.2. CHAINS ON GENERAL STATE SPACES 175

showing that C × C is accessible. Because C × C is a small set, Proposition 156


shows that Q ⊗ Q is phi-irreducible. In addition, π ⊗ π is invariant for Q ⊗ Q, so
that π ⊗ π is a maximal irreducibility measure and Q ⊗ Q is positive.
We have now all the necessary ingredients to prove Theorem 182.
of Theorem 182. By Lemma 188, Qm is phi-irreducible for any integer m. By
Proposition 157, there exists an accessible (m, , ν)-small set C with ν(C) > 0.
Lemma 46 shows that for all integers n,

kQn (x, ·) − Qn (x0 , ·)kTV ≤ kQm[n/m] (x, ·) − Qm[n/m] (x0 , ·)kTV .

Hence it suffices to prove that (7.40) holds for Qm and we may thus without loss of
generality assume that m = 1.
For any probability measure µ on (X × X, X ⊗ X ), let P?µ denote the probability
measure on the canonical space ((X×X)N , (X ⊗X )⊗N ) such that the canonical process
{(Xk , Xk0 )}k≥0 is a Markov chain with transition kernel Q⊗Q and initial distribution
µ. By Lemma 189, Q ⊗ Q is positive, and it is recurrent by Proposition 179.
Because π ⊗ π(C × C) = π 2 (C) > 0, by Theorem 168 there exist two measurable
sets C̄ ⊆ C × C and H̄ ⊆ X × X such that π ⊗ π(C × C \ C̄) = 0, π × π(H) = 1, and
for all (x, x0 ) ∈ H̄, P?x,x0 (τC̄ < ∞) = 1. Moreover, the set C̄ is a (1, , ν)-coupling
set with νx,x0 = ν for all (x, x0 ) ∈ C̄.
Let the transition kernel Q̄ be defined by (7.45) if  < 1 and by Q̄ = Q ⊗ Q
if  = 1. For  = 1, P̄x,x0 = P?x,x0 . Now assume that  ∈ (0, 1). For (x, x0 ) 6∈ C̄,
P̄x,x0 (τC̄ = ∞) = P?x,x0 (τC̄ = ∞). For (x, x0 ) ∈ C̄, noting that Q̄(x, x0 , A) ≤
(1 − )−2 Q ⊗ Q(x, x0 , A) we obtain

P̄x,x0 (τC̄ = ∞) = P̄x,x0 (τC̄ = ∞ | (X1 , X10 ) ∈


/ C × C) Q̄(x, x0 , C̄ c )
≤ (1 − )−2 Q ⊗ Q(x, x0 , C̄ c ) P?x,x0 (τC̄ = ∞ | X̄1 ∈
/ C̄)
= (1 − )−2 P?x,x0 (τC̄ = ∞) = 0 .

Thus, for all  ∈ (0, 1] the set C̄ is Harris-recurrent for the kernel Q̄. This implies
that limn→∞ Ēx,x0 [Kn ()] = 0 for all (x, x0 ) ∈ H̄ and, using Proposition 187, we
conclude that (7.40) is true.

7.2.5 Geometric Ergodicity and Foster-Lyapunov Conditions


Theorem 182 implies forgetting of the initial distribution and convergence to sta-
tionarity but does not provide us with rates of convergence. In this section, we show
how to adapt the construction above to derive explicit bounds on kξQn − ξ 0 Qn kV .
We focus on conditions that imply geometric convergence.
Definition 190 (Geometric Ergodicity). A positive aperiodic transition kernel Q
with invariant probability measure π is said to be V -geometrically ergodic if there
exist constants ρ ∈ (0, 1) and M < ∞ such that

kQn (x, ·) − πkV ≤ M V (x)ρn for π-almost all x. (7.55)

We now present conditions that ensure geometric ergodicity.


Definition 191 (Foster-Lyapunov Drift Condition). A transition kernel Q is said
to satisfy a Foster-Lyapunov drift condition outside a set C ∈ X if there exists a
measurable function V : X → [1, ∞], bounded on C, and non-negative constants
λ < 1 and b < ∞ such that

QV ≤ λV + b1C . (7.56)
176 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

If Q is phi-irreducible and satisfies a Foster-Lyapunov condition outside a small


set C, then C is accessible and, writing QV ≤ V − (1 − λ)V + b1C , Proposition 176
shows that Q is positive and π(V ) < ∞.

Example 192 (Random Walk on the Half-Line, Continued). Assume that for the
model of Example 153 there exists z > 0 such that E[ezW1 ] < ∞. Then because
µ < 0, there exists z > 0 such that E[ezW1 ] < 1. Define z0 = arg minz>0 E[ezW1 ]
and V (x) = ez0 x , and choose x0 > 0 such that λ = E[ez0 W1 ] + P(W1 < −x0 ) < 1.
Then for x > x0 ,

QV (x) = E[ez0 (x+W1 )+ ] = P(W1 ≤ −x) + ez0 x E[ez0 W1 1{W1 >−x} ] ≤ λV (x) .

Hence the Foster-Lyapunov drift condition holds outside the small set [0, x0 ], and
the RWHL is geometrically ergodic. For a sharper choice of the constants z0 and
λ, see Scott and Tweedie (1996, Theorem 4.1).

Example 193 (Metropolis-Hastings Algorithm, Continued). Consider the Metropolis-


Hastings algorithm of Example 149 with random walk proposal kernel r(x, x0 ) =
r(|x − x0 |). Geometric ergodicity of the Metropolis-Hastings algorithm on Rd is
largely a property of the tails of the stationary distribution π. Conditions for geo-
metric ergodicity can be shown to be, essentially, that the tails are exponential or
lighter (Mengersen and Tweedie, 1996) and that in higher dimensions the contours
of π are regular near ∞ (see for instance Jarner and Hansen, 2000). To understand
how the tail conditions come into play, consider the case where π is a probability
density on X = R+ . We suppose that π is log-concave in the upper tail, that is,
that there exists α > 0 and M such that for all x0 ≥ x ≥ M ,

log π(x) − log π(x0 ) ≥ α(x0 − x) . (7.57)

To simplify the proof, we assume that π is non-increasing, but this assumption is


unnecessary. Define Ax = {x0 ∈ R+ : π(x0 ) ≤ π(x)} and Rx = {x0 ∈ R+ , π(x) >
π(x0 )}, the acceptance and (possible) rejection regions for the chain started from x.
Because π is non-increasing, these sets are simple: Ax = [0, x] and Rx = (x, ∞) ∪
(−∞, 0). If we relax the monotonicity conditions, the acceptance and rejection
regions become more involved, but because π is log-concave and thus in particular
monotone in the upper tail, Ax and Rx are essentially intervals when x is sufficiently
large.
For any function V : R+ → [1, +∞) and x ∈ R+ ,

V (x0 )
Z  
QV (x) 0
=1+ r(x − x) − 1 dx0
V (x) Ax V (x)
π(x0 ) V (x0 )
Z  
0
+ r(x − x) − 1 dx0 .
Rx π(x) V (x)

We set V (x) = esx for some s ∈ (0, α). Because π is log-concave, π(x0 )/π(x) ≤
0
e−α(x −x) when x0 ≥ x ≥ M . For x ≥ M , it follows from elementary calculations
that Z ∞
QV (x)
lim sup ≤1− r(u)(1 − e−su )[1 − e−(α−s)u ] du < 1 ,
x→∞ V (x) 0

showing that the random walk Metropolis-Hastings algorithm on the positive real
line satisfies the Foster-Lyapunov condition when π is monotone and log-concave in
the upper tail.

The main result guaranteeing geometric ergodicity is the following.


7.2. CHAINS ON GENERAL STATE SPACES 177

Theorem 194. Let Q be a phi-irreducible aperiodic positive transition kernel with


invariant distribution π. Also assume that Q satisfies a Foster-Lyapunov drift con-
dition outside a small set C with drift function V . Then π(V ) is finite and Q is
V -geometrically ergodic.

In fact, it follows from Meyn and Tweedie (1993, Theorems 15.0.1 and 16.0.1)
that the converse is also true: if a phi-irreducible aperiodic kernel is V -geometrically
ergodic, then there exists an accessible small set C such that V is a drift function
outside C.
For the sake of brevity and simplicity, we now prove Theorem 194 under the
additional assumption that the level sets of V are all (1, , ν)-small. In that case,
it is possible to define a coupling set C̄ and a transition kernel Q̄ that satisfies a
(bivariate) Foster-Lyapunov drift condition outside C̄. The geometric ergodicity of
the transition kernel Q is then proved under this assumption. This is the purpose
of the following propositions.

Proposition 195. Let Q be a kernel that satisfies the Foster-Lyapunov drift con-
dition (7.56) with respect to a (1, , ν)-small set C and a function V whose level
sets are (1, , ν)-small. Then for any d > 1, the set C 0 = C ∪ {x ∈ X : V (x) ≤ d}
is small, C 0 × C 0 is a (1, , ν)-coupling set, and the kernel Q̄, defined as in (7.45),
satisfies the drift condition (7.58) with C̄ = C 0 ×C 0 , V̄ (x, x0 ) = (1/2)[V (x)+V (x0 )],
and λ̄ = λ + b/(1 + d) provided λ̄ < 1.

Proof. For (x, x0 ) 6∈ C̄ we have (1 + d)/2 ≤ V̄ (x, x0 ). Therefore


 
0 0 b b
Q̄V̄ (x, x ) ≤ λV̄ (x, x ) + ≤ λ + V̄ (x, x0 ) ,
2 1+d

and for (x, x0 ) ∈ C̄ it holds that

1
Q̄V̄ (x, x0 ) = [QV (x) + QV (x0 ) − 2ν(V )]
2(1 − )
λ b − ν(V )
≤ V̄ (x, x0 ) + .
(1 − ) 1−

Proposition 196. Assume that Q admits a (1, , ν)-coupling set C̄ and that there
exists a choice of the kernel Q̄ for which there is a measurable function V̄ : X̄ →
[1, ∞), λ̄ ∈ (0, 1) and b̄ > 0 such that

Q̄V̄ ≤ λ̄V̄ + b̄1C̄ . (7.58)

Let W : X → [1, ∞) be a measurable function such that W (x) + W (x0 ) ≤ 2V̄ (x, x0 )
for all (x, x0 ) ∈ X × X. Then there exist ρ ∈ (0, 1) and c > 0 such that for all
(x, x0 ) ∈ X × X,

kQn (x, ·) − Qn (x0 , ·)kW ≤ cV̄ (x, x0 )ρn . (7.59)

Proof. By Proposition 186, proving (7.59) amounts to proving the requested bound
for Ēx,x0 [V̄ (X̄n )Kn ()]. We only consider the case  ∈ (0, 1), the case  = 1 being
easier. Write x̄ = (x, x0 ). By induction, the drift condition (7.58) implies that
n−1
X
Ēx̄ [V̄ (X̄n )] = Q̄n V̄ (x̄) ≤ λ̄n V̄ (x̄) + b̄ λ̄j ≤ V̄ (x̄) + b̄/(1 − λ̄) . (7.60)
j=0
178 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

Recall that Kn () = (1 − )ηn (C̄) for  ∈ (0, 1), where ηn (C̄) = 0 1C̄ (Xj ) is
Pn−1
the number of visits to the coupling set C̄ before time n. Hence Kn () is F̄n−1 -
measurable. Let j ≤ n + 1 be an arbitrary positive integer to be chosen later. Then
(7.60) yields

Ēx̄ [V̄ (X̄n )Kn ()1{ηn (C̄)≥j} ] ≤ (1 − )j Ēx̄ [V̄ (X̄n )]1{j≤n}
≤ [V̄ (x̄) + b̄/(1 − λ̄)](1 − )j 1{j≤n} . (7.61)

Put M = supx̄∈C̄ Q̄V̄ (x̄)/V (x̄) and B = 1 ∨ [M (1 − )/λ̄]. For k = 0, . . . , n, define


Zk = λ̄−k [(1 − )/B]ηk (C̄) V̄ (X̄k ). Because ηn (C̄) is F̄n−1 -measurable, we obtain

Ēx̄ [Zn | F̄n−1 ] = λ̄−n Q̄V̄ (X̄n−1 )[(1 − )/B]ηn (C̄)


≤ λ̄−n+1 V̄ (X̄n−1 )[(1 − )/B]ηn (C̄) 1C̄ c (X̄n−1 )
+ λ̄−n M V̄ (X̄n−1 )[(1 − )/B]ηn (C̄) 1C̄ (X̄n−1 ) .

Using the relations ηn (C̄) = ηn−1 (C̄) + 1C̄ (X̄n−1 ) and M (1 − ) ≤ B λ̄, we find
that Ēx̄ [Zn | F̄n−1 ] ≤ Zn−1 and, by induction, Ēx̄ [Zn ] ≤ Ēx̄ [Z0 ] = V̄ (x̄). Hence, as
B ≥ 1,

Ēx̄ [V̄ (X̄n )Kn ()1{ηn (C̄)<j} ] ≤ λ̄n B j Ēx̄ [Zn ] ≤ λ̄n B j V̄ (x̄) . (7.62)

Gathering (7.61) and (7.62) yields

Ēx̄ [V̄ (X̄n )Kn ()] ≤ [V̄ (x̄) + b̄/(1 − λ̄)] [(1 − )j 1{j≤n} + λ̄n B j ] .

If B = 1, choosing j = n + 1 yields (7.59) with ρ = λ̄, and if B > 1 then set


j = [αn] with α ∈ (0, 1) such that log(λ̄) + α log(B) < 0; this choice yields (7.59)
with ρ = (1 − )α ∨ (λ̄B α ) < 1.

Example 197 (Autoregressive Model, Continued). In the model of Example 148,


we have verified that V (x) = 1 + x2 satisfies (7.56) when the noise variance is finite.
We can deduce from Theorem 194 a variety of results: the stationary distribution
has finite variance and the iterates Qn (x, ·) of the transition kernel converge to
the stationary distribution π geometrically fast in V -total variation distance. Thus
there exist constants C and ρ < 1 such that for any x ∈ X, kQn (x, ·) − πkV ≤
C(1 + x2 )ρn . This implies in particular that for any x ∈ X and any function f such
that supx∈X (1 + x2 )−1 |f (x)| < ∞, Ex [f (Xn )] converges to the limiting value
r
(1 − φ2 )x2
 
1 − φ2
Z
Eπ [f (Xn )] = exp − f (x) dx
2πσ 2 2σ 2

geometrically fast. This applies for the mean, f (x) = x, and the second moment,
f (x) = x2 (though in this case convergence can be derived directly from the autore-
gression).

7.2.6 Limit Theorems


One of the most important problems in probability theory is the investigation of
limit theorems for appropriately normalized sums of random variables. The case
of independent random variables is fairly well understood, but less is known about
dependent random variables such as Markov chains. The purpose of this section is
to study several basic limit theorems for additive functionals of Markov chains.
7.2. CHAINS ON GENERAL STATE SPACES 179

Law of Large Numbers


Suppose that {Xk } is a Markov chain with transition kernel Q and initial distri-
bution ν. Assume that Q is phi-irreducible and aperiodic and has a stationary
distribution π. Let f be a π-integrable function; π(|f |) < ∞. We say that the se-
quence {f (Xk )} satisfies a law of large
Pn numbers (LLN) if for any initial distribution
ν on (X, X ), the sample mean n−1 k=1 f (Xk ) converges to π(f ) Pν -a.s.
For i.i.d. samples, classical theory shows that the LLN holds provided π(|f |) <
∞. The following theorem shows that the LLN holds for ergodic Markov chains;
it does not require any conditions on the rate of convergence to the stationary
distribution.

Theorem 198. Let Q be a positive Harris recurrent transition kernel with invariant
distribution π. Then for any real π-integrable function f on X and any initial
distribution ν on (X, X ),
n
X
lim n−1 f (Xk ) = π(f ) Pν -a.s. (7.63)
n→∞
k=1

The LLN can be obtained from general ergodic theorems for stationary processes.
An elementary proof can be given when the chain possesses an accessible atom. The
basic technique is then the regeneration method, which consists in dividing the chain
into blocks between the chain’s successive returns to the atom. These blocks are
independent (see Lemma 199 below) and standard limit theorems for i.i.d. random
variables yield the desired result. When the chain has no atom, one may still employ
this technique by replacing the atom by a suitably chosen small set and using the
splitting technique (see for instance Meyn and Tweedie, 1993, Chapter 17).

Lemma 199. Let Q be a positive Harris recurrent transition kernel that admits an
accessible atom α. Define for any measurable function f ,
τα
!
X (j−1)
sj (f ) = f (Xk ) ◦ θτα , j≥1. (7.64)
k=1

Then for any initial distribution ν on (X, X ), k ≥ 0 and functions {Ψj } in Fb (R),
 
Yk k
Y
Eν  Ψj (sj (f )) = Eν [Ψ1 (s1 (f ))] Eα [Ψj (sj (f ))] .
j=1 j=2

(k)
Proof. Because the atom α is accessible and the chain is Harris recurrent, Px (τα <
∞) = 1 for any x ∈ X. By the strong Markov property, for any integer k,

Eν [Ψ1 (s1 (f )) · · · Ψk (sk (f ))]


= Eν [Ψ1 (s1 (f )) · · · Ψk−1 (sk−1 (f )) Eα [Ψk (sk (f )) | Fτ (k−1) ]1{τ (k−1) <∞} ]
α α

= Eν [Ψ1 (s1 (f )) · · · Ψk−1 (sk−1 (f ))] Eα [Ψk (s1 (f ))] .

The desired result in then obtained by induction.

of Theorem 198 when there is an accessible atom. First assume that f is non-negative.
Denote the accessible atom by α and define
n
1α (Xk ) ,
X
ηn = (7.65)
k=1
180 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

Pn
the occupation time of the atom α up to time n. We now split the sum k=1 f (Xk )
into sums over the excursions between successive visits to α,
n
X ηn
X n
X
f (Xk ) = sj (f ) + f (Xk ) .
k=1 j=1 (η )
k=τα n +1

This decomposition shows that


ηn
X n
X ηX
n +1

sj (f ) ≤ f (Xk ) ≤ sj (f ) . (7.66)
j=1 k=1 j=1

Because Q is Harris recurrent and α is accessible, ηn → ∞ Pν -a.s. as n → ∞. Hence


s1 (f )/ηn → 0 and (ηn − 1)/ηn → 1 Pν -a.s. By Lemma 199 the variables {sj (f )}j≥2
are i.i.d. under Pν . In addition Eν [sj (f )] = µα (f ) for j ≥ 2 with µα , defined in
(7.30), being an invariant measure. Because all invariant measures are constant
multiples of µα and π(|f |) < ∞, Eα [sj (f )] is finite. Writing
ηn ηn
1 X s1 (f ) ηn − 1 1 X
sj (f ) = + sj (f ) ,
η n j=1 ηn ηn ηn − 1 j=2

the LLN for i.i.d. random variables shows that


ηn
1 X
lim sj (f ) = µα (f ) Pν -a.s. ,
n→∞ η n
j=1

Pn
whence, by (7.66), the same limit holds for ηn−1 1 f (Xk ). Because π(1) = 1, µα (1)
Pn too. Applying the above result with f ≡ 1 yields n/ηn → µα (1), so that
is finite
n−1 1 f (Xk ) → µα (f )/µα (1) = π(f ) Pν -a.s. This is the desired result when f ≥ 0.
The general case is is handled by splitting f into its positive and negative parts.

Central Limit Theorems


We say that {f (Xk )} satisfies a central limit theorem
Pn (CLT) if there is a constant
σ 2 (f ) ≥ 0 such that the normalized sum n−1/2 k=1 {f (Xk ) − π(f )} converges Pν -
weakly to a Gaussian distribution with zero mean and variance σ 2 (f ) (we allow
for the special case σ 2 (f ) = 0 corresponding to weak convergence to the constant
0). CLTs are essential for understanding
Pn the error occurring when approximating
π(f ) by the sample mean n−1 k=1 f (Xk ) and are thus a topic of considerable
importance.
For i.i.d. samples, classical theory guarantees a CLT as soon as π(|f |2 ) < ∞.
This is not true in general for Markov chains; the CLTs that are available do require
some additional assumptions on the rate of convergence and/or the existence of
higher order moments of f under the stationary distribution.

Theorem 200. Let Q be a phi-irreducible aperiodic positive Harris recurrent tran-


sition kernel with invariant distribution π. Let f be a measurable function and
assume that there exists an accessible small set C satisfying
 !2 
Z XτC Z
π(dx) Ex  |f |(Xk )  < ∞ and π(dx) Ex [τC2 ] < ∞ . (7.67)
x∈C k=1 C

Then π(f 2 ) < ∞ and {f (Xk )} satisfies a CLT.


7.2. CHAINS ON GENERAL STATE SPACES 181

Proof. To start with, it follows from the expression (7.32) for the stationary distri-
bution that
"τ # Z  !2 
Z X C XτC
π(f 2 ) = π(dx) Ex f 2 (Xk ) ≤ π(dx) Ex  |f (Xk )|  < ∞ .
C k=1 C k=1

We now prove the CLT under the additional assumption that the chain admits
an accessible atom α. The proof in the general phi-irreducible case can be obtained
using the splitting construction. The proof is P
along the same lines as for the LLN.
n
Put f¯ = f − π(f ). By decomposing the sum k=1 f¯(Xk ) into excursions between
successive visits to the atom α, we obtain
n
X ηn
X
n−1/2 f¯(Xk ) − sj (f¯) ≤ n−1/2 s1 (|f¯|) + n−1/2 sηn +1 (|f¯|) , (7.68)
k=1 j=2

where ηn and sj (f ) are defined in (7.65) and (7.64). It is clear that the first term
on the right-hand side of this display vanishes (in Pν -probability) Pnas n → ∞. For
the second one, the strong LLN (Theorem 198) shows that n−1 1 s2j (|f¯|) has an
Pν -a.s. finite limit, whence, Pν -a.s.,
 
n n+1
s2n (|f¯|) 1 X n + 1 1 X
lim sup = lim sup  s2 (|f¯|) − s2 (|f¯|) = 0 .
n→∞ n n→∞ n j=1 j n n + 1 j=1 j

The strong LLN with f = 1α also shows that ηn /n → π(α) Pν -a.s., so that
s2ηn (|f¯|)/ηn → 0 and n−1/2 sηn +1 (|f¯|) → 0 Pν -a.s.
Pn Pη
Thus n−1/2 1 f¯(Xk ) and n−1/2 2n sj (f¯) have the same limiting behavior.
By Lemma 199, the blocks {s2j (|f¯|)}j≥2 are i.i.d. under Pν . Thus, by the CLT
Pn
for i.i.d. random variables, n−1/2 2 sj (f¯) converges Pν -weakly to a Gaussian
law with zero mean and some variance σ 2 < ∞; that the variance is indeed fi-
nite follows as above with the small set C being the accessible atom α. The so-
called Ascombe’s theorem (see for instance Gut, 1988, Theorem 3.1) then implies
−1/2 Pηn ¯
that ηn 2 f (Xk ) converges Pν -weakly to the same Gaussian law. Thus we
Pη −1/2 Pηn ¯
may conclude that n−1/2 2n f¯(Xk ) = (ηn /n)1/2 ηn 2 f (Xk ) converges Pν -
weaklyPto a Gaussian law with zero mean and variance π(α)σ 2 . By (7.68), so does
n
n−1/2 1 f¯(Xk ).
The condition (7.67) is stated in terms of the second moment of the excursion
between two successive visits to a small set and appears rather difficult to verify
directly. More explicit conditions can be obtained, in particular if we assume that
the chain is V -geometrically ergodic.
Proposition 201. Let Q be a phi-irreducible, aperiodic, positive Harris reccurrent
kernel that Q satisfies a Foster-Lyapunov drift condition (see Definition 191) outside
an accessible small set C, with drift function V . Then any measurable function f
such that |f |2 ≤ V satisfies a CLT.
Proof. Minkovski’s inequality implies that
 !2  )1/2
C −1
τX
(∞
Xq
Ex  |f (Xk )|  ≤ Ex [f (Xk )1{τC >k} ]
2

k=0 k=0
( ∞ q
)1/2
Ex [V (Xk )1{τC >k} ]
X
≤ .
k=0
182 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

Put Mk = λ−k V (Xk )1{τC ≥k} , where λ is as in (7.56). Then for k ≥ 1,

E[Mk+1 | Fk ] ≤ λ−(k+1) E[V (Xk+1 ) | Fk ]1{τC ≥k+1}


≤ λ−k V (Xk )1{τC ≥k+1} ≤ Mk ,

showing that {Mk } is a super-martingale. Thus Ex [Mk ] ≤ Ex [M1 ] for any x ∈ C,


which implies that for k ≥ 1,
 
sup Ex [V (Xk )1{τC ≥k} ] ≤ λ sup V (x) + b .
k
x∈C x∈C

7.3 Applications to Hidden Markov Models


As discussed in Section 1.2, an HMM is best defined as a Markov chain {Xk , Yk }k≥0
on the product space (X × Y, X ⊗ Y). The transition kernel of this joint chain has
a simple structure reflecting the conditional independence assumptions that are
imposed. Let Q and G denote, respectively, a Markov transition kernel on (X, X )
and a transition kernel from (X, X ) to (Y, Y). The transition kernel of the joint
chain {Xk , Yk }k≥0 is given by, for any (x, y) ∈ X × Y,
ZZ
T [(x, y), C] = Q(x, dx0 ) G(x0 , dy) , (x, y) ∈ X × Y, C ∈ X ⊗ Y . (7.69)
C

This chain is said to be hidden because only a component (here {Yk }k≥0 ) is observed.
Of course, the process {Yk } is not a Markov chain, but nevertheless most of the
properties of this process are inherited from stability properties of the hidden chain.
In this section, we establish stability properties of the kernel T of the joint chain.

7.3.1 Phi-irreducibility
Phi-irreducibility of the joint chain T is inherited from irreducibility of the hidden
chain, and the maximal irreducibility measures of the joint and hidden chains are
related in a simple way. Before stating the precise result, we recall (see Section 1.1.1)
that if φ is a measure on (X, X ), we define the measure φ ⊗ G on (X × Y, X ⊗ Y) by
ZZ
def
φ ⊗ G(A) = µ(dx) G(x, dy) , A∈X ⊗Y .
A

Proposition 202. Assume that Q is phi-irreducible, and let φ be an irreducibility


measure for Q. Then φ ⊗ G is an irreducibility measure for T . If ψ is a maximal
irreducibility measure for Q, then ψ ⊗ G is a maximal irreducibility measure for T .

Proof. LetR A ∈ X ⊗ Y be a set such that φ ⊗ G(A) > 0. Denote by ΨA the function
ΨA (x) = Y G(x, dy) 1A (x, y) for x ∈ X. By Fubini’s theorem,
ZZ Z
φ ⊗ G(A) = φ(dx) G(x, dy) 1A (x, y) = φ(dx) ΨA (x) ,

S∞ the condition φ⊗G(A) > 0 implies that φ ({ΨA > 0}) > 0. Because {ΨA > 0} =
and
m=0 {ΨA ≥ 1/m}, we have φ ({ΨA ≥ 1/m}) > 0 for some integer m. Because φ
7.3. APPLICATIONS TO HIDDEN MARKOV MODELS 183

is an irreducibility measure, for any x ∈ X there exists an integer k ≥ 0 such that


Qk (x, {ΨA ≥ 1/m}) > 0. Therefore for any y ∈ Y,
ZZ Z
T k [(x, y), A] = Qk (x, dx0 ) G(x0 , dy 0 ) 1A (x0 , y 0 ) = Qk (x, dx0 ) ΨA (x0 )
Z
1
≥ Qk (x, dx0 ) ΨA (x0 ) ≥ Qk (x, {ΨA ≥ 1/m}) > 0 ,
{ΨA ≥1/m} m

showing that φ ⊗ G is an irreducibility measure for T .


Morever, using Theorem 147, we see that a maximal irreducibility measure ψT
for T is given by, for any δ ∈ (0, 1) and A ∈ X ⊗ Y,
ZZ ∞
X
ψT (A) = φ(dx) G(x, dy) (1 − δ) δ m T m [(x, y), A]
m=0
ZZ ∞ Z
φ(dx) Qm (x, dx0 ) G(x0 , dy 0 ) 1A (x0 , y 0 )
X
= (1 − δ) δm
m=0
ZZ
= ψ(dx ) G(x0 , dy 0 ) 1A (x0 , y 0 ) = ψ ⊗ G(A) ,
0

where Z ∞
X
ψ(B) = φ(dx) (1 − δ) δ m Qm (x, B) , B∈X .
m=0

By Theorem 147, ψ is a maximal irreducibility measure for Q. In addition, if ψ̂ is


a maximal irreducibility measure for Q, then ψ̂ is equivalent to ψ. Because for any
A ∈ X ⊗ Y,
ZZ ZZ
dψ̂
ψ̂ ⊗ G(A) = ψ̂(dx) G(x, dy) 1A (x, y) = ψ ⊗ G(dx, dy) (x)1A (x, y) ,

ψ̂ ⊗ G(A) = 0 whenever ψ ⊗ G(A) = 0. Thus ψ̂ ⊗ G  ψ ⊗ G. Exchanging ψ and ψ̂


shows that ψ ⊗ G and ψ̂ ⊗ G are indeed equivalent, which concludes the proof.

Example 203 (Stochastic Volatility Model). The canonical stochastic volatility


model (see Example ??) is given by

Xk+1 = φXk + σUk , Uk ∼ N(0, 1) ,


Yk = β exp(Xk /2)Vk , Vk ∼ N(0, 1) ,

We have established (see Example 148) that because {Uk } has a positive density
on R+ , the chain {Xk } is phi-irreducible and λLeb is an irreducibility measure.
Therefore {Xk , Yk } is also phi-irreducible and λLeb ⊗λLeb is a maximal irreducibility
measure.

7.3.2 Atoms and Small Sets


It is possible to relate atoms and small sets of the joint chain to those of the hidden
chain. Examples of HMMs possessing accessible atoms are numerous, even when
the state space of the joint chain is general. They include in particular the Markov
chains whose hidden state space X is finite.

Example 204 (Normal HMM, Continued). For the normal HMM (see Exam-
ple ??), it holds that T [(x, y), ·] = T [(x, y 0 ), ·] for any (y, y 0 ) ∈ R × R. Hence
{x} × R is an atom for T .
184 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

When accessible atoms do not exist, it is important to determine small sets.


Here again the small sets of the joint chain can easily be related to those of the
hidden chain.

Lemma 205. Let m be a positive integer,  > 0 and let η be a probability measure
on (X, X ). Let C ∈ X be an (m, , η)-small set for the transition kernel Q, that is,
Qm (x, A) ≥ 1C (x)η(A) for all x ∈ X and A ∈ X . Then C × Y is an (m, , η ⊗ G)-
small set for the transition kernel T defined in (1.14), that is,

T m [(x, y), A] ≥ 1C (x) η ⊗ G(A) , (x, y) ∈ X × Y, A ∈ X ⊗ Y .

Proof. Pick (x, y) ∈ C × Y. Then


ZZ
T m [(x, y), A] = Qm (x, dx0 ) G(x0 , dy 0 ) 1A (x0 , y 0 )
ZZ
≥  η(dx0 ) G(x0 , dy 0 ) 1A (x0 , y 0 ) .

If the Markov transition kernel Q on (X, X ) is phi-irreducible (with maximal


irreducibility measure ψ), then we know from Proposition 157 that there exists an
accessible small set C. That is, there exists a set C ∈ X with Px (τC < ∞) > 0 for
all x ∈ X and such that C is (m, , η)-small for some triple (m, , η) with η(C) > 0.
Then Lemma 205 shows that C × Y is an (m, , η ⊗ G)-small set for the transition
kernel T .

Example 206 (Stochastic Volatility Model, Continued). We have shown in Ex-


ample 148 that any compact set K ⊂ R is small for the first-order autoregression
constituting the hidden chain of the stochastic volatility model of Example 203.
Therefore any set K × R, where K a compact subset of R, is small for the joint
chain {Xk , Yk }.

The simple relations between the small sets of the joint chain and those of the
hidden chain immediately imply that T and Q have the same period.

Proposition 207. Suppose that Q is phi-irreducible and has period d. Then T is


phi-irreducible and has the same period d. In particular, if Q is aperiodic, then so
is T .

Proof. Let C be an accessible (m, , η)-small set for Q with η(C) > 0. Define EC
as the set of time indices for which C is a small set with minorizing probability
measure η,
def
EC = {n ≥ 0 : C is (n, , η)-small for some  > 0} .

The period of the set C is given by the greatest common divisor of EC . Propo-
sition 180 shows that this value is in fact common to the chain as such and does
not depend on the particular small set chosen. Lemma 205 shows that C × Y is an
(m, , η ⊗ G)-small set for the joint Markov chain with transition kernel T , and that
η ⊗ G(C × Y) = η(C) > 0. The set EC×Y of time indices for which C × Y is a small
set for T with minorizing measure η ⊗ G is thus, using Lemma 205 again, equal to
EC . Thus the period of the set C is also the period of the set C × Y. Because the
period of T does not depend on the choice of the small set C × Y, it follows that
the periods of Q and T coincide.
7.3. APPLICATIONS TO HIDDEN MARKOV MODELS 185

7.3.3 Recurrence and Positive Recurrence


As the following result shows, recurrence and transience of the joint chain follows
directly from the corresponding properties of the hidden chain.
Proposition 208. Assume that the hidden chain is phi-irreducible. Then the fol-
lowing statements hold true.
(i) The joint chain is transient (recurrent) if and only if the hidden chain is tran-
sient (recurrent).
(ii) The joint chain is positive if and only if the hidden chain is positive. In addi-
tion, if the hidden chain is positive with stationary distribution π, then π ⊗ G
is the stationary distribution of the joint chain.
Proof. First assume that the transition kernel Q is transient, that is, that there is
a countable cover X = ∪i Ai of X with uniformly transient sets,
"∞ #
1Ai (Xn ) < ∞ .
X
sup Ex
x∈Ai n=1

Then the sets {Ai × Y}i≥1 form a countable cover of X × Y, and these sets are
uniformly transient because
"∞ # "∞ #
1Ai ×Y (Xn , Yn ) = Ex 1Ai (Xn ) .
X X
Ex (7.70)
n=1 n=1

Thus the joint chain is transient.


Conversely, assume that the joint chain is transient. Because the hidden chain
is phi-irreducible, Proposition 158 shows that there is a countable cover X = ∪i Ai
of X with sets that are small for Q. At least one of these, say A1 , is accessible
for Q. By Lemma 205, the sets Ai × Y are small. By Proposition 202, A1 × Y
is accessible and, because T is transient, Proposition 159 shows that A1 × Y is
uniformly transient. Equation (7.70) then shows that A1 is uniformly transient,
and because A1 is accessible, we conclude that Q is transient.
Thus the hidden chain is transient if and only if the joint chain is so. The
transience/recurrence dichotomy (Theorem 151) then implies that the hidden chain
is recurrent if and only if the joint chain is so, which completes the proof of (i).
We now turn to (ii). First assume that the hidden chain is positive recurrent,
that is, that there exists a unique stationary probability measure π satisfying πQ =
π. Then the probability measure π ⊗ G is stationary for the transition kernel T of
the joint chain, because
Z Z
(π ⊗ G)T (A) = · · · π(dx) G(x, dy) Q(x, dx0 ) G(x0 , dy 0 ) 1A (x0 , y 0 )
ZZZ
= π(dx) Q(x, dx0 ) G(x0 , dy 0 ) 1A (x0 , y 0 )
ZZ
= π(dx0 ) G(x0 , dy 0 ) 1A (x0 , y 0 ) = π ⊗ G(A) .

Because the joint chain admits a stationary distribution it is positive, and by Propo-
sition 179 it is recurrent.
Conversely, assume that the joint chain is positive. Denote by π̄ the (unique)
stationary probability measure of T . Thus for any Ā ∈ X ⊗ Y, we have
ZZ
π̄(dx, dy) Q(x, dx0 ) G(x0 , dy 0 ) 1Ā (x0 , y 0 )
ZZ
= π̄(dx, Y) Q(x, dx0 ) G(x0 , dy 0 ) 1Ā (x0 , y 0 ) = π̄(Ā) .
186 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

Setting Ā = A × Y for A ∈ X , this display implies that


Z
π̄(dx, Y) Q(x, A) = π̄(A × Y) .

This shows that π(A) = π̄(A × Y) is a stationary distribution for the hidden chain.
Hence the hidden chain is positive and recurrent.

When the joint (or hidden) chain is positive, it is natural to study the rate at
which it converges to stationarity.

Proposition 209. Assume that the hidden chain satisfies a uniform Doeblin con-
dition, that is, there exists a positive integer m,  > 0 and a family {ηx,x0 , (x, x0 ) ∈
X × X} of probability measures such that

Qm (x, A) ∧ Qm (x0 , A) ≥ ηx,x0 (A), A ∈ X , (x, x0 ) ∈ X × X .

Then the joint chain also satisfies a uniform Doeblin condition. Indeed, for all (x, y)
and (x0 , y 0 ) in X × Y and all Ā ∈ X ⊗ Y,

T m [(x, y), Ā] ∧ T m [(x0 , y 0 ), Ā] ≥ η̄x,x0 (Ā) ,

where Z
η̄x,x0 (Ā) = ηx,x0 (dx) G(x, dy) 1Ā (x, y) .

The proof is along the same lines as the proof of Lemma 205 and is omitted.
This proposition in particular implies that the ergodicity coefficients for the kernels
T m and Qm coincide; δ(T m ) = δ(Qm ). A straightforward but useful application of
this result is when the hidden Markov chain is defined on a finite state space. If the
transition matrix Q of this chain is primitive, that is, there exists a positive integer
m such that Qm (x, x0 ) > 0 for all (x, x0 ) ∈ X × X (or, equivalently, if the chain
Q is irreducible and aperiodic), then the joint Markov chain satisfies a uniform
Doeblin condition and the ergodicity coefficient of the joint chain is bounded as
δ(T m ) ≤ 1 −  with

= inf sup [Qm (x, x00 ) ∧ Qm (x0 , x00 )] .


(x,x0 )∈X×X x00 ∈X

A similar result holds when the hidden chain satisfies a Foster-Lyapunov drift
condition instead of a uniform Doeblin condition. This result is of particular interest
when dealing with hidden Markov models on state spaces that are not finite or
bounded.

Proposition 210. Assume that Q is phi-irreducible, aperiodic, and satisfies a


Foster-Lyapunov drift condition (Definition 191) with drift function V outside a
set C. Then the transition kernel T also satisfies a Foster-Lyapunov drift condition
with drift function V outside the set C × Y,

T [(x, y), V ] ≤ λV (x) + b1C×Y (x, y) .

Here on the left-hand side, we wrote V also for a function on X × Y defined by


V (x, y) = V (x).

The proof is straightforward. Proposition 195 yields an explicit bound on the


rate of convergence of the iterates of the Markov chain to the stationary distribution.
This result has a lot of interesting consequences.
7.3. APPLICATIONS TO HIDDEN MARKOV MODELS 187

Proposition 211. Suppose that Q is phi-irreducible, aperiodic, and satisfies a


Foster-Lyapunov drift condition with drift function V outside a small set C. Then
the transition kernel T is positive and aperiodic with invariant distribution π ⊗ G,
where π is the invariant distribution of Q. In addition, for any measurable function
f : X × Y → R, the following statements hold true.
(i) If supx∈X [V (x)]−1 G(x, dy) |f (x, y)| < ∞, then there exist ρ ∈ (0, 1) and
R

K < ∞ (not depending on f ) such that for any n ≥ 0 and (x, y) ∈ X × Y,


|T n f (x, y) − π ⊗ G(f )| ≤ Kρn V (x) sup [V (x0 )]−1 intG(x0 , dy) |f (x0 , y)| .
x0 ∈X

(ii) If supx∈X [V (x)]−1 G(x, dy) f 2 (x, y) < ∞, then Eπ⊗G [f 2 (X0 , Y0 )] < ∞ and
R

there exist ρ ∈ (0, 1) and K < ∞ (not depending on f ) such that for any n ≥ 0,

|Covπ [f (Xn , Yn ), f (X0 , Y0 )]|


 Z 2
n −1/2
≤ Kρ π(V ) sup[V (x)] G(x, dy) |f (x, y)| .
x∈X

Proof. First note that


ZZ
n
|T f (x, y) − π ⊗ G(f )| = [Qn (x, dx0 ) − π(dx0 )]G(x0 , dy 0 ) f (x0 , y 0 )
Z
0 −1
n
≤ kQ (x, ·) − πkV sup [V (x )] G(x0 , dy) |f (x0 , y)| .
x0 ∈X

Now part (i) follows from the geometric ergodicity of Q (Theorem 194). Next,
because π(V ) < ∞,
ZZ
Eπ⊗G [f 2 (X0 , Y0 )] = π(dx) G(x, dy) f 2 (x, y)
Z
−1
≤ π(V ) sup[V (x)] G(x, dy) f 2 (x, y) < ∞ ,
x∈X

implying that | Covπ [|f (Xn , Yn )|, |f (X0 , Y0 )|]| ≤ Varπ [f (X0 , Y0 )] < ∞. In addition
Covπ [f (Xn , Yn ), f (X0 , Y0 )]
= Eπ {E[f (Xn , Yn ) − π ⊗ G(f ) | F0 ]f (X0 , Y0 )}
ZZ ZZ
= π ⊗ G(dx, dy) f (x, y) [Qn (x, dx0 ) − π(dx0 )] G(x0 , dy 0 ) f (x0 , y 0 ) .

(7.71)
R R
By Jensen’s inequality G(x, dy) |f (x, y)| ≤ [ G(x, dy) f 2 (x, y)]1/2 and

QV 1/2 (x) ≤ [QV (x)]1/2 ≤ [λV (x) + b1C (x)]1/2 ≤ λ1/2 V 1/2 (x) + b1/2 1C (x) ,
showing that Q also satisfies a Foster-Lyapunov condition outside C with drift
function V 1/2 . By Theorem 194, there exists ρ ∈ (0, 1) and a constant K such that
ZZ
[Qn (x, dx0 ) − π(dx)] G(x0 , dy 0 ) f (x0 , y 0 )
Z
−1/2
n
≤ kQ (x, ·) − πkV 1/2 sup V (x) G(x0 , dy) |f (x0 , y)|
x0 ∈X
Z
≤ Kρn V 1/2 (x) sup V −1/2 (x0 ) G(x0 , dy) |f (x0 , y)| .
x0 ∈X

Part (ii) follows by plugging this bound into (7.71).


188 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY

Example 212 (Stochastic Volatility Model, Continued). In the model of Exam-


2 2
ple 203, we set V (x) = ex /2δ for δ > σU . It is easily shown that
 2 2 2
x φ (ρ + δ 2 )

ρ
QV (x) = exp ,
σU 2δ 2 δ2

where ρ2 = σU2 2
δ /(δ 2 −σU
2
). We may choose δ large enough that φ2 (ρ2 +δ 2 )/δ 2 < 1.
Then lim sup|x|→∞ QV (x)/V (x) = 0 so that Q satisfies a Foster-Lyapunov condition
2 2
with drift function V (x) = ex /2δ outside a compact set [−M, +M ]. Because every
compact set is small, the assumptions of Proposition 211R are satisfied, showing
p that
the joint chain is positive. Set f (x, y) = |y|. Then G(x, dy) |y| = βex/2 2/π.
Proposition 211(ii) shows that Varπ (Y0 ) < ∞ and that the autocovariance function
Cov(|Yn |, |Y0 |) decreases to zero exponentially fast.
Bibliography

Akashi, H. and Kumamoto, H. (1977) Random sampling approach to state estima-


tion in switching environment. Automatica, 13, 429–434.

Anderson, B. D. O. and Moore, J. B. (1979) Optimal Filtering. Prentice-Hall.

Askar, M. and Derin, H. (1981) A recursive algorithm for the Bayes solution of the
smoothing problem. IEEE Trans. Automat. Control, 26, 558–561.

Atar, R. and Zeitouni, O. (1997) Exponential stability for nonlinear filtering. Ann.
Inst. H. Poincaré Probab. Statist., 33, 697–725.

Athreya, K. B., Doss, H. and Sethuraman, J. (1996) On the convergence of the


Markov chain simulation method. Ann. Statist., 24, 69–100.

Athreya, K. B. and Ney, P. (1978) A new approach to the limit theory of recurrent
Markov chains. Trans. Am. Math. Soc., 245, 493–501.

Baum, L. E. and Petrie, T. P. (1966) Statistical inference for probabilistic functions


of finite state Markov chains. Ann. Math. Statist., 37, 1554–1563.

Baum, L. E., Petrie, T. P., Soules, G. and Weiss, N. (1970) A maximization tech-
nique occurring in the statistical analysis of probabilistic functions of Markov
chains. Ann. Math. Statist., 41, 164–171.

Bickel, P. J., Ritov, Y. and Rydén, T. (1998) Asymptotic normality of the maximum
likelihood estimator for general hidden Markov models. Ann. Statist., 26, 1614–
1635.

Boyles, R. (1983) On the convergence of the EM algorithm. J. Roy. Statist. Soc.


Ser. B, 45, 47–50.

Budhiraja, A. and Ocone, D. (1997) Exponential stability of discrete-time filters for


bounded observation noise. Systems Control Lett., 30, 185–193.

Campillo, F. and Le Gland, F. (1989) MLE for patially observed diffusions: Direct
maximization vs. the EM algorithm. Stoch. Proc. App., 33, 245–274.

Cappé, O. (2001) Recursive computation of smoothed functionals of hidden Marko-


vian processes using a particle approximation. Monte Carlo Methods Appl., 7,
81–92.

Cappé, O., Buchoux, V. and Moulines, E. (1998) Quasi-Newton method for maxi-
mum likelihood estimation of hidden Markov models. In Proc. IEEE Int. Conf.
Acoust., Speech, Signal Process., vol. 4, 2265–2268.

Carpenter, J., Clifford, P. and Fearnhead, P. (1999) An improved particle filter for
non-linear problems. IEE Proc., Radar Sonar Navigation, 146, 2–7.

189
190 BIBLIOGRAPHY

Cérou, F., Le Gland, F. and Newton, N. (2001) Stochastic particle methods for
linear tangent filtering equations. In Optimal Control and PDE’s - Innovations
and Applications, in Honor of Alain Bensoussan’s 60th Anniversary (eds. J.-L.
Menaldi, E. Rofman and A. Sulem), 231–240. IOS Press.

Chen, R. and Liu, J. S. (2000) Mixture Kalman filter. J. Roy. Statist. Soc. Ser. B,
62, 493–508.

Chigansky, P. and Lipster, R. (2004) Stability of nonlinear filters in nonmixing case.


Ann. Appl. Probab., 14, 2038–2056.

Collings, I. B. and Rydén, T. (1998) A new maximum likelihood gradient algorithm


for on-line hidden Markov model identification. In Proc. IEEE Int. Conf. Acoust.,
Speech, Signal Process., vol. 4, 2261–2264.

Cover, T. M. and Thomas, J. A. (1991) Elements of Information Theory. Wiley.

Crisan, D., Del Moral, P. and Lyons, T. (1999) Discrete filtering using branching
and interacting particle systems. Markov Process. Related Fields, 5, 293–318.

Del Moral, P. (2004) Feynman-Kac Formulae. Genealogical and Interacting Particle


Systems with Applications. Springer.

Del Moral, P. and Jacod, J. (2001) Interacting particle filtering with discrete-time
observations: Asymptotic behaviour in the Gaussian case. In Stochastics in Fi-
nite and Infinite Dimensions: In Honor of Gopinath Kallianpur (eds. T. Hida,
R. L. Karandikar, H. Kunita, B. S. Rajput, S. Watanabe and J. Xiong), 101–122.
Birkhäuser.

Del Moral, P., Ledoux, M. and Miclo, L. (2003) On contraction properties of Markov
kernels. Probab. Theory Related Fields, 126, 395–420.

Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977) Maximum likelihood from


incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B, 39, 1–38
(with discussion).

Devroye, L. (1986) Non-Uniform Random Variate Generation. Springer. URL


http://cgm.cs.mcgill.ca/~luc/rnbookindex.html.

Devroye, L. and Klincsek, T. (1981) Average time behavior of distributive sorting


algorithms. Computing, 26, 1–7.

Dobrushin, R. (1956) Central limit theorem for non-stationary Markov chains. I.


Teor. Veroyatnost. i Primenen., 1, 72–89.

Doob, J. L. (1953) Stochastic Processes. Wiley.

Douc, R., Moulines, E. and Rydén, T. (2004) Asymptotic properties of the max-
imum likelihood estimator in autoregressive models with Markov regime. Ann.
Statist., 32, 2254–2304.

Doucet, A., De Freitas, N. and Gordon, N. (eds.) (2001) Sequential Monte Carlo
Methods in Practice. Springer.

Doucet, A., Godsill, S. and Andrieu, C. (2000) On sequential Monte-Carlo sampling


methods for Bayesian filtering. Stat. Comput., 10, 197–208.

Doucet, A. and Tadić, V. B. (2003) Parameter estimation in general state-space


models using particle methods. Ann. Inst. Statist. Math., 55, 409–422.
BIBLIOGRAPHY 191

Durrett, R. (1996) Probability: Theory and Examples. Duxbury Press, 2nd ed.

Elliott, R. J. and Krishnamurthy, V. (1999) New finite-dimensional filters for param-


eter estimation of discrete-time linear Gaussian models. IEEE Trans. Automat.
Control, 44.

Ephraim, Y. and Merhav, N. (2002) Hidden Markov processes. IEEE Trans. Inform.
Theory, 48, 1518–1569.

Evans, M. and Swartz, T. (1995) Methods for approximating integrals in Statistics


with special emphasis on Bayesian integration problems. Statist. Sci., 10, 254–
272.

— (2000) Approximating Integrals via Monte Carlo and Deterministic Methods.


Oxford University Press.

Fearnhead, P. (1998) Sequential Monte Carlo methods in filter theory. Ph.D. thesis,
University of Oxford.

Feller, W. (1943) On a general class of “contagious” distributions. Ann. Math.


Statist., 14, 389–399.

Fletcher, R. (1987) Practical Methods of Optimization. Wiley.

Fredkin, D. R. and Rice, J. A. (1992) Maximum-likelihood-estimation and identi-


fication directly from single-channel recordings. Proc. Roy. Soc. London Ser. B,
249, 125–132.

Geweke, J. (1989) Bayesian inference in econometric models using Monte-Carlo


integration. Econometrica, 57, 1317–1339.

Giudici, P., Rydén, T. and Vandekerkhove, P. (2000) Likelihood-ratio tests for


hidden Markov models. Biometrics, 56, 742–747.

Glynn, P. W. and Iglehart, D. (1989) Importance sampling for stochastic simula-


tions. Management Science, 35, 1367–1392.

Gordon, N., Salmond, D. and Smith, A. F. (1993) Novel approach to nonlinear/non-


Gaussian Bayesian state estimation. IEE Proc. F, Radar Signal Process., 140,
107–113.

Gupta, N. and Mehra, R. (1974) Computational aspects of maximum likelihood esti-


mation and reduction in sensitivity function calculations. IEEE Trans. Automat.
Control, 19, 774–783.

Gut, A. (1988) Stopped Random Walks. Springer.

Hammersley, J. M. and Handscomb, D. C. (1965) Monte Carlo Methods. Methuen


& Co.

Handschin, J. (1970) Monte Carlo techniques for prediction and filtering of non-
linear stochastic processes. Automatica, 6, 555–563.

Handschin, J. and Mayne, D. (1969) Monte Carlo techniques to estimate the condi-
tionnal expectation in multi-stage non-linear filtering. In Int. J. Control, vol. 9,
547–559.

Ho, Y. C. and Lee, R. C. K. (1964) A Bayesian approach to problems in stochastic


estimation and control. IEEE Trans. Automat. Control, 9, 333–339.
192 BIBLIOGRAPHY

Horn, R. A. and Johnson, C. R. (1985) Matrix Analysis. Cambridge University


Press.

Ibragimov, I. A. and Hasminskii, R. Z. (1981) Statistical Estimation. Asymptotic


Theory. Springer.

Ito, H., Amari, S. I. and Kobayashi, K. (1992) Identifiability of hidden Markov


information sources and their minimum degrees of freedom. IEEE Trans. Inform.
Theory, 38, 324–333.

Jain, N. and Jamison, B. (1967) Contributions to Doeblin’s theory of Markov pro-


cesses. Z. Wahrsch. Verw. Geb., 8, 19–40.

Jamshidian, M. and Jennrich, R. J. (1997) Acceleration of the EM algorithm using


quasi-Newton methods. J. Roy. Statist. Soc. Ser. B, 59, 569–587.

Jarner, S. and Hansen, E. (2000) Geometric ergodicity of Metropolis algorithms.


Stoch. Proc. App., 85, 341–361.

Julier, S. J. and Uhlmann, J. K. (1997) A new extension of the Kalman filter


to nonlinear systems. In AeroSense: The 11th International Symposium on
Aerospace/Defense Sensing, Simulation and Controls.

Kaijser, T. (1975) A limit theorem for partially observed Markov chains. Ann.
Probab., 3, 677–696.

Kailath, T., Sayed, A. and Hassibi, B. (2000) Linear Estimation. Prentice-Hall.

Kalman, R. E. and Bucy, R. (1961) New results in linear filtering and prediction
theory. J. Basic Eng., Trans. ASME, Series D, 83, 95–108.

Kitagawa, G. (1987) Non-Gaussian state space modeling of nonstationary time se-


ries. J. Am. Statist. Assoc., 82, 1023–1063.

— (1996) Monte-Carlo filter and smoother for non-Gaussian nonlinear state space
models. J. Comput. Graph. Statist., 1, 1–25.

Kong, A., Liu, J. S. and Wong, W. (1994) Sequential imputation and Bayesian
missing data problems. J. Am. Statist. Assoc., 89.

Künsch, H. R. (2000) State space and hidden Markov models. In Complex Stochastic
Systems (eds. O. E. Barndorff-Nielsen, D. R. Cox and C. Kluppelberg). CRC
Press.

— (2003) Recursive Monte-Carlo filters: algorithms and theoretical analysis.


Preprint ETHZ, seminar für statistics.

Lange, K. (1995) A gradient algorithm locally equivalent to the EM algorithm. J.


Roy. Statist. Soc. Ser. B, 57, 425–437.

Le Gland, F. and Mevel, L. (1997) Recursive estimation in HMMs. In Proc. IEEE


Conf. Decis. Control, 3468–3473.

Le Gland, F. and Oudjane, N. (2004) Stability and uniform approximation of non-


linear filters using the hilbert metric and application to particle filters. Ann.
Appl. Probab., 14, 144–187.

Lehmann, E. L. and Casella, G. (1998) Theory of Point Estimation. Springer, 2nd


ed.
BIBLIOGRAPHY 193

Leroux, B. G. (1992) Maximum-likelihood estimation for hidden Markov models.


Stoch. Proc. Appl., 40, 127–143.
Lipster, R. S. and Shiryaev, A. N. (2001) Statistics of Random Processes: I. General
theory. Springer, 2nd ed.
Liu, J. and Chen, R. (1995) Blind deconvolution via sequential imputations. J. Am.
Statist. Assoc., 430, 567–576.
— (1998) Sequential Monte-Carlo methods for dynamic systems. J. Am. Statist.
Assoc., 93, 1032–1044.
Liu, J. S. (1996) Metropolized independent sampling with comparisons to rejection
sampling and importance sampling. Stat. Comput., 6, 113–119.
Louis, T. A. (1982) Finding the observed information matrix when using the EM
algorithm. J. Roy. Statist. Soc. Ser. B, 44, 226–233.
Luenberger, D. G. (1984) Linear and Nonlinear Programming. Addison-Wesley, 2nd
ed.
Meng, X.-L. (1994) On the rate of convergence of the ECM algorithm. Ann. Statist.,
22, 326–339.
Meng, X.-L. and Rubin, D. B. (1991) Using EM to obtain asymptotic variance-
covariance matrices: The SEM algorithm. J. Am. Statist. Assoc., 86, 899–909.
— (1993) Maximum likelihood estimation via the ECM algorithm: A general frame-
work. Biometrika, 80, 267–278.
Meng, X.-L. and Van Dyk, D. (1997) The EM algorithm–an old folk song sung to
a fast new tune. J. Roy. Statist. Soc. Ser. B, 59, 511–567.
Mengersen, K. and Tweedie, R. L. (1996) Rates of convergence of the Hastings and
Metropolis algorithms. Ann. Statist., 24, 101–121.
Meyn, S. P. and Tweedie, R. L. (1993) Markov Chains and Stochastic Stability.
Springer.
Niederreiter, H. (1992) Random Number Generation and Quasi-Monte Carlo Meth-
ods. SIAM.
Nummelin, E. (1978) A splitting technique for Harris recurrent Markov chains. Z.
Wahrscheinlichkeitstheorie und Verw. Gebiete, 4, 309–318.
— (1984) General Irreducible Markov Chains and Non-Negative Operators. Cam-
bridge University Press.
Orchard, T. and Woodbury, M. A. (1972) A missing information principle: Theory
and applications. In Proceedings of the 6th Berkeley Symposium on Mathematical
Statistics, vol. 1, 697–715.
Ostrowski, A. M. (1966) Solution of Equations and Systems of Equations. Academic
Press, 2nd ed.
Pitt, M. K. and Shephard, N. (1999) Filtering via simulation: Auxiliary particle
filters. J. Am. Statist. Assoc., 94, 590–599.
Polson, N. G., Carlin, B. P. and Stoffer, D. S. (1992) A Monte Carlo approach
to nonnormal and nonlinear state-space modeling. J. Am. Statist. Assoc., 87,
493–500.
194 BIBLIOGRAPHY

Press, W., Teukolsky, S., Vetterling, W. and Flannery, B. (1992) Numerical Recipes
in C: The Art of Scientific Computing. Cambridge University Press, 2nd ed. URL
http://www.numerical-recipes.com/.

Rabiner, L. R. (1989) A tutorial on hidden Markov models and selected applications


in speech recognition. Proc. IEEE, 77, 257–285.

Rauch, H., Tung, F. and Striebel, C. (1965) Maximum likelihood estimates of linear
dynamic systems. AIAA Journal, 3, 1445–1450.

Ristic, B., Arulampalam, M. and Gordon, A. (2004) Beyond Kalman Filters: Par-
ticle Filters for Target Tracking. Artech House.

Robert, C. P. and Casella, G. (2004) Monte Carlo Statistical Methods. Springer,


2nd ed.

Roberts, G. O. and Rosenthal, J. S. (2004) General state space Markov chains and
MCMC algorithms. Probab. Surv., 1, 20–71.

Roberts, G. O. and Tweedie, R. L. (1996) Geometric convergence and central limit


theorems for multidimensional Hastings and Metropolis algorithms. Biometrika,
83, 95–110.

Rosenthal, J. S. (1995) Minorization conditions and convergence rates for Markov


chain Monte Carlo. J. Am. Statist. Assoc., 90, 558–566.

— (2001) A review of asymptotic convergence for general state space Markov chains.
Far East J. Theor. Stat., 5, 37–50.

Rubin, D. B. (1987) A noniterative sampling/importance resampling alternative to


the data augmentation algorithm for creating a few imputations when the fraction
of missing information is modest: the SIR algorithm (discussion of Tanner and
Wong). J. Am. Statist. Assoc., 82, 543–546.

— (1988) Using the SIR algorithm to simulate posterior distribution. In Bayesian


Statistics 3 (eds. J. M. Bernardo, M. H. DeGroot, D. Lindley and A. Smith),
395–402. Clarendon Press.

Scott, D. J. and Tweedie, R. L. (1996) Explicit rates of convergence of stochastically


ordered Markov chains. In Athens Conference on Applied Probability and Time
Series: Applied Probability in Honor of J. M. Gani, vol. 114 of Lecture Notes in
Statistics. Springer.

Segal, M. and Weinstein, E. (1989) A new method for evaluating the log-likelihood
gradient, the Hessian, and the Fisher information matrix for linear dynamic sys-
tems. IEEE Trans. Inform. Theory, 35, 682–687.

Serfling, R. J. (1980) Approximation Theorems of Mathematical Statistics. Wiley.

Shiryaev, A. N. (1966) On stochastic equations in the theory of conditional Markov


process. Theory Probab. Appl., 11, 179–184.

Stratonovich, R. L. (1960) Conditional Markov processes. Theory Probab. Appl., 5,


156–178.

Tanizaki, H. (2003) Nonlinear and non-Gaussian state-space modeling with Monte-


Carlo techniques: a survey and comparative study. In Handbook of Statistics 21.
Stochastic processes: Modelling and Simulation (eds. D. N. Shanbhag and C. R.
Rao), 871–929. Elsevier.
BIBLIOGRAPHY 195

Teicher, H. (1960) On the mixture of distributions. Ann. Math. Statist., 31, 55–73.
— (1961) Identifiability of mixtures. Ann. Math. Statist., 32, 244–248.

— (1963) Identifiability of finite mixtures. Ann. Math. Statist., 34, 1265–1269.


— (1967) Identifiability of mixtures of product measures. Ann. Math. Statist., 38,
1300–1302.
Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985) Statistical Analysis
of Finite Mixture Distributions. Wiley.

Van der Merwe, R., Doucet, A., De Freitas, N. and Wan, E. (2000) The unscented
particle filter. In Adv. Neural Inf. Process. Syst. (eds. T. K. Leen, T. G. Dietterich
and V. Tresp), vol. 13. MIT Press.
Van Overschee, P. and De Moor, B. (1993) Subspace algorithms for the stochastic
identification problem. Automatica, 29, 649–660.
— (1996) Subspace Identification for Linear Systems. Theory, Implementation, Ap-
plications. Kluwer.
Wald, A. (1949) Note on the consistency of the maximum likelihood estimate. Ann.
Math. Statist., 20, 595–601.

Weinstein, E., Oppenheim, A. V., Feder, M. and Buck, J. R. (1994) Iterative and
sequential algorithms for multisensor signal enhancement. IEEE Trans. Acoust.,
Speech, Signal Process., 42, 846–859.
Whitley, D. (1994) A genetic algorithm tutorial. Stat. Comput., 4, 65–85.

Wonham, W. M. (1965) Some applications of stochastic differential equations to


optimal nonlinear filtering. SIAM J. Control, 2, 347–369.
Wu, C. F. J. (1983) On the convergence properties of the EM algorithm. Ann.
Statist., 11, 95–103.

Zangwill, W. I. (1969) Nonlinear Programming: A Unified Approach. Prentice-Hall.


Zaritskii, V., Svetnik, V. and Shimelevich, L. (1975) Monte-Carlo techniques in
problems of optimal data processing. Autom. Remote Control, 12, 2015–2022.
Index

Accept-reject algorithm Effective sample size, 75


in sequential Monte Carlo, 67 Efficient score test, 143
Accessible set, 155 EKF, see Kalman, extended filter
AEP, see Asymptotic equipartition prop- EM, see Expectation-maximization
erty Equivalent parameters, 129
Asymptotic equipartition property, see Expectation-maximization, 96
Shannon-McMillan-Breiman the- convergence of, 116
orem ECM, 120
Atom, 156 for missing data models, 103
in exponential family, 99
Backward smoothing intermediate quantity of, 97
decomposition, 27 Exponential family, 99
kernels, 27–28 Exponential forgetting, see Forgetting
Balance equations
detailed, 6 Filtered space, 3
global, 6 Filtering, 15
local, 6 Filtration, 3
Bayes natural, 3
formula, 28 Fisher identity, 100, 106, 135
operator, 41 Forgetting, 39–56
rule, 23 exponential, 46, 125
Bayesian of time-reversed chain, 138
model, 28 strong mixing condition, 43, 46
Bootstrap filter, 77 uniform, 40, 43–47
Forward smoothing
Canonical space, 4
decomposition, 24
Chapman-Kolmogorov equations, 2
kernels, 24, 40
Communicating states, 147
Forward-backward, 17–24
Conditional likelihood function, 64
α, see forward variable
log-concave, 69
β, see backward variable
Contrast function, 122
Coordinate process, 4 backward variable, 17
Coupling decomposition, 17
inequality, 171 forward variable, 17
of Markov chains, 171–173 scaling, 21
set, 172
Growth model
Dobrushin coefficient, 36 comparison of SIS kernels, 71–72
Doeblin condition, 37 performance of bootstrap filter, 78
for hidden Markov model, 186
Drift conditions Hahn-Jordan decomposition, 32
for hidden Markov model, 186 Harris recurrent chain, see Markov chain,
for Markov chain, 166–169, 175–178 Harris recurrent
Foster-Lyapunov, 175 Harris recurrent set, 163
Hidden Markov model, 7–8
ECM, see Expectation-maximization aperiodic, 184

196
INDEX 197

discrete, 7 Lagrange multiplier test, 143


fully dominated, 7 Likelihood, 14, 103, 123–124
likelihood, 14 conditional, 23, 24, 123
log-likelihood, 14 Likelihood ratio test, 142
partially dominated, 7 generalized, 142
phi-irreducible, 184 Local asymptotic normality, 123
positive, 185 Log-likelihood, see Likelihood
recurrent, 185 Louis identity, 100
transient, 185
Hitting time, 147, 153 Markov chain
HPD (highest posterior density) region, aperiodic, 152, 170
78 canonical version, 4
central limit theorem, 180, 181
Identifiability, 129–134, 143 ergodic theorem, 153, 170
in Gaussian linear state-space model, geometrically ergodic, 175
114 Harris recurrent, 163
of finite mixtures, 132 irreducible, 148
of mixtures, 132 law of large numbers, 179
Implicit conditioning convention, 18 non-homogeneous, 5
Importance kernel, see Instrumental ker- null, 152, 164
nel on countable space, 147–153
Importance sampling, 58–59 on general space, 153–182
self-normalized, 58 phi-irreducible, 154
sequential, see Sequential Monte Carlo positive, 164
unnormalized, 58 positive recurrent, 152
Importance weights recurrent, 150
normalized, 59 reverse, 5
coefficient of variation of, 74 reversible, 6
Shannon entropy of, 75 solidarity property, 150
Incremental weight, 63 strongly aperiodic, 170
Information matrix, 140 transient, 150
observed, 122 Markov property, 5
convergence of, 141 strong, 5
Initial distribution, 3 Maximum likelihood estimator, 121
Instrumental distribution, 58 asymptotic normality, 122, 123, 141
Instrumental kernel, 62 asymptotics, 122–123
choice of, 63 consistency, 122, 125–129, 141
optimal, 65–67 convergence in quotient topology, 129
local approximation of, 68–72 efficiency, 123
prior kernel, 64 Metropolis-Hastings algorithm
Invariant measure, 150, 164 geometric ergodicity, 176
sub-invariant measure, 164 phi-irreducibility, 155
Inversion method, 79 Missing information principle, 140
Irreducibility measure Mixing distribution, 132
maximal, 154 Mixture density, 132
of hidden Markov model, 182
of Markov chain, 154 Noisy AR(1) model
SIS with optimal kernel, 67
Kalman SIS with prior kernel, 64–65
extended filter, 70 Normal hidden Markov model
unscented filter, 70 identifiability, 133
Kernel, see Transition likelihood ratio testing in, 142
Kullback-Leibler divergence, 97 Normalizing constant, 58
198 INDEX

Occupation time Small set


of set, 153 existence, 158
of state, 148 of hidden Markov model, 184
Oscillation semi-norm, 33 of Markov chain, 158
SMC, see Sequential Monte Carlo
Particle filter, 57, 77 Smoothing, 13, 15
Period fixed-interval, 13, 19
of irreducible Markov chain, 153 forward-backward, 19
of phi-irreducible HMM, 184 Rauch-Tung-Striebel, 24
of phi-irreducible Markov chain, 170 with Markovian decomposition
of state in Markov chain, 152 backward, 27
Posterior, 23, 28 forward, 24
Prediction, 15 Splitting construction, 159–161
Prior, 23, 28 split chain, 159
Probability space State space, 3
filtered, 3 Stationary distribution
of hidden Markov model, 185
Radon-Nikodym derivative, 58 of Markov chain, 150
Rao test, 142 Stochastic process, 3
Recurrent adapted, 3
set, 155 stationary, 6
state, 148 Stochastic volatility model
Regeneration time, 160 approximation of optimal kernel, 69–
Resampling 70
in SMC, 75–78 identifiability, 133
multinomial, 59–60 performance of SISR, 78
alternatives to, 80 weight degeneracy, 74–75
implementation of, 79–80 Stopping time, 5
remainder, see residual Strong mixing condition, 43, 46
residual, 80–81 Subspace methods, 114
stratified, 81–82 Sufficient statistic, 99
systematic, 83
unbiased, 80 Total variation distance, 32, 34
Resolvent kernel, see Transition V -total variation, 171
Return time, 147, 153 Transient
Reversibility, 6 set (uniformly), 155
state, 148
Sample impoverishment, see Weight de- Transition
generacy density function, 1
Sampling importance resampling, 59–61 kernel, 1
estimator, 60 Markov, 1
mean squared error of, 61 resolvent, 154
unbiasedness, 60 reverse, 3
Score function, 134 unnormalized, 1
asymptotic normality, 134–140 matrix, 1
Sensitivity equations, 107
Sequential Monte Carlo, 57, 61–72 UKF, see Kalman, unscented filter
implementation in HMM, 61–63 Uniform spacings, 79
with resampling, 72–78
V -total variation distance, see Total vari-
Shannon-McMillan-Breiman theorem, 21
ation distance
Shift operator, 4
SIR, see Sampling importance resampling Wald test, 142
SIS, see Importance sampling Weight degeneracy, 57, 73–75
SISR, see Sequential Monte Carlo

You might also like