Inference in Hidden Markov Models
Inference in Hidden Markov Models
I State Inference 11
2 Filtering and Smoothing Recursions 13
2.1 Basic Notations and Definitions . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 The Forward-Backward Decomposition . . . . . . . . . . . . . 17
2.1.4 Implicit Conditioning (Please Read This Section!) . . . . . . 18
2.2 Forward-Backward . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 The Forward-Backward Recursions . . . . . . . . . . . . . . . 19
2.2.2 Filtering and Normalized Recursion . . . . . . . . . . . . . . 20
2.3 Markovian Decompositions . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Forward Decomposition . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Backward Decomposition . . . . . . . . . . . . . . . . . . . . 27
i
ii CONTENTS
4.3.2 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4.1 Implementation of Multinomial Resampling . . . . . . . . . . 79
4.4.2 Alternatives to Multinomial Resampling . . . . . . . . . . . . 80
II Parameter Inference 93
5 Maximum Likelihood Inference, Part I:
Optimization Through Exact Smoothing 95
5.1 Likelihood Optimization in Incomplete Data Models . . . . . . . . . 95
5.1.1 Problem Statement and Notations . . . . . . . . . . . . . . . 96
5.1.2 The Expectation-Maximization Algorithm . . . . . . . . . . . 96
5.1.3 Gradient-based Methods . . . . . . . . . . . . . . . . . . . . . 99
5.2 Application to HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.1 Hidden Markov Models as Missing Data Models . . . . . . . 103
5.2.2 EM in HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2.3 Computing Derivatives . . . . . . . . . . . . . . . . . . . . . . 106
5.3 The Example of Normal Hidden Markov Models . . . . . . . . . . . 107
5.3.1 EM Parameter Update Formulas . . . . . . . . . . . . . . . . 107
5.3.2 Estimation of the Initial Distribution . . . . . . . . . . . . . . 109
5.3.3 Computation of the Score and Observed Information . . . . . 109
5.4 The Example of Gaussian Linear State-Space Models . . . . . . . . . 114
5.4.1 The Intermediate Quantity of EM . . . . . . . . . . . . . . . 114
5.5 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5.1 Global Convergence of the EM Algorithm . . . . . . . . . . . 116
5.5.2 Rate of Convergence of EM . . . . . . . . . . . . . . . . . . . 119
5.5.3 Generalized EM Algorithms . . . . . . . . . . . . . . . . . . . 120
We now formally describe hidden Markov models, setting the notations that will be
used throughout the book. We start by reviewing the basic definitions and concepts
pertaining to Markov chains.
1
2 CHAPTER 1. MAIN DEFINITIONS AND NOTATIONS
• For any positive measure µ on (X, X ) and any real measurable function f on
(Y, Y),
ZZ
(µQ) (f ) = µ (Qf ) = µ(dx) Q(x, dy) f (y) ,
provided the integrals are well-defined. We may thus use the simplified nota-
tion µQf instead of (µQ)(f ) or µ(Qf ).
1.1. MARKOV CHAINS 3
The reverse kernel does not necessarily exist and is not uniquely defined. Never-
←
− ←
− ←
− ←
−
theless, if Q ν,1 and Q ν,2 satisfy (1.3), then for all A ∈ X , Q ν,1 (y, A) = Q ν,2 (y, A)
for νQ-almost every y in Y. The reverse kernel does exist if X and Y are Polish
spaces endowed with their Borel σ-fields. If Q admits a density q with respect to a
←
− R
measure µ on (Y, Y), then Q ν can be defined for all y such that X q(z, y) ν(dz) 6= 0
by
←
− q(x, y) ν(dx)
Q ν (y, dx) = R . (1.4)
X
q(z, y) ν(dz)
←− R
The values of Q ν on the set {y ∈ Y : X q(z, y) ν(dz) = 0} are irrelevant because
this set is νQ-negligible. In particular, if X is discrete and µ is counting measure,
then for all (x, y) ∈ X × Y such that νQ(y) 6= 0,
←
− ν(x)Q(x, y)
Q ν (y, x) = . (1.5)
νQ(y)
The distribution of X0 is called the initial distribution of the chain, and X is called
the state space.
If {Xk }k≥0 is F-adapted, then for all k ≥ 0 it holds that FkX ⊆ Fk ; hence a
Markov chain with respect to a filtration F is also a Markov chain with respect to
its natural filtration. Hereafter, a Markov chain with respect to its natural filtration
will simply be referred to as a Markov chain. When there is no risk of confusion,
we will not mention the underlying probability measure P.
A fundamental property of a Markov chain is that its finite-dimensional distri-
butions, and hence the distribution of the process {Xk }k≥0 , are entirely determined
by the initial distribution and the transition kernel.
4 CHAPTER 1. MAIN DEFINITIONS AND NOTATIONS
Proposition 4. Let {Xk }k≥0 be a Markov chain with initial distribution ν and
transition kernel Q. For any k ≥ 0 and any bounded X ⊗(k+1) -measurable function
f on X(k+1) ,
Z
E [f (X0 , . . . , Xk )] = f (x0 , . . . , xk ) ν(dx0 ) Q(x0 , dx1 ) · · · Q(xk−1 , dxk ) .
In the following, we will use the generic notation f ∈ Fb (Z) to denote the fact
that f is a measurable bounded function on (Z, Z). In the case of Proposition 4 for
instance, one considers functions f that are in Fb X(k+1) . More generally, we will
usually describe measures and transition kernels on (Z, Z) by specifying the way
they operate on the functions of Fb (Z).
Canonical Version
The iterates of the shift operator are defined inductively by θ0 = Id (the identity),
θ1 = θ and θk = θ ◦ θk−1 for k ≥ 1. If {Xk }k≥0 is the coordinate process with
associated natural filtration FX , then for all k, n ≥ 0, Xk ◦ θn = Xk+n , and more
generally for any FkX -measurable random variable Y , Y ◦ θn is Fn+k
X
-measurable.
The following theorem, which is a particular case of the Kolmogorov consistency
theorem, states that it is always possible to define a Markov chain on the canonical
space.
Markov Properties
More generally, an induction argument easily yields the Markov property: for any
X
F∞ -measurable random variable Y ,
The Markov property can be extended to a specific class of random times known as
stopping times. Let N̄ = N∪{+∞} denote the extended integer set and let (Ω, F, F)
be a filtered space. Then, a mapping τ : Ω → N̄ is said to be an F-stopping time if
{τ = n} ∈ Fn for all n ≥ 0. Intuitively, this means that at any time n one should
be able to tell, based on the information Fn available at that time, if the stopping
time occurs at this time n (or before then) or not. The class Fτ defined by
Fτ = {B ∈ F∞ : B ∩ {τ = n} ∈ Fn for all n ≥ 0} ,
Theorem 6 (Strong Markov Property). Let {Xk }k≥0 be the canonical version of
a Markov chain and let τ be an FX -stopping time. Then for any bounded F∞X
-
measurable function Ψ,
P(Xk+1 ∈ A | Fk ) = Qk (Xk , A) .
For i ≤ j we define
Qi,j = Qi Qi+1 · · · Qj .
With this notation, if ν denotes the distribution of X0 (which we refer to as the
initial distribution as in the homogeneous case), the distribution of Xn is ν Q0,n−1 .
An important example of a non-homogeneous Markov chain is the so-called reverse
chain. The construction of the reverse chain is based on the observation that if
{Xk }k≥0 is a Markov chain, then for any index n ≥ 1 the time-reversed (or, index-
reversed) process {Xn−k }nk=0 is a Markov chain too. The definition below provides
its transition kernels.
The relation (1.12) is referred to as the local balance equations (or detailed bal-
ance equations). If the state space is countable, these equations hold if for all
x, x0 ∈ X,
ν(x)Q(x, x0 ) = ν(x0 )Q(x0 , x) . (1.13)
Upon choosing a function f that only depends on the second variable in (1.12),
it is easily seen that νQ(f ) = ν(f ) for all functions f ∈ Fb (X). We can also write
this as ν = νQ. This equation is referred to as the global balance equations. By
induction, we find that νQn = ν for all n ≥ 0. The left-hand side of this equation
is the distribution of Xn , which thus does not depend on n when global balance
holds. This is a form of stationarity, obviously implied by local balance. We shall
tie this form of stationarity to the following customary definition.
Definition 10 (Stationary Process). A stochastic process {Xk } is said to be sta-
tionary (under P) if its finite-dimensional distributions are translation invariant,
that is, if for all k, n ≥ 1 and all n1 , . . . , nk , the distribution of the random vector
(Xn1 +n , . . . , Xnk +n ) does not depend on n.
A stochastic process with index set N, stationary but otherwise general, can
always be extended to a process with index set Z, having the same finite-dimensional
distributions (and hence being stationary). This is a consequence of Kolmogorov’s
existence theorem for stochastic processes.
For a Markov chain, any multi-dimensional distribution can be expressed in
terms of the initial distribution and the transition kernel—this is Proposition 4—
and hence the characterization of stationarity becomes much simpler than above.
Indeed, a Markov chain is stationary if and only if its initial distribution ν and
transition kernel Q satisfy νQ = ν, that is, satisfy global balance. Much more will
be said about stationary distributions of Markov chains in Chapter 7.
Definition 11 (Hidden Markov Model). Let (X, X ) and (Y, Y) be two measurable
spaces and let Q and G denote, respectively, a Markov transition kernel on (X, X )
and a transition kernel from (X, X ) to (Y, Y). Consider the Markov transition kernel
defined on the product space (X × Y, X ⊗ Y) by
ZZ
T [(x, y), C] = Q(x, dx0 ) G(x0 , dy 0 ) , (x, y) ∈ X × Y, C ∈ X ⊗ Y . (1.14)
C
The Markov chain {Xk , Yk }k≥0 with Markov transition kernel T and initial distri-
bution ν ⊗ G, where ν is a probability measure on (X, X ), is called a hidden Markov
model.
Although the definition above concerns the joint process {Xk , Yk }k≥0 , the term
hidden is only justified in cases where {Xk }k≥0 is not observable. In this respect,
{Xk }k≥0 can also be seen as a fictitious intermediate process that is useful only
in defining the distribution of the observed process {Yk }k≥0 . We shall denote by
Pν and Eν the probability measure and corresponding expectation associated with
the process {Xk , Yk }k≥0 on the canonical space (X × Y)N , (X ⊗ Y)⊗N . Notice that
this constitutes a slight departure from the Markov notations introduced previously,
as ν is a probability measure on X only and not on the state space X × Y of the
joint process. This slight abuse of notation is justified by the special structure of
the model considered here. Equation (1.14) shows that whatever the distribution
of the initial joint state (X0 , Y0 ), even if it were not of the form ν × G, the law of
{Xk , Yk }k≥1 only depends on the marginal distribution of X0 . Hence it makes sense
to index probabilities and expectations by this marginal initial distribution only.
If both X and Y are countable, the hidden Markov model is said to be discrete,
which is the case originally considered by Baum and Petrie (1966).
In the third part of the book (Chapter 5 and following) where we consider
statistical estimation for HMMs with unknown parameters, we will require even
stronger conditions and assume that the model is fully dominated in the following
sense.
8 CHAPTER 1. MAIN DEFINITIONS AND NOTATIONS
Note that for such models, we will generally re-use the notation ν to denote the
probability density function of the initial state X0 (with respect to λ) rather than
the distribution itself.
Proposition 14. Let {Xk , Yk }k≥0 be a Markov chain over the product space X × Y
with transition kernel T given by (1.14). Then, for any integer p, any ordered set
{k1 < · · · < kp } of indices and all functions f1 , . . . , fp ∈ Fb (Y),
" p # p Z
Y Y
Eν fi (Yki ) Xk1 , . . . , Xkp = fi (y) G(Xki , dy) . (1.17)
i=1 i=1 Y
R
Because G(xi , dyi ) = 1,
p
" #
Y
Eν fi (Yki )h(Xk1 , . . . , Xkp ) =
i=1
Y Z
Eν h(Xk1 , . . . , Xkp ) fi (yi ) G(Xi , dyi ) .
i∈{k1 ,...,kp }
Corollary 15.
1.2. HIDDEN MARKOV MODELS 9
(i) For any integer p and any ordered set {k1 < · · · < kp } of indices, the random
variables Yk1 , . . . , Ykp are Pν -conditionally independent given (Xk1 , Xk2 , . . . , Xkp ).
(ii) For any integers k and p and any ordered set {k1 < · · · < kp } of indices
such that k 6∈ {k1 , . . . , kp }, the random variables Yk and (Xk1 , . . . , Xkp ) are
Pν -conditionally independent given Xk .
Proof. Part (i) is an immediate consequence of Proposition 14. To prove (ii), note
that for any f ∈ Fb (Y) and h ∈ Fb (Xp ),
Eν f (Yk )h(Xk1 , . . . , Xkp ) | Xk
= Eν Eν [f (Yk ) | Xk1 , . . . , Xkp , Xk ]h(Xk1 , . . . , Xkp ) Xk
= Eν f (Yk ) | Xk ] Eν [h(Xk1 , . . . , Xkp ) | Xk .
(Yk1 , . . . , Ykp ) ⊥
⊥ (Yk10 , . . . , Ykp0 0 ) | (Xk1 , . . . , Xkp , Xk10 , . . . , Xkp0 0 ) [Pν ]
and
(Yk1 , . . . , Ykp ) ⊥
⊥ (Xk10 , . . . , Xkp0 0 ) | (Xk1 , . . . , Xkp ) [Pν ] .
Hence,
(Yk1 , . . . , Ykp ) ⊥
⊥ (Xk10 , . . . , Xkp0 0 , Yk10 , . . . , Ykp0 ) | (Xk1 , . . . , Xkp ) [Pν ] ,
State Inference
11
Chapter 2
This chapter deals with a fundamental issue in hidden Markov modeling: given a
fully specified model and some observations Y0 , . . . , Yn , what can be said about the
corresponding unobserved state sequence X0 , . . . , Xn ? More specifically, we shall
be concerned with the evaluation of the conditional distributions of the state at
index k, Xk , given the observations Y0 , . . . , Yn , a task that is generally referred
to as smoothing. There are of course several options available for tackling this
problem (Anderson and Moore, 1979, Chapter 7) and we focus, in this chapter, on
the fixed-interval smoothing paradigm in which n is held fixed and it is desired to
evaluate the conditional distributions of Xk for all indices k between 0 and n. Note
that only the general mechanics of the smoothing problem are dealt with in this
chapter. In particular, most formulas will involve integrals over X. We shall not,
for the moment, discuss ways in which these integrals can be effectively evaluated,
or at least approximated, numerically.
The driving line of this chapter is the existence of a variety of smoothing ap-
proaches that involve a number of steps that only increase linearly with the number
of observations. This is made possible by the fact (to be made precise in Sec-
tion 2.3) that conditionally on the observations Y0 , . . . , Yn , the state sequence still
is a Markov chain, albeit a non-homogeneous one.
From a historical perspective, it is interesting to recall that most of the early
references on smoothing, which date back to the 1960s, focused on the specific case
of Gaussian linear state-space models, following the pioneering work by Kalman and
Bucy (1961). The classic book by Anderson and Moore (1979) on optimal filtering,
for instance, is fully devoted to linear state-space models—see also Chapter 10 of the
recent book by Kailath et al. (2000) for a more exhaustive set of early references on
the smoothing problem. Although some authors such as (for instance) Ho and Lee
(1964) considered more general state-space models, it is fair to say that the Gaus-
sian linear state-space model was the dominant paradigm in the automatic control
community1 . In contrast, the work by Baum and his colleagues on hidden Markov
models (Baum et al., 1970) dealt with the case where the state space X of the hidden
state is finite. These two streams of research (on Gaussian linear models and finite
state space models) remained largely separated. Approximately at the same time, in
the field of probability theory, the seminal work by Stratonovich (1960) stimulated
a number of contributions that were to compose a body of work generally referred to
1 Interestingly,until the early 1980s, the works that did not focus on the linear state-space
model were usually advertised by the use of the words “Bayes” or “Bayesian” in their title—see,
e.g., Ho and Lee (1964) or Askar and Derin (1981).
13
14 CHAPTER 2. FILTERING AND SMOOTHING RECURSIONS
as filtering theory. The object of filtering theory is to study inference about partially
observable Markovian processes in continuous time. A number of early references in
this domain indeed consider some specific form of discrete state space continuous-
time equivalent of the HMM (Shiryaev, 1966; Wonham, 1965)—see also Lipster
and Shiryaev (2001), Chapter 9. Working in continuous time, however, implies the
use of mathematical tools that are definitely more complex than those needed to
tackle the discrete-time model of Baum et al. (1970). As a matter of fact, filtering
theory and hidden Markov models evolved as two mostly independent fields of re-
search. A poorly acknowledged fact is that the pioneering paper by Stratonovich
(1960) (translated from an earlier Russian publication) describes, in its first sec-
tion, an equivalent to the forward-backward smoothing approach of Baum et al.
(1970). It turns out, however, that the formalism of Baum et al. (1970) generalizes
well to models where the state space is not discrete anymore, in contrast to that
of Stratonovich (1960).
where Lν,n is an important quantity which we define below for future reference.
Definition 16 (Likelihood). The likelihood of the observations is the probability
density function of Y0 , Y1 , . . . , Yn with respect to µn defined, for all (y0 , . . . , yn ) ∈
Yn+1 , by
Lν,n (y0 , . . . , yn ) =
Z Z
· · · ν(dx0 )g(x0 , y0 )Q(x0 , dx1 )g(x1 , y1 ) · · · Q(xn−1 , dxn )g(xn , yn ) . (2.3)
In addition,
def
`ν,n = log Lν,n , (2.4)
is referred to as the log-likelihood function.
Remark 17 (Concise Notation for Sub-sequences). For the sake of conciseness, we
will use in the following the notation Yl:m to denote the collection of consecutively
indexed variables Yl , . . . , Ym wherever possible (proceeding the same way for the
unobservable sequence {Xk }). In quoting (2.3) for instance, we shall write Lν,n (y0:n )
rather than Lν,n (y0 , . . . , yn ). By transparent convention, Yk:k refers to the single
variable Yk , although the second notation (Yk ) is to be preferred in this particular
2.1. BASIC NOTATIONS AND DEFINITIONS 15
2.1.2 Smoothing
We first define generically what is meant by the word smoothing before deriving
the basic results that form the core of the techniques discussed in the rest of the
chapter.
Definition 18 (Smoothing, Filtering, Prediction). For positive indices k, l, and n
with l ≥ k, denote by φν,k:l|n the conditional distribution of Xk:l given Y0:n , that is
(a) φν,k:l|n is a transition kernel from Y(n+1) to X(l−k+1) :
where the equality holds Pν -almost surely. Specific choices of k and l give rise to
several particular cases of interest:
Joint Smoothing: φν,0:n|n , for n ≥ 0;
Filtering: φν,n|n for n ≥ 0; Because the use of filtering will be preeminent in the
following, we shall most often abbreviate φν,n|n to φν,n .
In more precise terms, φν,k:l|n is a version of the conditional distribution of Xk:l
given Y0:n . It is however not obvious that such a quantity indeed exists in great
generality. The proposition below complements Definition 18 by a constructive
approach to defining the smoothing quantities from the elements of the hidden
Markov model.
Proposition 19. Consider a hidden Markov model compatible with Definition 12,
let n be a positive integer and y0:n ∈ Yn+1 a sub-sequence such that Lν,n (y0:n ) > 0.
The joint smoothing distribution φν,0:n|n then satisfies
Z Z
φν,0:n|n (y0:n , f ) = Lν,n (y0:n )−1 ··· f (x0:n )
n
Y
× ν(dx0 )g(x0 , y0 ) Q(xk−1 , dxk )g(xk , yk ) (2.5)
k=1
16 CHAPTER 2. FILTERING AND SMOOTHING RECURSIONS
for all functions f ∈ Fb Xn+1 . Likewise, for indices p ≥ 0,
Z Z
φν,0:n+p|n (y0:n , f ) = ··· f (x0:n+p )
n+p
Y
× φν,0:n|n (y0:n , dx0:n ) Q(xk−1 , dxk ) (2.6)
k=n+1
for all functions f ∈ Fb Xn+p+1 .
Proof. Equation (2.5) defines φν,0:n|n in a way that obviously satisfies part (a) of
Definition 18. To prove the (b) part of the definition, consider a function h ∈
Fb Yn+1 . By (2.1),
Z Z
Eν [h(Y0:n )f (X0:n )] = ···
h(y0:n )f (x0:n )
" n #
Y
× ν(dx0 )g(x0 , y0 ) Q(xk−1 , dxk )g(xk , yk ) µn (dy0:n ) .
k=1
Using Definition 16 of the likelihood Lν,n and (2.5) for φν,0:n|n yields
Z Z
Eν [h(Y0:n )f (X0:n )] = · · · h(y0:n ) φν,0:n|n (y0:n , f )Lν,n (y0:n ) µn (dy0:n )
When integrating with respect to the subsequence yn+1:n+p , the third line of the pre-
Qn+p
vious equation reduces to l=n+1 Q(xl−1 , dxl )µn (dy0:n ). Finally use (2.3) and (2.5)
to obtain
Z Z
Eν [h(Y0:n )f (X0:n+p )] = · · · h(y0:n )f (x0:n+p )
" n+p #
Y
φν,0:n|n (y0:n , dx0:n ) Q(xk−1 , dxk ) Lν,n (y0:n )µn (dy0:n ) , (2.8)
k=n+1
where αν,k and βk|n are defined below in (2.11) and (2.12), respectively. In simple
terms, αν,k correspond to the factors in the multiple integral that are to be inte-
grated with respect to the state variables xl with indices l ≤ k while βk|n gathers
the remaining factors (which are to be integrated with respect to xl for l > k). This
simple splitting of the multiple integration in (2.9) constitutes the forward-backward
decomposition.
Definition 20 (Forward-Backward “Variables”). For k ∈ {0, . . . , n}, define the
following quantities.
Forward Kernel αν,k is the non-negative finite kernel from (Yk+1 , Y ⊗(k+1) ) to
(X, X ) such that
Z Z k
Y
αν,k (y0:k , f ) = ··· f (xk ) ν(dx0 )g(x0 , y0 ) Q(xl−1 , dxl )g(xl , yl ) , (2.11)
l=1
with the convention that the rightmost product term is empty for k = 0.
Backward Function βk|n is the non-negative measurable function on Yn−k × X
defined by
βk|n (yk+1:n , x) =
Z Z n
Y
· · · Q(x, dxk+1 )g(xk+1 , yk+1 ) Q(xl−1 , dxl )g(xl , yl ) , (2.12)
l=k+2
for k ≤ n − 1 (with the same convention that the rightmost product is empty for
k = n − 1); βn|n (·) is set to the constant function equal to 1 on X.
The term “forward and backward variables” as well as the use of the symbols
α and β is part of the HMM credo and dates back to the seminal work of Baum
and his colleagues (Baum et al., 1970, p. 168). It is clear however that for a general
model as given in Definition 12, these quantities as defined in (2.11) and (2.12)
are very different in nature, and indeed sufficiently so to prevent the use of the
loosely defined term “variable”. In the original framework studied by Baum and
his coauthors where X is a finite set, both the forward measures αν,k (y0:k , ·) and the
backward functions βk|n (yk+1:n , ·) can be represented by vectors with non-negative
entries. Indeed, in this case αν,k (y0:k , x) has the interpretation Pν (Y0 = y0 , . . . , Yk =
yk , Xk = x) while βk|n (yk+1:n , x) has the interpretation P(Yk+1 = yk+1 , . . . , Yn =
yn | Xk = x). This way of thinking of αν,k and βk|n may be extended to general state
18 CHAPTER 2. FILTERING AND SMOOTHING RECURSIONS
spaces: αν,k (y0:k , dx) is then the joint density (with respect to µk+1 ) of Y0 , . . . , Yk
and distribution of Xk , while βk|n (yk+1:n , x) is the conditional joint density (with
respect to µn−k ) of Yk+1 , . . . , Yn given Xk = x. Obviously, these entities may then
not be represented as vectors of finite length, as when X is finite; this situation is
the exception rather than the rule.
Let us simply remark at this point that while the forward kernel at index k
is defined irrespectively of the length n of the observation sequence (as long as
n ≥ k), the same is not true for the backward functions. The sequence of backward
functions clearly depends on the index where the observation sequence stops. In
general, for instance, βk|n−1 differs from βk|n even if we assume that the same
sub-observation sequence y0:n−1 is considered in both cases. This is the reason for
adding the terminal index n to the notation used for the backward functions. This
notation also constitutes a departure from HMM traditions in which the backward
functions are simply indexed by k. For αν,k , the situation is closer to standard
practice and we simply add the subscript ν to recall that the forward kernel αν,k , in
contrast with the backward measure, does depend on the distribution ν postulated
for the initial state X0 .
def
where gk are the data-dependent functions on X defined by gk (x) = g(x, yk ) for
the particular sequence y0:n under consideration. The sequence of functions {gk }
is about the only new notation that is needed as we simply re-use the previously
defined quantities omitting their explicit dependence on the observations. For in-
stance, in addition to writing Lν,n instead of Lν,n (y0:n ), we will also use φn (·) rather
than φn (y0:n , ·), βk|n (·) rather than βk|n (yk+1:n , ·), etc. This notational simplifica-
tion implies a corresponding terminological adjustment. For instance, αν,k will be
referred to as the forward measure at index k and considered as a positive finite
measure on (X, X ). In all cases, the conversion should be easy to do mentally, as in
the case of αν,k , for instance, what is meant is really “the measure αν,k (y0:k , ·), for
a particular value of y0:k ∈ Yk+1 ”.
At first sight, omitting the observations may seem a weird thing to do in a
statistically oriented book. However, for posterior state inference in HMMs, one
indeed works conditionally on a given fixed sequence of observations. Omitting the
observations from our notation will thus allow more concise expressions in most
parts of the book. There are of course some properties of the hidden Markov
model for which dependence with respect to the distribution of the observations
does matter (hopefully!) This is in particular the case of Section 3 on forgetting
2.2. FORWARD-BACKWARD 19
and Chapter 6, which deals with statistical properties of the estimates for which we
will make the dependence with respect to the observations explicit.
2.2 Forward-Backward
The forward-backward decomposition introduced in Section 2.1.3 is just a rewriting
of the multiple integral in (2.9) such that for f ∈ Fb (X),
Z
−1
φν,k|n (f ) = Lν,n f (x) αν,k (dx)βk|n (x) , (2.14)
where
Z Z k
Y
αν,k (f ) = ··· f (xk ) ν(dx0 )g0 (x0 ) Q(xl−1 , dxl )gl (xl ) (2.15)
l=1
and
βk|n (x) =
Z Z n
Y
··· Q(x, dxk+1 )gk+1 (xk+1 ) Q(xl−1 , dxl )gl (xl ) . (2.16)
l=k+2
The last expression is, by convention, equal to 1 for the final index k = n. Note
that we are now using the implicit conditioning convention discussed in the previous
section.
Similarly, the backward functions defined by (2.16) may be obtained, for all x ∈
X, by the recursion
Z
βk|n (x) = Q(x, dx0 )gk+1 (x0 )βk+1|n (x0 ) (2.19)
Proof. The proof of this result is straightforward and similar for both recursions.
For αν,k for instance, simply rewrite (2.15) as
Z Z "Z Z
αν,k (f ) = f (xk ) ···
xk ∈X xk−1 ∈X x0 ∈X,...,xk−2 ∈X
k−1
#
Y
ν(dx0 )g0 (x0 ) Q(xl−1 , dxl )gl (xl ) Q(xk−1 , dxk )gk (xk ) ,
l=1
φν,k|n (f ) = L−1
ν,n αν,k f βk|n .
and
αν,k (1) = Lν,k ,
where Lν,k refers to the likelihood of the observations up to index k (included) only,
under Pν .
Proof. Because (2.14) must hold in particular for f = 1 and the marginal smoothing
distribution φν,k|n is a probability measure,
def
φν,k|n (1) = 1 = L−1
ν,n αν,k βk|n .
For the final index k = n, βn|n is the constant function equal to 1 and hence
αν,n (1) = Lν,n . This observation is however not specific to the final index n, as αν,k
only depends on the observations up to index k and thus any particular index may
be selected as a potential final index (in contrast to what happens for the backward
functions).
Item (i) implies that the normalized forward measures ᾱν,k are probability measures
that have a probabilistic interpretation given below. Item R (ii) implies that the
normalized backward functions are such that φν,k|n (f ) = f (x)β̄k|n (x) ᾱν,k (dx) for
all f ∈ Fb (X), without the need for a further renormalization. We note that this
scaling scheme differs slightly from the one described by Rabiner (1989).
To derive the probabilistic interpretation of ᾱν,k , observe that (2.14) and Propo-
sition 23, instantiated for the final index k = n, imply that the filtering distribution
φν,n at index n (recall that φν,n is used as a simplified notation for φν,n|n ) may be
written [αν,n (1)]−1 αν,n . This finding is of course not specific to the choice of the
index n as already discussed when proving the second statement of Proposition 23.
Thus, the normalized version ᾱν,k of the forward measure αν,k coincides with the
filtering distribution φν,k introduced in Definition 18. This observation together
with Proposition 23 implies that there is a unique choice of scaling scheme that
satisfies the two requirements of the previous paragraph, as
Z Z
f (x) φν,k|n (dx) = L−1
ν,n f (x) αν,k (dx)βk|n (x)
Z
= f (x) L−1 −1
ν,k αν,k (dx) Lν,n Lν,k βk|n (x)
| {z }| {z }
ᾱν,k (dx) β̄k|n (x)
must hold for any f ∈ Fb (X). The following definition summarizes these conclu-
sions, using the notation φν,k rather than ᾱν,k , as these two definitions refer to the
same object—the filtering distribution at index k.
operating on decreasing indices k = n−1 down to 0; the initial condition is β̄n|n (x) =
1.
Once the two recursions above have been carried out, the smoothing distribution
at any given index k ∈ {0, . . . , n} is available via
Z
φν,k|n (f ) = f (x) β̄k|n (x)φν,k (dx) (2.24)
In other words, cν,0 = Lν,0 and for subsequent indices k ≥ 1, cν,k = Lν,k /Lν,k−1 .
Hence (2.25) coincides with the normalized forward and backward variables as spec-
ified by Definition 24.
2.2. FORWARD-BACKWARD 23
Filter to Predictor : The last equation in (2.27) simply means that the updated
predicting distribution φν,k+1|k is obtained by applying the transition kernel Q
to the current filtering distribution φν,k . We are thus left with the very basic
problem of determining the one-step distribution of a Markov chain given its
initial distribution.
Remark 27. In many situations, using (2.27) to determine φν,k is indeed the goal
rather than simply a first step in computing smoothed distributions. In particular,
for sequentially observed data, one may need to take actions based on the observa-
tions gathered so far. In such cases, filtering (or prediction) is the method of choice
for inference about the unobserved states, a topic that will be developed further in
Chapter 4.
Remark 28. Another remarkable fact about the filtering recursion is that (2.26)
together with (2.27) provides a method for evaluating the likelihood Lν,k of the ob-
servations up to index k recursively in the index k. In addition, as cν,k = Lν,k /Lν,k−1
from (2.26), cν,k may be interpreted as the conditional likelihood of Yk given the pre-
vious observations Y0:k−1 . However, as discussed at the beginning of Section 2.2.2,
using (2.26) directly is generally impracticable for numerical reasons. In order to
avoid numerical under- or overflow, one can equivalently compute the log-likelihood
`ν,k . Combining (2.26) and (2.27) gives the important formula
k
def
X
`ν,k = log Lν,k = log φν,l|l−1 (gl ) , (2.29)
l=0
24 CHAPTER 2. FILTERING AND SMOOTHING RECURSIONS
Remark 29. The normalized backward function β̄k|n does not have a simple proba-
bilistic interpretation when isolated from the corresponding filtering measure. How-
ever, (2.24) shows that the marginal smoothing distribution, φν,k|n , is dominated
by the corresponding filtering distribution φν,k and that β̄k|n is by definition the
Radon-Nikodym derivative of φν,k|n with respect to φν,k ,
dφν,k|n
β̄k|n =
dφν,k
As a consequence,
inf M ∈ R : φν,k ({β̄k|n ≥ M }) = 0 ≥ 1
and
sup M ∈ R : φν,k ({β̄k|n ≤ M }) = 0 ≤ 1 ,
with the conventions inf ∅ = ∞ and sup ∅ = −∞. As a consequence, all values
of β̄k|n cannot get simultaneously large or close to zero as was the case for βk|n ,
although one cannot exclude the possibility that β̄k|n still has important dynamics
without some further assumptions
Qn on the model.
The normalizing factor l=k+1 cν,l = Lν,n /Lν,k by which β̄k|n differs from the
corresponding unnormalized backward function βk|n may be interpreted as the con-
ditional likelihood of the future observations Yk+1:n given the observations up to
index k, Y0:k .
which, using (2.13) and the definition (2.16) of the backward function, expands to
Z Z k
Y
L−1
ν,n ··· h(x0:k ) ν(dx0 )g0 (x0 ) Q(xi−1 , dxi )gi (xi )
i=1
Z
× Q(xk , dxk+1 )f (xk+1 )gk+1 (xk+1 )
Z Z n
Y
× ··· Q(xi−1 , dxi )gi (xi ) . (2.32)
i=k+2
| {z }
βk+1|n (xk+1 )
R
From Definition 30, Q(xk , dxk+1 )f (xk+1 )gk+1 (xk+1 )βk+1|n (xk+1 ) is equal to Fk|n (xk , f )βk|n (xk ).
Thus, (2.32) may be rewritten as
Z Z
Eν [f (Xk+1 )h(X0:k ) | Y0:n ] = L−1
ν,n · · · Fk|n (xk , f )h(x0:k )
" k #
Y
× ν(dx0 )g0 (x0 ) Q(xi−1 , dxi )gi (xi ) βk|n (xk ) . (2.33)
i=1
26 CHAPTER 2. FILTERING AND SMOOTHING RECURSIONS
Using the definition (2.16) of βk|n again, this latter integral is easily seen to be
similar to (2.32) except for the fact that f (xk+1 ) has been replaced by Fk|n (xk , f ).
Hence
Eν [f (Xk+1 )h(X0:k ) | Y0:n ] = Eν [Fk|n (Xk , f )h(X0:k ) | Y0:n ] ,
for all functions h ∈ Fb Xk+1 as requested.
For k ≥ n, the situation is simpler because (2.6) implies that φν,0:k+1|n =
φν,0:k|n Q. Hence,
and thus
Z Z
Eν [f (Xk+1 )h(X0:k ) | Y0:n ] = ··· h(x0:k )φν,0:k|n (dx0:k )Q(xk , f ) ,
Remark 32. A key ingredient of the above proof is (2.32), which gives a repre-
sentation of the joint smoothing distribution of the state variables X0:k given the
observations up to index n, with n ≥ k. This representation, which states that
φν,0:k|n (f )
Z Z " k
#
Y
= L−1
ν,n ··· f (x0:k ) ν(dx0 )g0 (x0 ) Q(xi−1 , dxi )gi (xi ) βk|n (xk ) (2.34)
i=1
for all f ∈ Fb Xk+1 , is a generalization of the marginal forward-backward decom-
position as stated in (2.14).
Proposition 31 implies that, conditionally on the observations Y0:n , the state
sequence {Xk }k≥0 is a non-homogeneous Markov chain associated with the family
of Markov transition kernels {Fk|n }k≥0 and initial distribution φν,0|n . The fact that
the Markov property of the state sequence is preserved when conditioning sounds
surprising because the (marginal) smoothing distribution of the state Xk depends
on both past and future observations. There is however nothing paradoxical here, as
the Markov transition kernels Fk|n indeed depend (and depend only) on the future
observations Yk+1:n .
As a consequence of Proposition 31, the joint smoothing distributions may be
rewritten in a form that involves the forward smoothing kernels using the Chapman-
Kolmogorov equations (1.1).
Proposition 33. For any integers n and m, function f ∈ Fb Xm+1 and initial
probability ν on (X, X ),
Eν [f (X0:m )) | Y0:n ] =
Z Z m
Y
··· f (x0:m ) φν,0|n (dx0 ) Fi−1|n (xi−1 , dxi ) , (2.35)
i=1
where {Fk|n }k≥0 are defined by (2.30) and (2.31) and φν,0|n is the marginal smooth-
ing distribution defined, for any A ∈ X , by
Z
−1
φν,0|n (A) = [ν(g0 β0|n )] ν(dx)g0 (x)β0|n (x) . (2.36)
A
2.3. MARKOVIAN DECOMPOSITIONS 27
If one is only interested in computing the fixed point marginal smoothing dis-
tributions, (2.35) may also be used as the second phase of a smoothing approach
which we recapitulate below.
for all f ∈ Fb (X × X). From the previous discussion, there exists a Markov transi-
tion kernel Bν,k which satisfies Definition 2, that is
def
Bν,k = {Bν,k (x, A), x ∈ X, A ∈ X }
Proposition 35. Given a strictly positive index n, initial distribution ν, and index
k ∈ {0, . . . , n − 1},
for any f ∈ Fb (X). Here, Bν,k is the backward smoothing kernel defined in (2.38).
Before giving the proof of this result, we make a few remarks to provide some
intuitive understanding of the backward smoothing kernels.
Remark 36. Contrary to the forward kernel, the backward transition kernel is only
defined implicitly through the equality of the two representations (2.37) and (2.38).
This limitation is fundamentally due to the fact that the backward kernel implies a
non-trivial time-reversal operation.
Proposition 35 however allows a simple interpretation of the backward kernel:
Because Eν [f (Xk ) | Xk+1:n , Y0:n ] is equal to Bν,k (Xk+1 , f ) and thus depends neither
on Xl for l > k + 1 nor on Yl for l ≥ k + 1, the tower property of conditional ex-
pectation implies that not only is Bν,k (Xk+1 , f ) equal to Eν [f (Xk ) | Xk+1 , Y0:n ] but
also coincides with Eν [f (Xk ) | Xk+1 , Y0:k ], for any f ∈ Fb (X). In addition, the dis-
tribution of Xk+1 given Xk and Y0:k reduces to Q(Xk , ·) due to the particular form
of the transition kernel associated with a hidden Markov model (see Definition 11).
Recall also that the distribution of Xk given Y0:k is denoted by φν,k . Thus, Bν,k
can be interpreted as a Bayesian posterior in the equivalent pseudo-model where
Thus, in many cases of interest, the backward transition kernel Bν,k can be
written straightforwardly as a function of φν,k and Q. In these situations, Propo-
sition 38 is the method of choice for smoothing, as it only involves normalized
quantities, whereas Corollary 34 is not normalized and thus can generally not be
implemented as it stands.
of Proposition 35. Let k ∈ {0, . . . , n − 1} and h ∈ Fb Xn−k . Then
Z Z
Eν [f (Xk )h(Xk+1:n ) | Y0:n ] = ··· f (xk )h(xk+1:n ) φν,k:n|n (dxk:n ) . (2.40)
2.3. MARKOVIAN DECOMPOSITIONS 29
Using the definition (2.13) of the joint smoothing distribution φν,k:n|n yields
Z Z
Lν,k
Eν [h0 (Xk+1:n ) | Y0:n ] = ··· h0 (xk+1:n )
Lν,n
n
Y
× φν,k+1|k (dxk+1 )gk+1 (xk+1 ) Q(xi−1 , dxi )gi (xi ) .
i=k+2
Identifying h0 with h(xk+1:n ) f (x) Bν,k (xk+1 , dx), we find that (2.42) may be
R
rewritten as
def
recalling that φν,n|n = φν,n .
(iii) A more subtle difference between the forward and backward Markovian decom-
positions is the observation that Definition 30 does provide an expression of
the forward kernels Fk|n for any k ≥ 0, that is, also for indices after the end of
the observation sequence. Hence, the process {Xk }k≥0 , when conditioned on
some observations Y0:n , really forms a non-homogeneous Markov chain whose
finite-dimensional distributions are defined by Proposition 33. In contrast, the
backward kernels Bν,k are defined for indices k ∈ {0, . . . , n − 1} only, and
thus the index-reversed process {Xn−k } is also defined, by Proposition 35, for
indices k in the range {0, . . . , n} only. In order to define the index-reversed
chain for negative indices, a minimal requirement is that the underlying chain
{Xk } also be well defined for k < 0. Defining Markov chains {Xk } with in-
dices k ∈ Z is only meaningful in the stationary case, that is when ν is the
stationary distribution of Q. As both this stationarization issue and the for-
ward and backward Markovian decompositions play a key role in the analysis
of the statistical properties of the maximum likelihood estimator, we postpone
further discussion of this point to Chapter 6.
Chapter 3
Recall from previous chapters that in a partially dominated HMM model (see Def-
inition 12), we denote by
• Pν the probability associated to the Markov chain {Xk , Yk }k≥0 on the canon-
ical space (X × Y)N , (X ⊗ Y)⊗N with initial probability measure ν and tran-
sition kernel T defined by (1.15);
• φν,k|n the distribution of the hidden state Xk conditionally on the observations
Y0:n , under the probability measure Pν .
Forgetting properties pertain to the dependence of φν,k|n with respect to the
initial distribution ν. A typical question is to ask whether φν,k|n and φν 0 ,k|n are
close (in some sense) for large values of k and arbitrary choices of ν and ν 0 . This
issue will play a key role both when studying the convergence of sequential Monte
Carlo methods (Chapter ??) and when analyzing the asymptotic behavior of the
maximum likelihood estimator (Chapter 6).
In the following, it is shown more precisely that, under appropriate conditions
on the kernel Q of the hidden chain and on the transition density function g, the
total variation distance φν,k|n − φν 0 ,k|n TV converges to zero as k tends to infinity.
Remember that, following the implicit conditioning convention (Section 2.1.4), we
usually omit to indicate explicitly that φν,k|n indeed depends on the observations
Y0:n . In this section however we cannot use this convention anymore, as we will
meet both situations in which, say, kφν,n − φν 0 ,n kTV converges to zero (as n tends
to infinity) for all possible values of the sequence {yn }n≥0 ∈ YN (uniform forget-
ting) and cases where kφν,n − φν 0 ,n kTV can be shown to converge to zero almost
surely only when {Yk }k≥0 is assumed to be distributed under a specific distribution
(typically Pν? for some initial distribution ν? ). In this section, we thus make de-
pendence with respect to the observations explicit by indicating the relevant subset
of observation between brackets, using, for instance, φν,k|n [y0:n ] rather than φν,k|n .
We start by recalling some elementary facts and results about the total variation
norm of a signed measure, providing in particular useful characterizations of the
total variation as an operator norm over appropriately defined function spaces. We
then discuss the contraction property of Markov kernels, using the measure-theoretic
approach introduced in an early paper by Dobrushin (1956) and recently revisited
and extended by Del Moral et al. (2003). We finally present the applications of these
results to establish forgetting properties of the smoothing and filtering recursions
and discuss the implications of the technical conditions required to obtain these
results.
31
32CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY
The measures ξ+ and ξ− are referred to as the positive and negative variations of
the signed measure ξ. By construction, ξ = ξ+ − ξ− . This decomposition of ξ into
its positive and negative variations is called the Hahn-Jordan decomposition of ξ.
The definition of the positive and negative variations above is easily shown to be
independent of the particular Jordan set chosen.
Definition 39 (Total Variation of a Signed Measure). Let (X, X ) be a measurable
space and let ξ be a signed measure on (X, X ). The total variation norm of ξ is
defined as
kξkTV = ξ+ (X) + ξ− (X) ,
where (ξ+ , ξ− ) is the Hahn-Jordan decomposition of ξ.
showing (i). It also shows that the suprema in (ii) and (iii) are no larger than kξkTV
and kf k∞ , respectively. To establish equality in these relations, first note that
k1H − 1H c k∞ = 1 and ξ (1H − 1H c ) = ξ(H) − ξ(H c ) = kξkTV . This proves (ii).
Next pick f and let let {xn } be a sequence in X such that limn→∞ |f (xn )| = kf k∞ .
Then kf k∞ = limn→∞ |δxn (f )|, proving (iii).
The set M0 (X, X ) possesses some interesting properties that will prove useful in
the sequel. Let ξ be in this set. Because ξ(X) = 0, for any f ∈ Fb (X) and any real c
it holds that ξ(f ) = ξ(f − c). Therefore by Lemma 41(i), |ξ(f )| ≤ kξkTV kf − ck∞ ,
which implies that
|ξ(f )| ≤ kξkTV inf kf − ck∞ .
c∈R
It is easily seen that for any f ∈ Fb (X), inf c∈R kf − ck∞ is related to the oscillation
semi-norm of f , also called the global modulus of continuity,
def
osc (f ) = sup |f (x) − f (x0 )| = 2 inf kf − ck∞ . (3.1)
(x,x0 )∈X×X c∈R
The lemma below provides some additional insight into this result.
Lemma 42. For any ξ ∈ M(X, X ) and f ∈ Fb (X),
1
which shows (3.2). If ξ(X) = 0, then ξ+ (X) = ξ− (X) = 2 kξkTV , showing (3.3).
34CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY
1
|ξ(f ) − ξ 0 (f )| ≤ kξ − ξ 0 kTV osc (f ) . (3.4)
2
This inequality is sharper than the bound |ξ(f )−ξ 0 (f )| ≤ kξ − ξ 0 kTV kf k∞ provided
by Lemma 41(i), because osc (f ) ≤ 2 kf k∞ .
We conclude this section by establishing some alternative expressions for the
total variation distance between two probability measures.
1
kξ − ξ 0 kTV = sup |ξ(A) − ξ 0 (A)| (3.5)
2 A
= 1 − sup ν(X) (3.6)
ν≤ξ,ξ 0
Xn
= 1 − inf ξ(Ai ) ∧ ξ 0 (Ai ) . (3.7)
p=1
Here the supremum in (3.5) is taken over all measurable subsets of X, the supremum
in (3.6) is taken over all finite signed measures ν on (X, X ) satisfying ν ≤ ξ and
ν ≤ ξ 0 , and the infimum in (3.7) is taken over all finite measurable partitions
A1 , . . . , An of X.
Proof. To prove (3.5), first write ξ(A)−ξ 0 (A) = (ξ −ξ 0 )1A and note that osc (1A ) =
1. Thus (3.4) shows that the supremum in (3.5) is no larger than (1/2) kξ − ξ 0 kTV .
Now let H be a Jordan set of the signed measure ξ − ξ 0 . The supremum is bounded
from below by ξ(H) − ξ 0 (H) = (ξ − ξ 0 )+ (X) = (1/2) kξ − ξ 0 kTV . This establishes
equality in (3.5).
We now turn to (3.6). For any p, q ∈ R, |p − q| = p + q − 2(p ∧ q). Therefore for
any A ∈ X ,
1 1
|ξ(A) − ξ 0 (A)| = (ξ(A) + ξ 0 (A)) − ξ(A) ∧ ξ 0 (A) .
2 2
Applying this relation to the sets H and H c , where H is as above, shows that
1 1
(ξ − ξ 0 ) (H) = [ξ(H) + ξ 0 (H)] − ξ(H) ∧ ξ 0 (H) ,
2 2
1 0 1
(ξ − ξ) (H c ) = [ξ(H c ) + ξ 0 (H c )] − ξ(H c ) ∧ ξ 0 (H c ) .
2 2
For any measure ν such that ν ≤ ξ and ν ≤ ξ 0 , it holds that ν(H) ≤ ξ(H) ∧ ξ 0 (H)
and ν(H c ) ≤ ξ(H c ) ∧ ξ 0 (H c ), showing that
1 1 1
(ξ − ξ 0 ) (H) + (ξ 0 − ξ) (H c ) = kξ − ξ 0 kTV ≤ 1 − ν(X) .
2 2 2
Thus (3.6) is no smaller than the left-hand side. To show equality, let ν be the
measure defined by
ν(A) = ξ(A ∩ H c ) + ξ 0 (A ∩ H) . (3.8)
By the definition of H, ξ(A ∩ H c ) ≤ ξ 0 (A ∩ H c ) and ξ 0 (A ∩ H) ≤ ξ(A ∩ H) for any
A ∈ X . Therefore ν(A) ≤ ξ(A) and ν(A) ≤ ξ 0 (A). In addition, ν(H) = ξ 0 (H) =
35
showing that
n
X
sup ν(X) ≤ inf ξ(Ai ) ∧ ξ 0 (Ai ) .
ν≤ξ,ξ 0 i=1
The supremum and the infimum thus agree, and the proof of (3.7) follows from
(3.6).
1
δ(K) = sup kK(x, ·) − K(x0 , ·)kTV
2 (x,x0 )∈X×X
kK(x, ·) − K(x0 , ·)kTV
= sup .
(x,x0 )∈X×X,x6=x0 kδx − δx0 kTV
We remark that as K(x, ·) and K(x0 , ·) are probability measures, it holds that
kK(x, ·)kTV = kK(x0 , ·)kTV = 1. Hence δ(K) ≤ 12 (1+1) = 1, so that the Dobrushin
coefficient satisfies 0 ≤ δ(K) ≤ 1.
Lemma 46. Let ξ be a finite signed measure on (X, X ) and let K be a transition
kernel from (X, X ) to (Y, Y). Then
Proof. Pick ξ ∈ M(X, X ) and let, as usual, ξ+ and ξ− be its positive and negative
part, respectively. If ξ− (X) = 0 (ξ is a measure), then kξkTV = ξ(X) and (3.9)
becomes kξKkTV ≤ kξkTV ; this follows from Lemma 44. If ξ+ (X) = 0, an analogous
argument applies.
Thus assume that both ξ+ and ξ− are non-zero. In view of Lemma 41(ii), it
suffices to prove that for any f ∈ Fb (Y) with kf k∞ = 1,
We shall suppose that ξ+ (X) ≥ ξ− (X), if not, replace ξ by −ξ and (3.10) remains
the same. Then as |ξ+ (X) − ξ− (X)| = ξ+ (X) − ξ− (X), (3.10) becomes
Corollary 47.
Proof. If ξ(X) = 0, then (3.9) becomes kξKkTV ≤ δ(K) kξkTV , showing that
Qm (x, A) ≥ ν(A) .
where the infimum is taken over all (x, x0 ) ∈ X×X and all finite measurable partitions
A1 , . . . , An of X of X. Under
Pn the Doeblin condition, the sum in this display is
bounded from below by i=1 νx,x0 (Ai ) = . Hence the following lemma is true.
38CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY
Stochastic processes that are such that for any k, the distribution of the ran-
dom vector (Xn , . . . , Xn+k ) does not depend on n are called stationary (see Defi-
nition 10). It is clear that in general a Markov chain will not be stationary. Nev-
ertheless, given a transition kernel Q, it is possible that with an appropriate choice
of the initial distribution ν we may produce a stationary process. Assuming that
such a distribution exists, the stationarity of the marginal distribution implies that
Eν [1A (X0 )] = Eν [1A (X1 )] for any A ∈ X . This can equivalently be written as
ν(A) = νQ(A), or ν = νQ. In such a case, the Markov property implies that all
finite-dimensional distributions of {Xk }k≥0 are also invariant under translation in
time. These considerations lead to the definition of invariant measure.
Theorem 54. Under Assumption 49, Q admits a unique invariant probability mea-
sure π. In addition, for any ξ ∈ M1 (X, X ),
ξQkm − ξ 0 Qkm TV
≤ δ k (Qm ) kξ − ξ 0 kTV ≤ (1 − )k kξ − ξ 0 kTV . (3.14)
showing that {ξQkm } is a Cauchy sequence in M1 (X, X ) endowed with the total
variation norm. Because this metric space is complete, there exists a probability
measure π such that ξQkm → π. In view of the discussion above, π is invariant
for Qm . Moreover, by (3.14) this limit does not depend on ξ. Thus Qm admits
π as unique invariant probability measure. The Chapman-Kolmogorov equations
imply that (πQ)Qm = (πQm )Q = πQ, showing that πQ is also invariant for Qm
and hence that πQ = π as claimed.
Remark 55. Classical uniform convergence to equilibrium for Markov processes
has been studied during the first half of the 20th century by Doeblin, Kolmogorov,
and Doob under various conditions. Doob (1953) gave a unifying form to these
conditions, which he named Doeblin type conditions. More recently, starting in
the 1970s, an increasing interest in non-uniform convergence of Markov processes
has arisen. An explanation for this interest is that many useful processes do not
converge uniformly to equilibrium, while they do satisfy weaker properties such as a
geometric convergence. It later became clear that non-uniform convergence relates
to local Doeblin type condition and to hitting times for so-called small sets. These
types of conditions are detailed in Chapter 7.
Here, k and n are integers, and ν is the initial probability measure on (X, X ).
The filtering probability is defined by φν,n [Y0:n ] = φν,n|n [Y0:n ]. In this section, we
will establish that under appropriate conditions on the transition kernel Q and on
the function g, the sequence of filtering probabilities satisfies a property referred
to in the literature as “forgetting of the initial condition”. This property can be
formulated as follows: given two probability measures ν and ν 0 on (X, X ),
where ν? is the initial probability measure that defines the law of the observations
{Yk }. Forgetting is also a concept that applies to the smoothing distributions, as it
is often possible to extend the previous results showing that
Equation (3.16) can also be strengthened by showing that, under additional condi-
tions, the forgetting property is uniform with respect to the observed sequence Y0:n
in the sense that there exists a deterministic sequence {ρk } satisfying ρk → 0 and
Several of the results to be proven in the sequel are of this latter type (uniform
forgetting).
As shown in (2.5), the smoothing distribution is defined as the ratio
R R Qn
··· f (xk ) ν(dx0 ) g(x0 , y0 ) i=1 Q(xi−1 , dxi ) g(xi , yi )
φν,k|n [y0:n ](f ) = R R Qn .
··· ν(dx0 ) g(x0 , y0 ) i=1 Q(xi−1 , dxi ) g(xi , yi )
Here we have used the following notations and definitions from Chapter 2.
(i) Fi|n [yi+1:n ] are the forward smoothing kernels (see Definition 30) given for
i = 0, . . . , n − 1, x ∈ X and A ∈ X , by
def −1
Fi|n [yi+1:n ](x, A) = βi|n [yi+1:n ](x)
Z
× Q(x, dxi+1 ) g(xi+1 , yi+1 )βi+1|n [yi+2:n ](xi+1 ) , (3.18)
A
where βi|n [yi+1:n ](x) are the backward functions (see Definition 20)
Recall that, by Proposition 31, {Fi|n }i≥0 are the transition kernels of the non-
homogeneous Markov chain {Xk } conditionally on Y0:n ,
(ii) φν,0|n [y0:n ] is the posterior distribution of the state X0 conditionally on Y0:n =
y0:n , defined for any A ∈ X by
R
ν(dx0 ) g(x0 , y0 )β0|n [y1:n ](x0 )
φν,0|n [y0:n ](A) = RA . (3.20)
ν(dx0 ) g(x0 , y0 )β0|n [y1:n ](x0 )
We see that the non-linear mapping ν 7→ φν,k|n [y0:n ] is the composition of two
mappings on M1 (X, X ).
41
(i) The mapping ν 7→ φν,0|n [y0:n ], which associates to the initial distribution ν the
posterior distribution of the state X0 given Y0:n = y0:n . This mapping consists
in applying Bayes’ formula, which we write as
φν,0|n [y0:n ] = B[g(·, y0 )β0|n [y1:n ](·), ν] .
Here R
f (x)φ(x) ξ(dx)
B[φ, ξ](f ) = R , f ∈ Fb (X) , (3.21)
φ(x) ξ(dx)
for any probability measure ξ on (X, X ) and any non-negative measurable
function φ on X. Note that B[φ, ξ] is a probability measure on (X, X ). Because
of the normalization, this step is non-linear.
Qk
(ii) The mapping ξ 7→ ξ i=1 Fi−1|n [yi:n ], which is a linear mapping being defined
as product of Markov transition kernels.
For two initial probability measures ν and ν 0 on (X, X ), the difference of the
associated smoothing distributions may thus be expressed as
Note that the function g(x, y0 )β0|n [y1:n ](x) defined for x ∈ X may also be inter-
preted as the likelihood of the observation Lδx ,n [y0:n ] when starting from the initial
condition X0 = x (Proposition 23). In the sequel, we use the likelihood nota-
tion whenever possible, writing, in addition, Lx,n [y0:n ] rather than Lδx ,n [y0:n ] and
L•,n [y0:n ] when referring to the whole function.
Using Corollary 47, (3.22) implies that
where the final factor is a Dobrushin coefficient. Because Bayes operator B returns
probability measures, the total variation distance in the right-hand side of this
display is always bounded by 2. Although this bound may be sufficient, it is often
interesting to relate the total variation distance between B[φ, ξ] and B[φ, ξ 0 ] to the
total variation distance between ξ and ξ 0 . The following lemma is adapted from
(Künsch, 2000)—see also (Del Moral, 2004, Theorem 4.3.1).
Lemma 56. Let ξ and ξ 0 be two probability measures on (X, X ) and let φ be a
non-negative measurable function such that ξ(φ) > 0 or ξ 0 (φ) > 0. Then
kφk∞
kB[φ, ξ] − B[φ, ξ 0 ]kTV ≤ kξ − ξ 0 kTV . (3.24)
ξ(φ) ∨ ξ 0 (φ)
Proof. We may assume, without loss of generality, that ξ(φ) ≥ ξ 0 (φ). For any
f ∈ Fb (X),
B[φ, ξ](f ) − B[φ, ξ 0 ](f )
f (x)φ(x) (ξ − ξ 0 )(dx) f (x)φ(x) ξ 0 (dx) φ(x) (ξ 0 − ξ)(dx)
R R R
= R + R R
φ(x) ξ(dx) φ(x) ξ 0 (dx) φ(x) ξ(dx)
Z
1
= (ξ − ξ 0 )(dx) φ(x)(f (x) − B[φ, ξ 0 ](f )) .
ξ(φ)
42CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY
By Lemma 43,
Z
(ξ − ξ 0 )(dx) φ(x)(f (x) − B[φ, ξ 0 ](f )) ≤ kξ − ξ 0 kTV ×
1
sup |φ(x)(f (x) − B[φ, ξ 0 ](f )) − φ(x0 )(f (x0 ) − B[φ, ξ 0 ](f ))| .
2 (x,x0 )∈X×X
This decomposition and Corollary 47 shows that for any 0 ≤ j ≤ k, any initial distri-
butions ν and ν 0 and any sequence y0:n such that Lν,n [y0:n ] > 0 and Lν 0 ,n [y0:n ] > 0,
Because the Dobrushin coefficient of a Markov kernel is bounded by one, this rela-
tion implies that the total variation distance between the smoothing distributions
associated with two different initial distributions is non-expanding. To summarize
this discussion, we have obtained the following result.
Proposition 57. Let ν and ν 0 be two probability measures on (X, X ). For any
non-negative integers j, k, and n such that j ≤ k and any sequence y0:n ∈ Yn+1
such that Lν,n [y0:n ] > 0 and Lν 0 ,n [y0:n ] > 0,
Along the same lines, we can compare the posterior distribution of the state Xk
given observations Yj:n for different values of j. To avoid introducing new notations,
we will simply denote these conditional distributions by Pν ( Xk ∈ · | Yj:n = yj:n ).
43
Pν ( Xk ∈ · | Xj , Yj:n ) = Pν ( Xk ∈ · | Xj , Y0:n ) .
Using (3.25) as well, we thus find that the difference between Pν ( Xk ∈ · | Yj:n ) and
Pν ( Xk ∈ · | Y0:n ) may be expressed by
k
Y
Eν [f (Xk ) | Yj:n ] − Eν [f (Xk ) | Y0:n ] = (φ̃ν,j|n − φν,j|n ) Fi−1|n [Yi:n ]f .
i=j+1
Proceeding like in Proposition 57, we may thus derive a bound on the total variation
distance between these probability measures.
Proposition 58. For any integers j, k, and n such that 0 ≤ j ≤ k and any
probability measure ν on (X, X ),
k
Y
kPν ( Xk ∈ · | Y0:n ) − Pν ( Xk ∈ · | Yj:n )kTV ≤ 2δ Fi−1|n [Yi:n ] . (3.28)
i=j+1
We first show that under this condition, one may derive a non-trivial upper
bound on the Dobrushin coefficient of the forward smoothing kernels.
44CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY
(i) For any non-negative integers k and n such that k < n and x ∈ X,
n
Y n
Y
ς − (yj ) ≤ βk|n [yk+1:n ](x) ≤ ς + (yj ) . (3.30)
j=k+1 j=k+1
(ii) For any non-negative integers k and n such that k < n and any probability
measures ν and ν 0 on (X, X ),
R
ς − (yk+1 ) ν(dx) βk|n [yk+1:n ](x) ς + (yk+1 )
≤ RX ≤ .
ς + (yk+1 ) X
ν 0 (dx) βk|n [yk+1:n ](x) ς − (yk+1 )
(iii) For any non-negative integers k and n such that k < n, there exists a transition
kernel λk,n from (Yn−k , Y ⊗(n−k) ) to (X, X ) such that for any x ∈ X, A ∈ X ,
and yk+1:n ∈ Yn−k ,
ς − (yk+1 )
λk,n (yk+1:n , A) ≤ Fk|n [yk+1:n ](x, A)
ς + (yk+1 )
ς + (yk+1 )
≤ λk,n (yk+1:n , A) . (3.31)
ς − (yk+1 )
(iv) For any non-negative integers k and n, the Dobrushin coefficient of the forward
smoothing kernel Fk|n [yk+1:n ] satisfies
(
ρ0 (yk+1 ) k < n ,
δ(Fk|n [yk+1:n ]) ≤
ρ1 k≥n,
from above and below by ς + (y) and ς − (y), respectively. Part (i) then follows from
(2.16).
Next, (2.19) shows that
Z
ν(dx) βk|n [yk+1:n ](x)
ZZ
= ν(dx) Q(x, dxk+1 ) g(xk+1 , yk+1 )βk+1|n [yk+2:n ](xk+1 ) .
and similarly a lower bound, with ς − (yk+1 ) rather than ς + (yk+1 ), holds too. These
bounds are independent of ν, and (ii) follows.
We turn to part (iii). Using the definition (2.30), the forward kernel Fk|n [yk+1:n ]
may be expressed as
R
Q(x, dxk+1 ) g(xk+1 , yk+1 )βk+1|n [yk+2:n ](xk+1 )
Fk|n [yk+1:n ](x, A) = RA .
X
Q(x, dxk+1 ) g(xk+1 , yk+1 )βk+1|n [yk+2:n ](xk+1 )
45
Finally, part (iv) for k < n follows from part (iii) and Lemma 51. In the opposite
case, recall from (2.31) Rthat Fk|n = Q for indices k ≥ n. Integrating (3.29) with
respect to µ and using g(x, y) µ(dy) = 1, we find that for any A ∈ X and any
x ∈ X,
Z Z R −
− − ς (y)K(y, A) µ(dy)
Q(x, A) ≥ ς (y)K(y, A) µ(dy) = ς (y) µ(dy) × R ,
ς − (y) µ(dy)
where the ratio on the right-hand side is a probability measure. The proof of
part (iv) again follows from Lemma 51.
The final part of the above lemma shows that under Assumption 59, the Do-
brushin coefficient of the transition kernel Q satisfies δ(Q) ≤ 1 − for some > 0.
This is in fact a rather stringent assumption, which fails to be satisfied in many of
the examples considered in Chapter ??. When X is finite, this condition is satisfied
if Q(x, x0 ) ≥ for any (x, x0 ) ∈ X × X. When X is countable, δ(Q) < 1 is satis-
fied under the Doeblin condition 49 with n = 1. When X ⊆ Rd or more generally
is a topological space, δ(Q) < 1 typically requires that X is compact, which is,
admittedly, a serious limitation.
Proposition 61. Under 59 the following hold true.
(i) For any non-negative integers k and n and any probability measures ν and ν 0
on (X, X ),
(iii) For any non-negative integers j, k, and n such that j ≤ k and any probability
measure ν on (X, X ),
Proof. Using Lemma 60(iv) and Proposition 48, we find that for j ≤ k,
k∧n
k−j−(k∧n−j∧n)
Y
δ(Fj|n [yj+1:n ] · · · Fk|n [yk+1:n ]) ≤ ρ0 (yi ) × ρ1 .
i=j∧n+1
46CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY
Parts (i) and (iii) then follow from Propositions 57 and 58, respectively. Next we
note that (3.20) shows that
φν,0|n [y0:n ] = B β0|n [y1:n ](·), B[g(·, y0 ), ν] .
Apply Lemma 56 twice to this form to arrive at a bound on the total variation norm
of the difference φν,0|n [y0:n ] − φν 0 ,0|n [y0:n ] given by
The situation is elementary when the factors of this product are (non-trivially)
upper-bounded uniformly with respect to the observations Y0:n . To obtain such
bounds, we consider the following strengthening of the strong mixing condition,
first introduced by Atar and Zeitouni (1997).
Assumption 62 (Strong Mixing Reinforced). (i) There exist two positive real num-
bers σ − and σ + and a probability measure κ on (X, X ) such that for any x ∈ X
and A ∈ X ,
σ − κ(A) ≤ Q(x, A) ≤ σ + κ(A) .
R
(ii) For all y ∈ Y, 0 < X κ(dx) g(x, y) < ∞.
It is easily seen that this implies Assumption 59.
Lemma 63.R Assumption 62 implies Assumption 59 with ς − (y) = σ −
R
X
κ(dx) g(x, y),
ς + (y) = σ + X κ(dx) g(x, y), and
R
κ(dx) g(x, y)
K(y, A) = RA .
X
κ(dx) g(x, y)
In particular, ς − (y)/ς + (y) = σ − /σ + for any y ∈ Y.
Proof. The proof follows immediately upon observing that
Z Z Z
− 0 0 0 0
σ κ(dx ) g(x , y) ≤ Q(x, dx ) g(x , y) ≤ σ +
κ(dx0 ) g(x0 , y) .
A A A
0
(ii) For any non-negative R probability measures ν and ν on (X, X )
integer n and any
such that ν(dx0 ) g(x0 , y0 ) > 0 and ν 0 (dx0 ) g(x0 , y0 ) > 0,
R
(iii) For any non-negative integers j, k, and n such that j ≤ k and any probability
measure ν on (X, X ),
Thus, under Assumption 62 the filter and the smoother forget their initial condi-
tions exponentially fast, uniformly with respect to the observations. This property,
which holds under rather stringent assumptions, plays a key role in the sequel (see
for instance Chapters ?? and 6).
Of course, the product (3.33) can be shown to vanish asymptotically under
conditions that are less stringent than Assumption 62. A straightforward adaptation
of Lemma 63 shows that the following result is true.
Lemma 65. Assume 59 and that there exists a set C ∈ Y and constants 0 < σ − ≤
σ + < ∞ satisfying µ(C) > 0 and, for all y ∈ C, σ − ≤ ς − (y) ≤ ς + (y) ≤ σ + . Then,
ρ0 (y) ≤ 1 − σ − /σ + , ρ1 ≥ 1 − σ − µ(C) and
k∧n
k−j−(k∧n−j∧n)
Y
ρ0 (Yi )ρ1
i=j∧n+1
k∧n
Pi=j∧n+1 1C (Yi ) k−j−(k∧n−j∧n)
≤ 1 − σ − /σ + 1 − σ − µ(C) . (3.34)
In words, forgetting is guaranteed to occur when {Yk } visits a given set C in-
finitely often in the long run. Of course, such a property cannot hold true for all
possible sequences of observations but it may hold with probability one under appro-
priate assumptions on the law of {Yk }, assuming in particular that the observations
are distributed under the model, perhaps with a different initial distribution ν? .
To answer whether this happens or not requires additional results from the general
theory of Markov chains, and we postpone this discussion to Section 7.3 (see in
particular Proposition 208 on the recurrence of the joint chain in HMMs).
Example 66. This example was first discussed by Kaijser (1975) and recently
worked out by Chigansky and Lipster (2004). Let {Xk } be a Markov chain on
X = {0, 1, 2, 3}, defined by the recurrence equation Xk = (Xk−1 + Uk ) mod 4,
where {Uk } is an i.i.d. binary sequence with P(Bk = 0) = p and P(Bk = 1) = 1 − p
for some 0 < p < 1. For any (x, x0 ) ∈ X × X, Q4 (x, x0 ) > 0, which implies that
δ(Q4 ) < 1 and, by Theorem 54, that the chain is uniformly geometrically ergodic.
The observations {Yk } are a deterministic binary function of the chain, namely
Yk = 1{0,2} (Xk ) .
In the above example, the kernel Q does not satisfy Assumption 62 with m = 1
(one-step minorization), but the condition is verified for a power Qm (here for
m = 4). This situation is the rule rather than the exception. In particular, a
Markov chain on a finite state space has a unique invariant probability measure
and is ergodic if and only if there exists an integer m > 0 such that Qm (x, x0 ) > 0
for all (x, x0 ) ∈ X × X (but the condition may not hold for m = 1). This suggests
considering the following assumption (see for instance Del Moral, 2004, Chapter 4).
Assumption 67.
(i) There exist an integer m, two positive real numbers σ − and σ + , and a proba-
bility measure κ on (X, X ) such that for any x ∈ X and A ∈ X ,
(ii) There exist two measurable functions g − and g − from Y to (0, ∞) such that
for any y ∈ Y,
Compared to Assumption 62, the condition on the transition kernel has been
weakened, but at the expense of strengthening the assumption on the function g.
Note in particular that part (ii) is not satisfied in Example 66.
Using (3.17) and writing k = jm+r with 0 ≤ r < m, we may express φν,k|n [y0:n ]
as
Y (u+1)m−1
j−1 Y k−1
Y
φν,k|n [y0:n ] = φν,0|n [y0:n ] Fi|n [yi+1:n ] Fi|n [yi+1:n ] .
u=0 i=um i=jm
This implies, using Corollary 47, that for any probability measures ν and ν 0 on
(X, X ) and any sequence y0:n satisfying Lν,n [y0:n ] > 0 and Lν 0 ,n [y0:n ] > 0,
Qum+m−1
This expression suggest computing a bound on δ( i=um Fi|n [yi+1:n ]) rather than
a bound on δ(Fi|n ). The following result shows that such a bound can be derived
under Assumption 67.
Lemma 68. Under Assumption 67, the following hold true.
(i) For any non-negative integers k and n such that k < n and x ∈ X,
n
Y n
Y
g − (yj ) ≤ βk|n [yk+1:n ](x) ≤ g + (yj ) , (3.37)
j=k+1 j=k+1
(iii) For any non-negative integers u and n such that 0 ≤ u < bn/mc, there exists
a transition kernel λu,n from Y(n−(u+1)m) , Y ⊗(n−(u+1)m) to (X, X ) such that
for any x ∈ X, A ∈ X and yum+1:n ∈ Y(n−um) ,
(u+1)m (u+1)m−1
σ− Y g − (yi ) Y
λu,n (y(u+1)m+1:n , A) ≤ Fi|n [yi+1:n ](x, A)
σ+ i=um+1
g + (yi ) i=um
(u+1)m
σ+ Y g + (yi )
≤ − λu,n (y(u+1)m+1:n , A) . (3.38)
σ i=um+1
g − (yi )
Proof. Part (i) can be proved using an argument similar to the one used for Lem-
ma 60(i).
Next notice that for 0 ≤ u < bn/mc,
Under Assumption 67, dropping the dependence on the ys for notational simplicity,
the right-hand side of this display is bounded from above by
(u+1)m Z Z (u+1)m
Y Y
g + (yi ) ··· Q(xi−1 , dxi ) β(u+1)m|n (x(u+1)m )
i=um+1 i=um+1
(u+1)m Z
Y
≤ σ+ g + (yi ) β(u+1)m|n (x(u+1)m ) κ(dx(u+1)m ) .
i=um+1
1
R R Q(u+1)m
··· i=um+1 Q(xi−1 , xi ) g(xi , yi ) A (x(u+1)m )β(u+1)m|n (x(u+1)m )
= R R Q(u+1)m .
··· i=um+1 Q(xi−1 , xi ) g(xi , yi )β(u+1)m|n (x(u+1)m )
We define λu,n as the second ratio of this expression. Again a corresponding lower
bound is obtained similarly, proving part (iii).
Part (iv) follows from part (iii) and Lemma 51.
Using this result together with (3.36), we may obtain statements analogous to
Proposition 61. In particular, if there exist positive real numbers γ − and γ + such
that for all y ∈ Y,
γ − ≤ g − (y) ≤ g + (y) ≤ γ + ,
then the smoothing and the filtering distributions both forget uniformly the initial
distribution.
Assumptions 62 and 67 are still restrictive and fail to hold in many interesting
situations. In both cases, we assume that either the one-step or the m-step transition
kernel is uniformly bounded from above and below. The following weaker condition
is a first step toward handling more general settings.
(i) There exists a set C ∈ X , two positive real numbers σ − and σ + such that for
all x ∈ C and x0 ∈ X,
σ − ≤ qκ (x, x0 ) ≤ σ + .
(iii) There exists a (non-identically null) function α : Y → [0, 1] such that for any
(x, x0 ) ∈ X × X and y ∈ Y,
def
ρ[x, x0 ; y](x00 ) = qκ (x, x00 )g(x00 , y)qκ (x00 , x0 ) . (3.40)
Part (i) of this assumption implies that the set C is 1-small for the kernel Q
(see Definition 155). It it shown in Section 7.2.2 that such small sets do exist under
conditions that are weak and generally simple to check. Assumption 69 is trivially
satisfied under Assumption 62 using the whole state space X as the state C: in that
case, their exists a transition density function qκ (x, x0 ) that is bounded from above
and below for all (x, x0 ) ∈ X2 . It is more interesting to consider cases in which
the hidden chain is not uniformly ergodic. One such example, first addressed by
Budhiraja and Ocone (1997), is a Markov chain observed in noise with bounded
support.
(ii) The transition density satisfies q(x, x0 ) > 0 for all (x, x0 ) and there exists a
positive constant A, a probability density h and positive constants σ − and σ +
such that for all x ∈ C = [−A − M, A + M ],
σ − h(x0 ) ≤ q(x, x0 ) ≤ σ + h(x0 ) .
The results below can readily be extended to cover the case Yk = ψ(Xk ) + Vk ,
provided that the level sets {x ∈ R : |ψ(x)| ≤ K} of the function ψ are compact.
This is equivalent to requiring |ψ(x)| → ∞ as |x| → ∞. Likewise extensions to
multivariate states and/or observations are obvious.
Under (ii), Assumption 69(i) is satisfied with C as above and κ(dx) = h(x) dx.
Denote by φ the probability density of the random variables Vk . Then g(x, y) =
φ(y − x). The density φ may be chosen such that suppφ ⊆ [−M, +M ], so that
g(x, y) > 0 if and only if x ∈ [y − M, y + M ]. To verify Assumption 69(iii), put
Γ = [−A, A]. For y ∈ Γ, we then have g(x, y) = 0 if x 6∈ [−A − M, A + M ], and thus
Z Z A+M
00 00 00 0 00
q(x, x )g(x , y)q(x , x ) dx = q(x, x00 )g(x00 , y)q(x00 , x0 ) dx00 .
−A−M
0
This implies that for all (x, x ) ∈ X × X,
q(x, x00 )g(x00 , y)q(x00 , x0 ) dx00
R
RC =1.
X
q(x, x00 )g(x00 , y)q(x00 , x0 ) dx00
The bounded noise case is of course very specific, because an observation Yk allows
locating the corresponding state Xk within a bounded set.
Under assumption 69, the lemma below establishes that the set C is a 1-small set
for the forward transition kernels Fk|n [yk+1:n ] and that it is also uniformly accessible
from the whole space X (for the same kernels).
Lemma 71. Under Assumption 69, the following hold true.
(i) For any initial
R probability measure ν on (X, X ) and any sequence y0:n ∈ Yn+1
satisfying C ν(dx0 ) g(x0 , y0 ) > 0,
Lν,n (y0:n ) > 0 .
(ii) For any non-negative integers k and n such that k < n and any y0:n ∈ Yn+1 ,
the set C is a 1-small set for the transitions kernels Fk|n . Indeed there exists a
transition kernel λk,n from (Y(n−k) , Y ⊗(n−k) ) to (X, X ) such that for all x ∈ C,
yk+1:n ∈ Yn−k and A ∈ X ,
σ−
Fk|n [yk+1:n ](x, A) ≥ λk,n [yk+1:n ](A) .
σ+
(iii) For any non-negative integers k and n such that n ≥ 2 and k < n − 1, and any
yk+1:n ∈ Yn−k ,
inf Fk|n [yk+1:n ](x, C) ≥ α(yk+1 ) .
x∈X
Proof. Write
Z Z n
Y
Lν,n (y0:n ) = ··· ν(dx0 ) g(x0 , y0 ) Q(xi−1 , dxi ) g(xi , yi )
i=1
Z Z n
Q(xi−1 , dxi ) g(xi , yi )1C (xi−1 )
Y
≥ ··· ν(dx0 ) g(x0 , y0 )
i=1
Z n Z
− n
Y
≥ ν(dx0 ) g(x0 , y0 ) σ g(xi , yi ) κ(dxi ) ,
C i=1 C
53
showing part (i). The proof of (ii) is similar to that of Lemma 60(iii). For (iii),
write
Under Assumption 69, Φ(x, x0 ; y) ≥ α(y) for all (x, x0 ) ∈ X × X and y ∈ Y, which
concludes the proof.
The corollary below then shows that the whole set X is a 1-small set for the
composition Fk|n [yk+1:n ]Fk+1|n [yk+2:n ]. This generalizes a well-known result for
homogeneous Markov chains (see Proposition 157).
bk/2c−1
σ−
Y
φν,k|n [y0:n ] − φν 0 ,k|n [y0:n ] ≤2 1− α(y2j+1 ) .
TV
j=0
σ+
Proof. Because of Lemma 71(i), we may use the decomposition in (3.26) with j = 0
bounding the total variation distance by 2 to obtain
k−1
Y
φν,k|n [y0:n ] − φν 0 ,k|n [y0:n ] TV
≤2 δ Fj|n [yj+1:n ] .
j=0
Corollary 72 is only useful in cases where the function α is such that the obtained
bound indeed decreases as k and n grow. In Example 70, one could set α(y) = 1Γ (y),
for an interval Γ. In such a case, it suffices that the joint chain {Xk , Yk }k≥0 be
recurrent under Pν? —which was the case in Example 70—to guarantee that 1Γ (Yk )
equals one infinitely often and thus that φν,k|n [Y0:n ] − φν 0 ,k|n [Y0:n ] TV tends to
zero Pν? -almost surely as k, n → ∞. The following example illustrates a slightly
more complicated situation in which Assumption 69 still holds.
54CHAPTER 3. FORGETTING OF THE INITIAL CONDITION AND FILTER STABILITY
Xk+1 = φXk + Uk , X0 ∼ ν ,
Yk = Xk + Vk ,
where
(i) {Uk }k≥0 is an i.i.d. sequence of random variables with Laplace (double expo-
nential) distribution with scale parameter λ;
(ii) {Vk }k≥0 is an i.i.d. sequence of Gaussian random variable with zero mean and
variance σ 2 .
We will see below that the fact that the tails of the Xs are heavier than the tails of
the observation noise is important for the derivations that follow. It is assumed that
|φ| < 1, which implies that the chain {Xk } is positive recurrent, that is, admits a sin-
gle invariant probability measure π. It may be shown (see Chapter 7) that although
the Markov chain {Xk } is geometrically ergodic, that is, kQn (x, ·) − πkTV → 0 geo-
metrically fast, it is not uniformly ergodic as lim inf n→∞ supx∈R kQn (x, ·) − πkTV >
0. We will nevertheless see that the forward smoothing kernel is uniformly geomet-
rically ergodic.
Under the stated assumptions,
1
q(x, x0 ) = exp (−λ|x0 − φx|) ,
2λ
(y − x)2
1
g(x, y) = √ exp − .
2πσ 2σ 2
Here we set, for some M > 0 to be specified later, C = [−M − 1/2, M + 1/2], and
we let y ∈ [−1/2, +1/2]. Note that
R M +1/2
−M −1/2
exp(−λ|u − φx| − |y − u|2 /2σ 2 − λ|x0 − φu|) du
R∞
−∞
exp(−λ|u − φx| − |y − u|2 /2σ 2 − λ|x0 − φu|) du
RM
exp(−λ|u − x| − u2 /2σ 2 − φλ|x0 − u|) du
≥ R−M
∞ ,
−∞
exp(−λ|u − x| − u2 /2σ 2 − φλ|x0 − u|) du
and to show Assumption 69(iii) it suffices to show that the right-hand side is
bounded from below. This in turn is equivalent to showing that sup(x,x0 )∈R×R R(x, x0 ) <
1, where
R R ∞
−M
−∞
+ M
exp(−α|u − x| − βu2 − γ|x0 − u|) du
0
R(x, x ) = R ∞ (3.41)
−∞
exp(−α|u − x| − βu2 − γ|x0 − u|) du
Z M
exp −α(x − u) − βu2 − γ|u − x0 | du
−M
Z M
≥ e−2γM e−αx exp(−βu2 + αu) du .
−M
We consider the case M ≤ x ≤ x0 ; the other case can be handled similarly. The
denominator in (3.41) is then bounded by
Z M
−αx−γx0
e exp(−βu2 + (α + γ)u) du .
−M
The use of Monte Carlo methods for non-linear filtering can be traced back to the
pioneering contributions of Handschin and Mayne (1969) and Handschin (1970).
These early attempts were based on sequential versions of the importance sampling
paradigm, a technique that amounts to simulating samples under an instrumental
distribution and then approximating the target distributions by weighting these
samples using appropriately defined importance weights. In the non-linear filtering
context, importance sampling algorithms can be implemented sequentially in the
sense that, by defining carefully a sequence of instrumental distributions, it is not
needed to regenerate the population of samples from scratch upon the arrival of
each new observation. This algorithm is called sequential importance sampling,
often abbreviated SIS. Although the SIS algorithm has been known since the early
1970s, its use in non-linear filtering problems was rather limited at that time. Most
likely, the available computational power was then too limited to allow convincing
applications of these methods. Another less obvious reason is that the SIS algorithm
suffers from a major drawback that was not clearly identified and properly cured
until the seminal paper by Gordon et al. (1993). As the number of iterations
increases, the importance weights tend to degenerate, a phenomenon known as
sample impoverishment or weight degeneracy. Basically, in the long run most of the
samples have very small normalized importance weights and thus do not significantly
contribute to the approximation of the target distribution. The solution proposed
by Gordon et al. (1993) is to allow rejuvenation of the set of samples by duplicating
the samples with high importance weights and, on the contrary, removing samples
with low weights.
The particle filter of Gordon et al. (1993) was the first successful application of
sequential Monte Carlo techniques to the field of non-linear filtering. Since then,
sequential Monte Carlo (or SMC) methods have been applied in many different
fields including computer vision, signal processing, control, econometrics, finance,
robotics, and statistics (Doucet et al., 2001; Ristic et al., 2004). This chapter reviews
the basic building blocks that are needed to implement a sequential Monte Carlo
algorithm, starting with concepts related to the importance sampling approach.
More specific aspects of sequential Monte Carlo techniques will be further discussed
in Chapter ??, while convergence issues will be dealt with in Chapter ??.
57
58 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS
This quantity is obviously free from any scale factor in dµ/dν. The self-normalized
importance sampling estimator µ bIS
ν,N (f ) is defined as a ratio of the sample means of
the functions f1 = f × (dµ/dν) and f2 = dµ/dν. The strong law of large numbers
PN PN
thus implies that N −1 i=1 f1 (ξ i ) and N −1 i=1 f2 (ξ i ) converge almost surely, to
4.1. IMPORTANCE SAMPLING AND RESAMPLING 59
bIS
µ(f1 ) and ν(dµ/dν) = 1, respectively, showing that µν,N (f ) is a consistent estimator
of µ(f ). Again, more precise results on the behavior of this estimator will be given
in Chapter ??. In the following, the term importance sampling usually refers to the
self-normalized form (4.3) of the importance sampling estimate.
TARGET
TARGET
Figure 4.1: Principle of resampling. Top plot: the sample drawn from ν with as-
sociated normalized importance weights depicted by bullets with radii proportional
to the normalized weights (the target density corresponding to µ is plotted in solid
line). Bottom plot: after resampling, all points have the same importance weight,
and some of them have been duplicated (M = N = 7).
There are several ways of implementing this basic idea, the most obvious ap-
proach being sampling with replacement with probability of sampling each ξ i equal
60 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS
to the importance weight ω i . Hence the number of times N i each particular point
ξ˜i in the first-stage sample is selected follows a binomial Bin(N, ω i ) distribution.
The vector (N 1 , . . . , N M ) is distributed from Mult(N, ω 1 , . . . , ω M ), the multinomial
distribution with parameter N and probabilities of success (ω 1 , . . . , ω M ). In this
resampling step, the points in the first-stage sample that are associated with small
normalized importance weights are most likely to be discarded, whereas the best
points in the sample are duplicated in proportion to their importance weights. In
most applications, it is typical to choose M , the size of the first-stage sample, larger
(and sometimes much larger) than N . The SIR algorithm is summarized below.
Resampling:
• Draw, conditionally independently given (ξ˜1 , . . . , ξ˜M ), N discrete ran-
dom variables (I 1 , . . . , I N ) taking values in the set {1, . . . , M } with prob-
abilities (ω 1 , . . . , ω M ), i.e.,
P(I 1 = j) = ω j , j = 1, . . . , M . (4.5)
i
• Set, for i = 1, . . . , N , ξ i = ξ˜I .
is a consistent estimator of µ(f ) for all functions f satisfying µ(|f |) < ∞. The
resampling step might thus be seen as a means to transform the weighted importance
sampling estimate µ bIS
ν,M (f ) defined by (4.3) into an unweighted sample average.
Recall that N is the number of times that the element ξ˜i is resampled. Rewriting
i
N M
1 X X N i ˜i
µ̂SIR
ν,M,N (f ) = f (ξ i ) = f (ξ ) ,
N i=1 i=1
N
it is easily seen that the sample mean µ̂SIRν,M,N (f ) of the SIR sample is, conditionally
˜1 ˜M
on the first-stage sample (ξ , . . . , ξ ), equal to the importance sampling estimator
bIS
µ ν,M (f ) defined in (4.3),
h i
E µ̂SIR ˜1 ˜M = µ
bIS
ν,M,N (f ) ξ , . . . , ξ ν,M (f ) .
variance decomposition
h 2 i
E µ̂SIR
ν,M,N (f ) − µ(f )
h 2 i h 2 i
= E µ̂SIR bIS
ν,M,N (f ) − µν,M (f ) +E µbIS
ν,M (f ) − µ(f ) .
In this context, the kernels Rk will be referred to as the instrumental kernels. The
term importance kernel is also used. The following assumptions will be adopted in
the sequel.
Assumption 76 (Sequential Importance Sampling). 1. The target distribution
φ0 is absolutely continuous with respect to the instrumental distribution ρ0 .
2. For all k ≥ 0 and all x ∈ X, the measure Tku (x, ·) is absolutely continuous with
respect to Rk (x, ·).
Then for any k ≥ 0 and any function fk ∈ Fb Xk+1 ,
(k−1 )
Z Z
dφ0 Y dT u (xl , ·)
l
φ0:k|k (fk ) = · · · fk (x0:k ) (x0 ) (xl+1 ) ρ0:k (dx0:k ) , (4.9)
dρ0 dRl (xl , ·)
l=0
which implies that the target distribution φ0:k|k is absolutely continuous with re-
spect to the instrumental distribution ρ0:k with Radon-Nikodym derivative given
by
k−1
Y dT u (xl , ·)
dφ0:k|k dφ0 l
(x0:k ) = (x0 ) (xl+1 ) . (4.10)
dρ0:k dρ0 dRl (xl , ·)
l=0
It is thus legitimate to use ρ0:k as an instrumental distribution to compute im-
portance sampling estimates for integrals with respect to φ0:k|k . Denoting by
1 N
ξ0:k , . . . , ξ0:k N i.i.d. random sequences with common distribution
ρ0:k , the im-
portance sampling estimate of φ0:k|k (fk ) for fk ∈ Fb Xk+1 is defined as
PN i i
i=1 ωk fk (ξ0:k )
φ̂IS (f
0:k|k k ) = PN , (4.11)
i
i=1 ωk
4.2. SEQUENTIAL IMPORTANCE SAMPLING 63
dφ0 i
ω0i = (ξ ) for i = 1, . . . , N , (4.12)
dρ0 0
and, for k ≥ 0,
i dTku (ξki , ·) i
ωk+1 = ωki (ξ ) for i = 1, . . . , N . (4.13)
dRk (ξki , ·) k+1
dν i
ω0i = g0 (ξ0i ) (ξ ) for i = 1, . . . , N .
dρ0 0
Recursion: For k = 0, 1, . . . ,
1 N j
• Draw (ξk+1 , . . . , ξk+1 ) conditionally independently given {ξ0:k , j = 1, . . . , N }
i
from the distribution ξk+1 ∼ Rk (ξki , ·). Append ξk+1
i i
to ξ0:k to form
i i i
ξ0:k+1 = (ξ0:k , ξk+1 ).
• Compute the updated importance weights
i dQ(ξki , ·) i
ωk+1 = ωki × gk+1 (ξk+1
i
) (ξ ), i = 1, . . . , N .
dRk (ξki , ·) k+1
FILT.
INSTR.
FILT. +1
Figure 4.2: Principle of sequential importance sampling (SIS). Upper plot: the curve
represents the filtering distribution, and the particles with weights are represented
along the axis by bullets, the radii of which being proportional to the normalized
weight of the particle. Middle plot: the instrumental distribution with resampled
particle positions. Bottom plot: filtering distribution at the next time index with
particle updated weights. The case depicted here corresponds to the choice Rk = Q.
Prior Kernel
The first obvious and often very simple choice of instrumental kernel Rk is that
of setting Rk = Q (irrespectively of k). In that case, the instrumental kernel
simply corresponds to the prior distribution of the new state in the absence of the
corresponding observation. The incremental weight then simplifies to
dTku (x, ·) 0 Lk
(x ) = gk+1 (x0 ) ∝ gk+1 (x0 ) for all (x, x0 ) ∈ X2 . (4.14)
dQ(x, ·) Lk+1
A distinctive feature of the prior kernel is that the incremental weight in (4.14)
does not depend on x, that is, on the previous position. The use of the prior kernel
Rk = Q is popular because sampling from the prior kernel Q is often straightforward,
and computing the incremental weight simply amounts to evaluating the conditional
likelihood of the new observation given the current particle position. The prior
kernel also satisfies the minimal requirement of importance sampling as stated in
Assumption 76. In addition, because the importance function reduces to gk+1 , it is
upper-bounded as soon as one can assume that supx∈X,y∈Y g(x, y) is finite, which
(often) is a very mild condition (see also Section ??). Despite these appealing
properties, the use of the prior kernel can sometimes lead to poor performance,
often manifesting itself as a lack of robustness with respect to the values taken by
the observed sequence {Yk }k≥0 . The following example illustrates this problem in
a very simple situation.
noise,
The last observation is located 20 standard deviations away from the mean (zero)
of the stationary distribution, which definitively corresponds to an aberrant value
from the model’s point of view. In a practical situation however, we would of course
like to be able to handle also data that does not necessarily come from the model
under consideration. Note also that in this toy example, one can evaluate the exact
smoothing distributions by means of the Kalman filtering recursion discussed in
Section ??.
Figure 4.3 displays box and whisker plots for the SIS estimate of the posterior
mean of the final state X5 as a function of the number N of particles when using
the prior kernel. These plots have been obtained from 125 independent replications
of the SIS algorithm. The vertical line corresponds to the true posterior mean of
X5 given Y0:5 , computed using the Kalman filter. The figure shows that the SIS
algorithm with the prior kernel grossly underestimates the values of the state even
when the number of particles is very large. This is a case where there is a conflict be-
tween the prior distribution and the posterior distribution: under the instrumental
distribution, all particles are proposed in a region where the conditional likelihood
function g5 is extremely low. In that case, the renormalization of the weights used
to compute the filtered mean estimate according to (4.11) may even have unexpect-
edly adverse consequences: a weight close to 1 does not necessarily correspond to
a simulated value that is important for the distribution of interest. Rather, it is
a weight that is large relative to other, even smaller weights (of particles even less
important for the filtering distribution). This is a logical consequence of the fact
that the weights must sum to one.
and Akashi and Kumamoto (1977) and is largely adopted by authors such as Liu
and Chen (1995), Chen and Liu (2000), Doucet et al. (2000), Doucet et al. (2001)
and Tanizaki (2003). The word “optimal” is somewhat misleading, and we refer to
Chapter ?? for a more precise discussion of optimality of the instrumental distribu-
tion in the context of importance sampling (which generally has to be defined for
a specific choice of the function f of interest). The main property of Tk as defined
in (4.15) is that
dTku (x, ·) 0 Lk
(x ) = γk (x) ∝ γk (x) for (x, x0 ) ∈ X2 , (4.16)
dTk (x, ·) Lk+1
Equation (4.16) means that the incremental weight in (4.13) now depends on the
previous position of the particle only (and not on the new position proposed at index
k + 1). This is the exact opposite of the situation observed previously for the prior
kernel. The optimal kernel (4.15) is attractive because it incorporates information
both on the state dynamics and on the current observation: the particles move
“blindly” with the prior kernel, whereas they tend to cluster into regions where
the current local likelihood gk+1 is large when using the optimal kernel. There are
however two problems with using Tk in practice. First, drawing from this kernel
is usually not directly feasible. Second, calculation of the incremental importance
weight γk in (4.17) may be analytically intractable. Of course, the optimal kernel
takes a simple form with easy simulation and explicit evaluation of (4.17) in the
particular cases discussed in Chapter ??. It turns out that it can also be evaluated
for a slightly larger class of non-linear Gaussian state-space models, as soon as the
observation equation is linear (Zaritskii et al., 1975). Indeed, consider the state-
space model with non-linear state evolution equation
where
Γk+1 (x) = BR(x)Rt (x)B t + SS t .
In other situations, sampling from the kernel Tk and/or computing the normalizing
constant γk is a difficult task. There is no general recipe to solve this problem, but
rather a set of possible solutions that should be considered.
Example 79 (Noisy AR(1) Model, Continued). We consider the noisy AR(1) model
of Example 78 again using the optimal importance kernel, which corresponds to the
particular case where all variables are scalar and A and R are constant in (4.18)–
(4.19) above. Thus, the optimal instrumental transition density is given by
2 2 2 2
σU σV φx Yk σU σV
tk (x, ·) = N 2 + σ2 2 + σ2 , 2
σU V σU V σU + σV2
1 (Yk − φx)2
γk (x) ∝ exp − 2 + σ2 .
2 σU V
Figure 4.4 is the exact analog of Figure 4.3, also obtained from 125 independent
runs of the algorithm, for this new choice of instrumental kernel. The figure shows
that whereas the SIS estimate of posterior mean is still negatively biased, the op-
timal kernel tends to reduce the bias compared to the prior kernel. It also shows
that as soon as N = 400, there are at least some particles located around the true
filtered mean of the state, which means that the method should not get entirely lost
as subsequent new observations arrive.
To illustrate the advantages of the optimal kernel with respect to the prior kernel
graphically, we consider the model (4.18)–(4.19) again with φ = 0.9, σu2 = 0.4,
σv2 = 0.6, and (0, 2.6, 0.6) as observed series (of length 3). The initial distribution
is a mixture 0.6 N(−1, 0.3) + 0.4 N(1, 0.4) of two Gaussians, for which it is still
possible to evaluate the exact filtering distributions as the mixture of two Kalman
filters using, respectively, N(−1, 0.3) and N(1, 0.4) as the initial distribution of X0 .
We use only seven particles to allow for an interpretable graphical representation.
Figures 4.5 and 4.6 show the positions of the particles propagated using the prior
kernel and the optimal kernel, respectively. At time 1, there is a conflict between
the prior and the posterior as the observation does not agree with the particle
approximation of the predictive distribution. With the prior kernel (Figure 4.5),
the mass becomes concentrated on a single particle with several particles lost out
in the left tail of the distribution with negligible weights. In contrast, in Figure 4.6
most of the particles stay in high probability regions through the iterations with
several distinct particles having non-negligible weights. This is precisely because the
optimal kernel “pulls” particles toward regions where the current local likelihood
gk (x) = gk (x, Yk ) is large, whereas the prior kernel does not.
Accept-Reject Algorithm
Because drawing from the optimal kernel Tk is most often not feasible, a first nat-
ural idea consists in trying the accept-reject method (Algorithm ??), which is a
versatile approach to sampling from general distributions. To sample from the
optimal importance kernel Tk (x, ·) defined by (4.15), one needs an instrumental
kernel Rk (x, ·) from which it is easy to sample and such that there exists M sat-
dQ(x,·)
isfying dR k (x,·)
(x0 )gk (x0 ) ≤ M (for all x ∈ X). Note that because it is generally
impossible to evaluate the normalizing constant γk of Tk , we must resort here to
the unnormalized version of the accept-reject algorithm. The algorithm consists in
68 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS
The algorithm then consists in drawing ξ from the prior kernel Q(x, ·), U uniformly
on [0, 1] and accepting the draw if U ≤ gk (ξ)/ supx∈X gk (x). The acceptance rate of
this algorithm is then given by
Q(x, dx0 )gk (x0 )
R
p(x) = X .
supx0 ∈X gk (x0 )
Unfortunately, it is not always possible to design an importance kernel Rk (x, ·) that
is easy to sample from, for which the bound M is indeed finite, and such that the
acceptance rate p(x) is reasonably large.
Multivariate normal: fit the mean of the normal distribution to the mode of
tk (x, ·) and fit the covariance to minus the inverse of the Hessian of log tk (x, ·)
at the mode.
Multivariate t-distribution: fit the location and scale parameters as the mean
and covariance parameters in the normal case; the number of degrees of free-
dom is usually set arbitrarily (and independently of x) based on the arguments
discussed above.
We directly obtain
(x0 − φx)2
1
q(x, x0 ) = √ exp − ,
2πσ 2 2σ 2
Y2
1 1
gk (x0 ) = p exp − k2 exp(−x0 ) − x0 .
2πβ 2 2β 2
Simulating from the optimal transition kernel tk (x, x0 ) is difficult, but the function
x0 7→ log(q(x, x0 )gk (x0 )) is indeed (strictly) concave. The mode mk (x) of x0 7→
tk (x, x0 ) is the unique solution of the non-linear equation
1 0 Yk2 1
− (x − φx) + exp(−x0 ) − = 0 , (4.22)
σ2 2β 2 2
which can be found using Newton iterations. Once at the mode, the (squared)
scale σk2 (x) is set as minus the inverse of the second-order derivative of x0 7→
(log q(x, x0 )gk (x0 )) evaluated at the mode mk (x). The result is
−1
Yk2
1
σk2 (x) = + 2 exp [−mk (x)] . (4.23)
σ2 2β
Figure 4.7 shows a typical example of the type of fit that can be obtained
for the stochastic volatility model with this strategy using 1,000 particles. Note
that although the data used is the same as in Figure ??, the estimated distribu-
tions displayed in both figures are not directly comparable, as the MCMC method
in Figure ?? approximates the marginal smoothing distribution, whereas the se-
quential importance sampling approach used for Figure 4.7 provides a (recursive)
approximation to the filtering distributions.
When there is no easy way to implement the local linearization technique, a
natural idea explored by Doucet et al. (2000) and Van der Merwe et al. (2000)
consists in using classical non-linear filtering procedures to approximate tk . These
include in particular the so-called extended Kalman filter (EKF), which dates back
to the 1970s (Anderson and Moore, 1979, Chapter 10), as well as the unscented
Kalman filter (UKF) introduced by Julier and Uhlmann (1997)—see, for instance,
Ristic et al. (2004, Chapter 2) for a recent review of these techniques. We illustrate
below the use of the extended Kalman filter in the context of sequential importance
sampling.
We now consider the most general form of the state-space model with Gaussian
noises:
where a, b are vector-valued measurable functions. It is assumed that {Uk }k≥0 and
{Vk }k≥0 are independent white Gaussian noises. As usual, X0 is assumed to be
N(0, Σν ) distributed and independent of {Uk } and {Vk }. The extended Kalman
4.2. SEQUENTIAL IMPORTANCE SAMPLING 71
where
• R(x) is the dx × du matrix of partial derivatives of a(x, u) with respect to u
and evaluated at (x, 0),
It should be stressed that the measurement equation in (4.27) differs from (4.19)
in that it depends both on the current state Xk and on the previous one Xk−1 .
The approximate model specified by (4.26)–(4.27) thus departs from the HMM
assumptions. On the other hand, when conditioning on the value of Xk−1 , the
structure of both models, (4.18)–(4.19) and (4.26)–(4.27), are exactly similar. Hence
the posterior distribution of the state Xk given Xk−1 = x and Yk is a Gaussian
distribution with mean mk (x) and covariance matrix Γk (x), which can be evaluated
according to
−1
Kk (x) = R(x)Rt (x)B t (x) B(x)R(x)Rt (x)B t (x) + S(x)S t (x)
,
mk (x) = a(x, 0) + Kk (x) {Yk − b [a(x, 0), 0]} ,
Γ(x) = [I − Kk (x)B(x)] R(x)Rt (x) .
The Gaussian distribution with mean mk (x) and covariance Γk (x) may then be used
as a proxy for the optimal transition kernel Tk (x, ·). To improve the robustness of the
method, it is safe to increase the variance, that is, to use cΓk (x) as the simulation
variance, where c is a scalar larger than one. A perhaps more recommendable
option consists in using as previously a proposal distribution with tails heavier
than the Gaussian, for instance, a multivariate t-distribution with location mk (x),
scale Γk (x), and four or five degrees of freedom.
Example 81 (Growth Model). We consider the univariate growth model discussed
by Kitagawa (1987) and Polson et al. (1992) given, in state-space form, by
where {Uk }k≥0 and {Vk }k≥0 are independent white Gaussian noise processes and
x
ak−1 (x) = α0 x + α1 + α2 cos [1.2(k − 1)] (4.30)
1 + x2
72 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS
with α0 = 0.5, α1 = 25, α2 = 8, b = 0.05, and σv2 = 1 (the value of σu2 will be
discussed below). The initial state is known deterministically and set to X0 = 0.1.
This model is non-linear both in the state and in the measurement equation. Note
that the form of the likelihood adds an interesting twist to the problem: whenever
Yk ≤ 0, the conditional likelihood function
b2
def 2
gk (x) = g(x; Yk ) ∝ exp − 2 x2 − Yk /b
2σv
In Figure 4.8, the optimal kernel, the EKF approximation to the optimal kernel,
and the prior kernel for two different values of the state variance are compared.
This figure corresponds to the time index one, and Y1 is set to 6 (recall that the
initial state X0 is equal to 0.1). In the case where σu2 = 1 (left plot in Figure 4.8),
the prior distribution of the state, N(a0 (X0 ), σu2 ), turns out to be more informative
(more peaky, less diffuse) than the conditional likelihood g1 . In other words, the
observed Y1 does not carry a lot of information about the state X1 , compared to
the information provided by X0 ; this is because the measurement variance σv2 is
not small compared to σu2 . The optimal transition kernel, which does take Y1 into
account, is then very close to the prior kernel, and the differences between the three
kernels are minor. In such a situation, one should not expect much improvement
with the EKF approximation compared to the prior kernel.
In the case shown in the right plot of Figure 4.8 (σu2 = 10), the situation is
reversed. Now σv2 is relatively small compared to σu2 , so that the information about
X1 contained in g1 is large to that provided by the prior information on X0 . This
is the kind of situation where we expect the optimal kernel to improve considerably
on the prior kernel. Indeed, because Y1 > 0, the optimal kernel is bimodal, with the
second mode far smaller than the first one (recall that the plots are on log-scale); the
EKF kernel correctly picks the dominant mode. Figure 4.8 also illustrates the fact
that, in contrast to the prior kernel, the EKF kernel does not necessarily dominate
the optimal kernel in the tails; hence the need to simulate from an over-dispersed
version of the EKF approximation as discussed above.
is practically ineffective. If there are too many ineffective particles, the particle
approximation becomes both computationally and statistically inefficient: most of
the computing effort is put on updating particles and weights that do not contribute
significantly to the estimator; the variance of the resulting estimator will not reflect
the large number of terms in the sum but only the small number of particles with
non-negligible normalized weights.
Unfortunately, the situation described above is the rule rather than the excep-
tion, as the importance weights will (almost always) degenerate as the time index
PN
k increases, with most of the normalized importance weights ωki / j=1 ωkj close to
0 except for a few ones. We consider below the case of i.i.d. models for which it
is possible to show using simple arguments that the large sample variance of the
importance sampling estimate can only increase with the time index k.
Example 82 (Weight Degeneracy in the I.I.D. Case). The simplest case of appli-
cation of the sequential importance sampling technique is when µ is a probability
distribution on (X, X ) and the sequence of target distributions corresponds to the
product distributions, that is, the sequence of distributions on (Xk+1 , X ⊗(k+1) ) de-
fined recursively by µ0 = µ and µk = µk−1 ⊗ µ, for k ≥ 1. Let ν be another
probability distribution on (X, X ) and assume that µ is absolutely continuous with
respect to ν and that
Z 2
dµ
(x) ν(dx) < ∞ . (4.31)
dν
Finally, let f be a bounded measurable function that is not (µ-a.s.) constant such
that its variance under µ, µ(f 2 ) − µ2 (f ), is strictly positive.
Consider the sequential importance sampling estimate given by
N Qk dµ i
dν (ξl )
X
µIS
bk,N (f ) = f (ξk ) PN l=0
i
Qk dµ j , (4.32)
i=1 j=1 l=0 dν (ξl )
Because " k #
Y dµ
E (ξli ) = 1 ,
dν
l=0
the weak law of large numbers implies that the denominator of the right-hand side
of (4.33) converges to 1 in probability as N increases. Likewise, under (4.31), the
74 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS
central limit theorem shows that the numerator of the right-hand side of (4.33)
converges in distribution to the normal N(0, σk2 (f )) distribution, where
(
k
)2
Y 2 dµ
σk2 (f ) = E f (ξk1 ) − µ(f ) (ξ 1 )
(4.34)
dν l
l=0
"Z 2 #k Z 2
dµ dµ 2
= (x) ν(dx) (x) [f (x) − µ(f )] ν(dx) .
dν dν
Slutsky’s lemma then implies that (4.33) also converges in distribution to the same
N(0, σk2 (f )) limit as N grows. Now Jensen’s inequality implies that
Z 2 Z 2
dµ dµ
1= (x)ν(dx) ≤ (x) ν(dx) ,
dν dν
with equality if and only if µ = ν. Therefore, if µ 6= ν, the asymptotic variance
σk2 (f ) grows exponentially with the iteration index k for all functions f such that
Z 2 Z
dµ 2 dµ 2
(x) [f (x) − µ(f )] ν(dx) = (x) [f (x) − µ(f )] µ(dx) 6= 0 .
dν dν
Because µ is absolutely continuous with respect to ν, µ{x ∈ X : dµ/dν(x) = 0} = 0
and the last integral is null if and only if f has zero variance under µ.
Thus in the i.i.d. case, the asymptotic variance of the importance sampling
estimate (4.32) increases exponentially with the time index k as soon as the proposal
and target differ (except for constant functions).
It is more difficult to characterize the degeneracy of the weights for general target
and instrumental distributions. There have been some limited attempts to study
more formally this phenomenon in some specific scenarios. In particular, Del Moral
and Jacod (2001) have shown the degeneracy of the sequential importance sampling
estimator of the posterior mean in Gaussian linear models when the instrumental
kernel is the prior kernel. Such results are in general difficult to derive (even in the
Gaussian linear models where most of the derivations can be carried out explicitly)
and do not provide much additional insight. Needless to say, in practice, weight
degeneracy is a prevalent and serious problem making the vanilla sequential impor-
tance sampling method discussed so far almost useless. The degeneracy can occur
after a very limited number of iterations, as illustrated by the following example.
The coefficient of variation is minimal when the normalized√ weights are all equal to
1/N , and then CVN = 0. The maximal value of CVN is N − 1, which corresponds
to one of the normalized weights being one and all others being null. Therefore, the
coefficient of variation is often interpreted as a measure of the number of ineffective
particles (those that do not significantly contribute to the estimate). A related
criterion with a simpler interpretation is the so-called effective sample size Neff
(Liu, 1996), defined as
N
!2 −1
i
X ω
Neff = PN , (4.36)
i=1 j=1 ωj
which varies between 1 (all weights null but one) and N (equal weights). It is
straightforward to verify the relation
N
Neff = .
1 + CV2N
Some additional insights and heuristics about the coefficient of variation are given
by Liu and Chen (1995).
Yet another possible measure of the weight imbalance is the Shannon entropy
of the importance weights,
N
!
X ωi ωi
Ent = − PN log2 PN . (4.37)
j j
i=1 j=1 ω j=1 ω
When all the normalized importance weights are null except for one of them, the
entropy is null. On the contrary, if all the weights are equal to 1/N , then the
entropy is maximal and equal to log2 N .
4.3.2 Resampling
The solution proposed by Gordon et al. (1993) to reduce the degeneracy of the
importance weights is based on the concept of resampling already discussed in the
context of importance sampling in Section 4.1.2. The basic method consists in
resampling in the current population of particles using the normalized weights as
probabilities of selection. Thus, trajectories with small importance weights are
eliminated, whereas those with large importance weights are duplicated. After
resampling, all importance weights are reset to one. Up to the first instant when re-
sampling occurs, the method can really be interpreted as an instance of the sampling
importance resampling (SIR) technique discussed in Section 4.1.2. In the context
of sequential Monte Carlo, however, the main motivation for resampling is to avoid
future weight degeneracy by reseting (periodically) the weights to equal values. The
resampling step has a drawback however: as emphasized in Section 4.1.2, resampling
76 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS
Sampling:
j
• Draw (ξ˜k+1
1
, . . . , ξ˜k+1
N
) conditionally independently given {ξ0:k , j = 1, . . . , N }
from the instrumental kernel: ξ˜k+1 ∼ Rk (ξk , ·), i = 1, . . . , N .
i i
dQ(ξki , ·) ˜i
i
ωk+1 = ωki gk+1 (ξ˜k+1
i
) (ξ ), i = 1, . . . , N .
dRk (ξki , ·) k+1
Resampling (Optional):
j
• Draw, conditionally independently given {(ξ0:k i
, ξ˜k+1 ), i, j = 1, . . . , N },
1 N
the multinomial trial (Ik+1 , . . . Ik+1 ) with probabilities of success
1 N
ωk+1 ωk+1
PN j , . . . , PN j .
j ωk+1 j ωk+1
i
• Reset the importance weights ωk+1 to a constant value for i = 1, . . . , N .
i
If resampling is not applied, set for i = 1, . . . , N , Ik+1 = i.
As discussed previously the resampling step in the algorithm above may be used
systematically (for all indices k), but it is often preferable to perform resampling
from time to time only. Usually, resampling is either used systematically but at a
lower rate (for one index out of m, where m is fixed) or at random instants based
on the values of the coefficient of variation or the entropy criteria defined in (4.35)
and (4.37), respectively. Note that in addition to arguments based on the variance
of the Monte Carlo approximation, there is usually also a computational incentive
for limiting the use of resampling; indeed, except in models where the evaluation of
the incremental weights is costly (think of large-dimensional multivariate observa-
tions for instance), the computational cost of the resampling step is not negligible.
Both Sections 4.4.1 and 4.4.2 discuss several implementations and variants of the
resampling step that may render the latter argument less pregnant.
4.3. SEQUENTIAL IMPORTANCE SAMPLING WITH RESAMPLING 77
The term particle filter is often used to refer to Algorithm 85 although the
terminology SISR is preferable, as particle filtering is sometimes also used more
generically for any sequential Monte Carlo method. Gordon et al. (1993) actually
proposed a specific instance of Algorithm 85 in which resampling is done systemati-
cally at each step and the instrumental kernel is chosen as the prior kernel Rk = Q.
This particular algorithm, commonly known as the bootstrap filter , is most often
very easy to implement because it only involves simulating from the transition kernel
Q of the hidden chain and evaluation of the conditional likelihood function g.
There is of course a whole range of variants and refinements of Algorithm 85,
many of which will be covered in some detail in the next chapter. A simple remark
though is that, as in the case of the simplest SIR method discussed in Section 4.1.2,
it is possible to resample N times from a larger population of M intermediate
samples. In practice, it means that Algorithm 85 should be modified as follows at
indices k for which resampling is to be applied.
i,1 i,α
SIS: For i = 1, . . . , N , draw α candidates ξ˜k+1 , . . . , ξ˜k+1 from each proposal distri-
i
bution Rk (ξk , ·).
1,1 1,α N,1 N,α
Resampling: Draw (Nk+1 , . . . , Nk+1 , . . . , Nk+1 , . . . , Nk+1 ) from the multinomial
distribution with parameter N and probabilities
i,j
ωk+1
PN Pα l,m
for i = 1, . . . , N , j = 1, . . . , α .
l=1 m=1 ωk+1
In that case, the system of particles {ξki }1≤i≤N with associated weights {ωki }1≤i≤N ,
provides an approximation to the filtering distribution φk , which is the marginal of
the joint smoothing distribution φ0:k|k .
The notation ξki could be ambiguous when resampling is applied, as the first
i
k + 1 elements of the ith trajectory ξ0:k+1 at time k + 1 do not necessarily coincide
with the ith trajectory ξ0:k at time k. By convention, ξki always refers to the last
i
i
point in the ith trajectory, as simulated at index k. Likewise, ξl:k is the portion of
the same trajectory that starts at index l and ends at the last index (that is, k).
i
When needed, we will use the notation ξ0:k (l) for the element of index l in the ith
particle trajectory at time k to avoid ambiguity.
To conclude this section on the SISR algorithm, we briefly revisit two of the
examples already considered previously to contrast the results obtained with the
SIS and SISR approaches.
78 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS
4.4 Complements
As discussed above, resampling is a key ingredient of the success of sequential Monte
Carlo techniques. We discuss below two separate aspects related to this issue. First,
we show that there are several schemes based on clever probabilistic results that
may be exploited to reduce the computational load associated with multinomial
resampling. Next, we examine some variants of resampling that achieves lower
4.4. COMPLEMENTS 79
conditional variance than multinomial resampling. In this latter case, the aim is of
course to be able to decrease the number of particles without losing too much on
the quality of the approximation.
Throughout this section, we will assume that it is required to draw N samples
ξ 1 , . . . , ξ N out of a, usually larger, set {ξ˜1 , . . . , ξ˜M } according to the normalized im-
portance weights {ω 1 , . . . , ω M }. We denote by G a σ-field such that both ω 1 , . . . , ω M
and ξ˜1 , . . . , ξ˜M are G-measurable.
(where by convention S1 = U(1) ) are called the uniform spacings and distributed as
E1 EN
PN +1 , . . . , PN +1 ,
i=1 Ei i=1 Ei
The two sampling algorithms associated with these probabilistic results may be
summarized as follows.
80 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS
A sensible objective is to try to construct resampling schemes for which the con-
PN i
ditional variance Var( i=1 NN f (ξ˜i ) | G) is as small as possible and, in particular,
smaller than (4.42), preferably for any choice of the function f .
Residual Resampling
Residual resampling, or remainder resampling, is mentioned by Whitley (1994) (see
also Liu and Chen, 1998) as a simple means to decrease the variance incurred by
the sampling step. In this scheme, for i = 1, . . . , M we set
N i = N ω i + N̄ i ,
(4.43)
4.4. COMPLEMENTS 81
N ω i − bN ω i c
ω̄ i = , i = 1, . . . , M . (4.44)
N −R
This scheme is obviously unbiased with respect to G. Equivalently, for any measur-
able function f , the residual sampling estimator is
N M N −R
1 X X bN ω i c ˜i 1 X i
f (ξ i ) = f (ξ ) + f (ξ˜J ) , (4.45)
N i=1 i=1
N N i=1
2
M M
(N − R) X
i
˜i ) −
X
j ˜j )
= ω̄ f (ξ ω̄ f ( ξ
N2
i=1 j=1
M M
( M
)2
1 X i 2 ˜i X bN ω i c 2 ˜i N −R X
i ˜i
= ω f (ξ ) − 2
f (ξ ) − ω̄ f (ξ ) .
N i=1 i=1
N N2 i=1
M M M
X X bN ω i c N − R X i ˜i
ω i f (ξ˜i ) = f (ξ˜i ) + ω̄ f (ξ ) .
i=1 i=1
N N i=1
Then note that the sum of the M numbers bN ω i c/N plus (N − R)/N equals one,
whence this sequence of M + 1 numbers can be viewed as a probability distribution.
Thus Jensen’s inequality applied to the square of the right-hand side of the above
display yields
( M
)2 M
(M )2
X
i ˜i
X bN ω i c 2 ˜i N −R X
i ˜i
ω f (ξ ) ≤ f (ξ ) + ω̄ f (ξ ) .
i=1 i=1
N N i=1
Combining with (4.46) and (4.42), this shows that the conditional variance of resid-
ual sampling is always smaller than that of multinomial sampling.
Stratified Resampling
The inversion method for sampling a multinomial sequence of trials maps uniform
(0, 1) random variables U 1 , . . . , U N into indices I 1 , . . . , I N through a deterministic
function. For any function f ,
N N
X i X
f (ξ˜I ) = Φf (U i ) ,
i=1 i=1
82 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS
where the function Φf (which depends on both f and {ξ˜i }) is defined, for any
u ∈ (0, 1], by
M
i1(Pi−1 ωj ,Pi
def
X
Φf (u) = f (ξ˜I(u) ), I(u) = ω j ] (u) . (4.47)
j=1 j=1
i=1
R1 PM i ˜i
Note that, by construction, 0 Φf (u) du = i=1 ω f (ξ ). To reduce the con-
Ii
PN ˜
ditional variance of i=1 f (ξ ), we may change the way in which the sample
U 1 , . . . , U N is drawn. A possible solution, commonly used in survey sampling,
is based on stratification (see Kitagawa, 1996, and Fearnhead, 1998, Section 5.3,
for discussion of the method in the context of particle filtering). The interval
(0, 1] is partitioned into different strata, assumed for simplicity to be intervals
(0, 1] = (0, 1/N ] ∪ (1/N, 2/N ] ∪ · · · ∪ ({N − 1}/N, 1]. More general partitions could
have been considered as well; in particular, the number of partitions does not have
to equal N , and the interval lengths could be made dependent on the ω i . One then
draws a sample Ũ 1 , . . . , Ũ N conditionally independently given G from the distribu-
tion Ũ i ∼ U (({i − 1} /N, i/N ]) (for i = 1, . . . , N ) and let I˜i = I(Ũ i ) with I as in
(4.47) (see Figure 4.16). By construction, the difference between Ñ i = j=1 1{I˜j =i}
PN
and the target (non-integer) value N ω i is less than one in absolute value. It also
follows that
" N # " N #
˜ i
X X
E f (ξ˜ ) G = E
I i
Φf (Ũ ) G
i=1 i=1
N Z i/N
X Z 1 M
X
=N Φf (u) du = N Φf (u) du = N ω i f (ξ˜i ) ,
i=1 (i−1)/N 0 i=1
showing that the conditional variance of stratified sampling is always smaller than
that of multinomial sampling.
Remark 93. Note that stratified sampling may be coupled with the residual sam-
pling method discussed previously: the proof above shows that using stratified
sampling on the R residual indices that are effectively drawn randomly can only
decrease the conditional variance.
4.4. COMPLEMENTS 83
Systematic Resampling
Stratified sampling aims at reducing the discrepancy
N
1 X
?
DN
def
(U 1 , . . . , U N ) = sup 1(0,a] (U i ) − a
a∈(0,1] N i=1
of the sample U from the uniform distribution function on (0, 1]. This is sim-
ply the Kolmogorov-Smirnov distance between the empirical distribution function
of the sample and the distribution function of the uniform distribution. The
Koksma-Hlawka inequality (Niederreiter, 1992) shows that for any function f having
bounded variation on [0, 1],
N Z 1
1 X i ?
f (u ) − f (u) du ≤ C(f )DN (u1 , . . . , uN ) ,
N i=1 0
Pursuing in this direction, it makes sense to look for sequences with even smaller
average discrepancy. One such sequence is U i = U + (i − 1)/N , where U is drawn
from a uniform U((0, 1/N ]) distribution. In survey sampling, this method is known
as systematic sampling. It was introduced in the particle filter literature by Carpen-
ter et al. (1999) but is mentioned by Whitley (1994) under the name of universal
sampling. The interval (0, 1] is still divided into N sub-intervals ({i − 1}/N, i/N ]
and one sample is taken from each of them, as in stratified sampling. However, the
samples are no longer independent, as they have the same relative position within
each stratum (see Figure 4.17). This sampling scheme is obviously still unbiased.
Because the samples are not taken independently across strata, it is however not
possible to obtain simple formulas for the conditional variance (Künsch, 2003). It is
often conjectured that the conditional variance of systematic resampling is always
lower than that of multinomial resampling. This is not correct, as demonstrated by
the following example.
Example 94. Consider the case where the initial population of particles {ξ˜i }1≤i≤N
is composed of the interleaved repetition of only two distinct values x0 and x1 , with
identical multiplicities (assuming N to be even). In other words,
Because the value 2ω/N is assumed to be larger than 1/N , it is easily checked
that systematic resampling deterministically sets N/2 of the ξ i to be equal to x1 .
84 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS
Table 4.1: Standard deviations of various resampling methods for N = 100 and
F = 1. The bottom line has been obtained by simulations, averaging 100,000
Monte Carlo replications.
Depending on the draw of the initial shift, all the N/2 remaining particles are either
set to x1 , with probability 2ω − 1, or to x0 , with probability 2(1 − ω). Hence the
variance is that of a single Bernoulli draw scaled by N/2, that is,
" N
#
1 X
Var f (ξsyst ) G = (ω − 1/2)(1 − ω)F 2 .
i
N i=1
Note that in this case, the conditional variance of systematic resampling is not only
larger than (4.48) for most values of ω (except when ω is very close to 1/2), but
it does not even decrease to zero as N grows! Clearly, this observation is very
dependent on the order in which the initial population of particles is presented.
Interestingly, this feature is common to the systematic and stratified sampling
schemes, whereas the multinomial and residual approaches are unaffected by the
order in which the particles are labelled. In this particular example, it is straight-
forward to verify that residual and stratified resampling are equivalent—which is
not the case in general—and amount to deterministically setting N/2 particles to
the value x1 , whereas the N/2 remaining ones are drawn by N/2 conditionally in-
dependent Bernoulli trials with probability of picking x1 equal to 2ω − 1. Hence
the conditional variance, for both the residual and stratified schemes, is equal to
N −1 (2ω − 1)(1 − ω)F 2 . It is hence always smaller than (4.48), as expected from
the general study of these two methods.
Once again, the failure of systematic resampling in this example is entirely due
to the specific order in which the particles are labeled: it is easy to verify, at least
empirically, that the problem vanishes upon randomly permuting the initial particles
before applying systematic resampling. Table 4.1 also shows that a common feature
of both the residual, stratified, and systematic resampling procedures is to become
very efficient in some particular configurations of the weights such as when ω = 0.51
for which the probabilities of selecting the two types of particles are almost equal
and the selection becomes quasi-deterministic. Note also that prior random shuffling
does somewhat compromise this ability in the case of systematic resampling.
4.4. COMPLEMENTS 85
1.2
1.1
True Value= .907
0.9
0.8
Values
0.7
0.6
0.5
0.4
0.3
Figure 4.3: Box and whisker plot of the posterior mean estimate of X5 obtained
from 125 replications of the SIS filter using the prior kernel and increasing numbers
of particles. The horizontal line represents the true posterior mean.
1.2
0.9
0.8
Values
0.7
0.6
0.5
0.4
0.3
Figure 4.4: Box and whisker plot of the posterior mean estimate for X5 obtained
from 125 replications of the SIS filter using the optimal kernel and increasing num-
bers of particles. Same data and axes as Figure 4.3.
86 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS
FILT.
FILT. +1
FILT. +2
Figure 4.5: SIS using the prior kernel. The positions of the particles are indicated
by circles whose radii are proportional to the normalized importance weights. The
solid lines show the filtering distributions for three consecutive time indices.
FILT.
FILT. +1
FILT. +2
Figure 4.6: SIS using the optimal kernel (same data and display as in Figure 4.5).
4.4. COMPLEMENTS 87
0.08
0.06
Density
0.04
0.02
0
0
10
−2
15 −1.5
−1
−0.5
0
0.5
1
Time Index 20
2
1.5
State
−5 −5
log−density
−10 −10
log−density
−15 −15
−20 −20
−25 −25
−30 −30
−20 −15 −10 −5 0 5 10 15 20 −20 −15 −10 −5 0 5 10 15 20
Figure 4.8: Log-density of the optimal kernel (solid line), EKF approximation of
the optimal kernel (dashed-dotted line), and the prior kernel (dashed line) for two
different values of the state noise variance σu2 : left, σu2 = 1; right, σu2 = 10.
88 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS
1000
500
0
−25 −20 −15 −10 −5 0
1000
500
0
−25 −20 −15 −10 −5 0
100
50
0
−25 −20 −15 −10 −5 0
Importance Weights (base 10 logarithm)
20 10
Coeff. of Variation
15 8
Entropy
10 6
5 4
0 2
0 20 40 60 80 100 0 20 40 60 80 100
Time Index Time Index
Figure 4.10: Coefficient of variation (left) and entropy of the normalized importance
weights as a function of the number of iterations for the stochastic volatility model
of Example 80. Same model and data as in Figure 4.9.
2.5 10
Coeff. of Variation
2 9.5
Entropy
1.5 9
1 8.5
0.5 8
0 7.5
0 20 40 60 80 100 0 20 40 60 80 100
Time Index Time Index
Figure 4.11: Coefficient of variation (left) and entropy of the normalized importance
weights as a function of the number of iterations in the stochastic volatility model
of Example 80. Same model and data as in Figure 4.10. Resampling occurs when
the coefficient of variation gets larger than 1.
4.4. COMPLEMENTS 89
1000
500
0
−25 −20 −15 −10 −5 0
1000
500
0
−25 −20 −15 −10 −5 0
1000
500
0
−25 −20 −15 −10 −5 0
Importance Weights (base 10 logarithm)
20
10
State
−10
−20
0 5 10 15 20 25 30
25
Coefficient of Variation
20
15
10
0
0 5 10 15 20 25 30
Time Index
Figure 4.13: SIS estimates of the filtering distributions in the growth model with
instrumental kernel being the prior one and 500 particles. Top: true state sequence
(×) and 95%/50% HPD regions (light/dark grey) of estimated filtered distribution.
Bottom: coefficient of variation of the normalized importance weights.
90 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS
20
10
State
0
−10
−20
0 5 10 15 20 25 30
5
Coefficient of Variation
0
0 5 10 15 20 25 30
p] Time Index
Figure 4.14: Same legend for Figure 4.13, but with results for the corresponding
bootstrap filter.
ω1 + ω2 + ω3
ω1 + ω2
ω1
0
1 2 3 4 5 6
ω1 + ω2 + ω3
ω1 + ω2
ω1
0
1 2 3 4 5 6
Figure 4.16: Stratified sampling: the interval (0, 1] is divided into N intervals ((i −
1)/N, i/N ]. One sample is drawn uniformly from each interval, independently of
samples drawn in the other intervals.
4.4. COMPLEMENTS 91
ω1 + ω2 + ω3
ω1 + ω2
ω1
0
1 2 3 4 5 6
Figure 4.17: Systematic sampling: the unit interval is divided into N intervals
((i − 1)/N, i/N ] and one sample is drawn from each of them. Contrary to stratified
sampling, each sample has the same relative position within its stratum.
92 CHAPTER 4. SEQUENTIAL MONTE CARLO METHODS
Part II
Parameter Inference
93
Chapter 5
Maximum Likelihood
Inference, Part I:
Optimization Through Exact
Smoothing
In previous chapters, we have focused on structural results and methods for HMMs,
considering in particular that the models under consideration were always perfectly
known. In most situations, however, the model cannot be fully specified beforehand,
and some of its parameters need to be calibrated based on observed data. Except
for very simplistic instances of HMMs, the structure of the model is sufficiently
complex to prevent the use of direct estimators such as those provided by moment
or least squares methods. We thus focus in the following on computation of the
maximum likelihood estimator.
Given the specific structure of the likelihood function in HMMs, it turns out
that the key ingredient of any optimization method applicable in this context is
the ability to compute smoothed functionals of the unobserved sequence of states.
Hence the methods discussed in the second part of the book for evaluating smoothed
quantities are instrumental in devising parameter estimation strategies.
This chapter only covers the class of HMMs for which the smoothing recursions
may effectively be implemented on computers. For such models, the likelihood
function is computable, and hence our main task will be to optimize a possibly
complex but entirely known function. The topic of this chapter thus relates to the
more general field of numerical optimization. For models that do not allow for
exact numerical computation of smoothing distributions, this chapter provides a
framework from which numerical approximations can be built.
95
96 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I
dimensional random variable. In Section 5.2, we will exploit the specific structure
of the HMM, and in particular the fact that it corresponds to a missing data model
in which the observations simply are a subset of the complete data. We ignore these
specifics for the moment however and consider the general likelihood optimization
problem in incomplete data models.
Remark 96. To ensure that Q(θ ; θ0 ) is indeed well-defined for all values of the
pair (θ, θ0 ), one needs regularity conditions on the family of functions {f (· ; θ)}θ∈Θ ,
which will be stated below (Assumption 97). To avoid trivial cases however, we use
the convention 0 log 0 = 0 in (5.4) and in similar relations below. In more formal
terms, for every measurable set N such that both f (x ; θ) and p(x ; θ0 ) vanish λ-a.e.
on N , set Z
def
log f (x ; θ)p(x ; θ0 ) λ(dx) = 0 .
N
With this convention, Q(θ ; θ0 ) stays well-defined in cases where there exists a non-
empty set N such that both f (x ; θ) and f (x ; θ0 ) vanish λ-a.e. on N .
The intermediate quantity Q(θ ; θ0 ) of EM may be interpreted as the expecta-
tion of the function log f (X ; θ) when X is distributed according to the probability
density function p(· ; θ0 ) indexed by a, possibly different, value θ0 of the parameter.
Using (5.2) and (5.3), one may rewrite the intermediate quantity of EM in (5.4) as
where Z
0 def
H(θ ; θ ) = − log p(x ; θ)p(x ; θ0 ) λ(dx) . (5.6)
Equation (5.5) states that the intermediate quantity Q(θ ; θ0 ) of EM differs from (the
log of) the objective function `(θ) by a quantity that has a familiar form. Indeed,
H(θ0 ; θ0 ) is recognized as the entropy of the probability density function p(· ; θ0 )
(see for instance Cover and Thomas, 1991). More importantly, the increment of
H(θ ; θ0 ), Z
0 0 0 p(x ; θ)
H(θ ; θ ) − H(θ ; θ ) = − log p(x ; θ0 ) λ(dx) , (5.7)
p(x ; θ0 )
is recognized as the Kullback-Leibler divergence (or relative entropy) between the
probability density functions p indexed by θ and θ0 , respectively.
The last piece of notation needed is the following: the gradient and Hessian
of a function, say L, at θ0 will be denoted by ∇θ L(θ0 ) and ∇2θ L(θ0 ), respectively.
To avoid ambiguities, the gradient of H(· ; θ0 ) with respect to its first argument,
evaluated at θ00 , will be denoted by ∇θ H(θ ; θ0 )|θ=θ00 (where the same convention
will also be used, if needed, for the Hessian).
We conclude this introductory section by stating a minimal set of assumptions
that guarantee that all quantities introduced so far are indeed well-defined.
Assumption 97.
(i) The parameter set Θ is an open subset of Rdθ (for some integer dθ ).
(ii) For any θ ∈ Θ, L(θ) is positive and finite.
(iii) For any (θ, θ0 ) ∈ Θ × Θ, |∇θ log p(x ; θ)|p(x ; θ0 ) λ(dx) is finite.
R
98 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I
.
Assumption 97(iii) implies in particular that the probability distributions in the
family {p(· ; θ) dλ}θ∈Θ are all absolutely continuous with respect to one another.
Any individual distribution p(· ; θ) dλ can only vanish on sets that are assigned null
probability by all other probability distributions in the family. Thus both H(θ ; θ0 )
and Q(θ ; θ0 ) are well-defined for all pairs of parameters.
where the inequality is strict unless p(· ; θ) and p(· ; θ0 ) are equal λ-a.e.
Assume in addition that
(a) θ 7→ L(θ) is continuously differentiable on Θ;
(b) for any θ0 ∈ Θ, θ 7→ H(θ ; θ0 ) is continuously differentiable on Θ.
Then for any θ0 ∈ Θ, θ 7→ Q(θ ; θ0 ) is continuously differentiable on Θ and
Proof. The difference between the left-hand side and the right-hand side of (5.8)
is the quantity defined in (5.7), which we already recognized as a Kullback-Leibler
distance. Under Assumption 97(iii), this latter term is well-defined and known to
be strictly positive (by direct application of Jensen’s inequality) unless p(· ; θ) and
p(· ; θ0 ) are equal λ-a.e. (Cover and Thomas, 1991; Lehmann and Casella, 1998).
For (5.9), first note that Q(θ ; θ0 ) is a differentiable function of θ, as it is the dif-
ference of two functions that are differentiable under the additional assumptions (a)
and (b). Next, the previous discussion implies that H(θ ; θ0 ) is minimal for θ = θ0 ,
although this may not be the only point where the minimum is achieved. Thus its
gradient vanishes at θ0 , which proves (5.9).
The EM Algorithm
The essence of the EM algorithm, which is suggested by (5.5), is that Q(θ ; θ0 ) may
be used as a surrogate for `(θ). Both functions are not necessarily comparable but,
in view of (5.8), any value of θ such that Q(θ ; θ0 ) is increased over its baseline
Q(θ0 ; θ0 ) corresponds to an increase of ` (relative to `(θ0 )) that is at least as large.
The EM algorithm as proposed by Dempster et al. (1977) consists in iteratively
building a sequence {θi }i≥1 of parameter estimates given an initial guess θ0 . Each
iteration is classically broken into two steps as follows.
E-Step: Determine Q(θ ; θi );
M-Step: Choose θi+1 to be the (or any, if there are several) value of θ ∈ Θ that
maximizes Q(θ ; θi ).
Proposition 98 provides the two decisive arguments behind the EM algorithm. First,
an immediate consequence of (5.8) is that, by the very definition of the sequence
{θi }, the sequence {`(θi )}i≥0 of log-likelihood values is non-decreasing. Hence EM
is a monotone optimization algorithm. Second, if the iterations ever stop at a point
5.1. LIKELIHOOD OPTIMIZATION IN INCOMPLETE DATA MODELS 99
EM in Exponential Families
Definition 99 (Exponential Family). The family {f (· ; θ)}θ∈Θ defines an exponen-
tial family of positive functions on X if
where S and ψ are vector-valued functions (of the same dimension) on X and Θ
respectively, c is a real-valued function on Θ and h is a non-negative real-valued
function on X.
Here S(x) is known as the vector of natural sufficient statistics, and η = ψ(θ)
is
R the natural parameterization. If {f (· ; θ)}θ∈Θ is an exponential family and if
|S(x)|f (x ; θ) λ(dx) is finite for any θ ∈ Θ, the intermediate quantity of EM re-
duces to
Z Z
Q(θ ; θ0 ) = ψ(θ)t S(x)p(x ; θ0 ) λ(dx) − c(θ) + p(x ; θ0 ) log h(x) λ(dx) . (5.11)
Note that the right-most term does not depend on θ and thus plays no role in
the maximization. It may as well be ignored, and in practice it is not required to
compute it. Except for this term, the right-hand side of (5.11) has an explicit form as
soon as it is possible to evaluate the expectation of the vector of sufficient statistics
S under p(· ; θ0 ). The other important feature of (5.11), ignoring the rightmost term,
is that Q(θ ; θ0 ), viewed as Ra function of θ, is similar to the logarithm of (5.10) for
the particular value Sθ0 = S(x)p(x ; θ0 ) λ(dx) of the sufficient statistic.
In summary, if {f (· ; θ)}θ∈Θ is an exponential family, the two above general
conditions needed for the EM algorithm to be practicable reduce to the following.
E-Step: The expectation of the vector of sufficient statistics S(X) under p(· ; θ0 ) must
be computable.
M-Step: Maximization of ψ(θ)t s − c(θ) with respect to θ ∈ Θ must be feasible in closed
form for any s in the convex hull of S(X) (that is, for any valid value of the
expected vector of sufficient statistics).
Z
− ∇2θ `(θ0 ) = − ∇2θ log f (x ; θ) p(x ; θ0 ) λ(dx)
θ=θ 0
Z
+ ∇2θ log p(x ; θ) θ=θ 0
p(x ; θ0 ) λ(dx) . (5.13)
(5.12), it shows that the first- and second-order derivatives of ` may be evaluated
by computing expectations under p(· ; θ0 ) of quantities derived from f (· ; θ). We now
prove these three identities.
of Proposition 100. Equations (5.12) and (5.13) are just (5.5) where the right-hand
side is differentiated once, using (5.9), and then twice under the integral sign.
To prove (5.14), we start from (5.13) and note that the second term on its right-
hand side is the negative of an information matrix for the parameter θ associated
with the probability density function p(· ; θ) and evaluated at θ0 . We rewrite this
second term using the well-known information matrix identity
Z
∇2θ log p(x ; θ) θ=θ 0
p(x ; θ0 ) λ(dx)
Z
t
=− { ∇θ log p(x ; θ)|θ=θ0 } { ∇θ log p(x ; θ)|θ=θ0 } p(x ; θ0 ) λ(dx) .
This is again a consequence of assumption (b) and the fact that p(· ; θ) is a proba-
bility density function for all values of θ, implying that
Z
∇θ log p(x ; θ)|θ=θ0 p(x ; θ0 ) λ(dx) = 0 .
Now use the identity log p(x ; θ) = log f (x ; θ) − `(θ) and (5.12) to conclude that
Z
t
{ ∇θ log p(x ; θ)|θ=θ0 } { ∇θ log p(x ; θ)|θ=θ0 } p(x ; θ0 ) λ(dx)
Z
t
= { ∇θ log f (x ; θ)|θ=θ0 } { ∇θ log f (x ; θ)|θ=θ0 } p(x ; θ0 ) λ(dx)
t
− {∇θ `(θ0 )} {∇θ `(θ0 )} ,
It can be shown (Luenberger, 1984, Chapter 7) that under mild assumptions, the
steepest ascent method with multipliers (5.16) is globally convergent, with a set of
102 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I
limit points corresponding to the stationary points of ` (see Section 5.5 for precise
definitions of these terms and a proof that this property holds for the EM algorithm).
It remains that the use of the steepest ascent algorithm is not recommended,
particularly in large-dimensional parameter spaces. The reason for this is that its
speed of convergence linear in the sense that if the sequence {θi }i≥0 converges to a
point θ? such that the Hessian ∇2θ `(θ? ) is negative definite (see Section 5.5.2), then
here θ(k) denotes the kth coordinate of the parameter vector. For large-dimensional
problems it frequently occurs that, at least for some components k, the factor ρk
is close to one, resulting in very slow convergence of the algorithm. It should
be stressed however that the same is true for the EM algorithm, which also ex-
hibits speed of convergence that is linear, and often very poor (Dempster et al.,
1977; Jamshidian and Jennrich, 1997; Meng, 1994; Lange, 1995; Meng and Van
Dyk, 1997). For gradient-based methods however, there exists a whole range of
approaches, based on the second-order properties of the objective function, to guar-
antee faster convergence.
Hessian H(θ) is either non-invertible (or at least very badly conditioned) or not
negative semi-definite (in which case −H −1 (θi )∇θ `(θi ) is not necessarily an ascent
direction). To combat this difficulty, Quasi-Newton methods1 use the modified
recursion
θi+1 = θi + γi W i ∇`(θi ) ; (5.20)
here W i is a weight matrix that may be tuned at each iteration, just like the multi-
plier γi . The rationale is that if W i becomes close to −H −1 (θi ) when convergence
occurs, the modified algorithm will share the favorable convergence properties of the
Newton algorithm. On the other hand, by using a weight matrix W i different from
−H −1 (θi ), numerical issues associated with the matrix inversion may be avoided.
We again refer to Luenberger (1984) and Fletcher (1987) for a more precise discus-
sion of the available approaches and simply mention here the fact that usually the
methods only take profit of gradient information to construct W i , for instance using
finite difference calculations, without requiring the direct evaluation of the Hessian
H(θ).
In some contexts, it may be possible to build explicit strategies that are not as
good as the Newton algorithm—failing in particular to reach quadratic convergence
rates—but yet significantly faster at converging than the basic steepest ascent ap-
proach. For incomplete data models, Lange (1995) suggested to use in (5.20) a
weight matrix Ic−1 (θi ) given by
Z
Ic (θ ) = − ∇2θ log f (x ; θ) θ=θ0 p(x ; θ0 ) λ(dx) .
0
(5.21)
This is the first term on the right-hand side of (5.13). In many models of interest,
this matrix is positive definite for all θ0 ∈ Θ, and thus its inversion is not subject
to numerical instabilities. Based on (5.13), it is also to be expected that in some
circumstances, Ic (θ0 ) is a reasonable approximation to the Hessian ∇2θ `(θ0 ) and
hence that the weighted gradient algorithm converges faster than the steepest ascent
or EM algorithms (see Lange, 1995, for further results and examples). In a statistical
context, where f (x ; θ) is the joint density of two random variables X and Y , Ic (θ0 ) is
the conditional expectation given Y of the observed information matrix of associated
with this pair.
1 Conjugate gradient methods are another alternative approach that we do not discuss here.
104 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I
For a given value of y this is of course a particular case of (5.1), which served as the
basis for developing the EM framework in Section 5.1.2. In missing data models,
the family of probability density functions {p(· ; θ)}θ∈Θ defined in (5.3) may thus
be interpreted as
f (x, y ; θ)
p(x|y ; θ) = R , (5.23)
f (x, y ; θ) λ(dx)
the conditional probability density function of X given Y .
In the last paragraph, slightly modified versions of the notations introduced
in (5.1) and (5.3) were used to reflect the fact that the quantities of interest now
depend on the observed variable Y . This is obviously mostly a change regarding
terminology, with no impact on the contents of Section 5.1.2, except that we may
now think of integrating with respect to p(· ; θ) dλ as taking the conditional expec-
tation with respect to the missing data X, given the observed data Y , in the model
indexed by the parameter value θ.
5.2.2 EM in HMMs
We now consider more specifically hidden Markov models using the notations in-
troduced in Section 1.2, assuming that observations Y0 to Yn (or, in short, Y0:n ) are
available. Because we only consider HMMs that are fully dominated in the sense
of Definition 13, we will use the notations ν and φk|n to refer to the probability
density functions of these distributions (of X0 and of Xk given Y0:n ) with respect
to the dominating measure λ. The joint probability density function of the hidden
states X0:n and associated observations Y0:n , with respect to the product measure
λ⊗(n+1) ⊗ µ⊗(n+1) , is given by
where we used the same convention as above to indicate dependence with respect
to the parameter θ.
Because we mainly consider estimation of the HMM parameter vector θ from a
single sequence of observations, it does not make much sense to consider ν as an
independent parameter. There is no hope to estimate ν consistently, as there is
only one random variable X0 (that is not even observed!) drawn from this density.
In the following, we shall thus consider that ν is either fixed (and known) or fully
determined by the parameter θ that appears in q and g. A typical example of the
latter consists in assuming that ν is the stationary distribution associated with the
transition function q(·, · ; θ) (if it exists). This option is generally practicable only
in very simple models (see Example ?? below for an example) because of the lack of
analytical expressions relating the stationary distribution of q(·, · ; θ) to θ for general
parameterized hidden chains. Irrespective of whether ν is fixed or determined by θ,
it is convenient to omit dependence with respect to ν in our notations, writing, for
instance, Eθ for expectations under the model parameterized by (θ, ν).
The likelihood of the observations Ln (y0:n ; θ) is obtained by integrating (5.24)
with respect to the x (state) variables under the measure λ⊗(n+1) . Note that here
we use yet another slight modification of the notations adopted in Section 5.1 to
acknowledge that both the observations and the hidden states are indeed sequences
5.2. APPLICATION TO HMMS 105
with indices ranging from 0 to n (hence the subscript n). Upon taking the logarithm
in (5.24),
n−1
X
log fn (x0:n , y0:n ; θ) = log ν(x0 ; θ) + log q(xk , xk+1 ; θ)
k=0
Xn
+ log g(xk , yk ; θ) ,
k=0
n−1
X
Q(θ ; θ0 ) = Eθ0 [log ν(X0 ; θ) | Y0:n ] + Eθ0 [log q(Xk , Xk+1 ; θ) | Y0:n ]
k=0
Xn
+ Eθ0 [log g(Xk , Yk ; θ) | Y0:n ] .
k=0
n
X
Q(θ ; θ0 ) = Eθ0 [log ν(X0 ; θ) | Y0:n ] + Eθ0 [log gk (Xk ; θ) | Y0:n ]
k=0
n−1
X
+ Eθ0 [log q(Xk , Xk+1 ; θ) | Y0:n ] . (5.25)
k=0
Equation (5.25) shows that in great generality, evaluating the intermediate quantity
of EM only requires the computation of expectations under the marginal φk|n (· ; θ0 )
and bivariate φk:k+1|n (· ; θ0 ) smoothing distributions, given the parameter vector θ0 .
The required expectations may thus be computed using either any of the variants
of the forward-backward approach presented in Chapter 2 or the recursive smooth-
ing approach discussed in Section ??. To make the connection with the recursive
smoothing approach of Section ??, we simply rewrite (5.25) as Eθ0 [tn (X0:n ; θ) | Y0:n ],
where
t0 (x0 ; θ) = log ν(x0 ; θ) + log g0 (x0 ; θ) (5.26)
and
tk+1 (x0:k+1 ; θ) = tk (x0:k ; θ) + {log q(xk , xk+1 ; θ) + log gk+1 (xk+1 ; θ)} . (5.27)
Hence the gradient of the log-likelihood may also be evaluated using either the
forward-backward approach or the recursive technique discussed in Chapter 3. For
the latter, we only need to redefine the functional of interest, replacing (5.26)
and (5.27) by their gradients with respect to θ.
Louis’ identity (5.14) gives rise to more complicated expressions, and we only
consider here the case where g does depend on θ, whereas the state transition density
q and the initial distribution ν are assumed to be fixed and known (the opposite
situation is covered in detail in a particular case in Section 5.3.3). In this case,
(5.14) may be rewritten as
t
∇2θ `n (θ) + {∇θ `n (θ)} {∇θ `n (θ)} (5.29)
Xn
= Eθ [ ∇2θ log gk (Xk ; θ) Y0:n ]
k=0
n X
X n h i
t
+ Eθ {∇θ log gk (Xk ; θ)} {∇θ log gj (Xj ; θ)} Y0:n .
k=0 j=0
The first term on the right-hand side of (5.29) is obviously an expression that can be
computed proceeding as for (5.28), replacing first- by second-order derivatives. The
second term is however more tricky because it (seemingly) requires the evaluation
of the joint distribution of Xk and Xj given the observations Y0:n for all pairs of
indices k and j, which is not obtainable by the smoothing approaches based on
some form of the forward-backward decomposition. The rightmost term of (5.29)
is however easily recognized as a squared sum functional similar to (??), which can
thus be evaluated recursively (in n) proceeding as in Example ??. Recall that the
trick consists in observing that if
n
def
X
τn,1 (x0:n ; θ) = ∇θ log gk (xk ; θ) ,
k=0
( n
)( n
)t
def
X X
τn,2 (x0:n ; θ) = ∇θ log gk (xk ; θ) ∇θ log gk (xk ; θ) ,
k=0 k=0
then
t
τn,2 (x0:n ; θ) = τn−1,2 (x0:n−1 ; θ) + {∇θ log gn (xn ; θ)} {∇θ log gn (xn ; θ)}
t
+ τn−1,1 (x0:n−1 ; θ) {∇θ log gn (xn ; θ)}
t
+ ∇θ log gn (xn ; θ) {τn−1,1 (x0:n−1 ; θ)} .
This last expression is of the general form given in Definition ??, and hence Propo-
sition ?? may be applied to update recursively in n
Eθ [τn,1 (X0:n ; θ) | Y0:n ] and Eθ [τn,2 (X0:n ; θ) | Y0:n ] .
5.3. THE EXAMPLE OF NORMAL HIDDEN MARKOV MODELS 107
To make this approach more concrete, we will describe below, in Section 5.3.3, its
application to a very simple finite state space HMM.
n
" r #
(Yk − µi )2
1X
1{Xk = i} log υi +
X
0 st
Q(θ ; θ ) = C − Eθ0 Y0:n
2 i=1
υi
k=0
n r X
r
1{(Xk−1 , Xk ) = (i, j)} log qij Y0:n ,
X X
+ Eθ0
k=1 i=1 j=1
where the leading term does not depend on θ. Using the notations introduced in
Section 2.1 for the smoothing distributions, we may write
n r
(Yk − µi )2
0 st 1 XX 0
Q(θ ; θ ) = C − φk|n (i ; θ ) log υi +
2 i=1
υi
k=0
n X
X r X
r
+ φk−1:k|n (i, j ; θ0 ) log qij . (5.30)
k=1 i=1 j=1
Now, given the initial distribution ν and parameter θ0 , the smoothing distri-
butions appearing in (5.30) can be evaluated by any of the variants of forward-
backward smoothing discussed in Chapter 2. As already explained above, the E-
step of EM thus reduces to solving the smoothing problem. The M-step is specific
108 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I
and depends on the model parameterization: the task consists in finding a global
optimum of Q(θ ; θ0 ) that satisfies the constraints mentioned above. For this, sim-
ply introducePthe Lagrange multipliers λ1 , . . . , λr that correspond to the equality
r
constraints j=1 ij = 1 for i = 1, . . . , r (Luenberger, 1984, Chapter 10). The
q
first-order partial derivatives of the Lagrangian
r
X Xr
L(θ, λ ; θ0 ) = Q(θ ; θ0 ) + λi 1 − qij
i=1 j=1
are given by
n
∂ 1 X
L(θ, λ ; θ0 ) = φk|n (i ; θ0 )(Yk − µi ) ,
∂µi υi
k=0
n
(Yk − µi )2
∂ 1X 1
L(θ, λ ; θ0 ) = − φk|n (i ; θ0 ) − ,
∂υi 2 υi υi2
k=0
n
∂ X φk−1:k|n (i, j ; θ0 )
L(θ, λ ; θ0 ) = − λi ,
∂qij qij
k=1
r
∂ X
L(θ, λ ; θ0 ) = 1− qij . (5.31)
∂λi j=1
which achieves the maximum of Q(θ ; θ0 ) under the applicable parameter constraints:
Pn 0
k=0 φk|n (i ; θ )Yk
µ∗i = P n 0
, (5.32)
k=0 φk|n (i ; θ )
Pn
∗ φk|n (i ; θ0 )(Yk − µ∗i )2
υi = k=0Pn 0
, (5.33)
k=0 φk|n (i ; θ )
Pn
∗ φk−1:k|n (i, j ; θ0 )
qij = Pn k=1
Pr 0
(5.34)
k=1 l=1 φk−1:k|n (i, l ; θ )
Equations (5.32)–(5.34) are emblematic of the intuitive form taken by the parameter
update formulas derived though the EM strategy. These equations are simply the
maximum likelihood equations for the complete model in which both {Xk }0≤k≤n
and {Yk }0≤k≤n would be observed, except that the functions 1{Xk = i} and
1{Xk−1 = i, Xk = j} are replaced by their conditional expectations, φk|n (i ; θ0 )
and φk−1:k|n (i, j ; θ0 ), given the actual observations Y0:n and the available parame-
ter estimate θ0 . As discussed in Section 5.1.2, this behavior is fundamentally due to
the fact that the probability density functions associated with the complete model
form an exponential family. As a consequence, the same remark holds more gener-
ally for all discrete HMMs for which the conditional probability density functions
g(i, · ; θ) belong to an exponential family. A final word of warning about the way
in which (5.33) is written: in order to obtain a concise and intuitively interpretable
5.3. THE EXAMPLE OF NORMAL HIDDEN MARKOV MODELS 109
expression, (5.33) features the value of µ∗i as given by (5.32). It is of course possible
to rewrite (5.33) in a way that only contains the current parameter value θ0 and the
observations Y0:n by combining (5.32) and (5.33) to obtain
Pn 0 2 Pn 0 2
k=0 φk|n (i ; θ )Yk k=0 φk|n (i ; θ )Yk
υi∗ = P n 0
− P n 0
. (5.36)
k=0 φk|n (i ; θ ) k=0 φk|n (i ; θ )
r
1{X0 = i} log νi
X
log νX0 =
i=1
r
X
φ0|n (i ; θ0 ) log νi
i=1
to (5.30). This sum is indeed part of (5.30) already, but hidden within C st when
ν is not a parameter to be estimated. Using Lagrange multipliers as above, it is
straightforward to show that the M-step update of ν is νi∗ = φ0|n (i ; θ0 ).
It was also mentioned above that sometimes it is desirable to link ν to qθ as
being the stationary distribution of qθ . Then there is an additive contribution to
Q(θ ; θ0 ) as above, with the difference that ν can now not be chosen freely but is a
function of qθ . As there is no simple formula for the stationary distribution of qθ ,
the M-step is no longer explicit. However, once the sums (over k) in (5.30) have
been computed for all i and j, we are left with an optimization problem over the qij
for which we have an excellent initial guess, namely the standard update (ignoring
ν) (5.34). A few steps of a standard numerical optimization routine (optimizing over
the qij ) is then often enough to find the maximum of Q(· ; θ0 ) under the stationarity
assumption. Variants of the basic EM strategy, to be discussed in Section 5.5.3,
may also be useful in this situation.
Hence
n
∂ 1 X
`n (θ) = φk|n (i ; θ)(Yk − µi ) ,
∂µi υi
k=0
n
(Yk − µi )2
∂ 1X 1
`n (θ) = − φk|n (i ; θ) − ,
∂υi 2 υi υi2
k=0
n
∂ X φk−1:k|n (i, j ; θ)
`n (θ) = .
∂qij qij
k=1
where the notation qij refers to the element in the (1 + i)-th row and (1 + j)-th
column of the matrix Q (in particular, q00 and q11 are alternative notations for
ρ0 and ρ1 ). We are thus in the framework of Proposition ?? with a smoothing
functional tn,1 defined by
∂
t0,1 (x) = log ν(x) ,
∂ρ0
∂
sk,1 (x, x0 ) = log qxx0 for k ≥ 0 ,
∂ρ0
where the multiplicative functions {mk,1 }k≥0 are equal to 1. Straightforward cal-
culations yield
ρ1
t0,1 (x) = (ρ0 + ρ1 )−1 δ0 (x) − δ1 (x) ,
ρ0
1 1
sk,1 (x, x0 ) = δ(0,0) (x, x0 ) − δ(0,1) (x, x0 ) .
ρ0 1 − ρ0
φk (i) = c−1
0 ν(i)g0 (i) ,
τ0,1 (i) = t0,1 (i)φ0 (i) .
P1 P1
Recursion: For k = 0, 1, . . . , compute ck+1 = i=0 j=0 φk (i)qij gk (j) and, for
5.3. THE EXAMPLE OF NORMAL HIDDEN MARKOV MODELS 111
j = 0, 1,
1
X
φk+1 (j) = c−1
k+1 φk (i)qij gk (j) ,
i=0
1
X
τk+1,1 (j) = c−1
k+1 τk,1 (i)qij gk+1 (j)
i=0
+ φk (0)gk+1 (0)δ0 (j) − φk (0)gk+1 (1)δ1 (j) .
Pk
At each index k, the log-likelihood is available via `k = l=0 log cl , and its
derivative with respect to ρ0 may be evaluated as
1
∂ X
`k = τk,1 (i) .
∂ρ0 i=0
2 " n−1
#
∂2 ∂2 X ∂2
∂
`n + `n =E log ν(X0 ) + log qXk Xk+1 Y0:n
∂ρ20 ∂ρ0 ∂ρ20 ∂ρ20
k=0
!2
n−1
∂ X ∂
+ E log ν(X0 ) + log qXk Xk+1 Y0:n . (5.37)
∂ρ0 ∂ρ0
k=0
The first term on the right-hand side of (5.37) is very similar to the case of τn,1
considered above, except that we now need to differentiate the functions twice,
replacing t0,1 and sk,1 by ∂ρ∂ 0 t0,1 and ∂ρ∂ 0 sk,1 , respectively. The corresponding
smoothing functional tn,2 is thus now defined by
ρ1 (2ρ0 + ρ1 ) 1
t0,2 (x) = − 2 2
δ0 (x) + δ1 (x) ,
ρ0 (ρ0 + ρ1 ) (ρ0 + ρ1 )2
1 1
sk,2 (x, x0 ) = − 2 δ(0,0) (x, x0 ) − δ(0,1) (x, x0 ) .
ρ0 (1 − ρ0 )2
The second term on the right-hand side of (5.37) is more difficult, and we need
to proceed as in Example ??: the quantity of interest may be rewritten as the
conditional expectation of
" n−1
#2
X
tn,3 (x0:n ) = t0,1 (x0 ) + sk,1 (xk , xk+1 ) .
k=0
tk+1,3 (x0:k+1 ) = tk,3 (x0:k ) + s2k,1 (xk , xk+1 ) + 2tk,1 (x0:k )sk,1 (xk , xk+1 ) .
Hence tk,1 and tk,3 jointly are of the form prescribed by Definition ?? with in-
cremental additive functions sk,3 (x, x0 ) = s2k,1 (x, x0 ) and multiplicative updates
mk,3 (x, x0 ) = 2sk,1 (x, x0 ). As a consequence, the following smoothing recursion
holds.
Initialization: For i = 0, 1,
where the second term on the left-hand side may be evaluated in the same recursion,
following Algorithm 101.
To illustrate the results obtained with Algorithms 101–102, we consider the
model with parameters ρ0 = 0.95, ρ1 = 0.8, and υ = 0.1 (using the notations
introduced in Example ??). Figure 5.1 displays the typical aspect of two sequences
of length 200 simulated under slightly different values of ρ0 . One possible use of
the output of Algorithms 101–102 consists in testing for changes in the parameter
values. Indeed, under conditions to be detailed in Chapter 6 (and which hold here),
the normalized score n−1/2 ∂ρ∂ 0 `n satisfies a central limit theorem with variance given
by the limit of the normalized information −n−1 (∂ 2 /∂ρ20 )`n . Hence it is expected
that
∂
∂ρ `n
Rn = q 0 2
∂
− ∂ρ 2 `n
0
ρ0 = 0.95
2
1.5
Data
0.5
−0.5
−1
0 20 40 60 80 100 120 140 160 180 200
ρ = 0.92
0
2
1.5
1
Data
0.5
−0.5
−1
0 20 40 60 80 100 120 140 160 180 200
Time Index
Figure 5.1: Two simulated trajectories of length n = 200 from the simplified ion
channel model of Example ?? with ρ0 = 0.95, ρ1 = 0.8, and σ 2 = 0.1 (top), and
ρ0 = 0.92, ρ1 = 0.8, and σ 2 = 0.1 (bottom).
the null hypothesis ρ0 = 0.95 gives p-values of 0.87 and 0.09 for the two sequences
in the top and bottom plots, respectively, of Figure 5.1. When testing at the 10%
level, both sequences thus lead to the correct decision: no rejection and rejection of
the null hypothesis, respectively. Interestingly, testing the other way around, that
is, postulating ρ0 = 0.92 as the null hypothesis, gives p-values of 0.20 and 0.55 for
the top and bottom sequences of Figure 5.1, respectively. The outcome of the test
is now obviously less clear-cut, which reveals an asymmetry in its discrimination
ability: it is easier to detect values of ρ0 that are smaller than expected than the
converse. This is because smaller values of ρ0 means more changes (on average) in
the state sequence and hence more usable information about ρ0 to be obtained from
a fixed size record. This asymmetry is connected to the upward bias visible in the
left plot of Figure 5.2.
n = 200 n = 1000
0.999 0.999
0.99 0.99
Probability
0.90 0.90
0.5 0.5
0.1 0.1
0.01 0.01
0.001 0.001
−2 0 2 4 6 8 −2 0 2 4
Figure 5.2: QQ-plot of empirical quantiles of the test statistic Rn (abscissas) for the
simplified ion channel model of Example ?? with ρ0 = 0.95, ρ1 = 0.8, and σ 2 = 0.1
vs. normal quantiles (ordinates). Samples sizes were n = 200 (left) and n =1,000
(right), and 10,000 independent replications were used to estimate the empirical
quantiles.
114 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I
where X0 , {Uk }k≥0 and {Vk }k≥0 are jointly Gaussian. The parameters of the model
are the four matrices A, R, B, and S. Note that except for scalar models, it is not
possible to estimate R and S because both {Uk } and {Vk } are unobservable and
hence R and S are only identifiable up to an orthonormal matrix. In other words,
multiplying R or S by any orthonormal matrix of suitable dimension does not modify
the distribution of the observations. Hence the parameters that are identifiable are
the covariance matrices ΥR = RRt and ΥS = SS t , which we consider below.
Likewise, the matrices A and B are identifiable up to a similarity transformation
only. Indeed, setting Xk0 = T Xk for some invertible matrix T , that is, making a
change of basis for the state process, it is straightforward to check that the joint
process {(Xk0 , Yk )} satisfies the model assumptions with T AT −1 , BT −1 , and T R
replacing A, B, and R, respectively. Nevertheless, we work with A and B in the
algorithm below. If a unique representation is desired, one may use, for instance,
the companion form of A given its eigenvalues; this matrix may contain complex
entries though. As in the case of finite state space HMMs (Section 5.2.2), it is not
sensible to consider the initial covariance matrix Σν as an independent parameter
when using a single observed sequence. On the other hand, for such models it is very
natural to assume that Σν is associated with the stationary distribution of {Xk }.
We shall also assume that both ΥR and ΥS are full rank covariance matrices so
that all Gaussian distributions admit densities with respect to (multi-dimensional)
Lebesgue measure.
" n−1 #
X
0
∇A Q(θ ; θ ) = −Υ−1
R Eθ 0 t t
(AXk Xk − Xk+1 Xk ) Y0:n , (5.39)
k=0
(
0 1
∇Υ−1 Q(θ ; θ ) = − −nΥR (5.40)
R 2
" n−1 #)
X
t
+ Eθ 0 (Xk+1 − AXk )(Xk+1 − AXk ) Y0:n ,
k=0
" n #
X
∇B Q(θ ; θ0 ) = −Υ−1
S Eθ 0 (BXk Xkt − Yk Xkt ) Y0:n , (5.41)
k=0
(
0 1
∇Υ−1 Q(θ ; θ ) = − −(n + 1)ΥS (5.42)
S 2
" n #)
X
t
+ Eθ 0 (Yk − BXk )(Yk − BXk ) Y0:n .
k=0
Note that in the expressions above, we differentiate with respect to the inverses of
ΥR and ΥS rather than with respect to the covariance matrices themselves, which
is equivalent, because we assume both of the covariance matrices to be positive
definite, but yields simpler formulas. Equating all derivatives simultaneously to
zero defines the EM update of the parameters. We will denote these updates by A∗ ,
B ∗ , Υ∗R , and Υ∗S , respectively. To write them down, denote X̂k|n (θ0 ) = Eθ0 [Xk | Y0:n ]
and Σk|n (θ0 ) = Eθ0 [Xk Xk0 | Y0:n ]− X̂k|n (θ0 )X̂k|n
t
(θ0 ), where we now indicate explicitly
that these first two smoothing moments indeed depend on the current estimates of
the model parameters (they also depend on the initial covariance matrix Σν , but
we ignore this fact here because this quantity is considered as being fixed). We also
need to evaluate the conditional covariances
def
Ck,k+1|n (θ0 ) = Covθ0 [Xk , Xk+1 | Y0:n ]
t
= Eθ0 [Xk Xk+1 | Y0:n ] − X̂k|n (θ0 )X̂k+1|n
t
(θ0 ) .
116 CHAPTER 5. MAXIMUM LIKELIHOOD INFERENCE, PART I
n
1 Xh i
Υ∗S = Yk Ykt − B ∗ X̂k|n (θ0 )Ykt . (5.46)
n+1
k=0
In obtaining the covariance update, we used the same remark that made it possible
to rewrite, in the case of normal HMMs, (5.33) as (5.36).
5.5 Complements
To conclude this chapter, we briefly return to an issue mentioned in Section 5.1.2
regarding the conditions that ensure that the EM iterations indeed converge to
stationary points of the likelihood.
that is, M (θ0 ) is the set of values θ that maximize Q(θ ; θ0 ) over Θ. Usually M (θ0 )
reduces to a singleton, and the mapping M is then simply a point-to-point map (a
usual function from Θ to Θ). But the use of point-to-set maps makes it possible to
deal also with cases where the intermediate quantity of EM may have several global
maxima, without going into the details of what is done in such cases. We next need
the following definition before stating the main convergence theorem.
(a) θi → θ ∈ S,
Note that for point-to-point maps, that is, if T (θ) is a singleton for all θ, the
definition above is equivalent to the requirement that T be continuous on S. Defi-
nition 103 is thus a generalization of continuity for general (point-to-set) maps. We
are now ready to state the main result, which is proved in Zangwill (1969, p. 91)
or Luenberger (1984, p. 187).
Theorem 104 (Global Convergence Theorem). Let Θ be a subset of Rdθ and let
{θi }i≥0 be a sequence generated by θi+1 ∈ T (θi ) where T is a point-to-set map on
Θ. Let S ⊆ Θ be a given “solution” set and suppose that
(3) there is a continuous “ascent” function s on Θ such that s(θ) ≥ s(θ0 ) for all
θ ∈ T (θ0 ), with strict inequality for points θ0 that are not in S.
Then the limit of any convergent subsequence of {θi } is in the solution set S. In
addition, the sequence of values of the ascent function, {s(θi )}i≥0 , converges mono-
tonically to s(θ? ) for some θ? ∈ S.
The final statement of Theorem 104 should not be misinterpreted: that {s(θi )}
converges to a value that is the image of a point in S is a simple consequence of
the first and third assumptions. It does however not imply that the sequence of
parameters {θi } is itself convergent in the usual sense, but only that the limit points
of {θi } have to be in the solution set S. An important property however is that
because {s(θi(l) )}l≥0 converges to s(θ? ) for any convergent subsequence {θi(l) }, all
limit points of {θi } must be in the set S? = {θ ∈ Θ : s(θ) = s(θ? )} (in addition
to being in S). This latter statement means that the sequence of iterates {θi } will
ultimately approach a set of points that are “equivalent” as measured by the ascent
function s.
The following general convergence theorem following the proof by Wu (1983) is
a direct application of the previous theory to the case of EM.
Then all limit points of any instance {θi }i≥0 of an EM algorithm initialized at θ0
are in L0 = {θ ∈ Θ0 : ∇θ `(θ) = 0}, the set of stationary points of ` with log-
likelihood larger than that of θ0 . The sequence {`(θi )} of log-likelihoods converges
monotonically to `? = `(θ? ) for some θ? ∈ L0 .
Proof. This is a direct application of Theorem 104 using L0 as the solution set and
` as the ascent function. The first hypothesis of Theorem 104 follows from (ii) and
the third one from Proposition 98. The closedness assumption (2) follows from
Proposition 98 and (i): for the EM mapping M defined in (5.47), θ̃i ∈ M (θi )
amounts to the condition
which is also satisfied by the limits of the sequences {θ̃i } and {θi } (if these converge)
by continuity of the intermediate quantity Q, which follows from that of ` and H
(note that it is here important that H be continuous with respect to both argu-
ments). Hence the EM mapping is indeed closed on Θ as a whole and Theorem 105
follows.
The assumptions of Proposition 98 as well as item (i) above are indeed very mild
in typical situations. Assumption (ii) however may be restrictive, even for models
in which the EM algorithm is routinely used. The practical implication of (ii) being
violated is that the EM algorithm may fail to converge to the stationary points of
the likelihood for some particularly badly chosen initial points θ0 .
Most importantly, the fact that θi+1 maximizes the intermediate quantity Q(· ; θi )
of EM does in no way imply that, ultimately, `? is the global maximum of ` over
Θ. There is even no guarantee that `? is a local maximum of the log-likelihood: it
may well only be a saddle point (Wu, 1983, Section 2.1). Also, the convergence of
the sequence `(θi ) to `? does not automatically imply the convergence of {θi } to a
point θ? .
Pointwise convergence of the EM algorithm requires more stringent assumptions
that are difficult to verify in practice. As an example, a simple corollary of the global
convergence theorem states that if the solution set S in Theorem 104 is a single
point, θ? say, then the sequence {θi } indeed converges to θ? (Luenberger, 1984,
p. 188). The sketch of the proof of this corollary is that every subsequence of {θi }
has a convergent further subsequence because of the compactness assumption (1),
but such a subsequence admits s as an ascent function and thus converges to θ?
by Theorem 104 itself. In cases where the solution set is composed of several
points, further conditions are needed to ensure that the sequence of iterates indeed
converges and does not cycle through different solution points.
In the case of EM, pointwise convergence of the EM sequence may be guaranteed
under an additional condition given by Wu (1983) (see also Boyles, 1983, for an
equivalent result), stated in the following theorem.
(iii) kθi+1 − θi k → 0 as i → ∞,
then all limit points of {θi } are in a connected and compact subset of L? = {θ ∈ Θ :
`(θ) = `? }, where `? is the limit of the log-likelihood sequence {`(θi )}.
In particular, if the connected components of L? are singletons, then {θi } con-
verges to some θ? in L? .
Proof. The set of limit points of a bounded sequence {θi } with kθi+1 − θi k → 0 is
connected and compact (Ostrowski, 1966, Theorem 28.1). The proof follows because
under Theorem 104, the limit points of {θi } must belong to L? .
5.5. COMPLEMENTS 119
Here ∇θ M (θ? ) is called the rate matrix (see for instance Meng and Rubin, 1991).
A fixed point θ? is said to be stable if the spectral radius of ∇θ M (θ? ) is less than 1.
In this case, the tangent linear system is asymptotically stable in the sense that the
sequence {ζ i } defined recursively by ζ i+1 = ∇θ M (θ? )ζ i tends to zero as n tends to
infinity (for any choice of ζ 0 ). The linear rate of convergence of EM is defined as
the largest moduli of the eigenvalues of ∇θ M (θ? ). This rate is an upper bound on
the factors ρk that appear in (5.17).
Proposition 107. Under the assumptions of Theorem 100, assume that Q(· ; θ)
has a unique maximizer for all θ ∈ Θ and that, in addition,
Z
H(θ? ) = − ∇2θ log f (x ; θ) θ=θ p(x ; θ? ) λ(dx) (5.49)
?
and Z
G(θ? ) = − ∇2θ log p(x ; θ) θ=θ?
p(x ; θ? ) λ(dx) (5.50)
are positive definite matrices for all stationary points of EM (i.e., such that M (θ? ) =
θ? ). Then for all such points, the following hold true.
(i) ∇θ M (θ? ) is diagonalizable and its eigenvalues are positive real numbers.
(ii) The point θ? is stable for the mapping M if and only if it is a proper maximizer
of `(θ) in the sense that all eigenvalues of ∇2θ `(θ? ) are negative.
Proof. The EM mapping is defined implicitly through the fact that M (θ0 ) maximizes
Q(· ; θ0 ), which implies that
Z
∇θ log f (x ; θ)|θ=M (θ0 ) p(x ; θ0 ) λ(dx) = 0 ,
Thus ∇θ M (θ? ) is stable if and only if B? has negative eigenvalues only. The
Sylvester law of inertia (see for instance Horn and Johnson, 1985) shows that B? has
the same inertia (number of positive, negative, and zero eigenvalues) as ∇2θ `(θ? ).
Thus all of B? ’s eigenvalues are negative if and only if the same is true for ∇2θ `(θ? ),
that is, if θ? is a proper maximizer of `.
The proof above implies that when θ? is stable, the eigenvalues of M (θ? ) lie in
the interval (0, 1).
and then
θ2i+1 = arg max Q((θ1i+1 , θ2 ) ; (θ1i , θ2i )) .
θ2
It is easily checked that for this algorithm, (5.8) is still verified and thus ` is an ascent
function; this implies that Theorem 105 holds under the same set of assumptions.
The example above is only the simplest case where the ECM approach may be
applied, and further extensions are discussed by Meng and Rubin (1993).
Chapter 6
The maximum likelihood estimator (MLE) is one of the backbones of statistics, and
as we have seen in previous chapters, it is very much appropriate also for HMMs,
even though numerical approximations are required when the state space is not
finite. A standard result in statistics says that, except for “atypical cases”, the
MLE is consistent, asymptotically normal with asymptotic (scaled) variance equal
to the inverse Fisher information matrix, and efficient. The purpose of the current
chapter is to show that these properties are indeed true for HMMs as well, provided
some conditions of rather standard nature hold. We will also employ the asymptotic
results obtained to verify the validity of certain likelihood-based tests.
Recall that the distribution (law) P of {Yk }k≥0 depends on a parameter θ that
lies in a parameter space Θ, which we assume is a subset of Rdθ for some dθ . Com-
monly, θ is a vector containing some components that parameterize the transition
kernel of the hidden Markov chain—such as the transition probabilities if the state
space X is finite—and other components that parameterize the conditional distri-
butions of the observations given the states. Throughout the chapter, it is assumed
that the HMM model is, for all θ, fully dominated in the sense of Definition 13 and
that the underlying Markov chain is positive (see Definition 171).
Assumption 108.
(i) There exists a probability measure λ on (X, X ) such that for any x ∈ X and
Rany θ ∈0 Θ, Q0θ (x, ·) λ with transition density qθ . That is, Qθ (x, A) =
qθ (x, x ) λ(dx ) for A ∈ X .
(ii) There exists a probability measure µ on (Y, Y) such that for any x ∈ X and any
θR ∈ Θ, Gθ (x, ·) µ with transition density function gθ . That is, Gθ (x, A) =
gθ (x, y) µ(dy) for A ∈ Y.
121
122 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE
(i) for all θ ∈ Θ, n−1 `n (θ) → `(θ) Pθ? -a.s. uniformly over compact subsets of Θ,
where `n (θ) is the log-likelihood of the parameter θ given the first n obser-
vations and `(θ) is a continuous deterministic function with a unique global
maximum at θ? ;
(ii) n−1/2 ∇θ `n (θ? ) → N(0, J (θ? )) Pθ? -weakly, where J (θ) is the Fisher informa-
tion matrix at θ (we do not provide a more detailed definition at the moment);
(iii) limδ→0 limn→∞ sup|θ−θ? |≤δ k − n−1 ∇2θ `n (θ) − J (θ? )k = 0 Pθ? -a.s.
The function ` in (i) is sometimes referred to as the contrast function. We note that
−n−1 ∇2θ `n (θ) in (iii) is the observed information matrix, so that (iii) says that the
observed information should converge to the Fisher information in a certain uniform
sense. This uniformity may be replaced by conditions on the third derivatives of
the log-likelihood, which is common in statistical textbooks, but as we shall see, it
is cumbersome enough even to deal with second derivatives of the log-likelihood for
HMMs, whence avoiding third derivatives is preferable.
Condition (i) assures strong consistency of the MLE, which can be shown using
an argument that goes back to Wald (1949). The idea of the argument is as follows.
Denote by θbn the maximum the ML estimator; `n (θbn ) ≥ `n (θ) for any θ ∈ Θ.
Because ` has a unique global maximum at θ? , `(θ? ) − `(θ) ≥ 0 for any θ ∈ Θ and,
in particular, `(θ? ) − `(θbn ) ≥ 0. We now combine these two inequalities to obtain
0 ≤ `(θ? ) − `(θbn )
≤ `(θ? ) − n−1 `n (θ? ) + n−1 `n (θ? ) − n−1 `n (θbn ) + n−1 `n (θbn ) − `(θbn )
≤ 2 sup |`(θ) − n−1 `n (θ)| .
θ∈Θ
Therefore, by taking the compact subset in (i) above as Θ itself, `(θbn ) → `(θ? )
Pθ? -a.s. as n → ∞, which in turn implies, as ` is continuous with a unique global
maximum at θ? , that the MLE converges to θ? Pθ? -a.s.. In other words, the MLE
is strongly consistent.
Provided strong consistency holds, properties (ii) and (iii) above yield asymp-
totic normality of the MLE. In fact, we must also assume that θ? is an interior point
of Θ and that the Fisher information matrix J (θ? ) is non-singular. Then we can
for sufficiently large n make a Taylor expansion around θ? , noting that the gradient
of `n vanishes at the MLE θbn because θ? is maximal there,
Z 1
0 = ∇θ `n (θbn ) = ∇θ `n (θ? ) + ∇2θ `n [θ? + t(θbn − θ? )] dt (θbn − θ? ) .
0
Now θbn converges to θ? Pθ? -a.s. and so, using (iii), the first factor on the right-hand
side tends to J (θ? )−1 Pθ? -a.s. The second factor converges weakly to N(0, J (θ? ));
6.2. STATIONARY APPROXIMATIONS 123
this is (ii). Cramér-Slutsky’s theorem hence tells us that n1/2 (θbn − θ? ) tends Pθ? -
weakly to N(0, J −1 (θ? )), and this is the standard result on asymptotic normality
of the MLE.
In an entirely similar way properties (ii) and (iii) also show that for any u ∈ Rdθ
(recall that Θ is a subset of Rdθ ),
1
`n (θ? + n−1/2 u) − `n (θ? ) = n−1/2 uT ∇θ `n (θ? ) + uT [−n−1 ∇2θ `n (θ? )]u + Rn (u) ,
2
where n−1/2 ∇θ `n (θ? ) and −n−1 ∇2θ `n (θ? ) converge as described above, and where
Rn (u) tends to zero Pθ? -a.s. Such an expansion is known as local asymptotic nor-
mality (LAN) of the model, cf. Ibragimov and Hasminskii (1981, Definition II.2.1).
Under this condition, it is known that so-called regular estimators (a property pos-
sessed by all “sensible” estimators) cannot have an asymptotic covariance matrix
smaller than J −1 (θ? ) (Ibragimov and Hasminskii, 1981, p. 161). Because this limit
is obtained by the MLE, this estimator is efficient.
Later on in this chapter, we will also exploit properties (i)–(iii) to derive asymp-
totic properties of likelihood ratio and other tests for lower dimensional hypotheses
regarding θ.
We could also want to replace the fixed initial state by an initial distribution ν on
(X, X ), giving Z
Lν,n (θ) = Lx0 ,n (θ) ν(dx0 ) .
X
The stationary likelihood is then Lπθ ,n (θ), which we will simply denote by Ln (θ).
The advantage of working with the stationary likelihood is of course that it is
the correct likelihood for the model and may hence be expected to provide better
finite-sample performance. The advantage of assuming a fixed initial state x0 —and
hence adopting the likelihood Lx0 ,n (θ)—is that the stationary distribution πθ is not
always available in closed form when X is not finite. It is however important that
gθ (x0 , Y0 ) is positive Pθ? -a.s.; otherwise the log-likelihood may not be well-defined.
In fact, we shall require that gθ (x0 , Y0 ) is, Pθ? -a.s., bounded away from zero. In
the following, we always assume that this condition is fulfilled. A further advantage
of Lx0 ,n (θ) is that the methods described in the current chapter may be extended
to Markov-switching autoregressions (Douc et al., 2004), and then the stationary
likelihood is almost never computable, not even when X is finite. Throughout the
rest of this chapter, we will work with Lx0 ,n (θ) unless noticed, where x0 ∈ X is
chosen to satisfy the above positivity assumption but otherwise arbitrarily. The
MLE arising from this likelihood has the same asymptotic properties as has the
MLE arising from Ln (θ), provided the initial stationary distribution πθ has smooth
124 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE
second-order derivatives (cf. Bickel et al., 1998), whence from an asymptotic point
of view there is no loss in using the incorrect likelihood Lx0 ,n (θ).
We now return to the analysis of log-likelihood and items (i)–(iii) above. In the
setting of i.i.d. observations, the log-likelihood `n (θ) is a sum of i.i.d. terms, and
so (i) and (iii) follow from uniform versions of the strong law of large numbers and
(ii) is a consequence of the simplest central limit theorem. In the case of HMMs,
we can write `x0 ,n (θ) as a sum as well:
n
X Z
`x0 ,n (θ) = log gθ (xk , Yk ) φx0 ,k|k−1 [Y0:k−1 ](dxk ; θ) (6.2)
k=0
Xn Z
= log gθ (xk , Yk ) Pθ (Xk ∈ dxk | Y0:k−1 , X0 = x0 ) , (6.3)
k=0
where φx0 ,k|k−1 [Y0:k−1 ](· ; θ) is the predictive distribution of the state Xk given the
observations Y0:k−1 and X0 = x0 . These terms do not form a stationary sequence
however, so the law of large numbers—or rather the ergodic theorem—does not
apply directly. Instead we must first approximate `x0 ,n (θ) by the partial sum of a
stationary sequence.
When the joint Markov chain {Xk , Yk } has an invariant distribution, this chain is
stationary provided it is started from its invariant distribution. In this case, we can
(and will!) extend it to a stationary sequence {Xk , Yk }−∞<k<∞ with doubly infinite
time, as we can do with any stationary sequence. Having done this extension, we
can imagine a predictive distribution of the state Xk given the infinite past Y−∞:k−1
of observations. A key feature of these variables is that they now form a stationary
sequence, whence the ergodic theorem applies. Furthermore we can approximate
`x0 ,n (θ) by
X n Z
s
`n (θ) = log gθ (xk , Yk ) Pθ (Xk ∈ dxk | Y−∞:k−1 ) , (6.4)
k=0
where superindex s stands for “stationary”. Heuristically, one would expect this
approximation to be good, as observations far in the past do not provide much
information about the current one, at least not if the hidden Markov chain enjoys
good mixing properties. What we must do is thus to give a precise definition of the
predictive distribution Pθ (Xk ∈ · | Y−∞:k−1 ) given the infinite past, and then show
that it approximates the predictive distribution φx0 ,k|k−1 (· ; θ) well enough that
the two sums (6.2) and (6.4), after normalization by n, have the same asymptotic
behavior. We can treat the score function similarly by defining a sequence that
forms a stationary martingale increment sequence; for sums of such sequences there
is a central limit theorem.
The cornerstone in this analysis is the result on conditional mixing stated in
Section 3. We will rephrase it here, but before doing so we state a first assumption.
It is really a variation of Assumption 62, adapted to the dominated setting and
uniform in θ.
Assumption 109.
(i) The transition density qθ (x, x0 ) of {Xk } satisfies 0 < σ − ≤ qθ (x, x0 ) ≤ σ + < ∞
for all x, x0 ∈ X and all θ ∈ Θ, and the measure λ is a probability measure.
R
(ii) For all y ∈ Y, the integral X gθ (x, y) λ(dx) is bounded away from 0 and ∞ on
Θ.
Part (i) of this assumption often, but not always holds when the state space X
is finite or compact. Note that Assumption 109 says that for all θ ∈ Θ, the whole
6.3. CONSISTENCY 125
state space X is a 1-small set for the transition kernel Qθ , which implies that for
all θ ∈ Θ, the chain is phi-irreducible and strongly aperiodic (see Section 7.2 for
definitions). It also ensures that there exists a stationary distribution πθ for Qθ .
In addition, the chain is uniformly geometrically ergodic in the sense that for any
x ∈ X and n ≥ 0, kQnθ (x, ·) − πθ kTV ≤ (1 − σ − )n . Under Assumption 108, it holds
that πθ λ, and we use the same notation for this distribution and its density
with respect to the dominating measure λ.
Using the results of Section 7.3, we conclude that the state space X×Y is 1-small
for the joint chain {Xk , Yk }. Thus the joint chain is also phi-irreducible and strongly
aperiodic, and it admits a stationary distribution with density πθ (x)gθ (x, y) with
respect to the product measure λ ⊗ µ on (X × Y, X ⊗ Y) The joint chain also is
uniformly geometrically ergodic.
Put ρ = 1 − σ − /σ + ; then 0 ≤ ρ < 1. The important consequence of Assump-
tion 109 that we need in the current chapter is Proposition 64. It says that if
Assumption 109 holds true, then for all k ≥ 1, all y0:n and all initial distributions
ν and ν 0 on (X, X ),
Z
Pθ (Xk ∈ · | X0 = x, Y0:n = y0:n ) [ν(dx) − ν 0 (dx)] ≤ ρk . (6.5)
X TV
6.3 Consistency
6.3.1 Construction of the Stationary Conditional Log-likelihood
R
We shall now construct Pθ (Xk ∈ dxk | Y−∞:k−1 ) and gθ (xk , Yk ) Pθ (Xk ∈ dxk | Y−∞:k−1 ).
The latter variable will be defined as the limit of
Z
def
Hk,m,x (θ) = gθ (xk , Yk ) Pθ (Xk ∈ dxk | Y−m+1:k−1 , X−m = x) (6.6)
sup sup sup |hk,m,x (θ)| ≤ |log b+ | ∨ |log(σ − b− (Yk ))| . (6.9)
θ∈Θ m≥−(k−1) x∈X
Note that the step from the total variation bound to the bound on the difference
between the integrals does not need a factor “2”, because the integrands are non-
negative. Also note that (6.5) is stated for m = m0 = 0, but its initial time index is
of course arbitrary. The integral in (6.10) can be bounded from below as
Z
Hk,m,x (θ) ≥ σ − gθ (xk , Yk ) λ(dxk ) , (6.13)
and the same lower bound holds for (6.11). Combining (6.12) with these lower
bounds and the inequality |log x − log y| ≤ |x − y|/(x ∧ y) shows that
σ + k+m−1 ρk+m−1
|hk,m,x (θ) − hk,m0 ,x0 (θ)| ≤ ρ = ,
σ− 1−ρ
which is the first assertion of the lemma. Furthermore note that (6.10) and (6.13)
yield
σ − b− (Yk ) ≤ Hk,m,x (θ) ≤ b+ , (6.14)
which implies the second assertion.
Equation (6.8) shows that for any given k and x, {hk,m,x (θ)}m≥−(k−1) is a uni-
form (in θ) Cauchy sequence as m → ∞, Pθ? -a.s., whence there is a Pθ? -a.s. limit.
Moreover, again by (6.8), this limit does not depend on x, so we denote it by
hk,∞ (θ). Our interpretation of this limit is as log Eθ [ gθ (Xk , Yk ) | Y−∞:k−1 ]. Fur-
thermore (6.9) shows that provided Assumption 110 holds, {hk,m,x (θ)}m≥−(k−1) is
uniformly bounded in L1 (Pθ? ), so that hk,∞ (θ) is in L1 (Pθ? ) and, by the dominated
convergence theorem, the limit holds in this mode as well. Finally, by its definition
{hk,∞ (θ)}k≥0 is a stationary process, and it is ergodic because {Yk }−∞<k<∞ is. We
summarize these findings.
Proposition 112. Assume 108, 109, and 110 hold. Then for each θ ∈ Θ and
x ∈ X, the sequence {hk,m,x (θ)}m≥−(k−1) has, Pθ? -a.s., a limit hk,∞ (θ) as m → ∞.
This limit does not depend on x. In addition, for any θ ∈ Θ, hk,∞ (θ) belongs to
L1 (Pθ? ), and {hk,m,x (θ)}m≥−(k−1) also converges to hk,∞ (θ) in L1 (Pθ? ) uniformly
over θ ∈ Θ and x ∈ X.
Having come thus far, we can quantify the approximation of the log-likelihood
`x0 ,n (θ) by `sn (θ).
1
|`x0 ,n (θ) − `sn (θ)| ≤ |log gθ (x0 , Y0 )| + h0,∞ (θ) + Pθ? -a.s.
(1 − ρ)2
6.3. CONSISTENCY 127
Proof. Letting m0 → ∞ in (6.8) we obtain |hk,0,x0 (θ) − hk,∞ (θ)| ≤ ρk−1 /(1 − ρ) for
k ≥ 1. Therefore, Pθ? -a.s.,
n
X n
X
|`x0 ,n (θ) − `sn (θ)| = hk,0,x0 (θ) − hk,∞ (θ)
k=0 k=0
n
X ρk−1
≤ |log gθ (x0 , Y0 )| + h0,∞ (θ) + .
1−ρ
k=1
The following result shows that hk,∞ (θ) is then continuous in L1 (Pθ? ).
Proposition 115. Assume 108, 109, 110, and 114. Then for any θ ∈ Θ,
" #
Eθ? sup |h0,∞ (θ0 ) − h0,∞ (θ)| → 0 as δ → 0 ,
θ 0 ∈Θ: |θ 0 −θ|≤δ
Proof. Recall that h0,∞ (θ) is the limit of h0,m,x (θ) as m → ∞. We first prove that
for any x ∈ X and any m > 0, the latter quantity is continuous in θ and then use
this to show continuity of the limit. Recall the interpretation of H0,m,x (θ) as a
conditional density and write
H0,m,x (θ) =
R R Q0
··· i=−m+1 qθ (xi−1 , xi )gθ (xi , Yi ) λ(dx−m+1 ) · · · λ(dx0 )
R R Q−1 (6.15)
··· i=−m+1 qθ (xi−1 , xi )gθ (xi , Yi ) λ(dx−m+1 ) · · · λ(dx−1 )
We can now proceed to show uniform convergence of n−1 `x0 ,n (θ) to `(θ).
Proposition 116. Assume 108, 109, 110, and 114. Then
Proof. First note that because Θ is compact, it is sufficient to prove that for all
θ ∈ Θ,
lim sup lim sup sup |n−1 `x0 ,n (θ0 ) − `(θ)| = 0 Pθ? -a.s.
δ→0 n→∞ θ 0 : |θ 0 −θ|≤δ
Now write
= lim sup lim sup sup |n−1 `x0 ,n (θ0 ) − n−1 `sn (θ)|
δ→0 n→∞ θ 0 : |θ 0 −θ|≤δ
≤ lim sup lim sup sup n−1 |`x0 ,n (θ0 ) − `sn (θ0 )|
δ→0 n→∞ θ 0 : |θ 0 −θ|≤δ
+ lim sup lim sup sup n−1 |`sn (θ0 ) − `sn (θ)| .
δ→0 n→∞ θ 0 : |θ 0 −θ|≤δ
The first term on the right-hand side vanishes by Proposition 113 (note that Lemma 111
shows that supθ0 |h0,∞ (θ0 )| is in L1 (Pθ? ) and hence finite Pθ? -a.s.). The second term
is bounded by
n
X
−1
lim sup lim sup sup n (hk,∞ (θ0 ) − hk,∞ (θ))
δ→0 n→∞ θ 0 : |θ 0 −θ|≤δ
k=0
Xn
−1
≤ lim sup lim sup n sup |hk,∞ (θ0 ) − hk,∞ (θ)|
δ→0 n→∞ 0 0
k=0 θ : |θ −θ|≤δ
" #
0
= lim sup Eθ? sup |h0,∞ (θ ) − h0,∞ (θ)| = 0 ,
δ→0 θ 0 : |θ 0 −θ|≤δ
with convergence Pθ? -a.s. The two final steps follow by the ergodic theorem and
Proposition 115 respectively. The proof is complete.
At this point, we thus know that n−1 `x0 ,n converges uniformly to `. The
same conclusion
R holds when other initial distributions ν are put on X0 , provided
supθ |log gθ (x, Y0 ) ν(dx)| is finite Pθ? -a.s. When ν is the stationary distribution πθ ,
uniform convergence can in fact be proved without this extra regularity assumption
by conditioning on the previous state X−1 to get rid of the first two terms in the
bound of Proposition 113; cf. Douc et al. (2004).
The uniform convergence of n−1 `x0 ,n (θ) to `(θ) can be used—with an argument
entirely similar to the one of Wald outlined in Section 6.1—to show that the MLE
converges a.s. to the set, Θ? say, of global maxima of `. Because ` is continuous,
6.4. IDENTIFIABILITY 129
we know that Θ? is closed and hence also compact. More precisely, for any (open)
neighborhood of Θ? , the MLE will be in that neighborhood for large n, Pθ? -a.s. We
say that the MLE converges to Θ? in the quotient topology. This way of describing
convergence was used, in the context of HMMs, by Leroux (1992). The purpose of
the identifiability constraint, that `(θ) has a unique global maximum at θ? , is thus
to ensure that Θ? consists of the single point θ? so that the MLE indeed converges
to the point θ? .
6.4 Identifiability
As became obvious in the previous section, the set of global maxima of ` is of
intrinsic importance, as this set constitutes the possible limit points of the MLE.
The definition of `(θ) as a limit is however usually not suitable for extracting relevant
information about the set of maxima, and the purpose of this section is to derive a
different characterization of the set of global maxima of `.
Theorem 118. Assume 108, 109, and 110. Then a parameter θ ∈ Θ is a global
maximum of ` if and only if θ is equivalent to θ? .
where Hk,m,x (θ) is given in (6.6). Recalling that H1,m,x (θ) is the conditional density
of Y1 given Y−m+1:0 and X−m = x, we see that the inner (conditional) expectation
on the right-hand side is a Kullback-Leibler divergence and hence non-negative.
Thus the outer expectation and the limit `(θ? ) − `(θ) are non-negative as well, so
that θ? is a global mode of `.
Now pick θ ∈ Θ such that `(θ) = `(θ? ). Throughout the remainder of the
proof, we will use the letter p to denote (possibly conditional) densities of random
variables, with the arguments of the density indicating which random variables are
referred to. For any k ≥ 1,
0 = k(`(θ? ) − `(θ))
pθ? (Y1:k |Y−m+1:0 , X−m = x)
= lim Eθ? log
m→∞ pθ (Y1:k |Y−m+1:0 , X−m = x)
pθ? (Yk−n+1:k |Y−m+1:0 , X−m = x)
= lim Eθ? log
m→∞ pθ (Yk−n+1:k |Y−m+1:0 , X−m = x)
pθ? (Y1:k−n |Yk−n+1:k , Y−m+1:0 , X−m = x)
+ Eθ? log
pθ (Y1:k−n |Yk−n+1:k , Y−m+1:0 , X−m = x)
pθ? (Y1:n |Yn−k−m+1:n−k , Xn−k−m = x)
≥ lim sup Eθ? log ,
m→∞ pθ (Y1:n |Yn−k−m+1:n−k , Xn−k−m = x)
where the inequality follows by using stationarity for the first term and noting
that the second term is non-negative as an expectation of a (conditional) Kullback-
Leibler divergence as above. Hence we have inserted a gap between the variables
Y1:n whose density we examine and the variables Yn−k−m+1:n−k and Xn−k−m that
appear as a condition. The idea is now to let this gap tend to infinity and to show
that in the limit the condition has no effect. Next we shall thus show that
pθ? (Y1:n |Y−m+1:−k , X−m = x)
lim sup Eθ? log
k→∞ m≥k pθ (Y1:n |Y−m+1:−k , X−m = x)
pθ (Y1:n )
− Eθ? log ? =0. (6.16)
pθ (Y1:n )
Combining (6.16) with the previous inequality, it is clear that if `(θ) = `(θ? ), then
Eθ? {log[pθ? (Y1:n )/pθ (Y1:n )]} = 0, that is, the Kullback-Leibler divergence between
the n-dimensional densities pθ? (y1:n ) and pθ (y1:n ) vanishes. This implies, by the in-
formation inequality, that these densities coincide except on a set with µ⊗n -measure
zero, so that the n-dimensional laws of Pθ? and Pθ agree. Because n was arbitrary,
we find that θ? and θ are equivalent.
What remains to do is thus to prove (6.16). To that end, put Uk,m (θ) =
log pθ (Y1:n |Y−m+1:−k , X−m = x) and U (θ) = log pθ (Y1:n ). Obviously, it is enough
6.4. IDENTIFIABILITY 131
To do that we write
ZZ
pθ (Y1:n |Y−m+1:−k , X−m = x) = pθ (Y1:n |X0 = x0 ) Qkθ (x−k , dx0 )
implies that pθ (Y1:n |Y−m+1:−k , X−m = x) and pθ (Y1:n ) both obey the same lower
bound. Combined with the observation b− (Yi ) > 0 Pθ? -a.s., which follows from
Assumption 110, and the bound |log(x) − log(y)| ≤ |x − y|/x ∧ y, (6.18) shows that
lim sup |Uk,m (θ) − U (θ)| → 0 Pθ? -a.s.
k→∞ m≥k
Using the aforementioned bounds, we conclude that this expectation is indeed finite.
We remark that the basic structure of the proof is potentially applicable also to
models other than HMMs. Indeed, using the notation of the proof, we may define
` as `(θ) = limm→∞ Eθ? [log pθ (Y1 |Y−m:1 )], a definition that does not exploit the
HMM structure. Then the first part of the proof, up to (6.16), does not use the
HMM structure either, so that all that is needed, in a more general framework, is
to verify (6.16) (or, more precisely, a version thereof not containing X−m ). For
particular other processes, this could presumably be carried out using, for instance,
suitable mixing properties.
The above theorem shows that the points of global maxima of `—forming the
set of possible limit points of the MLE—are those that are statistically equivalent
to θ? . This result, although natural and important (but not trivial!), is however yet
of a somewhat “high level” character, that is, not verifiable in terms of “low level”
conditions. We would like to provide some conditions, expressed directly in terms of
the Markov chain and the conditional distributions gθ (x, y), that give information
about parameters that are equivalent to θ? and, in particular, when there is no
other such parameter than θ? . We will do this using the framework of mixtures of
distributions.
132 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE
In other words, the class of all mixtures of (fφ ) is identifiable if the two distri-
butions with densities fπ and fπ0 respectively agree only when π = π 0 . Yet another
way to put this property is to say that identifiability means that the mapping
π 7→ fπ is one-to-one (injective). A way, slightly Bayesian, of thinking of a mixture
distribution that is often intuitive and fruitful is the following. Draw φ ∈ Φ with
distribution π and then Y from the density fφ . Then, Y has density fπ .
Many important and commonly used parametric classes of densities are identi-
fiable. We mention the following examples.
(i) The Poisson family (Feller, 1943). In this case, Y = Z+ , Φ = R+ , φ is the mean
of the Poisson distribution, µ is counting measure, and fφ (y) = φy e−φ /y!.
(ii) The Gamma family (Teicher, 1961), with the mixture being either on the scale
parameter (with a fixed form parameter) or on the form parameter (with a
fixed scale parameter). The class of joint mixtures over both parameters is not
identifiable however, but the class of joint finite mixtures is identifiable.
(iii) The normal family (Teicher, 1960), with the mixture being either on the mean
(with fixed variance) or on the variance (with fixed mean). The class of joint
mixtures over both mean and variance is not identifiable however, but the class
of joint finite mixtures is identifiable.
(iv) The Binomial family Bin(N, p) (Teicher, 1963), with the mixture being on the
probability p. The class of finite mixtures is identifiable, provided the number
of components k of the mixture satisfies 2k − 1 ≤ N .
Theorem 120 (Teicher, 1967). Assume that the class of all mixtures of the family
(fφ ) of densities on Y with parameter φ ∈ Φ is identifiable. Then the class of all
(n)
mixtures of the n-fold product densities fφ (y) = fφ1 (y1 ) · · · fφn (yn ) on y ∈ Yn
n
with parameter φ ∈ Φ is identifiable. The same conclusion holds true when “all
mixtures” is replaced by “finite mixtures”.
6.4. IDENTIFIABILITY 133
consider a model on the state space X = {0, 1, 2} with Yk |Xk = i ∼ N(µi , σ 2 ), the
constraints µ0 = µ1 < µ2 , and transition probability matrix
q00 q01 0
Q = q10 q11 q12 .
0 q21 q22
The Markov chain {Xk } is thus a (discrete-time) birth-and-death process in the
sense that it can change its state index by at most one in each step. This model
is similar to models used in modeling ion channel dynamics (cf. Fredkin and Rice,
1992). Because µ1 < µ2 , we could then think of states 0 and 1 as “closed” and of
state 2 as “open”.
Now assume that θ is equivalent to θ? . Just as in Example 121, we may then
conclude that the law of {µ?Xk } under Pθ? and that of {µXk } under Pθ agree, and
hence, because of the constraints on the µs, that the laws of {1(Xk ∈ {0, 1}) +
1(Xk = 2)} under Pθ? and Pθ agree. In other words, after lumping states 0 and
1 of the Markov chain we obtain processes with identical laws. This in particular
implies that the distributions under Pθ? and Pθ of the sojourn times in the state
aggregate {0, 1} coincide. The probability of such a sojourn having length 1 is q12 ,
whence q12 = q?12 must hold. For length 2, the corresponding probability is q11 q12 ,
whence q11 = q?11 follows and then also q10 = q?10 as rows of Q sum up to unity.
2
For length 3, the probability is q11 q12 + q10 q01 q12 , so that finally q01 = q?01 and
q00 = q?00 . We may thus conclude that θ = θ? , that is, the model is identifiable.
The reason that identifiability holds despite the means µi being non-distinct is the
special structure of Q. For further reading on identifiability of lumped Markov
chains, see Ito et al. (1992).
To make sure that this gradient indeed exists and is well-behaved enough for our
purposes, we make the following assumptions.
Assumption 124. There exists an open neighborhood U = {θ : |θ − θ? | < δ} of θ?
such that the following hold.
(i) For all (x, x0 ) ∈ X × X and all y ∈ Y, the functions θ 7→ qθ (x, x0 ) and θ 7→
gθ (x, y) are twice continuously differentiable on U.
(ii)
sup sup k∇θ log qθ (x, x0 )k < ∞
θ∈U x,x0
and
sup sup k∇2θ log qθ (x, x0 )k < ∞ .
θ∈U x,x0
6.5. ASYMPTOTICS OF THE SCORE AND OBSERVED INFORMATION 135
(iii)
2
Eθ? sup sup k∇θ log gθ (x, Y1 )k <∞
θ∈U x
and
Eθ? sup sup k∇2θ log gθ (x, Y1 )k < ∞ .
θ∈U x
(iv) For µ-almost all y ∈ Y, there exists a function fy : X → R+ in L1 (λ) such that
supθ∈U gθ (x, y) ≤ fy (x).
(v) For λ-almost all x ∈ X, there exist functions fx1 : Y → R+ and fx2 : Y → R+ in
L1 (µ) such that k∇θ gθ (x, y)k ≤ fx1 (y) and k∇2θ gθ (x, y)k ≤ fx2 (y) for all θ ∈ U.
Note that ḣk,0,x (θ) is the gradient with respect to θ of the conditional log-likelihood
hk,0,x (θ) as defined in (6.7). It is a matter of straightforward algebra to check that
(6.20) and (6.21) agree.
136 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE
with the aim, just as before, to let m → ∞. This will yield a definition of ḣk,∞ (θ);
the dependence on x will vanish in the limit. Note however that the construction
below does not show that this quantity is in fact the gradient of hk,∞ (θ), although
one can indeed prove that this is the case.
As noted in Section 6.1, we want to prove a central limit theorem (CLT) for the
score function evaluated at the true parameter. A quite general way to do that is to
recognize that the corresponding score increments form, under reasonable assump-
tions, a martingale increment sequence with respect to the filtration generated by
the observations. This sequence is not stationary though, so one must either use a
general martingale CLT or first approximate the sequence by a stationary martin-
gale increment sequence. We will take the latter approach, and our approximating
sequence is nothing but {ḣk,∞ (θ? )}.
We now proceed to the construction of ḣk,∞ (θ). First write ḣk,m,x (θ) as
The following result shows that it makes sense to take the limit as m → ∞ in the
previous display.
Proposition 125. Assume 108, 109, and 124 hold. Then for any integers 1 ≤ i ≤
k, the sequence {Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k , X−m = x]}m≥0 converges Pθ? -a.s.
and in L2 (Pθ? ), uniformly with respect to θ ∈ U and x ∈ X, as m → ∞. The limit
does not depend on x.
Proof. The proof is entirely similar to that of Proposition 112. For any (x, x0 ) ∈
X × X and non-negative integers m0 ≥ m,
where the inequality stems from (6.5). Setting x = x0 in this display shows that
{Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k , X−m = x]}m≥0 is a Cauchy sequence, thus converg-
ing Pθ? -a.s. The inequality also shows that the limit does not depend on x. More-
over, because for any non-negative integer m, x ∈ X and θ ∈ U,
with the right-hand side belonging to L2 (Pθ? ). The inequality (6.23) thus also
shows that {Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k , X−m = x]}m≥0 is a Cauchy sequence in
L2 (Pθ? ) and hence converges in L2 (Pθ? ).
With the sums arranged as in (6.22), we can let m → ∞ and define, for k ≥ 1,
The following result gives an L2 -bound on the difference between ḣk,m,x (θ) and
ḣk,∞ (θ).
Lemma 126. Assume 108, 109, 110, and 124 hold. Then for k ≥ 1,
Proof. The idea of the proof is to match, for each index i of the sums expressing
ḣk,m,x (θ) and ḣk,∞ (θ), pairs of terms that are close. To be more precise, we match
1. The first terms of ḣk,m,x (θ) and ḣk,∞ (θ);
2. For i close to k,
and
Eθ [φθ (Xi−1 , Xi , Yi ) | Y−∞:k ] ,
and similarly for the corresponding terms conditioned on Y−m+1:k−1 and
Y−∞:k−1 , respectively;
3. For i far from k,
and
Eθ [φθ (Xi−1 , Xi , Yi ) | Y−m+1:k−1 , X−m = x] ,
and similarly for the corresponding terms conditioned on Y−∞:k and Y−∞:k−1 ,
respectively.
We start with the second kind of matches (of which the first terms are a special
case). Taking the limit in m0 → ∞ in (6.23), we see that
The proof of this bound is similar to that of Proposition 61 and uses the time-
reversed process. We postpone the proof to the end of this section. We may also let
m → ∞ and omit the condition on X−m without affecting the bound. As a result
of these bounds, we have
with the same bound being valid if the conditioning is on Y−∞:k and Y−∞:k−1 ,
respectively. This bound is small if i is far away from k.
Combining these two kinds of bounds and using Minkowski’s inequality for the
L2 -norm, we find that (Eθ kḣk,m,x (θ) − ḣk,∞ (θ)k2 )1/2 is bounded by
k−1
X −m
X
2ρk+m−1 + 2 × 2 (ρk−i−1 ∧ ρi+m−1 ) + 2 ρk−i−1
i=−m+1 i=−∞
ρk+m−1 X X
≤ 4 +4 ρk−i−1 + 4 ρi+m−1
1−ρ
−∞<i≤(k−m)/2 (k−m)/2≤i<∞
(k+m)/2−1
ρ
≤ 12
1−ρ
up to the factor (Eθ supx,x0 ∈X kφθ (x, x0 , Yi )k2 )1/2 . The proof is complete.
Proposition 127. Assume 108, 109, and 110 hold. Then for any integers i, k,
and m such that m ≥ 0 and −m < i < k, any x−m ∈ X, y−m+1:k ∈ Yk+m , and
θ ∈ U,
Proof. The cornerstone of the proof is the observation that conditional on Y−m+1:k
and X−m , the time-reversed process X with indices from k down to −m is a non-
homogeneous Markov chain satisfying a uniform mixing condition. We shall indeed
use a slight variant of the backward decomposition developed in Section 2.3.2. For
any j = −m + 1, . . . , k − 1, we thus define the backward kernel (cf. (2.39)) by
for any f ∈ Fb (X). For brevity, we do not indicate the dependence of the quantities
involved on θ. We note that the Qjintegral
R of the denominator of this display is
bounded from below by (σ − )m+j −m+1 gθ (xu , yu ) λ(dxu ), and is hence positive
Pθ? -a.s. under Assumption 110.
It is trivial that for any x ∈ X,
Z Z j
Y
··· q(xu−1 , xu )g(xu , yu ) λ(dxu ) f (xj )q(xj , x) =
u=−m+1
Z Z j
Y
··· q(xu−1 , xu )g(xu , yu ) λ(dxu ) q(xj , x)Bx−m ,j [y−m+1:j ](x, f ) ,
u=−m+1
σ− σ+
ν x ,j [y−m+1:j ] ≤ B x ,j [y −m+1:j ](x, ·) ≤ νx ,j [y−m+1:j ] ,
σ + −m −m
σ − −m
Thus Lemma 51 shows that the Dobrushin coefficient of each backward kernel is
bounded by ρ = 1 − σ − /σ + .
Finally
and
so that the two distributions on the left-hand sides can be considered as the result
of running the above-described reversed conditional Markov chain from index k − 1
down to index i, using two different initial conditions. Therefore, by Proposition 48,
they differ by at most ρk−1−i in total variation distance. The proof is complete.
140 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE
It is also immediate that hk,∞ (θ? ) is Fk -measurable. Hence the sequence {hk,∞ (θ? )}k≥0
is a Pθ? -martingale increment sequence with respect to the filtration {Fk }k≥0 in
L2 (Pθ? ). Moreover, this sequence is stationary because {Yk }−∞<k<∞ is. Any sta-
tionary martingale increment
Pn sequence in L2 (Pθ? ) satisfies a CLT (Durrett, 1996,
−1/2
p. 418), that is, n 0 ḣk,∞ (θ? ) → N(0, J (θ? )) Pθ? -weakly, where
def
J (θ? ) = Eθ? [ḣ1,∞ (θ? )ḣt1,∞ (θ? )] (6.25)
for all x0 ∈ X, where J (θ? ) is the limiting Fisher information as defined above.
We remark that above, we have normalized sums with indices from 0 to n, that
is, with n + 1 terms, by n1/2 rather than by (n + 1)1/2 . This of course does not
affect the asymptotics. However, if J (θ? ) is estimated for the purpose of making a
confidence interval for instance, then one may well normalize it using the number
n + 1 of observed data.
lim lim sup k(−n−1 ∇2θ `x0 ,n (θ)) − J (θ? )k = 0 Pθ? -a.s.
δ→0 n→∞ |θ−θ? |≤δ
for all x0 ∈ X.
Theorem 130. Assume 108, 109, 110, 114, and 124, and that θ? is identifiable,
that is, θ is equivalent to θ? only if θ = θ? (possibly up to a permutation of states if
X is finite). Then the following hold true.
(i) The MLE θbn = θbx0 ,n is strongly consistent: θbn → θ? Pθ? -a.s. as n → ∞.
(ii) If the Fisher information matrix J (θ? ) defined above is non-singular and θ? is
an interior point of Θ, then the MLE is asymptotically normal:
for all x0 ∈ X.
(iii) The normalized observed information at the MLE is a strongly consistent esti-
mator of J (θ? ):
As indicated above, the MLE θbn depends on the initial state x0 , but that de-
pendence will generally not be included in the notation.
The last part of the result is important, as is says that confidence intervals
or regions and hypothesis tests based on the estimate −(n + 1)−1 ∇2θ `x0 ,n (θbn ) of
J (θ? ) will asymptotically be of correct size. In general, there is no closed-form
expression for J (θ? ), so that it needs to be estimated in one way or another.
The observed information is obviously one way to do that, while another one
∗
is to simulate data Y1:N from the HMM, using the MLE, and then computing
−1 2
−(N + 1) ∇θ `x0 ,N (θn ) for this set of simulated data and some x0 . An advan-
b
tage of this approach is that N can be chosen arbitrarily large. Yet another
approach, motivated by (6.25), is to estimate the Fisher information by the em-
pirical covariance matrix of the conditional scores of (6.19) at the MLE, that
Pn
is, by R(n + 1)−1 0 [Sk|k−1 (θbn ) − S̄(θbn )][Sk|k−1 (θbn ) − S̄(θbn )]tPwith Sk|k−1 (θ) =
n
∇θ log gθ (x, Yk ) φx0 ,k|k−1 [Y0:k−1 ](dx ; θ) and S̄(θ) = (n + 1)−1 0 Sk|k−1 (θ). This
estimate can of course also be computed from estimated data, then using an ar-
bitrary sample size. The conditional scores may be computed as Sk|k−1 (θ) =
∇θ `x0 ,k (θ) − ∇θ `x0 ,k−1 (θ), where the scores are computed using any of the methods
of Section 5.2.3.
142 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE
Example 131 (Normal HMM). A slightly more involved example concerns the
Gaussian hidden Markov model with finite state space {1, 2, . . . , r} and conditional
distributions Yk |Xk = i ∼ N(µi , σi2 ). Suppose that we want to test for equality of
all of the r component-wise conditional variances σi2 : σ12 = σ22 = . . . = σr2 . Then,
the R-functions are for instance σi2 − σr2 for i = 1, 2, . . . , r − 1. The parameter γ is
obtained by removing from θ all σi2 and then adding a common conditional variance
σ 2 ; those b-functions referring to any of the σi2 evaluate to σ 2 . The matrices C and
D are again constant and of full rank.
Another one is the Wald test, which uses the test statistic
where R(θ) is the s×1 vector of R-functions evaluated at θ, and Jn (θ) = −n−1 ∇2θ `x0 ,n (θ)
is the observed information evaluated at θ. Yet another test is based on the Rao
statistic, defined as
Vn = n−1 Sn (θbn0 )Jn (θbn0 )−1 Sn (θbn0 )t ,
6.6. APPLICATIONS TO LIKELIHOOD-BASED TESTS 143
where θbn0 is the MLE over Θ0 , that is, the point where `x0 ,n (θ) is maximized subject
to the constraint Ri (θ) = 0, 1 ≤ i ≤ s, and Sn (θ) = ∇θ `x0 ,n (θ) is the score function
at θ. This test is also known under the names efficient score test and Lagrange
multiplier test. The Wald and Rao test statistics are usually defined using the true
Fisher information J (θ) rather than the observed one, but as J (θ) is generally
infeasible to compute for HMMs, we replace it by the observed counterpart.
Statistical theory for i.i.d. data suggests that the likelihood ratio, Wald and
Rao test statistics should all converge weakly to a χ2 distribution with s degrees of
freedom provided θ? ∈ Θ0 holds true, so that an approximate p-value of the test of
this null hypothesis can be computed by evaluating the complementary distribution
function of the χ2s distribution at the point λn , Wn , or Vn , whichever is preferred.
We now state formally that this procedure is indeed correct.
Theorem 132. Assume 108, 109, 110, 114, and 124 as well as the conditions
stated on the functions Ri and bi above. Also assume that θ? is identifiable, that is,
θ is equivalent to θ? only if θ = θ? (possibly up to a permutation of states if X is
finite), that J (θ? ) is non-singular, and that θ? and γ? are interior points of Θ and
Γ, respectively. Then if θ? ∈ Θ0 holds true, each of the test statistics λn , Wn , and
Vn converges Pθ? -weakly to the χ2s distribution as n → ∞.
The proof of this result follows, for instance, Serfling (1980, Section 4.4). The im-
portant observation is that the validity of the proof does not hinge on independence
of the data but on asymptotic properties of the score function and the observed
information, properties that have been established for HMMs in this chapter.
It is important to realize that a key assumption for Theorem 132 to hold is that
θ? is identifiable, so that θbn converges to a unique point θ? . As a result, the theorem
does not apply to the problem of testing the number of components of a finite state
HMM. In the normal HMM for instance, with Yk |Xk = i ∼ N(µi , σi2 ), one can
indeed effectively remove one component by invoking the constraints µ1 −µ2 = 0 and
σ12 − σ22 = 0, say. In this way, within Θ0 , components 1 and 2 collapse into a single
one. However, any θ ∈ Θ0 is then non-identifiable as the transition probabilities q12
and q21 , among others, can be chosen arbitrarily without changing the dynamics of
the model.
144 CHAPTER 6. STATISTICAL PROPERTIES OF THE MLE
Part III
Background and
Complements
145
Chapter 7
7.1.1 Irreducibility
Let {Xk }k≥0 be a Markov chain on a countable state space X with transition matrix
Q. For any x ∈ X, we define the first hitting time σx on x and the return time τx
to x respectively as
σx = inf{n ≥ 0 : Xn = x} , (7.1)
τx = inf{n ≥ 1 : Xn = x} , (7.2)
(n)
where, by convention, inf ∅ = +∞. The successive hitting times σx and return
(n)
times τx , n ≥ 0, are defined inductively by
For two states x and y, we say that state x leads to state y, which we write
x → y, if Px (σy < ∞) > 0. In words, x leads to y if the state y can be reached
from x. An alternative, equivalent definition is that there exists some integer n ≥ 0
such that the n-step transition probability Qn (x, y) > 0. If both x leads to y and y
leads to x, then we say that the x and y communicate, which we write x ↔ y.
Proof. We need to prove that the relation ↔ is reflexive, symmetric, and transitive.
The first two properties are immediate because, by definition, for all x, y ∈ X, x ↔ x
(reflexivity), and x ↔ y if and only if y ↔ x (symmetry).
For any pairwise distinct x, y, z ∈ X, {σy + σz ◦ θσy < ∞} ⊂ {σz < ∞} (if the
chain reaches y at some time and later z, it certainly reaches z). The strong Markov
147
148 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY
Px (σz < ∞) ≥ Px (σy + σz ◦ θσy < ∞) = Ex [1{σy <∞} 1{σz <∞} ◦ θσy ]
= Ex [1{σy <∞} PXσy (σz < ∞)] = Px (σy < ∞) Py (σz < ∞) .
In words, if the chain can reach y from x and z from y, it can reach z from x by
going through y. Hence if x → y and y → z, then x → z (transitivity).
For x ∈ X, we denote the equivalence class of x with respect to the relation “↔”
by C(x). Because “↔” is an equivalence relation, there exists a collection {xi } of
states, which may be finite or infinite, such that the classes {C(xi )} form a partition
of the state space X.
Definition 134 (Irreducibility). If C(x) = X for some x ∈ X (and then for all
x ∈ X), the Markov chain is called irreducible.
If the expected number of visits to x starting from x is finite, that is, if Ex [ηx ] < ∞,
then the state x is called transient. Otherwise, if Ex [ηx ] = ∞, x is said to be
recurrent. When X is countable, the recurrence or transience of a state x can be
expressed in terms of the probability Px (τx < ∞) that the chain started in x ever
returns to x.
Proposition 135. For any x ∈ X the following hold true,
(i) If x is recurrent, then Px (ηx = ∞) = 1 and Px (τx < ∞) = 1.
(ii) If x is transient, then Px (ηx < ∞) = 1 and Px (τx < ∞) < 1.
(iii) Ex [ηx ] = 1/[1 − Px (τx < ∞)], with 1/0 = ∞.
Proof. By construction,
∞
X ∞
X
Ex [ηx ] = Px (ηx ≥ k) = Px (σx(k) < ∞) .
k=1 k=1
(n−1)
If σx < ∞, then Xσ(n−1) = x Px -a.s., so that
x
Thus y is also recurrent by Corollary 137. Because x is recurrent, the strong Markov
property implies that
For a recurrent state x, the equivalence class C(x) (with respect to the relation
of communication defined in Section 7.1.1) may thus be equivalently defined as
If y 6∈ C(x), then Px (ηy = 0) = 1, which implies that Px (Xn ∈ C(x) for all
n ≥ 0) = 1. In words, the chain started from the recurrent state x forever stays in
C(x) and visits each state of C(x) infinitely many times.
The behavior of a Markov chain can thus be described as follows. If a chain is
not irreducible, there may exist several equivalence classes of communication. Some
of them contain only transient states, and some contain only recurrent states. The
latter are then called recurrence classes. If a chain starts from a recurrent state,
then it remains in its recurrence class forever. If it starts from a transient state, then
either it stays in the class of transient states forever, which implies that there exist
infinitely many transient states, or it reaches a recurrent state and then remains in
its recurrence class forever.
In contrast, if the chain is irreducible, then all the states are either transient or
recurrent. This is called the solidarity property of an irreducible chain. We now
summarize the previous results.
(ii) Px (τx < ∞) < 1, Ex [ηy ] < ∞ and the chain is transient.
Remark 140. Note that in the transient case, we do not necessarily have Px (τy <
∞) < 1 for all x and y in X. For instance, if Q is a transition matrix on N such
that Q(n, n + 1) = 1 for all n, then Pk (τn < ∞) = 1 for all k < n. Nevertheless all
states are obviously transient because Xn = X0 + n.
measure π. Moreover 0 < π(x) < ∞ for all x ∈ X. This measure is summable if
and only if there exists a state x such that
Ex [τx ] < ∞ . (7.5)
In this case, Ey [τy ] < ∞ for all y ∈ X and the unique invariant probability measure
is given by
π(x) = 1/ Ex [τx ] , x∈X. (7.6)
Proof. Let Q be the transition matrix of the chain. Pick an arbitrary state x ∈ X
and define the measure λx by
"τ −1 # "τ #
x x
1y (Xk ) = Ex 1y (Xk ) .
X X
λx (y) = Ex (7.7)
k=0 k=1
That is, λx (y) is the expected number of visits to the state y before the first return
to x, given that the chain starts in x. Let f be a non-negative function on X. Then
"τ −1 # ∞
x
Using this identity and the fact that Qf (Xk ) = Ex [f (Xk+1 ) | FkX ] Px -a.s. for all
k ≥ 1, we find that
∞ ∞
Ex [1{τx >k} Qf (Xk )] = Ex {1{τx >k} Ex [f (Xk+1 ) | FkX ]}
X X
λx (Qf ) =
k=0 k=0
∞
" τx
#
Ex [1{τx >k} f (Xk+1 )] = Ex
X X
= f (Xk ) ,
k=0 k=1
showing the induction. We will now show that π = λx . The proof is by contradic-
tion. Assume that π(z) > λx (z) for some z ∈ X. Then
X X
1 = π(x) = πQ(x) = π(z)Q(z, x) > λx (z)Q(z, x) = λx (x) = 1 ,
z∈X z∈X
Thus the unique invariant measure is summable if and only if a state x satisfying
this relation exists. On the other hand, if such a state x exists then, by uniqueness
of the invariant measure, Ey [τy ] < ∞ must hold for all states y. In this case,
the invariant probability measure, π say, satisfies π(x) = λx (x)/λx (X) = 1/ Ex [τx ].
Because the reference state x was in fact arbitrary, we find that π(y) = 1/ Ex [τy ]
for all states y.
It is natural to ask what can be inferred from the knowledge that a chain pos-
sesses an invariant probability measure. The next proposition gives a partial answer.
Proposition 142. Let Q be a transition matrix and π an invariant probability
measure. Then every state x such that π(x) > 0 is recurrent. If Q is irreducible,
then it is recurrent.
P∞ P∞
Proof. Let y ∈ X. If π(y) > 0 then n=0 πQn (y) = n=0 π(y) = ∞. On the other
hand, by Proposition 136,
∞
X X ∞
X
πQn (y) = π(x) Qn (x, y)
n=0 x∈X n=0
X X
= π(x) Ex [ηy ] ≤ Ey [ηy ] π(x) = Ey [ηy ] . (7.9)
x∈X x∈X
7.1.4 Ergodicity
A key result for positive recurrent irreducible chains is that the transition laws
converge, in a suitable sense, to the invariant vector π. The classical result is the
following.
Proposition 143. Consider an irreducible and positive recurrent Markov chain on
a countable state space. Then for any states x and y,
n
X
n−1 Qn (x, y) → π(y) as n → ∞ . (7.10)
i=1
The use of the Césaro limit can be avoided if the chain is aperiodic. The simplest
definition of aperiodicity is that a state x is aperiodic if Qk (x, x) > 0 for all k
sufficiently large or, equivalently, that the period of the state x is one. The period of
x is defined as the greatest common divisor of the set I(x) = {n > 0 : Qn (x, x) > 0}.
For irreducible chains, the following result holds true.
7.2. CHAINS ON GENERAL STATE SPACES 153
Proposition 144. If the chain is irreducible, then all states have the same period.
If the transition matrix Q is irreducible and aperiodic, then for all x and y in X,
there exists n(x, y) ∈ N such that Qk (x, y) > 0 for all k ≥ n(x, y).
Thus, an irreducible chain can be said to be aperiodic if the common period of
all states is one.
The traditional pointwise convergence (7.10) of transition probabilities has been
replaced in more recent research by convergence in total variation (see Defini-
tion 39). The convergence result may then be formulated as follows.
Theorem 145. Consider an irreducible and aperiodic positive recurrent Markov
chain on a countable state space X with transition matrix Q and invariant probability
distribution π. Then for all initial distributions ξ and ξ 0 on X,
The proof of this result, and indeed the focus on convergence in total variation,
follows using of the coupling technique. We postpone the presentation of this tech-
nique to Section 7.2.4 because essentially the same ideas can be applied to Markov
chains on general state spaces.
σA = inf{n ≥ 0 : Xn ∈ A} , (7.13)
τA = inf{n ≥ 1 : Xn ∈ A} , (7.14)
(n)
where, by convention, inf ∅ = +∞. The successive hitting times σA and return
(n)
times τA , n ≥ 0, are defined inductively by
(0) (1) (n+1) (n)
σA = 0, σA = σA , σA = inf{k > σA : Xk ∈ A} ,
(0) (1) (n+1) (n)
τA = 0, τA = τA , τA = inf{k > τA : Xk ∈ A} .
7.2.1 Irreducibility
The first step to develop a theory on general state spaces is to define a suitable
concept of irreducibility. The definition of irreducibility adopted for countable state
spaces does not extend to general ones, as the probability of reaching single point
x in the state space is typically zero.
154 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY
Proof. Let φ be an irreducibility measure and ∈ (0, 1). Let φ be the measure
defined by φ = φK , where K is the resolvent kernel defined by
def
X
K (x, A) = (1 − ) k Qk (x, A) , x ∈ X, A ∈ X . (7.17)
k≥0
Example 149. For simplicity, we assume here that X = Rd , which we equip with
the Borel σ-field X = B(Rd ). Assume that we are given a probability density
function π on with respect to Lebesgue measure λLeb . Let r be a transition density
kernel. Starting from Xn = x, a candidate transition x0 is generated from r(x, ·)
and accepted with probability
π(x0 ) r(x0 , x)
α(x, x0 ) = ∧1. (7.20)
π(x) r(x, x0 )
hit A more times, on average, than if it is started at “the most favorable location” in
A. Thus an alternative definition of a uniformly transient set is supx∈X Ex [ηA ] < ∞.
The main result on phi-irreducible transition kernels is the following recur-
rence/transience dichotomy, which parallels Theorem 139 for countable state-space
Markov chains.
(ii) There is a countable cover of X with uniformly transient sets, in which case we
call Q transient.
In the next section, we will prove Theorem 151 in the particular case where the
chain possesses an accessible atom (see Definition 152); the proof is then very similar
to that for countable state space. In the general case, the proof is more involved. It
is necessary to introduce small sets and the so-called splitting construction, which
relates the chain to one that does possess an accessible atom.
Atoms behave the same way as do individual states in the countable state space
case. Although any singleton {x} is an atom, it is not necessarily accessible, so that
Markov chain theory on general state spaces differs from the theory of countable
state space chains.
If α is an atom for Q, then for any m ≥ 1 it is an atom for Qm . Therefore we
denote by Qm (α, ·) the common value of Qm (x, ·) for all x ∈ α. This implies that
if the chain starts from within the atom, the distribution of the whole chain does
not depend on the precise starting point. Therefore we will also use the notation
Pα instead of Px for any x ∈ α.
Example 153 (Random Walk on the Half-Line). The random walk on the half-line
(RWHL) is defined by an initial condition X0 ≥ 0 and the recursion
Proposition 154. Let {Xk }k≥0 be a Markov chain that possesses an accessible
atom α, with associated probability measure ν. Then the chain is phi-irreducible, ν
is an irreducibility measure, and a set A ∈ X is accessible if and only if Pα (τA <
∞) > 0.
Moreover, α is recurrent if and only if Pα (τα < ∞) = 1 and (uniformly) tran-
sient otherwise, and the chain is recurrent if α is recurrent and transient otherwise.
7.2. CHAINS ON GENERAL STATE SPACES 157
Because α is accessible, Px (τα < ∞) > 0 for all x ∈ X. Thus for any A ∈ X
satisfying ν(A) > 0, it holds that Px (τA < ∞) > 0 for all x ∈ X, showing that ν is
an irreducibility measure. The above display also shows that A is accessible if and
only if Pα (τA < ∞).
(n)
Now let σα be the successive hitting times of α (see (7.13)). The strong Markov
property implies that for any n > 1,
∞
X ∞
X j
X
Qk (x, Bj ) ≤ Qk (x, Bj ) inf j Q` (y, α)
y∈Bj
k=1 k=1 `=1
X j Z
∞ X ∞
X
≤j Qk (x, dy) Q` (y, α) ≤ j 2 Qk (x, α) = j 2 Ex [ηα ] < ∞ .
k=1 `=1 Bj k=1
Definition 155 (Small Set). Let Q and ν be a transition kernel and a probability
measure, respectively, on (X, X ), let m be a positive integer and ∈ (0, 1]. A set
C ∈ X is called a (m, , ν)-small set for Q, or simply a small set, if ν(C) > 0 and
for all x ∈ C and A ∈ X ,
Qm (x, A) ≥ ν(A) .
If = 1 then C is an atom for the kernel Qm .
Trivially, any individual point is a small set, but small sets that are not accessible
are of limited interest. If the state space is countable and Q is irreducible, then
every finite set is small. The minorization measure associated to an accessible small
set provides an irreducibility measure.
Proposition 156. Let C be an accessible (m, , ν)-small set for the transition kernel
Q on (X, X ). Then ν is an irreducibility measure.
Proof. Let A ∈ X be such that ν(A) > 0. The strong Markov property yields
Px (τA < ∞) ≥ Px (τC < ∞, τA ◦ θτC < ∞) = Ex [1{τC <∞} PXτC (τA < ∞)] .
Example 160 (Autoregressive Process, Continued). Suppose that the noise distri-
bution in Example 148 has an everywhere positive continuous density γ with respect
to Lebesgue measure λLeb . If C = [−M, M ] and = inf |x|≤(1+φ)M γ(u), then for
A ⊆ C, Z
Q(x, A) = γ(x0 − φx) dx0 ≥ λLeb (A) .
A
Hence the compact set C is small. Obviously R is covered by a countable collection
of small sets and every accessible set (here sets with non-zero Lebesgue measure)
contains a small set.
Example 161 (Metropolis-Hastings Algorithm, Continued). Similar results hold
for the Metropolis-Hastings algorithm of Example 149 if π(x) and r(x, x0 ) are pos-
itive and continuous for all (x, x0 ) ∈ X × X. Suppose that C is compact with
λLeb (C) > 0. By positivity and continuity, we then have d = supx∈C π(x) < ∞ and
ε = inf (x,x0 )∈C×C q(x, x0 ) > 0. For any A ⊆ C, define
π(x0 )q(x0 , x)
def
Rx (A) = x0 ∈ A : < 1 ,
π(x)q(x, x0 )
the region of possible rejection. Then for any x ∈ C,
Z
Q(x, A) ≥ q(x, x0 )α(x, x0 ) dx0
ZA
q(x0 , x)
Z
≥ π(x0 ) dx0 + q(x, x0 ) dx0
Rx (A) π(x) A\Rx (A)
Z Z
ε 0 0 ε
≥ π(x ) dx + π(x0 ) dx0
d Rx (A) d A\Rx (A)
Z
ε
= π(x0 ) dx0 .
d A
Thus C is small and, again, X can be covered by a countable collection of small
sets.
We now show that it is possible to define a Markov chain with an atom, the
so-called split chain, whose properties are directly related to those of the original
chain. This technique was introduced by Nummelin (1978) (Athreya and Ney,
160 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY
Examining the above technicalities, we find that transitions into C c × {1} have
zero probability from everywhere, so that dn = 1 can only occur if Xn ∈ C. Because
dn = 1 indicates a regeneration time, from within C, this is logical. Likewise we
find that given a transition to some y ∈ C, the conditional probability that dn = 1
is , wherever the transition took place from. Thus the above split transition kernel
corresponds to the following simulation scheme for {(Xk , dk )}. Assume (Xk , dk ) are
given. If Xk 6∈ C, then draw Xk+1 from Q(Xk , ·). If Xk ∈ C and dn = 1, then draw
Xk+1 from ν, otherwise from R(Xk , ·). If the realized Xk+1 is not in C, then set
dk+1 = 0; if Xk+1 is in C, then set dk+1 = 1 with probability , and otherwise set
dk+1 = 0.
Split measures operate on the split kernel in the following way. For any measure
µ on (X, X ),
µ? Q̌ = (µQ)? . (7.26)
For any probability measure µ̌ on X̌ , we denote by P̌µ̌ and ̵̌ , respectively, the
probability distribution and the expectation on the canonical space (X̌N , X̌ ⊗N ) such
that the coordinate process, denoted {(Xk , dk )}k≥0 , is a Markov chain with initial
7.2. CHAINS ON GENERAL STATE SPACES 161
probability measure µ̌ and transition kernel Q̌. We also denote by {F̌k }k≥0 the
natural filtration of this chain and, as usual, by {FkX }k≥0 the natural filtration of
{Xk }k≥0 .
Proposition 162. Let Q be a phi-irreducible transition kernel on (X, X ), let C be
an accessible (1, , ν)-small set for Q and let µ be a probability measure on (X, X ).
Then for any bounded X -measurable function f and any k ≥ 1,
X
̵? [f (Xk ) | Fk−1 ] = Qf (Xk−1 ) P̌µ? -a.s. (7.27)
Before giving the proof, we discuss the implications of this result. It implies that
under P̌µ? , {Xk }k≥0 is a Markov chain (with respect to its natural filtration) with
transition kernel Q and initial distribution µ. By abuse of notation, we can identify
{Xk } with the coordinate process associated to the canonical space XN . Denote
by Pµ the probability measure on (XN , X ⊗N ) such that {Xk }k≥0 is a Markov chain
with transition kernel Q and initial distribution µ (see Section 1.1.2) and denote by
Eµ the associated expectation operator. Then Proposition 162 yields the following
X
identity. For any bounded F∞ -measurable random variable Y ,
̵? [Y ] = Eµ [Y ] . (7.28)
of Proposition 162. We have, µ? -a.s.,
̵? [f (Xk ) | F̌k−1 ] = 1{dk−1 =1} ν(f ) + 1{dk−1 =0} Rf (Xk−1 ) .
X
Because P̌µ̌ (dk−1 = 1 | Fk−1 ) = 1C (Xk−1 ) P̌µ? -a.s., it holds that
X X
̵? [f (Xk ) | Fk−1 ] = ̵? {Ě[f (Xk ) | F̌k−1 ] | Fk−1 }
= 1C (Xk−1 )ν(f ) + [1 − 1C (Xk−1 )]Rf (Xk−1 )
= Qf (Xk−1 ) .
Harris Recurrence
As for countable state spaces, it is sometimes useful to consider stronger recurrence
properties, expressed in terms of return probabilities rather than mean occupation
times.
(1)
Because Px (σA < ∞) = 1 for x ∈ A, we obtain that for all x ∈ A and all j ≥ 1,
(j) P∞ (j)
Px (σA = 1) and Ex [ηA ] = j=1 Px (σA < ∞) = ∞.
Even though all transition kernels may not be Harris recurrent, the following
theorem provides a very useful decomposition of the state space of a recurrent phi-
irreducible transition kernel. For a proof of this result, see Meyn and Tweedie (1993,
Theorem 9.1.5)
Example 169. To understand why a recurrent Markov chain can fail to be Harris,
consider the following elementary example of a chain on X = N. Let the transition
kernel Q be given by Q(0, 0) = 1 and for x ≥ 1, Q(x, x + 1) = 1 − 1/x2 and
Q(x, 0) = 1/x2 . Thus the state 0 is absorbing. Because Q(x, 0) > 0 for any x ∈ X,
δ0 is an irreducibility measure. In fact, by application of Theorem 147, this measure
is maximal. The set {0} is an atom and because P0 (τ{0} < ∞) = 1, the chain is
recurrent by Proposition 154.
The chain is not Harris recurrent, however. Indeed, for any x ≥ 1 we have
x+k−1
Y
Px (τ0 ≥ k) = Px (X1 6= 0, . . . , Xk−1 6= 0) = (1 − 1/j 2 ) .
j=x
Q∞
Because j=2 (1 − 1/j 2 ) > 0, we obtain that Px (τ0 = ∞) = limk→∞ Px (τ0 ≥ k) > 0
for any x ≥ 2, so that the accessible state 0 is not certainly reached from such an
initial state. Comparing to Theorem 168, we see that the decomposition of the state
space is given by H = {0} and N = {1, 2, . . .}.
164 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY
We now prove the existence of an invariant measure when the chain admits an
accessible atom. The invariant measure is defined as for countable state spaces, by
replacing an individual state by the atom. Thus define the measure µα on X by
"τ #
α
1A (Xn ) ,
X
µα (A) = Eα A∈X . (7.30)
n=1
Proposition 172. Let α be an accessible atom for the transition kernel Q. Then
µα is Q-sub-invariant. It is invariant if and only if the atom α is recurrent. In
that case, any Q-invariant measure µ is proportional to µα , and µα is a maximal
irreducibility measure.
1A (Xk )
X X
µα Q(A) = Eα Q(Xk , A) = Eα
k=1 k=2
= µα (A) − Pα (X1 ∈ A) + Eα [1A (Xτα +1 )1{τα <∞} ] .
Eα [1A (Xτα +1 )1{τα <∞} ] = Eα {Eα [1A (X1 ) ◦ θτα | FτXα ]1{τα <∞} }
= Eα [PXτα (X1 ∈ A)1{τα <∞} ] = Pα (X1 ∈ A) Pα (τα < ∞) .
Thus µα Q(A) = µα (A) − Pα (X1 ∈ A)[1 − Pα (τα < ∞)]. This proves that µα is
sub-invariant, and invariant if and only if Pα (τα < ∞) = 1.
Now let µ be an invariant non-trivial measure and let A be an accessible set such
that µ(A) < ∞. Then there exists an integer n such that Qn (α, A) > 0. Because µ
is invariant, it holds that µ = µQn , so that
This implies that µ(α) < ∞. Without loss of generality, we can assume µ(α) > 0;
otherwise we replace µ by µ + µα . Assuming µ(α) > 0, there is then no loss of
generality in assuming µ(α) = 1.
The next step is to prove that if µ is an invariant measure such that µ(α) = 1,
then µ ≥ µα . To prove this it suffices to prove that for all n ≥ 1,
n
X
µ(A) ≥ Pα (Xk ∈ A, τα ≥ k) .
k=1
7.2. CHAINS ON GENERAL STATE SPACES 165
Now assume now that the inequality holds for some n ≥ 1. Then
Z
µ(A) = Q(α, A) + µ(dy) Q(y, A)
αc
n
Eα [Q(Xk , A)1{τα ≥k} 1{Xk ∈α}
X
≥ Q(α, A) + / ]
k=1
n
Eα [Q(Xk , A)1{τα ≥k+1} ] .
X
≥ Q(α, A) +
k=1
whence
n+1
X n+1
X
µ(A) ≥ Q(α, A) + Pα (Xk ∈ A, τα ≥ k) = Pα (Xk ∈ A, τα ≥ k) .
k=2 k=1
By application of the definition of the split kernel and measures, it can be checked
that µ̌Q̌ = µ? . Hence µ? = µ̌Q̌ = µ̌. We thus see that µ? is Q̌-invariant, which, as
166 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY
noted above, implies that µ is Q-invariant. Hence we have shown that there exists
a Q-invariant measure if and only if there exists a Q̌-invariant one.
If Q is recurrent then C is recurrent, and as appears in the proof of Propo-
sition 173 this implies that the atom α̌ is recurrent for the split chain Q̌. Thus,
by Proposition 154 the kernel Q̌ is recurrent, and by Proposition 172 it admits an
invariant measure that is unique up to a scaling factor. Hence Q also admits an
invariant measure, unique up to a scaling factor and such that 0 < π(C) < ∞.
Let µ be Q-invariant. Then µ? is Q̌-invariant and hence, by Proposition 172, a
maximal irreducibility measure. If µ(A) > 0, then µ? (A × {0, 1}) = µ(A) > 0. Thus
A × {0, 1} is accessible, and this implies that A is accessible. We conclude that µ is
an irreducibility measure, and it is maximal because it is Kη -invariant.
Drift Conditions
We first give a sufficient condition for a chain to be positive, based on the expectation
of the return time to an accessible small set.
Proposition 175. Let Q be a transition kernel that admits an accessible small set
C such that
Then the chain is positive and the invariant probability measure π satisfies, for all
A ∈ X,
"τ −1 # Z "τ #
Z C C
Because τC < ∞ Py -a.s. for all y ∈ C, it holds that µC (C) = π(C). Then we
can show that µC (A) = π(A) for all A ∈ X . The proof is along the same lines as
the proof of Proposition 172 and is therefore omitted. Thus, µC is invariant. In
addition, we obtain that for any measurable set A,
Z Z
π(dy) Ey [1A (X0 )] = π(A ∩ C) = µC (A ∩ C) = π(dy) Ey [1A (XτC )] ,
C C
1A (Xk ) = 1A (Xk ) .
X X
µC (A) = π(dy) Ey π(dy) Ey
C k=1 C k=0
1A (Xk ) = 1A (Xk )
X X
= µC (dy) Ey π(dy) Ey
C k=1 C k=1
= π(A) .
Hence
"τ −1 #
Z C
so that any invariant measure is finite and the chain is positive. Finally, under
(7.33) we obtain that
"τ −1 # "τ −1 #
Z C
X XC
Then
" n
#
1{τC ≥n+1}
X
E[Mn+1 | Fn ] = QV (Xn ) + f (Xk )
k=0
" n
#
≤ V (Xn ) − f (Xn ) + b1C (Xn ) + 1{τC ≥n+1}
X
f (Xk )
k=0
" n−1
#
1{τC ≥n+1} ≤ Mn ,
X
= V (Xn ) + f (Xk )
k=0
so that (7.34) holds when C = [−M, M ] for some large enough M , provided |φ| <
1. Because we know that every compact set is small if the noise process has an
everywhere continuous positive density, Proposition 176 shows that the chain is
positive recurrent. Note that this approach provides an existence result but does
not help us to determine π. If {Uk } are Gaussian with zero mean and variance σ 2 ,
then one can check that the invariant distribution also is Gaussian with zero mean
and variance σ 2 /(1 − φ2 ).
Theorem 170 shows that if a chain is phi-irreducible and recurrent then the chain
is positive, that is, it admits a unique invariant probability measure π. In certain
situations, and in particular when dealing with MCMC procedures, it is known that
Q admits an invariant probability measure, but it is not known, a priori, that the
chain is recurrent. The following result shows that positivity implies recurrence.
Proof. Suppose that the chain is positive and let π be an invariant probability
measure. If Q is transient, the state space X is covered by a countable family {Aj }
of uniformly transient subsets (see Theorem 151). For any j and k,
k
X Z
n
kπ(Aj ) = πQ (Aj ) ≤ π(dx) Ex [ηAj ] ≤ sup Ex [ηAj ] . (7.38)
n=1 x∈X
Thus, the left-hand side of (7.38) is bounded as k → ∞. This implies that π(Aj ) =
0, and hence π(X) = 0. This is a contradiction so the chain cannot be transient.
7.2.4 Ergodicity
In this section, we study the convergence of iterates Qn of the transition kernel to
the invariant distribution. As for discrete state spaces case, we first need to avoid
periodic behavior that prevents the iterates to converge. In the discrete case, the
period of a state x is defined as the greatest common divisor of the set of time
points {n ≥ 0 : Qn (x, x) > 0}. Of course this notion does not extend to general
state spaces, but for phi-irreducible chains we may define the period of accessible
small sets. More precisely, let Q be a phi-irreducible transition kernel with maximal
irreducibility measure ψ. By Theorem 156, there exists an accessible (m, , ψ)-small
set C. Because ψ is a maximal irreducibility measure, ψ(C) > 0, so that when the
chain starts from C there is a positive probability that the it will return to C at
time m. Let
def
EC = {n ≥ 1 : the set C is (n, n , ψ)-small for some n > 0} (7.39)
170 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY
be the set of time points for which C is small with minorizing measure ψ. Note
that for n and m in EC , B ∈ X + and x ∈ C,
Z
Qn+m (x, B) ≥ Qm (x, dx0 ) Qn (x0 , B) ≥ m n ψ(C)ψ(B) > 0 ,
C
showing that EC is closed under addition. There is thus a natural period for EC ,
given by the greatest common divisor. Similar to the discrete case (see Proposi-
tion 144), this period d may be shown to be independent of the particular choice of
the small set C (see for instance Meyn and Tweedie, 1993, Theorem 5.4.4).
The d-cycle is maximal in the sense if D10 , . . . , Dd0 0 is a d0 -cycle, then d0 divides d,
and if d = d0 , then up to a permutation of indices Di0 and Di are ψ-almost equal.
It is obvious from the this theorem that the period d does not depend on the
choice of the small set C and that any small set must be contained (up to ψ-null
sets) inside one specific member of a d-cycle. This in particular implies that if there
exists an accessible (1, , ψ)-small set C, then d = 1. This suggests the following
definition
In all the examples considered above, we have shown the existence of a 1-small
set; therefore all these Markov chains are strongly aperiodic.
Now we can state the main convergence result, formulated and proved by Athreya
et al. (1996). It parallels Theorem 145.
Although this result does not provide information on the rate of convergence
to the invariant distribution, its assumptions are quite minimal. In fact, it may be
shown that these assumptions are essentially necessary and sufficient. If kQn (x, ·) − πkTV →
0 for any x ∈ X, then by Nummelin (1984, Proposition 6.3), the chain is π-
irreducible, aperiodic, positive Harris, and π is an invariant distribution. This
form of the ergodicity theorem is of particular interest in cases where the invariant
distribution is explicitly known, as in Markov chain Monte Carlo. It provides con-
ditions that are simple and easy to verify, and under which an MCMC algorithm
converges to its stationary distribution.
Of course the exceptional null set for non-Harris recurrent chain is a nuisance.
The example below however shows that there is no way of getting rid of it.
7.2. CHAINS ON GENERAL STATE SPACES 171
so that the total variation distance between the laws of two random elements is
bounded by the probability that they are unequal. Of course, this inequality is
not in general sharp, but we can construct on an appropriately defined probability
space (Ω̃, F̃, P̃) two X-valued random variables X and X 0 with laws ξ and ξ 0 such
that P̃(X = X 0 ) ≥ 1 − . The construction goes as follows. We draw a Bernoulli
random variable d with probability of success . If d = 0, we then draw X and
X 0 independently from the distributions (1 − )−1 (ξ − ν) and (1 − )−1 (ξ 0 − ν),
respectively. If d = 1, we draw X from ν and set X = X 0 . Note that for any A ∈ X ,
and, similarly, P̃(X 0 ∈ A) = ξ 0 (A). Thus, marginally the random variables X and
X 0 are distributed according to ξ and ξ 0 . By construction, P̃(X = X 0 ) ≥ P(d =
1) ≥ , showing that X and X 0 are equal with probability at least . Therefore
the coupling bound (7.41) can be made sharp by using an appropriate construction.
Note that this construction may be used to derive bounds on distances between
probability measures that generalize the total variation; we will consider in the
sequel the V -total variation.
that Xn = Xn0 for all indices n after a random time T , referred to as the coupling
time. The coupling procedure attempts to couple the two Markov chains when they
simultaneously enter a coupling set.
Definition 185 (Coupling Set). Let C̄ ⊆ X×X, ∈ (0, 1] and let ν = {νx,x0 , x, x0 ∈
X} be transition kernels from C̄ (endowed with the trace σ-field) to (X, X ). The set
C̄ is a (1, , ν)-coupling set if for all (x, x0 ) ∈ C̄ and all A ∈ X ,
By applying Lemma 43, this condition can be stated equivalently as: there exists
∈ (0, 1] such that for all (x, x0 ) ∈ C̄,
1
kQ(x, ·) − Q(x0 , ·)kTV ≤ 1 − . (7.44)
2
For simplicity, only one-step minorization is considered in this chapter. Adap-
tations to m-step minorization (replacing Q by Qm in (7.43)) can be carried out as
in Rosenthal (1995). Condition (7.43) is often satisfied by setting C̄ = C × C for a
(1, , ν)-small set C. Indeed, in that case, for all (x, x0 ) ∈ C × C and A ∈ X ,
For any probability measure µ̃ on (X̃, X̃ ), let P̃µ̃ be the probability measure on the
canonical space (X̃N , X̃ ⊗N ) such that the coordinate process {X̃k } is a Markov chain
with transition kernel Q̃ and initial distribution µ̃. The corresponding expectation
operator is denoted by Ẽµ̃ .
The transition kernel Q̃ can be described algorithmically. Given X̃0 = (X0 , X00 , d0 ) =
(x, x0 , d), X̃1 = (X1 , X10 , d1 ) is obtained as follows.
7.2. CHAINS ON GENERAL STATE SPACES 173
– If the coin comes up heads, draw X1 from νx,x0 and set X10 = X1 and
d1 = 1.
– If the coin comes up tails, draw (X1 , X10 ) from Q̄(x, x0 ; ·) and set d1 = 0.
• If d = 0 and (x, x0 ) 6∈ C̄, draw (X1 , X10 ) from Q̄(x, x0 ; ·) and set d1 = 0.
The variable dn is called the bell variable; it indicates whether coupling has occurred
by time n (dn = 1) or not (dn = 0). The first index n at which dn = 1 is the coupling
time;
T = inf{k ≥ 1 : dk = 1}.
Proposition 186. Assume that the transition kernel Q admits a (1, , ν)-coupling
set. Then for any probability measures ξ and ξ 0 on (X, X ) and any measurable
function V : X → [1, ∞),
1{σ ≥n}
C̄
if = 1 ;
Kn () = Q (7.51)
n−1 [1 − 1 (X̄ )] if ∈ (0, 1) .
j=0 C̄ j
Proposition 187. Assume that the transition kernel Q admits a (1, , ν)-coupling
set. Let ξ and ξ 0 be probability measures on (X, X ) and let V : X → [1, ∞) be a
measurable function. Then
Ẽµ̄⊗δ0 [V̄ (Xn , Xn0 )1{T >n} ] = ǵ̄ [V̄ (Xn , Xn0 )Kn ()] .
To do this, we shall prove by induction that for any n ≥ 0 and any bounded X̄ -
measurable functions {fj }j≥0 ,
n n
fj (Xj , Xj0 ) 1{T >n} = ǵ̄
Y Y
Ẽµ̄⊗δ0 fj (Xj , X̄j ) Kn () . (7.53)
j=0 j=0
Qn
This is obviously true for n = 0. For n ≥ 0, put χn = j=0 fj (Xj , Xj0 ). The
induction assumption and the identity {T > n + 1} = {dn+1 = 0} yield
Lemma 189. Let Q be an aperiodic positive transition kernel with invariant prob-
ability measure π. Then Q ⊗ Q is phi-irreducible, π ⊗ π is Q ⊗ Q-invariant, and
Q ⊗ Q is positive. If C is an accessible (m, , ν)-small set for Q, then C × C is an
accessible (m, 2 , ν ⊗ ν)-small set for Q ⊗ Q.
Proof. Because Q is phi-irreducible and admits π as an invariant probability mea-
sure, π is a maximal irreducibility measure for Q. Let C be an accessible (m, , ν)-
small set for Q. Then for (x, x0 ) ∈ C × C and A ∈ X ⊗ X ,
ZZ
(Q ⊗ Q)m (x, x0 ; A) = Qm (x, dy) Qm (x0 , dy 0 ) ≥ 2 ν ⊗ ν(A) .
A
Hence it suffices to prove that (7.40) holds for Qm and we may thus without loss of
generality assume that m = 1.
For any probability measure µ on (X × X, X ⊗ X ), let P?µ denote the probability
measure on the canonical space ((X×X)N , (X ⊗X )⊗N ) such that the canonical process
{(Xk , Xk0 )}k≥0 is a Markov chain with transition kernel Q⊗Q and initial distribution
µ. By Lemma 189, Q ⊗ Q is positive, and it is recurrent by Proposition 179.
Because π ⊗ π(C × C) = π 2 (C) > 0, by Theorem 168 there exist two measurable
sets C̄ ⊆ C × C and H̄ ⊆ X × X such that π ⊗ π(C × C \ C̄) = 0, π × π(H) = 1, and
for all (x, x0 ) ∈ H̄, P?x,x0 (τC̄ < ∞) = 1. Moreover, the set C̄ is a (1, , ν)-coupling
set with νx,x0 = ν for all (x, x0 ) ∈ C̄.
Let the transition kernel Q̄ be defined by (7.45) if < 1 and by Q̄ = Q ⊗ Q
if = 1. For = 1, P̄x,x0 = P?x,x0 . Now assume that ∈ (0, 1). For (x, x0 ) 6∈ C̄,
P̄x,x0 (τC̄ = ∞) = P?x,x0 (τC̄ = ∞). For (x, x0 ) ∈ C̄, noting that Q̄(x, x0 , A) ≤
(1 − )−2 Q ⊗ Q(x, x0 , A) we obtain
Thus, for all ∈ (0, 1] the set C̄ is Harris-recurrent for the kernel Q̄. This implies
that limn→∞ Ēx,x0 [Kn ()] = 0 for all (x, x0 ) ∈ H̄ and, using Proposition 187, we
conclude that (7.40) is true.
QV ≤ λV + b1C . (7.56)
176 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY
Example 192 (Random Walk on the Half-Line, Continued). Assume that for the
model of Example 153 there exists z > 0 such that E[ezW1 ] < ∞. Then because
µ < 0, there exists z > 0 such that E[ezW1 ] < 1. Define z0 = arg minz>0 E[ezW1 ]
and V (x) = ez0 x , and choose x0 > 0 such that λ = E[ez0 W1 ] + P(W1 < −x0 ) < 1.
Then for x > x0 ,
QV (x) = E[ez0 (x+W1 )+ ] = P(W1 ≤ −x) + ez0 x E[ez0 W1 1{W1 >−x} ] ≤ λV (x) .
Hence the Foster-Lyapunov drift condition holds outside the small set [0, x0 ], and
the RWHL is geometrically ergodic. For a sharper choice of the constants z0 and
λ, see Scott and Tweedie (1996, Theorem 4.1).
V (x0 )
Z
QV (x) 0
=1+ r(x − x) − 1 dx0
V (x) Ax V (x)
π(x0 ) V (x0 )
Z
0
+ r(x − x) − 1 dx0 .
Rx π(x) V (x)
We set V (x) = esx for some s ∈ (0, α). Because π is log-concave, π(x0 )/π(x) ≤
0
e−α(x −x) when x0 ≥ x ≥ M . For x ≥ M , it follows from elementary calculations
that Z ∞
QV (x)
lim sup ≤1− r(u)(1 − e−su )[1 − e−(α−s)u ] du < 1 ,
x→∞ V (x) 0
showing that the random walk Metropolis-Hastings algorithm on the positive real
line satisfies the Foster-Lyapunov condition when π is monotone and log-concave in
the upper tail.
In fact, it follows from Meyn and Tweedie (1993, Theorems 15.0.1 and 16.0.1)
that the converse is also true: if a phi-irreducible aperiodic kernel is V -geometrically
ergodic, then there exists an accessible small set C such that V is a drift function
outside C.
For the sake of brevity and simplicity, we now prove Theorem 194 under the
additional assumption that the level sets of V are all (1, , ν)-small. In that case,
it is possible to define a coupling set C̄ and a transition kernel Q̄ that satisfies a
(bivariate) Foster-Lyapunov drift condition outside C̄. The geometric ergodicity of
the transition kernel Q is then proved under this assumption. This is the purpose
of the following propositions.
Proposition 195. Let Q be a kernel that satisfies the Foster-Lyapunov drift con-
dition (7.56) with respect to a (1, , ν)-small set C and a function V whose level
sets are (1, , ν)-small. Then for any d > 1, the set C 0 = C ∪ {x ∈ X : V (x) ≤ d}
is small, C 0 × C 0 is a (1, , ν)-coupling set, and the kernel Q̄, defined as in (7.45),
satisfies the drift condition (7.58) with C̄ = C 0 ×C 0 , V̄ (x, x0 ) = (1/2)[V (x)+V (x0 )],
and λ̄ = λ + b/(1 + d) provided λ̄ < 1.
1
Q̄V̄ (x, x0 ) = [QV (x) + QV (x0 ) − 2ν(V )]
2(1 − )
λ b − ν(V )
≤ V̄ (x, x0 ) + .
(1 − ) 1−
Proposition 196. Assume that Q admits a (1, , ν)-coupling set C̄ and that there
exists a choice of the kernel Q̄ for which there is a measurable function V̄ : X̄ →
[1, ∞), λ̄ ∈ (0, 1) and b̄ > 0 such that
Let W : X → [1, ∞) be a measurable function such that W (x) + W (x0 ) ≤ 2V̄ (x, x0 )
for all (x, x0 ) ∈ X × X. Then there exist ρ ∈ (0, 1) and c > 0 such that for all
(x, x0 ) ∈ X × X,
Proof. By Proposition 186, proving (7.59) amounts to proving the requested bound
for Ēx,x0 [V̄ (X̄n )Kn ()]. We only consider the case ∈ (0, 1), the case = 1 being
easier. Write x̄ = (x, x0 ). By induction, the drift condition (7.58) implies that
n−1
X
Ēx̄ [V̄ (X̄n )] = Q̄n V̄ (x̄) ≤ λ̄n V̄ (x̄) + b̄ λ̄j ≤ V̄ (x̄) + b̄/(1 − λ̄) . (7.60)
j=0
178 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY
Recall that Kn () = (1 − )ηn (C̄) for ∈ (0, 1), where ηn (C̄) = 0 1C̄ (Xj ) is
Pn−1
the number of visits to the coupling set C̄ before time n. Hence Kn () is F̄n−1 -
measurable. Let j ≤ n + 1 be an arbitrary positive integer to be chosen later. Then
(7.60) yields
Ēx̄ [V̄ (X̄n )Kn ()1{ηn (C̄)≥j} ] ≤ (1 − )j Ēx̄ [V̄ (X̄n )]1{j≤n}
≤ [V̄ (x̄) + b̄/(1 − λ̄)](1 − )j 1{j≤n} . (7.61)
Using the relations ηn (C̄) = ηn−1 (C̄) + 1C̄ (X̄n−1 ) and M (1 − ) ≤ B λ̄, we find
that Ēx̄ [Zn | F̄n−1 ] ≤ Zn−1 and, by induction, Ēx̄ [Zn ] ≤ Ēx̄ [Z0 ] = V̄ (x̄). Hence, as
B ≥ 1,
Ēx̄ [V̄ (X̄n )Kn ()1{ηn (C̄)<j} ] ≤ λ̄n B j Ēx̄ [Zn ] ≤ λ̄n B j V̄ (x̄) . (7.62)
Ēx̄ [V̄ (X̄n )Kn ()] ≤ [V̄ (x̄) + b̄/(1 − λ̄)] [(1 − )j 1{j≤n} + λ̄n B j ] .
geometrically fast. This applies for the mean, f (x) = x, and the second moment,
f (x) = x2 (though in this case convergence can be derived directly from the autore-
gression).
Theorem 198. Let Q be a positive Harris recurrent transition kernel with invariant
distribution π. Then for any real π-integrable function f on X and any initial
distribution ν on (X, X ),
n
X
lim n−1 f (Xk ) = π(f ) Pν -a.s. (7.63)
n→∞
k=1
The LLN can be obtained from general ergodic theorems for stationary processes.
An elementary proof can be given when the chain possesses an accessible atom. The
basic technique is then the regeneration method, which consists in dividing the chain
into blocks between the chain’s successive returns to the atom. These blocks are
independent (see Lemma 199 below) and standard limit theorems for i.i.d. random
variables yield the desired result. When the chain has no atom, one may still employ
this technique by replacing the atom by a suitably chosen small set and using the
splitting technique (see for instance Meyn and Tweedie, 1993, Chapter 17).
Lemma 199. Let Q be a positive Harris recurrent transition kernel that admits an
accessible atom α. Define for any measurable function f ,
τα
!
X (j−1)
sj (f ) = f (Xk ) ◦ θτα , j≥1. (7.64)
k=1
Then for any initial distribution ν on (X, X ), k ≥ 0 and functions {Ψj } in Fb (R),
Yk k
Y
Eν Ψj (sj (f )) = Eν [Ψ1 (s1 (f ))] Eα [Ψj (sj (f ))] .
j=1 j=2
(k)
Proof. Because the atom α is accessible and the chain is Harris recurrent, Px (τα <
∞) = 1 for any x ∈ X. By the strong Markov property, for any integer k,
of Theorem 198 when there is an accessible atom. First assume that f is non-negative.
Denote the accessible atom by α and define
n
1α (Xk ) ,
X
ηn = (7.65)
k=1
180 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY
Pn
the occupation time of the atom α up to time n. We now split the sum k=1 f (Xk )
into sums over the excursions between successive visits to α,
n
X ηn
X n
X
f (Xk ) = sj (f ) + f (Xk ) .
k=1 j=1 (η )
k=τα n +1
sj (f ) ≤ f (Xk ) ≤ sj (f ) . (7.66)
j=1 k=1 j=1
Pn
whence, by (7.66), the same limit holds for ηn−1 1 f (Xk ). Because π(1) = 1, µα (1)
Pn too. Applying the above result with f ≡ 1 yields n/ηn → µα (1), so that
is finite
n−1 1 f (Xk ) → µα (f )/µα (1) = π(f ) Pν -a.s. This is the desired result when f ≥ 0.
The general case is is handled by splitting f into its positive and negative parts.
Proof. To start with, it follows from the expression (7.32) for the stationary distri-
bution that
"τ # Z !2
Z X C XτC
π(f 2 ) = π(dx) Ex f 2 (Xk ) ≤ π(dx) Ex |f (Xk )| < ∞ .
C k=1 C k=1
We now prove the CLT under the additional assumption that the chain admits
an accessible atom α. The proof in the general phi-irreducible case can be obtained
using the splitting construction. The proof is P
along the same lines as for the LLN.
n
Put f¯ = f − π(f ). By decomposing the sum k=1 f¯(Xk ) into excursions between
successive visits to the atom α, we obtain
n
X ηn
X
n−1/2 f¯(Xk ) − sj (f¯) ≤ n−1/2 s1 (|f¯|) + n−1/2 sηn +1 (|f¯|) , (7.68)
k=1 j=2
where ηn and sj (f ) are defined in (7.65) and (7.64). It is clear that the first term
on the right-hand side of this display vanishes (in Pν -probability) Pnas n → ∞. For
the second one, the strong LLN (Theorem 198) shows that n−1 1 s2j (|f¯|) has an
Pν -a.s. finite limit, whence, Pν -a.s.,
n n+1
s2n (|f¯|) 1 X n + 1 1 X
lim sup = lim sup s2 (|f¯|) − s2 (|f¯|) = 0 .
n→∞ n n→∞ n j=1 j n n + 1 j=1 j
The strong LLN with f = 1α also shows that ηn /n → π(α) Pν -a.s., so that
s2ηn (|f¯|)/ηn → 0 and n−1/2 sηn +1 (|f¯|) → 0 Pν -a.s.
Pn Pη
Thus n−1/2 1 f¯(Xk ) and n−1/2 2n sj (f¯) have the same limiting behavior.
By Lemma 199, the blocks {s2j (|f¯|)}j≥2 are i.i.d. under Pν . Thus, by the CLT
Pn
for i.i.d. random variables, n−1/2 2 sj (f¯) converges Pν -weakly to a Gaussian
law with zero mean and some variance σ 2 < ∞; that the variance is indeed fi-
nite follows as above with the small set C being the accessible atom α. The so-
called Ascombe’s theorem (see for instance Gut, 1988, Theorem 3.1) then implies
−1/2 Pηn ¯
that ηn 2 f (Xk ) converges Pν -weakly to the same Gaussian law. Thus we
Pη −1/2 Pηn ¯
may conclude that n−1/2 2n f¯(Xk ) = (ηn /n)1/2 ηn 2 f (Xk ) converges Pν -
weaklyPto a Gaussian law with zero mean and variance π(α)σ 2 . By (7.68), so does
n
n−1/2 1 f¯(Xk ).
The condition (7.67) is stated in terms of the second moment of the excursion
between two successive visits to a small set and appears rather difficult to verify
directly. More explicit conditions can be obtained, in particular if we assume that
the chain is V -geometrically ergodic.
Proposition 201. Let Q be a phi-irreducible, aperiodic, positive Harris reccurrent
kernel that Q satisfies a Foster-Lyapunov drift condition (see Definition 191) outside
an accessible small set C, with drift function V . Then any measurable function f
such that |f |2 ≤ V satisfies a CLT.
Proof. Minkovski’s inequality implies that
!2 )1/2
C −1
τX
(∞
Xq
Ex |f (Xk )| ≤ Ex [f (Xk )1{τC >k} ]
2
k=0 k=0
( ∞ q
)1/2
Ex [V (Xk )1{τC >k} ]
X
≤ .
k=0
182 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY
This chain is said to be hidden because only a component (here {Yk }k≥0 ) is observed.
Of course, the process {Yk } is not a Markov chain, but nevertheless most of the
properties of this process are inherited from stability properties of the hidden chain.
In this section, we establish stability properties of the kernel T of the joint chain.
7.3.1 Phi-irreducibility
Phi-irreducibility of the joint chain T is inherited from irreducibility of the hidden
chain, and the maximal irreducibility measures of the joint and hidden chains are
related in a simple way. Before stating the precise result, we recall (see Section 1.1.1)
that if φ is a measure on (X, X ), we define the measure φ ⊗ G on (X × Y, X ⊗ Y) by
ZZ
def
φ ⊗ G(A) = µ(dx) G(x, dy) , A∈X ⊗Y .
A
Proof. LetR A ∈ X ⊗ Y be a set such that φ ⊗ G(A) > 0. Denote by ΨA the function
ΨA (x) = Y G(x, dy) 1A (x, y) for x ∈ X. By Fubini’s theorem,
ZZ Z
φ ⊗ G(A) = φ(dx) G(x, dy) 1A (x, y) = φ(dx) ΨA (x) ,
S∞ the condition φ⊗G(A) > 0 implies that φ ({ΨA > 0}) > 0. Because {ΨA > 0} =
and
m=0 {ΨA ≥ 1/m}, we have φ ({ΨA ≥ 1/m}) > 0 for some integer m. Because φ
7.3. APPLICATIONS TO HIDDEN MARKOV MODELS 183
where Z ∞
X
ψ(B) = φ(dx) (1 − δ) δ m Qm (x, B) , B∈X .
m=0
We have established (see Example 148) that because {Uk } has a positive density
on R+ , the chain {Xk } is phi-irreducible and λLeb is an irreducibility measure.
Therefore {Xk , Yk } is also phi-irreducible and λLeb ⊗λLeb is a maximal irreducibility
measure.
Example 204 (Normal HMM, Continued). For the normal HMM (see Exam-
ple ??), it holds that T [(x, y), ·] = T [(x, y 0 ), ·] for any (y, y 0 ) ∈ R × R. Hence
{x} × R is an atom for T .
184 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY
Lemma 205. Let m be a positive integer, > 0 and let η be a probability measure
on (X, X ). Let C ∈ X be an (m, , η)-small set for the transition kernel Q, that is,
Qm (x, A) ≥ 1C (x)η(A) for all x ∈ X and A ∈ X . Then C × Y is an (m, , η ⊗ G)-
small set for the transition kernel T defined in (1.14), that is,
The simple relations between the small sets of the joint chain and those of the
hidden chain immediately imply that T and Q have the same period.
Proof. Let C be an accessible (m, , η)-small set for Q with η(C) > 0. Define EC
as the set of time indices for which C is a small set with minorizing probability
measure η,
def
EC = {n ≥ 0 : C is (n, , η)-small for some > 0} .
The period of the set C is given by the greatest common divisor of EC . Propo-
sition 180 shows that this value is in fact common to the chain as such and does
not depend on the particular small set chosen. Lemma 205 shows that C × Y is an
(m, , η ⊗ G)-small set for the joint Markov chain with transition kernel T , and that
η ⊗ G(C × Y) = η(C) > 0. The set EC×Y of time indices for which C × Y is a small
set for T with minorizing measure η ⊗ G is thus, using Lemma 205 again, equal to
EC . Thus the period of the set C is also the period of the set C × Y. Because the
period of T does not depend on the choice of the small set C × Y, it follows that
the periods of Q and T coincide.
7.3. APPLICATIONS TO HIDDEN MARKOV MODELS 185
Then the sets {Ai × Y}i≥1 form a countable cover of X × Y, and these sets are
uniformly transient because
"∞ # "∞ #
1Ai ×Y (Xn , Yn ) = Ex 1Ai (Xn ) .
X X
Ex (7.70)
n=1 n=1
Because the joint chain admits a stationary distribution it is positive, and by Propo-
sition 179 it is recurrent.
Conversely, assume that the joint chain is positive. Denote by π̄ the (unique)
stationary probability measure of T . Thus for any Ā ∈ X ⊗ Y, we have
ZZ
π̄(dx, dy) Q(x, dx0 ) G(x0 , dy 0 ) 1Ā (x0 , y 0 )
ZZ
= π̄(dx, Y) Q(x, dx0 ) G(x0 , dy 0 ) 1Ā (x0 , y 0 ) = π̄(Ā) .
186 CHAPTER 7. ELEMENTS OF MARKOV CHAIN THEORY
This shows that π(A) = π̄(A × Y) is a stationary distribution for the hidden chain.
Hence the hidden chain is positive and recurrent.
When the joint (or hidden) chain is positive, it is natural to study the rate at
which it converges to stationarity.
Proposition 209. Assume that the hidden chain satisfies a uniform Doeblin con-
dition, that is, there exists a positive integer m, > 0 and a family {ηx,x0 , (x, x0 ) ∈
X × X} of probability measures such that
Then the joint chain also satisfies a uniform Doeblin condition. Indeed, for all (x, y)
and (x0 , y 0 ) in X × Y and all Ā ∈ X ⊗ Y,
where Z
η̄x,x0 (Ā) = ηx,x0 (dx) G(x, dy) 1Ā (x, y) .
The proof is along the same lines as the proof of Lemma 205 and is omitted.
This proposition in particular implies that the ergodicity coefficients for the kernels
T m and Qm coincide; δ(T m ) = δ(Qm ). A straightforward but useful application of
this result is when the hidden Markov chain is defined on a finite state space. If the
transition matrix Q of this chain is primitive, that is, there exists a positive integer
m such that Qm (x, x0 ) > 0 for all (x, x0 ) ∈ X × X (or, equivalently, if the chain
Q is irreducible and aperiodic), then the joint Markov chain satisfies a uniform
Doeblin condition and the ergodicity coefficient of the joint chain is bounded as
δ(T m ) ≤ 1 − with
A similar result holds when the hidden chain satisfies a Foster-Lyapunov drift
condition instead of a uniform Doeblin condition. This result is of particular interest
when dealing with hidden Markov models on state spaces that are not finite or
bounded.
(ii) If supx∈X [V (x)]−1 G(x, dy) f 2 (x, y) < ∞, then Eπ⊗G [f 2 (X0 , Y0 )] < ∞ and
R
there exist ρ ∈ (0, 1) and K < ∞ (not depending on f ) such that for any n ≥ 0,
Now part (i) follows from the geometric ergodicity of Q (Theorem 194). Next,
because π(V ) < ∞,
ZZ
Eπ⊗G [f 2 (X0 , Y0 )] = π(dx) G(x, dy) f 2 (x, y)
Z
−1
≤ π(V ) sup[V (x)] G(x, dy) f 2 (x, y) < ∞ ,
x∈X
implying that | Covπ [|f (Xn , Yn )|, |f (X0 , Y0 )|]| ≤ Varπ [f (X0 , Y0 )] < ∞. In addition
Covπ [f (Xn , Yn ), f (X0 , Y0 )]
= Eπ {E[f (Xn , Yn ) − π ⊗ G(f ) | F0 ]f (X0 , Y0 )}
ZZ ZZ
= π ⊗ G(dx, dy) f (x, y) [Qn (x, dx0 ) − π(dx0 )] G(x0 , dy 0 ) f (x0 , y 0 ) .
(7.71)
R R
By Jensen’s inequality G(x, dy) |f (x, y)| ≤ [ G(x, dy) f 2 (x, y)]1/2 and
QV 1/2 (x) ≤ [QV (x)]1/2 ≤ [λV (x) + b1C (x)]1/2 ≤ λ1/2 V 1/2 (x) + b1/2 1C (x) ,
showing that Q also satisfies a Foster-Lyapunov condition outside C with drift
function V 1/2 . By Theorem 194, there exists ρ ∈ (0, 1) and a constant K such that
ZZ
[Qn (x, dx0 ) − π(dx)] G(x0 , dy 0 ) f (x0 , y 0 )
Z
−1/2
n
≤ kQ (x, ·) − πkV 1/2 sup V (x) G(x0 , dy) |f (x0 , y)|
x0 ∈X
Z
≤ Kρn V 1/2 (x) sup V −1/2 (x0 ) G(x0 , dy) |f (x0 , y)| .
x0 ∈X
where ρ2 = σU2 2
δ /(δ 2 −σU
2
). We may choose δ large enough that φ2 (ρ2 +δ 2 )/δ 2 < 1.
Then lim sup|x|→∞ QV (x)/V (x) = 0 so that Q satisfies a Foster-Lyapunov condition
2 2
with drift function V (x) = ex /2δ outside a compact set [−M, +M ]. Because every
compact set is small, the assumptions of Proposition 211R are satisfied, showing
p that
the joint chain is positive. Set f (x, y) = |y|. Then G(x, dy) |y| = βex/2 2/π.
Proposition 211(ii) shows that Varπ (Y0 ) < ∞ and that the autocovariance function
Cov(|Yn |, |Y0 |) decreases to zero exponentially fast.
Bibliography
Askar, M. and Derin, H. (1981) A recursive algorithm for the Bayes solution of the
smoothing problem. IEEE Trans. Automat. Control, 26, 558–561.
Atar, R. and Zeitouni, O. (1997) Exponential stability for nonlinear filtering. Ann.
Inst. H. Poincaré Probab. Statist., 33, 697–725.
Athreya, K. B. and Ney, P. (1978) A new approach to the limit theory of recurrent
Markov chains. Trans. Am. Math. Soc., 245, 493–501.
Baum, L. E., Petrie, T. P., Soules, G. and Weiss, N. (1970) A maximization tech-
nique occurring in the statistical analysis of probabilistic functions of Markov
chains. Ann. Math. Statist., 41, 164–171.
Bickel, P. J., Ritov, Y. and Rydén, T. (1998) Asymptotic normality of the maximum
likelihood estimator for general hidden Markov models. Ann. Statist., 26, 1614–
1635.
Campillo, F. and Le Gland, F. (1989) MLE for patially observed diffusions: Direct
maximization vs. the EM algorithm. Stoch. Proc. App., 33, 245–274.
Cappé, O., Buchoux, V. and Moulines, E. (1998) Quasi-Newton method for maxi-
mum likelihood estimation of hidden Markov models. In Proc. IEEE Int. Conf.
Acoust., Speech, Signal Process., vol. 4, 2265–2268.
Carpenter, J., Clifford, P. and Fearnhead, P. (1999) An improved particle filter for
non-linear problems. IEE Proc., Radar Sonar Navigation, 146, 2–7.
189
190 BIBLIOGRAPHY
Cérou, F., Le Gland, F. and Newton, N. (2001) Stochastic particle methods for
linear tangent filtering equations. In Optimal Control and PDE’s - Innovations
and Applications, in Honor of Alain Bensoussan’s 60th Anniversary (eds. J.-L.
Menaldi, E. Rofman and A. Sulem), 231–240. IOS Press.
Chen, R. and Liu, J. S. (2000) Mixture Kalman filter. J. Roy. Statist. Soc. Ser. B,
62, 493–508.
Crisan, D., Del Moral, P. and Lyons, T. (1999) Discrete filtering using branching
and interacting particle systems. Markov Process. Related Fields, 5, 293–318.
Del Moral, P. and Jacod, J. (2001) Interacting particle filtering with discrete-time
observations: Asymptotic behaviour in the Gaussian case. In Stochastics in Fi-
nite and Infinite Dimensions: In Honor of Gopinath Kallianpur (eds. T. Hida,
R. L. Karandikar, H. Kunita, B. S. Rajput, S. Watanabe and J. Xiong), 101–122.
Birkhäuser.
Del Moral, P., Ledoux, M. and Miclo, L. (2003) On contraction properties of Markov
kernels. Probab. Theory Related Fields, 126, 395–420.
Douc, R., Moulines, E. and Rydén, T. (2004) Asymptotic properties of the max-
imum likelihood estimator in autoregressive models with Markov regime. Ann.
Statist., 32, 2254–2304.
Doucet, A., De Freitas, N. and Gordon, N. (eds.) (2001) Sequential Monte Carlo
Methods in Practice. Springer.
Durrett, R. (1996) Probability: Theory and Examples. Duxbury Press, 2nd ed.
Ephraim, Y. and Merhav, N. (2002) Hidden Markov processes. IEEE Trans. Inform.
Theory, 48, 1518–1569.
Fearnhead, P. (1998) Sequential Monte Carlo methods in filter theory. Ph.D. thesis,
University of Oxford.
Handschin, J. (1970) Monte Carlo techniques for prediction and filtering of non-
linear stochastic processes. Automatica, 6, 555–563.
Handschin, J. and Mayne, D. (1969) Monte Carlo techniques to estimate the condi-
tionnal expectation in multi-stage non-linear filtering. In Int. J. Control, vol. 9,
547–559.
Kaijser, T. (1975) A limit theorem for partially observed Markov chains. Ann.
Probab., 3, 677–696.
Kalman, R. E. and Bucy, R. (1961) New results in linear filtering and prediction
theory. J. Basic Eng., Trans. ASME, Series D, 83, 95–108.
— (1996) Monte-Carlo filter and smoother for non-Gaussian nonlinear state space
models. J. Comput. Graph. Statist., 1, 1–25.
Kong, A., Liu, J. S. and Wong, W. (1994) Sequential imputation and Bayesian
missing data problems. J. Am. Statist. Assoc., 89.
Künsch, H. R. (2000) State space and hidden Markov models. In Complex Stochastic
Systems (eds. O. E. Barndorff-Nielsen, D. R. Cox and C. Kluppelberg). CRC
Press.
Press, W., Teukolsky, S., Vetterling, W. and Flannery, B. (1992) Numerical Recipes
in C: The Art of Scientific Computing. Cambridge University Press, 2nd ed. URL
http://www.numerical-recipes.com/.
Rauch, H., Tung, F. and Striebel, C. (1965) Maximum likelihood estimates of linear
dynamic systems. AIAA Journal, 3, 1445–1450.
Ristic, B., Arulampalam, M. and Gordon, A. (2004) Beyond Kalman Filters: Par-
ticle Filters for Target Tracking. Artech House.
Roberts, G. O. and Rosenthal, J. S. (2004) General state space Markov chains and
MCMC algorithms. Probab. Surv., 1, 20–71.
— (2001) A review of asymptotic convergence for general state space Markov chains.
Far East J. Theor. Stat., 5, 37–50.
Segal, M. and Weinstein, E. (1989) A new method for evaluating the log-likelihood
gradient, the Hessian, and the Fisher information matrix for linear dynamic sys-
tems. IEEE Trans. Inform. Theory, 35, 682–687.
Teicher, H. (1960) On the mixture of distributions. Ann. Math. Statist., 31, 55–73.
— (1961) Identifiability of mixtures. Ann. Math. Statist., 32, 244–248.
Van der Merwe, R., Doucet, A., De Freitas, N. and Wan, E. (2000) The unscented
particle filter. In Adv. Neural Inf. Process. Syst. (eds. T. K. Leen, T. G. Dietterich
and V. Tresp), vol. 13. MIT Press.
Van Overschee, P. and De Moor, B. (1993) Subspace algorithms for the stochastic
identification problem. Automatica, 29, 649–660.
— (1996) Subspace Identification for Linear Systems. Theory, Implementation, Ap-
plications. Kluwer.
Wald, A. (1949) Note on the consistency of the maximum likelihood estimate. Ann.
Math. Statist., 20, 595–601.
Weinstein, E., Oppenheim, A. V., Feder, M. and Buck, J. R. (1994) Iterative and
sequential algorithms for multisensor signal enhancement. IEEE Trans. Acoust.,
Speech, Signal Process., 42, 846–859.
Whitley, D. (1994) A genetic algorithm tutorial. Stat. Comput., 4, 65–85.
196
INDEX 197