0% found this document useful (0 votes)
13 views30 pages

18-MarkovChains 2

This document discusses the application of Markov chains in statistical inference and modeling, particularly in the context of time-dependent data. It provides examples, such as a taxi company's fare distribution, to illustrate how Markov models can analyze dependent observations and compute transition probabilities. The document also covers the properties of Markov chains, including the Markov property and the construction of transition matrices for modeling state transitions over time.

Uploaded by

tanyalim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views30 pages

18-MarkovChains 2

This document discusses the application of Markov chains in statistical inference and modeling, particularly in the context of time-dependent data. It provides examples, such as a taxi company's fare distribution, to illustrate how Markov models can analyze dependent observations and compute transition probabilities. The document also covers the properties of Markov chains, including the Markov property and the construction of transition matrices for modeling state transitions over time.

Uploaded by

tanyalim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

STAT 5703

data- connected to
time
dependent
Statistical Inference and Modeling
for Data Science
Dobrin Marchev

1
Motivation
• Until now, all models assumed i.i.d.
random sample framework
• We now consider a dependence structure
between the observations X1, … , Xn
• This can be done with two approaches:
Markov chains or time series
• Markov models provide a very rich set of
tools for handling dependence

Prof. Andrei A. Markov (1856-1922) , published his


result in 1906.
Markov later used his theory to study the distribution
of vowels in Onegin, written by Pushkin.
Data Science Example
• A common application of Markov chains in data science is
text prediction.
• It’s an area of NLP (Natural Language Processing) that is
commonly used in the tech industry by companies like
Google, LinkedIn and Instagram. When you’re writing
emails, Google predicts and suggests words or phrases to
autocomplete your email. And when you receive messages
on Instagram or LinkedIn, those apps suggest potential
replies.
• These are not the applications of a Markov chain we will
explore. The types of models these large-scale companies
use in production for these features are more
complicated. We will stick to simpler examples.
Example: the taxi problem
A taxi company has divided the city into three regions –
Northside, Downtown, and Southside. By keeping track of
pickups and drop-offs, the company has found that:

• Of the fares picked up in Northside, 50% stay in that region,


20% are taken to Downtown, and 30% go to Southside.
• Of the fares picked up Downtown, only 10% go to
Northside, 40% stay in Downtown, and 50% go to
Southside.
• Of the fares picked up in Southside, 30% go to each of
Northside and Downtown, while 40% stay in Southside.

We would like to know what the distribution of taxis will be


over time as they pick up and drop off successive fares.
Suppose we want to know the probability that a taxi starting
off Downtown, will be Downtown after letting off its seventh
fare?
State Diagram
what happens in 2 interventions ?

Random Var Sequence but dont


:
,
assume

This information can be IID .

represented in a state Dependence is specified as

much as
possible .
diagram which includes
• the three states D, N, and S
corresponding to the three
regions of the city
• the probabilities of a taxi
transitioning from one
region/state to another
Properties
Markov chains
conditionally dependent most immediate
I
?
Discrete & finite Space
continuous time .
passage
:
on .

• If the location of the taxi at time n is denoted by Xn , then the


sequence X1, X2, … consists of dependent variables, and can
be modeled with a Markov chain. The values the random
-

variable Xn can take, are known as states of the chain.



sequence of var that satisfiesit .

• The probabilities of moving from state to state are constant


and independent of the past behavior – this property of the
system is called the Markov property. That is,

𝑃 𝑋𝑛 = 𝑠𝑛 𝑋𝑛−1 = 𝑠𝑛−1, … , 𝑋0 = 𝑠0 = 𝑃 𝑋𝑛 = 𝑠𝑛 𝑋𝑛−1 = 𝑠𝑛−1

• We assume that a transition – picking up and dropping off a


fare – occurs each time the system is observed, and that
observations occur at regular intervals. Systems with these
characteristics are called Markov chains or Markov processes
(when time is continuous).
Computing Transition Probabilities
• What is the probability that a taxi that starts off Downtown ends up in
Northside after two fares?
• One possibility is that the taxi stays Downtown for the first fare and
then transitions to Northside for the second. The probability of this
occurring is then:

• But we could also have the taxi going to either Northside of Southside
first, then transitioning to Northside:

• Since the taxi could follow the first, second or third path, the
probability of starting and ending Downtown after two fares is:
Transitions in More Steps
• If we want to know the
Tree Diagram probability of a taxi
transitioning from one
region to another after just
three fares, the
computation will have more
possible paths.
• Suppose we were
interested in the probability
of a taxi both starting and
ending up Downtown. We
can use a tree diagram to
represent this calculation.
• More generally, we might
want to determine the
If we multiply along all the paths and sum probability of moving from
the results, we find that this probability is state I to state I over m
0.309. steps.
Transition Matrix

• We can create a square matrix, P, called the transition


matrix, by constructing rows for the probabilities going
from Southside and Northside as well.

• An entry Pij of this matrix is the probability of a transition


from region i to region j. For example, p32, is the
probability of a fare that originates in Northside going to
Southside. (Note the sum of entries across rows of P.)
Transition Matrix

What results when we multiply the transition matrix by itself?


from D to Nin2steps.
0.4 0.5 0.1 2
O𝑃2 = 0.3 0.4 0.3 Around
0.2 0.5 0.5 Around3 .
4


0 .

0 .


O

0.4 0.5 0.1 0.4 0.5 0.1 0.33 0.43 0.24
= 0.3 0.4 0.3 × 0.3 0.4 0.3 = 0.3 0.4 0.3
0.2 0.5 0.5 0.2 0.5 0.5 0.27 0.37 0.36

The highlighted entry results from the same computation that


we already considered for of a taxi going from D to N in two
min
fares. Stationary distribution Numbers should converge
of chain -
equalise together.
.

What are the other entries of P2? What are the entries of P3? Pn?
Statistical Model
• A Markov chain is a stochastic process {Xt} taking values in a (finite discrete)
state space 𝒳 = {1, . . . , s} such that the distribution of Xt depends only on Xt-1.
In the taxi example, 𝒳 = {N, D, S}. Note that the state space can also be infinite.
initial distribution.
• The observed sample data 𝑋𝑡1 , … , 𝑋𝑡𝑛 are of the form A 𝑋0 = 𝑠0, 𝑋𝑡1 =
𝑠1 , … , 𝑋𝑡𝑛 = 𝑠𝑛 , where 0 < t1 < ... < tn. drop generic assumption.

• Note that we can always write the likelihood (joint pdf) for any sample, using
the multiplication rule for dependent r.v, as:
Hard.
generalising to infinite state space >
-

𝑃 𝑋0 = 𝑠0 , … , 𝑋𝑡𝑛 = 𝑠𝑛
𝑛
= 𝑃 𝑋0 = 𝑠0 ෑ 𝑃 𝑋𝑡𝑖 = 𝑠𝑡𝑖 𝑋0 = 𝑠0 , 𝑋𝑡1 = 𝑠𝑡1 , … , 𝑋𝑡𝑖−1 = 𝑠𝑖−1
𝑖=1

• What makes {Xt} a Markov chain is the extra assumption that each
𝑃 𝑋𝑡𝑖 = 𝑠𝑡𝑖 𝑋0 = 𝑠0 , 𝑋𝑡1 = 𝑠𝑡1 , … , 𝑋𝑡𝑖−1 = 𝑠𝑖−1 depends only on the most
recent observation 𝑋𝑡𝑖−1 = 𝑠𝑖−1
Statistical Model
Definition: The (first-order) Markov property assumes that “given the present,
the future is independent of the past”, meaning that

𝑃 𝑋𝑡𝑖 = 𝑠𝑡𝑖 𝑋0 = 𝑠0 , 𝑋𝑡1 = 𝑠𝑡1 , … , 𝑋𝑡𝑖−1 = 𝑠𝑖−1


= 𝑃 𝑋𝑡𝑖 = 𝑠𝑡𝑖 𝑋𝑡𝑖−1 = 𝑠𝑖−1 , ∀𝑖

This simplifies the likelihood to:


𝑛
𝑃 𝑋0 = 𝑠0 , … , 𝑋𝑡𝑛 = 𝑠𝑛 = 𝑃 𝑋0 = 𝑠0 ෑ 𝑃 𝑋𝑡𝑖 = 𝑠𝑡𝑖 𝑋𝑡𝑖−1 = 𝑠𝑖−1
𝑖=1

For example,
𝑃 𝑋0 = 𝑎, 𝑋1 = 𝑏, 𝑋2 = 𝑐 = 𝑃 𝑋0 = 𝑎 𝑃 𝑋1 = 𝑏 𝑋0 = 𝑎 𝑃(𝑋2 = 𝑐|𝑋1 = 𝑏)
If the process does not posses the Markov property, then we can only write:
𝑃 𝑋0 = 𝑎, 𝑋1 = 𝑏, 𝑋2 = 𝑐
= 𝑃 𝑋0 = 𝑎 𝑃 𝑋1 = 𝑏 𝑋0 = 𝑎 𝑃(𝑋2 = 𝑐|𝑋0 = 𝑎, 𝑋1 = 𝑏)

The theory of Markov chains is a very rich and complex. We must get through
many definitions before we can do anything interesting.
Questions to be answered
• When does a Markov chain "settle
down" into some sort of equilibrium?
• What does this even mean?
• How do we estimate the parameters
of a Markov chain?
• What are the parameters of a Markov
chain?
• How can we construct Markov chains
that converge to a given equilibrium
distribution? Two Markov chains. One of the
• Why would we want to do that?? chains does not settle down into an
• Time to think: do you know what does equilibrium. The other one does.
Which one is which?
the following sequence converge to?
1 1
𝑥𝑛+1 = 𝑥𝑛 +
2 𝑥𝑛
Transition Matrix
Definition: A Markov chain is a Markov process in discrete time, so we can
simply write ti = i, i = 1, … , n.

Definition: A transition matrix Pt for a Markov chain {X} at time t is a matrix


containing information on the probability of transitioning between states.
Given an ordering of a matrix’s rows and columns by the state space 𝒳, the
(i, j)th element of the matrix Pt is given by:
given current
Next value -j value i. is
,

𝑷𝑡 𝑖𝑗 = 𝑃 𝑋𝑡+1 = 𝑗 𝑋𝑡 = 𝑖

This means each row of the matrix is a probability vector, and the sum of its
entries is 1.
Transition matrices have the property that the product of subsequent ones
describes a transition along the time interval spanned by the transition
matrices.
Note: The matrix depends on the time t.
Time homogeneous Markov chains
Definition: A Markov chain is time-independent, aka homogeneous if
Converging to something.
𝑃 𝑋𝑡+1 = 𝑠 𝑋𝑡 = 𝑟 = 𝑃 𝑋1 = 𝑠 𝑋0 = 𝑟 , = 𝑝𝑟𝑠 ∀𝑡
That is, the conditional probabilities only depend on the time difference.
Note: 𝑝𝑟𝑠 does not depend on t.

For a homogeneous Markov chain Xt, observed at a discrete equally spaced


times, t = 0, 1, . . . , n, we define the constant s×s transition matrix P such that
the (i, j)th element pij is given by:

𝑷 𝑖𝑗 = 𝑃 𝑋𝑡+1 = 𝑗 𝑋𝑡 = 𝑖 = 𝑃 𝑋1 = 𝑗 𝑋0 = 𝑖 , ∀𝑡

Note: The transition matrix must satisfy that


1. 𝑝𝑖𝑗 ≥ 0, ∀𝑖, 𝑗 ∈ 𝒳
2. σ𝑠𝑗=1 𝑝𝑖𝑗 = 1, ∀𝑖 ∈ 𝒳
Transition Matrix Properties
Denote the initial probabilities by 𝑝𝑗 = 𝑃 𝑋0 = 𝑗 , 𝑗 = 1, … , 𝑠. That is, we
𝑝1
can form a vector p of all initial probabilities: 𝒑 = ⋮ .
𝑝𝑠
tedious algebra
.
Then, by the law of total probability,
>
-
𝑠
Marginal property of next 𝑠
𝑃 𝑋1 = 𝑘 = ෍ 𝑃 𝑋0 = 𝑗 𝑃 𝑋1 = 𝑘 𝑋0 = 𝑗 = ෍ 𝑝𝑗 𝑝𝑗𝑘
𝑗=1
step being K. 𝑗=1
Notice this is the kth element of 𝒑′ 𝑷

Theorem: By induction, we can prove that the pmf of Xn is given by 𝒑′ 𝑷𝑛 .


That is, 𝑃(𝑋𝑛 = 𝑘) is the kth element of the vector 𝒑′ 𝑷𝑛 .

Note: This is different than 𝑷𝑛 . The element on ith row, jth column of 𝑷𝑛 is
equal to 𝑃(𝑋𝑛 = 𝑗|𝑋1 = 𝑖).
Example: Given the chain is in state s, the probability of a run of m ≥ 1
𝑚−1 1 − 𝑝
successive stays in state s is 𝑝ss 𝑠𝑠 ~ Geo 1 − 𝑝𝑠𝑠
Stationary Distribution
To understand the long-term behavior of our Markov chain, we introduce the
concept of a stationary distribution.
Definition: A Markov chain with transition matrix P is said to have a stationary
distribution 𝝅 (aka equilibrium or invariant distribution) if there exists a vector
𝝅 such that 𝝅′ 𝑷 = 𝝅′
This equation implies that multiplying the equilibrium vector by the transition
matrix P yields the same vector. In Markov chains, this is called "Pisa state" and
it refers to a specific kind of state where the chain can only reach from other
states but cannot transition away from it once entered.
Notice this means that 𝝅 must be a left eigenvector, corresponding to
eigenvalue 1. Note also that all transition matrices posses a right eigenvector,
corresponding to eigenvalue 1, since 𝑷𝟏𝑆 = 𝟏𝑆 , where 𝟏𝑆 is a vector of S 1’s.
?
is it
unique will chain converge to it ?
/ on finite state spaces always posses at
Theorem: All Markov chains defined
least one equilibrium distribution.
Theorem: Under some conditions (the Markov chain is ergodic), the stationary
distribution satisfies:
lim 𝑷𝑛 = 𝟏′𝝅
Irreducibility
Definition: A state sj is said to be accessible from state si if a chain starting in state
si has a positive probability to reach state sj at some future time point n. That is,
𝑛
∃𝑛 > 0: 𝑝𝑖𝑗 >0
If sj is accessible from si and si is accessible from state sj then we say that si and sj
communicate.
A communicating class is defined to be a set of states that communicate.
Definition: If a discrete time Markov chain is composed by only one
communicating class (i.e., if all states in the chain communicate), then it is said to
be irreducible.
0.5 0.5 0 0
0.9 0.1 0 0
Example: 𝑃 = is reducible and has 2 classes.
0 0 0.2 0.8
0 0 0.7 0.3

Note: Irreducibility does not guarantee the presence of limiting


probabilities.
Arrow Indicate that
:
it is not 0
.

Periodicity
Definition: We define the
period of state i, denoted
d(i), to be the greatest
common divisor of
𝐽𝑖 = 𝑛 ≥ 0: 𝑝𝑛 𝑖, 𝑖 > 0
That is, any return to state i
must occur in multiples of
d(i) time steps.
We call an irreducible
transition matrix P
aperiodic if 𝑑 𝑖 = 1, ∀𝑖.
Periodicity
Theorem: If a stochastic matrix is irreducible and aperiodic, then there exists a
unique invariant distribution and it satisfies the ergodic theorem.

Examples:
1 0
𝑃= has infinitely many invariant distributions.
0 1
0 1
𝑃= has a unique invariant distribution but is not ergodic.
1 0
Which condition is violated in each case?

Exercise: Is the following transition matrix irreducible and aperiodic?


1 1
0 0
2 2
1 2
𝑷= 0 0
3 3
1 0 0 0
0 0 1 0
The fine print

I
(𝑛)
Definition: 𝑓𝑖𝑖 = 𝑃 𝑋𝑛 = 𝑖, 𝑋1 ≠ 𝑖, … , 𝑋𝑛−1 ≠ 𝑖 𝑋0 = 𝑖 is the probability of
first recurrence to i at the nth step.
(𝑛)
Then 𝑓𝑖 = 𝑓𝑖𝑖 = σ∞
𝑛=1 𝑓𝑖𝑖 is the probability of recurrence.

Definition: A state i is called recurrent if 𝑓𝑖 = 1, that is, eventual return is


certain.
Otherwise, if 𝑓𝑖 < 1 the state is called transient.

Definition: If the mean time for recurrence is finite, the state is called positive
recurrent, otherwise it is null-recurrent.
Note: For infinite-state Markov chains, all 12 combinations of (irreducible or
not), (aperiodic or not), and (positive recurrent or null recurrent or transient)
are possible.

Theorem: If a general Markov chain is positive recurrent, then there exists a


unique finite invariant measure.
Rscripting .

Finding the stationary distribution

• Using a computer, you can apply “brute force”, meaning


compute some very high power of P and then each row should
be approximately equal to 𝜋

• Eigen decomposition: Notice that 𝝅′ 𝑷 = 𝝅′ implies that 𝜋 is a


left eigenvector of P, corresponding to eigenvalue 1. Therefore,
if you can find the eigenvalues of P it means you can find 𝜋

• You can solve 𝝅′ 𝑷 = 𝝅′ as a system of linear equations subject


to σ 𝜋𝑖 = 1 . Distribution
Probability .
Rate of convergence
Theorem: The eigenvalues 𝜆1 , … , 𝜆𝑠 of a Markov transition matrix
P satisfy the inequality 𝜆𝑖 ≤ 1, ∀ 𝑖 = 1, … , 𝑠.
Theorem: All Markov chain transition matrices have at least one
eigenvalue equal to 1.

If all eigenvalues are positive and 𝜆1 > ⋯ > 𝜆𝑠 , then 𝜆1 = 1 and


the size of the next eigenvalue 𝜆2 indicates the rate of the speed of
convergence as we approach equilibrium. The reason is because it
describes how quickly the largest of the vanishing terms will
approach zero in:
𝑠

𝑷𝑛 = ෍ 𝜆𝑛𝑖 𝒓𝑖 𝒍𝑖 ′
𝑖=1
where ri are the right eigenvectors, and li are the left eigenvectors.
Note: There is no formula how to compute the left eigenvectors if
you already computed the right eigenvectors.
Business
Example
Coke and Pepsi are the
only companies in
country X. A soda
company wants to tie up
with one of these
competitor. They hire a
market research company
to find which of the brand
will have a higher market
share after 1 month and
after 2 months. Currently,
Pepsi owns 55% and Coke
owns 45% of market
share. Following are the
conclusions drawn out by
the market research
company:
Likelihood Inference
Assume we have observed data s0, s1, … , sn at times 0, 1, … , n
from a stationary Markov chain Xt. Given tjat sequence, we can
estimate the transition matrix using the MLE method.
Then the likelihood is:
𝐿 𝑷 = 𝑃 𝑋0 = 𝑠0 , … , 𝑋𝑡𝑛 = 𝑠𝑛
𝑛−1

= 𝑃 𝑋0 = 𝑠0 ෑ 𝑃 𝑋𝑡𝑖+1 = 𝑠𝑡𝑖+1 𝑋𝑡𝑖 = 𝑠𝑖


𝑖=0
𝑛−1 𝑠 𝑠
𝑛𝑖𝑗
= 𝑃 𝑋0 = 𝑠0 ෑ 𝑝𝑠𝑖 𝑠𝑖+1 = 𝑝0 ෑ ෑ 𝑝𝑖𝑗
𝑖=0 𝑖=1 𝑗=1
where nij is the observed frequency of transitions from i to j.
The loglikelihood is:
𝑠 𝑠

𝑙 𝑷 = ෍ ෍ 𝑛𝑖𝑗 log 𝑝𝑖𝑗 + log 𝑝0


𝑖=1 𝑗=1
MLE
If you take the derivatives,
𝜕𝑙(𝑷) 𝑛𝑖𝑗
=
𝜕𝑝𝑖𝑗 𝑝𝑖𝑗
It can be shown that the MLE of 𝑝𝑖𝑗 is

𝑛𝑖𝑗
𝑝Ƹ𝑖𝑗 =
𝑛𝑖⋅

where 𝑛𝑖⋅ = σ𝑠𝑗=1 𝑛𝑖𝑗

Note: there is a chi-square test if the transition probabilities are


independent of the current state i.e. pij = pj. It can be used to test if
a data sequence is independent or a Markov chain or also to test
convergence to equilibrium.
Extensions

One can extend the idea of a first-order Markov chain to chains of


order m, where the probability of transition into s depends on the m
previous states
𝑃 𝑋𝑗 = 𝑠|𝑋0 = 𝑠0 , … , 𝑋𝑡𝑗−1 = 𝑠𝑗−1
= 𝑃 𝑋𝑗 = 𝑠|𝑋𝑗−𝑚 = 𝑠𝑗−𝑚 , … , 𝑋𝑡𝑗−1 = 𝑠𝑗−1

When m = 1 we have s(s − 1) parameters There are Sm such vectors


now!
Extensions

The state space S can in countably infinite. Then the transition


“matrix” is infinite as well! The theoretical results are a lot more
complicated.
A famous example is known as random walk.
For example, a random walk on the integer numbers ℤ can be
defined as being at position z and changing to either z + 1 or z – 1
with probabilities
1 1 𝑧
𝑝𝑧,𝑧−1 = +
2 2 𝑐+ 𝑧
𝑝𝑧,𝑧+1 = 1 − 𝑝𝑧,𝑧−1 , 𝑧 ∈ ℤ
where c > 0.
Note that the set of all these p form an infinite matrix where most of
the elements are zeros.
Road ahead

• Markov chain theory is most widely used in the Markov chain


Monte Carlo (MCMC) methods in statistics.
• However, in MCMC the state space is usually uncountable, for
example, 𝑆 = (−∞, ∞), that is, 𝑋𝑛 ∈ 𝑆 would be a continuous
random variable.
• The transition matrix becomes a transition kernel
(A matrix can be regarded as an operator from vectors in ℝ𝑠 to ℝ𝑠 ,
whereas the kernel transforms one density function to another
density function)
• The equilibrium distribution is not a vector but a pdf.
• The conditions for ergodicity are more complicated
• The eigenvectors are called eigenfunctions.
• There are infinitely many eigenvalues!
Example

• Consider (again) a random walk defined as


𝑛

𝑋𝑛 = ෍ 𝑍𝑖
𝑖=0
where Zi ~ N(0,1) are independent.
Then the transition kernel is a conditional density given by
𝑘 ⋅ 𝑥 = 𝑁 𝑥, 1
This should be interpreted as the conditional density of Xn+1 given
that Xn = x.
What does this mean?
𝑏
1 𝑦−𝑥 2
𝑃 𝑎 < 𝑋𝑛+1 < 𝑏 𝑋𝑛 = 𝑥 = න 𝑒 − 2 𝑑𝑦
2𝜋
𝑎
Note that the Lebesgue measure is the invariant measure in this case.

You might also like