0% found this document useful (0 votes)
36 views37 pages

Lectures 7 and 8

This document provides an overview of Markov chains, hidden Markov models, and Markov random fields. It discusses Gibbs sampling for Markov chains and the forward-backward algorithm for hidden Markov models. Key applications covered include state estimation, prediction, smoothing, and finding the most probable path in hidden Markov models.

Uploaded by

davyarper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views37 pages

Lectures 7 and 8

This document provides an overview of Markov chains, hidden Markov models, and Markov random fields. It discusses Gibbs sampling for Markov chains and the forward-backward algorithm for hidden Markov models. Key applications covered include state estimation, prediction, smoothing, and finding the most probable path in hidden Markov models.

Uploaded by

davyarper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Lecture 7

Markov Chain
Hidden Markov Models
(Markov Random Fields)
• Markov Chains
– Gibbs Sampling
• Hidden Markov Models
– State Estimation,
– Prediction,
– Smoothing,
– Most Probable Path
• (Markov Random Fields)
Background
In dynamic systems:
State Estimation – Estimating the current state
of the system given current knowledge.
Prediction – Estimating future state(s) of the
system given current knowledge.
Smoothing – Estimating prior states of the
system given current knowledge.
Background
Independence
P(A,B)=P(A)P(B)
Conditional Independence
P(A,B|C)=P(A|C)P(B|C)
P(A|B,C)=P(A|C)
Chain Rule
P(A,B,C)=P(C|A,B)P(B|A)P(A)
Markov Chains
• Initial state, or distribution over possible initial states.
• Transition probabilities
– Markov Condition: State at time t+1 depends only on state
at time t. (Leads to higher order MCs)
– Ie Current state conditionally independent of all prior
states except preceeding.
Bull Stagnant Bear
Bull .9 .025 .075
Stagnant .25 .5 .25
Bear .15 .05 .8

Market Market
t t+1
Markov Chains
• Using nodes to represent variables & conditional
distributions
• Conditioned upon variables indicated by edges.

1st Order Markov Chain


… St St+1 St+2 …

2nd Order Markov Chain

… St St+1 St+2 …
Markov Chain
Transition probabilities for transition matrix T.
Simple state prediction (1st Order):
𝑿𝑡+𝑛 = 𝑿𝑡 𝑇 𝑻𝑛

The eigenvector to the eigen value 1 gives the


steady equilibrium distribution. (Ie ’long run’
distribution of the MC).
Sampling From a Markov Chain
• Provides a sequence of sampled state vales
corresponding to time 0 to t: 𝑆𝑜 , 𝑆1 , 𝑆2 , … , 𝑆𝑡
• Sample from the initial state distribution to
find the sample value 𝑆𝑜
• Construct the distribution P(𝑆𝑖 |𝑆𝑖−1 ) using
𝑆𝑖−1 and the transition matrix. Then sample
from this distribution to 𝑆𝑖 .
• Note: Samples are not independent.
Markov Chain Monte Carlo
- Generate a (1st order) MC that (in its equilibrium
state) represents the target distribution.
- Proceed to generate samples from it by evolving
the MC.
- As the number of samples approaches infinity, the
sampled distribution approaches the actual
equilibrium distribution.
- Burn period
- nth sample
- Blackboard example...
Hidden Markov Models
• State of system is hidden from us.
• Some observation related to the state is available
to us.
– Require sensor/emission probabilities, E.
– Assume observations depend only on current state.
(Conditionally indepent of all other states and
observations.)
S1 S2 S3 ... Sn

O1 O2 O3 On
Hidden Markov Models
• Prediction: Just as in Markov Chains...

𝑃 𝑆𝑡+𝑛 |𝑆𝑡 = 𝑃(𝑆𝑡 )𝑻𝑛


Hidden Markov Models
Note * notation:
𝑚

𝑃∗ 𝑆𝑡 |𝑆𝑡−1 = 𝑃 𝑆𝑡 𝑆𝑡−1 = 𝑖 𝑃 𝑆𝑡−1 = 𝑖


𝑖=1
Hidden Markov Models
We will make use of Bayes Rule:
𝑃 𝑌𝑋 𝑃 𝑋
𝑃 𝑋|𝑌 =
𝑃 𝑌
When Y is observed, this becomes:
𝑃 𝑌=𝑦𝑋 𝑃 𝑋
𝑃 𝑋|𝑌 = 𝑦 =
𝑃 𝑌=𝑦
Hidden Markov Models
State Estimation, t>0:
𝑃 𝑆𝑡 |𝑂1:𝑡 , 𝑆0 = 𝑃 𝑆𝑡 |𝑆𝑡−1 , 𝑂𝑡 𝑃(𝑆𝑡−1 |𝑂1:𝑡−1 , 𝑆0 )
Note the recursion:
𝑷 𝑺𝒕 |𝑶𝟏:𝒕 , 𝑺𝟎 = 𝑃 𝑆𝑡 |𝑆𝑡−1 , 𝑂𝑡 𝑷(𝑺𝒕−𝟏 |𝑶𝟏:𝒕−𝟏 , 𝑺𝟎 )
So we can proceed iteratively through, basising
our estimation of 𝑆𝑡 only on our estimation of
𝑆𝑡−1 and observation 𝑂𝑡 .
Hidden Markov Models
• State Estimation
𝑃 𝑂𝑡 𝑆𝑡 𝑃∗ 𝑆𝑡 𝑆𝑡−1
𝑃 𝑆𝑡 |𝑆𝑡−1 , 𝑂𝑡 =
𝑃 𝑂𝑡
∝ 𝑃 𝑂𝑡 𝑆𝑡 𝑃∗ 𝑆𝑡 𝑆𝑡−1
Remember: The previous state estimation has all relevant information
from the past!

St-1 St

𝑇: 𝑃(𝑆𝑡 |𝑆𝑡−1 )
𝑆: 𝑃(𝑂𝑡 |𝑆𝑡 )

Ot
Hidden Markov Models
St-1 St St-1 St=T St=F St Ot=T Ot=F
T .9 .1 T .3 .7
Ot F .3 .7 F .1 .9

𝑃 𝑆𝑡 |𝑆𝑡−1 , 𝑂𝑡 ∝ 𝑃 𝑂𝑡 𝑆𝑡 𝑃∗ 𝑆𝑡 𝑆𝑡−1
• Let our belief regarding S0 be that it is 80% likely S0=T.
• Let us observe O1=F.
𝑃 𝑆1 𝑆0 =< .8 .9 + .2 .3 , .8 .1 + .2 .7 > = < .78, . 22 >
𝑃 𝑂1 = 𝐹 𝑆1 = < .7 , . 9 >
𝑃 𝑆1 |𝑆0 , 𝑂1 = 𝐹 ∝ <(.78)(.7),(.22)(.9)>=<.546,.198>
.546 .198
𝑃 𝑆1 |𝑆0 , 𝑂1 = 𝐹 = < , > ≈ <.734, . 266 >
.546+.198 .546+.198
Hidden Markov Models
St-1 St
St-1 St=T St=F St O1t
T .6 .4 T 𝒩(3.5,10)
F .5 .5 F 𝒩(5,5)
O1t O2t O3t

St O2t St O3t
T 𝒩(45,100) T 𝒩(0, . 1)
F 𝒩(55,225) F 𝒩(0, . 5)
𝑃 𝑆𝑡 |𝑆𝑡−1 , 𝑂𝑡
∝ 𝜌 𝑂𝑡1 𝑆𝑡 𝜌 𝑂𝑡2 𝑆𝑡 𝜌 𝑂𝑡3 𝑆𝑡 𝑃∗ 𝑆𝑡 𝑆𝑡−1
• Let our belief regarding S0 be that it is 50% likely S0=T.
• Let us observe O11 =6.103, O12 =54.7 and O13 =.154
𝑃 𝑆1 𝑆0 =< .5 .6 + .5 .5 , .5 .4 + .5 .5 > = < .55, . 45 >
𝑃 𝑂11 = 6.103 𝑆1 ≈ < .089 , . 158 >
𝑃 𝑂12 = 54.7 𝑆1 ≈ < .025 , . 027 >
𝑃 𝑂13 = .154 𝑆1 ≈ < 1.120 , . 551 >
𝑃 𝑆1 |𝑆0 , 𝑂11 = 6.103, 𝑂12 = 54.7, 𝑂13 = .154
∝ <(.55)(.089)(.025)(1.120),(.45)(.158)(.027)(.551)> ≈ <.00137,.00106>
.00137 .00106
𝑃 𝑆1 |𝑆0 , 𝑂11 = 6.103, 𝑂12 = 54.7, 𝑂13 = .154 ≈ < , > ≈ <.564, . 436 >
.00137+.00106 .00137+.00106
Hidden Markov Models: Lab B
• State Estimation
– Given an initial state, transition and sensor probabilities,
we can iteratively calculate the distribution at each
subsequent state.
– We can do this online.
Note that for Lab B
– Vector of 3 observations (as in last example)
– Sparse transition matrix (many impossible transitions).
Assume random walk with possibility of staying still.
– Presumably uniform initial state.
– NOT real time.
Hidden Markov Models
Smoothing: The Forward-Backward Algorithm
𝑃 𝑆𝑠≤𝑡 |𝑂0:𝑡 , 𝑆0
= 𝑃 𝑆𝑠≤𝑡 |𝑂0:𝑠 , 𝑆0 𝑃 𝑆𝑠≤𝑡 |𝑂𝑠+1:𝑡
The Forward Algorithm:
• We have seen how, given an initial state,
transition and sensor probabilities, we can
iteratively calculate 𝑃 𝑆𝑠 |𝑂0:𝑠 , 𝑆0 for 1 ≤ 𝑠 ≤ 𝑡.
The Backward Algorithm:
• Starting at t, we can iteratively calculate:
𝑃 𝑆𝑠 |𝑂𝑠+1:𝑡
Hidden Markov Models
𝑷 𝑺𝒔 |𝑶𝒔+𝟏:𝒕 = 𝑃 𝑆𝑠 |𝑂𝑠+2:𝑡 , 𝑂𝑠+1
= 𝑃 𝑆𝑠 |𝑆𝑠+1 𝑃(𝑆𝑠+1 |𝑂𝑠+1:𝑡 )
= 𝑃 𝑆𝑠 |𝑆𝑠+1 𝑃(𝑆𝑠+1 |𝑂𝑠+2:𝑡 , 𝑂𝑠+1 )
= 𝑃 𝑆𝑠 |𝑆𝑠+1 𝑃(𝑂𝑠+1 |𝑆𝑠+1 )𝑃 𝑆𝑠+1 |𝑂𝑠+2:𝑡
∝ 𝑃(𝑆𝑠 )𝑃 𝑆𝑠+1 |𝑆𝑠 𝑃(𝑂𝑠+1 |𝑆𝑠+1 )𝑷 𝑺𝒔+𝟏 |𝑶𝒔+𝟐:𝒕

– Note the recursion.


– We have a base case since:
𝑃 𝑂𝑡+1:𝑡 |𝑆𝑡 = 𝑃 ∅|𝑆𝑡 = 1
– We actually iterate backwards from t.
Hidden Markov Models
Forward backward algorithm for all states.
Forward chain:
𝑓0 = 𝑆0
𝑓𝑖 = 𝑓𝑖−1 𝑇𝑂𝑖
Backward chain:
𝑏𝑡 = 1
𝑏𝑖 = 𝑂𝑖+1 𝑇𝑏𝑖+1
Combination:
𝑃(𝑆𝑖 ) ∝ 𝑓𝑖 𝑏𝑖
Blackboard example...
Hidden Markov Models
• Most Probable Path: Viterbi Algorithm
– Calculate the path probabilities as in the Dynamic
Programming for Path Finding.
• Difference: Multiplicative instead of additive
accumulation function.
• Does not find probability of most probable path unless
normalization (over all paths) occurs at each step.
– Using log probabilities useful.
– Blackboard example...
Markov Random Fields
An undirected network of models, each
specifying the conditional distribution for a
variable given its neighbors.
Note graph convention: P(A|B,D)
Nodes represent P(B|A,D)
variables. Conditional
distributions associated
with each node. Edges P(D|A,B,E) P(C|E)
into nodes indicate
which variables are
conditioned upon.

P(E|C,D)
Markov Random Fields
• Inference:
Given the states we know, to find out the states we do
not know, we sample...
• Gibbs Sampler
– Divide domain into known and unknown variables.
– Assign unknown variables a random value.
– We iterate through unknown variables, calculating a new
value given the values assigned to their neighbors.
– After each iteration, we record a sample.
– We estimate distributions of interest from these samples.
Example on blackboard.
Markov Random Fields
• Example: Letter recognition
1. Gather lots of examples of writen letters of the
relevant alphabet (and scale them to a normal
size).
2. Divide the letters into segments.
3. Pick a characteristic of the image found in these
segments: Eg number of curves, or number of
vertices (meeting points of curves).
Markov Random Fields
• We notice that the probability distribution for a
given characteristic, for a given letter, is
dependent on its neighboring segments.
• The variables in the model are discrete. The
distributions are all conditional multinomials.

A A
Markov Random Fields
Training:
Train a model for each letter using Bayesian
methods:
• Use Dirichlet/count statistics to estimate the
true distributions from the training data for
each letter.
• This will give you models for expected
distribution of particular letters.
Example on blackboard.
Markov Random Fields
Classifying:
Given a new letter-image, we:
• Divide it into segments and classify each segment by
the chosen characteristic (eg number of curves).
• Calculate the probability of this set of characteristic
values for the segments for each of our letters.
• Normalizing these values give us the probability of the
letter-image being a given letter.
Example on blackboard.
Categorical Distributions
Categorical distributions use n parameters to specify the
probability distribution of a n-valued random variable.
Each parameter, i, gives the probability of the variable
taking the ith value. The degrees of freedom of such a
distribution is n-1, since we have the constraint:
𝑛

𝑃(𝑋 = 𝑥𝑗 )
𝑗=1
Result: Win Draw Loss
Prob. .6 .3 .1

Here is a three valued categorical distribution


representing the result of a match:
Conditional Categorical Distributions

Conditional categorical distributions P(Y|X) give a categorical distribution for each


possible value of the discrete variables being conditioned upon.
Here is a conditional categorical distribution representing the distribution over the
result of a match given the values taken by the location and weather variables:
Location Weather Win Draw Loss
Home Raining .2 .7 .1
Home Normal .8 .15 .05
Home Hot .6 .2 .1
Away Raining .1 .8 .1
Away Normal .5 .4 .1
Away Hot .2 .6 .2

This is a classifier that gives a distribution for an output variable Y given input
variables X.
Maximum Likelihood & Count
Parameters
Take discrete variable 𝑋: {𝑥1 , 𝑥2 , … , 𝑥𝑛 }, distributed 𝑐𝑎𝑡(𝑝1 , 𝑝2 , … , 𝑝𝑛 ).
Let us track the number of times that we have seen 𝑋 take particular
values with the count parameters:
{𝑐1 , 𝑐2 , … , 𝑐𝑛 }
The maximum likelihood value of the parameters 𝑝1 , 𝑝2 , … , 𝑝𝑛 is the
value that makes the observations most probable. It is:
𝑐𝑖
𝑝𝑖 = 𝑛
𝑗=1 𝑐𝑗
• The ratio of the count parameters gives us our ML estimation for
the distribution parameters.
• As the count parameters increase, new observations will alter the
ML estimation of the distribution parameters less and less.
Count Parameters & Adaption

Using counts makes it easy to adapt our parameter estimates: We simply


add to the counts as observations occur and adjust accordingly.
Now my knowledge of this coin
My knowledge of this coin is
is given by the counts <3,1>,
given by the counts <2,1>, since I
since I have flipped it 4 times
have flipped it 3 times and it has
and it has come up heads 3 of
come up heads 2 of those 3.
those 4.

We can adapt to soft evidence too: If we hear that another coin toss has
occurred from someone who cannot remember the result for sure, but is
75% sure that it was heads, we would have the counts <2.75,1.25>.
Count Parameters & Conditional
Distributions

For conditional distributions, we keep counts under each set of


possible conditions
My beliefs about the ability of of
thisthe
team variables being conditioned upon.
is given by the counts:
Rain Win Draw
Loss They won in the rain, so now my
T 0 1 beliefs are given by the counts:
4 Rain Win Draw
F 3 0 Loss
0 T 1 1
Since I’ve seen them play in the rain 5 4
times, and of these they lost 5 and drew F 3 0
1, and I’ve seen them play without rain 0
3 times, and they won all 3!
Pseudo-Observations and Expert
Knowledge
Easy to encode an expert’s knowledge about
probabilities and their confidence in their
estimation:
Get them to specify their knowledge ’as if’ they
had seen a particular set of observations.
Count Parameters and Dirichlet
Distributions
The count parameters can be interpreted as the parameters of
a Dirichlet distribution over our belief regarding the correct
value of the parameters in the categorical distribution.

𝑛 𝑛
Γ 𝑖=1 𝑐𝑖 𝑐 −1
𝐷𝑖𝑟 𝑐1 , 𝑐2 … 𝑐𝑛 = 𝑛 𝑥𝑖 𝑖 ,
𝑖=𝑖 Γ 𝑐𝑖
𝑖=1
𝑛−1

𝑊𝑖𝑡ℎ 𝑠𝑢𝑝𝑝𝑜𝑟𝑡: 𝑥1 , … , 𝑥𝑛−1 𝑤ℎ𝑒𝑟𝑒 𝑥𝑖 ∈ 0,1 𝑎𝑛𝑑 𝑥𝑖 < 1


𝑖=1

Note: 𝑥𝑛 is implicit from the constraint.


Just understand the relationship between the shape and the
parameters! See examples: abn::dir.plot(…).
Count Parameters and Dirichlet
Distributions
We can obtain confidence estimates for the parameters of our categorical
distribution given our observations and prior beliefs (pseduo-obsevations)
count.
Working with Dirichlets is beyond the scope of this course.
Ignorance & Conservativism
A common choise is to model ignorance ’as if’
we had seen all values occur once. This is
because otherwise we would jump to certainty
after a single observation! (Why?)
Dirichlet distributions accord with this
convention: Dirichlet distributions of all ones are
uniform over possible parameters.

You might also like