Lectures 7 and 8
Lectures 7 and 8
Markov Chain
Hidden Markov Models
(Markov Random Fields)
• Markov Chains
– Gibbs Sampling
• Hidden Markov Models
– State Estimation,
– Prediction,
– Smoothing,
– Most Probable Path
• (Markov Random Fields)
Background
In dynamic systems:
State Estimation – Estimating the current state
of the system given current knowledge.
Prediction – Estimating future state(s) of the
system given current knowledge.
Smoothing – Estimating prior states of the
system given current knowledge.
Background
Independence
P(A,B)=P(A)P(B)
Conditional Independence
P(A,B|C)=P(A|C)P(B|C)
P(A|B,C)=P(A|C)
Chain Rule
P(A,B,C)=P(C|A,B)P(B|A)P(A)
Markov Chains
• Initial state, or distribution over possible initial states.
• Transition probabilities
– Markov Condition: State at time t+1 depends only on state
at time t. (Leads to higher order MCs)
– Ie Current state conditionally independent of all prior
states except preceeding.
Bull Stagnant Bear
Bull .9 .025 .075
Stagnant .25 .5 .25
Bear .15 .05 .8
Market Market
t t+1
Markov Chains
• Using nodes to represent variables & conditional
distributions
• Conditioned upon variables indicated by edges.
… St St+1 St+2 …
Markov Chain
Transition probabilities for transition matrix T.
Simple state prediction (1st Order):
𝑿𝑡+𝑛 = 𝑿𝑡 𝑇 𝑻𝑛
O1 O2 O3 On
Hidden Markov Models
• Prediction: Just as in Markov Chains...
St-1 St
𝑇: 𝑃(𝑆𝑡 |𝑆𝑡−1 )
𝑆: 𝑃(𝑂𝑡 |𝑆𝑡 )
Ot
Hidden Markov Models
St-1 St St-1 St=T St=F St Ot=T Ot=F
T .9 .1 T .3 .7
Ot F .3 .7 F .1 .9
𝑃 𝑆𝑡 |𝑆𝑡−1 , 𝑂𝑡 ∝ 𝑃 𝑂𝑡 𝑆𝑡 𝑃∗ 𝑆𝑡 𝑆𝑡−1
• Let our belief regarding S0 be that it is 80% likely S0=T.
• Let us observe O1=F.
𝑃 𝑆1 𝑆0 =< .8 .9 + .2 .3 , .8 .1 + .2 .7 > = < .78, . 22 >
𝑃 𝑂1 = 𝐹 𝑆1 = < .7 , . 9 >
𝑃 𝑆1 |𝑆0 , 𝑂1 = 𝐹 ∝ <(.78)(.7),(.22)(.9)>=<.546,.198>
.546 .198
𝑃 𝑆1 |𝑆0 , 𝑂1 = 𝐹 = < , > ≈ <.734, . 266 >
.546+.198 .546+.198
Hidden Markov Models
St-1 St
St-1 St=T St=F St O1t
T .6 .4 T 𝒩(3.5,10)
F .5 .5 F 𝒩(5,5)
O1t O2t O3t
St O2t St O3t
T 𝒩(45,100) T 𝒩(0, . 1)
F 𝒩(55,225) F 𝒩(0, . 5)
𝑃 𝑆𝑡 |𝑆𝑡−1 , 𝑂𝑡
∝ 𝜌 𝑂𝑡1 𝑆𝑡 𝜌 𝑂𝑡2 𝑆𝑡 𝜌 𝑂𝑡3 𝑆𝑡 𝑃∗ 𝑆𝑡 𝑆𝑡−1
• Let our belief regarding S0 be that it is 50% likely S0=T.
• Let us observe O11 =6.103, O12 =54.7 and O13 =.154
𝑃 𝑆1 𝑆0 =< .5 .6 + .5 .5 , .5 .4 + .5 .5 > = < .55, . 45 >
𝑃 𝑂11 = 6.103 𝑆1 ≈ < .089 , . 158 >
𝑃 𝑂12 = 54.7 𝑆1 ≈ < .025 , . 027 >
𝑃 𝑂13 = .154 𝑆1 ≈ < 1.120 , . 551 >
𝑃 𝑆1 |𝑆0 , 𝑂11 = 6.103, 𝑂12 = 54.7, 𝑂13 = .154
∝ <(.55)(.089)(.025)(1.120),(.45)(.158)(.027)(.551)> ≈ <.00137,.00106>
.00137 .00106
𝑃 𝑆1 |𝑆0 , 𝑂11 = 6.103, 𝑂12 = 54.7, 𝑂13 = .154 ≈ < , > ≈ <.564, . 436 >
.00137+.00106 .00137+.00106
Hidden Markov Models: Lab B
• State Estimation
– Given an initial state, transition and sensor probabilities,
we can iteratively calculate the distribution at each
subsequent state.
– We can do this online.
Note that for Lab B
– Vector of 3 observations (as in last example)
– Sparse transition matrix (many impossible transitions).
Assume random walk with possibility of staying still.
– Presumably uniform initial state.
– NOT real time.
Hidden Markov Models
Smoothing: The Forward-Backward Algorithm
𝑃 𝑆𝑠≤𝑡 |𝑂0:𝑡 , 𝑆0
= 𝑃 𝑆𝑠≤𝑡 |𝑂0:𝑠 , 𝑆0 𝑃 𝑆𝑠≤𝑡 |𝑂𝑠+1:𝑡
The Forward Algorithm:
• We have seen how, given an initial state,
transition and sensor probabilities, we can
iteratively calculate 𝑃 𝑆𝑠 |𝑂0:𝑠 , 𝑆0 for 1 ≤ 𝑠 ≤ 𝑡.
The Backward Algorithm:
• Starting at t, we can iteratively calculate:
𝑃 𝑆𝑠 |𝑂𝑠+1:𝑡
Hidden Markov Models
𝑷 𝑺𝒔 |𝑶𝒔+𝟏:𝒕 = 𝑃 𝑆𝑠 |𝑂𝑠+2:𝑡 , 𝑂𝑠+1
= 𝑃 𝑆𝑠 |𝑆𝑠+1 𝑃(𝑆𝑠+1 |𝑂𝑠+1:𝑡 )
= 𝑃 𝑆𝑠 |𝑆𝑠+1 𝑃(𝑆𝑠+1 |𝑂𝑠+2:𝑡 , 𝑂𝑠+1 )
= 𝑃 𝑆𝑠 |𝑆𝑠+1 𝑃(𝑂𝑠+1 |𝑆𝑠+1 )𝑃 𝑆𝑠+1 |𝑂𝑠+2:𝑡
∝ 𝑃(𝑆𝑠 )𝑃 𝑆𝑠+1 |𝑆𝑠 𝑃(𝑂𝑠+1 |𝑆𝑠+1 )𝑷 𝑺𝒔+𝟏 |𝑶𝒔+𝟐:𝒕
P(E|C,D)
Markov Random Fields
• Inference:
Given the states we know, to find out the states we do
not know, we sample...
• Gibbs Sampler
– Divide domain into known and unknown variables.
– Assign unknown variables a random value.
– We iterate through unknown variables, calculating a new
value given the values assigned to their neighbors.
– After each iteration, we record a sample.
– We estimate distributions of interest from these samples.
Example on blackboard.
Markov Random Fields
• Example: Letter recognition
1. Gather lots of examples of writen letters of the
relevant alphabet (and scale them to a normal
size).
2. Divide the letters into segments.
3. Pick a characteristic of the image found in these
segments: Eg number of curves, or number of
vertices (meeting points of curves).
Markov Random Fields
• We notice that the probability distribution for a
given characteristic, for a given letter, is
dependent on its neighboring segments.
• The variables in the model are discrete. The
distributions are all conditional multinomials.
A A
Markov Random Fields
Training:
Train a model for each letter using Bayesian
methods:
• Use Dirichlet/count statistics to estimate the
true distributions from the training data for
each letter.
• This will give you models for expected
distribution of particular letters.
Example on blackboard.
Markov Random Fields
Classifying:
Given a new letter-image, we:
• Divide it into segments and classify each segment by
the chosen characteristic (eg number of curves).
• Calculate the probability of this set of characteristic
values for the segments for each of our letters.
• Normalizing these values give us the probability of the
letter-image being a given letter.
Example on blackboard.
Categorical Distributions
Categorical distributions use n parameters to specify the
probability distribution of a n-valued random variable.
Each parameter, i, gives the probability of the variable
taking the ith value. The degrees of freedom of such a
distribution is n-1, since we have the constraint:
𝑛
𝑃(𝑋 = 𝑥𝑗 )
𝑗=1
Result: Win Draw Loss
Prob. .6 .3 .1
This is a classifier that gives a distribution for an output variable Y given input
variables X.
Maximum Likelihood & Count
Parameters
Take discrete variable 𝑋: {𝑥1 , 𝑥2 , … , 𝑥𝑛 }, distributed 𝑐𝑎𝑡(𝑝1 , 𝑝2 , … , 𝑝𝑛 ).
Let us track the number of times that we have seen 𝑋 take particular
values with the count parameters:
{𝑐1 , 𝑐2 , … , 𝑐𝑛 }
The maximum likelihood value of the parameters 𝑝1 , 𝑝2 , … , 𝑝𝑛 is the
value that makes the observations most probable. It is:
𝑐𝑖
𝑝𝑖 = 𝑛
𝑗=1 𝑐𝑗
• The ratio of the count parameters gives us our ML estimation for
the distribution parameters.
• As the count parameters increase, new observations will alter the
ML estimation of the distribution parameters less and less.
Count Parameters & Adaption
We can adapt to soft evidence too: If we hear that another coin toss has
occurred from someone who cannot remember the result for sure, but is
75% sure that it was heads, we would have the counts <2.75,1.25>.
Count Parameters & Conditional
Distributions
𝑛 𝑛
Γ 𝑖=1 𝑐𝑖 𝑐 −1
𝐷𝑖𝑟 𝑐1 , 𝑐2 … 𝑐𝑛 = 𝑛 𝑥𝑖 𝑖 ,
𝑖=𝑖 Γ 𝑐𝑖
𝑖=1
𝑛−1