Day & Time: Monday (10am-11am & 3pm-4pm)
Tuesday (10am-11am)
Wednesday (10am-11am & 3pm-4pm)
Friday (9am-10am, 11am-12am, 2pm-3pm)
Dr. Srinivasa L. Chakravarthy
&
Smt. Jyotsna Rani Thota
Department of CSE
GITAM Institute of Technology (GIT)
Visakhapatnam – 530045
Email: [email protected] & [email protected]
Department of CSE, GIT 1
2 Novt 2020
EID 403 and machine learning
Course objectives
● Explore about various disciplines connected with ML.
● Explore about efficiency of learning with inductive bias.
● Explore about identification of Ml algorithms like decision
tree learning.
● Explore about algorithms like Artificial Neural networks,
genetic programming, Bayesian algorithm, Nearest neighbor
algorithm, Hidden Markov chain model.
Department of CSE, GIT EID 403 and machine learning
Learning Outcomes
● Identify the various applications connected with ML.
● Classify efficiency of ML algorithms with Inductive bias
technique.
● Discriminate the purpose of all ML algorithms.
● Analyze any application and Correlate available ML
algorithms.
● Choose an ML algorithm to develop their project.
Department of CSE, GIT EID 403 and machine learning
Syllabus
20 August 2020 4
Department of CSE, GIT EID 403 and machine learning
Reference book 1. Title -Machine Learning
Author- Tom M Mitchell
Department of CSE, GIT EID 403 and machine learning
Reference book 2. Title –Introduction to Machine Learning
Author- Ethem Alpaydin
Department of CSE, GIT EID 403 and machine learning
Module -5
(Chapter-15 from prescribed book author -Ethem Alpaydin)
It includes-
Discrete Markov processes
Hidden Markov Models
Three problems of HMM
Evaluation problem
Finding state sequence
Learning model parameters & continuous observations
HMM with output & Model selection in HMM 7
Introduction
So far, we assumed that the instances that forms a sample are
independent and identically distributed i.e., if each random variable has the
same probability distribution as the others and all are mutually independent.
This assumption is not valid for applications where successive instances
are dependent.
For example, Processes where sequence of observations cannot be
modeled as sample probability distributions are-
1. In a word successive letters are dependent.
2. Base pairs in a DNA sequence are dependent. and.
3. In a speech recognition, phonemes in a word (dictionary),
words in a sentence (syntax, semantics of the language).
Introduction
Any sequence is characterized by parametric random process.
In this chapter-it is about
● How modelling will be done.
● How parameters of such a model can be learned from a training sample of-
example sequences.
Discrete Markov Processes
Consider a system that acts like, at any time it is in one of the
set of N distinct states-S1,S2,...SN,.
The state at time t is qt, t=1,2,...
For example qt=Si means that at time t, the system state is Si.
At regularly spaced discrete times, the system moves to a state with a given
probability depends on previous state-
P(qt+1=Sj | qt=Si, qt-1=Sk ,...)
Discrete Markov Processes(cont.)
First-order Markov, the state at time t+1 depends on state at time t-
P(qt+1=Sj | qt=Si , qt-1=Sk ,...) = P(qt+1=Sj | qt=Si)
This corresponds to saying that, for a give present state-the future is
independent of past.
Let us assume that the probabilities are independent of time called transition
probabilities-
aij ≡ P(qt+1=Sj | qt=Si) aij ≥ 0 and Σj= 1N aij=1
So,going from Si to Sj has the same probability aij at any time. The only special
case is first state Si with an initial probability 𝛑i .
πi ≡ P(q1=Si) Σj=1N πi=1
Discrete Markov Processes(cont.)
Example of a markov model with 3 states. This is a stochastic automaton.
In an observable Markov model, the states are observable. At any time, as the
system moves from one state to other state, we get an observation sequence i.e., a
sequence of states.
The output of the process is set of states at each instant of time where each state
corresponds to physical observable event.
Discrete Markov Processes(cont.)
We have an observation sequence, i.e, the state sequence O = Q ={q1,q2,..qT}
Where the probability is given as
Where 𝛑q1 is probability going from q1. aq1q2 is the probability of going from q1 to q2
.
We multiply these probabilities to get the probability of whole sequence.
Discrete Markov Processes(cont.)
Let us assume, we have N urns/baskets where each urn contains balls of only
one color.
So there is an urn of red balls, another of blue balls and so on..
Let us say we have 3 states, S : red, S : blue, S : green
1 2 3
With initial probabilities
Let us say A=[ aij ] is a N X N matrix whose rows sum to 1.
aij is the probability of drawing from urn j (a ball of color j), after drawing a ball
of color i from urn i. The transition matrix is
Discrete Markov Processes(cont.)
Given 𝚷 and A, it is easy to generate K random sequences each of length T.
Let us see how to calculate probability of a sequence..
Assume that the first 4 balls are “red,red,green,green” .
This corresponds to observation sequence O={S1,S1,S3,S3}.
Its probability is-
Discrete Markov Processes(cont.)
Now, let us see how we can learn the parameters 𝚷 and A.
Given K example sequences of length T,
Where qtk is the state at time t of sequence k,
The initial probability
Where 1(b) is 1 if b is true and 0 otherwise.
Transition probability aij
Hidden Markov Models
In HMM,
1. The states are not observable.
But when we visit a state, an observation is recorded that is a probabilistic function of
the state.
2. Discrete observations {v1,v2,...,vM} in each state.
3. Observable or Emission probability bj(m),that we observe vm, m=1...M in state Sj.
bj(m) ≡ P(Ot=vm | qt=Sj)
We assume that the probabilities do not depend on t.
The values observed forms the observation sequence O.
4. The state sequence Q is not observed, that is what makes the model “hidden” but it
should be inferred from the observation sequence O.
Hidden Markov Models(cont.)
Elements of an HMM
N: Number of states
M: Number of observation symbols
A = [aij]: N X N state transition probability matrix
B = bj(m): N X M observation probability matrix
Π = [πi]: N X 1 initial state probability vector
λ = (A, B, Π), λ is a parameter set of HMM.
Given λ, the model can be used to generate an arbitrary number of
observation sequences of arbitrary length.
Three Basic Problems of HMMs
Given a number of sequences of observations,we are interested in 3 problems-
1. Evaluation- Given a model λ and observation sequence O, evaluate the probability
P(O| λ).
2. State Sequence- Given λ and O, state sequence Q ={q1,q2...qT}, find Q* which is the
highest probability state sequence ,
such that P (Q* | O, λ ) = maxQ P (Q | O , λ ) .
3. Learning- Given a training set of observation sequences, X={Ok}k learn the model
that maximizes the probability of X i.e., find λ* = maxλ P ( X | λ ).
Hidden Markov Models(cont.)
1. Evaluation Problem
Give an observation sequence O = {O1,O2...OT} and state sequence Q ={q1,...qT},
λ is a parameter set of HMM.
To calculate P(O| λ) there is an efficient procedure called forward-backward
procedure.
It is based on the idea of dividing the observation sequence into two parts-
1. Starting from time 1 until time t,
2. Starting from time t+1 until time t.
Hidden Markov Models(cont.)
1. Evaluation Problem(cont.)
We define forward variable as, the probability of observing the partial
sequence {O1...Ot} until time t and being in Si at time t, given the model λ:
The nice thing about it is that, it can be calculated recursively by accumulating
results-
Hidden Markov Models(cont.)
1. Evaluation Problem(cont.)
We define backward variable as, the probability of being in Si at time t and
observing the partial sequence Ot+1….OT
It can be calculated recursively by time going in the backward direction-
Hidden Markov Models(cont.)
2. Finding the State Sequence-
Let us define as, the probability of being in state S i at time t, given O
and λ, which can be computed as follows-
To find the state sequence,Choose the state that has the highest probability,
for each time step t:
qt*= arg maxi γt(i)
Hidden Markov Models(cont.)
2. Finding the State Sequence-(cont.)
To find the single best state sequence, we use the Viterbi algorithm, based
on dynamic programming, which takes such transition probabilities into account.
Given state sequence Q, observation sequence O,
we define δt(i) ≡ maxq1q2∙∙∙ qt-1 p(q1q2∙∙∙qt-1,qt =Si,O1∙∙∙Ot | λ)
Where δt(i) is the highest probability path at time t that accounts for the first
t observations and ends in Si.
Hidden Markov Models(cont.)
3. Learning Model Parameters
To calculate λ* that maximizes the probability of X, i.e., P ( X | λ )
We define as, the probability of being in S i at time t and in Sj at
time t+1, given the whole observation O and λ-
Hidden Markov Models(cont.)
Continuous Observations-
We assumed discrete observations modeled as a multi-nominal-
The k-means used for vector quantization is the hard version of a
Gaussian mixture model-
The scalar continuous observation, The easiest is to assume a normal distribution-
Hidden Markov Models(cont.)
Model Selection in HMM-
Example of left-right HMM
In classification, estimate P (O | λi) by a separate HMM and
Use Bayes’ rule-
END OF MODULE-5 (Chapter 15)