Introduction to Hidden Markov
Models
Hidden Markov Model (HMM)
• A Hidden Markov Model (HMM) is a statistical model
used to describe systems that are Markov processes with
unobserved (hidden) states.
• In an HMM, we assume that the system being modeled is
a Markov process where the state at any time depends
only on the previous state (Markov property), but unlike
a simple Markov model, the states are not directly
observable.
• Instead, what we observe are outcomes or emissions that
are probabilistically generated by each hidden state.
Components of an HMM
• States (S): These are the hidden states of the system. The
system transitions between these states according to certain
probabilities.
• Observations (O): These are the observable outputs emitted
by the hidden states. Each state emits an observation based
on a probability distribution.
• Transition Probabilities (A): The probability of moving
from one state to another, denoted as A=P(st+1∣st), where stis
the current state and st+1is the next state.
• Emission Probabilities (B): The probability of observing a
particular output given the current state, B=P(ot∣st), where ot
is the observation at time t.
• Initial State Probabilities (π): The probability of the
system starting in a particular state, π=P(s1), where s1is the
initial state.
Example: Weather Prediction
• Imagine you are trying to predict the weather, but you can’t observe it directly.
• Instead, you can only see whether someone carries an umbrella, which gives
you clues about the weather.
• The weather can be sunny (S) or rainy (R), and each day’s weather depends
only on the previous day’s weather (Markov property).
• States: S={Sunny,Rainy}
• Observations: O={Umbrella,No Umbrella}
• Transition Probabilities (A):
• P(Sunny→Sunny)=0.7
• P(Sunny→Rainy)=0.3
• P(Rainy→Sunny)=0.4
• P(Rainy→Rainy)=0.6
• Emission Probabilities (B):If it is sunny, there’s a 10% chance the person
will carry an umbrella and 90% chance they won’t.
• If it is rainy, there’s an 80% chance the person will carry an umbrella and 20%
chance they won’t.
• So, for example, P(Umbrella∣Sunny)=0.1 and P(Umbrella ∣Rainy)=0.8.
• Initial State Probabilities (π):
• P(Sunny)=0.6
• P(Rainy)=0.4
Problems Solved Using HMMs:
• Likelihood (Evaluation Problem): Given a sequence of
observations (e.g., umbrella/no umbrella over a few days), what is
the likelihood that the observed sequence was generated by the
model? This is solved using the Forward Algorithm.
• Decoding (State Prediction Problem): Given a sequence of
observations, what is the most probable sequence of hidden states
that generated these observations? This is solved using the Viterbi
Algorithm.
• Learning (Parameter Estimation Problem): Given a set of
observations, how do we adjust the HMM parameters (transition,
emission, and initial probabilities) to best fit the data? This is solved
using the Baum-Welch Algorithm, a form of the Expectation-
Maximization (EM) algorithm.
Example of Decoding with the Viterbi Algorithm:
• Suppose over three days you observed the person carrying an umbrella
each day.
• Using the Viterbi algorithm, you can infer the most likely sequence of
weather conditions that caused these observations, given the transition
and emission probabilities.
Applications of HMM:
• Speech Recognition: In speech, the hidden states are the phonemes
(sounds), while the observable events are the acoustic signals.
• Part-of-Speech Tagging: In Natural Language Processing (NLP),
HMMs are used to assign parts of speech to words in a sentence, where
the words are observable, and the parts of speech are hidden.
• Bioinformatics: HMMs are used for gene prediction, where the hidden
states are the types of DNA sequences (e.g., coding, non-coding), and the
observations are the actual sequences.
• This flexibility of HMMs to model systems with hidden states and
probabilistic outputs makes them powerful for a wide range of sequence
modeling tasks.
HMM for POS tagging
An HMM consists of:
1.States: In POS tagging, these are the part-of-speech tags (like
noun, verb, adjective, etc.)—the "hidden" variables.
2.Observations: These are the words in the sentence, which are
visible or "observable."
3.Transition probabilities: The probability of moving from one
POS tag to another (e.g., transitioning from a noun to a verb).
4.Emission probabilities: The probability of a word being
generated by a particular POS tag (e.g., the probability that the
word "run" is a verb).
5.Initial probabilities: The probability of a particular POS tag
starting the sequence (e.g., the probability of a sentence starting
with a noun).
How HMM Works in POS Tagging
Given a sequence of words (observations), the goal is to determine the
most likely sequence of POS tags (states) that could have generated
the words.
Example:
Consider the sentence: Time flies like an arrow.
We want to assign a POS tag to each word in the sentence.
1.States (POS Tags): For this example, the possible tags for each word
might include:
•Time: {NN (noun), VB (verb)}
•flies: {VB (verb), NNS (plural noun)}
•like: {VB (verb), IN (preposition)}
•an: {DT (determiner)}
•arrow: {NN (noun)}
2.Observations (Words): The observed words are ["Time", "flies", "like", "an", "arrow"]
.
How HMM Works in POS Tagging
Transition Probabilities: These define the likelihood of transitioning
from one tag to another. For instance:
•P(NN → VB) could be small (nouns are less likely to transition to
verbs).
•P(DT → NN) could be high (determiners are often followed by
nouns).
Emission Probabilities: These define the likelihood of a word being
generated by a specific tag. For example:
•P("flies" | VB) = high (since "flies" can be a verb).
•P("flies" | NNS) = high (since "flies" can also be a plural noun).
Initial Probabilities: These are the probabilities that a sentence starts
with a particular POS tag. For example:
•P(NN as the first tag) could be high, since many sentences start with
a noun.
How HMM Works in POS Tagging
Finding the Most Likely Sequence (Viterbi Algorithm):
•The Viterbi algorithm is typically used to find the most probable
sequence of tags. It computes the highest probability path (sequence of
tags) that results in the observed words.
1. Start with the first word "Time". Assume two possible POS tags:
NN (noun) or VB (verb).
2. For each possible POS tag of "Time", calculate the probability of
observing the next word "flies" given the previous tag (using
transition and emission probabilities).
3. Continue this process for all words in the sentence, keeping track of
the most likely sequence of POS tags.
How HMM Works in POS Tagging
Final Output:
After running the Viterbi algorithm, you might get the following
tagging:
Here:
•"Time" is tagged as a noun (NN),
•"flies" as a verb (VB),
•"like" as a preposition (IN),
•"an" as a determiner (DT),
•"arrow" as a noun (NN).
This sequence represents the most probable path through the hidden
states (POS tags) given the observed words, according to the HMM.