Hidden Markov and Maximum Entropy Models
Introduction, Maximum entropy models,
Markov Chains - Logistic regression,
- Observed Markov model, - hyperplane,
- weighted finite-state automata, Maximum entropy markov models,
- Probabilistic graphical model, - MaxEnt model,
Hidden Markov model, - HMM tagging model,
- transition probability matrix, - MEMM tagging model.
- Observed likelihood,
- emission probability,
- Left-to-right (Bakis) HMM,
Maximum entropy models,
- log-linear classifiers,
- linear regression,
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Introduction (Hidden Markov and Maximum Entropy Models )
Two important classes of statistical models for processing text & speech;
(1) Hidden Markov model (HMM),
(2) Maximum entropy model (MaxEnt), and particularly a Markov-related
variant of MaxEnt called the maximum entropy Markov model (MEMM).
HMMs and MEMMs are both sequence classifiers.
- A sequence classifier or sequence labeller is a model whose job is to assign
some label or class to each unit in a sequence.
- compute a probability distribution over possible labels and choose the best
label sequence.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
2. Markov Chains
Markov chains and hidden Markov models are both extensions of the finite
automata.
Finite automata is definitely by a set of states and a set of transitions between
states.
A Markov chain is a special case of a weighted automaton in which the input
sequence uniquely determines which states the automaton will go through.
Because
it can’t represent inherently ambiguous problems, a Markov chain is only
useful for assigning probabilities to unambiguous sequences.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
2. Markov Chains (Cont…)
Figure 6.1 (a) shows;
- (word by word state) a Markov chain for assigning a probability to a sequence of weather
events, for which the vocabulary consists of HOT, COLD and RAINY.
Figure 6.1 (b) shows;
- (sequence of word states, sentences) another simple example of a Markov chain for
assigning a probability to a sequence of words w1, w2, …., wn.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
2. Markov Chains (Alternative representation)
An alternative representation that is sometimes used for Markov chains doesn’t reply on a
start or end state,
- instead representation the distribution over initial state and accepting states explicitly.
Examples ; compute the probability of each of the following sequences as;
o hot hot hot hot => π = .5*.5*.5*.5 = 0.0625
o cold hot cold hot => π = .5*.2*.5*.2 = 0.01 ????
COLD-> COLD->WARM->WARM->WARM-> HOT-> COLD
HOT->COLD->HOT->HOT->WARM->COLD->COLD->WARM
WARM->HOT->COLD->WARM->COLD->HOT->WARM->HOT
HOT->COLD->COLD->WARM->WARM->HOT->COLD->WARM
COLD->HOT->WARM->WARM->COLD->HOT->WARM->HOT
WARM->COLD->HOT->WARM->COLD->COLD->HOT->WARM
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
2. Markov Chains (Class Participation)
How to compute the probabilities of each of the following sentences by using
7-states problems of;
(a) Students did their assignment well at time (* high probability likelihood).
(b) did their assignment student well at time (*2nd best probability likelihood).
(c) At student assignment well did time their ( worst probability likelihood).
How to compute the probabilities of each of the following sentences by using
5-states problems of;
(a) Weather is hot and dry (* high probability likelihood).
(b) and is hot weather dry (*2nd best probability likelihood).
(c) hot weather and dry is ( worst probability likelihood).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Hidden Markov Model (HMM)
A Hidden Markov Model (HMM) allows us to talk Hidden Markov about
both
- observed Model events (like words that we see in the input) and hidden
events (like part-of-speech tags) that we think of as causal factors in our
probabilistic model.
A formal definition of a Hidden Markov Model, focusing on how HMM it
differs from a Markov chain.
- HMM doesn’t rely on a start or end state.
- representing the distribution over initial and accepting states explicitly.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Hidden Markov Model (HMM) (Cont…)
A first hidden Markov model instantiates two simplifying assumptions;
First, the probability of a particular state depends only on the previous state:
Second, the probability of an output observation oi
- depends only on the state that produced the observation qi and
- not on any other states or any other observations:
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Hidden Markov Model (HMM) [Example]
In Figure;
Two states : ‘Low’ and ‘High’ atmospheric
pressure.
Two observations : ‘Rain’ and ‘Dry’.
Transition probabilities: P(‘Low’|‘Low’)=0.3 ,
P(‘High’|‘Low’)=0.7 , P(‘Low’|‘High’)=0.2,
P(‘High’|‘High’)=0.8
Observation probabilities : P(‘Rain’|‘Low’)=0.6 ,
P(‘Dry’|‘Low’)=0.4 , P(‘Rain’|‘High’)=0.4 ,
P(‘Dry’|‘High’)=0.3 .
Initial probabilities: say P(‘Low’)=0.4 ,
P(‘High’)=0.6 .
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Hidden Markov Model (HMM) [Example-1] (Cont…)
Calculate of observation sequence probability;
Transition
Suppose we want to calculate a probability of a probabilities:
sequence of observations in our example, P(‘Low’|‘Low’)=0.3 ,
{‘Dry’,’Rain’}. P(‘High’|‘Low’)=0.7 ,
Consider all possible hidden state sequences: P(‘Low’|‘High’)=0.2,
P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P(‘High’|‘High’)=0.8
P({‘Dry’,’Rain’} , {‘Low’,’High’}) + P({‘Dry’,’Rain’} , Observation
{‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’}) probabilities :
P(‘Rain’|‘Low’)=0.6 ,
P(‘Dry’|‘Low’)=0.4 ,
where first term is : P(‘Rain’|‘High’)=0.4 ,
P({‘Dry’,’Rain’} , {‘Low’,’Low’})= P(‘Dry’|‘High’)=0.3 .
P(‘Dry’|’Low’) P(‘Low’) P(‘Rain’|’Low’) P(‘Low’) Initial probabilities:
P(‘Low’|’Low) say P(‘Low’)=0.4 ,
P(‘High’)=0.6 .
= 0.4*0.4*0.6*0.4*0.3
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Hidden Markov Model (HMM) [Example-2] (Cont…)
Typed word recognition, assume all characters are separated.
Character recognizer outputs probability of the image being particular
character, P(image|character).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Hidden Markov Model (HMM) [Example-3] (Cont…)
We can construct a single HMM for all words.
Hidden states = all characters in the alphabet.
Transition probabilities and initial probabilities are calculated from language
model.
Observations and observation probabilities are as before.
Here we have to determine the best sequence of hidden states, the one that
most likely produced word image.
This is an application of Decoding problem.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4. Left-to-right (Bakis) HMM
During left-to-right (also called Bakis)
HMMs, the state transitions proceed from
left to right.
In a Bakis HMM,
- no transitions go from a higher-
numbered state to a lower-numbered state.
It includes 1-state to multi-states HMM as;
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4. Left-to-right (Bakis) HMM (Home Assignment)
Draw a model of left-to-right HMM of 2-states, 3-states and 4-states
problems of;
(a) (b)
Tennis posture detection
(c)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5. Maximum Entropy Models
2nd probabilistic machine learning framework called maximum entropy modelling.
- MaxEnt is more widely known as multinomial logistic regression.
MaxEnt belongs to the family of classifiers known as the exponential or log-linear
classifiers.
- MaxEnt works by extracting some set of features from the input,
- combining them linearly (meaning that each feature is multiplied by a weight and then
added up) and sum become exponent.
Example-1:
In text classification,
- need to decide whether a particular email should be classified as spam.
- determine whether a particular sentence or document expresses a positive or negative
opinion.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5. Maximum Entropy Models (Cont…)
Example-2: Assume that we have some input x (perhaps it is a word that needs to be tagged or
a document that needs to be classified).
- From input x, we extract some features fi.
- A feature for tagging might be this word ends in –ing.
- For each such features fi, we have some weight wi.
Given the features and weights, our goal is to choose a class for a word.
- the probability of a particular class c given the observation x is;
where Z is a normalization factor, used to make the probability correctly sum to 1.
Finally, in actual MaxEnt model,
- the feature f and weights w both depend on the class c (i.e., we’ll have different features and
weights for different classes);
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5.1 Linear Regression
In linear regression, we are given a set of observations;
- each observation associated with some features,
- and we want to predict some real-valued outcome for each observation.
Example; predicting housing prices.
Levitt and Dubner showed that; the words used in a real estate ad can be a good predictor of;
- whether a house will sell for more or less than its asking prices.
- e.g., house whose real estate ads has words like
fantastic, cute, or charming, tending to sell for
lower prices.
- e.g., while houses whose ads has words like
maple and granite tended to sell for
higher prices.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5.1 Linear Regression (Cont…)
Figure shows;
- a graph of these points, with the feature (# of adjectives) on the REGRESSION LINE x-
axis, and the price on the y-axis.
Suppose the weight vector that we had previously learned for this task was
w = (w0,w1,w2,w3) = (18000,−5000,−3000,−1.8).
Then the predicted value for this house would be computed by multiplying each feature by
its weight:
The equation of any line is
y = mx +b; as we
show on the graph, the slope
of this line is
m = −4900, while the
intercept (b) = 16550.
Features (in this case x,
numbers of adjectives)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5.1 Linear Regression (Class Participation)
Example; Global warming may be reducing average
snowfall in your town and you are asked to predict how
much snow you think will fall this year.
Looking at the following table you might guess somewhere
around 10-20 inches. That’s a good guess, but you could make
a better guess, by using regression.
- Find out linear regression for 2014, 2015, 2016, 2017 and
2018?
Hint :
- regression also gives you a useful equation, which for this
chart is: y = -2.2923x + 4624.4.
- For example, 2005:
y = -2.2923(2005) + 4624.4 = 28.3385 inches, which is
pretty close to the actual figure of 30 inches for that year.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5.2 Logistic Regression
In logistic regression, we classify whether some observation x is in the class (true) or not in
the class (false).
Example; we are assigning a part-of-speech tag to the word “race”.
Secretariat/NNP is/BEZ expected/VBN to/TO race/?? tommorrow/
- we are just doing classification, not sequence classification, so let’s consider just this single
word.
- We would like to know whether to assign the class VB to race (or instead assign some other
class like NN)
Case 1: We can thus add a binary feature that is true if this is the case:
Case 2: Another feature would be whether the previous word “to” the tag TO;
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5.3 Maximum Entropy Markov Models
Previously, the HMM tagging model is based on probabilities of the form
P(tag|tag) and P(word|tag).
- That means that if we want to include some source of knowledge into the tagging process,
we must find a way to encode the knowledge into one of these two probabilities.
- But “many knowledge sources are hard to fit into these models??????”.
For example; tagging unknown words
- useful features include capitalization, the presence of hyphens, word endings, and so on.
- There is no easy way to fit probabilities like P(capitalization|tag), P(hyphen|tag),
P(suffix|tag), and so on into a HMM-style model.
- HMM to model the most probable part-of-speech tag sequence, we rely on Bayes rule,
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5.3 Maximum Entropy Markov Models (Cont…)
In an MEMM, we break down the probabilities as follows;
Fig. The dependency graph for a traditional HMM (left).
The dependency graph for a Maximum Entropy Markov
Model (right).
In case of HMM, its parameters are used to maximize the likelihood of the observation
sequence (see Figure at left)
While, in the MEMM, the current observation Ot depends on the current state St and the
current observation Ot is also depend on the previous state St-1 .
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5.3 Maximum Entropy Markov Models [Example-1] (Cont…)
More formally, in the HMM, we compare the
probability of the state sequence given the
observations
as;
In the MEMM, we compute the probability of the
state sequence given the observation as;
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
(Class Presentation)
Design case study with proper examples for;
Linear Regression,
Logistic Regression,
Maximum Entropy Markov Models.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. HMM Vs Maximum Entropy Markov Models [Example]
Text classification: Asia or Europe
Europe Training Data Asia
Monaco Monaco Monaco Monaco
Hong Monaco
Monaco Monaco Hong
Kong
Kong
Monaco
HMM FACTORS: PREDICTIONS MEMM:
NB Model • P(A) = P(E) = • P(A,M) =
Class • P(M|A) = • P(E,M) =
• P(M|E) = • P(A|M) =
X1=M • P(E|M) =
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. HMM Vs Maximum Entropy Markov Models [Example] (Cont…)
Text classification: Asia or Europe
Europe Training Data Asia
Monaco Monaco Monaco
Hong Monaco Hong
Monaco Hong
Kong Kong
Kong
Monaco
NB Model HMM FACTORS: PREDICTIONS MEMM:
• P(A) = P(E) = • P(A,H,K) =
Class
• P(H|A) = P(K|A) = • P(E,H,K) =
• P(H|E) = P(K|E) = • P(A|H,K) =
X1=H X2=K
• P(E|H,K) =
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. HMM Vs Maximum Entropy Markov Models [Example] (Cont…)
Text classification: Asia or Europe
Europe Training Data Asia
Monaco Monaco Monaco
Hong Monaco Hong
Monaco Hong
Kong Kong
Kong
Monaco
NB Model HMM FACTORS: PREDICTIONS MEMM:
• P(A) = P(E) = • P(A,H,K,M) =
Class • P(H|A) = P(K|A) = • P(E,H,K,M) =
• P(H|E) = PK|E) =
• P(A|H,K,M) =
H K M • P(M|A) =
• P(M|E) =
• P(E|H,K,M) =
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. HMM vs. Maximum Entropy Markov Models [Example] (Cont…)
NLP relevance: we often have overlapping features….
HMM models multi-count correlated evidence
• Each feature is multiplied in, even when you have multiple features telling you the same thing
Maximum Entropy models (pretty much) solve this problem
• As we will see, this is done by weighting features so that model expectations match the observed
(empirical) expectations.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)