ISyE 6416: Computational Statistics
Spring 2023
Lecture 8: EM algorithm and
Gaussian Mixture Model
Prof. Yao Xie
H. Milton Stewart School of Industrial and Systems Engineering
Georgia Institute of Technology
Expectation-Maximization (EM) Algorithm
▶ an algorithm to a maximum likelihood estimator in non-ideal case: missing data,
indirect observations
▶ missing data
▶ clustering (unknown label)
▶ hidden-states in HMM
▶ latent factors
▶ replace one difficult likelihood
maximization with a sequence of
easier maximizations
▶ in the limit, the answer to the original
problem
Applications of EM
▶ Data clustering in machine learning
▶ Natural language processing (Baum-Welch algorithm to fit hidden Markov model)
▶ Imputing missing data
General set-up
S O
Observation
Hidden state
Space
▶ we do not observe S, only observe indirectly from O
▶ Joint distribution of state and observation f (S, O|θ)
Deriving EM
▶ Introduce
Q(θ; θ′ ) = E{log f (S, O|θ)|θ′ , O}
▶ The expectation uses the conditional
distribution of S given O and assumed
value of parameter θ′
Rhymer’s Notes
Intuition
Given O, the “best guess” we could have for S, is its conditional expectation with
respect to S|O, θ (notion of projection); but the computation of expectation involves
parameter values. We take a guess, and improve in next round.
Comment on the Q function
For the conditional likelihood: Q-function
Q(θ; θ′ ) = E{log f (S, O|θ)|θ′ , O}
▶ The expectation is taken with respect to the conditional distribution f (S|O)
▶ O: observed data
▶ In this sense, it has a Bayesian flavor: we have to compute the posterior
distribution of the state given the observation
▶ θ′ : assumed value of the parameter when deriving the posterior distribution
f (S|O)
▶ θ is the parameter involved in “log-likelihood” log f (S|θ) that we will maximize
with respect to
▶ θ and θ′ are usually not the same in your algorithm
E-M algorithm
▶ E-step: compute expectation of the log-likelihood
observed data O, unknown state S
Q(θ; θ′ ) = E{log f (S, O|θ)|θ′ , O}
▶ M-step: compute maximum likelihood using the expectation in previous step
E-step ⇒ M-step ⇒ E-step ⇒ M-step ⇒
▶ stop until ∥θk+1 − θk ∥ < ϵ or |Q(θk+1 |θk ) − Q(θk |θk−1 )| < ϵ
Example: EM for missing data
n = 4, p = 2
x1 = (0, 2)T , x2 = (1, 0)T , x3 = (2, 2)T , x4 = (∗, 4)T
Assume they are i.i.d. samples from Gaussian
2
T σ1 0
N ([µ1 , µ2 ] , )
0 σ22
Use EM algorithm to impute the missing data *.
Hidden state: Missing data.
Pattern classification, R. O. Duda, P. E. Hart, and D. G. Stork
(Cont.) Example: missing data
▶ Initialization: θ0 = (0, 0, 1, 1)T , i.e., mean [0, 0]T and covariance I2 .
▶ E-step
Q(θ|θ0 ) = Ex41 [log p(x|θ)|x1 , x2 , x3 , x42 ]
3
X
= log p(xi |θ)
i=1
Z
+ log(p([x41 , 4]T )|θ) · p([x41 , 4]|θ0 )dx41
3
X (1 + µ21 ) (4 − µ2 )2
= log p(xi |θ) − − − log(2πσ1 σ2 )
i=1
2σ12 2σ22
▶ M-step
θ1 = arg max Q(θ|θ0 )
θ
(Cont.) Example: missing data - iterations
0.75
2.0 0.75 0.938 0
0.938 ⇒
θ1 = µ1 = Σ1 =
2.0 0 2.0
2.0
1.0
2.0
θ2 =
0.667
2.0
The absent-minded biologist
197 animals
Distributed into 4 categories
125 18 20 34
Multinomial model of 5 category with unknown parameter θ
1 θ 1−θ 1−θ θ
( , , , , )
2 4 4 4 4
Can we figure out the number of Monkey A based on the data?
(Cont.) The absent-minded biologist
▶ data y = (125, 18, 20, 34)
▶ now assume y1 = y11 + y12 = 125
▶ Likelihood function
n! 1 θ 1 θ 1 θ θ
f (y|θ) = ( )y11 ( )y12 ( − )y2 ( − )y3 ( )y4
y11 !y12 !y2 !y3 !y4 ! 2 4 4 4 4 4 4
▶ log-likelihood
ℓ(θ|y) ∝ (y12 + y4 ) log θ + (y2 + y3 ) log(1 − θ)
▶ y12 unknown, cannot directly maximize ℓ(θ|y)
(Cont.) The absent-minded biologist: set-up EM
Q(θ|θ′ ) = Ey12 [(y12 + y4 ) log θ + (y2 + y3 ) log(1 − θ)|y1 , . . . , y4 , θ′ ]
= (Ey12 [y12 |y1 , θ′ ] + y4 ) log θ + (y2 + y3 ) log(1 − θ)
θ′ /4
Conditional distribution of y12 given y1 : Binomial (y1 , θ′ /4+1/2 )
y1 θ ′ θ′
Ey12 [y12 |y1 , θ′ ] = ′
:= y12 ,
2+θ
E-step:
′
Q(θ|θ′ ) = (y12
θ
+ y4 ) log θ + (y2 + y3 ) log(1 − θ)
(θ )
y12k +y4
M-step: θk+1 = arg max Q(θ|θk ) = (θk )
y12 +y2 +y3 +y4
Fitting Gaussian mixture model (GMM)
C
X
xi ∼ πc ϕ(xi |µc , Σc )
c=1
ϕ: density of multi-variate normal
▶ parameters {µc , Σc , πc }C c=1
▶ assume C is known.
▶ observed data {x1 , . . . , xn }
▶ complete data {(x1 , y1 ), . . . , (xn , yn )}
yn : “label” for each sample, missing.
(𝑥' , 𝑦' )
𝜋"
𝜋$
𝜋#
EM for GMM
▶ If we know the label information yi , likelihood function can be easily written
πyi ϕ(xi |µyi , Σyi )
▶ now yi unknown, compute its expectation with respect to the set of parameters
Xn
Q(θ|θ′ ) = E[ log πyi + log ϕ(xi |µyi , Σyi )|xi , θ′ ]
i=1
(𝑥' , 𝑦' )
𝜋"
𝜋$
𝜋#
E-step
▶ (πc(k) , µ(k) (k)
c , Σc ) parameter values in the kth iteration
▶ we need yi |xi , posterior distribution of label, given observation xi
pi,c := p(yi = c|xi ) ∝ πc(k) ϕ(xi |µc(k) , Σ(k)
c )
PC
and c=1 p(yi = c|xi ) = 1
n
X
Q(θ|θk ) = E[log πyi + log ϕ(xi |µyi , Σyi )|xi , θk ]
i=1
Xn XC n X
X C
= pi,c log πc + pi,c log ϕ(xi |µc , Σc )
i=1 c=1 i=1 c=1
Q: where is θ?
M-step
▶ Maximize Q(θ|θk ) with respect to πc , µc , Σc (note that they can be maximized
separately)
θk+1 = arg max Q(θ|θk )
θ
PC
▶ note that c=1 πc =1
Pn
pi,c xi
µ(k+1)
c = Pi=1
n
i=1 pi,c
Pn (k+1) (k+1) T
i=1 pi,c (xi − µc )(xi − µc )
Σ(k+1)
c = Pn
i=1 pi,c
n
1X
πc(k+1) = pi,c
n
i=1
Interpretation
(𝑥' , 𝑦' )
▶ pi,c : probability of each sample belong
to computer c
▶ πc(k+1) : count the expected number of 𝜋"
samples belong to component c 𝜋$
▶ soft-assignment: xi belong to 𝜋#
component c with assignment
probability pi,c 0.5 1
(k+1)
▶ µc :
“average” centroid using soft 0.3
𝑥' 2
assignment
▶ µ(k+1)
c : “average” covariance using 0.2
soft assignment 3
P(𝑦' = 𝑗|𝑥' )
k-means
1 1
▶ K-means: “hard” assignment
▶ EM algorithm: “soft” assignment: in the end, pi,c can 0
𝑥" 2
be viewed as a soft label for each sample; convert into
hard label:
C
ĉi = arg max pi,c
c=1
0
3
Demo
▶ The wine data set was introduced by Forina et al. (1986)
▶ It originally includes the results of 27 chemical measurements on 178 wines made
in the same region of Italy but derived from three different cultivars: Barolo,
Grignolino and Barbera
▶ We use the first two principle components of the data
Mixture of 3 Gaussian components
▶ First fun PCA to reduce the data dimension to 2
▶ Use pi,c , c = 1, 2, 3 as the proportion of “red”, “green”, and “blue” components
Properties of EM
▶ EM algorithm converges to local maximum
▶ Heuristic: escaping the local maximum through a random start
▶ EM works on improving Q(θ|θ′ ) rather than directly improving log f (x|θ)
▶ one can show that improvement on Q(θ|θ′ ) improves log f (x|θ)
▶ EM works well with exponential family
▶ E-step: sum of expectations of the sufficient statistics
▶ M-step: maximizing a linear function
usually possible to derive closed-form update
Convergence of EM
▶ Proof by A. Dempster, N. Larid and
D. Rubin in 1977, later generalized by
J. Wu in 1983.
▶ Basic idea: find a sequence of
quadratic lower bounds for the
likelihood function
▶ EM monotonically increases the
observed data log likelihood
ℓ(θk+1 ) ≥ Q(θk+1 ; θk ) ≥ Q(θk ; θk ) = ℓ(θk )