0% found this document useful (0 votes)
39 views

Expectation Maximization

The document provides an overview of expectation maximization (EM), including how it works, applications, and an example of using EM for probabilistic clustering. EM is an iterative method for finding maximum likelihood or posterior estimates of parameters in statistical models, where the model depends on unobserved latent variables. It involves repeating an expectation (E) step, which computes the expected value of the latent variables, and a maximization (M) step, which computes the parameter values maximizing the expected log-likelihood found in the E step.

Uploaded by

Jarir Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Expectation Maximization

The document provides an overview of expectation maximization (EM), including how it works, applications, and an example of using EM for probabilistic clustering. EM is an iterative method for finding maximum likelihood or posterior estimates of parameters in statistical models, where the model depends on unobserved latent variables. It involves repeating an expectation (E) step, which computes the expected value of the latent variables, and a maximization (M) step, which computes the parameter values maximizing the expected log-likelihood found in the E step.

Uploaded by

Jarir Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Expectation

Maximization
(Data Mining and Warehousing)
 Expectation Maximization
 How EM works
 Applications of EM
 Example – Probabilistic Clustering

Outlines  Fuzzy Clustering Using EM


 K-means vs EM in terms of Clustering
 Gaussian Mixture Model with EM
 Optimal Number of Clusters
 Advantages and Limitations
Expectation Maximization

 An iterative approach to find local maximum likelihood or maximum a


posteriori.
 Can handle latent variables.
 Gaussian mixture models are an approach to density estimation where the
parameters of the distributions are fit using the expectation-maximization
algorithm.
General MLE Process Vs. EM

 Both Maximum Likelihood Estimation (MLE) and EM can find the "best-fit"
parameters, but with different methodologies.
 MLE accumulates all the data object to estimate the parameters; but EM
takes a guess at the parameter first, and then tweaks the model to fit the
guesses and the observed data.

 EM – The Chicken and Egg Problem


- Need parameters (mean, covariance) to need the source of the points.
- Need to know the source to estimate those parameters.
EM Algorithm – How It Works

Below are the steps of the Expectation Maximization Algorithm:


 The E-step: Estimate the missing variables (latent parameters) in the dataset.
 The M-step: Maximize the parameters of the model in the presence of the
data.
These two steps are repeated until convergence.
EM Algorithm – How It Works (Contd.)
 The EM algorithm is mostly used in probabilistic
clustering methods (unsupervised), especially in Fuzzy
Clustering and Probabilistic Model Based Clustering.
Applications of  Computer Visions and Machine Learning.

The EM  Natural Language Processing (NLP).


 Estimation the parameters of Hidden Markov Model
Algorithm (HMM) Classifiers.
 Reconstruction of medical images.
Probabilistic Clustering

 A method for deriving cluster where each object is assigned a probability of


belonging to a cluster.
 Data objects are assumed to be coming from some distribution function.
 If the probability is interpreted as degree of membership, then these are
fuzzy clustering techniques.
Fuzzy Clustering

Given a set of objects, X = {x1,…,xn}, a fuzzy set S is a subset of X that allows


each object in X to have a membership degree between 0 and 1. Formally, a
fuzzy set, S, can be modeled as a function, FS:X-> [0,1].
This concept can be applied on clustering.
 Fuzzy clustering allows an object to belong to more than one cluster.
 This clustering can be represented using a partition matrix MT, where each
object is assigned a membership degree, for each fuzzy cluster.
 Also called soft clustering.
 Used in text mining.
Fuzzy Clustering (Contd.)
Fuzzy Clustering Using EM

Consider following six


points:
a(3,3), b(4,10), c(9,6),
d(14,8), e(18,11) and
f(21,7)
We randomly select two
points, say c1 = a and c2 =
b, as the initial centers of
the clusters.
Fuzzy Clustering Using EM (Contd.)

 1st E step: Assign objects to the clusters.


 Calculate the weight (Membership Degree) of each object for each cluster: wij
means the weight of object i in cluster j.
 For any data object o,

wij =

For data point, c(9,6), wc,c1= =0.48 and wc,c2= =0.52


Fuzzy Clustering Using EM (Contd.)

1st M Step: Update the previously assigned centroids.

where j=1,2.
According to this formula, after 1st iteration, we get the updated
centers c1(8.47,5.12) and c2(10.42,8.99).
Fuzzy Clustering Using EM (Contd.)
Fuzzy Clustering Using EM (Contd.)

Partition matrix and updated centroids after three iterations:


Comparison on Clustering Methods
(k-means Vs. EM)
K-means Expectation Maximization
1. Hard Clustering. 1. Soft clustering.
2. Based on Euclidean 2. Based on density probability.
distance.
3. Works on numeric data 3. Works on both nominal and
only. numeric data.
4. Inherently non-robust; 4. Robust method.
sensitive to outliers.
Probabilistic Model Based Clustering and
Mixture Models
Probabilistic Model Based Clustering
 Cluster analysis is to find hidden categories based on generative models.
 Each hidden category is a distribution over the data space.
 Each category represents a probabilistic cluster.

Mixture Models
 Probabilistically-grounded way of doing soft clustering.
 Each cluster is a generative model (Gaussian, Multinomial).
 The parameters are latent (Mean, covariance etc.).
Advantages and Limitations of EM

Advantages:
 It is always guaranteed that likelihood will increase with each iteration.
 The E-step and M-step are often pretty easy for many problems in terms of
implementation.
 Solutions to the M-steps often exist in the closed form.
Limitations:
 Slow convergence.
 It makes convergence to the local optima only.
 It requires both the probabilities, forward and backward (numerical
optimization requires only forward probability).
EM – Dealing with The Local Maxima
Problem

 The EM algorithm iterations always increases the likelihood, but it has a


propensity to converge to local maxima.
 Restarting the algorithm from the initial "parameter guessing" step can be a
solution.
 From all the possible (guessed) parameters, we can choose the one which
yields the greatest maximum likelihood.
References

1. Jung YG, Kang MS, Heo J. Clustering performance comparison using K-means and
expectation maximization
algorithms. Biotechnol Biotechnol Equip. 2014;28(sup1):S44-S48.
doi:10.1080/13102818.2014.949045
2. Gupta, Ujjwal Das, Vinay Menon, and Uday Babbar. "Detecting the number of
clusters during expectation-maximization clustering using information criterion."
In 2010 Second International Conference on Machine Learning and Computing, pp.
169-173. IEEE, 2010.
3. https://towardsdatascience.com/a-comparison-between-k-means-clustering-and-
expectation-maximization-estimation-for-clustering-8c75a1193eb7
4. https://www.geeksforgeeks.org/ml-expectation-maximization-algorithm/
5. http://www.inf.ed.ac.uk/teaching/courses/iaml/2011/slides/em.pdf
6. https://machinelearningmastery.com/expectation-maximization-em-algorithm/
7. https://www.statisticshowto.com/em-algorithm-expectation-maximization/
Thank you!!
Univariate Gaussian Mixture Model with EM
Does it look
more like a
sample from
yellow
gaussian, or
blue?

Bayesian Posterior

Each cluster
follows 1-D
Gaussian
distribution.
Optimal Number of k for Gaussians

 Increasing the number of clusters?


- The dimentionality of the model also increases.
- Monotonous increase in likelihood.
 Focus on maximizing the likelihood with any number of clusters?
- We may end up with k = n clusters for n data points.
 An Information Criteria parameter is used for selection among models with
different number of parameters P.
- Introduces a penalty term for each parameter.
- Pick the simplest of the models.
BIC : maxp { L – ½*p*log(n)}
AIC : minp {2p – L}

You might also like