0% found this document useful (0 votes)
46 views20 pages

Ensemble Learning Techniques Explained

Uploaded by

jacqsharon1311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views20 pages

Ensemble Learning Techniques Explained

Uploaded by

jacqsharon1311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT IV ENSEMBLE TECHNIQUES AND UNSUPERVISED LEARNING

Combining multiple learners: Model combination schemes, Voting, Ensemble Learning -


bagging, boosting, stacking, Unsupervised learning: K-means, Instance Based Learning: KNN,
Gaussian mixture models and Expectation maximization

Combining multiple learners

The “no free lunch” theorem states that there is no single learning algorithm that in any
domain always induces the most accurate learner. The usual approach in ML is to try many
different models and to choose the one that performs best on a separate validation set.

Each learning algorithm dictates a specific model (e.g. SVMs, NNs, probabilistic models)
that comes with a set of inherent assumptions.

This inductive bias renders an error if the assumptions do not hold for the data.

Oftentimes we attempt to mitigate the effect of these assumptions by adopting a methodology


that makes as few a priori assumptions as possible; using multiple learners, however, presents
an alternative solution.

By suitably combining multiple learned models as an ensemble, accuracy can be improved.

This leads to (2) fundamental questions for ensemble learning: (1) How do we generate base-
learners (i.e. learned models using the same data/learning algorithm type) that complement
each other? (2) How do we combine the outputs of the base-learners for maximum accuracy?
(*) Note that not all model combinations will necessarily increase test accuracy, but learning
extra models will always increase run-time and space complexity.

Ensemble of classifiers is a set of classifiers whose individual decisions combined in some


way to classify new examples.

Differ in training strategy, and combination method 1. Parallel training with different training
sets: bagging 2. Sequential training, iteratively re-weighting training examples so current
classiCier focuses on hard examples: boosting 3. Parallel training with objective encouraging
division of labor: mixture of experts.

Minimize two sets of errors: 1. Variance: error from sensitivity to small Cluctuations in the
training set 2. Bias: erroneous assumptions in the model.

Based on one of two basic observations: 1. Variance reduction: if the training sets are
completely independent, it will always helps to average an ensemble because this will reduce
variance without affecting bias (e.g., bagging) -- reduce sensitivity to individual data pts

2. Bias reduction: for simple models, average of models has much greater capacity than
single model (e.g., hyperplane classiCiers, Gaussian densities). Averaging models can reduce
bias substantially by increasing capacity, and control variance by Citting one component at a
time (e.g., boosting)

Lots of different combination methods: Most popular are averaging and majority voting.

Voting is an ensemble method that combines the performances of multiple models to make
predictions.
Benefits of Voting

Incorporating voting comes with many advantages.

Firstly, since voting relies on the performance of many models, they will not be hindered by
large errors or misclassifications from one model. A poor performance from one model can be
offset by a strong performance from other models.

By combining models to make a prediction, you mitigate the risk of one model making an
inaccurate prediction by having other models that can make the correct prediction. Such an
approach enables the estimator to be more robust and prone to overfitting.

In classification problems, there are two types of voting: hard voting and soft voting.

Hard voting entails picking the prediction with the highest number of votes, whereas soft
voting entails combining the probabilities of each prediction in each model and picking the
prediction with the highest total probability.

Voting in regression problems is somewhat different. Instead of finding the prediction with the
highest frequency, regression models built with voting take the predictions of each model and
compute their average value to derive a final prediction.

In either classification or regression problems, voting serves as a means to enhance predictive


performance.

Drawbacks of Voting
Firstly, there are cases where an individual model can outperform a group of models.
Secondly, since voting requires the use of multiple models, they are naturally more
computationally intensive. Thus, creating, training, and deploying such models will be much
more costly.
Finally, voting only serves to benefit when the machine learning classifiers perform at similar
levels. A voting estimator built from models with contrasting levels of efficiency may perform
erratically.
Ensemble Learning:

Ensemble learning is a machine learning paradigm where multiple models (often called “weak
learners”) are trained to solve the same problem and combined to get better results. The main
hypothesis is that when weak models are correctly combined we can obtain more accurate
and/or robust models.

The idea of ensemble methods is to try reducing bias and/or variance of such weak learners by
combining several of them together in order to create a strong learner (or ensemble model)
that achieves better performances.

• Ensemble methods aim at improving predictability in models by combining several models to


make one very reliable model.
• The most popular ensemble methods are boosting, bagging, and stacking.
• Ensemble methods are ideal for regression and classification, where they reduce bias and
variance to boost the accuracy of models.

Combine weak learners

In order to set up an ensemble learning method, we first need to select our base models to be
aggregated. Most of the time (including in the well known bagging and boosting methods) a
single base learning algorithm is used so that we have homogeneous weak learners that are
trained in different ways. The ensemble model we obtain is then said to be “homogeneous”.
However, there also exist some methods that use different type of base learning algorithms:
some heterogeneous weak learners are then combined into an “heterogeneous ensembles
model”.
One important point is that our choice of weak learners should be coherent with the way we
aggregate these models. If we choose base models with low bias but high variance, it should
be with an aggregating method that tends to reduce variance whereas if we choose base
models with low variance but high bias, it should be with an aggregating method that tends to
reduce bias.

This brings us to the question of how to combine these models. We can mention three major
kinds of meta-algorithms that aims at combining weak learners:

• bagging, that often considers homogeneous weak learners, learns them independently
from each other in parallel and combines them following some kind of deterministic
averaging process
• boosting, that often considers homogeneous weak learners, learns them sequentially in a
very adaptative way (a base model depends on the previous ones) and combines them
following a deterministic strategy

• stacking, that often considers heterogeneous weak learners, learns them in parallel and
combines them by training a meta-model to output a prediction based on the different
weak models predictions

Very roughly, we can say that bagging will mainly focus at getting an ensemble model with
less variance than its components whereas boosting and stacking will mainly try to produce
strong models less biased than their components (even if variance can also be reduced).

Ensemble learning helps improve machine learning results by combining several models. This
approach allows the production of better predictive performance compared to a single model.
Basic idea is to learn a set of classifiers (experts) and to allow them to vote.
Advantage: Improvement in predictive accuracy.
Disadvantage: It is difficult to understand an ensemble of classifiers.

Types of Ensemble Classifier

Ensemble learning helps improve machine learning results by combining several models. This
approach allows the production of better predictive performance compared to a single model.
Basic idea is to learn a set of classifiers (experts) and to allow them to
vote. Bagging and Boosting are two types of Ensemble Learning. These two decrease the
variance of a single estimate as they combine several estimates from different models. So the
result may be a model with higher stability. Let’s understand these two terms in a glimpse.
1. Bagging: It is a homogeneous weak learners’ model that learns from each other
independently in parallel and combines them for determining the model average.
2. Boosting: It is also a homogeneous weak learners’ model but works differently from
Bagging. In this model, learners learn sequentially and adaptively to improve model
predictions of a learning algorithm.

Bagging:
Bagging (Bootstrap Aggregation) is used to reduce the variance of a decision tree. Suppose a
set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement
from D (i.e., bootstrap). Then a classifier model Mi is learned for each training set D < i. Each
classifier Mi returns its class prediction. The bagged classifier M* counts the votes and assigns
the class with the most votes to X (unknown sample).

Implementation steps of Bagging –

1. Multiple subsets are created from the original data set with equal tuples, selecting
observations with replacement.
2. A base model is created on each of these subsets.
3. Each model is learned in parallel from each training set and independent of each other.
4. The final predictions are determined by combining the predictions from all the models.

Bagging
Bootstrap Aggregating, also known as bagging, is a machine learning ensemble meta-
algorithm designed to improve the stability and accuracy of machine learning algorithms used
in statistical classification and regression. It decreases the variance and helps to
avoid overfitting. It is usually applied to decision tree methods. Bagging is a special case of
the model averaging approach.

Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples is selected via row
sampling with a replacement method (i.e., there can be repetitive elements from different d
tuples) from D (i.e., bootstrap). Then a classifier model Mi is learned for each training set D <
i. Each classifier Mi returns its class prediction. The bagged classifier M* counts the votes and
assigns the class with the most votes to X (unknown sample).
Implementation Steps of Bagging
• Step 1: Multiple subsets are created from the original data set with equal tuples, selecting
observations with replacement.
• Step 2: A base model is created on each of these subsets.
• Step 3: Each model is learned in parallel with each training set and independent of each
other.
• Step 4: The final predictions are determined by combining the predictions from all the
models.
Example of Bagging
The Random Forest model uses Bagging, where decision tree models with higher variance are
present. It makes random feature selection to grow trees. Several random trees make a
Random Forest.
Boosting
Boosting is an ensemble modeling technique that attempts to build a strong classifier from the
number of weak classifiers. It is done by building a model by using weak models in series.
Firstly, a model is built from the training data. Then the second model is built which tries to
correct the errors present in the first model. This procedure is continued and models are added
until either the complete training data set is predicted correctly or the maximum number of
models is added.
Boosting Algorithms
There are several boosting algorithms. The original ones, proposed by Robert
Schapire and Yoav Freund were not adaptive and could not take full advantage of the weak
learners. Schapire and Freund then developed AdaBoost, an adaptive boosting algorithm that
won the prestigious Gödel Prize. AdaBoost was the first really successful boosting algorithm
developed for the purpose of binary classification. AdaBoost is short for Adaptive Boosting
and is a very popular boosting technique that combines multiple “weak classifiers” into a
single “strong classifier”.

Algorithm:

1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points and decrease the weights of
correctly classified data points. And then normalize the weights of all data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End
BOOSTING

Similarities Between Bagging and Boosting


Bagging and Boosting, both being the commonly used methods, have a universal similarity of
being classified as ensemble methods. Here we will explain the similarities between them.
1. Both are ensemble methods to get N learners from 1 learner.
2. Both generate several training data sets by random sampling.
3. Both make the final decision by averaging the N learners (or taking the majority of them i.e
Majority Voting).
4. Both are good at reducing variance and provide higher stability.

Differences Between Bagging and Boosting


[Link] Bagging Boosting

The simplest way of combining predictions that A way of combining predictions that
1. belong to the same type. belong to the different types.

2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.

Models are weighted according to their


3. Each model receives equal weight. performance.

New models are influenced


by the performance of previously built
4. Each model is built independently. models.

Different training data subsets are selected using row Every new subset contains the elements
sampling with replacement and random sampling that were misclassified by previous
5. methods from the entire training dataset. models.
[Link] Bagging Boosting

6. Bagging tries to solve the over-fitting problem. Boosting tries to reduce bias.

If the classifier is unstable (high variance), then apply If the classifier is stable and simple (high
7. bagging. bias) the apply boosting.

In this base classifiers are trained


8. In this base classifiers are trained parallelly. sequentially.

Example: The AdaBoost uses Boosting


9 Example: The Random forest model uses Bagging. techniques

Main Types of Ensemble Methods

1. Bagging

Bagging, the short form for bootstrap aggregating, is mainly applied in classification
and regression. It increases the accuracy of models through decision trees, which reduces
variance to a large extent. The reduction of variance increases accuracy, eliminating
overfitting, which is a challenge to many predictive models.

Bagging is classified into two types, i.e., bootstrapping and aggregation. Bootstrapping is a
sampling technique where samples are derived from the whole population (set) using the
replacement procedure. The sampling with replacement method helps make the selection
procedure randomized. The base learning algorithm is run on the samples to complete the
procedure.

Aggregation in bagging is done to incorporate all possible outcomes of the prediction and
randomize the outcome. Without aggregation, predictions will not be accurate because all
outcomes are not put into consideration. Therefore, the aggregation is based on the probability
bootstrapping procedures or on the basis of all outcomes of the predictive models.

Bagging is advantageous since weak base learners are combined to form a single strong
learner that is more stable than single learners. It also eliminates any variance, thereby
reducing the overfitting of models. One limitation of bagging is that it is computationally
expensive. Thus, it can lead to more bias in models when the proper procedure of bagging is
ignored.

2. Boosting

Boosting is an ensemble technique that learns from previous predictor mistakes to make better
predictions in the future. The technique combines several weak base learners to form one
strong learner, thus significantly improving the predictability of models. Boosting works by
arranging weak learners in a sequence, such that weak learners learn from the next learner in
the sequence to create better predictive models.
Boosting takes many forms, including gradient boosting, Adaptive Boosting (AdaBoost), and
XGBoost (Extreme Gradient Boosting). AdaBoost uses weak learners in the form of decision
trees, which mostly include one split that is popularly known as decision stumps. AdaBoost’s
main decision stump comprises observations carrying similar weights.

Gradient boosting adds predictors sequentially to the ensemble, where preceding predictors
correct their successors, thereby increasing the model’s accuracy. New predictors are fit to
counter the effects of errors in the previous predictors. The gradient of descent helps the
gradient booster identify problems in learners’ predictions and counter them accordingly.

XGBoost makes use of decision trees with boosted gradient, providing improved speed and
performance. It relies heavily on the computational speed and the performance of the target
model. Model training should follow a sequence, thus making the implementation of gradient
boosted machines slow.

3. Stacking

Stacking, another ensemble method, is often referred to as stacked generalization. This


technique works by allowing a training algorithm to ensemble several other similar learning
algorithm predictions. Stacking has been successfully implemented in regression, density
estimations, distance learning, and classifications. It can also be used to measure the error rate
involved during bagging.

Variance Reduction

Ensemble methods are ideal for reducing the variance in models, thereby increasing the
accuracy of predictions. The variance is eliminated when multiple models are combined to
form a single prediction that is chosen from all other possible predictions from the combined
models. An ensemble of models combines various models to ensure that the resulting
prediction is the best possible, based on the consideration of all predictions

FIG. Stacking algorithm. The number of weak learners in the stack is variable.

Stacking: While bagging and boosting used homogenous weak learners for ensemble,
Stacking often considers heterogeneous weak learners, learns them in parallel, and combines
them by training a meta-learner to output a prediction based on the different weak learner’s
predictions. A meta learner inputs the predictions as the features and the target being the
ground truth values in data D(Fig 2.), it attempts to learn how to best combine the input
predictions to make a better output prediction.

In averaging ensemble eg. Random Forest, the model combines the predictions from multiple
trained models. A limitation of this approach is that each model contributes the same amount
to the ensemble prediction, irrespective of how well the model performed. A further
generalization of this approach is replacing the linear weighted sum with Linear Regression
(regression problem) or Logistic Regression (classification problem) to combine the
predictions of the sub-models with any learning algorithm. This approach is called [Link]
stacking, an algorithm takes the outputs of sub-models as input and attempts to learn how to
best combine the input predictions to make a better output prediction.

Stacking for Machine Learning

The stacked model with meta learner = Logistic Regression and weak learners = 4 Neural
Networks
Unsupervised learning:

Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.

The goal of unsupervised learning is to find the underlying structure of dataset, group
that data according to similarities, and represent that dataset in a compressed format.

Why use Unsupervised Learning?

Below are some main reasons which describe the importance of Unsupervised Learning:

o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.

Fig. Unsupervised learning

Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will interpret the raw data to find the hidden
patterns from the data and then will apply suitable algorithms such as k-means clustering,
Decision tree, etc.

Types of Unsupervised Learning Algorithm:

The unsupervised learning algorithm can be further categorized into two types of problems:
o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group. Cluster analysis finds the commonalities between the
data objects and categorizes them as per the presence and absence of those
commonalities.
o Association: An association rule is an unsupervised learning method which is used
for finding the relationships between variables in the large database. It determines the
set of items that occurs together in the dataset. Association rule makes marketing
strategy more effective. Such as people who buy X item (suppose a bread) are also
tend to purchase Y (Butter/Jam) item. A typical example of Association rule is Market
Basket Analysis.

Unsupervised Learning algorithms:

Below is the list of some popular unsupervised learning algorithms:

o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition

Advantages of Unsupervised Learning


o Unsupervised learning is used for more complex tasks as compared to supervised
learning because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to
labeled data.

Disadvantages of Unsupervised Learning


o Unsupervised learning is intrinsically more difficult than supervised learning as it
does not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input data
is not labeled, and algorithms do not know the exact output in advance.
K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science.

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled


dataset into different clusters. Here K defines the number of pre-defined clusters that need to
be created in the process, as if K=2, there will be two clusters, and for K=3, there will be
three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Instance Based Learning:

The Machine Learning systems which are categorized as instance-based learning are the
system that learn the training examples by heart and then generalizes to new instances based
on some similarity measure. It is called instance-based because it builds the hypotheses from
the training instances. It is also known as memory-based learning or lazy-learning (because
they delay processing until a new instance must be classified).

Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made to the
target function.
2. This algorithm can adapt to new data easily, one which is collected as we go .

Disadvantages:

1. Classification costs are high


2. Large amount of memory required storing the data, and each query involves starting the
identification of a local model from scratch.

Some of the instance-based learning algorithms are:


1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
5. Case-Based Reasoning
There are several methods available for clustering:
• K Means Clustering
• Hierarchical Clustering
• Gaussian Mixture Models

K Nearest Neighbor (KNN)

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into
a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to ca t and dog,
but we want to know either it is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure.
How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data
points for all the training samples.

Gaussian mixture models

What is a Gaussian Mixture Model?

Sometimes our data has multiple distributions or it has multiple peaks. It does not always
have one peak, and one can notice that by looking at the data set. It will look like there are
multiple peaks happening here and there. There are two peak points and the data seems to be
going up and down twice or maybe three times or four times. But if there are Multiple
Gaussian distributions that can represent this data, then we can build what we called
a Gaussian Mixture Model.

In other words we can say that, if we have three Gaussian Distribution as GD1, GD2, GD3
having mean as µ1, µ2,µ3 and variance 1,2,3 than for a given set of data points GMM will
identify the probability of each data point belonging to each of these distributions.

It is a probability distribution that consists of multiple probability distributions and has


Multiple Gaussians.
The probability distribution function of d-dimensions Gaussian Distribution is defined
as:

Suppose there are K clusters (For the sake of simplicity here it is assumed that the number
of clusters is known and it is K). So and are also estimated for each k. Had it been only
one distribution, they would have been estimated by the maximum-likelihood method.
But since there are K such clusters and the probability density is defined as a linear
function of densities of all these K distributions, i.e.

Expectation-Maximization (EM) Algorithm

Expectation-Maximization algorithm can be used for the latent variables (variables that
are not directly observable and are actually inferred from the values of the other observed
variables) too in order to predict their values with the condition that the general form of
probability distribution governing those latent variables is known to us. This algorithm is
actually at the base of many unsupervised clustering algorithms in the field of machine
learning.

Expectation Maximization Algorithm: EM can be used for variables that are not directly
observable and deduce from the value of other observed variables. It can be used with
unlabeled data for its classification. It is one of the popular approaches to maximize the
likelihood.

Basic Ideas of EM -Algorithm: Given a set of incomplete data and set of starting
parameters.

E-Step: Using the given data and the current value of parameters, estimate the value of
hidden data.

M-Step: After the E-step, it is used to maximize the hidden variable and joint distribution
of the data.

The Expectation-Maximization (EM) algorithm is an iterative way to find maximum-


likelihood estimates for model parameters when the data is incomplete or has some missing
data points or has some hidden variables. EM chooses some random values for the missing
data points and estimates a new set of data. These new values are then recursively used to
estimate a better first date, by filling up missing points, until the values get fixed.
These are the two basic steps of the EM algorithm, namely the E Step, or Expectation Step
or Estimation Step, and M Step, or Maximization Step.

Algorithm:
1. Given a set of incomplete data, consider a set of starting parameters.
2. Expectation step (E – step): Using the observed available data of the dataset, estimate
(guess) the values of the missing data.
3. Maximization step (M – step): Complete data generated after the expectation (E) step
is used in order to update the parameters.
4. Repeat step 2 and step 3 until convergence.

Estimation step
Initialize , and by some random values, or by K means clustering results or by hierarchical
clustering results. Then for those given parameter values, estimate the value of the latent
variables
Maximization Step
Update the value of the parameters calculated using the ML method.
Usage of EM Algorithm

1. Can be used to fill missing data.


2. To find the values of latent variables.
The disadvantage of EM algorithm is that it has slow convergence and it converges up to
local optima only.
Usage of EM algorithm –
• It can be used to fill the missing data in a sample.
• It can be used as the basis of unsupervised learning of clusters.
• It can be used for the purpose of estimating the parameters of Hidden Markov Model
(HMM).
• It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
• It is always guaranteed that likelihood will increase with each iteration.
• The E-step and M-step are often pretty easy for many problems in terms of
implementation.
• Solutions to the M-steps often exist in the closed form
Disadvantages of EM algorithm –
• It has slow convergence.
• It makes convergence to the local optima only.
• It requires both the probabilities, forward and backward (numerical optimization
requires only forward probability).

You might also like