UNIT-IV
UNIT-IV
Voting
• The simplest way to combine multiple classifiers is by voting, which corresponds to taking
a linear combination of the learners. Voting is an ensemble machine learning algorithm.
• For regression, a voting ensemble involves making a prediction that is the average of
multiple other regression models.
• In classification, a hard voting ensemble involves summing the votes for crisp class labels
from other models and predicting the class with the most votes. A soft voting ensemble
involves summing the predicted probabilities for class labels and predicting the class label
with the largest sum probability.
• Fig. shows Base-learners with their outputs.
• In this methods, the first step is to create multiple classification/regression models using
some training dataset. Each base model can be created using different splits of the same
training dataset and same algorithm, or using the same dataset with different algorithms, or
any other method.
• Learn multiple alternative definitions of a concept using different training data or different
learning algorithms. It combines decisions of multiple definitions, e.g. box using weighted
voting.
Fig. shows general idea of Base-learners with model combiner.
• When combining multiple independent and diverse decisions each of which is at least more
accurate than random guessing, random errors cancel each other out, and correct decisions
are reinforced. Human ensembles are demonstrably better.
• Use a single, arbitrary learning algorithm but manipulate training data to make it learn
multiple models.
• The problem here is that if there is an error with one of the base-learners, there may be a
misclassification because the class code words are so similar. So the approach in error-
correcting codes is to have L > K and increase the Hamming distance between the code
words.
• One possibility is pairwise separation of classes where there is a separate base-learner to
separate Ci from Cj, for i < j.
• Pairwise L = K(K − 1)/2
Boosting
• Boosting is a very different method to generate multiple predictions (function mob
estimates) and combine them linearly. Boosting refers to a general and provably effective
method of producing a very accurate classifier by combining rough and moderately
inaccurate rules of thumb.
• Originally developed by computational learning theorists to guarantee performance
improvements on fitting training data for a weak learner that only needs to generate a
hypothesis with a training accuracy greater than 0.5. Final result is the weighted sum of the
results of weak classifiers.
• A learner is weak if it produces a classifier that is only slightly better than random guessing,
while a learner is said to be strong if it produces a classifier that achieves a low error with
high confidence for a given concept.
• Revised to be a practical algorithm, AdaBoost, for building ensembles that empirically
improves generalization performance. Examples are given weights. At each iteration, a new
hypothesis is learned and the examples are reweighted to focus the system on examples that
the most recently learned classifier got wrong.
• Boosting is a bias reduction technique. It typically improves the performance of a single
tree model. A reason for this is that we often cannot construct trees which are sufficiently
large due to thinning out of observations in the terminal nodes.
• Boosting is then a device to come up with a more complex solution by taking linear
combination of trees. In presence of high-dimensional predictors, boosting is also very useful
as a regularization technique for additive or interaction modeling.
• To begin, we define an algorithm for finding the rules of thumb, which we call a weak
learner. The boosting algorithm repeatedly calls this weak learner, each time feeding it a
different distribution over the training data. Each call generates a weak classifier and we must
combine all of these into a single classifier that, hopefully, is much more accurate than any
one of the rules.
• Train a set of weak hypotheses: h1,..., hT. The combined hypothesis H is a weighted
majority vote of the T weak hypotheses. During the training, focus on the examples that are
misclassified.
AdaBoost:
• AdaBoost, short for "Adaptive Boosting", is a machine learning meta - algorithm
formulated by Yoav Freund and Robert Schapire who won the prestigious "Gödel Prize" in
2003 for their work. It can be used in conjunction with many other types of learning
algorithms to improve their performance.
• It can be used to learn weak classifiers and final classification based on weighted vote of
weak classifiers.
• It is linear classifier with all its desirable properties. It has good generalization properties.
• To use the weak learner to form a highly accurate prediction rule by calling the weak
learner repeatedly on different distributions over the training examples.
• Initially, all weights are set equally, but each round the weights of incorrectly classified
examples are increased so that those observations that the previously classifier poorly
predicts receive greater weight on the next iteration.
• Advantages of AdaBoost:
1. Very simple to implement
2. Fairly good generalization
3. The prior error need not be known ahead of time.
• Disadvantages of AdaBoost:
1. Sub optimal solution
2. Can over fit in presence of noise.
Boosting Steps:
1. Draw a random subset of training samples d1 without replacement from the training set D
to train a weak learner C1
2. Draw second random training subset d2 without replacement from the training set and add
50 percent of the samples that were previously falsely classified/misclassified to train a weak
learner C2
3. Find the training samples d3 in the training set D on which C1 and C2 disagree to train a
third weak learner C3
4. Combine all the weak learners via majority voting.
Advantages of Boosting:
1. Supports different loss function.
2. Works well with interactions.
Disadvantages of Boosting:
1. Prone to over-fitting.
2. Requires careful tuning of different hyper - parameters.
Stacking
• Stacking, sometimes called stacked generalization, is an ensemble machine learning method
that combines multiple heterogeneous base or component models via a meta-model.
• The base model is trained on the complete training data, and then the meta-model is trained
on the predictions of the base models. The advantage of stacking is the ability to explore the
solution space with different models in the same problem.
• The stacking based model can be visualized in levels and has at least two levels of the
models. The first level typically trains the two or more base learners(can be heterogeneous)
and the second level might be a single meta learner that utilizes the base models predictions
as input and gives the final result as output. A stacked model can have more than two such
levels but increasing the levels doesn't always guarantee better performance.
• In the classification tasks, often logistic regression is used as a meta learner, while linear
regression is more suitable as a meta learner for regression-based tasks.
• Stacking is concerned with combining multiple classifiers generated by different learning
algorithms L1,..., LN on a single dataset S, which is composed by a feature vector S1 = (xi, ti).
• The stacking process can be broken into two phases:
1. Generate a set of base - level classifiers C1,..., CN where Ci = Li (S)
2. Train a meta - level classifier to combine the outputs of the base – level classifiers.
• Fig. shows stacking frame.
• The training set for the meta- level classifier is generated through a leave - one - out cross
validation process.
j
i = 1, ..., n and k = 1,..., N: C k
= Lk (S-si)
• The learned classifiers are then used to generate predictions for si : ŷki = Cki (xi)
• The meta- level dataset consists of examples of the form ((ŷ,...,ŷni), yi), where the features
are the predictions of the base - level classifiers and the class is the correct class of the
example in hand.
• Why do ensemble methods work?
• Based on one of two basic observations:
1. Variance reduction: If the training sets are completely independent, it will always helps
to average an ensemble because this will reduce variance without affecting bias (e.g. -
bagging) and reduce sensitivity to individual data points.
2. Bias reduction: For simple models, average of models has much greater capacity than
single model Averaging models can reduce bias substantially by increasing capacity and
control variance by Citting one component at a time.
Adaboost
• AdaBoost also referred to as adaptive boosting Stumpis a method in Machine Learning used
as an ensemble method. The maximum not unusual algorithm used with AdaBoost is
selection trees with one stage meaning with decision trees with most effective 1 split. These
trees also are referred to as decision stumps.
• To determine cluster membership, most algorithms evaluate the distance between a point
and the cluster centroids. The output from a clustering algorithm is basically a statistical
description of the cluster centroids with the number of components in each cluster.
• Cluster centroid: The centroid of a cluster is a point whose parameter values are the mean
of the parameter values of all the points in the cluster. Each cluster has a well defined
centroid.
• Distance: The distance between two points is taken as a common metric to as see the
similarity among the components of population. The commonly used distance measure is the
euclidean metric which defines the distance between two points
p= (P1, P2,...) and q = (q1,q2,...) is given by,
d = Σ ki=1 (pi - qi)2
• The goal of clustering is to determine the intrinsic grouping in a set of unlableled data. But
how to decide what constitutes a good clustering? It can be shown that there is no absolute
"best" criterion which would be independent of the final aim of the clustering. Consequently,
it is the user which must supply criterion, in such a way that the result of the clustering will
suit their needs.
• Clustering analysis helps construct meaningful partitioning of a large set of objects Cluster
analysis has been widely used in numerous applications, including pattern recognition, data
analysis, image processing etc.
• Clustering algorithms may be classified as listed below:
1. Exclusive clustering
2. Overlapping clustering
3. Hierarchical clustering
4. Probabilisitic clustering.
• A good clustering method will produce high quality clusters high intra- class similarlity and
low inter class similarity. The quality of a clustering result depends on both the similarity
measure used by the method and its implementation. The quality of a clustering method is
also measured by it's ability to discover some or all of the hidden patterns.
• Clustering techniques types: The major clustering techniques are,
a) Partitioning methods
b) Hierarchical methods
c) Density-based methods.
• Firstly, we are able to pick the number of friends, so we are able to select the ok = 5.
• Next, we will calculate the Euclidean distance between the facts points. The Tab Euclidean
distance is the gap between points, which we've got already studied in geometry. It may be
calculated as:
Expectation-maximization
• In Gaussian mixture models, an expectation-maximization method is a powerful tool for
estimating the parameters of a Gaussian mixture model. The expectation is termed E and
maximization is termed M.
• Expectation is used to find the Gaussian parameters which are used to represent each
component of gaussian mixture models. Maximization is termed M and it is involved in
determining whether new data points can be added or not.
• The Expectation-Maximization (EM) algorithm is used in maximum likelihood estimation
where the problem involves two sets of random variables of which one, X, is observable and
the other, Z, is hidden.
• The goal of the algorithm is to find the parameter vector that maximizes the (gpie-likelihood
of the observed values of X, L( ϕ| X).
• But in cases where this is not feasible, we associate the extra hidden variables Z and express
the underlying model using both, to maximize the likelihood of the joint distribution of X and
Z, the complete likelihood Lc ( | X,Z).
• Expectation-maximization (EM) is an iterative method used to find maximum likelihood
estimates of parameters in probabilistic models, where the model depends on unobserved,
also called latent, variables.
• EM alternates between performing an expectation (E) step, which computes an NOV
expectation of the likelihood by including the latent variables as if they were observed, and
maximization (M) step, which computes the maximum likelihood estimates of the parameters
by maximizing the expected likelihood found in the E step.
• The parameters found on the M step are then used to start another E step, and the process is
repeated until some criterion is satisfied. EM is frequently used for data clustering like for
example in Gaussian mixtures.
• In the Expectation step, find the expected values of the latent variables (here you need to
use the current parameter values).
• In the Maximization step, first plug in the expected values of the latent variables in the log-
likelihood of the augmented data. Then maximize this log-likelihood to reevaluate the
parameters.
• Expectation-Maximization (EM) is a technique used in point estimation. Given a set of
observable variables X and unknown (latent) variables Z we want to estimate parameters Ѳ in
a model.
• The expectation maximization (EM) algorithm is a widely used maximum likeli-hood
estimation procedure for statistical models when the values of some of the variables in the
model are not observed
• The EM algorithm is an elegant and powerful method for finding the maximum bead
likelihood of models with hidden variables. The key concept in the EM algorithm trig is that
it iterates between the expectation step (E-step) and maximization step (M-step) until
convergence.
• In the E-step, the algorithm estimates the posterior distribution of the hidden variables Q
given the observed data and the current parameter settings; and in the M-step the algorithm
calculates the ML parameter settings with Q fixed.
• At the end of each iteration the lower bound on the likelihood is optimized for the given
parameter setting (M-step) and the likelihood is set to that bound (E-step), which guarantees
an increase in the likelihood and convergence to a local maximum, or global maximum if the
likelihood function is unimodal.
• Generally, EM works best when the fraction of missing information is small and the
dimensionality of the data is not too large. EM can require many iterations, and higher
dimensionality can dramatically slow down the E-step.
• EM is useful for several reasons: conceptual simplicity, ease of implementation, and the fact
that each iteration improves 1(0). The rate of convergence on the first few steps is typically
quite good, but can become excruciatingly slow as you approach local optima.
• Sometimes the M-step is a constrained maximization, which means that there are
constraints on valid solutions not encoded in the function itself.
• Expectation maximization is an effective technique that is often used in data analysis to
manage missing data. Indeed, expectation maximization overcomes some of the limitations of
other techniques, such as mean substitution or regression substitution. These alternative
techniques generate biased estimates-and, specifically, underestimate the standard errors.
Expectation maximization overcomes this problem.