0% found this document useful (0 votes)
25 views15 pages

Lec 12

The document discusses generative models in machine learning, focusing on maximum likelihood estimation (MLE) and the differences between generative and discriminative algorithms. It highlights the advantages and disadvantages of generative algorithms, such as their ability to work with limited data and robustness to feature corruption. Additionally, it introduces more complex models like Gaussian mixtures and the Expectation Maximization (EM) algorithm for optimizing these models with latent variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views15 pages

Lec 12

The document discusses generative models in machine learning, focusing on maximum likelihood estimation (MLE) and the differences between generative and discriminative algorithms. It highlights the advantages and disadvantages of generative algorithms, such as their ability to work with limited data and robustness to feature corruption. Additionally, it introduces more complex models like Gaussian mixtures and the Expectation Maximization (EM) algorithm for optimizing these models with latent variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Generative Models

General Recipe for MLE


Algorithms
Given a problem with label set , find a way to map data
features to PMFs with support
The notation captures parameters in the model (e.g. vectors,
bias terms)
For binary classification, and
For multiclassification, and
The function is often called the likelihood function
The function called negative log likelihood function
Given data , find the model parameters that maximize
likelihood function i.e. think that the training labels are
very likely
Generative Models
So far, we looked at probability theory as a tool to
express the belief of an ML algorithm that the true label
is such and such
Likelihood: given model it tells us
We also looked at how to use probability theory to
express our beliefs about which models are preferred by
us and which are not
Prior: this just tells us
Notice that in all of this, the data features were always
considered constant and never questions as being
random or flexible
Can we also talk about ?
Generative Algorithms
ML algos that can learn dist. of the form or or
A slightly funny bit of terminology used in machine learning
Discriminative Algorithms: that only use to do their stuff
Generative Algorithms: that use or etc to do their stuff
Generative Algorithms have their advantages and disadvantages
More expensive: slower train times, slower test times, larger models
An overkill: often, need only to make predictions – disc. algos enough!
More frugal: can work even if we have very less training data (e.g.
RecSys)
More robust: can work even if features corrupted e.g. some features
missing
A recent application of generative techniques (GANs etc) allows
us to
Generate novel examples of a certain class of data points
A very simple generative model
Given a few feature vectors (never mind labels for now)
We wish to learn a probability distribution with support over
This distribution should capture interesting properties about the
data in a way that allows us to do things like generate similar-
looking feature vectors etc
Let us try to learn a standard Gaussian as this distribution
i.e. we wish to learn so that the distribution explains the
data well
One way is to look for a that achieves maximum likelihood i.e.
MLE!!
As before, assume that our feature vectors were independently
generated
which, upon applying first order optimality, gives us
We just learnt as our generating dist. for data features!
A more powerful generative
model
Suppose we are not satisfied with the above simple
model
Suppose we wish to instead learn as well as a so that
the distribution explains the data well
Log likelihood function (be careful – cannot ignore any
terms now)
where

F.O. optimality w.r.t. i.e. gives us


F.O. optimality w.r.t i.e. gives us
Since this must be global opt. too!
A still more powerful generative
model
Suppose we wish to instead learn as well as a so that
the distribution explains the data well ( notation for PSD)
where

F.O.O. w.r.t. i.e. gives


Definitely when i.e. when
We may have in some other funny cases even when which
basically means there may be multiple optima for this problem
F.O. optimality w.r.t i.e. requires more work
A still more powerful generative
model 8
For a square matrix , its trace is defined as the sum of its
diagonal elements
Easy result: if , then where
Not so easy result: if is a constant matrix, then
Recall: dims of derivs always equal those of quantity w.r.t which
deriv is taken
Let us denote for convenience
New expression: where
A still more powerful generative
model See “The Matrix
9
For any we have the following
Cookbook” (reference
Symmetry: section on course
webpage) for these
Linearity:
results
New expression: where
(assume symm)

F.O.O. w.r.t. i.e. gives which gives


Since as well as symmetric, this must be the global optimum!
MAP, Bayesian Generative
Models? 10
The previous techniques allow us to learn the parameters of
a Gaussian distribution (either or or ) that offer the highest
likelihood of observed data features by computing the MLE
We can incorporate priors over (e.g. Gaussian, Laplacian),
priors over (e.g. inverse Gamma dist. which has support
only over non-negative numbers) and (e.g. inverse Wishart
dist. which has support only over PSD matrices) and
computer the MAP
We can also perform full-blown Bayesian inference by
computing posterior distributions over quantities such as –
calculations involving predictive posterior get messy –
beyond scope of CS771
However, can make generative models more powerful in
Still more powerful generative
model?
Suppose we are concerned that a single Gaussian
cannot capture all the variations in our data
Just as in LwP when we realized sometimes, a single prototype
not enough
Can we learn 2 (or more) Gaussians to represent our data
instead?
Such a generative model is often called a mixture of
Gaussians
The Expectation Maximization (EM) algorithm is a very
powerful technique for performing this and several other
tasks
Soft clustering, learning Gaussian mixture models (GMM)
Robust learning, Mixed Regression
Learning a Mixture of Two
Gaussians
This means that if someone tells us that this means that the
We suspect that instead
first Gaussian of one
is responsible Gaussian,
for that data pointtwo
and Gaussians
are involved in generating
consequently, the likelihoodour feature
expression is . vectors
Similarly, if
Let us calltells
someone them andthis means that the second Gaussian is
us that
responsible
Each of these foristhat dataapoint
called and the likelihood
component expression
of this GMM
Covariance matrices, moreisthan .
two components can also be
incorporated
Since we are unsure which data point came from which
component, we introduce a latent variable per data
point to denote this
The English word “latent” means hidden or dormant or
concealed
Nice name since this variable describes something that was
hidden from us
MLE with Latent Variables
We wish to obtain the maximum (log) likelihood
models i.e.

Since we do not know the values of latent variables,


force them into the expression using the law of total
probability
We did a similar thing (introduce models) in predictive
posterior calculations

Very difficult optimization problem – NP-hard in general


However, two heuristics exist which work reasonably well in
practice
Heuristic 1: Alternating
Optimization step 2 till you are tired or till the
Keep alternating between step 1 and

Convert the original optimization problem


process has converged!

to a double maximization problem (assume const)

In several ML problems with latent vars, although the above


double optimization problem is (still) difficult, following two
problems are easy
The most important difference between the original and
The intuition behind reducing things to a
a sum double
Step 1:the new
Fix problem
and update is that original
latent has
variables tooftheir
log of optimal
sum
valuesoptimization is that
which is very it may
difficult be mostly
to optimize the case
whereas thethat
new only
one of the terms
problem in of
gets rid the summation
this and looks will dominate
simply and if
like a MLE
this is theWe
problem. case, then
know howapproximating
to solve MLE the sum byvery
problems the
Step 2: Fix latent variables
largest and update
term easily!
should to their optimal
be okay i.e.
values
Heuristic 1 at Work Isn’t this like the k-
means clustering
algorithm?
As discussed before, we assume a mixture of two
Gaussians Not just “like” – this is the k-means algorithm!
and This means that the k-means algorithm is
Step 1 becomesone heuristic way to compute an MLE which
is difficult to compute directly!
Indeed! Notice that even here, instead of
Step 2 becomes choosing just one value of the latent
variables at each time step, we can instead
use a distribution over their support
I have a feeling that
Thus, and where is the number of data pointsheuristic
the second for which we
have will also give us
Repeat! something familiar!

You might also like