Lecture 5 Variational Autoencoder
Lecture 5 Variational Autoencoder
● Variational autoencoders (VAEs) are generative models, like Generative Adversarial Networks.
● Their association with this group of models derives mainly from the architectural affinity with the basic
autoencoder (the final training objective has an encoder and a decoder), but their mathematical
formulation differs significantly.
● VAEs are directed probabilistic graphical models (DPGM) whose posterior is approximated by a neural
network, forming an autoencoder-like architecture.
● Differently from discriminative modeling that aims to learn a predictor given the observation, generative
modeling tries to simulate how the data are generated, in order to understand the underlying causal
relations.
● Causal relations have indeed the great potential of being generalisable.
In a lot of real-world problems, we have a whole bunch of data that we're looking at. It could be images, or text,
or audio, whatever it is. But the underlying factor, the underlying process of the data could be much simpler in
a much lower dimensional space than the actual data that we're looking at. So a lot of techniques in machine
learning, they try to compress the dimensionality of your data into a smaller space. One very popular technique
that's used a lot in recent papers is called variational autoencoders. It is going to be a pretty technical episode,
so I hope you're ready to dive into the mechanics of variational autoencoders.
So before we dive into the mechanics of variational autoencoders, I first want to introduce normal
autoencoders. So I'm going to assume that you're already familiar with the neural network architectures that
we have, things like back propagation and all of that. So what an autoencoder does is it takes some kind of input
data, it could be an image or a vector, anything at all, with a very high dimensionality, it's gonna run it through
this neural network, and it's gonna try and compress the data into a smaller representation.
It does this with two principal components. The first component is what we call the encoder. The encoder is
simply a bunch of layers. They could be fully connected layers or convolutional that are going to take the input,
and they're going to compress it down to a smaller representation which has less dimensions than the input and
this is what we call the bottleneck.
And then from the bottleneck it’s going to try and reconstruct the input by using again fully connected or
convolutional layers.
And then the last function of training an autoencoder is simply looking at the reconstructed version at the end
of your decoder network.
And then you're going to simply compute the reconstruction loss with respect to your input. And by simply
comparing pixel to pixel differences in the output, we can create a loss function, and we can start training our
network to compress images.
And so obviously you have simple encoders that use fully connected layers, but you can just as well swap them
out with convolutional layers if you're working with images or something like audio for example.
And if you look at what's going on here, if you train a deep convolutional network to do encoding and decoding
of a whole bunch of images, you're actually creating a whole new kind of compression algorithm. And Google is
actually thinking of using these types of networks for reducing the amount of bandwidth that you use on your
phone.
So if you download an image, then the full resolution image is first downscaled, then it's sent to you over the
wireless internet connection, and then in your phone there is actually a decoder that reconstructs the full
resolution image from the compressed representation.
And if you apply this to something like m-miss for example, then it's very interesting to see what these hidden
representations are actually learning.
So here you can see a bunch of images where on the left side you can see the input digits that are being fed
through the network, and then on the right side all of those are reconstructed images. But you can see what
happens if we change the size of the hidden representation.
So if we use only a 2D hidden representation that means that our bottleneck, you know in the middle of the
network, is only two variables. Then we get reconstructions that look pretty okay but they are very fuzzy and
the fuzziness is because you force the entire information of your image to go through two single variables, and
then when you reconstruct obviously you lose some of that detail and that is why the images look so fuzzy.
If you use more dimensions in your latent representation, you can get reconstructions that are much clearer
and much sharper but you need more information in that bottleneck.
And it's interesting to note that the exact same technique is applied to image segmentation as well. So you take
an input image, you run it through your convolutional encoder, it goes through a bottleneck representation, and
then it gets remapped to a full output image. But in this case, instead of reconstructing the original image,
you're actually trying to target a segmented version of your image.
And it's exactly this type of network that is used in self-driving cars to segment the different parts of the public
road into specific objects that a car needs to detect.
Okay so that's the basic idea behind auto-encoders but there are a few very clever tricks that you can apply to
an auto encoder to have it do some really fancy stuff.
So imagine that you start with a normal m-miss digit. It's a clean image, nothing's wrong with it.
But then you add a whole lot of noise to it and you're going to run that noisy image through your encoder
network. You get through the bottleneck representation and then you try to reconstruct the image.
But instead of reconstructing the noisy image, what you're going to do is try and reconstruct the original clean
image. And if you train this network and a whole bunch of these noisy m-miss digits, you're going to try and
force the encoder step to actually get rid of the noise. And this is what we call a denoising auto-encoder.
And so you can see here that by using this approach you can actually train a denoising auto-encoder that is very
good at removing noise from input images. And denoising images isn't the only thing that you can do with this
type of approach.
So in this case for example you take an input and instead of adding noise to it, you simply crop a rectangular
area out of the image, and you throw it away. You replace it with white or black pixels. You feed that input image
through the network, and you try to reconstruct the original full image.
And this technique is what we call neural impeding. It's where you take a small part of the image you throw it
away and then you ask the network to reconstruct whatever was there in the input image.
And with this approach you can do simple things like removing watermarks from images. But you could also
remove a parked car for example if you are filming on a movie set in a natural setting.
Okay so now that we have the basic concept behind a normal auto encoder, let's introduce variational
autoencoders. So the idea behind variational auto-encoders is that instead of mapping any input to a fixed
vector, you want to map your input onto a distribution. And so the only thing that's different in a variational
autoencoder is that your normal bottleneck vector Z is replaced by two separate vectors. One representing the
mean of your distribution, and the other one representing the standard deviation of that distribution. And so
whenever you need a vector to feed through your decoder network, the only thing you have to do is take a
sample from the distribution and then feed it to the decoder.
And so to train a variational autoencoder, the loss function in this case actually consists of two terms.
The first term represents the reconstruction loss, so this is really the same as the autoencoder step, except that
here there is an expectation operator because we are sampling from a distribution. And then the second part
of the loss functions is what we call the KL divergence. I'm not going to go into all of the details because there
is a lot of math involved there. But basically what you want to make sure is that the distribution that you're
learning is not too far removed from a normally distributed discussion. So you're going to try and force your
latent distribution to be relatively close to a mean of zero and a standard deviation of one.
And so finally before we can start training our variational autoencoder, we have one final trick that we have to
use. Because if you look at the computation graph of our network right now, we have a problem. In the middle
of that mat work after the bottleneck we have a sampling operation. There is a node there that takes a sample
from a distribution and then feeds that sample through the decoder. But the problem is that you cannot run
back propagation. You cannot push gradients through a sampling node. So this is an issue. And so in order to
run your gradients through the entire network and train everything end-to-end, we're going to use what we call
the reparameterization trick.
And so the trick goes as follows. If you look at the latent vector that you're sampling, you can actually look at
that vector as the sum of a fixed mu, which is just the parameter that you're learning, plus some kind of a Sigma,
which is also a parameter that you're learning. And then multiplied with an epsilon, and this epsilon is where
we're gonna put the stochastic part. So this epsilon is always going to be standard caution. It's always gonna
have zero mean and standard deviation of one. We're gonna sample from that epsilon and then multiply it with
Sigma, add mu, and we have our latent vector.
And so the clever thing here is that now our mu and our Sigma, those are the only things that we actually want
to train so there we have to be able to compute gradients and run back propagation but that epsilon, well that
doesn't really matter because we don't want to change that epsilon ever again. That epsilon is a fixed stochastic
node. Okay it's still stochastic but we don't have to run back propagation through it so it doesn't matter that it's
sampling operation.
And so this is the reparamisation trick, where instead of having a full stochastic node that is blocking all of your
gradients because you can't do back propagation through it, you're going to split it up into a part where you can
do back prop and then another part which is still stochastic but which you don't want to train because it's fixed.
Pretty clever right.
So let's take a quick look at some code in tensor flow. So here you can see the encoder network which is training
two sets of parameters, the means and the standard deviations of our distribution.
And then in the actual auto-encoder we're going to do a sampling operation from the distribution to actually
get our latent vector.
And then you can see where they compute the KL divergence and then you're actually going to compute your
loss and back prop through it.
Alright and so before we go and look at some visual results of what you can do with variational autoencoders, I
want to note one final thing. There is a new class of variational autoencoders which has a lot of promising results,
that's called disentangled variational autoencoders. And the basic idea behind this disentanglement is that you
want to make sure that the different neurons in your latent distribution that are uncorrelated, that they all try
and learn something different about the input data. And so to implement this the only thing you have to change
is add one hyper parameter through your loss function that weighs how much this KL divergence is present in
the loss function. And so in the disentangled version, the auto encoder will only use a specific latent variable if
it really has a benefit. And if it doesn't benefit the compression then it will simply stick to the normal.
So in order to show the results of a disentangled representation, let's look at a very simple data set.
The data set consists of images that are generated from four latent factors. So you have the x position, the y
position, the size of the objects, and the rotation of the object. And by just picking a sample from that
distribution, you have four values, you can just generate an image that is generated from exactly that hidden
representation. And then the idea is if you train a disentangled variational autoencoder, what you would like to
see is that the autoencoder is able to reconstruct and come up with that exact mapping of those four latent
variables to encode the information in its inputs.
And it turns out that if you use the normal loss function of a variational autoencoder, it simply comes up with a
whole bunch of latent spaces, but it's not really finding exactly those latent variables that we use to generate
the images.
But if you disentangle your representations, it gets much closer. So here on the left side, you can see that by
increasing that beta factor in your autoencoder, you're actually forcing your auto encoder to map the
information onto only a few of those latent variables. So instead of using all ten of them, the autoencoder only
uses five of the latent variables to encode the information. And you can see that the first one represents the Y
position, the second is the x position, then you have the scale, which is the third one, and then in fact there are
two latent representations that the autoencoder used for representing the rotation of an object.
But interestingly, all the other latent representations, even though they are there, they are still fixed at the
Gaussian distribution, and this is because they weren't really necessary to encode the information of our input.
And here you can see really interesting results where they applied variational autoencoders at Google
Deepmind's on their Deepmind lab environment. So you can see a 3D world where an agent can sort of run
around. And what they did is they compress the input images that the agent is seeing into latent space and then
they reconstruct it. But what you can also do is you can start changing the latent variables and then see what
happens to the reconstruction. And so it turns out that if you use a disentangled variational autoencoder, then
changing the latent variables actually corresponds to some very interpretive things. So here you can see that
changing the first latent variable actually changes the color of the floor but nothing else. And then there are
other latent variables that correspond to turning to the left or turning to the right, and there are even some that
changed the rotation and the identity of specific objects that the agent is looking at.
And in contrast, if you don't use this disentanglement, then whenever you start changing a latent
representation, everything starts blurring up in the image, and it's not really clear what this latent vector was
trying to encode.
And so I think this image sums everything up. On the left side, you have a disentangled variational autoencoder,
and you can see that if you change the first dimension in your latent space, then the face is rotated but nothing
else changes. If you do the same thing in a normal variational autoencoder, the face also rotates but you can
see that a lot of other stuff is changing as well.
And then as a comparison, on the right side you can see the results for a Generative Adversarial Network.
And so the holy grail of disentangled variational auto-encoders is to have some kind of a network that can extract
very useful causal features from a very high dimensional space, and then use those features for some tasks that
it's trying to learn. And the hope is then that those learned features will also generalize to the mains outside of
your training data.
And so one of the common domains where people are trying to apply variational autoencoders is, for example,
in reinforcement learning. Because the whole problem in reinforcement learning is that you have very sparse
rewards and it takes a really long time to train anything. So by using this variational autoencoder as some sort
of a feature extractor, the hope is that you can actually run your agent on the representation on the compressed
representation instead of on the full input space.
And so when using this, in practice, there is actually a very clear trade-off. If you disentangle the latent space
too little, then your network is sort of overfitting, because you give it too much freedom, it can just learn how
to reconstruct your input training data, but it won't generalize to unseen data in new cases. On the other hand,
if you disentangle too much, then you actually lose a lot of the high definition detail in your input, and this can
actually hurt performance in a lot of applications. So personally I find this a really interesting idea, and I'm very
curious if this will lead to some types of networks that can learn to extract very useful low dimensional
information from very high dimensional spaces. Because in the end we want to train agents that are able to
understand the world by compressing a whole lot of information and then learning useful behavior on that
latent space.