Jakub M. Tomczak - Deep Generative Modeling-Springer (2022)
Jakub M. Tomczak - Deep Generative Modeling-Springer (2022)
Jakub M. Tomczak
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my beloved wife Ewelina,
my parents, and brother.
Foreword
In the last decade, with the advance of deep learning, machine learning has made
enormous progress. It has completely changed entire subfields of AI such as
computer vision, speech recognition, and natural language processing. And more
fields are being disrupted as we speak, including robotics, wireless communication,
and the natural sciences.
Most advances have come from supervised learning, where the input (e.g., an
image) and the target label (e.g., a “cat”) are available for training. Deep neural
networks have become uncannily good at predicting objects in visuals scenes and
translating between languages. But obtaining labels to train such models is often
time consuming, expensive, unethical, or simply impossible. That’s why the field
has come to the realization that unsupervised (or self-supervised) methods are key
to make further progress.
This is no different for human learning: when human children grow up, the
amount of information that is consumed to learn about the world is mostly
unlabeled. How often does anyone really tell you what you see or hear in the
world? We must learn the regularities of the world unsupervised, and we do this
by searching for patterns and structure in the data.
And there is lots of structure to be learned! To illustrate this, imagine that we
choose the three colors of each pixel of an image uniformly at random. The result
will be an image that with overwhelmingly large probability will look like gibberish.
The vast majority of image space is filled with images that do not look like anything
we see when we open our eyes. This means that there is a huge amount of structure
that can be discovered, and so there is a lot to learn for children!
Of course, kids do not just stare into the world. Instead, they constantly interact
with it. When children play, they test their hypotheses about the laws of physics,
sociology, and psychology. When predictions are wrong, they are surprised and
presumably update their internal models to make better predictions next time. It is
reasonable to assume that this interactive play of an embodied intelligence is key to
at least arrive at the type of human intelligence we are used to. This type of learning
has clear parallels with reinforcement learning, where machines make plans, say to
vii
viii Foreword
play a game of chess, observe if they win or lose, and update their models of the
world and strategies to act in them.
But it’s difficult to make robots move around in the world to test hypotheses and
actively acquire their own annotations. So, the more practical approach to learning
with lots of data is unsupervised learning. This field has gained a huge amount of
attention and has seen stunning progress recently. One only needs to look at the
kind of images of non-existent human faces that we can now generate effortlessly
to experience the uncanny sense of progress the field has made.
Unsupervised learning comes in many flavors. This book is about the kind we
call probabilistic generative modeling. The goal of this subfield is to estimate a
probabilistic model of the input data. Once we have such a model, we can generate
new samples from it (i.e., new images of faces of people that do not exist).
A second goal is to learn abstract representations of the input. This latter field is
called representation learning. The high-level representations self-organize the input
into “disentangled” concepts, which could be the objects we are familiar with, such
as cars and cats, and their relationships.
While disentangling has a clear intuitive meaning, it has proven to be a
rather slippery concept to properly define. In the 1990s, people were thinking of
statistically independent latent variables. The goal of the brain was to transform the
highly dependent pixel representation into a much more efficient and less redundant
representation of independent latent variables, which compresses the input and
makes the brain more energy and information efficient.
Learning and compression are deeply connected concepts. Learning requires
lossy compression of data because we are interested in generalization and not in
storing the data. At the level of datasets, machine learning itself is about transferring
a tiny fraction of the information present in a dataset into the parameters of a model
and forgetting everything else.
Similarly, at the level of a single datapoint, when we process for example an
input image, we are ultimately interested in the abstract high-level concepts present
in that image, such as objects and their relations, and not in detailed, pixel-level
information. With our internal models we can reason about these objects, manipulate
them in our head and imagine possible counterfactual futures for them. Intelligence
is about squeezing out the relevant predictive information from the correlated soup
pixel-level information that hits our senses and representing that information in a
useful manner that facilitates mental manipulation.
But the objects that we are familiar with in our everyday lives are not really
all that independent. A cat that is chasing a bird is not statistically independent
of it. And so, people also made attempts to define disentangling in terms of
(subspaces of variables) that exhibit certain simple transformation properties when
we transform the input (a.k.a. equivariant representations), or as variables that one
can independently control in order to manipulate the world around us, or as causal
variables that are activating certain independent mechanisms that describe the world,
and so on.
The simplest way to train a model without labels is to learn a probabilistic
generative model (or density) of the input data. There are a number of techniques
Foreword ix
able function under this barrage of fake news? One thing is certain, this field is one
of the hottest in town, and this book is an excellent introduction to start engaging
with it. But everyone should be keenly aware that mastering this technology comes
with new responsibilities towards society. Let’s progress the field with caution.
We live in a world where Artificial Intelligence (AI) has become a widely used
term: there are movies about AI, journalists writing about AI, and CEOs talking
about AI. Most importantly, there is AI in our daily lives, turning our phones,
TVs, fridges, and vacuum cleaners into smartphones, smart TVs, smart fridges,
and vacuum robots. We use AI, however, we still do not fully understand what
“AI” is and how to formulate it, even though AI has been established as a separate
field in the 1950s. Since then, many researchers pursue the holy grail of creating
an artificial intelligence system that is capable of mimicking, understanding, and
aiding humans through processing data and knowledge. In many cases, we have
succeeded to outperform human beings on particular tasks in terms of speed and
accuracy! Current AI methods do not necessarily imitate human processing (neither
biologically nor cognitively) but rather are aimed at making a quick and accurate
decision like navigating in cleaning a room or enhancing the quality of a displayed
movie. In such tasks, probability theory is key since limited or poor quality of data
or intrinsic behavior of a system forces us to quantify uncertainty. Moreover, deep
learning has become a leading learning paradigm that allows learning hierarchical
data representations. It draws its motivation from biological neural networks;
however, the correspondence between deep learning and biological neurons is rather
far-fetched. Nevertheless, deep learning has brought AI to the next level, achieving
state-of-the-art performance in many decision-making tasks. The next step seems
to be a combination of these two paradigms, probability theory and deep learning,
to obtain powerful AI systems that are able to quantify their uncertainties about
environments they operate in.
What Is This Book About Then? This book tackles the problem of formulating AI
systems by combining probabilistic modeling and deep learning. Moreover, it goes
beyond the typical predictive modeling and brings together supervised learning and
unsupervised learning. The resulting paradigm, called deep generative modeling,
utilizes the generative perspective on perceiving the surrounding world. It assumes
that each phenomenon is driven by an underlying generative process that defines
a joint distribution over random variables and their stochastic interactions, i.e.,
xi
xii Preface
how events occur and in what order. The adjective “deep” comes from the fact
that the distribution is parameterized using deep neural networks. There are two
distinct traits of deep generative modeling. First, the application of deep neural
networks allows rich and flexible parameterization of distributions. Second, the
principled manner of modeling stochastic dependencies using probability theory
ensures rigorous formulation and prevents potential flaws in reasoning. Moreover,
probability theory provides a unified framework where the likelihood function plays
a crucial role in quantifying uncertainty and defining objective functions.
Who Is This Book for Then? The book is designed to appeal to curious students,
engineers, and researchers with a modest mathematical background in under-
graduate calculus, linear algebra, probability theory, and the basics in machine
learning, deep learning, and programming in Python and PyTorch (or other deep
learning libraries). It should appeal to students and researchers from a variety of
backgrounds, including computer science, engineering, data science, physics, and
bioinformatics that wish to get familiar with deep generative modeling. In order
to engage with a reader, the book introduces fundamental concepts with specific
examples and code snippets. The full code accompanying the book is available
online at:
https://github.com/jmtomczak/intro_dgm
The ultimate aim of the book is to outline the most important techniques in deep
generative modeling and, eventually, enable readers to formulate new models and
implement them.
The Structure of the Book The book consists of eight chapters that could be read
separately and in (almost) any order. Chapter 1 introduces the topic and highlights
important classes of deep generative models and general concepts. Chapters 2, 3
and 4 discuss modeling of marginal distributions while Chaps. 5, and 6 outline the
material on modeling of joint distributions. Chapter 7 presents a class of latent
variable models that are not learned through the likelihood-based objective. The
last chapter, Chap. 8, indicates how deep generative modeling could be used in the
fast-growing field of neural compression. All chapters are accompanied by code
snippets to help understand how the presented methods could be implemented. The
references are generally to indicate the original source of the presented material
and provide further reading. Deep generating modeling is a broad field of study,
and including all fantastic ideas is nearly impossible. Therefore, I would like to
apologize for missing any paper. If anyone feels left out, it was not intentional from
my side.
In the end, I would like to thank my wife, Ewelina, for her help and presence that
gave me the strength to carry on with writing this book. I am also grateful to my
parents for always supporting me, and my brother who spent a lot of time checking
the first version of the book and the code.
This book, like many other books, would not have been possible without the con-
tribution and help from many people. During my career, I was extremely privileged
and lucky to work on deep generative modeling with an amazing set of people
whom I would like to thank here (in alphabetical order): Tameem Adel, Rianne
van den Berg, Taco Cohen, Tim Davidson, Nicola De Cao, Luka Falorsi, Eliseo
Ferrante, Patrick Forré, Ioannis Gatopoulos, Efstratios Gavves, Adam Gonczarek,
Amirhossein Habibian, Leonard Hasenclever, Emiel Hoogeboom, Maximilian Ilse,
Thomas Kipf, Anna Kuzina, Christos Louizos, Yura Perugachi-Diaz, Ties van
Rozendaal, Victor Satorras, Jerzy Światek,
˛ Max Welling, Szymon Zar˛eba, and
Maciej Zi˛eba.
I would like to thank other colleagues with whom I worked on AI and had plenty
of fascinating discussions (in alphabetical order): Davide Abati, Ilze Auzina, Babak
Ehteshami Bejnordi, Erik Bekkers, Tijmen Blankevoort, Matteo De Carlo, Fuda van
Diggelen, A.E. Eiben, Ali El Hassouni, Arkadiusz Gertych, Russ Greiner, Mark
Hoogendoorn, Emile van Krieken, Gongjin Lan, Falko Lavitt, Romain Lepert, Jie
Luo, ChangYong Oh, Siamak Ravanbakhsh, Diederik Roijers, David W. Romero,
Annette ten Teije, Auke Wiggers, and Alessandro Zonta.
I am especially thankful to my brother, Kasper, who patiently read all sections,
and ran and checked every single line of code in this book. You can’t even imagine
my gratitude for that!
I would like to thank my wife, Ewelina, for supporting me all the time and giving
me the strength to finish this book. Without her help and understanding, it would
be nearly impossible to accomplish this project. I would like to also express my
gratitude to my parents, Elżbieta and Ryszard, for their support at different stages
of my life because without them I would never be who I am now.
xiii
Contents
xv
xvi Contents
3.1.5 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.6 Is It All? Really? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.7 ResNet Flows and DenseNet Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Flows for Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Flows in R or Maybe Rather in Z? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.3 Integer Discrete Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.4 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.5 What’s Next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Probabilistic Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Variational Auto-Encoders: Variational Inference for
Non-linear Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 The Model and the Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 A Different Perspective on the ELBO. . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.3 Components of VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.3.1 Parameterization of Distributions . . . . . . . . . . . . . . . . . . . 63
4.3.3.2 Reparameterization Trick. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.4 VAE in Action! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.5 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.6 Typical Issues with VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.7 There Is More! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Improving Variational Auto-Encoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.1 Priors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.1.1 Standard Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.1.2 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.1.3 VampPrior: Variational Mixture of
Posterior Prior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.1.4 GTM: Generative Topographic Mapping . . . . . . . . . . . 85
4.4.1.5 GTM-VampPrior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4.1.6 Flow-Based Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.1.7 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.2 Variational Posteriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.2.1 Variational Posteriors with Householder
Flows [20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.2.2 Variational Posteriors with Sylvester
Flows [16] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4.2.3 Hyperspherical Latent Space . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5 Hierarchical Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5.2 Hierarchical VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.5.2.1 Two-Level VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Contents xvii
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Chapter 1
Why Deep Generative Modeling?
Before we start thinking about (deep) generative modeling, let us consider a simple
example. Imagine we have trained a deep neural network that classifies images (x ∈
ZD ) of animals (y ∈ Y, and Y = {cat, dog, horse}). Further, let us assume that
this neural network is trained really well so that it always classifies a proper class
with a high probability p(y|x). So far so good, right? The problem could occur
though. As pointed out in [1], adding noise to images could result in completely
false classification. An example of such a situation is presented in Fig. 1.1 where
adding noise could shift predicted probabilities of labels; however, the image is
barely changed (at least to us, human beings).
This example indicates that neural networks that are used to parameterize the
conditional distribution p(y|x) seem to lack semantic understanding of images.
Further, we even hypothesize that learning discriminative models is not enough for
proper decision making and creating AI. A machine learning system cannot rely on
learning how to make a decision without understanding the reality and being able to
express uncertainty about the surrounding world. How can we trust such a system
if even a small amount of noise could change its internal beliefs and also shift its
certainty from one decision to the other? How can we communicate with such a
system if it is unable to properly express its opinion about whether its surrounding
is new or not?
To motivate the importance of the concepts like uncertainty and understanding
in decision making, let us consider a system that classifies objects, but this time
into two classes: orange and blue. We assume we have some two-dimensional data
(Fig. 1.2, left) and a new datapoint to be classified (a black cross in Fig. 1.2). We
can make decisions using two approaches. First, a classifier could be formulated
explicitly by modeling the conditional distribution p(y|x) (Fig. 1.2, middle). Sec-
ond, we can consider a joint distribution p(x, y) that could be further decomposed
as p(x, y) = p(y|x) p(x) (Fig. 1.2, right).
Fig. 1.1 An example of adding noise to an almost perfectly classified image that results in a shift
of predicted label
Fig. 1.2 And example of data (left) and two approaches to decision making: (middle) a discrimi-
native approach and (right) a generative approach
After training a model using the discriminative approach, namely, the conditional
distribution p(y|x), we obtain a clear decision boundary. Then, we see that the black
cross is farther away from the orange region; thus, the classifier assigns a higher
probability to the blue label. As a result, the classifier is certain about the decision!
On the other hand, if we additionally fit a distribution p(x), we observe that the
black cross is not only farther away from the decision boundary, but also it is distant
to the region where the blue datapoints lie. In other words, the black point is far away
from the region of high probability mass. As a result, the (marginal) probability
of the black cross p(x = black cross) is low, and the joint distribution p(x =
black cross, y = blue) will be low as well and, thus, the decision is uncertain!
This simple example clearly indicates that if we want to build AI systems that
make reliable decisions and can communicate with us, human beings, they must
understand the environment first. For this purpose, they cannot simply learn how
to make decisions, but they should be able to quantify their beliefs about their
surrounding using the language of probability [2, 3]. In order to do that, we claim
that estimating the distribution over objects, p(x), is crucial.
From the generative perspective, knowing the distribution p(x) is essential
because:
• It could be used to assess whether a given object has been observed in the past or
not.
• It could help to properly weight the decision.
• It could be used to assess uncertainty about the environment.
1.2 Where Can We Use (Deep) Generative Modeling? 3
With the development of neural networks and the increase in computational power,
deep generative modeling has become one of the leading directions in AI. Its
applications vary from typical modalities considered in machine learning, i.e., text
analysis (e.g., [5]), image analysis (e.g., [6]), audio analysis (e.g., [7]), to problems
in active learning (e.g., [8]), reinforcement learning (e.g., [9]), graph analysis (e.g.,
[10]), and medical imaging (e.g., [11]). In Fig. 1.3, we present graphically potential
applications of deep generative modeling.
In some applications, it is indeed important to generate (synthesize) objects or
modify features of objects to create new ones (e.g., an app turns a young person
Learning
Labeled Unlabeled
data data
Quering
Active learning
Text
Images
Graphs
Reinforcement Audio
learning
Medical imaging
into an old one). However, in others like active learning it is important to ask for
uncertain objects, i.e., objects with low p(x)) that should be labeled by an oracle. In
reinforcement learning, on the other hand, generating the next most likely situation
(states) is crucial for taking actions by an agent. For medical applications, explaining
a decision, e.g., in terms of the probability of the label and the object, is definitely
more informative to a human doctor than simply assisting with a diagnosis label.
If an AI system would be able to indicate how certain it is and also quantify
whether the object is suspicious (i.e., low p(x)) or not, then it might be used as
an independent specialist that outlines its own opinion.
These examples clearly indicate that many fields, if not all, could highly benefit
from (deep) generative modeling. Obviously, there are many mechanisms that AI
systems should be equipped with. However, we claim that the generative modeling
capability is definitely one of the most important ones, as outlined in the above-
mentioned cases.
At this point, after highlighting the importance and wide applicability of (deep)
generative modeling, we should ask ourselves how to formulate (deep) generative
models. In other words, how to express p(x) that we mentioned already multiple
times.
We can divide (deep) generative modeling into four main groups (see Fig. 1.4):
• Autoregressive generative models (ARM)
• Flow-based models
• Latent variable models
• Energy-based models
We use deep in brackets because most of what we have discussed so far could be
modeled without using neural networks. However, neural networks are flexible and
powerful and, therefore, they are widely used to parameterize generative models.
From now on, we focus entirely on deep generative models.
As a side note, please treat this taxonomy as a guideline that helps us to navigate
through this book, not something written in stone. Personally, I am not a big fan of
spending too much time on categorizing and labeling science because it very often
results in antagonizing and gatekeeping. Anyway, there is also a group of models
based on the score matching principle [12–14] that do not necessarily fit our simple
taxonomy. However, as pointed out in [14], these models share a lot of similarities
with latent variable models (if we treat consecutive steps of a stochastic process as
latent variables) and, thus, we treat them as such.
1.3 How to Formulate (Deep) Generative Modeling? 5
Autoregressive Flow-based
Latent variable Energy-based
models models
models models
(e.g., PixelCNN) (e.g., RealNVP)
The first group of deep generative models utilizes the idea of autoregressive
modeling (ARM). In other words, the distribution over x is represented in an
autoregressive manner:
D
p(x) = p(x0 ) p(xi |x<i ), (1.1)
i=1
Integer discrete flows propose to use affine coupling layers with rounding
operators to ensure the integer-valued output [27]. A generalization of the affine
coupling layer was further investigated in [28].
All generative models that take advantage of the change of variables formula
are referred to as flow-based models or flows for short. We will discuss flows in
Chap. 3.
z ∼ p(z)
x ∼ p(x|z).
In other words, the latent variables correspond to hidden factors in data, and the
conditional distribution p(x|z) could be treated as a generator.
The most widely known latent variable model is the probabilistic Principal
Component Analysis (pPCA) [29] where p(z) and p(x|z) are Gaussian distribu-
tions, and the dependency between z and x is linear.
A non-linear extension of the pPCA with arbitrary distributions is the Varia-
tional Auto-Encoder (VAE) framework [30, 31]. To make the inference tractable,
variational inference is utilized to approximate the posterior p(z|x), and neural
networks are used to parameterize the distributions. Since the publication of the
seminal papers by Kingma and Welling [30], Rezende et al. [31], there were multiple
extensions of this framework, including working on more powerful variational
posteriors [19, 21, 22, 32], priors [33, 34], and decoders [35]. Interesting directions
include considering different topologies of the latent space, e.g., the hyperspherical
latent space [36]. In VAEs and the pPCA all distributions must be defined upfront
1.3 How to Formulate (Deep) Generative Modeling? 7
and, therefore, they are called prescribed models. We will pay special attention to
this group of deep generative models in Chap. 4.
So far, ARMs, flows, the pPCA, and VAEs are probabilistic models with
the objective function being the log-likelihood function that is closely related
to using the Kullback–Leibler divergence between the data distribution and the
model distribution. A different approach utilizes an adversarial loss in which a
discriminator D(·) determines a difference between real data and synthetic
data
provided by a generator in the implicit form, namely, p(x|z) = δ x − G(z) ,
where δ(·) is the Dirac delta. This group of models is called implicit models, and
Generative Adversarial Networks (GANs) [6] become one of the first successful
deep generative models for synthesizing realistic-looking objects (e.g., images). See
Chap. 7 for more details.
1.3.5 Overview
In Table 1.1, we compared all four groups of models (with a distinction between
implicit latent variable models and prescribed latent variable models) using arbitrary
criteria like:
• Whether training is typically stable
• Whether it is possible to calculate the likelihood function
• Whether one can use a model for lossy or lossless compression
• Whether a model could be used for representation learning
All likelihood-based models (i.e., ARMs, flows, EBMs, and prescribed models
like VAEs) can be trained in a stable manner, while implicit models like GANs
suffer from instabilities. In the case of the non-linear prescribed models like VAEs,
we must remember that the likelihood function cannot be exactly calculated, and
only a lower-bound could be provided. Similarly, EBMs require calculating the
partition function that is analytically intractable problem. As a result, we can get
the unnormalized probability or an approximation at best. ARMs constitute one
of the best likelihood-based models; however, their sampling process is extremely
slow due to the autoregressive manner of generating new content. EBMs require
running a Monte Carlo method to receive a sample. Since we operate on high-
dimensional objects, this is a great obstacle for using EBMs widely in practice. All
other approaches are relatively fast. In the case of compression, VAEs are models
that allow us to use a bottleneck (the latent space). On the other hand, ARMs and
flows could be used for lossless compression since they are density estimators
and provide the exact likelihood value. Implicit models cannot be directly used
for compression; however, recent works use GANs to improve image compression
[44]. Flows, prescribed models, and EBMs (if use latents) could be used for
representation learning, namely, learning a set of random variables that summarize
data in some way and/or disentangle factors in data. The question about what is a
good representation is a different story and we refer a curious reader to literature,
e.g., [45].
1 https://pytorch.org/.
10 1 Why Deep Generative Modeling?
References
1. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian
Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In 2nd International
Conference on Learning Representations, ICLR 2014, 2014.
2. Christopher M Bishop. Model-based machine learning. Philosophical Transactions of the
Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984):20120222,
2013.
3. Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence. Nature,
521(7553):452–459, 2015.
4. Julia A Lasserre, Christopher M Bishop, and Thomas P Minka. Principled hybrids of generative
and discriminative models. In 2006 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR’06), volume 1, pages 87–94. IEEE, 2006.
5. Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy
Bengio. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL
Conference on Computational Natural Language Learning, pages 10–21, 2016.
6. Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. arXiv preprint
arXiv:1406.2661, 2014.
7. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative
model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
8. Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Variational adversarial active learning.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5972–
5981, 2019.
9. David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
10. GraphVAE: Towards generation of small graphs using variational autoencoders,
author=Simonovsky, Martin and Komodakis, Nikos, booktitle=International Conference
on Artificial Neural Networks, pages=412–422, year=2018, organization=Springer.
11. Maximilian Ilse, Jakub M Tomczak, Christos Louizos, and Max Welling. DIVA: Domain
invariant variational autoencoders. In Medical Imaging with Deep Learning, pages 322–348.
PMLR, 2020.
12. Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score
matching. Journal of Machine Learning Research, 6(4), 2005.
13. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data
distribution. arXiv preprint arXiv:1907.05600, 2019.
14. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. In
International Conference on Learning Representations, 2020.
15. Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.
In International Conference on Machine Learning, pages 1747–1756. PMLR, 2016.
16. Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray
Kavukcuoglu. Conditional image generation with PixelCNN decoders. In Proceedings of the
30th International Conference on Neural Information Processing Systems, pages 4797–4805,
2016.
17. Oren Rippel and Ryan Prescott Adams. High-dimensional probability estimation with deep
density models. arXiv preprint arXiv:1302.5125, 2013.
18. Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components
estimation. arXiv preprint arXiv:1410.8516, 2014.
19. Jakub M Tomczak and Max Welling. Improving variational auto-encoders using householder
flow. arXiv preprint arXiv:1611.09630, 2016.
20. Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Inter-
national Conference on Machine Learning, pages 1530–1538. PMLR, 2015.
References 11
21. Rianne Van Den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester
normalizing flows for variational inference. In 34th Conference on Uncertainty in Artificial
Intelligence 2018, UAI 2018, pages 393–402. Association For Uncertainty in Artificial
Intelligence (AUAI), 2018.
22. Emiel Hoogeboom, Victor Garcia Satorras, Jakub M Tomczak, and Max Welling. The
convolution exponential and generalized Sylvester flows. arXiv preprint arXiv:2006.01910,
2020.
23. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP.
arXiv preprint arXiv:1605.08803, 2016.
24. Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and Jörn-Henrik Jacobsen.
Invertible residual networks. In International Conference on Machine Learning, pages 573–
582. PMLR, 2019.
25. Ricky TQ Chen, Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Residual flows
for invertible generative modeling. arXiv preprint arXiv:1906.02735, 2019.
26. Yura Perugachi-Diaz, Jakub M Tomczak, and Sandjai Bhulai. Invertible DenseNets with
Concatenated LipSwish. Advances in Neural Information Processing Systems, 2021.
27. Emiel Hoogeboom, Jorn WT Peters, Rianne van den Berg, and Max Welling. Integer discrete
flows and lossless compression. arXiv preprint arXiv:1905.07376, 2019.
28. Jakub M Tomczak. General invertible transformations for flow-based generative modeling.
INNF+, 2021.
29. Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622,
1999.
30. Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint
arXiv:1312.6114, 2013.
31. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation
and approximate inference in deep generative models. In International conference on machine
learning, pages 1278–1286. PMLR, 2014.
32. Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max
Welling. Improved variational inference with inverse autoregressive flow. Advances in Neural
Information Processing Systems, 29:4743–4751, 2016.
33. Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schul-
man, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint
arXiv:1611.02731, 2016.
34. Jakub Tomczak and Max Welling. VAE with a VampPrior. In International Conference on
Artificial Intelligence and Statistics, pages 1214–1223. PMLR, 2018.
35. Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David
Vazquez, and Aaron Courville. PixelVAE: A latent variable model for natural images. arXiv
preprint arXiv:1611.05013, 2016.
36. Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak.
Hyperspherical variational auto-encoders. In 34th Conference on Uncertainty in Artificial
Intelligence 2018, UAI 2018, pages 856–865. Association For Uncertainty in Artificial
Intelligence (AUAI), 2018.
37. Edwin T Jaynes. Probability theory: The logic of science. Cambridge university press, 2003.
38. Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-
based learning. Predicting structured data, 1(0), 2006.
39. David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for
Boltzmann machines. Cognitive science, 9(1):147–169, 1985.
40. Geoffrey E Hinton, Terrence J Sejnowski, et al. Learning and relearning in Boltzmann
machines. Parallel distributed processing: Explorations in the microstructure of cognition,
1(282-317):2, 1986.
41. Geoffrey E Hinton. A practical guide to training restricted Boltzmann machines. In Neural
networks: Tricks of the trade, pages 599–619. Springer, 2012.
12 1 Why Deep Generative Modeling?
42. Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted Boltzmann
machines. In Proceedings of the 25th international conference on Machine learning, pages
536–543, 2008.
43. Will Grathwohl, Kuan-Chieh Wang, Joern-Henrik Jacobsen, David Duvenaud, Mohammad
Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should
treat it like one. In International Conference on Learning Representations, 2019.
44. Fabian Mentzer, George D Toderici, Michael Tschannen, and Eirikur Agustsson. High-fidelity
generative image compression. Advances in Neural Information Processing Systems, 33, 2020.
45. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review
and new perspectives. IEEE transactions on pattern analysis and machine intelligence,
35(8):1798–1828, 2013.
Chapter 2
Autoregressive Models
2.1 Introduction
Before we start discussing how we can model the distribution p(x), we refresh our
memory about the core rules of probability theory, namely, the sum rule and the
product rule. Let us introduce two random variables x and y. Their joint distribution
is p(x, y). The product rule allows us to factorize the joint distribution in two
manners, namely:
These two rules will play a crucial role in probability theory and statistics and, in
particular, in formulating deep generative models.
Now, let us consider a high-dimensional random variable x ∈ XD where X =
{0, 1, . . . , 255} (e.g., pixel values) or X = R. Our goal is to model p(x). Before we
jump into thinking of specific parameterization, let us first apply the product rule to
express the joint distribution in a different manner:
D
p(x) = p(x1 ) p(xd |x<d ), (2.4)
d=2
As mentioned earlier, we aim for modeling the joint distribution p(x) using
conditional distributions. A potential solution to the issue of using D separate model
is utilizing a single, shared model for the conditional distribution. However, we
need to make some assumptions to use such a shared model. In other words, we
look for an autoregressive model (ARM). In the next subsection, we outline ARMs
parameterized with various neural networks. After all, we are talking about deep
generative models so using a neural network is not surprising, isn’t it?
D
p(x) = p(x1 )p(x2 |x1 ) p(xd |xd−1 , xd−2 ). (2.5)
d=3
Then, we can use a small neural network, e.g., multi-layered perceptron (MLP),
to predict the distribution of xd . If X = {0, 1, . . . , 255}, the MLP takes xd−1 , xd−2
and outputs probabilities for the categorical distribution of xd , θd . The MLP could
be of the following form:
Fig. 2.1 An example of applying a shared MLP depending on two last inputs. Inputs are denoted
by blue nodes (bottom), intermediate representations are denoted by orange nodes (middle), and
output probabilities are denoted by green nodes (top). Notice that a probability θd is not dependent
on xd
Fig. 2.2 An example of applying an RNN depending on two last inputs. Inputs are denoted by blue
nodes (bottom), intermediate representations are denoted by orange nodes (middle), and output
probabilities are denoted by green nodes (top). Notice that compared to the approach with a shared
MLP, there is an additional dependency between intermediate nodes hd
• If they are badly conditioned (i.e., the eigenvalues of a weight matrix are
larger or smaller than 1, then they suffer from exploding or vanishing gradients,
respectively, that hinders learning long-range dependencies.
There exist methods to help training RNNs like gradient clipping or, more generally,
gradient regularization [4] or orthogonal weights [5]. However, here we are not
interested in looking into rather specific solutions to new problems. We seek for a
different parameterization that could solve our original problem, namely, modeling
long-range dependencies in an ARM.
In [6, 7] it was noticed that convolutional neural networks (CNNs) could be used
instead of RNNs to model long-range dependencies. To be more precise, one-
dimensional convolutional layers (Conv1D) could be stacked together to process
sequential data. The advantages of such an approach are the following:
• Kernels are shared (i.e., an efficient parameterization).
• The processing is done in parallel that greatly speeds up computations.
• By stacking more layers, the effective kernel size grows with the network depth.
These three traits seem to place Conv1D-based neural networks as a perfect solution
to our problem. However, can we indeed use them straight away?
A Conv1D can be applied to calculate embeddings like in [7], but it cannot be
used for autoregressive models. Why? Because we need convolutions to be causal
[8]. Causal in this context means that a Conv1D layer is dependent on the last k
inputs but the current one (option A) or with the current one (option B). In other
words, we must “cut” the kernel in half and forbid it to look into the next variables
2.2 Autoregressive Models Parameterized by Neural Networks 17
Fig. 2.3 An example of applying causal convolutions. The kernel size is 2, but by applying dilation
in higher layers, a much larger input could be processed (red edges), thus, a larger memory is
utilized. Notice that the first layers must be option A to ensure proper processing
(look into the future). Importantly, the option A is required in the first layer because
the final output (i.e., the probabilities θd ) cannot be dependent on xd . Additionally,
if we are concerned about the effective kernel size, we can use dilation larger
than 1.
In Fig. 2.3 we present an example of a neural network consisting of 3 causal
Conv1D layers. The first CausalConv1D is of type A, i.e., it does not take into
account only the last k inputs without the current one. Then, in the next two layers,
we use CausalConv1D (option B) with dilations 2 and 3. Typically, the dilation
values are 1, 2, 4, and 8 (v.d. Oord et al., 2016a); however, taking 2 and 4 would
not nicely fit in a figure. We highlight in red all connections that go from the output
layer to the input layer. As we can notice, stacking CausalConv1D layers with the
dilation larger than 1 allows us to learn long-range dependencies (in this example,
by looking at 7 last inputs).
An example of an implementation of CausalConv1D layer is presented below. If
you are still confused about the option A and the option B, please analyze the code
snippet step-by-step.
1 class CausalConv1d (nn. Module ):
2 def __init__ (self , in_channels , out_channels , kernel_size ,
dilation , A=False , ∗∗ kwargs ):
3 super( CausalConv1d , self ). __init__ ()
4
7 # attributes :
8 self. kernel_size = kernel_size
9 self. dilation = dilation
10 self.A = A # whether option A (A=True) or B (A= False)
11 self. padding = ( kernel_size − 1) ∗ dilation + A ∗ 1
12
Alright, let us talk more about details and how to implement an ARM. Here, and in
the whole book, we focus on images, e.g., x ∈ {0, 1, . . . , 15}64 . Since images are
represented by integers, we will use the categorical distribution to represent them (in
next chapters, we will comment on the choice of distribution for images and present
some alternatives). We model p(x) using an ARM parameterized by CausalConv1D
layers. As a result, each conditional is the following:
L
= [xd = l] ln θd (x<d ) . (2.15)
n d l=1
2.3.1 Code
Uff... Alright, let’s take a look at some code. The full code is available under the
following: https://github.com/jmtomczak/intro_dgm. Here, we focus only on the
code for the model. We provide details in the comments.
1 class ARM(nn. Module ):
2 def __init__ (self , net , D=2, num_vals =256):
3 super(ARM , self). __init__ ()
4
40 return log_p
41
53 return x_new
A B C
Fig. 2.4 An example of outcomes after the training: (a) Randomly selected real images. (b)
Unconditional generations from the ARM. (c) The validation curve during training
3x3 kernel
*
masked
kernel
masked
input image feature map
3x3 kernel
Fig. 2.5 An example of a masked 3×3 kernel (i.e., a causal 2D kernel): (left) A difference between
a standard kernel (all weights are used; denoted by green) and a masked kernel (some weights are
masked, i.e., not used; in red). For the masked kernel, we denoted the node (pixel) in the middle in
violet, because it is either masked (option A) or not (option B). (middle) An example of an image
(light orange nodes: zeros, light blue nodes: ones) and a masked kernel (option A). (right) The
result of applying the masked kernel to the image (with padding equal to 1)
Perfect! Now we are ready to run the full code. After training our ARM, we
should obtain results similar to those in Fig. 2.4.
References
1. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical
evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint
arXiv:1412.3555, 2014.
2. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
3. Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural
networks. In ICML, 2011.
4. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent
neural networks. In International conference on machine learning, pages 1310–1318. PMLR,
2013.
5. Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks.
In International Conference on Machine Learning, pages 1120–1128. PMLR, 2016.
6. Ronan Collobert and Jason Weston. A unified architecture for natural language processing:
Deep neural networks with multitask learning. In Proceedings of the 25th international
conference on Machine learning, pages 160–167, 2008.
7. Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network
for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics, pages 212–217. Association for Computational Linguistics, 2014.
8. Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolu-
tional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
9. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative
model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
10. Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.
In International Conference on Machine Learning, pages 1747–1756. PMLR, 2016.
11. Auke Wiggers and Emiel Hoogeboom. Predictive sampling with forecasting autoregressive
models. In International Conference on Machine Learning, pages 10260–10269. PMLR, 2020.
12. Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray
Kavukcuoglu. Conditional image generation with pixelcnn decoders. In Proceedings of the
30th International Conference on Neural Information Processing Systems, pages 4797–4805,
2016.
13. Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the
pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint
arXiv:1701.05517, 2017.
14. Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved
autoregressive generative model. In International Conference on Machine Learning, pages
864–872. PMLR, 2018.
15. Amirhossein Habibian, Ties van Rozendaal, Jakub M Tomczak, and Taco S Cohen. Video
compression with rate-distortion autoencoders. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 7033–7042, 2019.
References 25
16. Nal Kalchbrenner, Aäron Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves,
and Koray Kavukcuoglu. Video pixel networks. In International Conference on Machine
Learning, pages 1771–1779. PMLR, 2017.
17. Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David
Vazquez, and Aaron Courville. PixelVAE: A latent variable model for natural images. arXiv
preprint arXiv:1611.05013, 2016.
18. Yang Song, Chenlin Meng, Renjie Liao, and Stefano Ermon. Accelerating feedforward
computation via parallel nonlinear equation solving. In International Conference on Machine
Learning, pages 9791–9800. PMLR, 2021.
19. Georg Ostrovski, Will Dabney, and Rémi Munos. Autoregressive quantile networks for
generative modeling. In International Conference on Machine Learning, pages 3936–3945.
PMLR, 2018.
20. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural
information processing systems, pages 5998–6008, 2017.
21. Scott Reed, Aäron Oord, Nal Kalchbrenner, Sergio Gómez Colmenarejo, Ziyu Wang, Yutian
Chen, Dan Belov, and Nando Freitas. Parallel multiscale autoregressive density estimation. In
International Conference on Machine Learning, pages 2912–2921. PMLR, 2017.
22. Jacob Menick and Nal Kalchbrenner. Generating high fidelity images with subscale pixel
networks and multidimensional upscaling. In International Conference on Learning Represen-
tations, 2018.
Chapter 3
Flow-Based Models
3.1.1 Introduction
So far, we have discussed a class of deep generative models that model the
distribution p(x) directly in an autoregressive manner. The main advantage of
ARMs is that they can learn long-range statistics and, in a consequence, powerful
density estimators. However, their drawback is that they are parameterized in an
autoregressive manner, hence, sampling is rather a slow process. Moreover, they
lack a latent representation, therefore, it is not obvious how to manipulate their
internal data representation that makes it less appealing for tasks like compression
or metric learning. In this chapter, we present a different approach to direct modeling
of p(x). However, before we start our considerations, we will discuss a simple
example.
Example 3.1 Let us take a random variable z ∈ R with π(z) = N(z|0, 1). Now,
we consider a new random variable after applying some linear transformation to z,
namely x = 0.75z + 1. Now the question is the following:
What is the distribution of x, p(x)?
We can guess the solution by using properties of Gaussians, or dig in our memory
about the change of variables formula to calculate this distribution, that is:
∂f −1 (x)
p(x) = π z = f −1
(x) , (3.1)
∂x
where f is an invertible function (a bijection). What does it mean? It means that the
function maps one point to another, distinctive point, and we can always invert the
function to obtain the original point.
In Fig. 3.1, we have an example of a bijection. Notice that volumes of the domains
do
−1not need
to be the same! Keep it in mind and think about it in the context of
∂f (x)
∂x .
Coming back to our example, we have
x−1
f −1 (x) = . (3.3)
0.75
Then, the derivative of the change of volume is
−1
∂f (x) 4
∂x = 3 . (3.4)
random variable with a known distribution, z ∼ p(z). The same holds for multiple
variables x, z ∈ RD :
∂f −1 (x)
p(x) = p z = f −1 (x) , (3.7)
∂x
where:
−1
∂f (x)
∂x = det Jf −1 (x) (3.8)
Moreover, we can also use the inverse function theorem that yields
J −1 (x) = Jf (x)−1 . (3.10)
f
Since f is invertible, we can use the inverse function theorem to rewrite (3.7) as
follows:
−1
p(x) = p z = f −1 (x) Jf (x) . (3.11)
To get some insight into the role of the Jacobian-determinant, take a look at
Fig. 3.2. Here, there are three cases of invertible transformations that play around
with a uniform distribution defined over a square.
In the case on top, the transformation turns a square into a rhombus without
changing its volume. As a result, the Jacobian-determinant of this transformation
is 1. Such transformations are called volume-preserving. Notice that the resulting
distribution is still uniform and since there is no change of volume, it is defined over
the same volume as the original one, thus, the color is the same.
In the middle, the transformation shrinks the volume, therefore, the resulting
uniform distribution is “denser” (a darker color in Fig. 3.2). Additionally, the
Jacobian-determinant is smaller than 1.
In the last situation, the transformation enlarges the volume, hence, the uniform
distribution is defined over a larger area (a lighter color in Fig. 3.2). Since the volume
is larger, the Jacobian-determinant is larger than 1.
Notice that shifting operator is volume-preserving. To see that imagine adding an
arbitrary value (e.g., 5) to all points of the square. Does it change the volume? Not
at all! Thus, the Jacobian-determinant equals 1.
30 3 Flow-Based Models
A natural question is whether we can utilize the idea of the change of variables to
model a complex and high-dimensional distribution over images, audio, or other
data sources. Let us consider a hierarchical model, or, equivalently, a sequence of
invertible transformations, fk : RD → RD . We start with a known distribution
π(z0 ) = N(z0 |0, I). Then, we can sequentially apply the invertible transformations
to obtain a flexible distribution [1, 2]:
K
−1
∂fi (zi−1 ) −1
p(x) = π z0 = f (x) det ∂z , (3.12)
i−1
i=1
K
p(x) = π z0 = f −1 (x) Jf (zi−1 )−1 . (3.13)
i
i=1
Fig. 3.3 An example of transforming a unimodal distribution (the latent space) to a multimodal
distribution (the data space, e.g., the pixel space) through a series of invertible transformations fi
K
ln p(x) = ln N z0 = f −1 (x)|0, I − ln Jfi (zi−1 ) . (3.14)
i=1
Interestingly, we see that the first part, namely ln N z0 = f −1 (x)|0, I , cor-
responds to the Mean Square Error loss function
between 0 and f −1 (x) plus a
constant. The second part, K ln J (z
fi i−1 ) , as in our example, ensures that
i=1
the distribution is properly normalized. However, since it penalizes the change of
volume (take a look again at the example above!), we can think of it as a kind of a
regularizer for the invertible transformations {fi }.
Once we have laid down the foundations of the change of variables for expressing
density functions, now we must face two questions:
• How to model the invertible transformations?
• What is the difficulty here?
The answer to the first question could be neural networks because they are
flexible and easy-to-train. However, we cannot take any neural network because
of two reasons. First, the transformation must be invertible, thus, we must pick
an invertible neural network. Second, even if a neural network is invertible,
we
face a problem of calculating the second part of (3.14), i.e., K ln Jf (zi−1 ),
i=1 i
that is non-trivial and computationally intractable for an arbitrary sequence of
invertible transformations. As a result, we seek for such neural networks that are
both invertible and the logarithm of a Jacobian-determinant is (relatively) easy to
calculate. The resulting model that consists of invertible transformations (neural
networks) with tractable Jacobian-determinants is referred to as normalizing flows
or flow-based models.
There are various possible invertible neural networks with tractable Jacobian-
determinants, e.g., Planar Normalizing Flows [1], Sylvester Normalizing Flows [3],
Residual Flows [4, 5], Invertible DenseNets [6]. However, here we focus on a very
important class of models: RealNVP, Real-valued Non-Volume Preserving flows
[7] that serve as a starting point for many other flow-based generative models (e.g.,
GLOW [8]).
32 3 Flow-Based Models
The main component of RealNVP is a coupling layer. The idea behind this
transformation is the following. Let us consider an input to the layer that is divided
into two parts: x = [xa , xb ]. The division into two parts could be done by dividing
the vector x into x1:d and xd+1:D or according to a more sophisticated manner, e.g.,
a checkerboard pattern [7]. Then, the transformation is defined as follows:
ya = xa (3.15)
yb = exp (s (xa )) xb + t (xa ) , (3.16)
where s(·) and t (·) are arbitrary neural networks called scaling and transition,
respectively.
This transformation is invertible by design, namely:
that yields
⎛ ⎞
D−d
D−d
det(J) = exp (s (xa ))j = exp ⎝ s (xa )j ⎠ . (3.20)
j =1 j =1
A simple yet effective transformation that could be combined with a coupling layer
is a permutation layer. Since permutation is volume-preserving, i.e., its Jacobian-
3.1 Flows for Continuous Random Variables 33
A B
xa ya za xa ya za
xb f yb zb xb f -1 yb zb
Fig. 3.4 A combination of a coupling layer and a permutation layer that transforms [xa , xb ] to
[za , zb ]. (a) A forward pass through the block. (b) An inverse pass through the block
Fig. 3.5 A schematic representation of the uniform dequantization for two binary random
variables: (left) the probability mass is assigned to points, (right) after the uniform dequantization,
the probability mass is assigned to square areas. Colors correspond to probability values
determinant is equal to 1, we can apply it each time after the coupling layer. For
instance, we can reverse the order of variables.
An example of an invertible block, i.e., a combination of a coupling layer with a
permutation layer is schematically presented in Fig. 3.4.
3.1.3.3 Dequantization
Let us turn math into a code! We will first discuss the log-likelihood function (i.e.,
the learning objective) and how mathematical formulas correspond to the code.
First, it is extremely important to know what is our learning objective, i.e., the log-
likelihood function. In the example, we use coupling layers as described earlier,
together with permutation layers. Then, we can plug the logarithm of the Jacobian-
determinant for the coupling layers (for the permutation layers it is equal to 1, so
ln(1) = 0) in Eq. (3.14) that yields
⎛ ⎞
K
D−d
ln p(x) = ln N z0 = f −1 (x)|0, I − ⎝ sk xka ⎠ , (3.21)
j
i=1 j =1
where sk is the scale network in the k-th coupling layer, and xka denotes the input to
the k-th coupling layer. Notice that exp in the log-Jacobian-determinant is cancelled
by applying the logarithm.
Let us think again about the learning objective from the implementation perspec-
tive. First, we definitely need to obtain z by calculating f −1 (x), and then we can
calculate ln N z0 = f −1 (x)|0, I . That is actually easy, and we get
1
ln N z0 = f −1 (x)|0, I = −const − f −1 (x)2 , (3.22)
2
where const = D2 ln (2π ) is the normalizing constant of the standard Gaussian, and
−1 (x)2 = MSE(0, f −1 (x)).
2 f
1
Alright, now we should look into the second part of the objective, i.e., the log-
Jacobian-determinants. As we can see, we have a sum over transformations, and for
each coupling layer, we consider only the outputs of the scale nets. Hence, the only
thing we must remember during implementing the coupling layers is to return not
only output but also the outcome of the scale layer too.
3.1.5 Code
Now, we have all components to implement our own RealNVP! Below, there is a
code with a lot of comments that should help to understand every single line of it.
The full code (with auxiliary functions) that you can play with is available here:
https://github.com/jmtomczak/intro_dgm.
1 class RealNVP (nn. Module ):
2 def __init__ (self , nets , nett , num_flows , prior , D=2,
dequantization =True):
3 super(RealNVP , self). __init__ ()
3.1 Flows for Continuous Random Variables 35
48
93
9 nett = lambda : nn. Sequential (nn. Linear (D//2, M), nn. LeakyReLU (),
10 nn. Linear (M, M), nn. LeakyReLU (),
11 nn. Linear (M, D//2))
12
Et voila! Now we are ready to run the full code. After training our RealNVP, we
should obtain results resembling those in Fig. 3.6.
A B C
Fig. 3.6 An example of outcomes after the training: (a) Randomly selected real images. (b)
Unconditional generations from the RealNVP. (c) The validation curve during training
38 3 Flow-Based Models
Yes and no. Yes in the sense it is the minimalistic example of an implementation
of the RealNVP. No, because there are many improvements over the instance of the
RealNVP presented here, namely:
• Factoring-out [7]: During the forward pass (from x to z), we can split the
variables and proceed with processing only a subset of them. This could help
to parameterize the base distribution by using the outputs of intermediate layers.
In other words, we can obtain an autoregressive base distribution.
• Rezero trick [11]: Introducing additional parameters to the coupling layer, e.g.,
yb = exp (αs(xa )) xb + βt (xa ) and α, β are initialized with 0’s. This helps
to ensure that the transformations act as identity maps in the beginning. It is
shown in [12] that this trick helps to learn better transformations by maintaining
information about the input through all layers in the beginning of the training
process.
• Masking or Checkerboard pattern [7]: We can use a checkerboard pattern instead
of dividing an input into two parts like [x1:D/2 , xD/2+1:D ]. This encourages
learning local statistics better.
• Squeezing [7]: We can also play around with “squeezing” some dimensions. For
instance, an image that consists of C channels, width W, and height H could be
turned into 4C channels, width W/2, and height H/2.
• Learnable base distributions: instead of using a standard Gaussian base distribu-
tion, we can consider another model for that, e.g., an autoregressive model.
• Invertible 1x1 convolution [8]: A fixed permutation could be replaced with a
(learned) invertible 1x1 convolution as in the GLOW model [8].
• Variational dequantization [13]: We can also pick a different dequantization
scheme, e.g., variational dequantization. This allows to obtain much better
scores. However, it is not for free because it leads to a lower bound to the log-
likelihood function.
Moreover, there are many new fascinating research directions! I will name them
here and point to papers where you can find more details:
• Data compression with flows [14]: Flow-based models are perfect candidates for
compression since they allow to calculate the exact likelihood. Ho et al. [14]
proposed a scheme that allows to use flows in the bit-back-like compression
scheme.
• Conditional flows [15–17]: Here, we present the unconditional RealNVP. How-
ever, we can use a flow-based model for conditional distributions. For instance,
we can use the conditioning as an input to the scale network and the translation
network.
• Variational inference with flows [1, 3, 18–21]: Conditional flow-based models
could be used to form a flexible family of variational posteriors. Then, the lower
bound to the log-likelihood function could be tighter. We will come back to that
in Chap. 4, Sect. 4.4.2.
3.1 Flows for Continuous Random Variables 39
• Integer discrete flows [12, 22, 23]: Another interesting direction is a version of
the RealNVP for integer-valued data. We will explain this idea in Sect. 3.2.
• Flows on manifolds [24]: Typically, flow-based models are considered in the
Euclidean space. However, they could be considered in non-Euclidean spaces,
resulting in new properties of (partially) invertible transformations.
• Flows for ABC [25]: Approximate Bayesian Computation (ABC) assumes that
the posterior over quantities of interest is intractable. One possible approach to
mitigate this issue is to approximate it using flow-based models, e.g., masked
autoregressive flows [26], as presented in [25].
Many other interesting information on flow-based models could be found in a
fantastic review by Papamakarios et al. [27].
where || · ||2 is the 2 matrix norm. Then Lip(g) = K < 1 and Lip(F ) < 1 + K.
Only in this specific case the Banach fixed-point theorem holds and ResNet layer
F has a unique inverse. As a result, the inverse can be approximated by fixed-point
iterations [4].
To estimate the log-determinant is, especially for high-dimensional spaces,
computationally intractable due to expensive computations. Since ResNet blocks
have a constrained Lipschitz constant, the logarithm of the Jacobian-determinant is
cheaper to compute, tractable, and approximated with guaranteed convergence [4]:
40 3 Flow-Based Models
∞
(−1)k+1
ln p(x) = ln p(f (x)) + tr [Jg (x)] k
, (3.24)
k
k=1
where Jg (x) is the Jacobian of g at x that satisfies ||Jg || < 1. The Skilling-
Hutchinson trace estimator [30, 31] is used to compute the trace at a lower cost
than to fully compute the trace of the Jacobian. Residual Flows [5] use an improved
method to estimate the power series at an even lower cost with an unbiased estimator
based on “Russian roulette” of [32]. Intuitively, the method estimates the infinite
sum of the power series by evaluating a finite amount of terms. In return, this
leads to less computation of terms compared to invertible residual networks. To
avoid derivative saturation, which occurs when the second derivative is zero in large
regions, the LipSwish activation is proposed [4].
DenseNet Flows [6]
Since it is possible to formulate a flow for a ResNet architecture, a natural question
is whether it could be accomplished for densely connected networks (DensNets)
[33]. In [6], it was shown that indeed it is possible!
The main component of DenseNet flows is a DenseBlock that is defined as a
function F : Rd → Rd with F (x) = x + g(x), where g consists of dense
layers {hi }ni=1 . Note that an important modification to make the model invertible
is to output x + g(x), whereas a standard DenseBlock would only output g(x). The
function g is expressed as follows:
where hn+1 represents a 1×1 convolution to match the output size of Rd . A layer hi
consists of two parts concatenated to each other. The upper part is a copy of the input
signal. The lower part consists of the transformed input, where the transformation
is a multiplication of (convolutional) weights Wi with the input signal, followed by
a non-linearity φ having Lip(φ) ≤ 1, such as ReLU, ELU, LipSwish, or tanh. As an
example, a dense layer h2 can be composed as follows:
x h1 (x)
h1 (x) = , h2 (h1 (x)) = . (3.26)
φ(W1 x) φ(W2 h1 (x))
The DenseNet flows [6] rely on the same techniques for approximating the
Jacobian-determinant as in the ResNet flows. The main difference between the
DenseNet flows and the ResNet flows lies in normalizing weights so that the Lips-
chitz constant of the transformation is smaller than 1 and, thus, the transformation
is invertible. Formally, to satisfy Lip(g) < 1, we need to enforce Lip(hi ) < 1 for
all n layers, since Lip(g) ≤ Lip(hn+1 ) · . . . · Lip(h1 ). Therefore, we first need to
determine the Lipschitz constant for a dense layer hi . We know that a function f is
K-Lipschitz if for all points v and w the following holds:
3.2 Flows for Discrete Random Variables 41
where function f1 is the upper part and f2 is the lower part. We can now find an
analytical form to express a limit on K for the dense layer in the form of Eq. (3.27):
where we know that the Lipschitz constant of h consist of two parts, namely
Lip(f1 ) = K1 and Lip(f2 ) = K2 . Therefore, the Lipschitz constant of layer h
can be expressed as
Lip(h) = K21 + K22 . (3.30)
With spectral normalization of Eq. (3.23), we know that we can enforce (convolu-
tional) weights Wi to be at most 1-Lipschitz. Hence, for all n dense layers we apply
the spectral normalization on the lower part which locally enforces Lip(f2 ) = K2 <
1. Further, since we enforce each layer hi to be at most 1-Lipschitz and we start with
h1 , where f1 (x) = x, we know that Lip(f1 ) √ = 1. Therefore,
√ the Lipschitz constant
of an entire layer can be at most Lip(h) = 12 + 12 = 2, thus dividing by this
limit enforces each layer to be at most 1-Lipschitz. To read more about DenseNet
flows and further improvements, please see the original paper [6].
3.2.1 Introduction
Fig. 3.7 Examples of: (a) homeomorphic spaces, and (b) non-homeomorphic spaces. The red
cross indicates it is impossible to invert the transformation
Fig. 3.8 An example of “replacing” a ring (in blue) with a ball (in magenta)
Another example that I really like, and which is closer to the potential issues
of continuous flows, is transforming a ring into a ball as in Fig. 3.8. The goal is to
replace the blue ring with the magenta ball. In order to make the transformation
bijective, while transforming the blue ring in place of the magenta ball, we must
ensure that the new magenta “ring” is in fact “broken” so that the new blue “ball”
can get inside! Again, why? If the magenta ring is not broken, then we cannot say
how the blue ball got inside that destroys bijectivity! In the language of topology, it
is impossible because the two spaces are non-homeomorphic.
Alright, but how this affects the flow-based models? I hope that some of you
asked this question, or maybe even imagined possible cases where this might hinder
learning flows. In general, I would say it is fine, and we should not look for faults
where there is none or almost none. However, if you work with flows that require
dequantization, then you can spot cases like the one in Fig. 3.9. In this simple
example, we have two discrete random variables that after uniform dequantization
have two regions with equal probability mass, and the remaining two regions with
zero probability mass [10]. After training a flow-based model, we have a density
estimator that assigns non-zero probability mass where the true distribution has zero
density! Moreover, the transformation in the flow must be a bijection, therefore,
there is a continuity between the two squares (see Fig. 3.9, right). Where did we see
that? Yes, in Fig. 3.8! We must know how to invert the transformation, thus, there
must be a “trace” of how the probability mass moves between the regions.
Again, we can ask ourselves if it is bad. Well, I would say not really, but if
we think of a case with more random variables, and there is always some little error
here and there, this causes a probability mass leakage that could result in a far-from-
perfect model. And, overall, the model could err in proper probability assignment.
Before we consider any specific cases and discuss discrete flows, first we need to
answer whether there is a change of variables formula for discrete random variables.
The answer, fortunately, is yes! Let us consider x ∈ XD where X is a discrete space,
44 3 Flow-Based Models
Fig. 3.9 An example of uniformly dequantized discrete random variables (left) and a flow-based
model (right). Notice that in these examples, the true distribution assigns equal probability mass
to the two regions in orange, and zero probability mass to the remaining two regions (in black).
However, the flow-based model assigns probability mass outside the original non-zero probability
regions
e.g., X = {0, 1} or X = Z. Then the change of variables takes the following form:
p(x) = π z0 = f −1 (x) , (3.32)
Fig. 3.10 An example of a discrete flow for two binary random variables. Colors represent various
probabilities (i.e., the sum of all squares is 1)
Fig. 3.11 An example of a discrete flow for two binary random variables but in the extended
space. Colors represent various probabilities (i.e., the sum of all squares is 1)
applying the discrete flow, and learning the discrete flow is equivalent to learning
the base distribution π .1 So we are back to square one.
However, as pointed out by van den Berg et al. [12], the situation looks differently
if we consider an extended space (or infinite space like Z). The discrete flow can still
only shuffle the probabilities, but now it can re-organize them in such a way that the
probabilities can be factorized! In other
words, it can help the base distribution to
be a product of marginals, π(z) = D d=1 πd (zd |θd ), and the dependencies among
variables are now encoded in the invertible transformations. An example of this case
is presented in Fig. 3.11. We refer to [12] for a more thorough discussion with an
appropriate lemma.
This is amazing information! It means that building a flow-based model in the
discrete space makes sense. Now we can think of how to build an invertible neural
network in discrete spaces and we have it!
1 Well, this is not entirely true, we can still learn some correlations but it is definitely highly limited.
46 3 Flow-Based Models
We know now that it makes sense to work with discrete flows and that they are
flexible as long as we use extended spaces or infinite spaces like Z. However, the
question is how to formulate an invertible transformation (or rather: an invertible
neural network) that will output discrete values.
Hoogeboom et al. [22] proposed to focus on integers since they can be seen as
discretized continuous values. As such, we consider coupling layers [7] and modify
them accordingly. Let us remind ourselves the definition of bipartite coupling layers
for x ∈ RD :
ya = xa (3.33)
yb = exp (s (xa )) xb + t (xa ) , (3.34)
where s(·) and t (·) are arbitrary neural networks called scaling and transition,
respectively.
Considering integer-valued variables, x ∈ ZD , requires modifying this transfor-
mation. First, using scaling might be troublesome because multiplying by integers
is still possible, but when we invert the transformation, we divide by integers, and
dividing an integer by an integer does not necessarily result in an integer. Therefore,
we must remove scaling just in case. Second, we use an arbitrary neural network for
the transition. However, this network must return integers! Hoogeboom et al. [22]
utilize a relatively simple trick, namely they say that we can round the output of t (·)
to the closest integer. As a result, we add (in the forward) or subtract (in the inverse)
integers from integers that is perfectly fine. (the outcome is still integer-valued.)
Eventually, we get the following bipartite coupling layer:
ya = xa (3.35)
yb = xb + t (xa ), (3.36)
where · is the rounding operator. An inquisitive reader could ask at this point
whether the rounding operator still allows using the backpropagation algorithm.
In other words, whether the rounding operator is differentiable. The answer is no,
but [22] showed that using the straight-through estimator (STE) of a gradient is
sufficient. As a side note, the STE in this case uses the rounding in the forward
pass of the network, t (xa ), but it utilizes t (xa ) in the backward pass (to calculate
gradients). van den Berg et al. [12] further indicated that indeed the STE works well
and the bias does not hinder training too much. The implementation of the rounding
operator using the STE is presented below.
1 # We need to turn torch. round (i.e., the rounding operator ) into
a differentiable function . For this purpose , we use the
rounding in the forward pass , but the original input for the
backward pass. This is nothing else than the STE.
2 class RoundStraightThrough ( torch. autograd . Function ):
3.2 Flows for Discrete Random Variables 47
7 @staticmethod
8 def forward (ctx , input):
9 rounded = torch. round(input , out=None)
10 return rounded
11
12 @staticmethod
13 def backward (ctx , grad_output ):
14 grad_input = grad_output .clone ()
15 return grad_input
y1 = x1 ◦ f1 (∅, x2:D )
y2 = (g2 (y1 ) x2 ) ◦ f2 (y1 , x3:D )
...
yd = (gd (y1:d−1 ) xd ) ◦ fd (y1:d−1 , xd+1:D )
...
yD = (gD (y1:D−1 ) xD ) ◦ fD (y1:D−1 , ∅)
is invertible.
Proof In order to inverse y to x we start from the last element to obtain the
following:
Then, we can proceed with next expressions in the decreasing order (i.e., from D −1
to 1) to eventually obtain
For instance, we can divide x into four parts, x = [xa , xb , xc , xd ], and the
following transformation (a quadripartite coupling layer) is invertible [23]:
ya = xa + t (xb , xc , xd ) (3.37)
yb = xb + t (ya , xc , xd ) (3.38)
yc = xc + t (ya , yb , xd ) (3.39)
yd = xd + t (ya , yb , yc ). (3.40)
D
π(z) = πd (zd ) (3.41)
d=1
D
= DL(zd |μd , νd ), (3.42)
d=1
where πd (zd ) = DL(zd |μd , νd ) is the discretized logistic distribution that is defined
as a difference of CDFs of the logistic distribution as follows [34]:
π(z) = sigm ((z + 0.5 − μ)/ν) − sigm ((z − 0.5 − μ)/ν) , (3.43)
3.2 Flows for Discrete Random Variables 49
Fig. 3.12 An example of the discretized logistic distribution with μ = 0 and ν = 1. The magenta
area on the y-axis corresponds to the probability mass of a bin of size 1
where μ ∈ R and ν > 0 denote the mean and the scale, respectively, sigm(·) is
the sigmoid function. Notice that this is equivalent to calculating the probability of
z falling into a bin of length 1, therefore, we add 0.5 in the first CDF and subtract
0.5 from the second CDF. An example of the discretized distribution is presented in
Fig. 3.12 and the implementation follows. Interestingly, we can use this distribution
to replace the Categorical distribution in Chap. 2, as it was done in [18]. We can
even use a mixture of discretized logistic distribution to further improve the final
performance [22, 35].
1 # This function implements the log of the discretized logistic
distribution .
2 def log_integer_probability (x, mean , logscale ):
3 scale = torch.exp( logscale )
4
5 logp = log_min_exp (
6 F. logsigmoid ((x + 0.5 − mean) / scale),
7 F. logsigmoid ((x − 0.5 − mean) / scale))
8
9 return logp
D
ln p(x) = ln DL(zd = f −1 (x)|μd , νd ) (3.44)
d=1
D
= ln (sigm ((zd + 0.5 − μd )/νd ) − sigm ((zd − 0.5 − μd )/νd )) ,
d=1
(3.45)
50 3 Flow-Based Models
where we make all μd and νd learnable parameters. Notice that νd must be positive
(strictly larger than 0), therefore, in the implementation, we will consider the
logarithm of the scale because taking exp of the log-scale ensures having strictly
positive values.
3.2.4 Code
Now, we have all components to implement our own Integer Discrete Flow (IDF)!
Below, there is a code with a lot of comments that should help to understand every
single line of it.
1 # That ’s the class of the Integer Discrete Flows (IDFs).
2 # There are two options implemented :
3 # Option 1: The bipartite coupling layers as in ( Hoogeboom et al
., 2019).
4 # Option 2: A new coupling layer with 4 parts as in (Tomczak ,
2021).
5 # We implement the second option explicitly , without any loop , so
that it is very clear how it works.
6 class IDF(nn. Module ):
7 def __init__ (self , netts , num_flows , D=2):
8 super(IDF , self). __init__ ()
9
10 print(’IDF by JT.’)
11
12 # Option 1:
13 if len(netts) == 1:
14 self.t = torch.nn. ModuleList ([ netts [0]() for _ in
range( num_flows )])
15 self. idf_git = 1
16
17 # Option 2:
18 elif len( netts) == 4:
19 self.t_a = torch.nn. ModuleList ([ netts [0]() for _ in
range( num_flows )])
20 self.t_b = torch.nn. ModuleList ([ netts [1]() for _ in
range( num_flows )])
21 self.t_c = torch.nn. ModuleList ([ netts [2]() for _ in
range( num_flows )])
22 self.t_d = torch.nn. ModuleList ([ netts [3]() for _ in
range( num_flows )])
23 self. idf_git = 4
24
25 else:
26 raise ValueError (’You can provide either 1 or 4
translation nets.’)
27
45 # Option 1:
46 if self. idf_git == 1:
47 (xa , xb) = torch.chunk(x, 2, 1)
48
49 if forward :
50 yb = xb + self.round(self.t[index ](xa))
51 else:
52 yb = xb − self.round(self.t[index ](xa))
53
56 # Option 2:
57 elif self. idf_git == 4:
58 (xa , xb , xc , xd) = torch.chunk(x, 4, 1)
59
60 if forward :
61 ya = xa + self.round(self.t_a[index ]( torch.cat ((
xb , xc , xd), 1)))
62 yb = xb + self.round(self.t_b[index ]( torch.cat ((
ya , xc , xd), 1)))
63 yc = xc + self.round(self.t_c[index ]( torch.cat ((
ya , yb , xd), 1)))
64 yd = xd + self.round(self.t_d[index ]( torch.cat ((
ya , yb , yc), 1)))
65 else:
66 yd = xd − self.round(self.t_d[index ]( torch.cat ((
xa , xb , xc), 1)))
67 yc = xc − self.round(self.t_c[index ]( torch.cat ((
xa , xb , yd), 1)))
68 yb = xb − self.round(self.t_b[index ]( torch.cat ((
xa , yc , yd), 1)))
69 ya = xa − self.round(self.t_a[index ]( torch.cat ((
yb , yc , yd), 1)))
70
84 return z
85
93 return x
94
Below, we provide examples of neural networks that could be used to run the
IDFs.
1 # The number of invertible transformations
2 num_flows = 8
3
9 if idf_git == 1:
10 nett = lambda : nn. Sequential (
11 nn. Linear (D//2, M), nn. LeakyReLU () ,
12 nn. Linear (M, M), nn. LeakyReLU (),
13 nn. Linear (M, D//2))
14 netts = [nett]
15
16 elif idf_git == 4:
17 nett_a = lambda : nn. Sequential (
18 nn. Linear (3 ∗ (D//4) , M), nn. LeakyReLU (),
19 nn. Linear (M, M), nn. LeakyReLU (),
20 nn. Linear (M, D//4))
21
Fig. 3.13 An example of outcomes after the training: (a) Randomly selected real images.
(b) Unconditional generations from the IDF with bipartite coupling layers. (c) Unconditional
generations from the IDF with quadripartite coupling layers
39 # Init IDF
40 model = IDF(netts , num_flows , D=D)
41 # Print the summary (like in Keras)
42 print( summary (model , torch.zeros (1, 64) , show_input =False ,
show_hierarchical = False))
And we are done, this is all we need to have! After running the code (take a
look at: https://github.com/jmtomczak/intro_dgm) and training the IDFs, we should
obtain results similar to those in Fig. 3.13.
References
1. Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In
International Conference on Machine Learning, pages 1530–1538. PMLR, 2015.
2. Oren Rippel and Ryan Prescott Adams. High-dimensional probability estimation with deep
density models. arXiv preprint arXiv:1302.5125, 2013.
3. Rianne Van Den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester
normalizing flows for variational inference. In 34th Conference on Uncertainty in Artificial
Intelligence 2018, UAI 2018, pages 393–402. Association For Uncertainty in Artificial
Intelligence (AUAI), 2018.
4. Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and Jörn-Henrik Jacobsen.
Invertible residual networks. In International Conference on Machine Learning, pages 573–
582. PMLR, 2019.
5. Ricky TQ Chen, Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Residual flows
for invertible generative modeling. arXiv preprint arXiv:1906.02735, 2019.
6. Yura Perugachi-Diaz, Jakub M Tomczak, and Sandjai Bhulai. Invertible DenseNets with
concatenated LipSwish. Advances in Neural Information Processing Systems, 2021.
7. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP.
arXiv preprint arXiv:1605.08803, 2016.
8. Diederik P Kingma and Prafulla Dhariwal. Glow: generative flow with invertible 1× 1
convolutions. In Proceedings of the 32nd International Conference on Neural Information
Processing Systems, pages 10236–10245, 2018.
9. Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative
models. arXiv preprint arXiv:1511.01844, 2015.
10. Emiel Hoogeboom, Taco S Cohen, and Jakub M Tomczak. Learning discrete distributions by
dequantization. arXiv preprint arXiv:2001.11235, 2020.
11. Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W
Cottrell, and Julian McAuley. Rezero is all you need: Fast convergence at large depth. arXiv
preprint arXiv:2003.04887, 2020.
12. Rianne van den Berg, Alexey A Gritsenko, Mostafa Dehghani, Casper Kaae Sønderby, and Tim
Salimans. Idf++: Analyzing and improving integer discrete flows for lossless compression.
arXiv e-prints, pages arXiv–2006, 2020.
13. Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving
flow-based generative models with variational dequantization and architecture design. In
International Conference on Machine Learning, pages 2722–2730. PMLR, 2019.
14. Jonathan Ho, Evan Lohn, and Pieter Abbeel. Compression with flows via local bits-back
coding. arXiv preprint arXiv:1905.08500, 2019.
15. Michał Stypułkowski, Kacper Kania, Maciej Zamorski, Maciej Zi˛eba, Tomasz Trzciński,
and Jan Chorowski. Representing point clouds with generative conditional invertible flow
networks. arXiv preprint arXiv:2010.11087, 2020.
16. Christina Winkler, Daniel Worrall, Emiel Hoogeboom, and Max Welling. Learning likelihoods
with conditional normalizing flows. arXiv preprint arXiv:1912.00042, 2019.
17. Valentin Wolf, Andreas Lugmayr, Martin Danelljan, Luc Van Gool, and Radu Timofte. Deflow:
Learning complex image degradations from unpaired data with conditional flows. arXiv
preprint arXiv:2101.05796, 2021.
18. Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max
Welling. Improved variational inference with inverse autoregressive flow. Advances in Neural
Information Processing Systems, 29:4743–4751, 2016.
19. Emiel Hoogeboom, Victor Garcia Satorras, Jakub M Tomczak, and Max Welling. The
convolution exponential and generalized Sylvester flows. arXiv preprint arXiv:2006.01910,
2020.
20. Jakub M Tomczak and Max Welling. Improving variational auto-encoders using householder
flow. arXiv preprint arXiv:1611.09630, 2016.
56 3 Flow-Based Models
21. Jakub M Tomczak and Max Welling. Improving variational auto-encoders using convex
combination linear inverse autoregressive flow. arXiv preprint arXiv:1706.02326, 2017.
22. Emiel Hoogeboom, Jorn WT Peters, Rianne van den Berg, and Max Welling. Integer discrete
flows and lossless compression. arXiv preprint arXiv:1905.07376, 2019.
23. Jakub M Tomczak. General invertible transformations for flow-based generative modeling.
INNF+, 2021.
24. Johann Brehmer and Kyle Cranmer. Flows for simultaneous manifold learning and density
estimation. arXiv preprint arXiv:2003.13913, 2020.
25. George Papamakarios, David Sterratt, and Iain Murray. Sequential neural likelihood: Fast
likelihood-free inference with autoregressive flows. In The 22nd International Conference on
Artificial Intelligence and Statistics, pages 837–848. PMLR, 2019.
26. George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for
density estimation. arXiv preprint arXiv:1705.07057, 2017.
27. George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji
Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. arXiv
preprint arXiv:1912.02762, 2019.
28. Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael Cree. Regularisation of neural
networks by enforcing Lipschitz continuity. arXiv preprint arXiv:1804.04368, 2018.
29. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normaliza-
tion for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
30. John Skilling. The eigenvalues of mega-dimensional matrices. In Maximum Entropy and
Bayesian Methods, pages 455–466. Springer, 1989.
31. Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian
smoothing splines. Communications in Statistics-Simulation and Computation, 19(2):433–
450, 1990.
32. Herman Kahn. Use of different Monte Carlo sampling techniques. Proceedings of Symposium
on Monte Carlo Methods, 1955.
33. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely
connected convolutional networks. In IEEE Conference on Computer Vision and Pattern
Recognition, 2017.
34. Subrata Chakraborty and Dhrubajyoti Chakravarty. A new discrete probability distribution with
integer support on (−∞, ∞). Communications in Statistics-Theory and Methods, 45(2):492–
505, 2016.
35. Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the
PixelCNN with discretized logistic mixture likelihood and other modifications. arXiv preprint
arXiv:1701.05517, 2017.
36. Emile van Krieken, Jakub M Tomczak, and Annette ten Teije. Storchastic: A framework
for general stochastic automatic differentiation. Advances in Neural Information Processing
Systems, 2021.
37. Niru Maheswaranathan, Luke Metz, George Tucker, Dami Choi, and Jascha Sohl-Dickstein.
Guided evolutionary strategies: Augmenting random search with surrogate gradients. In
International Conference on Machine Learning, pages 4264–4273. PMLR, 2019.
Chapter 4
Latent Variable Models
4.1 Introduction
Fig. 4.1 A diagram presenting a latent variable model and a generative process. Notice the low-
dimensional manifold (here 2D) embedded in the high-dimensional space (here 3D)
In plain words, we first sample z (e.g., we imagine the size, the shape, and
the color of a horse) and then create an image with all necessary details, i.e., we
sample x from the conditional distribution p(x|z). One can ask whether we need
probabilities here but try to create precisely the same image at least two times. Due
to various external factors, it is almost impossible to create two identical images.
That is why probability theory is so beautiful and allows us to describe reality!
The idea behind latent variable models is that we introduce the latent variables
z and the joint distribution is factorized as follows: p(x, z) = p(x|z)p(z). This
naturally expressed the generative process described above. However, for training,
we have access only to x. Therefore, according to probabilistic inference, we should
sum out (or marginalize out) the unknown, namely, z. As a result, the (marginal)
likelihood function is the following:
p(x) = p(x|z)p(z) dz. (4.1)
x = Wz + b + ε, (4.2)
4.2 Probabilistic Principal Component Analysis 59
where ε ∼ N(ε|0, σ 2 I). The property of the Gaussian distribution yields [1]
p(x|z) = N x|Wz + b, σ 2 I . (4.3)
Now, we are able to calculate the logarithm of the (marginal) likelihood function
ln p(x)! We refer to [1, 2] for more details on learning the parameters in the pPCA
model. Moreover, what is interesting about the pPCA is that, due to the properties
of Gaussians, we can also calculate the true posterior over z analytically:
p(z|x) = N M−1 W (x − μ), σ −2 M , (4.7)
Let us take a look at the integral one more time and think of a general case where we
cannot calculate it analytically. The simplest approach would be to use the Monte
Carlo approximation:
p(x) = p(x|z) p(z) dz (4.8)
where, in the last line, we use samples from the prior over latents, zk ∼ p(z).
Such an approach is relatively easy and since our computational power grows so
fast, we can sample a lot of points in reasonably short time. However, as we know
from statistics, if z is multidimensional, and M is relatively large, we get into a
trap of the curse of dimensionality, and to cover the space properly, the number of
samples grows exponentially with respect to M. If we take too few samples, then
the approximation is simply very poor.
We can use more advanced Monte Carlo techniques [3]; however, they still suffer
from issues associated with the curse of dimensionality. An alternative approach is
the application of variational inference [4]. Let us consider a family of variational
distributions parameterized by φ, {qφ (z)}φ . For instance, we can consider Gaussians
with means and variances, φ = {μ, σ 2 }. We know the form of these distributions,
and we assume that they assign non-zero probability mass to all z ∈ ZM . Then, the
logarithm of the marginal distribution could be approximated as follows:
ln p(x) = ln p(x|z)p(z) dz (4.11)
qφ (z)
= ln p(x|z)p(z) dz (4.12)
qφ (z)
p(x|z)p(z)
= ln Ez∼qφ (z) (4.13)
qφ (z)
p(x|z)p(z)
≥ Ez∼qφ (z) ln (4.14)
qφ (z)
= Ez∼qφ (z) ln p(x|z) + ln p(z) − ln qφ (z) (4.15)
= Ez∼qφ (z) [ln p(x|z)] − Ez∼qφ (z) ln qφ (z) − ln p(z) . (4.16)
4.3 Variational Auto-Encoders: Variational Inference for Non-linear Latent. . . 61
For completeness, we provide also a different derivation of the ELBO that will help
us to understand why the lower bound might be tricky sometimes:
qφ (z|x) qφ (z|x)
= Ez∼qφ (z|x) ln p(x|z) − ln + ln (4.23)
p(z) p(z|x)
= Ez∼qφ (z|x) [ln p(x|z)] − KL qφ (z|x)p(z) + KL qφ (z|x)p(z|x) .
(4.24)
Please note that in the derivation above we use the sum and the product rules
q (z|x)
together with multiplying by 1 = qφφ (z|x) , nothing else, no dirty tricks here! Please
try to replicate this by yourself, step by step. If you understand this derivation well,
it would greatly help you to see where potential problems of the VAEs (and the
latent variable models in general) lie.
Once you analyzed this derivation, let us take a closer look at it:
ln p(x) = Ez∼qφ (z|x) [ln p(x|z)] − KL qφ (z|x)p(z) + KL qφ (z|x)p(z|x) .
ELBO ≥0
(4.25)
The last component, KL qφ (z|x)p(z|x) , measures the difference between the
variational posterior and the real posterior, but we do not know what the real
posterior is! However, we can skip this part since the Kullback–Leibler divergence
is always equal to or greater than 0 (from its definition) and, thus, we are left with
the ELBO. We can think of KL qφ (z|x)p(z|x) as a gap between the ELBO and
the true log-likelihood.
Beautiful! But ok, why this is so important? Well, if we take qφ (z|x) that is a
bad approximation of p(z|x), then the KL term will be larger, and even if the ELBO
is optimized well, the gap between the ELBO and the true log-likelihood could be
huge! In plain words, if we take too simplistic posterior, we can end up with a bad
VAE anyway. What is “bad” in this context? Let us take a look at Fig. 4.2. If the
ELBO is a loose lower bound of the log-likelihood, then the optimal solution of the
ELBO could be completely different than the solution of the log-likelihood. We will
comment on how to deal with that later on and, for now, it is enough to be aware of
that issue.
4.3 Variational Auto-Encoders: Variational Inference for Non-linear Latent. . . 63
Let us wrap up what we know right now. First of all, we consider a class of amortized
variational posteriors {qφ (z|x)}φ that approximate the true posterior p(z|x). We can
see them as stochastic encoders. Second, the conditional likelihood p(x|z) could
be seen as a stochastic decoder. Third, the last component, p(z), is the marginal
distribution, also referred to as a prior. Lastly, the objective is the ELBO, a lower
bound to the log-likelihood function:
ln p(x) ≥ Ez∼qφ (z|x) [ln p(x|z)] − Ez∼qφ (z|x) ln qφ (z|x) − ln p(z) . (4.26)
There are two questions left to get the full picture of the VAEs:
1. How to parameterize the distributions?
2. How to calculate the expected values? After all, these integrals have not
disappeared!
As you can probably guess by now, we use neural networks to parameterize the
encoders and the decoders. But before we use the neural networks, we should know
what distributions we use! Fortunately, in the VAE framework we are almost free to
choose any distributions! However, we must remember that they should make sense
for a considered problem. So far, we have explained everything through images,
so let us continue that. If x ∈ {0, 1, . . . , 255}D , then we cannot use a Normal
distribution, because its support is totally different than the support of discrete-
valued images. A possible distribution we can use is the categorical distribution,
that is:
where the probabilities are given by a neural network NN, namely, θ (z) =
softmax (NN(z)). The neural network NN could be an MLP, a convolutional neural
network, RNNs, etc.
The choice of a distribution for the latent variables depends on how we want to
express the latent factors in data. For convenience, typically z is taken as a vector
of continuous random variables, z ∈ RM . Then, we can use Gaussians for both the
variational posterior and the prior:
qφ (z|x) = N z|μφ (x), diag σφ2 (x) (4.28)
where μφ (x) and σφ2 (x) are outputs of a neural network, similarly to the case of
the decoder. In practice, we can have a shared neural network NN(x) that outputs
2M values that are further split into M values for the mean μ and M values for
the variance σ 2 . For convenience, we consider a diagonal covariance matrix. We
could use flexible posteriors (see Sect. 4.4.2). Moreover, here we take the standard
Gaussian prior. We will comment on that later (see Sect. 4.4.1).
So far, we played around with the log-likelihood and we ended up with the ELBO.
However, there is still a problem with calculating the expected value, because it
contains an integral! Therefore, the question is how we can calculate it and why it
is better than the MC-approximation of the log-likelihood without the variational
posterior. In fact, we will use the MC-approximation, but now, instead of sampling
from the prior p(z), we will sample from the variational posterior qφ (z|x). Is it
better? Yes, because the variational posterior assigns typically more probability
mass to a smaller region than the prior. If you play around with your code of
a VAE and examine the variance, you will probably notice that the variational
posteriors are almost deterministic (whether it is good or bad is rather an open
question). As a result, we should get a better approximation! However, there is still
an issue with the variance of the approximation. Simply speaking, if we sample z
from qφ (z|x), plug them into the ELBO, and calculate gradients with respect to the
parameters of a neural network φ, the variance of the gradient may still be pretty
large! A possible solution to that, first noticed by statisticians (e.g., see [8]), is
the approach of reparameterizing the distribution. The idea is to realize that we
can express a random variable as a composition of primitive transformations (e.g.,
arithmetic operations, logarithm, etc.) of an independent random variable with a
simple distribution. For instance, if we consider a Gaussian random variable z with
a mean μ and a variance σ 2 , and an independent random variable ∼ N(|0, 1),
then the following holds (see Fig. 4.3):
z = μ + σ · . (4.30)
Now, if we start sampling from the standard Gaussian and apply the above
transformation, then we get a sample from N(z|μ, σ )!
If you do not remember this fact from statistics, or you simply do not believe
me, write a simple code for that and play around with it. In fact, this idea could be
applied to many more distributions [9].
The reparameterization trick could be used in the encoder qφ (z|x). As observed
by Kingma and Welling [6] and Rezende et al. [7], we can drastically reduce
the variance of the gradient by using this reparameterization of the Gaussian
distribution. Why? Because the randomness comes from the independent source
p(), and we calculate gradient with respect to a deterministic function (i.e., a neural
4.3 Variational Auto-Encoders: Variational Inference for Non-linear Latent. . . 65
network), not random objects. Even better, since we learn the VAE using stochastic
gradient descent, it is enough to sample z only once during training!
We went through a lot of theory and discussions, and you might think it is impossible
to implement a VAE. However, it is actually simpler than it might look. Let us sum
up what we know so far and focus on very specific distributions and neural networks.
First of all, we will use the following distributions:
• qφ (z|x) = N z|μφ (x), σφ2 (x) ;
• p(z) = N (z|0, I);
• pθ (x|z) = Categorical (x|θ (z)).
We assume that xd ∈ X = {0, 1, . . . , L − 1}.
Next, we will use the following networks:
• The encoder network:
Notice that the last layer outputs 2M values because we must have M values
for the mean and M values for the (log-)variance. Moreover, a variance must be
positive; therefore, instead, we consider the logarithm of the variance because it
can take real values then. As a result, we do not need to bother about variances
being always positive. An alternative is to apply a non-linearity like softplus.
• The decoder network:
Since we use the categorical distribution for x, the outputs of the decoder network
are probabilities. Thus, the last layer must output D · L values, where D is the
number of pixels and L is the number of possible values of a pixel. Then, we
must reshape the output to a tensor of the following shape: (B, D, L), where B
is the batch size. Afterward, we can apply the softmax activation function to obtain
probabilities.
Finally, for a given dataset D = {xn }N n=1 , the training objective is the ELBO
where we use a single sample from the variational posterior zφ,n = μφ (xn ) +
σφ (xn ) . We must remember that in almost any available package we minimize
by default, so we must take the negative sign, namely:
N
−ELBO(D; θ, φ) = − ln Categorical xn |θ zφ,n +
n=1
ln N zφ,n |μφ (xn ), σφ2 (xn ) + ln N zφ,n |0, I .
(4.31)
So as you can see, the whole math boils down to a relatively simple learning
procedure:
1. Take xn and apply the encoder network to get μφ (xn ) and ln σφ2 (xn ).
2. Calculate zφ,n by applying the reparameterization trick, zφ,n = μφ (xn ) +
σφ (xn ) , where ∼ N(0, I).
3. Apply the decoder network to zφ,n to get the probabilities θ (zφ,n ).
4. Calculate the ELBO by plugging in xn , zφ,n , μφ (xn ), and ln σφ2 (xn ).
4.3.5 Code
Now, all components are ready to be turned into a code! For the full implementation,
please take a look at https://github.com/jmtomczak/intro_dgm. Here, we focus only
on the code for the VAE model. We provide details in the comments. We divide the
code into four classes: Encoder, Decoder, Prior, and VAE. It might look like overkill,
but it may help you to think of the VAE as a composition of three parts and better
comprehend the whole approach.
1 class Encoder (nn. Module ):
2 def __init__ (self , encoder_net ):
3 super(Encoder , self). __init__ ()
4
32 # Sampling procedure.
33 def sample (self , x=None , mu_e=None , log_var_e =None):
34 #If we don ’t provide a mean and a log−variance , we must
first calculate it:
35 if (mu_e is None) and ( log_var_e is None):
36 mu_e , log_var_e = self. encode (x)
37 # Or the final sample
38 else:
39 # Otherwise , we can simply apply the reparameterization
trick!
40 if (mu_e is None) or ( log_var_e is None):
41 raise ValueError (’mu and log−var can ‘t be None!’)
42 z = self. reparameterization (mu_e , log_var_e )
43 return z
44
35 else:
36 raise ValueError (’Only: ‘categorical ‘, ‘bernoulli ‘’)
37
60 else:
61 raise ValueError (’Only: ‘categorical ‘, ‘bernoulli ‘’)
62
63 return x_new
64
77 else:
78 raise ValueError (’Only: ‘categorical ‘, ‘bernoulli ‘’)
79
80 return log_p
70 4 Latent Variable Models
81
5 print(’VAE by JT.’)
6
19
20 # ELBO
21 RE = self. decoder . log_prob (x, z)
22 KL = (self.prior. log_prob (z) − self. encoder . log_prob (mu_e
=mu_e , log_var_e=log_var_e , z=z)).sum(−1)
23
24 if reduction == ’sum ’:
25 return −(RE + KL).sum ()
26 else:
27 return −(RE + KL).mean ()
28
Perfect! Now we are ready to run the full code and after training our VAE, we
should obtain results similar to those like in Fig. 4.4.
A B C
Fig. 4.4 An example of outcomes after the training: (a) Randomly selected real images. (b)
Unconditional generations from the VAE. (c) The validation curve during training
72 4 Latent Variable Models
VAEs constitute a very powerful class of models, mainly due to their flexibility.
Unlike flow-based models, they do not require the invertibility of neural networks
and, thus, we can use any arbitrary architecture for encoders and decoders. In
contrast to ARMs, they learn a low-dimensional data representation and we can
control the bottleneck (i.e., the dimensionality of the latent space). However, they
also suffer from several issues. Except the ones mentioned before (i.e., a necessity
of an efficient integral estimation, a gap between the ELBO and the log-likelihood
function for too simplistic variational posteriors), the potential problems are the
following:
• Let us take a look at the ELBO and the regularization term. For a non-trainable
prior like the standard Gaussian, the regularization term will be minimized if
∀x qφ (z|x) = p(z). This may happen if the decoder is so powerful that it treats z
as a noise, e.g., a decoder is expressed by an ARM [10]. This issue is known as
the posterior collapse [11].
• Another issue is associated with a mismatch between the aggregated posterior,
qφ (z) = N1 n qφ (z|xn ), and the prior p(z). Imagine that we have the standard
Gaussian prior and the aggregated posterior (i.e., an average of variational
posteriors over all training data). As a result, there are regions where there prior
assigns high probability but the aggregated posterior assigns low probability, or
other way around. Then, sampling from these holes provides unrealistic latent
values and the decoder produces images of very low quality. This problem is
referred to as the hole problem [12].
• The last problem we want to discuss is more general and, in fact, it affects all
deep generative models. As it was noticed in [13], the deep generative models
(including VAEs) fail to properly detect out-of-distribution examples. Out-of-
distribution datapoints are examples that follow a totally different distribution
than the one a model was trained on. For instance, let us assume that our model
is trained on MNIST, then Fashion MNIST examples are out-of-distribution.
Thus, an intuition tells that a properly trained deep generative model should
assign high probability to in-distribution examples and low probability to out-
of-distribution points. Unfortunately, as shown in [13], this is not the case. The
out-of-distribution problem remains one of the main unsolved problems in deep
generative modeling [14].
There are a plethora of papers that extend VAEs and apply them to many problems.
Below, we will list out selected papers and only touch upon the vast literature on the
topic!
4.3 Variational Auto-Encoders: Variational Inference for Non-linear Latent. . . 73
1 p(x|zk )
K
ln p(x) ≈ ln , (4.32)
K qφ (zk |x)
k=1
where zk ∼ qφ (zk |x). Notice that the logarithm is outside the expected value. As
shown in [15], using importance weighting with sufficiently large K gives a good
estimate of the log-likelihood. In practice, K is taken to be 512 or more if the
computational budget allows.
Enhancing VAEs: Better Encoders After introducing the idea of VAEs, many
papers focused on proposing a flexible family of variational posteriors. The most
prominent direction is based on utilizing conditional flow-based models [16–21].
We discuss this topic more in Sect. 4.4.2.
Enhancing VAEs: Better Decoders VAEs allow using any neural network to
parameterize the decoder. Therefore, we can use fully connected networks, fully
convolutional networks, ResNets, or ARMs. For instance, in [22], a PixelCNN-
based decoder was used utilized in a VAE.
Enhancing VAEs: Better Priors As mentioned before, this could be a serious
issue if there is a big mismatch between the aggregated posterior and the prior.
There are many papers that try to alleviate this issue by using a multimodal prior
mimicking the aggregated posterior (known as the VampPrior) [23], or a flow-based
prior (e.g., [24, 25]), an ARM-based prior [26], or using an idea of resampling [27].
We present various priors in Sect. 4.4.1.
Extending VAEs Here, we present the unsupervised version of VAEs. However,
there is no restriction to that and we can introduce labels or other variables. In
[28] a semi-supervised VAE was proposed. This idea was further extended to the
concept of fair representations [29, 30]. In [30], the authors proposed a specific
latent representation that allows domain generalization in VAEs. In [31] variational
inference and the reparameterization trick were used for Bayesian Neural Nets. [31]
is not necessarily introducing a VAE, but a VAE-like way of dealing with Bayesian
neural nets.
VAEs for Non-image Data So far, we explain everything on images. However,
there is no restriction on that! In [11] a VAE was proposed to deal with sequential
data (e.g., text). The encoder and the decoder were parameterized by LSTMs. An
interesting application of the VAE framework was also presented in [32] where
VAEs were used for the molecular graph generation. In [26] the authors proposed a
VAE-like model for video compression.
74 4 Latent Variable Models
component mentioned in some papers (e.g., [23, 48]), a different approach would
be to train the prior with an adversarial loss. Further, [47] present various ideas how
auto-encoders could benefit from adversarial learning.
4.4.1 Priors
where we explicitly highlight the summation over training data, namely, the
expected value with respect to x’s from the empirical distribution pdata (x) =
1 N
N n=1 δ(x − xn ), and δ(·) is the Dirac delta.
The ELBO consists of two parts, namely, the reconstruction error:
RE = Ex∼pdata (x) Eqφ (z|x) [ln pθ (x|z)] , (4.34)
and the regularization term between the encoder and the prior:
= Ex∼pdata (x) Eqφ (z|x) ln pλ (z) − ln qφ (z|x) . (4.35)
1
N
= qφ (z|xn ) ln pλ (z) − ln qφ (z|xn ) dz (4.40)
N
n=1
1 1
N N
= qφ (z|xn ) ln pλ (z)dz − qφ (z|xn ) ln qφ (z|xn )dz (4.41)
N N
n=1 n=1
N
1
= qφ (z) ln pλ (z)dz − qφ (z|xn ) ln qφ (z|xn )dz (4.42)
N
n=1
= −CE qφ (z)||pλ (z) + H qφ (z|x) , (4.43)
where we use the property of the Dirac delta: δ(a − a )f (a)da = f (a ), and we
use the notion of the aggregated posterior [47, 48] defined as follows:
1
N
q(z) = qφ (z|xn ) . (4.44)
N
n=1
rewrite using three terms, with the total correlation [49]. We will not use it here,
so it is left as a “homework.”
Anyway, one may ask why is it useful to rewrite the ELBO? The answer is rather
straightforward: We can analyze it from a different perspective! In this section, we
will focus on the prior, an important component in the generative part that is very
often neglected. Many Bayesianists are stating that a prior should not be learned.
But VAEs are not Bayesian models, please remember that! Besides, who says we
cannot learn the prior? As we will see shortly, a non-learnable prior could be pretty
annoying, especially for the generation process.
What Does ELBO Tell Us About the Prior?
Alright, we see that consists of the cross-entropy and the entropy. Let us start with
the entropy since it is easier to be analyzed. While optimizing, we want to maximize
the ELBO and, hence, we maximize the entropy:
N
1
H qφ (z|x) = − qφ (z|xn ) ln qφ (z|xn )dz. (4.45)
N
n=1
The cross-entropy term influences the VAE in a different manner. First, we can
ask the question how to interpret the cross-entropy between qφ (z) and pλ (z). In
general, the cross-entropy tells us the average number of bits (or rather nats because
we use the natural logarithm) needed to identify an event drawn from qφ (z if a
coding scheme used for it is pλ (z). Notice that in we have the negative cross-
entropy.
Since we maximize the ELBO, it means that we aim for minimizing
CE qφ (z)||pλ (z) . This makes sense because we would like qφ (z) to match pλ (z).
And we have accidentally touched upon the most important issue here: What do we
really want here? The cross-entropy forces the aggregated posterior to match the
prior! That is the reason why we have this term here. If you think about it, it is a
beautiful result that gives another connection between VAEs and the information
theory.
78 4 Latent Variable Models
Alright, so we see what the cross-entropy does, but there are two possibilities
here. First, the prior is fixed (non-learnable), e.g., the standard Gaussian prior.
Then, optimizing the cross-entropy pushes the aggregated posterior to match the
prior. It is schematically depicted in Fig. 4.6. The prior acts like an anchor and the
amoeba of the aggregated posterior moves so that to fit the prior. In practice, this
optimization process is troublesome because the decoder forces the encoder to be
peaked and, in the end, it is almost impossible to match a fixed-shaped prior. As
a result, we obtain holes, namely, regions in the latent space where the aggregated
posterior assigns low probability while the prior assigns (relatively) high probability
(see a dark gray ellipse in Fig. 4.6). This issue is especially apparent in generations
because sampling from the prior, from the hole, may result in a sample that is of an
extremely low quality. You can read more about it in [12].
On the other hand, if we consider a learnable prior, the situation looks different.
The optimization allows to change the aggregated posterior and the prior. As the
consequence, both distributions try to match each other (see Fig. 4.7). The problem
of holes is then less apparent, especially if the prior is flexible enough. However,
we can face other optimization issues when the prior and the aggregated posteriors
chase each other. In practice, the learnable prior seems to be a better option, but it is
still an open question whether training all components at once is the best approach.
Moreover, the learnable prior does not impose any specific constraint on the latent
representation, e.g., sparsity. This could be another problem that would result in
undesirable problems (e.g., non-smooth encoders).
Eventually, we can ask the fundamental question: What is the best prior then?!
The answer is already known and is hidden in 1the cross-entropy term: It is the
aggregated posterior. If we take pλ (z) = N n=1 N qφ (z|xn ), then, theoretically, the
cross-entropy equals the entropy of qφ (z) and the regularization term is smallest.
However, in practice, this is infeasible because:
• We cannot sum over tens of thousands of points and backpropagate through them.
4.4 Improving Variational Auto-Encoders 79
• This result is fine from the theoretical point of view; however, the optimization
process is stochastic and could cause additional errors.
• As mentioned earlier, choosing the aggregated posterior as the prior does not
constrain the latent representation in any obvious manner and, thus, the encoder
could behave unpredictably.
• The aggregated posterior may work well if the get N → +∞ points, because
then we can get any distribution; however, this is not the case in practice and it
contradicts also the first bullet.
As a result, we can keep this theoretical solution in mind and formulate approxima-
tions to it that are computationally tractable. In the next sections, we will discuss a
few of them.
5 self.L = L
6
7 # params weights
8 self.means = torch.zeros (1, L)
9 self. logvars = torch.zeros (1, L)
10
K
pλ (z) = wk N(z|μk , σk2 ), (4.47)
k=1
4.4 Improving Variational Auto-Encoders 81
where λ = {wk }, {μk }, {σk2 } are trainable parameters.
Similarly to the standard Gaussian prior, we trained a small VAE with the mixture
of Gaussians prior (with K = 16) and a two-dimensional latent space. In Fig. 4.9,
we present samples from the encoder for the test data (black dots) and the contour
plot for the MoG prior. Comparing to the standard Gaussian prior, the MoG prior
fits better the aggregated posterior, allowing to patch holes.
An example of the code is presented below:
1 class MoGPrior (nn. Module ):
2 def __init__ (self , L, num_components ):
3 super(MoGPrior , self). __init__ ()
4
5 self.L = L
6 self. num_components = num_components
7
8 # params
9 self.means = nn. Parameter (torch. randn( num_components ,
self.L)∗ multiplier )
10 self. logvars = nn. Parameter (torch. randn( num_components ,
self.L))
11
12 # mixing weights
13 self.w = nn. Parameter (torch. zeros( num_components , 1, 1))
14
22 # mixing probabilities
23 w = F. softmax (self.w, dim =0)
24 w = w. squeeze ()
25
26 # pick components
27 indexes = torch. multinomial (w, batch_size , replacement =
True)
28
43 # mixing probabilities
44 w = F. softmax (self.w, dim =0)
45
46 # log−mixture−of−Gaussians
47 z = z. unsqueeze (0) # 1 x B x L
48 means = means. unsqueeze (1) # K x 1 x L
49 logvars = logvars . unsqueeze (1) # K x 1 x L
50
54 return log_prob
In [23], it was noticed that we can improve on the MoG prior and approximate the
aggregated posterior by introducing pseudo-inputs:
1
K
pλ (z) = qφ (z|uk ), (4.48)
N
k=1
where λ = φ, {u2k } are trainable parameters and uk ∈ XD is a pseudo-input.
Notice that φ is a part of the trainable parameters. The idea of pseudo-input is to
4.4 Improving Variational Auto-Encoders 83
Fig. 4.10 An example of the VampPrior (contours) and the samples from the aggregated posterior
(black dots)
randomly initialize objects that mimic observable variables (e.g., images) and learn
them by backpropagation.
This approximation to the aggregated posterior is called the variational mixture
of posterior prior, VampPrior for short. In [23] you can find some interesting
properties and further analysis of the VampPrior. The main drawback of the
VampPrior lies in initializing the pseudo-inputs; however, it serves as a good proxy
to the aggregated posterior that improves the generative quality of the VAE, e.g.,
[10, 50, 51].
Alemi et al. [10] presented a nice connection of the VampPrior with the
information-theoretic perspective on the VAE. They further proposed to introduce
learnable probabilities of the components:
K
pλ (z) = wk qφ (z|uk ), (4.49)
k=1
5 self.L = L
6 self.D = D
7 self. num_vals = num_vals
8
11 # pseudo−inputs
12 u = torch.rand( num_components , D) ∗ self. num_vals
84 4 Latent Variable Models
15 # mixing weights
16 self.w = nn. Parameter (torch. zeros(self.u.shape [0], 1, 1))
# K x 1 x 1
17
27 # mixing probabilities
28 w = F. softmax (self.w, dim =0) # K x 1 x 1
29 w = w. squeeze ()
30
31 # pick components
32 indexes = torch. multinomial (w, batch_size , replacement =
True)
33
48 # mixing probabilities
49 w = F. softmax (self.w, dim =0) # K x 1 x 1
50
51 # log−mixture−of−Gaussians
52 z = z. unsqueeze (0) # 1 x B x L
53 mean_vampprior = mean_vampprior . unsqueeze (1) # K x 1 x L
54 logvar_vampprior = logvar_vampprior . unsqueeze (1) # K x 1
x L
55
58
59 return log_prob
In fact, we can use any density estimator to model the prior. In [52] a density
estimator called generative topographic mapping (GTM) was proposed that
defines a grid of K points in a low-dimensional space, u ∈ RC , namely:
K
p(u) = wk δ(u − uk ) (4.50)
k=1
K
= wk N z|μg (uk ), σg2 (uk ) , (4.52)
k=1
5 self.L = L
6
7 # 2D manifold
86 4 Latent Variable Models
12 k = 0
13 for i in range( num_components ):
14 for j in range( num_components ):
15 self.u[k ,0] = u1[i]
16 self.u[k ,1] = u2[j]
17 k = k + 1
18
22 # mixing weights
23 self.w = nn. Parameter( torch .zeros( num_components ∗∗2 , 1, 1))
24
35 # mixing probabilities
36 w = F. softmax (self.w, dim =0)
37 w = w. squeeze ()
38
39 # pick components
40 indexes = torch. multinomial (w, batch_size , replacement =
True)
4.4 Improving Variational Auto-Encoders 87
41
56 # log−mixture−of−Gaussians
57 z = z. unsqueeze (0) # 1 x B x L
58 mean_gtm = mean_gtm . unsqueeze (1) # K∗∗2 x 1 x L
59 logvar_gtm = logvar_gtm . unsqueeze (1) # K∗∗2 x 1 x L
60
61 w = F. softmax (self.w, dim =0)
62
66 return log_prob
4.4.1.5 GTM-VampPrior
As mentioned earlier, the main issue with the VampPrior is the initialization of the
pseudo-inputs. Instead, we can use the idea of the GTM to learn the pseudo-inputs.
Combining these two approaches, we get the following prior:
K
pλ (z) = wk qφ z|gγ (uk ) , (4.53)
k=1
where we first define a grid in a low-dimensional space, {uk }, and then transform
them to XD using the transformation gγ .
Now, we train a small VAE with the GTM-VampPrior (with K = 16, i.e., a 4 × 4
grid) and a two-dimensional latent space. In Fig. 4.12, we present samples from the
encoder for the test data (black dots) and the contour plot for the GTM-VampPrior.
88 4 Latent Variable Models
Again, this mixture-based prior allows to wrap the points (the aggregated posterior)
and assign the probability to proper regions.
An example of an implementation of the GTM-VampPrior is presented below:
1 class GTMVampPrior (nn. Module ):
2 def __init__ (self , L, D, gtm_net , encoder , num_points , u_min
=−10., u_max =10. , num_vals =255):
3 super( GTMVampPrior , self ). __init__ ()
4
5 self.L = L
6 self.D = D
7 self. num_vals = num_vals
8
11 # 2D manifold
12 self.u = torch.zeros( num_points ∗∗2 , 2) # K∗∗2 x 2
13 u1 = torch. linspace (u_min , u_max , steps= num_points )
4.4 Improving Variational Auto-Encoders 89
16 k = 0
17 for i in range( num_points ):
18 for j in range( num_points ):
19 self.u[k ,0] = u1[i]
20 self.u[k ,1] = u2[j]
21 k = k + 1
22
26 # mixing weights
27 self.w = nn. Parameter( torch.zeros( num_points ∗∗2 , 1, 1))
28
41 # mixing probabilities
42 w = F. softmax (self.w, dim =0)
43 w = w. squeeze ()
44
45 # pick components
46 indexes = torch. multinomial (w, batch_size , replacement =
True)
47
62 # mixing probabilities
63 w = F. softmax (self.w, dim =0)
90 4 Latent Variable Models
64
65 # log−mixture−of−Gaussians
66 z = z. unsqueeze (0) # 1 x B x L
67 mean_vampprior = mean_vampprior . unsqueeze (1) # K x 1 x L
68 logvar_vampprior = logvar_vampprior . unsqueeze (1) # K x 1
x L
69
73 return log_prob
The last distribution we want to discuss here is a flow-based prior. Since flow-based
models can be used to estimate any distribution, it is almost obvious to use them
for approximating the aggregated posterior. Here, we use the implementation of the
RealNVP presented before (see Chap. 3 for details).
As in the previous cases, we train a small VAE with the flow-based prior and
two-dimensional latent space. In Fig. 4.13, we present samples from the encoder
for the test data (black dots) and the contour plot for the flow-based prior. Similar
to the previous mixture-based priors, the flow-based prior allows approximating the
aggregated posterior very well. This is in line with many papers using flows as the
prior in the VAE [24, 25]; however, we must remember that the flexibility of the
flow-based prior comes with the cost of an increased number of parameters and
potential training issues inherited from the flows.
Fig. 4.13 An example of the flow-based prior (contours) and the samples from the aggregated
posterior (black dots)
4.4 Improving Variational Auto-Encoders 91
5 self.D = D
6
14 s = self.s[index ](xa)
15 t = self.t[index ](xa)
16
17 if forward :
18 #yb = f^{−1}(x)
19 yb = (xb − t) ∗ torch.exp(−s)
20 else:
21 #xb = f(y)
22 yb = torch.exp(s) ∗ xb + t
23
36 return z, log_det_J
37
44 return x
45
4.4.1.7 Remarks
In practice, we can use any density estimator to model pλ (z). For instance, we
can use an autoregressive model [26] or more advanced approaches like resampled
priors [27] or hierarchical priors [51]. Therefore, there are many options! However,
there is still an open question how to do that and what role the prior (the marginal)
should play. As I mentioned in the beginning, Bayesianists would say that the
marginal should impose some constraints on the latent space or, in other words,
our prior knowledge about it. I am a Bayesiast deep down in my heart and this way
of thinking is very appealing to me. However, it is still unclear what is a good latent
representation. This question is as old as mathematical modeling. I think that it
would be interesting to look at optimization techniques, maybe applying a gradient-
based method to all parameters/weights at once is not the best solution. Anyhow,
I am pretty sure that modeling the prior is more important than many people think
and plays a crucial role in VAEs.
T ∂f(t)
ln p(x) ≥ Eq(z(0) |x) ln p(x|z(T ) ) + ln det (t−1) − KL q(z(0) |x)||p(z(T ) ) .
∂z
t=1
(4.54)
In fact, the normalizing flow can be used to enrich the posterior of the VAE with
small or even none modifications in the architecture of the encoder and the decoder.
Motivation
= UDU , (4.55)
U = I − YSY . (4.56)
The value K is called the degree of the orthogonal matrix. Further, it can be shown
that any orthogonal matrix of degree K can be expressed using the product of
Householder transformations [53, 54], namely:
Theorem 4.2 Any orthogonal matrix with the basis acting on the K-dimensional
subspace can be expressed as a product of exactly K Householder matrices:
U = HK HK−1 · · · H1 , (4.57)
Theoretically, Theorem 4.2 shows that we can model any orthogonal matrix
in a principled fashion using K Householder transformations. Moreover, the
Householder matrix Hk is orthogonal matrix itself [55]. Therefore, this property
and the Theorem 4.2 put the Householder transformation as a perfect candidate for
formulating a volume-preserving flow that allows to approximate (or even capture)
the true full-covariance matrix.
Householder Flows
v v
where Ht = I − 2 ||vt ||t 2 is called the Householder matrix.
t
The most important property of Ht is that it is an orthogonal matrix and hence
the absolute value of the Jacobian-determinant is equal to 1. This fact significantly
∂Ht z(t−1)
simplifies the objective (4.54) because ln det ∂z(t−1) = 0, for t = 1, . . . , T .
Starting from a simple posterior with the diagonal covariance matrix for z(0) , the
series of T linear transformations given by (4.58) defines a new type of volume-
preserving flow that we refer to as the Householder flow (HF). The vectors vt ,
t = 1, . . . , T , are produced by the encoder network along with means and variances
using a linear layer with the input vt−1 , where v0 = h is the last hidden layer of
the encoder network. The idea of the Householder flow is schematically presented
in Fig. 4.14. Once the encoder returns the first Householder vector, the Householder
vt−1
v0 vt
NN HF Ht z(t−1)
x z(0) z(T ) z(t−1) z(t)
(a) encoder network + Householder Flow (b) one step of the Householder Flow
Fig. 4.14 A schematic representation of the encoder network with the Householder flow. (a) The
general architecture of the VAE+HF: The encoder returns means and variances for the posterior
and the first Householder vector that is further used to formulate the Householder flow. (b) A single
step of the Householder flow that uses linear Householder transformation. In both panels solid lines
correspond to the encoder network and the dashed lines are additional quantities required by the HF
4.4 Improving Variational Auto-Encoders 95
flow requires T linear operations to produce a sample from a more flexible posterior
with an approximate full-covariance matrix.
Motivation
The Householder flows can model only full-covariance Gaussians that is still not
necessarily a rich family of distributions. Now, we will look into a generalization of
the Householder flows. For this purpose, let us consider the following transforma-
tion similar to a single layer MLP with M hidden units and a residual connection:
Since Sylvester’s determinant identity plays a crucial role in the proposed family of
normalizing flows, we will refer to them as Sylvester normalizing flows.
Parameterization of A and B
Q = (q1 . . . qM )
which can be computed in O(M), since R̃R is also upper triangular. The following
theorem gives a sufficient condition for this transformation to be invertible.
Theorem 4.4 Let R and R̃ be upper triangular matrices. Let h : R −→ R be a
smooth function with bounded, positive derivative. Then, if the diagonal entries of
R and R̃ satisfy rii r̃ii > −1/h ∞ and R̃ is invertible, the transformation given by
(4.63) is invertible.
The proof of this theorem could be found in [16].
Preserving Orthogonality of Q
with a sufficient condition for convergence given by Q(0) Q(0) − I2 < 1. Here,
the 2-norm of a matrix X refers to X2 = λmax (X), with λmax (X) representing
the largest singular value of X. In our experimental evaluations we ran the iterative
procedure until Q(k) Q(k) − IF ≤ , with XF the Frobenius norm, and
a small convergence threshold. We observed that running this procedure up to
30 steps was sufficient to ensure convergence with respect to this threshold. To
minimize the computational overhead introduced by orthogonalization, we perform
this orthogonalization in parallel for all flows.
4.4 Improving Variational Auto-Encoders 97
Motivation
In the VAE framework, choosing Gaussian priors and Gaussian posteriors from the
mathematical convenience leads to Euclidean latent space. However, such choice
could be limiting for the following reasons:
98 4 Latent Variable Models
SNF IAF
transformation transformation
A B
Fig. 4.15 Different amortization strategies for Sylvester normalizing flows and Inverse Autore-
gressive Flows. (a) Our inference network produces amortized flow parameters. This strategy is
also employed by planar flows. (b) Inverse Autoregressive Flow [18] introduces a measure of
x dependence through a context variable h(x). This context acts as an additional input for each
transformation. The flow parameters themselves are independent of x
The von Mises–Fisher (vMF) distribution is often described as the Normal Gaussian
distribution on a hypersphere. Analogous to a Gaussian, it is parameterized by μ ∈
Rm indicating the mean direction, and κ ∈ R≥0 the concentration around μ. For the
special case of κ = 0, the vMF represents a Uniform distribution. The probability
density function of the vMF distribution for a random unit vector z ∈ Rm (or z ∈
Sm−1 ) is then defined as
4.5 Hierarchical Latent Variable Models 99
where ||μ||2 = 1, Cm (κ) is the normalizing constant, and Iv denotes the modified
Bessel function of the first kind at order v.
Interestingly, since we define a distribution over a hypersphere, it is possible to
formulate a uniform prior over the hypersphere. Then it turns out that if we take the
vMF distribution as the variational posterior, it is possible to calculate the Kullback–
Leibler divergence between the vMF distribution and the uniform defined over Sm−1
analytically [33]:
−1
2(π m/2 )
KL[vMF(μ, κ)||Unif(Sm−1 )] = κ + log Cm (κ) − log . (4.68)
(m/2)
To sample from the vMF, one can follow the procedure of [59]. Importantly,
the reparameterization cannot be easily formulated for the vMF distribution.
Fortunately, [60] allows extending the reparameterization trick to the wide class
of distributions that can be simulated using rejection sampling. [33] presents how to
formulate the acceptance–rejection sampling reparameterization procedure. Being
equipped with the sampling procedure and the reparameterization trick, and having
an analytical form of the Kullback–Leibler divergence, we have everything to be
able to build a hyperspherical VAE. However, please note the all these procedures
are less trivial than the ones for Gaussians. Therefore, a curious reader is referred to
[33] for further details.
4.5.1 Introduction
The main goal of AI is to formulate and implement systems that can interact with an
environment, process, store, and transmit information. In other words, we wish an
AI system understands the world around it by identifying and disentangling hidden
factors in the observed low-sensory data [61]. If we think about the problem of
building such a system, we can formulate it as learning a probabilistic model, i.e.,
a joint distribution over observed data, x, and hidden factors, z, namely, p(x, z).
Then learning a useful representation is equivalent to finding a posterior distribution
over the hidden factors, p(z|x). However, it is rather unclear what we really mean
by useful in this context. In a beautiful blog post [62], Ferenc Huszar outlines
why learning a latent variable model by maximizing the likelihood function is not
necessarily useful from the representation learning perspective. Here, we will use it
100 4 Latent Variable Models
as a good starting point for a discussion of why applying hierarchical latent variable
models could be beneficial.
Let us start by defining the setup. We assume the empirical distribution pdata (x)
and a latent variable model pθ (x, z). The way we parameterize the latent variable
model is not constrained in any manner; however, we assume that the distribution
is parameterized using deep neural networks (DNNs). This is important for two
reasons:
1. DNNs are non-linear transformations and as such, they are flexible and allow
parameterizing a wide range of distributions.
2. We must remember that DNNs will not solve all problems for us! In the end, we
need to think about the model as a whole, not only about the parameterization.
What I mean by that is the distribution we choose and how random variables
interact, etc. DNNs are definitely helpful, but there are many potential pitfalls
(we will discuss some of them later on) that even the largest and coolest DNN is
unable to take care of.
It is worth to remember that the joint distribution could be factorized in two ways,
namely:
1
N
=− ln pθ (xn ). (4.74)
N
n=1
Eventually, we have obtained the objective function we use all the time, namely, the
negative log-likelihood function.
If we think of usefulness of a representation (i.e., hidden factors) z, we intuitively
think of some kind of information that is shared between z and x. However,
the unconstrained training problem we consider, i.e., the minimization of the
negative log-likelihood function, does not necessarily say anything about the latent
4.5 Hierarchical Latent Variable Models 101
zero. Still, we can achieve the same performing models at two different levels of the
usefulness but at least the information flows from x to z. Obviously, the considered
scenario is purely hypothetical, but it shows that the inductive bias of a model
can greatly help to learn representations without being specified by the objective
function. Please keep this thought in mind because it will play a crucial role later
on!
The next situation is more tricky. Let us assume that we have a constrained
class of models; however, the conditional likelihood p(x|z) is parameterized by a
flexible, enormous DNN. A potential danger here is that this model could learn
to completely disregard z, treating it as a noise. As a result, p(x|z) becomes an
unconditional distribution that mimics pdata (x) almost perfectly. At the first glance,
this scenario sounds unrealistic, but it is a well-known phenomenon in the field.
For instance, [10] conducted a thorough experiment with variational auto-encoders,
and taking a PixelCNN++-based decoder resulted in a VAE that was unable to
reconstruct images. Their conclusion was exactly the same, namely, taking a class
of models with too flexible p(x|z) could lead to the model in the bottom-left corner
in Fig. 4.18.
4.5 Hierarchical Latent Variable Models 103
Let us start with a VAE with two latent variables: z1 and z2 . The joint distribution
could be factorized as follows:
Since we know already that even for a single latent variable calculating posteriors
over latents is intractable (except the linear Gaussian case, it is worth remembering
that!), we can utilize the variational inference with a family of variational posteriors
Q(z1 , z2 |x). Now, the main part is how to define the variational posteriors. A rather
natural approach would be to reverse the dependencies and factorize the posterior in
the following fashion:
or even we can simplify it as follows (dropping the dependency on x for the second
latent variable):
A Potential Pitfall
ELBO(x) = EQ(z1 ,z2 |x) ln p(x|z1 )−KL[q(z1 |x)||p(z1 |z2 )]−KL[q(z2 |z1 )||p(z2 )] . (4.82)
To shed some light on the ELBO for the two-level VAE, we notice the
following:
1. All conditions (z1 , z2 , x) are either samples from Q(z1 , z2 |x) or pdata (x).
2. We obtain the Kullback–Leibler divergence terms by looking at the variables per
layer. You are encouraged to derive the ELBO step-by-step, it is a great exercise
to get familiar with the variational inference.
3. It is worth remembering that the Kullback–Leibler divergence is always non-
negative.
Theoretically, everything should work perfectly fine, but there are a couple of
potential problems. First, we initialize all DNNs that parameterize the distributions
randomly. As a result, all Gaussians are basically standard Gaussians. Second, if
the decoder is powerful and flexible, there is a huge danger that the model will try
take advantage of the optimum for the last KL-term, KL[q(z2 |z1 )||p(z2 )]], that is,
q(z2 |z1 ) ≈ p(z2 ) ≈ N(0, 1). Then, since q(z2 |z1 ) ≈ N(0, 1), the second layer
is not used at all (it is a Gaussian noise) and we get back to the same issues as
in the one-level VAE architecture. It turns out that learning the two-level VAE is
even more problematic than a VAE with a single latent because even for a relatively
simple decoder the second latent variable z2 is mostly unused [15, 70]. This effect
is called the posterior collapse.
A natural question is whether we can do better. You can already guess the answer,
but before shouting it out loud, let us think for a second. In the generative part, we
106 4 Latent Variable Models
have top-down dependencies, going from the highest level of abstraction (latents)
down to the observable variables. Let us repeat it here again:
Do you see any resemblance? Yes, the variational posteriors have the extra
x, but the dependencies are pointing in the same direction. Why this could be
beneficial? Because now we could have a shared top-down path that would make
the variational posteriors and the generative part tightly connected through a shared
parameterization. That could be a very useful inductive bias!
This idea was originally proposed in ResNet VAEs [18] and Ladder VAEs [71],
and it was further developed in BIVA [44], NVAE [45], and the very deep VAE [46].
These approaches differ in their implementations and parameterizations used (i.e.,
architectures of DNNs); however, they all could be categorized as instantiations of
top-down VAEs. The main idea, as mentioned before, is to share the top-down path
between the variational posteriors and the generative distributions and use a side,
deterministic path going from x to the last latents. Alright, let us write this idea
down.
First, we have the top-down path that defines p(x|z1 ), p(z1 |z2 ), and p(z2 ). Thus,
we need a DNN that outputs μ1 and σ12 for given z2 , and another DNN that outputs
the parameters of p(x|z1 ) for given z1 . Since p(z2 ) is an unconditional distribution
(e.g., the standard Gaussian), we do not need a separate DNN here.
Second, we have a side, deterministic path that gives two deterministic variables:
r1 = f1 (x) and r2 = f2 (r1 ). Both transformations, f1 and f2 , are DNNs. Then,
we can use additional DNNs that return some modifications of the means and
the variances, namely, μ1 , σ12 , and μ2 , σ22 . These modifications could be
defined in many ways. Here we follow the way it is done in NVAE [45], namely, the
modifications are relative location and scales of the values given in the top-down
path. If you do not fully follow this idea, it should be clear once we define the
variational posteriors.
Finally, we can define the whole procedure. We define various neural networks
by specifying different indices. For sampling, we use the top-down path:
1. z2 ∼ N(0, 1)
2. [μ1 , σ12 ] = N N1 (z2 )
3. z1 ∼ N(μ1 , σ12 )
4. ϑ = N Nx (z1 )
5. x ∼ pϑ (x|z1 )
Now (please focus!) we calculate samples from the variational posteriors as
follows:
4.5 Hierarchical Latent Variable Models 107
4.5.2.3 Code
7 # bottom−up path
8 self. nn_r_1 = nn_r_1
9 self. nn_r_2 = nn_r_2
10
14 # top−down path
15 self. nn_z_1 = nn_z_1
16 self.nn_x = nn_x
17
18
19 # other params
20 self.D = D # dim of inputs
21
39 # bottom−up
40 # step 1
41 r_1 = self. nn_r_1 (x)
42 r_2 = self. nn_r_2 (r_1)
43
44 #step 2
45 delta_1 = self. nn_delta_1 (r_1)
46 delta_mu_1 , delta_log_var_1 = torch.chunk(delta_1 , 2, dim
=1)
47 delta_log_var_1 = F. hardtanh ( delta_log_var_1 , −7., 2.)
48
49 # step 3
50 delta_2 = self. nn_delta_2 (r_2)
51 delta_mu_2 , delta_log_var_2 = torch.chunk(delta_2 , 2, dim
=1)
52 delta_log_var_2 = F. hardtanh ( delta_log_var_2 , −7., 2.)
53
56 # top−down
57 # step 4
58 z_2 = self. reparameterization ( delta_mu_2 , delta_log_var_2
)
59
60 # step 5
61 h_1 = self. nn_z_1 (z_2)
62 mu_1 , log_var_1 = torch. chunk(h_1 , 2, dim =1)
63
64 # step 6
65 z_1 = self. reparameterization (mu_1 + delta_mu_1 ,
log_var_1 + delta_log_var_1 )
66
67 # step 7
68 h_d = self.nn_x(z_1)
69
79 #===== ELBO
80 # RE
81 if self. likelihood_type == ’categorical ’:
82 RE = log_categorical (x, mu_d , num_classes =self.
num_vals , reduction =’sum ’, dim=−1).sum(−1)
83
87 # KL
110 4 Latent Variable Models
97 KL = KL_z_1 + KL_z_2
98
99 # Final ELBO
100 if reduction == ’sum ’:
101 loss = −(RE − KL).sum ()
102 else:
103 loss = −(RE − KL).mean ()
104
105 return loss
106
117 # step 4
118 h_d = self.nn_x(z_1)
119
A B C
Fig. 4.21 An example of outcomes after the training: (a) Randomly selected real images. (b)
Unconditional generations from the top-down VAE. (c) The validation curve during training
That’s it! Now we are ready to run the full code (take a look at: https://github.
com/jmtomczak/intro_dgm). After training our top-down VAE, we should obtain
results like in Fig. 4.21.
What we have discussed here is just touching upon the topic. Hierarchical models
in probabilistic modeling seem to be important research direction and modeling
paradigm. Moreover, the technical details are also crucial for achieving state-of-
the-art performance. I strongly suggest reading about NVAE [45], ResNet VAE
[18], Ladder VAE [71], BIVA [44], and very deep VAEs [46] and compare various
tricks and parameterizations used therein. These models share the same idea, but
implementations vary significantly.
The research on hierarchical generative modeling is very up-to-date and develops
very quickly. As a result, this is nearly impossible to mention even a fraction of
interesting papers. I will mention only a few worth noticing papers:
• Pervez and Gavves [72] provides an insightful analysis about a potential problem
with hierarchical VAEs, namely, the KL divergence term is closely related to the
harmonics of the parameterizing function. In other words, using DNNs result
in high-frequency components of the KL term and, eventually, to the posterior
collapse. The authors propose to smooth the VAE by applying Ornstein–
Uhlenbeck (OU) Semigroup. I refer to the original paper for details.
• Wu et al. [73] proposes greedy layer-wise learning of a hierarchical VAE. The
authors used this idea in the context of video prediction. The main motivation
for utilizing greedy layer-wise learning is a limited amount of computational
resources. However, the idea of greedy layer-wise training has been extensively
utilized in the past [66–68].
• Gatopoulos and Tomczak [25] discusses incorporating pre-defined transfor-
mations like downscaling into the model. The idea is to learn a reversed
transformation to, e.g., downscaling in a stochastic manner. The resulting VAE
has a set of auxiliary variables (e.g., downscaled versions of observables) and a
set of latent variables that encode missing information in the auxiliary variables.
112 4 Latent Variable Models
z1 y
Originally, deep diffusion probabilistic models were proposed in [75] and they took
inspiration from non-equilibrium statistical physics. The main idea is to iteratively
114 4 Latent Variable Models
destroy the structure in data through a forward diffusion process and, afterward,
to learn a reverse diffusion process to restore the structure in data. In a follow-up
paper [74] recent developments in deep learning were used to train a powerful and
flexible diffusion-based deep generative model that achieved SOTA results in the
task of image synthesis. Here, we will abuse the original notation to make a clear
connection between hierarchical latent variable models and DDGMs. As previously,
we are interested in finding a distribution over data, pθ (x); however, we assume an
additional set of latent variables z1:T = [z1 , . . . , zT ]. The marginal likelihood is
defined by integrating out all latents:
pθ (x) = pθ (x, z1:T ) dz1:T . (4.86)
where x ∈ RD and zi ∈ RD for i = 1, . . . , T . Please note that the latents have the
same dimensionality as the observables. This is the same situation as in the case of
flow-based models. We parameterize all distributions using DNNs.
So far, we have not introduced anything new! This is again a hierarchical latent
variable model. As in the case of hierarchical VAEs, we can introduce a family of
variational posteriors as follows:
T
Qφ (z1:T |x) = qφ (z1 |x) qφ (zi |zi−1 ) . (4.88)
i=2
The key point is how we define these distributions. Before, we used normal
distributions parameterized by DNNs, but now we formulate them as the following
Gaussian diffusion process [75]:
qφ (zi |zi−1 ) = N(zi | 1 − βi zi−1 , βi I), (4.89)
where z0 = x. Notice that a single step of the diffusion, qφ (zi |zi−1 ), works in a
relatively
√ easy way. Namely, it takes the previously generated object zi−1 , scales it
by 1 − βi , and then adds noise with variance βi . To be even more explicit, we can
write it using the reparameterization trick:
zi = 1 − βi zi−1 + βi , (4.90)
4.5 Hierarchical Latent Variable Models 115
df
=L(x; θ, φ).
− EQφ (z−1 |x) KL qφ (z1 |x)||pθ (z1 |z2 ) . (4.92)
Example 4.1 Let us take T = 5. This is not much (e.g., [74] uses T = 1000), but it
is easier to explain the idea with a very specific model. Moreover, let us use a fixed
βt ≡ β. Then we have the following DDGM:
pθ (x, z1:5 ) = pθ (x|z1 )pθ (z1 |z2 )pθ (z2 |z3 )pθ (z3 |z4 )pθ (z4 |z5 )pθ (z5 ), (4.93)
Qφ (z1:5 |x) = qφ (z1 |x)qφ (z2 |z1 )qφ (z3 |z2 )qφ (z4 |z3 )qφ (z5 |z4 ). (4.94)
4
− EQφ (z−i |x) KL qφ (zi |zi−1 )||pθ (zi |zi+1 ) +
i=2
− EQφ (z−i |x) KL qφ (z5 |z4 )||pθ (z5 ) +
− EQφ (z−i |x) KL qφ (z1 |x)||pθ (z1 |z2 ) , (4.95)
where
The last interesting question is how to model inputs and, eventually, what
distribution we should use to model p(x|z1 ). So far, we used the categorical
distribution because pixels were integer-valued. However, for the DDGM, we
assume that they are continuous and we will use a simple trick. We normalize our
inputs to values between −1 and 1 and apply the Gaussian distribution with the unit
variance and the mean being constrained to [−1, 1] using the tanh non-linearity:
and only the reverse diffusion requires applying DDNs. But without any further
mumbling, let us dive into the code!
4.5.3.3 Code
At this point, you might think that it is pretty complicated and a lot of math is
involved here. However, if you followed our previous discussions on VAEs, it should
be rather clear what we need to do here.
1 class DDGM(nn. Module ):
2 def __init__ (self , p_dnns , decoder_net , beta , T, D):
3 super(DDGM , self). __init__ ()
4
5 print(’DDGM by JT.’)
6
11 # other params
12 self.D = D # the dimensionality of the inputs ( necessary
for sampling !)
13
39 # =====
40 # Backward Diffusion
41 # We start with the last z and proceed to x.
42 # At each step , we calculate means and variances.
43 mus = []
44 log_vars = []
45
64 for i in range(len(mus)):
65 KL_i = ( log_normal_diag (zs[i], torch.sqrt (1. − self.
beta) ∗ zs[i], torch.log(self.beta)) − log_normal_diag (zs[i],
mus[i], log_vars [i])).sum(−1)
66
67 KL = KL + KL_i
68
69 # Final ELBO
70 if reduction == ’sum ’:
71 loss = −(RE − KL).sum ()
72 else:
73 loss = −(RE − KL).mean ()
74
75 return loss
76
77 # Sampling is the reverse diffusion with sampling at each
step.
78 def sample (self , batch_size =64):
79 z = torch.randn ([ batch_size , self.D])
80 for i in range(len(self. p_dnns ) − 1, −1, −1):
81 h = self. p_dnns [i](z)
82 mu_i , log_var_i = torch. chunk(h, 2, dim =1)
4.5 Hierarchical Latent Variable Models 119
87 return mu_x
88
97 return zs[−1]
That’s it! Now we are ready to run the full code (take a look at: https://github.
com/jmtomczak/intro_dgm). After training our DDGM, we should obtain results
like in Fig. 4.24.
4.5.3.4 Discussion
Extensions
Currently, DDGMs are very popular deep generative models. What we present
here is very close to the original formulation of the DDGMs [75]. However, [74]
introduced many interesting insights and improvements on the original idea, such
as:
• Since the forward diffusion consists of Gaussian distributions and linear trans-
formations of means, it is possible to analytically marginalize out intermediate
steps, which yields:
q(zt |x) = N(zt | ᾱt x, (1 − ᾱt I), (4.98)
where:
120 4 Latent Variable Models
A B
C D
Fig. 4.24 An example of outcomes after the training: (a) Randomly selected real images. (b)
Unconditional generations from the DDGM. (c) A visualization of the last stochastic level after
applying the forward diffusion. As expected, the resulting images resemble pure noise. (d) An
example of a validation curve for the ELBO
√ √
ᾱt−1 βt αt (1 − ᾱt−1 )
μ̃t (zt , x) = x+ zt (4.100)
1 − ᾱt 1 − ᾱt
and
1 − ᾱt−1
β̃t = βt . (4.101)
1 − ᾱt
where θ (zt ) is parameterized by a DNN and it aims for estimating the noise
from zt .
• Even further, each Lt could be simplified to:
! !2
! !
Lt,simple = Et,x0 , ! − θ ᾱt x0 + 1 − ᾱt , t ! .
Ho et al. [74] provides empirical results that such an objective could be beneficial
for training and the final quality of synthesized images.
There were many follow-ups on [74], we mention only a few of them here:
• Improving DDGMs: Nichol and Dhariwal [84] introduces further tricks on
improving training stability and performance of DDGMs by learning the covari-
ance matrices in the reverse diffusion, proposing a different noise schedule,
among others. Interestingly, the authors of [76] propose to extend the observables
(i.e., pixels) and latents with Fourier features as additional channels. The
rationale behind this is that the high frequency features allow neural networks to
cope better with noise. Moreover, they introduce a new noise scheduling and an
application of certain functions to improve the numerical stability of the forward
diffusion process.
• Sampling speed-up: Kong and Ping [85] and Watson et al. [86] focus on speeding
up the sampling process.
• Superresolution: Saharia et al. [77] uses DDGMs for the task of superresolution.
122 4 Latent Variable Models
In the end, it is worth making a comparison of DDGMs with VAEs and flow-based
models. In Table 4.1, we provide a comparison based on rather arbitrary criteria:
• Whether the training procedure is stable or not
• Whether the likelihood could be calculated
• Whether a reconstruction is possible
• Whether a model is invertible
• Whether the latent representation could be lower dimensional than the input
space (i.e., a bottleneck in a model)
The three models share a lot of similarities. Overall, training is rather stable
even though numerical issues could arise in all models. Hierarchical VAEs could
be seen as a generalization of DDGMs. There is an open question of whether
it is indeed more beneficial to use fixed variational posteriors by sacrificing the
possibility of having a bottleneck. There is also a connection between flows and
DDGMs. Both classes of models aim for going from data to noise. Flows do
that by applying invertible transformations, while DDGMs accomplish that by a
diffusion process. In flows, we know the inverse but we pay the price of calculating
the Jacobian-determinant, while DDGMs require flexible parameterizations of the
reverse diffusion but there are no extra strings attached. Looking into connections
among these models is definitely an interesting research line.
References
21. Jakub M Tomczak and Max Welling. Improving variational auto-encoders using convex
combination linear inverse autoregressive flow. arXiv preprint arXiv:1706.02326, 2017.
22. Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David
Vazquez, and Aaron Courville. PixelVAE: A latent variable model for natural images. arXiv
preprint arXiv:1611.05013, 2016.
23. Jakub Tomczak and Max Welling. VAE with a VampPrior. In International Conference on
Artificial Intelligence and Statistics, pages 1214–1223. PMLR, 2018.
24. Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schul-
man, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint
arXiv:1611.02731, 2016.
25. Ioannis Gatopoulos and Jakub M Tomczak. Self-supervised variational auto-encoders.
Entropy, 23(6):747, 2021.
26. Amirhossein Habibian, Ties van Rozendaal, Jakub M Tomczak, and Taco S Cohen. Video
compression with rate-distortion autoencoders. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 7033–7042, 2019.
27. Matthias Bauer and Andriy Mnih. Resampled priors for variational autoencoders. In The 22nd
International Conference on Artificial Intelligence and Statistics, pages 66–75. PMLR, 2019.
28. Diederik P Kingma, Danilo J Rezende, Shakir Mohamed, and Max Welling. Semi-supervised
learning with deep generative models. In Proceedings of the 27th International Conference on
Neural Information Processing Systems, pages 3581–3589, 2014.
29. Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational
fair autoencoder. arXiv preprint arXiv:1511.00830, 2015.
30. Maximilian Ilse, Jakub M Tomczak, Christos Louizos, and Max Welling. DIVA: Domain
invariant variational autoencoders. In Medical Imaging with Deep Learning, pages 322–348.
PMLR, 2020.
31. Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncer-
tainty in neural network. In International Conference on Machine Learning, pages 1613–1622.
PMLR, 2015.
32. Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for
molecular graph generation. In International Conference on Machine Learning, pages 2323–
2332. PMLR, 2018.
33. Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak.
Hyperspherical variational auto-encoders. In 34th Conference on Uncertainty in Artificial
Intelligence 2018, UAI 2018, pages 856–865. Association For Uncertainty in Artificial
Intelligence (AUAI), 2018.
34. Tim R Davidson, Jakub M Tomczak, and Efstratios Gavves. Increasing expressivity of a
hyperspherical VAE. arXiv preprint arXiv:1910.02912, 2019.
35. Emile Mathieu, Charline Le Lan, Chris J Maddison, Ryota Tomioka, and Yee Whye Teh.
Continuous Hierarchical Representations with Poincaré Variational Auto-Encoders. arXiv
preprint arXiv:1901.06033, 2019.
36. Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-Softmax.
arXiv preprint arXiv:1611.01144, 2016.
37. C Maddison, A Mnih, and Y Teh. The concrete distribution: A continuous relaxation of discrete
random variables. In Proceedings of the international conference on learning Representations.
International Conference on Learning Representations, 2017.
38. Emile van Krieken, Jakub M Tomczak, and Annette ten Teije. Storchastic: A framework
for general stochastic automatic differentiation. Advances in Neural Information Processing
Systems, 2021.
39. Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor Berg-Kirkpatrick. Lagging
inference networks and posterior collapse in variational autoencoders. arXiv preprint
arXiv:1901.05534, 2019.
40. Adji B Dieng, Yoon Kim, Alexander M Rush, and David M Blei. Avoiding latent variable
collapse with generative skip models. In The 22nd International Conference on Artificial
Intelligence and Statistics, pages 2397–2405. PMLR, 2019.
References 125
41. Adji B Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David M Blei. Variational
Inference via χ-Upper Bound Minimization. In Proceedings of the 31st International
Conference on Neural Information Processing Systems, pages 2729–2738, 2017.
42. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew
Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning basic visual
concepts with a constrained variational framework. ICLR, 2016.
43. Partha Ghosh, Mehdi SM Sajjadi, Antonio Vergari, Michael Black, and Bernhard Scholkopf.
From variational to deterministic autoencoders. In International Conference on Learning
Representations, 2019.
44. Lars Maaløe, Marco Fraccaro, Valentin Liévin, and Ole Winther. BIVA: A Very Deep
Hierarchy of Latent Variables for Generative Modeling. In NeurIPS, 2019.
45. Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. arXiv
preprint arXiv:2007.03898, 2020.
46. Rewon Child. Very deep VAEs generalize autoregressive models and can outperform them on
images. arXiv preprint arXiv:2011.10650, 2020.
47. Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey.
Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
48. Matthew D Hoffman and Matthew J Johnson. ELBO surgery: Yet another way to carve up
the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian
Inference, NIPS, volume 1, page 2, 2016.
49. Ricky TQ Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of
disentanglement in VAEs. In Proceedings of the 32nd International Conference on Neural
Information Processing Systems, pages 2615–2625, 2018.
50. Frantzeska Lavda, Magda Gregorová, and Alexandros Kalousis. Data-dependent conditional
priors for unsupervised learning of multimodal data. Entropy, 22(8):888, 2020.
51. Shuyu Lin and Ronald Clark. Ladder: Latent data distribution modelling with a generative
prior. arXiv preprint arXiv:2009.00088, 2020.
52. Christopher M Bishop, Markus Svensén, and Christopher KI Williams. GTM: The generative
topographic mapping. Neural computation, 10(1):215–234, 1998.
53. Christian H Bischof and Xiaobai Sun. On orthogonal block elimination. Preprint MCS-P450-
0794, Mathematics and Computer Science Division, Argonne National Laboratory, page 4,
1994.
54. Xiaobai Sun and Christian Bischof. A basis-kernel representation of orthogonal matrices.
SIAM journal on matrix analysis and applications, 16(4):1184–1196, 1995.
55. Alston S Householder. Unitary triangularization of a nonsymmetric matrix. Journal of the
ACM (JACM), 5(4):339–342, 1958.
56. Leonard Hasenclever, Jakub Tomczak, Rianne van den Berg, and Max Welling. Variational
inference with orthogonal normalizing flows. 2017.
57. Åke Björck and Clazett Bowie. An iterative algorithm for computing the best estimate of an
orthogonal matrix. SIAM Journal on Numerical Analysis, 8(2):358–364, 1971.
58. Zdislav Kovarik. Some iterative methods for improving orthonormality. SIAM Journal on
Numerical Analysis, 7(3):386–389, 1970.
59. Gary Ulrich. Computer generation of distributions on the m-sphere. Journal of the Royal
Statistical Society: Series C (Applied Statistics), 33(2):158–163, 1984.
60. Christian Naesseth, Francisco Ruiz, Scott Linderman, and David Blei. Reparameterization
gradients through acceptance-rejection sampling algorithms. In Artificial Intelligence and
Statistics, pages 489–498. PMLR, 2017.
61. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review
and new perspectives. IEEE transactions on pattern analysis and machine intelligence,
35(8):1798–1828, 2013.
62. Ferenc Huszár. Is maximum likelihood useful for representation learning?
63. Mary Phuong, Max Welling, Nate Kushman, Ryota Tomioka, and Sebastian Nowozin. The
mutual autoencoder: Controlling information in latent code representations.
126 4 Latent Variable Models
64. Samarth Sinha and Adji B Dieng. Consistency regularization for variational auto-encoders.
arXiv preprint arXiv:2105.14859, 2021.
65. Jakub M Tomczak. Learning informative features from restricted Boltzmann machines. Neural
Processing Letters, 44(3):735–750, 2016.
66. Yoshua Bengio. Learning deep architectures for AI. Now Publishers Inc, 2009.
67. Ruslan Salakhutdinov. Learning deep generative models. Annual Review of Statistics and Its
Application, 2:361–385, 2015.
68. Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann machines. In Artificial
intelligence and statistics, pages 448–455. PMLR, 2009.
69. Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis.
Chapman and Hall/CRC, 1995.
70. Lars Maaløe, Marco Fraccaro, and Ole Winther. Semi-supervised generation with cluster-
aware generative models. arXiv preprint arXiv:1704.00637, 2017.
71. Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther.
Ladder variational autoencoders. Advances in Neural Information Processing Systems,
29:3738–3746, 2016.
72. Adeel Pervez and Efstratios Gavves. Spectral smoothing unveils phase transitions in
hierarchical variational autoencoders. In International Conference on Machine Learning, pages
8536–8545. PMLR, 2021.
73. Bohan Wu, Suraj Nair, Roberto Martin-Martin, Li Fei-Fei, and Chelsea Finn. Greedy
hierarchical variational autoencoders for large-scale video prediction. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2318–2328, 2021.
74. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv
preprint arXiv:2006.11239, 2020.
75. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep
unsupervised learning using nonequilibrium thermodynamics. In International Conference
on Machine Learning, pages 2256–2265. PMLR, 2015.
76. Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.
arXiv preprint arXiv:2107.00630, 2021.
77. Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad
Norouzi. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636,
2021.
78. Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: A versatile
diffusion model for audio synthesis. In International Conference on Learning Representations,
2020.
79. Jacob Austin, Daniel Johnson, Jonathan Ho, Danny Tarlow, and Rianne van den Berg. Struc-
tured denoising diffusion models in discrete state-spaces. arXiv preprint arXiv:2107.03006,
2021.
80. Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax
flows and multinomial diffusion: Towards non-autoregressive language models. arXiv preprint
arXiv:2102.05379, 2021.
81. Chin-Wei Huang, Jae Hyun Lim, and Aaron Courville. A variational perspective on diffusion-
based generative models and score matching. arXiv preprint arXiv:2106.02808, 2021.
82. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. In
International Conference on Learning Representations, 2020.
83. Belinda Tzen and Maxim Raginsky. Neural stochastic differential equations: Deep latent
Gaussian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019.
84. Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. arXiv
preprint arXiv:2102.09672, 2021.
85. Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. arXiv preprint
arXiv:2106.00132, 2021.
86. Daniel Watson, Jonathan Ho, Mohammad Norouzi, and William Chan. Learning to efficiently
sample from diffusion probabilistic models. arXiv preprint arXiv:2106.03802, 2021.
References 127
87. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data
distribution. arXiv preprint arXiv:1907.05600, 2019.
88. Yang Song and Diederik P Kingma. How to train your energy-based models. arXiv preprint
arXiv:2101.03288, 2021.
89. Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space.
arXiv preprint arXiv:2106.05931, 2021.
90. Antoine Wehenkel and Gilles Louppe. Diffusion priors in variational autoencoders. In ICML
Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models,
2021.
Chapter 5
Hybrid Modeling
5.1 Introduction
In Chap. 1, I tried to convince you that learning the conditional distribution p(y|x) is
not enough and, instead, we should focus on the joint distribution p(x, y) factorized
as follows:
Why? Let me remind you my reasoning. The conditional p(y|x) does not allow
us to say anything about x but, instead, it will give its best to provide a decision.
As a result, I can provide an object that has never been observed so far, and p(y|x)
could still be pretty certain about its decision (i.e., assigning high probability to
one class). On the other hand, once we have trained p(x), we should be able to, at
least in theory, access the probability of the given object. And, eventually, determine
whether our decision is reliable or not.
In the previous chapters, we completely focused on answering the question
on how to learn p(x) alone. Since we had in mind the necessity of using it
for evaluating the probability, we discussed only the likelihood-based models,
namely the autoregressive models (ARMs), the flow-based models (flows), and the
Variational Auto-Encoders (VAEs). Now, the naturally arising question is how to
use a deep generative model together with a classifier (or a regressor). Let us focus
on a classification task for simplicity and think of possible approaches.
Let us start with some easy, naive almost approach. In the most straightforward
way, we can train p(y|x) and p(x) separately. And that is it, we have a classifier,
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 129
J. M. Tomczak, Deep Generative Modeling,
https://doi.org/10.1007/978-3-030-93158-2_5
130 5 Hybrid Modeling
sharing. Since training is stochastic, we really could worry about potential bad local
optima and our worries are even doubled now.
Would such an approach fail? Well, there is no simple answer to this question.
Probably, it could work pretty well even, but it might lead to models far from optimal
ones. Either way, who does like being unclear about training models? At least not
me.
Alright, so since I whine about sharing the parameterization, it is obvious that the
second approach uses (drums here)... a shared parameterization! To be more precise,
a partially shared parameterization assumes that there is a neural network that
processes x and then its output is fed to two neural networks: one for the classifier,
and one for the marginal distribution over x’s. An example of this approach is
depicted in Fig. 5.2 (the shared neural network is depicted in purple).
Now, taking the logarithm of the joint distribution gives
Neural Net
132 5 Hybrid Modeling
optimization perspective, the gradients flow through the γ network, and, thus, it
contains information about both x and y. This may greatly help in finding a better
solution.
At the first glance, there is nothing wrong in learning using the training objective
expressed as
If we think about it, during training, the γ network obtains a much stronger
signal from ln pβ,γ (x). Following our example of binary variables, let us assume that
our neural nets return all probabilities equal 0.5, so for the independent Bernoulli
variables we get
where we use the property of the logarithm (ln 0.5 = ln 2−1 = − ln 2) and it does
not matter what is the value of y because the neural network returns 0.5 for y = 0
and y = 1. Similarly, for x we get
D
D
ln Bern(xd |0.5) = ln Bern(xd |0.5)
d=1 d=1
= −D ln 2.
Therefore, we see that the ln pβ,γ (x) part is D-times stronger than the
ln pα,γ (y|x) part! How does it influence the final gradients during training? Try
to visualize a bar of height ln 2 and the other that is D-times higher. Now, imagine
these bars “flow” through γ . Do you see it? Yes, the γ neural network will obtain
more information from the marginal distribution and this information could cripple
the classification part. In other words, our final model will be always biased
towards the marginal part. Can we do something about it? Fortunately, yes!
5.2 Hybrid Modeling 133
Neural Net
Invertible
Neural Net
where λ ∈ [0, 1]. Unfortunately, this weighting scheme is not derived from a well-
defined distribution and it breaks the elegance of the likelihood-based approach.
However, if you do not mind being inelegant, then this approach should work well!
A different approach is proposed in [2] where only ln p(x) is weighted:
the model. Hence, the flow-based model is well-informed about the label. Second,
the weighting λ allows controlling whether the model is more discriminative or
more generative. Third, we can use any flow-based model! GLOW was used in [2],
however, [6] used residual flows, and [7] applied invertible DenseNets. Fourth, as
presented by [2], we can use any classifier (or regressor), e.g., Bayesian classifiers.
A potential drawback of this approach lies in the necessity of determining λ. This
is an extra hyperparameter that requires tuning. Moreover, as noticed in previous
papers [2, 6, 7], the value of λ drastically changes the performance of the model
from discriminative to generative. That is an open question how to deal with that.
No it is time to be more specific and formulate a hybrid model. Let us start with the
classifier and consider a fully-connected neural network to model the conditional
distribution p(y|x), namely:
K
p(y|x) = θk (x)[y=k] , (5.10)
k=1
where θk (x) is the softmax value for the k-th class, and [y = k] is the Iverson bracket
(i.e., [y = k] = 1 if y equals k, and 0—otherwise).
Next, we focus on modeling p(x). We can use any marginal model, e.g., we can
apply flows and the change of variables formula, namely:
p(x) = π z = f −1 (x) |Jf (x)|−1 , (5.11)
where Jf (x) denotes the Jacobian of the transformation (i.e., neural network) f
evaluated at x. In the case of the flow, we typically use π(z) = N(z|0, 1), i.e., the
standard Gaussian distribution.
Plugging these all distributions to the objective of the hybrid modeling (x, y; λ),
we get
5.4 Code 135
K
(x, y; λ) = [y = k] ln θk,g,f (x) + λ N(z = f −1 (x)|0, 1) − ln |Jf (x)|,
k=1
(5.12)
where we additionally highlight that θk,g,f is parameterized by two neural networks:
f from the flow and g for the final classification.
Now, if we would like to follow [2], we could pick coupling layers as the
components of f and, eventually, we would model p(x) using RealNVP or GLOW,
for instance. However, we want to be more fancy and we will utilize Integer Discrete
Flows (IDFs) [8, 9]. Why? Because we simply can and also IDFs do not require
calculating the Jacobian. Besides, we can practice a bit formulating various hybrid
models.
Let us quickly recall IDFs. First, they operate on ZD , i.e., integers. Second, we
need to pick an appropriate π(z) that in this case could be the discretized logistic
(DL), DL(z|μ, ν) with mean μ and scale ν. Since the change of variable formula for
discrete random variables does not require calculating the Jacobian (remember: no
change of volume here!), we can rewrite the hybrid modeling objective as follows:
K
(x, y; λ) = [y = k] ln θk,g,f (x) + λ DL(z = f −1 (x)|μ, ν). (5.13)
k=1
That’s it! Congratulations, if you have followed all these steps, you have arrived
at a new hybrid model that uses IDFs to model the distribution of x. Notice that the
classifier takes integers as inputs.
5.4 Code
We have all components to implement our own Hybrid Integer Discrete Flow
(HybridIDF)! Below, there is a code with a lot of comments that should help to
understand every single line of it. The full code (with auxiliary functions) that you
can play with is available at: https://github.com/jmtomczak/intro_dgm.
1 class HybridIDF (nn. Module ):
2 def __init__ (self , netts , classnet , num_flows , alpha =1., D=2)
:
3 super(HybridIDF , self). __init__ ()
4
5 print(’HybridIDF by JT.’)
6
10 if len(netts) == 1:
11 self.t = torch.nn. ModuleList ([ netts [0]() for _ in
range( num_flows )])
12 self. idf_git = 1
13 self.beta = nn. Parameter (torch .zeros(len(self.t)))
14
23 else:
24 raise ValueError (’You can provide either 1 or 4
translation nets.’)
25
26 # This contains extra layers for classification on top of
z.
27 self. classnet = classnet
28
52 if self. idf_git == 1:
53 (xa , xb) = torch.chunk(x, 2, 1)
54
55 if forward :
56 yb = xb + self.beta[index] ∗ self.round(self.t[
index ](xa))
57 else:
58 yb = xb − self.beta[index] ∗ self.round(self.t[
index ](xa))
59
65 if forward :
66 ya = xa + self.beta[index] ∗ self.round(self.t_a[
index ]( torch.cat ((xb , xc , xd), 1)))
67 yb = xb + self.beta[index] ∗ self.round(self.t_b[
index ]( torch.cat ((ya , xc , xd), 1)))
68 yc = xc + self.beta[index] ∗ self.round(self.t_c[
index ]( torch.cat ((ya , yb , xd), 1)))
69 yd = xd + self.beta[index] ∗ self.round(self.t_d[
index ]( torch.cat ((ya , yb , yc), 1)))
70 else:
71 yd = xd − self.beta[index] ∗ self.round(self.t_d[
index ]( torch.cat ((xa , xb , xc), 1)))
72 yc = xc − self.beta[index] ∗ self.round(self.t_c[
index ]( torch.cat ((xa , xb , yd), 1)))
73 yb = xb − self.beta[index] ∗ self.round(self.t_b[
index ]( torch.cat ((xa , yc , yd), 1)))
74 ya = xa − self.beta[index] ∗ self.round(self.t_a[
index ]( torch.cat ((yb , yc , yd), 1)))
75
96
97 return x
98
132 # The forward pass: Now , we use the hybrid model objective!
133 def forward (self , x, y, reduction =’avg ’):
134 z = self.f(x)
135 y_pred = self. classnet (z) # output : probabilities (i.e.,
softmax )
136
16 # Init HybridIDF
17 model = HybridIDF (netts , classnet , num_flows , D=D, alpha=alpha)
And we are done, this is all we need to have! After running the code and training
the HybridIDFs, we should obtain results similar to those in Fig. 5.4.
Hybrid VAE The hybrid modeling idea goes beyond using flows for p(x). Instead,
e.g., we can pick VAE and then, after applying the variational inference, we get a
lower bound to the hybrid modeling objective:
A B
C D
Fig. 5.4 An example of outcomes after the training: (a) Randomly selected real images. (b)
Unconditional generations from the HybridIDF. (c) An example of a validation curve for the
classification error. (d) An example of a validation curve for the negative log-likelihood, i.e.,
− ln p(x)
The Factor λ As mentioned before, the fudge factor λ could be troublesome. First,
it does not follow from a proper probability distribution. Second, it must be tuned
that is always extra trouble... But, as mentioned before, [11] showed that we can get
rid of λ!
New Parameterizations An interesting open research direction is whether we can
get rid of λ by using a different learning algorithm and/or other parameterization
(e.g., some peculiar neural networks). I strongly believe it is possible and, one day,
we will get there!
Is This a Good Factorization? I am almost sure that some of you wonder whether
this factorization of the joint, i.e., p(x, y) = p(y|x) p(x) is indeed better than
p(x, y) = p(x|y) p(y). If I were to sample x from a specific class y, then the latter
is better. However, if you go back to Chap. 1, you will notice that I do not care about
generating. I prefer to have a good model that will assign proper probabilities to the
world. That is why I prefer p(x, y) = p(y|x) p(x).
References 141
References
1. Guillaume Bouchard and Bill Triggs. The tradeoff between generative and discriminative clas-
sifiers. In 16th IASC International Symposium on Computational Statistics (COMPSTAT’04),
pages 721–728, 2004.
2. Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshmi-
narayanan. Hybrid models with deep and invertible features. In International Conference
on Machine Learning, pages 4723–4732. PMLR, 2019.
3. Diederik P Kingma, Danilo J Rezende, Shakir Mohamed, and Max Welling. Semi-supervised
learning with deep generative models. In Proceedings of the 27th International Conference on
Neural Information Processing Systems, pages 3581–3589, 2014.
4. Sergey Tulyakov, Andrew Fitzgibbon, and Sebastian Nowozin. Hybrid VAE: Improving deep
generative models using partial observations. arXiv preprint arXiv:1711.11566, 2017.
5. Diederik P Kingma and Prafulla Dhariwal. Glow: generative flow with invertible 1× 1
convolutions. In Proceedings of the 32nd International Conference on Neural Information
Processing Systems, pages 10236–10245, 2018.
6. Ricky TQ Chen, Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Residual flows
for invertible generative modeling. arXiv preprint arXiv:1906.02735, 2019.
7. Yura Perugachi-Diaz, Jakub M Tomczak, and Sandjai Bhulai. Invertible DenseNets with
concatenated LipSwish. Advances in Neural Information Processing Systems, 2021.
8. Rianne van den Berg, Alexey A Gritsenko, Mostafa Dehghani, Casper Kaae Sønderby, and Tim
Salimans. Idf++: Analyzing and improving integer discrete flows for lossless compression.
arXiv e-prints, pages arXiv–2006, 2020.
9. Emiel Hoogeboom, Jorn WT Peters, Rianne van den Berg, and Max Welling. Integer discrete
flows and lossless compression. arXiv preprint arXiv:1905.07376, 2019.
10. Maximilian Ilse, Jakub M Tomczak, Christos Louizos, and Max Welling. DIVA: Domain
invariant variational autoencoders. In Medical Imaging with Deep Learning, pages 322–348.
PMLR, 2020.
11. Tom Joy, Sebastian M Schmon, Philip HS Torr, N Siddharth, and Tom Rainforth. Rethinking
semi-supervised learning in VAEs. arXiv preprint arXiv:2006.10102, 2020.
Chapter 6
Energy-Based Models
6.1 Introduction
So far, we have discussed various deep generative models for modeling the marginal
distribution over observable variables (e.g., images), p(x), such as, autoregressive
models (ARMs), flow-based models (flows, for short), Variational Auto-Encoders
(VAEs), and hierarchical models like hierarchical VAEs and diffusion-based deep
generative models (DDGMs). However, from the very beginning, we advocate for
using deep generative modeling in the context of finding the joint distribution over
observables and decision variables that is factorized as p(x, y) = p(y|x)p(x). After
taking the logarithm of the joint we obtain two additive components: ln p(x, y) =
ln p(y|x) + ln p(x). We outlined how such a joint model could be formulated
and trained in the hybrid modeling setting (see Chap. 5). The drawback of hybrid
modeling though is the necessity of weighting both distributions, i.e., (x, yλ) =
ln p(y|x) + λ ln p(x), and for λ = 1 this objective does not correspond to the log-
likelihood of the joint distribution. The question is whether it is possible to formulate
a model to learn with λ = 1. Here, we are going to discuss a potential solution to
this problem using probabilistic energy-based models (EBMs) [1].
The history of EBMs is long and dates back to 80 of the previous century when
models dubbed Boltzmann Machines were proposed [2, 3]. Interestingly, the idea
behind Boltzmann machines is taken from statistical physics and was formulated
by cognitive scientists. A nice mix-up, isn’t it? In a nutshell, instead of proposing a
specific distribution like Gaussian or Bernoulli, we can define an energy function,
E(x), that assigns a value (energy) to a given state. There are no restrictions on the
energy function so you can already think of parameterizing it with neural networks.
Then, the probability distribution could be obtained by transforming the energy to
the unnormalized probability e−E(x) and normalizing it by Z = x e−E(x) (a.k.a. a
partition function) that yields the Boltzmann (also called Gibbs) distribution:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 143
J. M. Tomczak, Deep Generative Modeling,
https://doi.org/10.1007/978-3-030-93158-2_6
144 6 Energy-Based Models
e−E(x)
p(x) = . (6.1)
Z
If we consider continuous random variables, then the sum sign should be replaced
by the integral. In physics, the energy is scaled by an inverse of temperature
[4], however, we skip it to keep the notation uncluttered. Understanding how the
Boltzmann distribution works is relatively simple. Imagine a grid 5-by-5. Then,
assign some value (energy) to each of the 25 points where a larger value means
that a point has higher energy. Exponentiating the energy ensures that we do not
obtain negative values. To calculate the probability for each point, we need to
divide all exponentiated energies by their sum, in the same way how we do it for
calculating softmax. In the case of continuous random variables, we must normalize
by calculating an integral (i.e., a sum over all infinitesimal regions). For instance,
the Gaussian distribution could be also expressed as the Boltzmann distribution with
an analytically tractable partition function and the energy function of the following
form:
1
E(x; μ, σ 2 ) = (x − μ)2 , (6.2)
2σ 2
that yields
e−E(x)
p(x) = (6.3)
e−E(x) dx
1
(x−μ)2
e 2σ 2
= √ . (6.4)
2π σ 2
unconstrained, it could be any function! Yes, you have probably guessed already,
it could be a neural network! Second, notice that the energy function could be
multimodal without being defined as such (i.e., opposing to a mixture distribution).
Third, there is no difference if we define it over discrete or continuous variables.
I hope you see now that EBMs have a lot of advantages! They possess a lot of
deficiencies too but hey, let us stick to the positive aspects before we start, ok?
where we indicate by [y] the specific output of the neural net NNθ (x). Then, the
joint probability distribution is defined as the Boltzmann distribution:
exp{NNθ (x)[y]}
pθ (x, y) = (6.6)
x,y exp{NNθ (x)[y]}
exp{NNθ (x)[y]}
= , (6.7)
Zθ
where we define the partition function as Zθ = x,y exp{NNθ (x)[y]}.
Since we have the joint distribution, we can calculate the marginal distribution
and the conditional distribution. First, let us take a look at the marginal p(x).
Applying the sum rule to the joint distribution yields:
pθ (x) = pθ (x, y) (6.8)
y
y exp{NNθ (x)[y]}
= (6.9)
x,y exp{NNθ (x)}[y]
y exp{NNθ (x)[y]}
= . (6.10)
Zθ
Let us notice that we can express this distribution differently. First, we can re-
write the numerator in the following manner:
146 6 Energy-Based Models
exp{N Nθ (x)[y]} = exp log exp{NNθ (x)[y]} (6.11)
y y
pθ (x, y)
pθ (y|x) = (6.14)
pθ (x)
exp{N Nθ (x)[y]}
Zθ
= (6.15)
y exp{N Nθ (x)[y]}
Zθ
exp{NNθ (x)[y]}
= . (6.16)
y exp{NNθ (x)[y]}
The last line should resemble something, do you see it? Yes, you are right, it is
the softmax function! We have shown that the energy-based model could be used
either as a classifier or as a marginal distribution. And it is enough to define a single
neural network for that! Isn’t it beautiful? The same observations were made in [5]
that any classifier could be seen as an energy-based model.
Interestingly, the logarithm of the joint distribution is the following:
6.3 Training
We have a single neural network to train and the training objective is the logarithm
of the joint distribution. Since the training objective is a sum of the logarithm
of the conditional pθ (y|x) and the logarithm the marginal pθ (x), calculating the
gradient with respect to the parameters θ requires taking the gradient of each of the
component separately. We know that there is no problem with learning a classifier
so let us take a closer look at the second component, namely:
We can decipher what has just happened here. The gradient of the first part,
∇θ LogSumExpy {NNθ (x)[y]}, is calculated for a given datapoint x. The log-sum-
exp function is differentiable, so we can apply autograd tools. However, the second
part, Ex ∼pθ (x) ∇θ LogSumExpy {NNθ (x )[y]} , is a totally different story for two
reasons:
148 6 Energy-Based Models
• First, the gradient of the logarithm of the partition function turns into the expected
value over x distributed according to the model! That is really a problem because
the expected value cannot be analytically calculated and sampling from the
marginal distribution pθ (x) is non-trivial.
• Second, we need to calculate the expected value of the log-sum-exp of NNθ (x).
That is good news because we can do it using automatic differentiation tools.
Thus, the only problem lies in the expected value. Typically, it is approximated
by Monte Carlo samples, however, it is not clear how to sample effectively and
efficiently from an EBM. Grathwohl et al. [5] proposes to use the Langevin
dynamics [6] that is an MCMC method. The Langevin dynamics in our case starts
with a randomly initialized x0 and then uses the information about the landscape of
the energy function (i.e., the gradient) to seek for new x, that is
where α > 0, σ > 0, and ∼ N(0, I ). The Langevin dynamics could be seen as
the stochastic gradient descent in the observable space with a small Gaussian noise
added at each step. Once we apply this procedure for η steps, we can approximate
the gradient as follows:
where the first part is for learning a classifier, and the second part is for learning a
generator (so to speak). As a result, we can say we have a sum of two objectives for
a fully shared model. The gradient with respect to the parameters is the following:
The last two components come from calculating the gradient of the marginal
distribution. Remember that the problematic part is only the last component! We will
approximate this part using the Langevine dynamics (i.e., a sampling procedure) and
a single sample. The final training procedure is the following:
1. Sample xn and yn from a dataset.
6.4 Code 149
6. Apply the autograd tool to calculate gradients ∇θ L(θ ) and update the neural
network.
Notice that Lclf (θ ) is nothing else than the cross-entropy loss, and Lgen (θ ) is a
(crude) approximation to the log-marginal distribution over x’s.
6.4 Code
What do we need to code then? First, we must specify the neural network that
defines the energy function. (let us call it the energy net.) Classifying using the
energy net is rather straightforward. The main problematic part is sampling from
the model using the Langevin dynamics. Fortunately, the autograd tools allow us to
easily access the gradient with respect x! In fact, it is a single line in the code below.
Then we require writing a loop to run the Langevin dynamics for η iterations with
the steps size α and the noise level equal σ . In the code, we assume the data are
normalized and scaled to [−1, 1] similarly to [5].
1 class EBM(nn. Module ):
2 def __init__ (self , energy_net , alpha , sigma , ld_steps , D):
3 super(EBM , self). __init__ ()
4
5 print(’EBM by JT.’)
6
13 # hyperparams
150 6 Energy-Based Models
14 self.D = D
15
16 self.sigma = sigma
17
48 # =====
49 # discriminative part
50 # − calculate the discriminative loss: the cross−entropy
51 L_clf = self . class_loss (f_xy , y)
52
53 # =====
54 # generative part
55 # − calculate the generative loss: E(x) − E( x_sample )
56 L_gen = self . gen_loss (x, f_xy)
57
58 # =====
59 # Final objective
60 if reduction == ’sum ’:
61 loss = (L_clf + L_gen).sum ()
62 else:
63 loss = (L_clf + L_gen).mean ()
64
65 return loss
66
6.5 Restricted Boltzmann Machines 151
79 return x_i_grad
80
90 return x_new
91
And we are done, this is all we need to have! After running the code (take a
look at: https://github.com/jmtomczak/intro_dgm) and training the EBM, we should
obtain results similar to those in Fig. 6.2.
The idea of defining a model through the energy function is the foundation of a
broad family of Boltzmann machines (BMs) [2, 7]. The Boltzmann machines define
an energy function as follows:
152 6 Energy-Based Models
A B
C D
Fig. 6.2 An example of outcomes after the training: (a) Randomly selected real images. (b)
Unconditional generations from the EBM after applying η = 20 steps of the Langevin dynamics.
(c) An example of a validation curve of the objective (Lclf + Lgen ). (d) An example of a validation
curve of the generative objective Lgen
E(x; θ ) = − x Wx + b x , (6.32)
where θ = {W, b} and W is the weight matrix and b is the bias vector (bias weights),
which is the same as that in Hopfield networks and Ising models. The problem
with BM is that they are hard to train (due to the partition function). However, we
can alleviate the problem by introducing latent variables and restricting connections
among observables.
Restricting BMs
Let us consider a BM that consists of binary observable variables x ∈ {0, 1}D and
binary latent (hidden) variables z ∈ {0, 1}M . The relationships among variables are
specified through the following energy function:
E(x, z; θ ) = −x Wz − b x − c z, (6.33)
1
p(x, z|θ ) = exp − E(x, z; θ ) , (6.34)
Zθ
where
Zθ = exp − E(x, z; θ ) (6.35)
x z
is the partition function. The marginal probability over observables (the likelihood
of an observation) is
1
p(x|θ ) = exp − F (x; θ ) , (6.36)
Zθ
Learning RBMs
For given data D = {xn }Nn=1 , we can train an RBM using the maximum likelihood
approach that seeks the maximum of the log-likelihood function:
1
(θ ) = log p(xn |θ ). (6.40)
N
xn ∈D
The gradient of the learning objective (θ ) with respect to θ takes the following
form:
1
N
∇θ (θ ) = − ∇θ F (xn ; θ ) − p(x̂|θ )∇θ F (x̂; θ ) . (6.41)
N
n=1 x̂
1 We use the following notation: for given matrix A, A is its element at location (i, j ), A·j denotes
ij
its j th column , Ai· denotes its ith row, and for given vector a, ai is its ith element.
154 6 Energy-Based Models
In general, the gradient in Eq. (6.41) cannot be computed analytically because the
second term requires summing over all configurations of observables. One way
to sidestep this issue is the standard stochastic approximation of replacing the
expectation under p(x|θ ) by a sum over S samples {x̂1 , . . . , x̂S } drawn according
to p(x|θ ) [8]:
1 1
N S
∇θ (θ ) ≈ − ∇θ F (xn ; θ ) − ∇θ F (x̂s ; θ ) . (6.42)
N S
n=1 s=1
1
N
∇θ (θ ) ≈ − ∇θ F (xn ; θ ) − ∇θ F (x̄n ; θ ) . (6.43)
N
n=1
The original CD [9] used K steps of the Gibbs chain, starting from each datapoint
xn to obtain a sample x̄n and is restarted after every parameter update. An alternative
approach, Persistent Contrastive Divergence (PCD), does not restart the chain after
each update; this typically results in a slower convergence rate but eventually better
performance [10].
Defining Higher-Order Relationships Through the Energy Function
The energy function is an interesting concept because it allows modeling higher-
order dependencies among variables. For instance, the binary RBM could be
extended to third-order multiplicative interactions by introducing two kinds of
hidden variables, i.e., subspace units and gate units. The subspace units are hidden
variables that reflect variations of a feature, and, thus, they are more robust to
invariances. The gate units are responsible for activating the subspace units and
they can be seen as pooling features composed of the subspace features.
Let us consider the following random variables: x ∈ {0, 1}D , h ∈ {0, 1}M , S ∈
{0, 1}M×K . We are interested in the situation where there are three variables
connected, namely one observable xi and two types of hidden binary units, a gate
unit hj and a subspace unit sj k . Each gate unit is associated with a group of subspace
hidden units. The energy function of a joint configuration is then defined as follows:2
D
M
K
D
M
M
K
E(x, h, S; θ ) = − Wij k xi hj sj k − bi xi − cj hj − hj Dj k sj k , (6.44)
i=1 j =1 k=1 i=1 j =1 j =1 k=1
2 Unlike in other cases, we use sums instead of matrix products because now we have third-order
K
p(hj = 1|x) = sigm − Klog2 + cj + softplus Wij k xi + Dj k ,
k=1 i
(6.47)
The paper of [5] is definitely a milestone in the EBM literature because it shows that
we can use any energy function parameterized by a neural network. However, to get
to that point there was a lot of work on the energy-based models.
Restricted Boltzmann Machines RBMs possess a couple of useful traits. First,
the bipartite structure helps training that could be further used in formulating an
efficient training procedure for RBMs called contrastive divergence [9] that takes
advantage of block-Gibbs sampling. As mentioned earlier, a chain is initialized
either at a random point or a sample of latents and then, conditionally, the other set
of variables are trained. Similar to the ping-pong game, we sample some variables
given the others until convergence or until we decide to stop. Second, the distribution
3 softplus(a) = log 1 + exp(a) .
156 6 Energy-Based Models
References
1. Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-
based learning. Predicting structured data, 1(0), 2006.
2. David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for
Boltzmann machines. Cognitive science, 9(1):147–169, 1985.
3. Paul Smolensky. Information processing in dynamical systems: Foundations of harmony
theory. Technical report, Colorado Univ at Boulder Dept of Computer Science, 1986.
4. Edwin T Jaynes. Probability theory: The logic of science. Cambridge university press, 2003.
5. Will Grathwohl, Kuan-Chieh Wang, Joern-Henrik Jacobsen, David Duvenaud, Mohammad
Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should
treat it like one. In International Conference on Learning Representations, 2019.
6. Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics.
In Proceedings of the 28th international conference on machine learning (ICML-11), pages
681–688. Citeseer, 2011.
7. Geoffrey E Hinton, Terrence J Sejnowski, et al. Learning and relearning in Boltzmann
machines. Parallel distributed processing: Explorations in the microstructure of cognition,
1(282-317):2, 1986.
8. Benjamin Marlin, Kevin Swersky, Bo Chen, and Nando Freitas. Inductive principles for
restricted Boltzmann machine learning. In Proceedings of the thirteenth International
Conference on Artificial Intelligence and Statistics, pages 509–516, 2010.
9. Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural
computation, 14(8):1771–1800, 2002.
10. Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the
likelihood gradient. In ICML, pages 1064–1071, 2008.
11. Jakub M Tomczak and Adam Gonczarek. Learning invariant features using subspace restricted
Boltzmann machine. Neural Processing Letters, 45(1):173–182, 2017.
12. Roland Memisevic and Geoffrey E Hinton. Learning to represent spatial transformations with
factored higher-order Boltzmann machines. Neural computation, 22(6):1473–1492, 2010.
158 6 Energy-Based Models
13. Aaron Courville, James Bergstra, and Yoshua Bengio. A spike and slab restricted Boltzmann
machine. In Proceedings of the fourteenth international conference on artificial intelligence
and statistics, pages 233–241. JMLR Workshop and Conference Proceedings, 2011.
14. Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted Boltzmann machines
for collaborative filtering. In Proceedings of the 24th international conference on Machine
learning, pages 791–798, 2007.
15. Kyung Hyun Cho, Tapani Raiko, and Alexander Ilin. Gaussian-Bernoulli deep Boltzmann
machine. In The 2013 International Joint Conference on Neural Networks (IJCNN), pages
1–7. IEEE, 2013.
16. Ilya Sutskever, Geoffrey E Hinton, and Graham W Taylor. The recurrent temporal restricted
Boltzmann machine. In Advances in Neural Information Processing Systems, pages 1601–
1608, 2009.
17. Graham W Taylor, Leonid Sigal, David J Fleet, and Geoffrey E Hinton. Dynamical binary
latent variable models for 3d human pose tracking. In 2010 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, pages 631–638. IEEE, 2010.
18. Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted Boltzmann
machines. In Proceedings of the 25th international conference on Machine learning, pages
536–543, 2008.
19. Hugo Larochelle, Michael Mandel, Razvan Pascanu, and Yoshua Bengio. Learning algorithms
for the classification restricted Boltzmann machine. The Journal of Machine Learning
Research, 13(1):643–669, 2012.
20. Jakub M Tomczak. Learning informative features from restricted Boltzmann machines. Neural
Processing Letters, 44(3):735–750, 2016.
21. Jakub M Tomczak. On some properties of the low-dimensional Gumbel perturbations in the
Perturb-and-MAP model. Statistics & Probability Letters, 115:8–15, 2016.
22. Jakub M Tomczak, Szymon Zar˛eba, Siamak Ravanbakhsh, and Russell Greiner. Low-
dimensional perturb-and-map approach for learning restricted Boltzmann machines. Neural
Processing Letters, 50(2):1401–1419, 2019.
23. Jascha Sohl-Dickstein, Peter B Battaglino, and Michael R DeWeese. New method for
parameter estimation in probabilistic models: Minimum Probability Flow. Physical review
letters, 107(22):220601, 2011.
24. Yang Song and Diederik P Kingma. How to train your energy-based models. arXiv preprint
arXiv:2101.03288, 2021.
25. Yoshua Bengio. Learning deep architectures for AI. Now Publishers Inc, 2009.
26. Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep belief
networks for scalable unsupervised learning of hierarchical representations. In Proceedings of
the 26th annual international conference on machine learning, pages 609–616, 2009.
27. Ruslan Salakhutdinov. Learning deep generative models. Annual Review of Statistics and Its
Application, 2:361–385, 2015.
28. Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In
Proceedings of the 25th international conference on Machine learning, pages 872–879, 2008.
29. Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with
neural networks. science, 313(5786):504–507, 2006.
30. Max Welling and Yee Whye Teh. Approximate inference in Boltzmann machines. Artificial
Intelligence, 143(1):19–50, 2003.
31. Jonathan S Yedidia, William T Freeman, and Yair Weiss. Constructing free-energy approxima-
tions and generalized belief propagation algorithms. IEEE Transactions on information theory,
51(7):2282–2312, 2005.
32. Martin J Wainwright, Tommi S Jaakkola, and Alan S Willsky. A new class of upper bounds on
the log partition function. IEEE Transactions on Information Theory, 51(7):2313–2335, 2005.
33. Tamir Hazan and Tommi Jaakkola. On the partition function and random maximum a-
posteriori perturbations. In Proceedings of the 29th International Conference on International
Conference on Machine Learning, pages 1667–1674, 2012.
Chapter 7
Generative Adversarial Networks
7.1 Introduction
Once we discussed latent variable models, we claimed that they naturally define
a generative process by first sampling latents z ∼ p(z) and then generating
observables x ∼ pθ (x|z). That is nice! However, the problem appears when we start
thinking about training. To be more precise, the training objective is an issue. Why?
Well, the probability theory tells us to get rid of all unobserved random variables
by marginalizing them out. In the case of latent variable models, this is equivalent
to calculating the (marginal) log-likelihood function in the following form:
log pθ (x) = log pθ (x|z) p(z) dz. (7.1)
As we mentioned already in the section about VAEs (see Sect. 4.3), the
problematic part is calculating the integral because it is not analytically tractable
unless all distributions are Gaussian and the dependency between x and z is linear.
However, let us forget for a moment about all these issues and take a look at what
we can do here. First, we can approximate the integral using Monte Carlo samples
from the prior p(z) that yields
log pθ (x) = log pθ (x|z) p(z) dz (7.2)
1
S
≈ log pθ (x|zs ) (7.3)
S
s=1
S
= log exp (log pθ (x|zs )) − log S (7.4)
s=1
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 159
J. M. Tomczak, Deep Generative Modeling,
https://doi.org/10.1007/978-3-030-93158-2_7
160 7 Generative Adversarial Networks
mean considering one point at a time and then summing all individual errors instead
of comparing samples (i.e., collections of individuals) that we can refer to as a global
comparison. However, we do not need to stick to the KL divergence! Instead, we can
use other metrics that look at a set of points (i.e., distributions represented by a set of
points) like integral probability metrics [2] (e.g., the Maximum Mean Discrepancy
[MMD] [3]) or use other divergences [4].
Still, all of the mentioned metrics rely on defining explicitly how we measure the
error. The question is whether we can parameterize our loss function and learn it
alongside our model. Since we talk all the time about neural networks, can we go
even further and utilize a neural network to calculate differences?
Getting Rid of Prescribed Distributions
Alright, we agreed on the fact that the KL divergence is only one of many possible
loss functions. Moreover, we asked ourselves whether we can use a learnable loss
function. However, there is also one question floating in the air, namely, do we
need to use the prescribed models in the first place? The reasoning is the following.
Since we know that density networks take noise and turn them into distribution in
the observable space, do we really need to output a full distribution? What if we
return a single point? In other words, what if we define the conditional likelihood as
Dirac’s delta:
Then, let us understand what is going on! The marginal distribution is an infinite
mixture of delta peaks. In other words, we take a single z and plot a peak (or a point
in 2D, it is easier to imagine) in the observable space. We proceed to infinity and
once we do that, the observable space will be covered by more and more points and
some regions will be denser than the others. This kind of modeling a distribution is
also known as implicit modeling.
So where is the problem then? Well, the problem in the prescribed modeling
setting is that the term log δ (x − NNθ (z)) is ill-defined and cannot be used in
many probability measures, including the KL-term, because we cannot calculate
the loss function. Therefore, we can ask ourselves whether we can define our own
loss function, perhaps? And, even more, parameterize it with neural networks! You
must admit it sounds appealing! So how to accomplish that?
7.2 Implicit Modeling with Generative Adversarial Networks (GANs) 163
Adversarial Loss
Let us start with the following story. There is a con artist (a fraud) and a friend of the
fraud (an expert) who knows a little about art. Moreover, there is a real artist who
has passed away (e.g., Pablo Picasso). The fraud tries to mimic the style of Pablo
Picasso as well as possible. The friend expert browses for paintings of Picasso and
compares them to the paintings provided by the fraud. Hence, the fraud tries to
fool his friend, while the expert tries to distinguish real paintings of Picasso from
fakes. Over time, the fraud becomes better and better and the expert also learns how
to decide whether a given painting is a fake. Eventually, and unfortunately to the
world of art, work of the fraud may become indistinguishable from Picasso and the
expert may be completely uncertain about the paintings and whether they are fakes.
Now, let us formalize this wicked game. We call the expert a discriminator that
takes an object x and returns a probability whether it is real (i.e., coming from the
empirical distribution), Dα : X → [0, 1]. We refer to the fraud as a generator that
takes noise and turns it into an object x, Gβ : Z → X. All x’s coming from the
empirical distribution pdata (x) are called real and all x’s generated by Gβ (z) are
dubbed fake. Then, we construct the objective function as follows:
• We have two sources of data: x ∼ pθ (x) = Gβ (z) p(z) dz and x ∼ pdata (x).
• The discriminator solves the classification task by assigning 0 to all fake
datapoints and 1 to all real datapoints.
• Since the discriminator can be seen as a classifier, we can use the binary cross-
entropy loss function in the following form:
(α, β) = Ex∼preal log Dα (x) + Ez∼p(z) log 1 − Dα Gβ (z) . (7.8)
The left part corresponds to the real data source, and the right part contains the
fake data source.
• We try to maximize (α, β) with respect to α (i.e., the discriminator). In plain
words, we want the discriminator to be as good as possible.
• The generator tries to fool the discriminator and, thus, it tries to minimize (α, β)
with respect to β (i.e., the generator).
Eventually, we face the following learning objective:
min max Ex∼preal log Dα (x) + Ez∼p(z) log 1 − Dα Gβ (z) . (7.9)
β α
We refer to (α, β) as the adversarial loss since there are two actors trying to achieve
two opposite goals.
GANs
Let us put everything together:
• We have a generator that turns noise into fake data.
• We have a discriminator that classifies given input as either fake or real.
• We parameterize the generator and the discriminator using deep neural networks.
164 7 Generative Adversarial Networks
Fig. 7.2 A schematic representation of GANs. Please note the part of the generator and its
resemblance to density networks
• We learn the neural networks using the adversarial loss (i.e., we optimize the
min–max problem).
The resulting class of models is called Generative Adversarial Networks (GANs)
[5]. In Fig. 7.2, we present the idea of GANs and how they are connected to density
networks. Notice that the generator part constitutes an implicit distribution, i.e.,
a distribution from an unknown family of distributions, and its analytical form is
unknown as well; however, we can sample from it.
Believe me or not, but we have all components to implement GANs. Let us look into
all of them step-by-step. In fact, the easiest way to understand them is to implement
them.
Generator
The first part is the generator, Gβ (z), which is simply a deep neural network. The
code for a class of the generator is presented below. Notice that we distinguish
between a function for generating, namely, transforming z to x, and sampling that
first samples z ∼ N(0, I) and then calls generate.
Discriminator
The second component is the discriminator. Here, the code is even simpler because
it consists of a single neural network. The code for a class of the discriminator is
provided below:
GAN
Now, we are ready to combine these two components. In our implementation, a
GAN outputs the adversarial loss either for the generator or for the discriminator.
Maybe the code below is overkill; however, it is better to write a few more lines and
properly understand what is going on than applying some unclear tricks.
5 print(’GAN by JT.’)
6
47 if reduction == ’sum ’:
48 return loss.sum ()
49 else:
50 return loss.mean ()
7.3 Implementing GANs 167
51
8 # −discriminator
9 discriminator_net = nn. Sequential (nn. Linear (D, M), nn. ReLU (),
10 nn. Linear (M, 1) , nn. Sigmoid ())
11
Training
One might think that the training procedure for GANs is more complicated than for
any of the likelihood-based models. However, it is not the case. The only difference
is that we need two optimizers instead of one. An example of a code with a training
loop is presented below:
6 # −Discriminator
7 # Notice that we call our model with the ’discriminator ’ mode
.
8 loss_dis = model. forward (batch , mode=’discriminator ’)
9
10 optimizer_dis . zero_grad ()
11 optimizer_gen . zero_grad ()
12 loss_dis . backward ( retain_graph =True)
13 optimizer_dis .step ()
14
15 # −Generator
16 # Notice that we call our model with the ’generator ’ mode.
17 loss_gen = model. forward (batch , mode=’generator ’)
168 7 Generative Adversarial Networks
18
19 optimizer_dis . zero_grad ()
20 optimizer_gen . zero_grad ()
21 loss_gen . backward ( retain_graph =True)
22 optimizer_gen .step ()
A B
C D
Fig. 7.3 Examples of results after running the code for GANs. (a) Real images. (b) Fake images.
(c) The validation curve for the discriminator. (d) The validation curve for the generator
7.4 There Are Many GANs Out There! 169
Fig. 7.4 Generated images after (a) 10 epochs of training and (b) 50 epochs of training
Since the publication of the seminal paper on GANs [5] (however, the idea of the
adversarial problem could be traced back to [6]), there was a flood of GAN-based
ideas and papers. I would not even dare to mention a small fraction of them. The
field of implicit modeling with GANs is growing constantly. I will try to point to a
few important papers:
• Conditional GANs: An important extension of GANs is allowing them to
generate data conditionally [7].
• GANs with encoders: An interesting question is whether we can extend condi-
tional GANs to a framework with encoders. It turns out that it is possible; see
BiGAN [8] and ALI [9] for details.
• StyleGAN and CycleGAN: The flexibility of GANs could be utilized in formulat-
ing specialized image synthesizers. For instance, StyleGAN is formulated in such
a way to transfer style between images [10], while CycleGAN tries to “translate”
one image into another, e.g., a horse into a zebra [11].
170 7 Generative Adversarial Networks
• Wasserstein GANs: In [12] it was claimed that the adversarial loss could be
formulated differently using the Wasserstein distance (a.k.a. the earth-mover
distance), that is:
W (α, β) = Ex∼preal [Dα (x)] − Ez∼p(z) Dα Gβ (z) . (7.10)
where Dα (·) must be a 1-Lipschitz function. The simpler way to achieve that is
by clipping the weight of the discriminator to some small value c. Alternatively,
spectral normalization could be applied [13] by using the power iteration
method. Overall, constraining the discriminator to be a 1-Lipshitz function
stabilizes training; however, it is still hard to comprehend the learning process.
• f-GANs: The Wasserstein GAN indicated that we can look elsewhere for
alternative formulations of the adversarial loss. In [14], it is advocated to use
f-divergences for that.
• Generative Moment Matching Networks [15, 16]: As mentioned earlier, we could
use other metrics instead of the likelihood function. We can fix the discriminator
and define it as the Maximum Mean Discrepancy with a given kernel function.
The resulting problem is simpler because we do not train the discriminator and,
thus, we get rid of the cumbersome min–max optimization. However, the final
quality of synthesized images is typically poorer.
• Density difference vs. Density ratio: An interesting perspective is presented in
[17, 18] where we can see various GANs either as a difference of densities or as
a ratio of densities. I refer to the original papers for further details.
• Hierarchical implicit models: The idea of defining implicit models could be
extended to hierarchical models [19].
• GANs and EBMs: If you recall the EBMs, you may notice that there is a clear
connection between the adversarial loss and the logarithm of the Boltzmann
distribution. In [20, 21] it was noticed that introducing a variational distribution
over observables, q(x), leads to the following objective:
where E(·) is the energy function and H [·] is the entropy. The problem again
boils down to the min–max optimization problem, namely, minimizing with
respect to the energy function and maximizing with respect to the variational dis-
tribution. The second difference between the adversarial loss and the variational
lower bound here is the entropy term that is typically intractable.
• What GAN to use?: That is the question! Interestingly, it seems that training
GANs greatly depends on the initialization and the neural nets rather than the
adversarial loss or other tricks. You can read more about it in [22].
• Training instabilities: The main problem of GANs is unstable learning and a
phenomenon called mode collapse, namely, a GAN samples beautiful images
but only from some regions of the observable space. This problem has been
studied for a long time by many (e.g., [23–25]); however, it still remains an open
question.
References 171
References
1. David JC MacKay and Mark N Gibbs. Density networks. Statistics and neural networks:
advances at the interface, pages 129–145, 1999.
2. Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert RG
Lanckriet. On integral probability metrics, φ-divergences and binary classification. arXiv
preprint arXiv:0901.2698, 2009.
3. Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex Smola. A
kernel method for the two-sample-problem. Advances in Neural Information Processing
Systems, 19:513–520, 2006.
4. Tim Van Erven and Peter Harremos. Rényi divergence and Kullback-Leibler divergence. IEEE
Transactions on Information Theory, 60(7):3797–3820, 2014.
5. Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. arXiv preprint
arXiv:1406.2661, 2014.
6. Jürgen Schmidhuber. Making the world differentiable: On using fully recurrent self-
supervised neural networks for dynamic reinforcement learning and planning in non-stationary
environments. Institut für Informatik, Technische Universität München. Technical Report FKI-
126, 90, 1990.
7. Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint
arXiv:1411.1784, 2014.
8. Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv
preprint arXiv:1605.09782, 2016.
9. Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Mar-
tin Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint
arXiv:1606.00704, 2016.
10. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative
adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 4401–4410, 2019.
11. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image trans-
lation using cycle-consistent adversarial networks. In Proceedings of the IEEE international
conference on computer vision, pages 2223–2232, 2017.
12. Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial
networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
13. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normaliza-
tion for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
14. Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural
samplers using variational divergence minimization. In Advances in Neural Information
Processing Systems, pages 271–279, 2016.
172 7 Generative Adversarial Networks
15. Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative
neural networks via maximum mean discrepancy optimization. In Proceedings of the Thirty-
First Conference on Uncertainty in Artificial Intelligence, pages 258–267, 2015.
16. Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In
International Conference on Machine Learning, pages 1718–1727. PMLR, 2015.
17. Ferenc Huszár. Variational inference using implicit distributions. arXiv preprint
arXiv:1702.08235, 2017.
18. Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv
preprint arXiv:1610.03483, 2016.
19. Dustin Tran, Rajesh Ranganath, and David M Blei. Hierarchical implicit models and
likelihood-free variational inference. Advances in Neural Information Processing Systems,
2017:5524–5534, 2017.
20. Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based
probability estimation. arXiv preprint arXiv:1606.03439, 2016.
21. Shuangfei Zhai, Yu Cheng, Rogerio Feris, and Zhongfei Zhang. Generative adversarial
networks as variational training of energy based models. arXiv preprint arXiv:1611.01799,
2016.
22. Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are
GANs created equal? A large-scale study. Advances in Neural Information Processing Systems,
31, 2018.
23. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
Improved techniques for training GANs. Advances in neural information processing systems,
29:2234–2242, 2016.
24. Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs
do actually converge? In International Conference on Machine Learning, pages 3481–3490.
PMLR, 2018.
25. Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with
auxiliary classifier GANs. In International conference on machine learning, pages 2642–2651.
PMLR, 2017.
26. Adji B Dieng, Francisco JR Ruiz, David M Blei, and Michalis K Titsias. Prescribed generative
adversarial networks. arXiv preprint arXiv:1910.04302, 2019.
27. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Info-
GAN: Interpretable representation learning by information maximizing generative adversarial
nets. In Proceedings of the 30th International Conference on Neural Information Processing
Systems, pages 2180–2188, 2016.
Chapter 8
Deep Generative Modeling for Neural
Compression
8.1 Introduction
In December 2020, Facebook reported having around 1.8 billion daily active users
and around 2.8 billion monthly active users [1]. Assuming that users uploaded, on
average, a single photo each day, the resulting volume of data would give a very
rough (let me stress it: a very rough) estimate of around 3000TB of new images
per day. This single case of Facebook alone already shows us potential great costs
associated with storing and transmitting data. In the digital era we can simply say
this: efficient and effective manner of handling data (i.e., faster and smaller) means
more money in the pocket.
The most straightforward way of dealing with these issues (i.e., smaller and
faster) is based on applying compression, and, in particular, image compression
algorithms (codecs) that allow us to decrease the size of an image. Instead of
changing infrastructure, we can efficiently and effectively store and transmit images
by making them simply smaller! Let us be honest, the more we compress an image,
the more and faster we can send and the less disk memory we need!
If we think of image compression, probably the first association is JPEG or
PNG, standards used on the daily basis by everyone. I will not go into details
of these standards (e.g., see [2, 3] for an introduction) but what it is important
to know is that they use some pre-defined math like Discrete Cosine Transform.
The main advantage of the standard codecs like JPEG is that they are interpretable,
i.e., all steps are hand-designed and their behavior could be predicted. However,
this comes at the cost of insufficient flexibility that could drastically decrease
their performance. So how we can increase the flexibility of transformations? Any
idea? Anyone? Do I hear deep learning [4, 5]? Indeed! Many of today’s image
compression algorithms are enhanced by neural networks.
The emerging field of compression algorithms using neural networks is called
neural compression. Neural compression becomes a leading trend in developing
new codecs where neural networks replace parts of the standard codecs [6], or
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 173
J. M. Tomczak, Deep Generative Modeling,
https://doi.org/10.1007/978-3-030-93158-2_8
174 8 Deep Generative Modeling for Neural Compression
neural-based codecs are trained [7] together with quantization [8] and entropy
coding [9–12]. We will discuss the general compression scheme in detail in the next
subsection but here it is important to understand why deep generative modeling is
important in the context of neural compression. The answer was given a long time
ago by Claude Shannon who showed in [13] that (informally):
The length of a message representing a source data is propor-
tional to the entropy of this data.
We do not know the entropy of data because we do not know the probability
distribution of data, p(x), but we can estimate it using one of the deep generative
models we have discussed so far! Because of that, recently, there is an increasing
interest in using deep generative modeling to improve neural compression. We can
use deep generative models for modeling probability distribution for entropy coders
[9–12], but also to significantly increase the final reconstruction and compression
quality by incorporating new schemes for inference [14] and reconstruction [15].
The Objective
The final performance of a codec is evaluated in terms of reconstruction error and
compression ratio. The reconstruction error is called distortion measure and is
calculated as a difference between the original image and the reconstructed image
176 8 Deep Generative Modeling for Neural Compression
using the mean square error (MSE) (typically, the peak signal-to-noise ratio P SNR
2552
expressed as 10 log10 MSE is reported) or perceptual metrics like the multi-scale
structure similarity index (MS-SSI M) [19]. The compression ratio, called rate,
is usually expressed by the bits per pixel (bpp), i.e., the total size in bits of the
encoder output divided by the total size in pixels of the encoder input [17]. Typically,
the performance of codecs is compared by inspecting the rate-distortion plane (i.e.,
plotting curves on a plane with the rate on the x-axis and the distortion on the y-
axis).
Formally, we assume an auto-ecoder architecture (see Fig. 8.1 again) with an
encoding transformation, fe : X → Y, that takes an input x and returns a discrete
signal y (a code). After sending the message, a reconstruction x̂ is given by a
decoder, fd : Y → X. Moreover, there is an (adaptive) entropy coding model
that learns the distribution p(y) and is further used to turn the discrete signal y
into a bitstream by an entropy coder (e.g., Huffman coding, arithmetic coding). If
a compression method has any adaptive (hyper)parameters, it could be learned by
optimizing the following objective function:
L(x) = d x, x̂ + βr (y) , (8.1)
where d(·, ·) is the distortion measure (e.g., PSNR, MS-SSIM), and r(·) is the
rate measure (e.g., r (y) = − ln p(y)), β > 0 is a weighting factor controlling the
balance between rate and distortion. Notice that distortion requires both the encoder
and the decoder, and rate requires the encoder, and the entropy model.
We have discussed all necessary concepts of image compression and now we can
delve into neural compression. However, before we do that, the first question we
face is whether we can get any benefit from using neural networks for compression
and where and how we can use them in this context. As mentioned already, standard
codecs utilize a series of pre-defined transformations and mathematical operations.
But how does it work?
Let us quickly discuss one of the most commonly used codecs: JPEG. In the
JPEG codec, an RGB image is first linearly transformed to the YCbCr format:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎡ ⎤
Y 0 0.299 0.587 0.114 R
⎣Cb⎦ = ⎣128⎦ + ⎣−0.168736 −0.331264 0.5 ⎦ ⎣ G⎦ (8.2)
Cr 128 0.5 −0.48688 −0.081312 B
Then, the Cb and Cr channels are downscaled, typically two or three times (the
first compression stage). After that, each channel is split into, e.g., 8×8 blocks,
and fed to the discrete cosine transform (DCT) that is eventually quantized (the
8.4 Neural Compression: Components 177
second compression stage). After all, the Huffman coding could be used. To decode
the signal, the inverse DCT is used, the Cb and Cr channels are upscaled and the
RGB representation is recovered. The whole system is presented in Fig. 8.2. As
you can notice, each step is easy to follow and if you know how DCT works, the
whole procedure is a white box. There are some hyperparameters but, again, they
have a very clear interpretation (e.g., how many times the Cb and Cr channels are
downscaled, the size of blocks).
Alright, we know now how a standard codec works. However, one of the problems
with standard codecs is that they are not necessarily flexible. One may ask
whether DCT is the optimal transformation for all images. The answer is, with
high probability, no. If we are willing to give up the nicely designed white box,
we can turn it into a black box by replacing all mathematical operations with
neural networks. The potential gain is increased flexibility and potentially better
performance (both in terms of distortion and rate).
We need to remember though that learning neural network requires the differen-
tiability of the whole approach. However, we require a discrete output of the neural
network that could break the backpropagation! For this purpose, we must formulate
a differentiable quantization procedure. Additionally, to obtain a powerful model,
we need an adaptive entropy coding model. This is an important component of the
neural compression pipeline because it not only optimizes the compression ratio
(i.e., the rate) but also helps to learn a useful codebook. Next, we will discuss both
components in detail.
178 8 Deep Generative Modeling for Neural Compression
Encoders and Decoders In neural compression, unlike in VAEs, the encoder and
the decoder consist of neural networks with no additional functions. As a result, we
focus on architectures rather than how to parameterize a distribution, for instance.
The output of the encoder is a continuous code (floats) and the output of the decoder
is a reconstruction of an image. Below we present PyTroch classes for an encoder
and a decoder with examples of neural networks.
1 # The encoder is simply a neural network that takes an image and
outputs a corresponding code.
2 class Encoder (nn. Module ):
3 def __init__ (self , encoder_net ):
4 super(Encoder , self). __init__ ()
5
1 # ENCODER
2 e_net = nn. Sequential (
3 nn. Linear (D, M∗2) , nn. BatchNorm1d (M∗2) , nn.ReLU (),
4 nn. Linear (M∗2, M), nn. BatchNorm1d (M), nn.ReLU (),
5 nn. Linear (M, M//2) , nn. BatchNorm1d (M//2) , nn.ReLU (),
6 nn. Linear (M//2, C))
7
10 # DECODER
11 d_net = nn. Sequential (
12 nn. Linear (C, M//2) , nn. BatchNorm1d (M//2) , nn.ReLU (),
13 nn. Linear (M//2, M), nn. BatchNorm1d (M), nn.ReLU (),
8.4 Neural Compression: Components 179
Listing 8.2 Examples of neural networks for the encoder and the decoder
ŷ = Ŝc. (8.3)
The resulting code, ŷ, consists of values from the codebook only.
We can ask ourselves whether indeed we gain anything because values of ŷ are
still floats. So, in other words, where is the discrete signal we want to turn into the
bitstream? We can answer it in two ways. First, there are only K possible values
in the codebook. Hence, the values are discrete but represented by a finite number
of floats. Second, the real magic happens when we calculate the matrix Ŝ. Notice
that this matrix is indeed discrete! In each row, there is a single position with 1
180 8 Deep Generative Modeling for Neural Compression
codebook
indices to
softmax codebook
code
Fig. 8.3 An example of the quantization of codes. (a) Distances. (b) Indices. (c) Quantized code
and 0’s elsewhere. As a result, either we look at it from the codebook perspective
or the similarity matrix perspective, we should be convinced that indeed we can
turn any real-valued vector into a vector consisting of real values but from a finite
set. And, most importantly, this whole procedure of quantizing, allows us to apply
the backpropagation algorithm! This quantization approach (or very similar) was
used in many neural compression methods, e.g., [8, 11] or [10]. We can also use
other differential quantization techniques, e.g., vector quantization [16]. However,
we prefer to stick to a simple codebook that turns to be pretty effective in practice.
1 class Quantizer (nn. Module ):
2 def __init__ (self , input_dim , codebook_dim , temp =1. e7):
3 super(Quantizer , self). __init__ ()
4 # temperature for softmax
5 self.temp = temp
6
Adaptive Entropy Coding Model The last piece in the whole puzzle is entropy
coding. We rely on entropy coders like Huffman coding or arithmetic coding. Either
way, these entropy coders require from us an estimate of the probability distribution
over codes, p(y). Once they have it, they can losslessly compress the discrete signal
into a bitstream. In general, we can encode discrete symbols to bits separately (e.g.,
Huffman coding) or encode the whole discrete signal into a bit representation (e.g.,
arithmetic coding). In compression systems, arithmetic coding is preferable over
Huffman coding because it is faster and more accurate (i.e., better compression)
[16].
182 8 Deep Generative Modeling for Neural Compression
We will not review and explain in detail how arithmetic coding works. We refer to
[16] (or any other book on data compression) for details. There are two facts we need
to know and remember. First, if we provide probabilities of symbols, then arithmetic
coding does not need to make an extra pass through the signal to estimate them.
Second, there is an adaptive variant of arithmetic coding that allows modifying
probabilities while compressing symbols sequentially (also known as progressive
coding).
These two remarks are important for us because, as mentioned earlier, we can
estimate p(y) using a deep generative model. Once we learn the deep generative
model, the arithmetic coding can use it for lossless compression of codes. Moreover,
if we use a model that factorizes the distribution, e.g., an autoregressive model, then
we can also utilize the idea of progressive coding.
In our example, we use the autoregressive model that takes quantized code and
returns the probability of each value in the codebook (i.e., the indices). In other
words, the autoregressive model outcomes probabilities over the codebook values.
It is worth to mention though that in our implementation we use the term “entropy
coding” but we mean an entropy coding model. Moreover, it is worth mentioning
that there are specialized distributions for compression purposes, e.g., the scale
hyperprior [9], but here we are interested in deep generative modeling for neural
compression.
1 class ARMEntropyCoding (nn. Module ):
2 def __init__ (self , code_dim , codebook_dim , arm_net ):
3 super( ARMEntropyCoding , self). __init__ ()
4 self. code_dim = code_dim
5 self. codebook_dim = codebook_dim
6 self. arm_net = arm_net # it takes B x 1 x code_dim and
outputs B x codebook_dim x code_dim
7
13 return p
14
24 return x_new
25
8.4 Neural Compression: Components 183
If we look into the objective, we can immediately notice that the first component,
Ex∼pdata (x) (x − x̂)2 , is the Mean Squared Error (MSE) loss. In other words, it
is the reconstruction error. The second part, Ex∼pdata (x) − ln pλ (ŷ) , is the cross-
entropy between q(ŷ) = pdata (x) δ Q(fe,φ (x); c) − ŷ and pλ (ŷ), where δ(·)
denotes Dirac’s delta. To clearly see that, let us write it down step-by-step:
N
1
=− ln pλ Q(fe,φ (xn ); c) . (8.8)
N
n=1
In the training objective, we have a sum of distortion and rate. Please note that
during training, we do not need to use entropy coding at all. However, it is necessary
if we want to use neural compression in practice.
Additionally, it is beneficial to discuss how the bits per pixel (bpp) is calculated.
The definition of the bpp is the total size in bits of the encoder output divided by the
total size in pixels of the encoder input. In our case, the encoder returns a code of
size M and each value is mapped to one of K values (let us assume that K = 2κ . As
a result, we can represent the quantized code using indices, i.e., integers. Since we
have K possible integers, we can use κ bits to represent each of them. As a result, the
code is described by κ × M bits. In other words, we can use a uniform distribution
8.4 Neural Compression: Components 185
with probability equal 1/(κ × M) that gives the bpp equal − log2 (1/(κ × M)/D.
However, we can improve this score by using entropy coding. As a result, we can
use the rate value and divide it by the size of the image, D, to obtain the bpp, i.e.,
− log2 p(ŷ)/D.
1 class NeuralCompressor (nn. Module ):
2 def __init__ (self , encoder , decoder , entropy_coding ,
quantizer , beta =1., detaching =False):
3 super( NeuralCompressor , self). __init__ ()
4
7 # we
8 self. encoder = encoder
9 self. decoder = decoder
10 self. entropy_coding = entropy_coding
11 self. quantizer = quantizer
12
26 # decoding
27 x_rec = self. decoder ( quantizer_out [2])
28
35 # Objective
36 objective = Distortion + self.beta ∗ Rate
37
38 if reduction == ’sum ’:
39 return objective .sum (), Distortion .sum (), Rate.sum ()
40 else:
41 return objective .mean (), Distortion .mean (), Rate.mean
()
A B
Fig. 8.5 An example of outcomes after the training: (a) A distortion curve. (b) A rate curve.
(c) Real images (left columns) and their reconstructions (middle column), and samples from the
autoregressive entropy coder (right column)
Interestingly, since the entropy coder is also a deep generative model, we can
sample from it. In Fig. 8.5c, there are four samples presented. They indicate that the
model indeed can learn the data distribution of the quantized codes!
References
1. Facebook. Facebook reports fourth quarter and full year 2020 results.
2. Rashid Ansari, Christine Guillemot, and Nasir Memon. Jpeg and jpeg2000. In The Essential
Guide to Image Processing, pages 421–461. Elsevier, 2009.
3. Zixiang Xiong and Kannan Ramchandran. Wavelet image compression. In The Essential Guide
to Image Processing, pages 463–493. Elsevier, 2009.
4. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–
444, 2015.
5. Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–
117, 2015.
6. Lionel Gueguen, Alex Sergeev, Ben Kadlec, Rosanne Liu, and Jason Yosinski. Faster neural
networks straight from jpeg. Advances in Neural Information Processing Systems, 31:3933–
3944, 2018.
7. Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár. Lossy image compression
with compressive autoencoders. arXiv preprint arXiv:1703.00395, 2017.
8. Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte,
Luca Benini, and Luc Van Gool. Soft-to-hard vector quantization for end-to-end learning
compressible representations. In Proceedings of the 31st International Conference on Neural
Information Processing Systems, pages 1141–1151, 2017.
9. Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Varia-
tional image compression with a scale hyperprior. In International Conference on Learning
Representations, 2018.
10. Amirhossein Habibian, Ties van Rozendaal, Jakub M Tomczak, and Taco S Cohen. Video
compression with rate-distortion autoencoders. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 7033–7042, 2019.
11. Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool.
Conditional probability models for deep image compression. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 4394–4402, 2018.
188 8 Deep Generative Modeling for Neural Compression
12. David Minnen, Johannes Ballé, and George Toderici. Joint autoregressive and hierarchical
priors for learned image compression. arXiv preprint arXiv:1809.02736, 2018.
13. Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical
journal, 27(3):379–423, 1948.
14. Yibo Yang, Robert Bamler, and Stephan Mandt. Improving inference for neural image
compression. Advances in Neural Information Processing Systems, 33, 2020.
15. Fabian Mentzer, George D Toderici, Michael Tschannen, and Eirikur Agustsson. High-fidelity
generative image compression. Advances in Neural Information Processing Systems, 33, 2020.
16. David Salomon. Data compression: the complete reference. Springer Science & Business
Media, 2004.
17. LJ Karam. Lossless image compression. In Al Bovik, editor, The Essential Guide to Image
Processing. Elsevier Academic Press, 2009.
18. Pierre Moulin. Multiscale image decompositions and wavelets. In The essential guide to image
processing, pages 123–142. Elsevier, 2009.
19. Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for
image quality assessment. In The Thirty-Seventh Asilomar Conference on Signals, Systems
& Computers, 2003, volume 2, pages 1398–1402. IEEE, 2003.
20. Johannes Ballé, Philip A Chou, David Minnen, Saurabh Singh, Nick Johnston, Eirikur
Agustsson, Sung Jin Hwang, and George Toderici. Nonlinear transform coding. IEEE Journal
of Selected Topics in Signal Processing, 15(2):339–353, 2020.
21. Adam Golinski, Reza Pourreza, Yang Yang, Guillaume Sautiere, and Taco S Cohen. Feedback
recurrent autoencoder for video compression. In Proceedings of the Asian Conference on
Computer Vision, 2020.
22. Yang Yang, Guillaume Sautière, J Jon Ryu, and Taco S Cohen. Feedback recurrent
autoencoder. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 3347–3351. IEEE, 2020.
Appendix A
Useful Facts from Algebra and Calculus
Norm Definition
Norm is a function · : X → R+ with the following properties:
1. x = 0 ⇔ x = 0,
2. αx = |α|x, where α ∈ R,
3. x + y ≤ x + y.
Inner Product Definition
The inner product is a function ·, · : X × X → R with the following properties:
1. x, x ≥ 0,
2. x, y = y, x,
3. αx, y = αx, y,
4. x + z, y = x, y + z, y.
Chosen Properties of Norm and Inner Product
• x, x = x2 ,
• |x, y| ≤ xy (for a vector in RD x, y = xy cos θ ),
• x + y2 = x2 + 2x, y + y2 .
• x − y2 = x2 − 2x, y + y2 .
Liner Dependency
Let φm be a non-linear transformation, and x ∈ RM . A linear product of these two
vectors is:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 189
J. M. Tomczak, Deep Generative Modeling,
https://doi.org/10.1007/978-3-030-93158-2
190 A Useful Facts from Algebra and Calculus
−1
ab 1 d −b
A−1 = =
cd ad − bc −c a
⎡ ⎤−1 ⎡ ⎤
ab c (ek − f h) (ch − bk) (bf − ce)
1
A−1 = ⎣d e f ⎦ = ⎣(f g − dk) (ak − cg) (cd − af )⎦
det A
gh k (dh − eg) (bg − ah) (ae − bd)
Appendix B
Useful Facts from Probability Theory and
Statistics
Bernoulli Distribution
B(x|θ ) = θ x (1 − θ )1−x , where x ∈ {0, 1} i θ ∈ [0, 1]
E[x] = θ
Var[x] = θ (1 − θ )
Categorical (Multinoulli) Distribution
D
M(x|θ ) = θdxd , where xd ∈ {0, 1} i θd ∈ [0, 1] for all d = 1, 2, . . . , D,
d=1
D
θd = 1
d=1
E[xd ] = θd
Var[xd ] = θd (1 − θd )
Normal Distribution
1 (x − μ)2
N(x|μ, σ 2 ) = √ exp −
2π σ 2σ 2
E[x] = μ
Var[x] = σ 2
Multivariate Normal Distribution
1 1 1
N(x|μ, ) = exp − (x − μ)T −1 (x − μ) ,
(2π ) D/2 || 1/2 2
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 191
J. M. Tomczak, Deep Generative Modeling,
https://doi.org/10.1007/978-3-030-93158-2
192 B Useful Facts from Probability Theory and Statistics
E[x] = μ
Cov[x] =
Beta Distribution
(a + b) a−1
Beta(x|a, b) = x (1 − x)b−1 ,
(a)(b)
∞
where x ∈ [0, 1] and a > 0 i b > 0, (x) = t x−1 e−t dt
0
E[x] = a+b
a
Var[x] = (a+b)2ab
(a+b+1)
Marginal Distribution
In the continuous case:
Conditional Distribution
p(x, y)
p(y|x) =
p(x)
Sum Rule
p(x) = p(x, y)
y
Product Rule
p(x, y) = p(x|y)p(y)
= p(y|x)p(x)
Bayes’ Rule
p(x|y)p(y)
p(y|x) =
p(x)
B.2 Statistics
N
p(D|θ ) = p(xn |θ ).
n=1
The logarithm of the likelihood function p(D|θ ) is given by the following expres-
sion:
N
log p(D|θ ) = log p(xn |θ ).
n=1
D G
Decoder, 61 Gaussian diffusion process, 114
Decoder (codec), 175 Generative modeling, 3
Deep Boltzmann machines, 156 Generative process, 57
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 195
J. M. Tomczak, Deep Generative Modeling,
https://doi.org/10.1007/978-3-030-93158-2
196 Index
T V
Top-down VAEs, 105 VampPrior, 82
Transformers, 24 Variational autoencoder (VAE), 6, 60
Triangular Sylvester flows, 97 Variational dequantization, 38
Variational inference, 58, 60, 92
U Volume-preserving transformations,
Uncertainty, 1 29, 32
Uniform dequantization, 33 von Mises–Fisher distribution, 98