0% found this document useful (0 votes)
32 views9 pages

Information Dropout Learning Optimal Representations Through Noisy Computation

Uploaded by

eityh41
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views9 pages

Information Dropout Learning Optimal Representations Through Noisy Computation

Uploaded by

eityh41
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO.

12, DECEMBER 2018 2897

Information Dropout: Learning Optimal


Representations Through Noisy Computation
Alessandro Achille and Stefano Soatto

Abstract—The cross-entropy loss commonly used in deep learning is closely related to the defining properties of optimal
representations, but does not enforce some of the key properties. We show that this can be solved by adding a regularization term,
which is in turn related to injecting multiplicative noise in the activations of a Deep Neural Network, a special case of which is the
common practice of dropout. We show that our regularized loss function can be efficiently minimized using Information Dropout,
a generalization of dropout rooted in information theoretic principles that automatically adapts to the data and can better exploit
architectures of limited capacity. When the task is the reconstruction of the input, we show that our loss function yields a Variational
Autoencoder as a special case, thus providing a link between representation learning, information theory and variational inference.
Finally, we prove that we can promote the creation of optimal disentangled representations simply by enforcing a factorized prior, a fact
that has been observed empirically in recent work. Our experiments validate the theoretical intuitions behind our method, and we find
that Information Dropout achieves a comparable or better generalization performance than binary dropout, especially on smaller
models, since it can automatically adapt the noise to the structure of the network, as well as to the test sample.

Index Terms—Representation learning, deep learning, information bottleneck, nuisances, invariants, minimality

Ç
1 INTRODUCTION

W E call “representation” any function of the data that is


useful for a task. An optimal representation is most
useful (sufficient), parsimonious (minimal), and minimally
2) We relate the defining properties of optimal repre-
sentations for classification to the loss function most
commonly used in deep learning, but with an added
affected by nuisance factors (invariant). Do deep neural net- regularizer (Section 4, Eq. (3)).
works approximate such sufficient invariants? 3) We show that, counter-intuitively, injecting multi-
The cross-entropy loss most commonly used in deep plicative noise to the computation improves the
learning does indeed enforce the creation of sufficient rep- properties of a representation and results in better
resentations, but the other defining properties of optimal approximation of an optimal one (Section 6).
representations do not seem to be explicitly enforced by 4) We relate such a multiplicative noise to the regular-
the commonly used training procedures. However, we izer, and show that in the special case of Bernoulli
show that this can be done by adding a regularizer, which noise, regularization reduces to dropout [3], thus
is related to the injection of multiplicative noise in the acti- establishing a connection to information theoretic
vations, with the surprising result that noisy computation principles. We also provide a more efficient alterna-
facilitates the approximation of optimal representations. In this tive, called Information Dropout, that makes better
paper we establish connections between the theory of opti- use of limited capacity, adapts to the data, and is
mal representations for classification tasks, variational related to Variational Dropout [4] (Section 6).
inference, dropout and “disentangling” in deep neural net- 5) We show that, when the task is reconstruction, the
works. Our contributions can be summarized in the follow- procedure above yields a generalization of the Varia-
ing steps: tional Autoencoder, which is instead derived from
a Bayesian inference perspective [5]. This establishes
1) We define optimal representations using established a connection between information theoretic and
principles of statistical decision and information the- Bayesian representations, where the former explains
ory: sufficiency, minimality, invariance (cf. [1], [2]) the use of a multiplier used in practice but unex-
(Section 3). plained by Bayesian theory (Section 7).
6) We show that “disentanglement of the hidden
causes,” an often-cited but seldom formalized desid-
 The authors are with the Department of Computer Science, University of
eratum for deep networks, can be achieved by
California, Los Angeles, 405 Hilgard Ave, Los Angeles, CA 90095.
E-mail: {achille, soatto}@cs.ucla.edu. assuming a factorized prior for the components of
Manuscript received 12 Feb. 2017; revised 17 Sept. 2017; accepted 4 Dec. the optimal representation. Specifically, we prove
2017. Date of publication 10 Jan. 2018; date of current version 1 Nov. 2018. that computing the regularizer term under the sim-
(Corresponding author: Alessandro Achille.) plifying assumption of an independent prior has
Recommended for acceptance by H. Lee. the effect of minimizing the total correlation of the
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. components, a phenomenon previously observed
Digital Object Identifier no. 10.1109/TPAMI.2017.2784440 empirically by [6] (Section 5).
0162-8828 ß 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
2898 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 12, DECEMBER 2018

7) We validate the theory with several experiments Apart from the practical interest of Information Dropout,
including: improved insensitivity/invariance to nui- one of our main results is that Information Dropout can be
sance factors using Information Dropout using (a) seen as a generalization to several existing dropout methods,
Cluttered MNIST [7] and (b) MNIST+CIFAR, a newly providing a unified framework to analyze them, together with
introduced dataset to test sensitivity to occlusion some additional insights on empirical results. As we discuss in
phenomena critical in Vision applications; (c) we Section 3, the introduction of noise to prevent overfitting has
show improved efficiency of Information Dropout already been studied from several points of view. For example
compared to regular dropout for limited capacity the original formulation of dropout of [3], which introduces
networks, (d) we show that Information Dropout binary multiplicative noise, was motivated as a way of effi-
favors disentangled representations; (e) we show that ciently training an ensemble of exponentially many networks,
Information Dropout adapts to the data and allows that would be averaged at testing time. Kingma et al. [4] intro-
different amounts of information to flow between duce Variational Dropout, a dropout method which closely
different layers in a deep network (Section 8). resemble ours, and is instead derived from a Bayesian analysis
In the next section we introduce the basic formalism to of neural networks. Information Dropout gives an alternative
make the above statements more precise, which we do in information-theoretic interpretation of those methods.
subsequent sections. As we show in Section 7, other than being very closely
related to Variational Dropout, Information Dropout
directly yields a variational autoencoder as a special case
2 PRELIMINARIES
when the task is the reconstruction of the input. This result
In the general supervised setting, we want to learn the con- is in part expected, since our loss function seeks an optimal
ditional distribution pðyjxÞ of some random variable y, representation of the input for the task of reconstruction,
which we refer to as the task, given (samples of the) input and the representation given by the latent variables of a var-
data x. In typical applications, x is often high dimensional iational autoencoder fits the criteria. However, it still rises
(for example an image or a video), while y is low dimen- the question of exactly what and how deep are the links
sional, such as a label or a coarsely-quantized location. between information theory, representation learning, varia-
In such cases, a large part of the variability in x is actually tional inference and nuisance invariance. This work can be
due to nuisance factors that affect the data, but are otherwise seen as a small step in answering this question.
irrelevant for the task [1]. Since by definition these nuisance
factors are not predictive of the task, they should be disre- 3 RELATED WORK
garded during the inference process. However, it often hap-
pens that modern machine learning algorithms, in part due The main contribution of our work is to establish how two
to their high flexibility, will fit spurious correlations, present seemingly different areas of research, namely dropout
in the training data, between the nuisances and the task, methods to prevent overfitting, and the study of optimal
thus leading to poor generalization performance. representations, can be linked through the Information Bot-
In view of this, [8] argue that the success of deep learning tleneck principle.
is in part due to the capability of neural networks to build Dropout was introduced by Srivastava et al. [3]. The origi-
incrementally better representations that expose the relevant nal motivation was that by randomly dropping the activa-
variability, while at the same time discarding nuisances. This tions during training, we can effectively train an ensemble
interpretation is intriguing, as it establishes a connection of exponentially many networks, that are then averaged dur-
between machine learning, probabilistic inference, and infor- ing testing, therefore reducing overfitting. Wang et al. [9]
suggested that dropout could be seen as performing a
mation theory. However, common training practice does not
Monte-Carlo approximation of an implicit loss function, and
seem to stem from this insight, and indeed deep networks
that instead of multiplying the activations by binary noise,
may maintain even in the top layers dependencies on easily
like in the original dropout, multiplicative Gaussian noise
ignorable nuisances (see for example Fig. 2).
with mean 1 can be used as a way of better approximating
To bring the practice in line with the theory, and to better
the implicit loss function. This led to a comparable perfor-
understand these connections, we introduce a modified cost
mance but faster training than binary dropout.
function, that can be seen as an approximation of the Infor-
Kingma et al. [4] take a similar view of dropout as intro-
mation Bottleneck Lagrangian of [2], which encourages the
ducing multiplicative (Gaussian) noise, but instead study
creation of representations of the data which are increas-
the problem from a Bayesian point of view. In this setting,
ingly disentangled and insensitive to the action of nuisan-
given a training dataset D ¼ ðxi ; yi Þi¼1;...;N and a prior distri-
ces, and we show that this loss can be minimized using a
bution pðwÞ, we want to compute the posterior distribution
new layer, which we call Information Dropout, that allows
pðwjDÞ of the weights w of the network. As is customary in
the network to selectively introduce multiplicative noise in
variational inference, the true posterior can be approxi-
the layer activations, and thus to control the flow of infor-
mated by minimizing the negative variational lower bound
mation. As we show in various experiments, this method
LðuÞ of the marginal log-likelihood of the data,
improves the generalization performance by building better
representations and preventing overfitting, and it consider- 1X N
ably improves over binary dropout on smaller models, LðuÞ ¼ Ewpu ðwjDÞ ½log pðyi jxi ; wÞ
N i¼1
since, unlike dropout, Information Dropout also adapts the (1)
noise to the structure of the network and to the individual 1
þ KLðpu ðwjDÞkpðwÞÞ:
sample at test time. N
ACHILLE AND SOATTO: INFORMATION DROPOUT: LEARNING OPTIMAL REPRESENTATIONS THROUGH NOISY COMPUTATION 2899

This minimization is difficult to perform, since it requires to characterized representations that are both sufficient and
repeatedly sample new weights for each sample of the data- invariant. They argue that the management of nuisance
set. As an alternative, [4] suggest that the uncertainty about factors common in visual data, such as changes of view-
the weights that is expressed by the posterior distribution point, local deformations, and changes of illumination, is
pu ðwjDÞ can equivalently be encoded as a multiplicative directly tied to the specific structure of deep convolutional
noise in the activations of the layers (the so called local repar- networks, where local marginalization of simple nuisances
ametrization trick). As we will see in the following sections, at each layer results in marginalization of complex nuisances
this loss function closely resemble the one of Information in the network as a whole.
Dropout, which however is derived from a purely informa- Our work fits in this last line of thinking, where the goal is
tion theoretic argument based on the Information Bottleneck not equivalence to the data up to the action of (group) nuisan-
principle. One difference is that we allow the parameters of ces, but instead sufficiency for the task. Our main contribution
the noise to change on a per-sample basis (which, as we in this sense is to show that injecting noise into the layers, and
show in the experiments, can be useful to deal with nuisan- therefore using a non-deterministic function of the data, can
ces), and that we allow a scaling constant b in front of the actually simplify the theoretical analysis and lead to disentan-
KL-divergence term, which can be changed freely. Interest- gling and improved insensitivity to nuisances. This is an alter-
ingly, even if the Bayesian derivation does not allow a nate explanation to that put forth by the references above.
rescaling of the KL-divergence, Kingma et al. notice that
choosing a different scale for the KL-divergence term can 4 OPTIMAL REPRESENTATIONS AND THE
indeed lead to improvements in practice. A related method, INFORMATION BOTTLENECK LOSS
but derived from an information theoretic perspective was
also suggested previously by [10]. Given some input data x, we want to compute some (possi-
The interpretation of deep neural network as a way of bly nondeterministic) function of x, called a representation,
creating successively better representations of the data has that has some desirable properties in view of the task y, for
already been suggested and explored by many. Most instance by being more convenient to work with, exposing
recently, Tishby et al. [8] put forth an interpretation of deep relevant statistics, or being easier to store. Ideally, we want
neural networks as creating sufficient representations of the this representation to be as good as the original data for the
data that are increasingly minimal. In parallel simultaneous task, and not squander resources modeling parts of the data
work, [11] approximate the information bottleneck similarly that are irrelevant to the task. Formally, this means that we
to us, but focus on empirical analysis of robustness to adver- want to find a random variable z satisfying the following
sarial perturbations rather than tackling disentanglement, conditions:
invariance and minimality analytically. i) z is a representation of x; that is, its distribution
Sufficient dimensionality reduction [12] and Optimal depends only on x, as expressed by the following
Component Analysis [13] follow a similar idea to us, in that Markov chain:
they focus on finding the smallest (usually linear) sufficient
statistic of the data that is sufficient for a given task. How-
ever, while they define small in term of dimension of the
representation, we focus on finding a (non-linear) represen- ii) z is sufficient for the task y, that is Iðx; yÞ ¼ Iðz; yÞ,
tation with minimal information content, but whose dimen- expressed by the Markov chain:
sion can, in fact, be even larger than the original data. By
allowing large non-linear representations, we can exploit
the full representational power of deep networks, while the
minimality of the information content still promotes nui- iii) among all random variables satisfying these require-
sance invariance and prevents overfitting. Our framework ments, the mutual information Iðx; zÞ is minimal.
also has connections with Independent Component Analy- This means that z discards all variability in the data
sis (ICA), which we discuss further in Section 7. that is not relevant to the task.
Some have focused on creating representations that are Using the identity Iðx; yÞ  Iðz; yÞ ¼ Iðx; yjzÞ, where I the
maximally invariant to nuisances, especially when they have mutual information, it is easy to see that the above condi-
the structure of a (possibly infinite-dimensional) group act- tions are equivalent to finding a distribution pðzjxÞ which
ing on the data, like [14], or, when the nuisance is a locally solves the optimization problem
compact group acting on each layer, by successive approxi-
mations implemented by hierarchical convolutional archi- minimize Iðx; zÞ
tectures, like [15] and [16]. In these cases, which cover s.t. Iðx; yjzÞ ¼ 0:
common nuisances such as translations and rotations of an
The minimization above is difficult in general. For this rea-
image (affine group), or small diffeomorphic deformations
son, Tishby et al. have introduced a generalization known
due to a slight change of point of view (group of diffeomor-
as the Information Bottleneck Principle and the associated
phisms), the representation is equivalent to the data modulo
Lagrangian to be minimized [2]:
the action of the group. However, when the nuisances
are not a group, as is the case for occlusions, it is not possible L ¼ Iðx; yjzÞ þ bIðx; zÞ:
to achieve such equivalence, that is, there is a loss. To
address this problem, [1] defined optimal representations When y is a discrete random variable, such as a label, as we
not in terms of maximality, but in terms of sufficiency, and will often assume through this work, we can further use the
2900 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 12, DECEMBER 2018

identity Iðx; yjzÞ ¼ HðyjzÞ  HðyjxÞ and the fact that HðyjxÞ Remark (Approximate sufficiency). The quantity Iðx;
is constant to obtain the equivalent Lagrangian yjzÞ ¼ HðyjzÞ  HðyjxÞ  0 can be seen as a measure of
the distance between pðx; y; zÞ and the closest distribution
L ¼ HðyjzÞ þ bIðx; zÞ; (2) qðx; y; zÞ such that x ! z ! y is a Markov chain. There-
where b is a positive constant that manages the trade-off fore, by minimizing Eq. (2) we find representations that
between sufficiency (the performance on the task, as mea- are increasingly “more sufficient”, meaning that they are
sured by the first term) and minimality (the complexity of closer to an actual Markov chain.
the representation, measured by the second term). It is easy
to see that, in the limit b ! 0þ , this is equivalent to the origi- 5 DISENTANGLEMENT
nal problem, where z is a minimal sufficient statistic. When In addition to sufficiency and minimality, “disentanglement
all random variables are discrete and z ¼ T ðxÞ is a determin- of hidden factors” is often cited as a desirable property of a
istic function of x, the algorithm proposed by [2] can be used representation [18], but seldom formalized. We may think
to minimize the IB Lagrangian efficiently. However, no algo- that the observed data is generated by a complex interplay
rithm is known to minimize the IB Lagrangian for non- of independent causes, or factors. Ideally, the components
Gaussian, high-dimensional continuous random variables. of the learned representation should capture these indepen-
One of our key results is that, when we restrict to the dent factors by disentangling the correlations in the
family of distributions obtained by injecting noise to one observed data. We can then quantify disentanglement by
layer of a neural network, we can efficiently approximate measuring the total correlation [19], also known as multiinfor-
and minimize the IB Lagrangian.1 As we will show, this mation [20],2 defined as
process can be effectively implemented through a generali- Q
zation of the dropout layer that we call Information Dropout. TCðzÞ :¼ KLðqðzÞk j qj ðzj ÞÞ:
To set the stage, we rewrite the IB Lagrangian as a per-
sample loss function. Let pðx; yÞ denote thetrue distribution Notice that the components of z are mutually independent
 if and only if TCðzÞ is zero. Adding this as a penalty in the
of the data, from which the training set ðxi ; yi Þ i¼1;...;N is
sampled, and let pu ðzjxÞ and pu ðyjzÞ denote the unknown IB Lagrangian, with a factor g yields
distributions that we wish to estimate, parametrized by u. 1X N

Then, we can write the two terms in the IB Lagrangian as L¼ Ezpðzjxi Þ ½log pðyi jzÞ
N i¼1 (4)
 
HðyjzÞ ’ Ex;ypðx;yÞ Ezpu ðzjxÞ ½log pu ðyjzÞ þ bKLðpu ðzjxi Þkpu ðzÞÞ þ gTCðzÞ:
Iðx; zÞ ¼ ExpðxÞ ½KLðpu ðzjxÞkpu ðzÞÞ; In general, minimizing this augmented loss is intractable,
since to compute both the KL term and the total correla-
where KL denotes the Kullback-Leibler divergence. We can
tion, we need to know the marginal distribution pu ðzÞ,
therefore approximate the IB Lagrangian empirically as
which is not easily computable. However, the following
1X N
proposition, that we prove in Appendix B, which can be
L¼ Ezpðzjxi Þ ½log pðyi jzÞ þ bKLðpu ðzjxi Þkpu ðzÞÞ: (3)
N i¼1 found on the Computer Society Digital Library at http://
doi.ieeecomputersociety.org/10.1109/TPAMI.2017.2784440,
Notice that the first term simply is the average cross- shows that if we choose g ¼ b, then the problem simplifies,
entropy, which is the most commonly used loss function in and can be easily solved by adding an auxiliary variable.
deep learning. The second term can then be seen as a regu-
larization term. In fact, many classical regularizers, like the Proposition 1. The minimization problem
L2 penalty, can be expressed in the form of Eq. (3) (see also
1X N
[17]). In this work, we interpret the KL term as a reuglarizer minimize Ezpðzjxi Þ ½log pðyi jzÞ
that penalizes the transfer of information from x to z. In the p N i¼1
next section, we discuss ways to control such information þ bfKLðpðzjxi ÞkpðzÞÞ þ TCðzÞg;
transfer through the injection of noise.
is equivalent to the following minimization in two variables
Remark (Deterministic versus stochastic representa-
tions). Aside from being easier to work with, stochastic 1X N
representations can attain a lower value of the IB minimize Ezpðzjxi Þ ½log pðyi jzÞ
p;q N i¼1
Lagrangian than any deterministic representation. For Qjzj
example, consider the task of reconstructing single ran- þ b KLðpðzjxi Þk i¼1 qi ðzi ÞÞ:
dom bit y given a noisy observation x. The only determin-
istic representations are equivalent to the either the noisy In other words, minimizing the standard IB Lagrangian
observation itself or to the trivial constant map. It is not assuming that the activations are independent, i.e., having
difficult to check that for opportune values of b and of the
noise, neither realize the optimal tradeoff reached by a 2. As pointed out by a reviewer, multi-information would be a more
suitable stochastic representation. appropriate name for this quantity. We chose to use Total Correlation
both for historical reasons, after its introduction in [19], and to empha-
size the relation with disentanglement also in recent work on unsuper-
1. Since we restrict the family of distributions, there is no guarantee vised learning [21]. Other measures of independence are of course
that the resulting representation will be optimal. We can, however, possible. Total Correlation has the advantage of being enforced natu-
iterate the process to obtain incrementally improved approximations. rally when optimizing other information-theoretic quantities.
ACHILLE AND SOATTO: INFORMATION DROPOUT: LEARNING OPTIMAL REPRESENTATIONS THROUGH NOISY COMPUTATION 2901

Q
qðzÞ ¼ i qi ðzi Þ, is equivalent to enforcing disentanglement
of the hidden factors. It is interesting to note that this inde-
pendence assumption is already adopted often by practi-
tioners Ron grounds of simplicity, since the actual marginal
pðzÞ ¼ x pðx; zÞdx is often incomputable. That using a factor-
ized model results in “disentanglement” was also observed
empirically by [6] which, however, introduced an ad-hoc
metric based on classifiers of low VC-dimension, rather than
the more natural Total Correlation adopted here.
In view of the previous proposition, from now on we will Fig. 1. Comparison of the empirical distribution pðzÞ of the post-noise
assume that the activations are independent and ignore the activations with our proposed prior when using: (a) ReLU activations,
total correlation term. for which we propose a log-uniform prior, and (b) Softplus activations, for
which we propose a log-normal prior. In both cases, the empirical
distribution approximately follows the proposed prior. Both histograms
6 INFORMATION DROPOUT where obtained from the last dropout layer of the All-CNN-32 network
described in Table 2, trained on CIFAR-10.
Guided by the analysis in the previous sections, and to
emphasize the role of stochasticity, we consider representa-
Since the ReLU activations are frequently zero, we also
tions z obtained by computing a deterministic map fðxÞ of
assume qðz ¼ 0Þ ¼ q0 for some constant 0  q0  1. There-
the data (for instance a sequence of convolutional and/or
fore, the final prior has the form qðzÞ ¼ q0 d0 ðzÞ þ c=z, where
fully-connected layers of a neural network), and then multi-
d0 is the Dirac delta in zero. In Fig. 1a, we compare this prior
plying the result component-wise by a random sample 
distribution with the actual empirical distribution pðzÞ of a
drawn from a parametric noise distribution pa with unit
network with ReLU activations.
mean and variance that depends on the input x:
In a network implemented using Softplus activations, a
log-normal is a good fit of the distribution of the activations.
"  paðxÞ ð"Þ; z ¼ "  fðxÞ;
This is to be expected, especially when using batch-normali-
zation, since the pre-activations will approximately follow
where “” denotes the element-wise product. Notice that, if
a normal distribution with zero mean, and the Softplus
paðxÞ ð"Þ is a Bernoulli distribution rescaled to have mean 1,
approximately resembles a scaled exponential near zero.
this reduces exactly to the classic binary dropout layer. As
Therefore, in this case we suggest using a log-normal distri-
we discussed in Section 3, there are also variants of dropout
bution as our prior qðzÞ. In Fig. 1b, we compare this prior
that use different distributions.
with the empirical distribution pðzÞ of a network with
Softplus activations.
Using these priors, we can finally compute the KL diver-
gence term in Eq. (3) for both ReLU activations and Softplus
activations. We prove the following two propositions in
Appendix A, available online.
Proposition 2 (Information Dropout cost for ReLU). Let
A natural choice for the distribution paðxÞ ð"Þ, which also
z ¼ " fðxÞ, where "  pa ð"Þ, and assume pðzÞ ¼ qd0 ðzÞ þ
simplifies the theoretical analysis, is the log-normal distri-
c=z. Then, assuming fðxÞ 6¼ 0, we have
bution paðxÞ ð"Þ ¼ log N ð0; a2u ðxÞÞ. Once we fix this noise dis-
tribution, given the above expression for z, we can easily KLðpu ðzjxÞkpðzÞÞ ¼ HðpaðxÞ ðlog "ÞÞ þ log c
compute the distribution pu ðzjxÞ that appears in Eq. (3).
However, to be able to compute the KL-divergence term, In particular, if pa ð"Þ is chosen to be the log-normal distribu-
we still need to fix a prior distribution qu ðzÞ. The choice of tion pa ð"Þ ¼ log N ð0; a2u ðxÞÞ, we have
this prior largely depends on the expected distribution of
the activations fðxÞ. Recall that, by Section 5, we can assume KLðpu ðzjxÞkpðzÞÞ ¼ log au ðxÞ þ const: (5)
that all activations are independent, thus simplifying the
computation. Now, we concentrate on two of the most com- If instead fðxÞ ¼ 0, we have
mon activation functions, the rectified linear unit (ReLU), KLðpu ðzjxÞkpðzÞÞ ¼ log q:
which is easy to compute and works well in practice, and
the Softplus function, which can be seen as a strictly positive Proposition 3 (Information Dropout cost for Softplus).
and differentiable approximation of ReLU. Let z ¼ " fðxÞ, where "  pa ð"Þ ¼ log N ð0; a2u ðxÞÞ, and
A network implemented using only ReLU and a final assume pu ðzÞ ¼ log N ðm; s 2 Þ. Then, we have
Softmax layer has the remarkable property of being scale- 1   aðxÞ 1
invariant, meaning that multiplying all weights, biases, and KLðpu ðzjxÞkpðzÞÞ ¼ 2 a2 ðxÞ þ m2  log  :
2s s 2
activations by a constant does not change the final result. (6)
Therefore, from a theoretical point of view, it would be
desirable to use a scale-invariant prior. The only such prior Substituting the expression for the KL divergence in
is the improper log-uniform, qðlog ðzÞÞ ¼ c, or equivalently Eq. (5) inside Eq. (3), and ignoring for simplicity the special
qðzÞ ¼ c=z, which was also suggested by [4], but as a prior case fðxÞ ¼ 0, we obtain the following loss function for
for the weights of the network, rather than the activations. ReLU activations
2902 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 12, DECEMBER 2018

1X N disentangled representation, allowing it to obtain a better


L¼ Ezpu ðzjxi Þ ½log pðyi jzÞ þ b log au ðxi Þ; (7) compression result [24]. In the same setting as Eq. (9) we
N i¼1
can use larger values of b to force Information Dropout, and
and a similar expression for Softplus. Notice that the first hence, in the case of reconstruction, a VAE, to recover repre-
expectation can be approximated by sampling (in the experi- sentations that are increasingly more compressed and also
ments we use one single sample, as customary for dropout), disentangled. This fact is implicitly used in contemporary
and is just the average cross-entropy term that is typical in work [6], that derive the loss in Eq. (9) taking inspiration
deep learning. The second term, which is new, penalizes from experimental evidence in neuroscience. They empiri-
the network for choosing a low variance for the noise, i.e., for cally verify that, as expected from this theoretical deriva-
letting more information pass through to the next layer. tion, for higher values of b the representation z recovered
This loss can be optimized easily using stochastic gradient by the VAE is increasingly more disentangled.
descent and the reparametrization trick of [5] to back-propa- Eq. (8) has two other important cases: As we have
gate the gradient through the sampling operation. already seen, the case g ¼ 0 and b > 0 is the standard Infor-
mation Bottleneck Lagrangian: A VAE trained with this loss
will focus purely on compression of the input, without
7 CONNECTIONS WITH OTHER FRAMEWORKS
squandering resources to also disentangle the representa-
In this section, we outline strong connections between Infor- tion. In the case b ¼ 0 and g > 0 we obtain instead the stan-
mation Dropout, Variational Autoencoders (VAEs) [5], and dard loss function of Independent Component Analysis
Independent Component Analysis (ICA) [22], [23]. (ICA), whereby we try to reconstruct a perfectly disen-
Variational autoencoders aim to reconstruct, given a tangled representation of the data, without any constraint
training dataset D ¼ fxi g, a latent random variable z such on its complexity (quantity of information). While both
that the observed data x can be thought as being generated cases are important on their own right, Proposition 1 does
by the, usually simpler, variable z through some unknown not apply for them, thus the loss function does not generally
generative process pu ðxjzÞ. In practice, this is done by mini- simplify and cannot be computed in closed form.
mizing the negative variational lower-bound to the mar-
ginal log-likelihood of the data
8 EXPERIMENTS
1X N
LðuÞ ¼ Ezpu ðzjxi Þ ½log pu ðxi jzÞ The goal of our experiments is to validate the theory, by
N i¼1
Y showing that indeed increasing noise level yields reduced
þ KLðpu ðzjxi Þk qu ðzi ÞÞ; dependency on nuisance factors, a more disentangled repre-
i sentation, and that by adapting the noise level to the data
where the optimization is joint over the factorized prior we can better exploit architectures of limited capacity.
qu ðzÞ, which is often assumed to be factorized, and the pos- To this end, we first compare Information Dropout with
terior pu ðzjxÞ. The optimization can now be performed easily the Dropout baseline on several standard benchmark data-
through sampling using the SGVB method of [5]. sets using different networks architecture, and highlight a
We now show that this procedure can be seen as a special few key properties. All the models were implemented using
case of Information Dropout: Consider again the loss in TensorFlow [25]. As [4] also notice, letting the variance of
Eq. (4) in the special case y ¼ x, that is, when the task is the noise grow excessively leads to poor generalization. To
reconstruction of the input: avoid this problem, we constraint aðxÞ < 0:7, so that the
maximum variance of the log-normal error distribution will
1X N
L¼ Ezpðzjxi Þ ½log pðxi jzÞ be approximately 1, the same as binary dropout when using
N i¼1 (8) a drop probability of 0.5. In all experiments we divide the
þ bKLðpu ðzjxi Þkpu ðzÞÞ þ g TCðzÞ: KL-divergence term by the number of training samples, so
that for b ¼ 1 the scaling of the KL-divergence term in simi-
By Proposition 1, in the special case b ¼ g, this reduces to lar to the one used by Variational Dropout (see Section 3).
Cluttered MNIST. To visually asses the ability of Informa-
1X N
LðuÞ ¼ Ezpu ðzjxi Þ ½log pu ðxi jzÞ tion Dropout to create a representation that is increasingly
N i¼1
Y (9) insensitive to nuisance factors, we train the All-CNN-96 net-
þ b KLðpu ðzjxi Þk qu ðzi ÞÞ; work (Table 2) for classification on a Cluttered MNIST data-
i set [7], consisting of 96 96 images containing a single
where again the optimization is joint over prior qu ðzÞ and MNIST digit together with 21 distractors. The dataset is
posterior pu ðzjxÞ, leading to the same optimization problem divided in 50,000 training images and 10,000 testing images.
of a VAE when b ¼ 1, that is when all quantities have the As shown in Fig. 2, for small values of b, the network lets
same weight in the loss function. This derivation also pro- through both the objects of interest (digits) and distractors,
vides some additional insights: when using a factorized to upper layers. By increasing the value of b, we force the
prior, a VAE will try to find a representation of the data network to disregard the least discriminative components
which is sufficient for reconstruction (cross-entropy term), of the data, thereby building a better representation for the
maximally compressed (KL term) and disentangled (total task. This behavior depends on the ability of Information
correlation term). We can also see that, while using instead Dropout to learn the structure of the nuisances in the data-
a non factorized prior increases the complexity of the opti- set which, unlike other methods, is facilitated by the ability
mization problem, it spares the VAE from having to find a to select noise level on a per-sample basis.
ACHILLE AND SOATTO: INFORMATION DROPOUT: LEARNING OPTIMAL REPRESENTATIONS THROUGH NOISY COMPUTATION 2903

Fig. 3. (a) Average classification error on MNIST over 3 runs of several


dropout methods applied to a fully connected network with three hidden
layers and ReLU activations. Information Dropout outperforms binary
dropout, especially on smaller networks, possibly because dropout
severely reduces the already limited capacity of the network, while Infor-
mation Dropout can adapt the amount of noise to the data and the size
Fig. 2. Plot of the total KL-divergence at each spatial location in the first of the network. Information dropout also outperforms a dropout layer
three Information Dropout layers (of sizes 48 48, 24 24 and 12 12 that uses constant log-normal noise with the same variance, confirming
respectively) of All-CNN-96 (see Table 2) trained on Cluttered MNIST the benefits of adaptive noise. (b) Classification error on CIFAR-10 for
with different values of b. This measures how much information from several dropout methods applied to the All-CNN-32 network (see Table 2)
each part of the image the Information Dropout layer is transmitting to using Softplus activations.
the next layer. For small b information about the nuisances is transmitted
to the next layers, while for higher values of b the dropout layers drop
Information Dropout can adapt the amount of noise to the
the information as soon as the receptive field is big enough to recognize
it as a nuisance. The resulting representation is thus more robust to nui- data and to the size of the network so that the relevant infor-
sances, improving generalization. Notice that the noise added by Infor- mation can still flow to the successive layers. Fig. 6 shows
mation Dropout is tailored to the specific sample, to the point that the how the amount of transmitted information adapts to the
digit can be localized from the noise mask.
size and hierarchical level of the layer.
Disentangling. As we saw Section 6, in the case of Softplus
Occluded CIFAR. Occlusions are a fundamental phenome- activations, the logarithm of the activations approximately
non in vision, for which it is difficult to hand-design invari- follow a normal distribution. We can then approximate the
ant representations. To assess that the approximate minimal total correlation using the associated covariance matrix S.
sufficient representation produced by Information Dropout Precisely, we have
has this invariance property, we created a new dataset by TCðzÞ ¼ log jS1
0 Sj
occluding images from CIFAR-10 with digits from MNIST
(Fig. 4). We train the All-CNN-32 network (Table 2) to clas- where S0 ¼ diag S is the variance of the marginal distribu-
sify the CIFAR image. The information relative to the occlud- tion. In Fig. 5 we plot the testing error and the total correla-
ing MNIST digit is then a nuisance for the task, and therefore tion of the representation learned by All-CNN-32 on
should be excluded from the final representation. To test CIFAR-10 when using 25 percent of the filters for different
this, we train a secondary network to classify the nuisance values of b. As predicted, when b increases the total correla-
MNIST digit using only the the representation learned for tion diminishes, that is, the representation becomes disen-
the main task. When training with small values of b, the net- tangled, and the testing error improves, since we prevent
work has very little pressure to limit the effect of nuisances overfitting. When b is to large, information flow is insuffi-
in the representation, so we expect the nuisance classifier to cient, and the testing error rapidly increases.
perform better. On the other hand, increasing the value of b
we expect its performance to degrade, since the representa-
tion will become increasingly minimal, and therefore invari-
ant to nuisances. The results in Fig. 4 confirm this intuition.
MNIST and CIFAR-10. Similar to [4], to see the effect of
Information Dropout on different network sizes and archi-
tectures, we train on MNIST a network with three fully con-
nected hidden layers with a variable number of hidden
units, and we train on CIFAR-10 [26] the All-CNN-32 con-
volutional network described in Table 2, using a variable
percentage of all the filters. The fully connected network Fig. 4. A few samples from our Occluded CIFAR dataset and the plot of
was trained for 80 epochs, using stochastic gradient descent the testing error on the main task (classifying the CIFAR image) and on
with momentum with initial learning rate 0.07 and drop- the nuisance task (classifying the occluding MNIST digit) as b varies.
ping the learning rate by 0.1 at 30 and 70 epochs. The CNN For both tasks, we use the same representation of the data trained
for the main task using Information Dropout. For larger values of b the
was trained for 160 epochs with initial learning rate 0.1 and representation is increasingly more invariant to nuisances, making
dropping the learning rate by 0.1 at 80 and 120 epochs. We the nuisance classification task harder, but improving the performance
show the results in Fig. 3. Information Dropout is compara- on the main task by preventing overfitting. For the nuisance task, we test
ble or outperforms binary dropout, especially on smaller using the learned noisy representation of the data, since we are inter-
ested specifically in the effects of the noise. For the main task, we show
networks. A possible explanation is that dropout severely the result both using the noisy representation (N), and the deterministic
reduces the already limited capacity of the network, while representation (D) obtained by disabling the noise at testing time.
2904 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 12, DECEMBER 2018

TABLE 1
Average Variational Lower-Bound L on the
Testing Dataset for a Simple VAE, Where the Size of the
Latent Variable z Is 256 k and the Encoder/Decoder
Each Contain 512 k Hidden Units

k Gaussian Information
1 98.8 100.0
2 99.0 99.1
3 98.7 99.1

The latent variable z is implemented either using a Gaussian vector or


Fig. 5. For different values of b, plot of the test error and total correlation using Information Dropout. Both methods achieve a similar performance.
of the final layer of the All-CNN-32 network with Softplus activations
trained on CIFAR-10 with 25 percent of the filters. Increasing b the test Dropout. We also establish connections with variational
error decreases (we prevent overfitting) and the representation becomes inference and variational autoencoding, and show that
increasingly disentangled. When b is too large, it prevents information “disentangling of the hidden causes” can be measured by
from passing through, jeopardizing sufficiency and causingia drastic
increase in error. total correlation and achieved simply by enforcing indepen-
dence of the components in the representation prior.
VAE. To validate Section 7, we replicate the basic varia-
tional autoencoder of [5], implementing it both with Gaussian TABLE 2
latent variables, as in the original, and with an Information Structure of the Networks Used in the Experiments
Dropout layer. We trained both implementations for 300
epochs dropping the learning rate by 0.1 at 30 and 120 epochs. (a) All-CNN-32
We report the results in Table 1. The Information Dropout Input 32 32
3 3 conv 96 ReLU
implementation has similar performance to the original, 3 3 conv 96 ReLU
confirming that a variational autoencoder can be considered 3 3 conv 96 ReLU stride 2
a special case of Information Dropout. dropout
3 3 conv 192 ReLU
3 3 conv 192 ReLU
9 DISCUSSION 3 3 conv 192 ReLU stride 2
dropout
We relate the Information Bottleneck principle and its asso- 3 3 conv 192 ReLU
ciated Lagrangian to seemingly unrelated practices and con- 1 1 conv 192 ReLU
cepts in deep learning, including dropout, disentanglement, 1 1 conv 10 ReLU
spatial average
variational autoencoding. For classification tasks, we show softmax
how an optimal representation can be achieved by injecting
multiplicative noise in the activation functions, and there- (b) All-CNN-96
fore into the gradient computation during learning. Input 96x96
A special case of noise (Bernoulli) results in dropout, 3 3 conv 32 ReLU
which is standard practice originally motivated by ensemble 3 3 conv 32 ReLU
averaging rather than information-theoretic considerations. 3 3 conv 32 ReLU stride 2
Better (adaptive) noise models result better exploitation of dropout
limited capacity, leading to a method we call Information 3 3 conv 64 ReLU
3 3 conv 64 ReLU
3 3 conv 64 ReLU stride 2
dropout
3 3 conv 96 ReLU
3 3 conv 96 ReLU
3 3 conv 96 ReLU stride 2
dropout
3 3 conv 192 ReLU
3 3 conv 192 ReLU
3 3 conv 192 ReLU stride 2
dropout
Fig. 6. Plots of (a) the total information transmitted through the two
3 3 conv 192 ReLU
dropout layers of a All-CNN-32 network with Softplus activations trained
on CIFAR and (b) the average quantity of information transmitted 1 1 conv 192 ReLU
through each unit in the two layers. From (a) we see that the total quan- 1 1 conv 10 ReLU
tity of information transmitted does not vary much with the number of spatial average
filters and that, as expected, the second layer transmits less informa- softmax
tion than the first layer, since prior to it more nuisances have been
disentangled and discarded. In (b) we see that when we decrease the The design of network is based on [27], but we also add batch normaliza-
number of filters, we force each single unit to let more information flow tion before the activations of each layer. Depending on the experiment,
(i.e., we apply less noise), and that the units in the top dropout layer the ReLU activations are replaced by Softplus activations, and the drop-
contain on average more information relevant to the task than the units out layer is implemented with Binary Dropout, Information Dropout or
in the bottom dropout layer. completely removed.
ACHILLE AND SOATTO: INFORMATION DROPOUT: LEARNING OPTIMAL REPRESENTATIONS THROUGH NOISY COMPUTATION 2905

So, what may be done by necessity in some computational [12] K. P. Adragni and R. D. Cook, “Sufficient dimension reduction
and prediction in regression,” Philosoph. Trans. Roy. Soc. London
systems (noisy computation), turns out to be beneficial towards A: Math., Phys. Eng. Sci., vol. 367, no. 1906, pp. 4385–4405, 2009.
achieving invariance and minimality. Analogously, what has [13] X. Liu, A. Srivastava, and K. Gallivan, “Optimal linear representa-
been done for convenience (assuming a factorized prior) turns tions of images for object recognition,” in Proc. IEEE Comput. Soc.
out to be the beneficial towards achieving “disentanglement.” Conf. Comput. Vis. Pattern Recognit., 2003, vol. 1, pp. I–I.
[14] G. Sundaramoorthi, P. Petersen, V. S. Varadarajan, and S. Soatto,
Another interpretation of Information Dropout is as a “On the set of images modulo viewpoint and contrast changes,” in
way of biasing the network towards reconstructing repre- Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 832–839.
sentations of the data that are compatible with a Markov [15] F. Anselmi, L. Rosasco, and T. Poggio, “On invariance and selec-
chain generative model, making it more suited to data com- tivity in representation learning,” Inf. Inference, vol. 5, no. 2,
pp. 134–158, 2016.
ing from hierarchical models, and in this sense is comple- [16] J. Bruna and S. Mallat, “Classification with scattering operators,” in
mentary to architectural constraint, such as convolutions, Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 1561–1566.
that instead bias the model toward geometric tasks. [17] Y. Gal and Z. Ghahramani, “Bayesian convolutional neural net-
works with bernoulli approximate variational inference,” in Proc.
It should be noticed that injecting multiplicative noise to 4th Int. Conf. Learning Representations (ICLR) Workshop Track, 2016.
the activations can be thought of as a particular choice of a [18] Y. Bengio, A. Courville, and P. Vincent, “Representation learning:
class of minimizers of the loss function, but can also be A review and new perspectives,” IEEE Trans. Pattern Anal. Mach.
interpreted as a regularization terms added to the cost Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
[19] S. Watanabe, “Information theoretical analysis of multivariate
function, or as a particular procedure utilized to carry out correlation,” IBM J. Res. Develop., vol. 4, no. 1, pp. 66–82, 1960.
the optimization. So the same operation can be interpreted [20] M. Studenỳ and J. Vejnarova, “The multiinformation function as a
as either of the three key ingredients in the optimization: tool for measuring stochastic dependence,” in Learning in Graphi-
the function to be minimized, the family over which to cal Models. Berlin, Germany: Springer, 1998, pp. 261–297.
[21] G. V. Steeg, “Unsupervised learning via total correlation explan-
minimize, and the procedure with which to minimize. This ation,” in Proc. 26th Int. Joint Conf. Artif. Intell., pp. 5151–5155,
highlight the intimate interplay between the choice of mod- https://doi.org/10.24963/ijcai.2017/740
els and algorithms in deep learning. [22] P. Comon, “Independent component analysis, a new concept?”
Signal Process., vol. 36, pp. 287–314, 1994.
[23] A. J. Bell and T. J. Sejnowski, “An information-maximization
ACKNOWLEDGMENTS approach to blind separation and blind deconvolution,” Neural
Comput., vol. 7, pp. 1129–1159, 1995.
Work supported by ARO, ONR, AFOSR. We are very grateful [24] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and
to the reviewers for their through analysis of the paper. In par- M. Welling, “Improved variational inference with inverse autoregres-
ticular, we would like to thank one of the anonymous sive flow,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 4743–4751.
reviewers for for providing an elegant alternative proof to [25] M. Abadi, et al., “TensorFlow: Large-scale machine learning on
heterogeneous systems,” 2015, Software Available from Tensor-
Proposition 1, as well as thoughtful critiques to the theorems. flow.org. [Online]. Available: http://tensorflow.org/
Dedicated to Naftali Tishby in the occasion of the conference [26] A. Krizhevsky and G. Hinton, “Learning multiple layers of
Information, Control and Learning held in his honor in Jerusalem, features from tiny images,” 2009, https://www.cs.toronto.edu/
September 26-28, 2016. Registered as Tech Report UCLA- ~kriz/learning-features-2009-TR.pdf
[27] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller,
CSD160009 and arXiv:1611.01353 on November 6, 2016. “Striving for simplicity: The all convolutional net,” arXiv:1412.6806,
ICLR 2015 Workshop Track, 2015, http://arxiv.org/abs/1412.6806
REFERENCES
Alessandro Achille received the master’s
[1] S. Soatto and A. Chiuso, “Visual representations: Defining proper- degree in pure math (with honors) from the
ties and deep approximations,” in Proc. Int. Conf. Learn. Representa- Scuola Normale Superiore di Pisa and the
tions, May 2016. University of Pisa. He is working towards the PhD
[2] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottle- degree at UCLA, where he joined Prof. Soatto’s
neck method,” in Proc. 37-th Annu. Allerton Conf. Commun., Control research group, in 2015. His research interests
Comput., 1999. include machine learning, variational inference
[3] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and and information theory, with particular focus on
R. Salakhutdinov, “Dropout: A simple way to prevent neural artificial, and embodied intelligence.
networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1,
pp. 1929–1958, 2014.
[4] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout
and the local reparameterization trick,” in Proc. 28th Int. Conf. Stefano Soatto received the DIng degree (high-
Neural Inf. Process. Syst., 2015, pp. 2575–2583. est honors) from the University of Padova - Italy, in
[5] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 1992, and the PhD degree in control and dynam-
in Proc. 2nd Int. Conf. Learn. Representations, no. 2014, 2013. ical systems from the California Institute of Tech-
[6] I. Higgins, et al., “Beta-VAE: Learning basic visual concepts with a nology, in 1996. He joined UCLA, in 2000 after
constrained variational framework,” in Proc. 2nd Int. Conf. Learn. being an assistant and then an associate profes-
Representations, no. 2014, 2013. sor of electrical and biomedical engineering with
[7] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent Washington University, and a research associate
models of visual attention,” in Proc. Adv. Neural Inf. Process. Syst., in applied sciences with Harvard University. From
2014, pp. 2204–2212. 1995 to 1998, he was also ricercatore in the
[8] N. Tishby and N. Zaslavsky, “Deep learning and the information bot- Department of Mathematics and Computer Sci-
tleneck principle,” in Proc. IEEE Inf. Theory Workshop., 2015, pp. 1–5. ence, University of Udine - Italy. His general research interests include
[9] S. Wang and C. Manning, “Fast dropout training,” in Proc. 30th computer vision and nonlinear estimation and control theory. In particular,
Int. Conf. Mach. Learn., 2013, pp. 118–126. he is interested in ways for computers to use sensory information (e.g.
[10] G. E. Hinton and D. Van Camp, “Keeping the neural networks vision, sound, touch) to interact with humans and the environment.
simple by minimizing the description length of the weights,”
in Proc. 6th Annu. Conf. Comput. Learn. Theory, 1993, pp. 5–13. " For more information on this or any other computing topic,
[11] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep varia- please visit our Digital Library at www.computer.org/publications/dlib.
tional information bottleneck,” in Proc. Int. Conf. Learning Repre-
sentations, 2017.

You might also like