Information Dropout Learning Optimal Representations Through Noisy Computation
Information Dropout Learning Optimal Representations Through Noisy Computation
Abstract—The cross-entropy loss commonly used in deep learning is closely related to the defining properties of optimal
representations, but does not enforce some of the key properties. We show that this can be solved by adding a regularization term,
which is in turn related to injecting multiplicative noise in the activations of a Deep Neural Network, a special case of which is the
common practice of dropout. We show that our regularized loss function can be efficiently minimized using Information Dropout,
a generalization of dropout rooted in information theoretic principles that automatically adapts to the data and can better exploit
architectures of limited capacity. When the task is the reconstruction of the input, we show that our loss function yields a Variational
Autoencoder as a special case, thus providing a link between representation learning, information theory and variational inference.
Finally, we prove that we can promote the creation of optimal disentangled representations simply by enforcing a factorized prior, a fact
that has been observed empirically in recent work. Our experiments validate the theoretical intuitions behind our method, and we find
that Information Dropout achieves a comparable or better generalization performance than binary dropout, especially on smaller
models, since it can automatically adapt the noise to the structure of the network, as well as to the test sample.
Index Terms—Representation learning, deep learning, information bottleneck, nuisances, invariants, minimality
Ç
1 INTRODUCTION
7) We validate the theory with several experiments Apart from the practical interest of Information Dropout,
including: improved insensitivity/invariance to nui- one of our main results is that Information Dropout can be
sance factors using Information Dropout using (a) seen as a generalization to several existing dropout methods,
Cluttered MNIST [7] and (b) MNIST+CIFAR, a newly providing a unified framework to analyze them, together with
introduced dataset to test sensitivity to occlusion some additional insights on empirical results. As we discuss in
phenomena critical in Vision applications; (c) we Section 3, the introduction of noise to prevent overfitting has
show improved efficiency of Information Dropout already been studied from several points of view. For example
compared to regular dropout for limited capacity the original formulation of dropout of [3], which introduces
networks, (d) we show that Information Dropout binary multiplicative noise, was motivated as a way of effi-
favors disentangled representations; (e) we show that ciently training an ensemble of exponentially many networks,
Information Dropout adapts to the data and allows that would be averaged at testing time. Kingma et al. [4] intro-
different amounts of information to flow between duce Variational Dropout, a dropout method which closely
different layers in a deep network (Section 8). resemble ours, and is instead derived from a Bayesian analysis
In the next section we introduce the basic formalism to of neural networks. Information Dropout gives an alternative
make the above statements more precise, which we do in information-theoretic interpretation of those methods.
subsequent sections. As we show in Section 7, other than being very closely
related to Variational Dropout, Information Dropout
directly yields a variational autoencoder as a special case
2 PRELIMINARIES
when the task is the reconstruction of the input. This result
In the general supervised setting, we want to learn the con- is in part expected, since our loss function seeks an optimal
ditional distribution pðyjxÞ of some random variable y, representation of the input for the task of reconstruction,
which we refer to as the task, given (samples of the) input and the representation given by the latent variables of a var-
data x. In typical applications, x is often high dimensional iational autoencoder fits the criteria. However, it still rises
(for example an image or a video), while y is low dimen- the question of exactly what and how deep are the links
sional, such as a label or a coarsely-quantized location. between information theory, representation learning, varia-
In such cases, a large part of the variability in x is actually tional inference and nuisance invariance. This work can be
due to nuisance factors that affect the data, but are otherwise seen as a small step in answering this question.
irrelevant for the task [1]. Since by definition these nuisance
factors are not predictive of the task, they should be disre- 3 RELATED WORK
garded during the inference process. However, it often hap-
pens that modern machine learning algorithms, in part due The main contribution of our work is to establish how two
to their high flexibility, will fit spurious correlations, present seemingly different areas of research, namely dropout
in the training data, between the nuisances and the task, methods to prevent overfitting, and the study of optimal
thus leading to poor generalization performance. representations, can be linked through the Information Bot-
In view of this, [8] argue that the success of deep learning tleneck principle.
is in part due to the capability of neural networks to build Dropout was introduced by Srivastava et al. [3]. The origi-
incrementally better representations that expose the relevant nal motivation was that by randomly dropping the activa-
variability, while at the same time discarding nuisances. This tions during training, we can effectively train an ensemble
interpretation is intriguing, as it establishes a connection of exponentially many networks, that are then averaged dur-
between machine learning, probabilistic inference, and infor- ing testing, therefore reducing overfitting. Wang et al. [9]
suggested that dropout could be seen as performing a
mation theory. However, common training practice does not
Monte-Carlo approximation of an implicit loss function, and
seem to stem from this insight, and indeed deep networks
that instead of multiplying the activations by binary noise,
may maintain even in the top layers dependencies on easily
like in the original dropout, multiplicative Gaussian noise
ignorable nuisances (see for example Fig. 2).
with mean 1 can be used as a way of better approximating
To bring the practice in line with the theory, and to better
the implicit loss function. This led to a comparable perfor-
understand these connections, we introduce a modified cost
mance but faster training than binary dropout.
function, that can be seen as an approximation of the Infor-
Kingma et al. [4] take a similar view of dropout as intro-
mation Bottleneck Lagrangian of [2], which encourages the
ducing multiplicative (Gaussian) noise, but instead study
creation of representations of the data which are increas-
the problem from a Bayesian point of view. In this setting,
ingly disentangled and insensitive to the action of nuisan-
given a training dataset D ¼ ðxi ; yi Þi¼1;...;N and a prior distri-
ces, and we show that this loss can be minimized using a
bution pðwÞ, we want to compute the posterior distribution
new layer, which we call Information Dropout, that allows
pðwjDÞ of the weights w of the network. As is customary in
the network to selectively introduce multiplicative noise in
variational inference, the true posterior can be approxi-
the layer activations, and thus to control the flow of infor-
mated by minimizing the negative variational lower bound
mation. As we show in various experiments, this method
LðuÞ of the marginal log-likelihood of the data,
improves the generalization performance by building better
representations and preventing overfitting, and it consider- 1X N
ably improves over binary dropout on smaller models, LðuÞ ¼ Ewpu ðwjDÞ ½log pðyi jxi ; wÞ
N i¼1
since, unlike dropout, Information Dropout also adapts the (1)
noise to the structure of the network and to the individual 1
þ KLðpu ðwjDÞkpðwÞÞ:
sample at test time. N
ACHILLE AND SOATTO: INFORMATION DROPOUT: LEARNING OPTIMAL REPRESENTATIONS THROUGH NOISY COMPUTATION 2899
This minimization is difficult to perform, since it requires to characterized representations that are both sufficient and
repeatedly sample new weights for each sample of the data- invariant. They argue that the management of nuisance
set. As an alternative, [4] suggest that the uncertainty about factors common in visual data, such as changes of view-
the weights that is expressed by the posterior distribution point, local deformations, and changes of illumination, is
pu ðwjDÞ can equivalently be encoded as a multiplicative directly tied to the specific structure of deep convolutional
noise in the activations of the layers (the so called local repar- networks, where local marginalization of simple nuisances
ametrization trick). As we will see in the following sections, at each layer results in marginalization of complex nuisances
this loss function closely resemble the one of Information in the network as a whole.
Dropout, which however is derived from a purely informa- Our work fits in this last line of thinking, where the goal is
tion theoretic argument based on the Information Bottleneck not equivalence to the data up to the action of (group) nuisan-
principle. One difference is that we allow the parameters of ces, but instead sufficiency for the task. Our main contribution
the noise to change on a per-sample basis (which, as we in this sense is to show that injecting noise into the layers, and
show in the experiments, can be useful to deal with nuisan- therefore using a non-deterministic function of the data, can
ces), and that we allow a scaling constant b in front of the actually simplify the theoretical analysis and lead to disentan-
KL-divergence term, which can be changed freely. Interest- gling and improved insensitivity to nuisances. This is an alter-
ingly, even if the Bayesian derivation does not allow a nate explanation to that put forth by the references above.
rescaling of the KL-divergence, Kingma et al. notice that
choosing a different scale for the KL-divergence term can 4 OPTIMAL REPRESENTATIONS AND THE
indeed lead to improvements in practice. A related method, INFORMATION BOTTLENECK LOSS
but derived from an information theoretic perspective was
also suggested previously by [10]. Given some input data x, we want to compute some (possi-
The interpretation of deep neural network as a way of bly nondeterministic) function of x, called a representation,
creating successively better representations of the data has that has some desirable properties in view of the task y, for
already been suggested and explored by many. Most instance by being more convenient to work with, exposing
recently, Tishby et al. [8] put forth an interpretation of deep relevant statistics, or being easier to store. Ideally, we want
neural networks as creating sufficient representations of the this representation to be as good as the original data for the
data that are increasingly minimal. In parallel simultaneous task, and not squander resources modeling parts of the data
work, [11] approximate the information bottleneck similarly that are irrelevant to the task. Formally, this means that we
to us, but focus on empirical analysis of robustness to adver- want to find a random variable z satisfying the following
sarial perturbations rather than tackling disentanglement, conditions:
invariance and minimality analytically. i) z is a representation of x; that is, its distribution
Sufficient dimensionality reduction [12] and Optimal depends only on x, as expressed by the following
Component Analysis [13] follow a similar idea to us, in that Markov chain:
they focus on finding the smallest (usually linear) sufficient
statistic of the data that is sufficient for a given task. How-
ever, while they define small in term of dimension of the
representation, we focus on finding a (non-linear) represen- ii) z is sufficient for the task y, that is Iðx; yÞ ¼ Iðz; yÞ,
tation with minimal information content, but whose dimen- expressed by the Markov chain:
sion can, in fact, be even larger than the original data. By
allowing large non-linear representations, we can exploit
the full representational power of deep networks, while the
minimality of the information content still promotes nui- iii) among all random variables satisfying these require-
sance invariance and prevents overfitting. Our framework ments, the mutual information Iðx; zÞ is minimal.
also has connections with Independent Component Analy- This means that z discards all variability in the data
sis (ICA), which we discuss further in Section 7. that is not relevant to the task.
Some have focused on creating representations that are Using the identity Iðx; yÞ Iðz; yÞ ¼ Iðx; yjzÞ, where I the
maximally invariant to nuisances, especially when they have mutual information, it is easy to see that the above condi-
the structure of a (possibly infinite-dimensional) group act- tions are equivalent to finding a distribution pðzjxÞ which
ing on the data, like [14], or, when the nuisance is a locally solves the optimization problem
compact group acting on each layer, by successive approxi-
mations implemented by hierarchical convolutional archi- minimize Iðx; zÞ
tectures, like [15] and [16]. In these cases, which cover s.t. Iðx; yjzÞ ¼ 0:
common nuisances such as translations and rotations of an
The minimization above is difficult in general. For this rea-
image (affine group), or small diffeomorphic deformations
son, Tishby et al. have introduced a generalization known
due to a slight change of point of view (group of diffeomor-
as the Information Bottleneck Principle and the associated
phisms), the representation is equivalent to the data modulo
Lagrangian to be minimized [2]:
the action of the group. However, when the nuisances
are not a group, as is the case for occlusions, it is not possible L ¼ Iðx; yjzÞ þ bIðx; zÞ:
to achieve such equivalence, that is, there is a loss. To
address this problem, [1] defined optimal representations When y is a discrete random variable, such as a label, as we
not in terms of maximality, but in terms of sufficiency, and will often assume through this work, we can further use the
2900 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 12, DECEMBER 2018
identity Iðx; yjzÞ ¼ HðyjzÞ HðyjxÞ and the fact that HðyjxÞ Remark (Approximate sufficiency). The quantity Iðx;
is constant to obtain the equivalent Lagrangian yjzÞ ¼ HðyjzÞ HðyjxÞ 0 can be seen as a measure of
the distance between pðx; y; zÞ and the closest distribution
L ¼ HðyjzÞ þ bIðx; zÞ; (2) qðx; y; zÞ such that x ! z ! y is a Markov chain. There-
where b is a positive constant that manages the trade-off fore, by minimizing Eq. (2) we find representations that
between sufficiency (the performance on the task, as mea- are increasingly “more sufficient”, meaning that they are
sured by the first term) and minimality (the complexity of closer to an actual Markov chain.
the representation, measured by the second term). It is easy
to see that, in the limit b ! 0þ , this is equivalent to the origi- 5 DISENTANGLEMENT
nal problem, where z is a minimal sufficient statistic. When In addition to sufficiency and minimality, “disentanglement
all random variables are discrete and z ¼ T ðxÞ is a determin- of hidden factors” is often cited as a desirable property of a
istic function of x, the algorithm proposed by [2] can be used representation [18], but seldom formalized. We may think
to minimize the IB Lagrangian efficiently. However, no algo- that the observed data is generated by a complex interplay
rithm is known to minimize the IB Lagrangian for non- of independent causes, or factors. Ideally, the components
Gaussian, high-dimensional continuous random variables. of the learned representation should capture these indepen-
One of our key results is that, when we restrict to the dent factors by disentangling the correlations in the
family of distributions obtained by injecting noise to one observed data. We can then quantify disentanglement by
layer of a neural network, we can efficiently approximate measuring the total correlation [19], also known as multiinfor-
and minimize the IB Lagrangian.1 As we will show, this mation [20],2 defined as
process can be effectively implemented through a generali- Q
zation of the dropout layer that we call Information Dropout. TCðzÞ :¼ KLðqðzÞk j qj ðzj ÞÞ:
To set the stage, we rewrite the IB Lagrangian as a per-
sample loss function. Let pðx; yÞ denote thetrue distribution Notice that the components of z are mutually independent
if and only if TCðzÞ is zero. Adding this as a penalty in the
of the data, from which the training set ðxi ; yi Þ i¼1;...;N is
sampled, and let pu ðzjxÞ and pu ðyjzÞ denote the unknown IB Lagrangian, with a factor g yields
distributions that we wish to estimate, parametrized by u. 1X N
Then, we can write the two terms in the IB Lagrangian as L¼ Ezpðzjxi Þ ½log pðyi jzÞ
N i¼1 (4)
HðyjzÞ ’ Ex;ypðx;yÞ Ezpu ðzjxÞ ½log pu ðyjzÞ þ bKLðpu ðzjxi Þkpu ðzÞÞ þ gTCðzÞ:
Iðx; zÞ ¼ ExpðxÞ ½KLðpu ðzjxÞkpu ðzÞÞ; In general, minimizing this augmented loss is intractable,
since to compute both the KL term and the total correla-
where KL denotes the Kullback-Leibler divergence. We can
tion, we need to know the marginal distribution pu ðzÞ,
therefore approximate the IB Lagrangian empirically as
which is not easily computable. However, the following
1X N
proposition, that we prove in Appendix B, which can be
L¼ Ezpðzjxi Þ ½log pðyi jzÞ þ bKLðpu ðzjxi Þkpu ðzÞÞ: (3)
N i¼1 found on the Computer Society Digital Library at http://
doi.ieeecomputersociety.org/10.1109/TPAMI.2017.2784440,
Notice that the first term simply is the average cross- shows that if we choose g ¼ b, then the problem simplifies,
entropy, which is the most commonly used loss function in and can be easily solved by adding an auxiliary variable.
deep learning. The second term can then be seen as a regu-
larization term. In fact, many classical regularizers, like the Proposition 1. The minimization problem
L2 penalty, can be expressed in the form of Eq. (3) (see also
1X N
[17]). In this work, we interpret the KL term as a reuglarizer minimize Ezpðzjxi Þ ½log pðyi jzÞ
that penalizes the transfer of information from x to z. In the p N i¼1
next section, we discuss ways to control such information þ bfKLðpðzjxi ÞkpðzÞÞ þ TCðzÞg;
transfer through the injection of noise.
is equivalent to the following minimization in two variables
Remark (Deterministic versus stochastic representa-
tions). Aside from being easier to work with, stochastic 1X N
representations can attain a lower value of the IB minimize Ezpðzjxi Þ ½log pðyi jzÞ
p;q N i¼1
Lagrangian than any deterministic representation. For Qjzj
example, consider the task of reconstructing single ran- þ b KLðpðzjxi Þk i¼1 qi ðzi ÞÞ:
dom bit y given a noisy observation x. The only determin-
istic representations are equivalent to the either the noisy In other words, minimizing the standard IB Lagrangian
observation itself or to the trivial constant map. It is not assuming that the activations are independent, i.e., having
difficult to check that for opportune values of b and of the
noise, neither realize the optimal tradeoff reached by a 2. As pointed out by a reviewer, multi-information would be a more
suitable stochastic representation. appropriate name for this quantity. We chose to use Total Correlation
both for historical reasons, after its introduction in [19], and to empha-
size the relation with disentanglement also in recent work on unsuper-
1. Since we restrict the family of distributions, there is no guarantee vised learning [21]. Other measures of independence are of course
that the resulting representation will be optimal. We can, however, possible. Total Correlation has the advantage of being enforced natu-
iterate the process to obtain incrementally improved approximations. rally when optimizing other information-theoretic quantities.
ACHILLE AND SOATTO: INFORMATION DROPOUT: LEARNING OPTIMAL REPRESENTATIONS THROUGH NOISY COMPUTATION 2901
Q
qðzÞ ¼ i qi ðzi Þ, is equivalent to enforcing disentanglement
of the hidden factors. It is interesting to note that this inde-
pendence assumption is already adopted often by practi-
tioners Ron grounds of simplicity, since the actual marginal
pðzÞ ¼ x pðx; zÞdx is often incomputable. That using a factor-
ized model results in “disentanglement” was also observed
empirically by [6] which, however, introduced an ad-hoc
metric based on classifiers of low VC-dimension, rather than
the more natural Total Correlation adopted here.
In view of the previous proposition, from now on we will Fig. 1. Comparison of the empirical distribution pðzÞ of the post-noise
assume that the activations are independent and ignore the activations with our proposed prior when using: (a) ReLU activations,
total correlation term. for which we propose a log-uniform prior, and (b) Softplus activations, for
which we propose a log-normal prior. In both cases, the empirical
distribution approximately follows the proposed prior. Both histograms
6 INFORMATION DROPOUT where obtained from the last dropout layer of the All-CNN-32 network
described in Table 2, trained on CIFAR-10.
Guided by the analysis in the previous sections, and to
emphasize the role of stochasticity, we consider representa-
Since the ReLU activations are frequently zero, we also
tions z obtained by computing a deterministic map fðxÞ of
assume qðz ¼ 0Þ ¼ q0 for some constant 0 q0 1. There-
the data (for instance a sequence of convolutional and/or
fore, the final prior has the form qðzÞ ¼ q0 d0 ðzÞ þ c=z, where
fully-connected layers of a neural network), and then multi-
d0 is the Dirac delta in zero. In Fig. 1a, we compare this prior
plying the result component-wise by a random sample
distribution with the actual empirical distribution pðzÞ of a
drawn from a parametric noise distribution pa with unit
network with ReLU activations.
mean and variance that depends on the input x:
In a network implemented using Softplus activations, a
log-normal is a good fit of the distribution of the activations.
" paðxÞ ð"Þ; z ¼ " fðxÞ;
This is to be expected, especially when using batch-normali-
zation, since the pre-activations will approximately follow
where “” denotes the element-wise product. Notice that, if
a normal distribution with zero mean, and the Softplus
paðxÞ ð"Þ is a Bernoulli distribution rescaled to have mean 1,
approximately resembles a scaled exponential near zero.
this reduces exactly to the classic binary dropout layer. As
Therefore, in this case we suggest using a log-normal distri-
we discussed in Section 3, there are also variants of dropout
bution as our prior qðzÞ. In Fig. 1b, we compare this prior
that use different distributions.
with the empirical distribution pðzÞ of a network with
Softplus activations.
Using these priors, we can finally compute the KL diver-
gence term in Eq. (3) for both ReLU activations and Softplus
activations. We prove the following two propositions in
Appendix A, available online.
Proposition 2 (Information Dropout cost for ReLU). Let
A natural choice for the distribution paðxÞ ð"Þ, which also
z ¼ " fðxÞ, where " pa ð"Þ, and assume pðzÞ ¼ qd0 ðzÞ þ
simplifies the theoretical analysis, is the log-normal distri-
c=z. Then, assuming fðxÞ 6¼ 0, we have
bution paðxÞ ð"Þ ¼ log N ð0; a2u ðxÞÞ. Once we fix this noise dis-
tribution, given the above expression for z, we can easily KLðpu ðzjxÞkpðzÞÞ ¼ HðpaðxÞ ðlog "ÞÞ þ log c
compute the distribution pu ðzjxÞ that appears in Eq. (3).
However, to be able to compute the KL-divergence term, In particular, if pa ð"Þ is chosen to be the log-normal distribu-
we still need to fix a prior distribution qu ðzÞ. The choice of tion pa ð"Þ ¼ log N ð0; a2u ðxÞÞ, we have
this prior largely depends on the expected distribution of
the activations fðxÞ. Recall that, by Section 5, we can assume KLðpu ðzjxÞkpðzÞÞ ¼ log au ðxÞ þ const: (5)
that all activations are independent, thus simplifying the
computation. Now, we concentrate on two of the most com- If instead fðxÞ ¼ 0, we have
mon activation functions, the rectified linear unit (ReLU), KLðpu ðzjxÞkpðzÞÞ ¼ log q:
which is easy to compute and works well in practice, and
the Softplus function, which can be seen as a strictly positive Proposition 3 (Information Dropout cost for Softplus).
and differentiable approximation of ReLU. Let z ¼ " fðxÞ, where " pa ð"Þ ¼ log N ð0; a2u ðxÞÞ, and
A network implemented using only ReLU and a final assume pu ðzÞ ¼ log N ðm; s 2 Þ. Then, we have
Softmax layer has the remarkable property of being scale- 1 aðxÞ 1
invariant, meaning that multiplying all weights, biases, and KLðpu ðzjxÞkpðzÞÞ ¼ 2 a2 ðxÞ þ m2 log :
2s s 2
activations by a constant does not change the final result. (6)
Therefore, from a theoretical point of view, it would be
desirable to use a scale-invariant prior. The only such prior Substituting the expression for the KL divergence in
is the improper log-uniform, qðlog ðzÞÞ ¼ c, or equivalently Eq. (5) inside Eq. (3), and ignoring for simplicity the special
qðzÞ ¼ c=z, which was also suggested by [4], but as a prior case fðxÞ ¼ 0, we obtain the following loss function for
for the weights of the network, rather than the activations. ReLU activations
2902 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 12, DECEMBER 2018
TABLE 1
Average Variational Lower-Bound L on the
Testing Dataset for a Simple VAE, Where the Size of the
Latent Variable z Is 256 k and the Encoder/Decoder
Each Contain 512 k Hidden Units
k Gaussian Information
1 98.8 100.0
2 99.0 99.1
3 98.7 99.1
So, what may be done by necessity in some computational [12] K. P. Adragni and R. D. Cook, “Sufficient dimension reduction
and prediction in regression,” Philosoph. Trans. Roy. Soc. London
systems (noisy computation), turns out to be beneficial towards A: Math., Phys. Eng. Sci., vol. 367, no. 1906, pp. 4385–4405, 2009.
achieving invariance and minimality. Analogously, what has [13] X. Liu, A. Srivastava, and K. Gallivan, “Optimal linear representa-
been done for convenience (assuming a factorized prior) turns tions of images for object recognition,” in Proc. IEEE Comput. Soc.
out to be the beneficial towards achieving “disentanglement.” Conf. Comput. Vis. Pattern Recognit., 2003, vol. 1, pp. I–I.
[14] G. Sundaramoorthi, P. Petersen, V. S. Varadarajan, and S. Soatto,
Another interpretation of Information Dropout is as a “On the set of images modulo viewpoint and contrast changes,” in
way of biasing the network towards reconstructing repre- Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 832–839.
sentations of the data that are compatible with a Markov [15] F. Anselmi, L. Rosasco, and T. Poggio, “On invariance and selec-
chain generative model, making it more suited to data com- tivity in representation learning,” Inf. Inference, vol. 5, no. 2,
pp. 134–158, 2016.
ing from hierarchical models, and in this sense is comple- [16] J. Bruna and S. Mallat, “Classification with scattering operators,” in
mentary to architectural constraint, such as convolutions, Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 1561–1566.
that instead bias the model toward geometric tasks. [17] Y. Gal and Z. Ghahramani, “Bayesian convolutional neural net-
works with bernoulli approximate variational inference,” in Proc.
It should be noticed that injecting multiplicative noise to 4th Int. Conf. Learning Representations (ICLR) Workshop Track, 2016.
the activations can be thought of as a particular choice of a [18] Y. Bengio, A. Courville, and P. Vincent, “Representation learning:
class of minimizers of the loss function, but can also be A review and new perspectives,” IEEE Trans. Pattern Anal. Mach.
interpreted as a regularization terms added to the cost Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
[19] S. Watanabe, “Information theoretical analysis of multivariate
function, or as a particular procedure utilized to carry out correlation,” IBM J. Res. Develop., vol. 4, no. 1, pp. 66–82, 1960.
the optimization. So the same operation can be interpreted [20] M. Studenỳ and J. Vejnarova, “The multiinformation function as a
as either of the three key ingredients in the optimization: tool for measuring stochastic dependence,” in Learning in Graphi-
the function to be minimized, the family over which to cal Models. Berlin, Germany: Springer, 1998, pp. 261–297.
[21] G. V. Steeg, “Unsupervised learning via total correlation explan-
minimize, and the procedure with which to minimize. This ation,” in Proc. 26th Int. Joint Conf. Artif. Intell., pp. 5151–5155,
highlight the intimate interplay between the choice of mod- https://doi.org/10.24963/ijcai.2017/740
els and algorithms in deep learning. [22] P. Comon, “Independent component analysis, a new concept?”
Signal Process., vol. 36, pp. 287–314, 1994.
[23] A. J. Bell and T. J. Sejnowski, “An information-maximization
ACKNOWLEDGMENTS approach to blind separation and blind deconvolution,” Neural
Comput., vol. 7, pp. 1129–1159, 1995.
Work supported by ARO, ONR, AFOSR. We are very grateful [24] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and
to the reviewers for their through analysis of the paper. In par- M. Welling, “Improved variational inference with inverse autoregres-
ticular, we would like to thank one of the anonymous sive flow,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 4743–4751.
reviewers for for providing an elegant alternative proof to [25] M. Abadi, et al., “TensorFlow: Large-scale machine learning on
heterogeneous systems,” 2015, Software Available from Tensor-
Proposition 1, as well as thoughtful critiques to the theorems. flow.org. [Online]. Available: http://tensorflow.org/
Dedicated to Naftali Tishby in the occasion of the conference [26] A. Krizhevsky and G. Hinton, “Learning multiple layers of
Information, Control and Learning held in his honor in Jerusalem, features from tiny images,” 2009, https://www.cs.toronto.edu/
September 26-28, 2016. Registered as Tech Report UCLA- ~kriz/learning-features-2009-TR.pdf
[27] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller,
CSD160009 and arXiv:1611.01353 on November 6, 2016. “Striving for simplicity: The all convolutional net,” arXiv:1412.6806,
ICLR 2015 Workshop Track, 2015, http://arxiv.org/abs/1412.6806
REFERENCES
Alessandro Achille received the master’s
[1] S. Soatto and A. Chiuso, “Visual representations: Defining proper- degree in pure math (with honors) from the
ties and deep approximations,” in Proc. Int. Conf. Learn. Representa- Scuola Normale Superiore di Pisa and the
tions, May 2016. University of Pisa. He is working towards the PhD
[2] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottle- degree at UCLA, where he joined Prof. Soatto’s
neck method,” in Proc. 37-th Annu. Allerton Conf. Commun., Control research group, in 2015. His research interests
Comput., 1999. include machine learning, variational inference
[3] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and and information theory, with particular focus on
R. Salakhutdinov, “Dropout: A simple way to prevent neural artificial, and embodied intelligence.
networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1,
pp. 1929–1958, 2014.
[4] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout
and the local reparameterization trick,” in Proc. 28th Int. Conf. Stefano Soatto received the DIng degree (high-
Neural Inf. Process. Syst., 2015, pp. 2575–2583. est honors) from the University of Padova - Italy, in
[5] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 1992, and the PhD degree in control and dynam-
in Proc. 2nd Int. Conf. Learn. Representations, no. 2014, 2013. ical systems from the California Institute of Tech-
[6] I. Higgins, et al., “Beta-VAE: Learning basic visual concepts with a nology, in 1996. He joined UCLA, in 2000 after
constrained variational framework,” in Proc. 2nd Int. Conf. Learn. being an assistant and then an associate profes-
Representations, no. 2014, 2013. sor of electrical and biomedical engineering with
[7] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent Washington University, and a research associate
models of visual attention,” in Proc. Adv. Neural Inf. Process. Syst., in applied sciences with Harvard University. From
2014, pp. 2204–2212. 1995 to 1998, he was also ricercatore in the
[8] N. Tishby and N. Zaslavsky, “Deep learning and the information bot- Department of Mathematics and Computer Sci-
tleneck principle,” in Proc. IEEE Inf. Theory Workshop., 2015, pp. 1–5. ence, University of Udine - Italy. His general research interests include
[9] S. Wang and C. Manning, “Fast dropout training,” in Proc. 30th computer vision and nonlinear estimation and control theory. In particular,
Int. Conf. Mach. Learn., 2013, pp. 118–126. he is interested in ways for computers to use sensory information (e.g.
[10] G. E. Hinton and D. Van Camp, “Keeping the neural networks vision, sound, touch) to interact with humans and the environment.
simple by minimizing the description length of the weights,”
in Proc. 6th Annu. Conf. Comput. Learn. Theory, 1993, pp. 5–13. " For more information on this or any other computing topic,
[11] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep varia- please visit our Digital Library at www.computer.org/publications/dlib.
tional information bottleneck,” in Proc. Int. Conf. Learning Repre-
sentations, 2017.