0% found this document useful (0 votes)
21 views19 pages

Hyperspherical Variational Auto-Encoders: Tim R. Davidson Luca Falorsi Nicola de Cao Thomas Kipf Jakub M. Tomczak

The document introduces Hyperspherical Variational Auto-Encoders (S-VAEs), which utilize a von Mises-Fisher distribution to better model data with hyperspherical structures compared to traditional Gaussian-based Variational Auto-Encoders (VAEs). It discusses the limitations of Gaussian priors in capturing the latent structure of certain datasets and demonstrates through experiments that S-VAEs outperform standard N-VAEs in various scenarios, including low-dimensional data and citation network link prediction. The paper emphasizes the importance of accounting for the spherical nature of data in machine learning models, proposing a more suitable approach for such cases.

Uploaded by

Vitaliy Bulygin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views19 pages

Hyperspherical Variational Auto-Encoders: Tim R. Davidson Luca Falorsi Nicola de Cao Thomas Kipf Jakub M. Tomczak

The document introduces Hyperspherical Variational Auto-Encoders (S-VAEs), which utilize a von Mises-Fisher distribution to better model data with hyperspherical structures compared to traditional Gaussian-based Variational Auto-Encoders (VAEs). It discusses the limitations of Gaussian priors in capturing the latent structure of certain datasets and demonstrates through experiments that S-VAEs outperform standard N-VAEs in various scenarios, including low-dimensional data and citation network link prediction. The paper emphasizes the importance of accounting for the spherical nature of data in machine learning models, proposing a more suitable approach for such cases.

Uploaded by

Vitaliy Bulygin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Hyperspherical Variational Auto-Encoders

Tim R. Davidson∗ Luca Falorsi∗ Nicola De Cao∗ Thomas Kipf Jakub M. Tomczak

University of Amsterdam
arXiv:1804.00891v3 [stat.ML] 27 Sep 2022

Abstract to obtain f : Z → X ⊂ RN . Given two hidden units,


an autoencoder quickly discovers the latent circle, while
a normal VAE becomes highly unstable. This is to be
The Variational Auto-Encoder (VAE) is one
expected as a Gaussian prior is concentrated around the
of the most used unsupervised machine learn-
origin, while the KL-divergence tries to reconcile the
ing models. But although the default choice
differences between S 1 and R2 .
of a Gaussian distribution for both the prior
and posterior represents a mathematically con- The fact that some data types like directional data are
venient distribution often leading to competi- better explained through spherical representations is long
tive results, we show that this parameterization known and well-documented (Mardia, 1975; Fisher et al.,
fails to model data with a latent hyperspheri- 1987), with examples spanning from protein structure, to
cal structure. To address this issue we propose observed wind directions. Moreover, for many modern
using a von Mises-Fisher (vMF) distribution in- problems such as text analysis or image classification,
stead, leading to a hyperspherical latent space. data is often first normalized in a preprocessing step to
Through a series of experiments we show how focus on the directional distribution. Yet, few machine
such a hyperspherical VAE, or S-VAE, is more learning methods explicitly account for the intrinsically
suitable for capturing data with a hyperspheri- spherical nature of some data in the modeling process. In
cal latent structure, while outperforming a nor- this paper, we propose to use the von Mises-Fisher (vMF)
mal, N -VAE, in low dimensions on other data distribution as an alternative to the Gaussian distribution.
types. This replacement leads to a hyperspherical latent space
as opposed to a hyperplanar one, where the Uniform dis-
tribution on the hypersphere is conveniently recovered as
1 INTRODUCTION a special case of the vMF. Hence this approach allows
for a truly uninformative prior, and has a clear advantage
First introduced by Kingma and Welling (2014a); in the case of data with a hyperspherical interpretation.
Rezende et al. (2014), the Variational Auto-Encoder This was previously attempted by Hasnat et al. (2017), but
(VAE) is an unsupervised generative model that presents crucially they do not learn the concentration parameter
a principled fashion for performing variational inference around the mean, κ.
using an auto-encoding architecture. Applying the non- In order to enable training of the concentration parame-
centered parameterization of the variational posterior ter, we extend the reparameterization trick for rejection
(Kingma and Welling, 2014b), further simplifies sam- sampling as recently outlined in Naesseth et al. (2017) to
pling and allows to reduce bias in calculating gradients allow for n additional transformations. We then combine
for training. Although the default choice of a Gaussian this with the rejection sampling procedure proposed by
prior is mathematically convenient, we can show through Ulrich (1984) to efficiently reparameterize the VAE 1 .
a simple example that in some cases it breaks the as-
sumption of an uninformative prior leading to unstable We demonstrate the utility of replacing the normal dis-
results. Imagine a dataset on the circle Z ⊂ S 1 , that is tribution with the von Mises-Fisher distribution for gen-
subsequently embedded in RN using a transformation f erating latent representations by conducting a range of
∗ 1
Equal contribution. . Correspondence to: Nicola De Cao Code freely available on: https://github.com/
<[email protected]>. nicola-decao/s-vae
experiments in three distinct settings. First, we show that mass around the origin, encouraging points to cluster in
our S-VAEs outperform VAEs with the Gaussian varia- the center. This is particularly problematic when the data
tional posterior (N -VAEs) in recovering a hyperspherical is divided into multiple clusters. Although an ideal latent
latent structure. Second, we conduct a thorough com- space should separate clusters for each class, the normal
parison with N -VAEs on the MNIST dataset through an prior will encourage all the cluster centers towards the
unsupervised learning task and a semi-supervised learning origin. An ideal prior would only stimulate the variance
scenario. Finally, we show that S-VAEs can significantly of the posterior without forcing its mean to be close to
improve link prediction performance on citation network the center. A prior satisfying these properties is a uniform
datasets in combination with a Variational Graph Auto- over the entire space. Such a uniform prior, however, is
Encoder (VGAE) (Kipf and Welling, 2016). not well defined on the hyperplane.

2 VARIATIONAL AUTO-ENCODERS High dimensions: soap bubble effect It is a well-


known phenomenon that the standard Gaussian distri-
2.1 FORMULATION bution in high dimensions tends to resemble a uniform
distribution on the surface of a hypersphere, with the vast
In the VAE setting, we have a latent variable model for majority of its mass concentrated on the hyperspherical
data, where z ∈ RM denotes latent variables, x is a vector shell. Hence it would appear interesting to compare the
of D observed variables, and pφ (x, z) is a parameterized behavior of a Gaussian approximate posterior with an
model of the joint distribution. Our objective is to op- approximate posterior already naturally defined on the
R
timize the log-likelihood of the data, log pφ (x, z)dz. hypersphere. This is also motivated from a theoretical
When pφ (x, z) is parameterized by a neural network, point of view, since the Gaussian definition is based on
marginalizing over the latent variables is generally in- the L2 norm that suffers from the curse of dimensionality.
tractable. One way of solving this issue is to maximize
the Evidence Lower Bound (ELBO) 2.3 BEYOND THE HYPERPLANE
Z
log pφ (x, z)dz ≥ Eq(z) [log pφ (x|z)]+ Once we let go of the hyperplanar assumption, the pos-
sibility of a uniform prior on the hypersphere opens up.
− KL(q(z)||p(z)), (1) Mirroring our discussion in the previous subsection, such
where q(z) is the approximate posterior distribution, be- a prior would exhibit no pull towards the origin allowing
longing to a family Q. The bound is tight if q(z) = clusters of data to evenly spread over the surface with no
p(z|x), meaning q(z) is optimized to approximate the directional bias. Additionally, in higher dimensions, the
true posterior. While in theory q(z) should be optimized cosine similarity is a more meaningful distance measure
for every data point x, to make inference more scalable than the Euclidean norm.
to larger datasets the VAE setting introduces an inference
network qψ (z|x; θ) parameterized by a neural network Manifold mapping In general, exploring VAE mod-
that outputs a probability distribution for each data point els that allow a mapping to distributions in a latent
x. The final objective is therefore to maximize space not homeomorphic to RD is of fundamental in-
terest. Consider data lying in a small M -dimensional
L(φ, ψ) = Eqψ (z|x;θ) [log pφ (x|z)]+ manifold M, embedded in a much higher dimensional
− KL(qψ (z|x; θ)||p(z)), (2) space X = RN . For most real data, this manifold will
likely not be homeomorphic to RM . An encoder can
In the original VAE both the prior and the posterior are
be considered as a smooth map enc : X → Z = RD
defined as normal distributions. We can further efficiently
from the original space to Z. The restriction of the en-
approximate the ELBO by Monte Carlo estimates, using
coder to M, enc|M : M → Z will also be a smooth
the reparameterization trick (Kingma and Welling, 2014a;
mapping. However since M is not homeomorphic to Z
Rezende et al., 2014). This is done by expressing a sam-
if D ≤ M , then enc|M cannot be a homeomorphism.
ple of z ∼ qψ (z|x; θ), as z = h(θ, ε, x), where h is a
That is, there exists no invertible and globally contin-
reparameterization transformation and ε ∼ s(ε) is some
uous mapping between the coordinates of M and the
noise random variable independent from θ.
ones of Z. Conversely if D > M then M can be
smoothly embedded in Z for D sufficiently big 2 , such
2.2 THE LIMITATIONS OF A GAUSSIAN
that enc|M : M → enc|M (M) =: emb(M) ⊂ Z is a
DISTRIBUTION PRIOR
homeomorphism and emb(M) denotes the embedding of
Low dimensions: origin gravity In low dimensions, 2
By the Whitney embedding theorem any smooth real M -
the Gaussian density presents a concentrated probability dimensional manifold can be smoothly embedded in R2M
(a) Original (b) Autoencoder (c) N -VAE (d) N -VAE, β = 0.1 (e) S-VAE

Figure 1: Plots of the original latent space (a) and learned latent space representations in different settings, where β is a
re-scaling factor for weighting the KL divergence. (Best viewed in color)

M. Yet, since D > M , when taking random points in 3 REPLACING GAUSSIAN WITH VON
the latent space they will most likely not be in emb(M) MISES-FISHER
resulting in a poorly reconstructed sample.
The VAE tries to solve this problem by forcing M to be 3.1 VON MISES-FISHER DISTRIBUTION
mapped into an approximate posterior distribution that
has support in the entire Z. Clearly, this approach is The von Mises-Fisher (vMF) distribution is often de-
bound to fail since the two spaces have a fundamentally scribed as the Normal Gaussian distribution on a hyper-
different structure. This can likely produce two behaviors: sphere. Analogous to a Gaussian, it is parameterized by
first, the VAE could just smooth the original embedding µ ∈ Rm indicating the mean direction, and κ ∈ R≥0 the
emb(M) leaving most of the latent space empty, leading concentration around µ. For the special case of κ = 0, the
to bad samples. Second, if we increase the KL term the vMF represents a Uniform distribution. The probability
encoder will be pushed to occupy all the latent space, density function of the vMF distribution for a random unit
but this will create instability and discontinuity, affecting vector z ∈ Rm (or z ∈ S m−1 ) is then defined as
the convergence of the model. To validate our intuition
we performed a small proof of concept experiment using q(z|µ, κ) = Cm (κ) exp (κµT z) (3)
M = S 1 , which is visualized in Figure 1. Note that as κm/2−1
expected the auto-encoder in Figure 1(b) mostly recovers Cm (κ) = , (4)
(2π)m/2 I m/2−1 (κ)
the original latent space of Figure1(a) as there are no dis-
tributional restrictions. In Figure 1(c) we clearly observe
where ||µ||2 = 1, Cm (κ) is the normalizing constant, and
for the N -VAE that points collapse around the origin due
Iv denotes the modified Bessel function of the first kind
to the KL, which is much less pronounced in Figure 1(d)
at order v.
when its contribution is scaled down. Lastly, the S-VAE
almost perfectly recovers the original circular latent space.
The observed behavior confirms our intuition. 3.2 KL DIVERGENCE
To solve this problem the best option would be to directly As previously emphasized, one of the main advan-
specify a Z homeomorphic to M and distributions on M. tages of using the vMF distribution as an approxi-
However, for real data discovering the structure of M mate posterior is that we are able to place a uniform
will often be a difficult inference task. Nevertheless, we prior on the latent space. The KL divergence term
believe this shows that investigating VAE architectures KL(vMF(µ, κ)||U (S m−1 )) to be optimized is:
that map to posterior distributions defined on manifolds
different than the Euclidean space is a topic worth to be −1
Im/2 (k) 2(π m/2 )

explored. In that sense, this work represents an initial step κ + log Cm (κ) − log , (5)
in this research direction. Im/2−1 (k) Γ(m/2)

see Appendix B for complete derivation. Notice that since


the KL term does not depend on µ, this is only optimized
in the reconstruction term. The above expression cannot
be handled by automatic differentiation packages because
of the modified Bessel function in Cm (κ). Thus, to opti-
mize this term we derive the gradient with respect to the
Algorithm 1 vMF sampling tributions that can be simulated using rejection sampling.
Input: dimension m, mean µ, concentration κ Dropping the dependence from x for simplicity, assume
sample v ∼ U (S m−2 ) the approximate posterior is of the form g(ω|θ) and that
1
sample ω ∼ g(ω|κ, m) ∝ exp(ωκ)(1 − ω 2 ) 2 (m−3) it can be sampled by making proposals from r(ω|θ). If
{acceptance-rejection sampling} the proposal distribution can be reparameterized we can

z0 ← (ω; ( 1 − ω 2 )v> )> still perform the reparameterization trick. Let ε ∼ s(ε),
U ← Householder(e1 , µ) {Householder transform} and ω = h(ε, θ), a reparameterization of the proposal dis-
Return: U z0 tribution, r(ω|θ). Performing the reparameterization trick
for g(ω|θ) is made possible by the fundamental lemma
proven in (Naesseth et al., 2017):
concentration parameter ∇κ KL(vMF(µ, κ)||U (S m−1 )): Lemma 1. Let f be any measurable function and ε ∼
g(h(ε, θ)|θ)
1 Im/2+1 (k) π(ε|θ) = s(ε) the distribution of the ac-
k + r(h(ε, θ)|θ)
2 Im/2−1 (k) cepted sample. Then:
 ! Z
Im/2 (k) Im/2−2 (k) + Im/2 (k) Eπ(ε|θ) [f (h(ε, θ))] = f (h(ε, θ))π(ε|θ)dε
− + 1 , (6)
Im/2−1 (k)2 Z
= f (ω)g(ω|θ)dω = Eg(ω|θ) [f (ω)], (7)
where the modified Bessel functions can be computed
without numerical instabilities using the exponentially
scaled modified Bessel function. Then the gradient can be taken using the log derivative
trick:
3.3 SAMPLING PROCEDURE
∇θ Eg(ω|θ) [f (ω)] = ∇θ Eπ(ε|θ) [f (h(ε, θ))] =
To sample from the vMF we follow the procedure of Eπ(ε|θ) [∇θ f (h(ε, θ))]+
Ulrich (1984), outlined in Algorithm 1. We first sam-
 
g(h(ε, θ)|θ)
ple from a vMF q(z|e1 , κ) with modal vector e1 = + Eπ(ε|θ) f (h(ε, θ))∇θ log , (8)
r(h(ε, θ)|θ)
(1, 0, · · · , 0). Since the vMF density is uniform in
all the m − 2 dimensional sub-hyperspheres {x ∈ However, in the case of the vMF a different procedure
S m−1 |e> 1 x = ω}, the sampling technique reduces is required. After performing the transformation h(ε, θ)
to sampling the value ω from the univariate density and accepting/rejecting the sample, we sample another
g(ω|κ, m) ∝ exp(κω)(1 − ω 2 )(m−3)/2 , ω ∈ [−1, 1], random variable v ∼ π2 (v), and then apply a transfor-
using an acceptance-rejection scheme. After getting a mation z = T (h(ε, θ), v; θ), such that z ∼ qψ (z|θ) is
sample from q(z|e1 , κ) an orthogonal transformation distributed as the approximate posterior (in our case a
U (µ) is applied such that the transformed sample is dis- vMF). Effectively this entails applying another reparame-
tributed according to q(z|µ, κ). This can be achieved terization trick after the acceptance/rejection step. To still
using a Householder reflection such that U (µ)e1 = µ. A be able to perform the reparameterization we show that
more in-depth explanation of the sampling technique can Lemma 1 fundamentally still holds in this case as well.
be found in Appendix A.
Lemma 2. Let f be any measurable function and ε ∼
It is worth noting that the sampling technique does not g(h(ε, θ)|θ)
π1 (ε|θ) = s(ε) the distribution of the ac-
suffer from the curse of dimensionality, as the acceptance- r(h(ε, θ)|θ)
rejection procedure is only applied to a univariate distri- cepted sample. Also let v ∼ π2 (v), and T a trans-
bution. Moreover in the case of S 2 , the density g(ω|κ, 3) formation that depends on the parameters such that if
reduces to g(ω|κ, 3) ∝ exp(kω)1[−1,+1] (ω) which can z = T (ω, v; θ) with ω ∼ g(ω|θ), then ∼ q(z|θ):
be directly sampled without rejection.
E(ε,v)∼π1 (ε|θ)π2 (v) [f (T (h(ε, θ), v; θ))] =
Z
3.4 N-TRANSFORMATION f (z)q(z|θ)dz = Eq(z|θ) [f (z)], (9)
REPARAMETERIZATION TRICK

While the reparameterization trick is easily imple- Proof. See Appendix C.


mentable in the normal case, unfortunately it can only
be applied to a handful of distributions. However a recent With this result we are able to derive a gradient expression
technique introduced by Naesseth et al. (2017) allows to similarly as done in equation 8. We refer to Appendix D
extend the reparameterization trick to the wide class of dis- for a complete derivation.
(a) R2 latent space of the N -VAE. (b) Hammer projection of S 2 latent space of the S-VAE.

Figure 2: Latent space visualization of the 10 MNIST digits in 2 dimensions of both N -VAE (left) and S-VAE (right).
(Best viewed in color)

3.5 BEHAVIOR IN HIGH DIMENSIONS process. Opposite to these approaches, we aim at using a
non-informative prior to simplify the inference.
The surface area of a hypersphere is defined as
The closest approach to ours is a VAE with a vMF distri-
2(π m/2
) bution in the latent space used for a sentence generation
S(m − 1) = rm (10) task by (Guu et al., 2018). While formally this approach
Γ(m/2)
is cast as a variational approach, the proposed model does
where m is the dimensionality and r the radius. Notice not reparameterize and learn the concentration parameter
that S(m − 1) → 0, as m → ∞. However, even for κ, treating it as a constant value that remains the same
m > 20 we observe a vanishing surface problem (see for every approximate posterior instead. Critically, as
Figure 6 in Appendix E). This could thus lead to unstable indicated in Equation 5, the KL divergence term only
behavior of hyperspherical models in high dimensions. depends on κ therefore leaving κ constant means never
explicitly optimizing the KL divergence term in the loss.
The method then only optimizes the reconstruction error
4 RELATED WORK by adding vMF noise to the encoder output in the latent
space to still allow generation. Moreover, using a fixed
Extending the VAE The majority of VAE extensions global κ for all the approximate posteriors severely limits
focus on increasing the flexibility of the approximate the flexibility and the expressiveness of the model.
posterior. This is usually achieved through normalizing
flows (Rezende and Mohamed, 2015), a class of invertible Non-Euclidean Latent Space In Liu and Zhu (2018),
transformations applied sequentially to an initial repa- a general model to perform Bayesian inference in Rieman-
rameterizable density q0 (z0 ), allowing for more complex nian Manifolds is proposed. Following other Stein-related
posteriors. Normalizing flows can be considered orthog- approaches, the method does not explicitly define a poste-
onal to our proposed approach. In fact, while allowing rior density but approximates it with a number of particles.
for a more flexible posterior, they do not modify the stan- Despite its generality and flexibility, it requires the choice
dard normal prior assumption. They could be perfectly of a kernel on the manifold and multiple particles to have
combined with S-VAEs allowing for more flexible distri- a good approximation of the posterior distribution. The
butions on the hypersphere. former is not necessarily straightforward, while the latter
quickly becomes computationally unfeasible.
One approach to obtain a more flexible prior is to use a
simple mixture of Gaussians (MoG) prior (Dilokthanakul Another approach by Nickel and Kiela (2017), capital-
et al., 2016). The recently introduced VampPrior model izes on the hierarchical structure present in some data
(Tomczak and Welling, 2018) outlines several advantages types. By learning the embeddings for a graph in a
over the MoG and instead tries to learn a more flexible non-euclidean negative curvature hyperbolical space, they
prior by expressing it as a mixture of approximate pos- show this topology has clear advantages over embedding
teriors. A non-parametric prior is proposed in Nalisnick these objects in a Euclidean space. Although they did not
and Smyth (2017), utilizing a truncated stick-breaking use a VAE-based approach, that is, they did not build a
Table 1: Summary of results (mean and standard-deviation over 10 runs) of unsupervised model on MNIST. RE and KL
correspond respectively to the reconstruction and the KL part of the ELBO. Best results are highlighted only if they
passed a student t-test with p < 0.01.

N -VAE S-VAE
Method
LL L[q] RE KL LL L[q] RE KL
d=2 -135.73±.83 -137.08±.83 -129.84±.91 7.24±.11 -132.50±.73 -133.72±.85 -126.43±.91 7.28±.14
d=5 -110.21±.21 -112.98±.21 -100.16±.22 12.82±.11 -108.43±.09 -111.19±.08 -97.84±.13 13.35±.06
d = 10 -93.84±.30 -98.36±.30 -78.93±.30 19.44±.14 -93.16±.31 -97.70±.32 -77.03±.39 20.67±.08
d = 20 -88.90±.26 -94.79±.19 -71.29±.45 23.50±.31 -89.02±.31 -96.15±.32 -67.65±.43 28.50±.22
d = 40 -88.93±.30 -94.91±.18 -71.14±.56 23.77±.49 -90.87±.34 -101.26±.33 -67.75±.70 33.50±.45

probabilistic generative model of the data interpreting the 5 EXPERIMENTS


embeddings as latent variables, this approach shows the
merit of explicitly adjusting the choice of latent topology In this section, we first perform a series of experiments
to the data used. to investigate the theoretical properties of the proposed
S-VAE compared to the N -VAE. In a second experiment,
we show how S-VAEs can be used in semi-supervised
tasks to create a better separable latent representation to
A Hyperspherical Perspective As noted before, a dis- enhance classification. In the last experiment, we show
tinction must be made between models dealing with the that the S-VAE indeed presents a promising alternative to
challenges of intrinsically hyperspherical data like omni- N -VAEs for data with a non-Euclidean latent representa-
directional video, and those attempting to exploit some tion of low dimensionality, on a link prediction task for
latent hyperspherical manifold. A recent example of the three citation networks. All architecture and hyperparam-
first can be found in Cohen et al. (2018), where spherical eter details are given in Appendix F.
CNNs are introduced. While flattening a spherical im-
age produces unavoidable distortions, the newly defined 5.1 RECOVERING HYPERSPHERICAL
convolutions take into account its geometrical properties. LATENT REPRESENTATIONS
The most general implementation of the second model In this first experiment we build on the motivation devel-
type was proposed by Gopal and Yang (2014), who intro- oped in Subsection 2.3, by confirming with a synthetic
duced a suite of models to improve cluster performance of data example the difference in behavior of the N -VAE
high-dimensional data based on mixture of vMF distribu- and S-VAE in recovering latent hyperspheres. We first
tions. They showed that reducing an object representation generate samples from a mixture of three vMFs on the
to its directional components increases clusterability over circle, S 1 , as shown in Figure 1(a), which subsequently
standard methods like K-Means or Latent Dirichlet Allo- are mapped into the higher dimensional R100 by applying
cation (Blei et al., 2001). a noisy, non-linear transformation. After this, we in turn
Specific applications of the vMF can be further found train an auto-encoder, a N -VAE, and a S-VAE. We fur-
ranging from computer vision, where it is used to infer ther investigate the behavior of the N -VAE, by training a
structure from motion (Guan and Smith, 2017) in spheri- model using a scaled down KL divergence.
cal video, or structure from texture (Wilson et al., 2014),
to natural language processing, where it is utilized in text Results The resulting latent spaces, displayed in Figure
analysis (Banerjee et al., 2003, 2005) and topic modeling 1, clearly confirm the intuition built in Subsection 2.3. As
(Banerjee and Basu, 2007; Reisinger et al., 2010). expected, in Figure 1(b) the auto-encoder is perfectly ca-
pable to embed in low dimensions the original underlying
Additionally, modeling data by restricting it to a hyper-
data structure. However, most parts of the latent space are
sphere provides some natural regularizing properties as
not occupied by points, critically affecting the ability to
noted in (Liu et al., 2017). Finally Aytekin et al. (2018)
generate meaningful samples.
show on a variety of deep auto-encoder models that
adding L2 normalization to the latent space during train- In the N -VAE setting we observe two types of behaviours,
ing, i.e. forcing the latent space on a hypersphere, im- summarized by Figures 1(c) and 1(d). In the first we
proves clusterability. observe that if the prior is too strong it will force the
Table 2: Summary of results (mean accuracy and standard-deviation over 20 runs) of semi-supervised K-NN on MNIST.
Best results are highlighted only if they passed a student t-test with p < 0.01.

100 600 1000


Method
N -VAE S-VAE N -VAE S-VAE N -VAE S-VAE
d=2 72.6±2.1 77.9±1.6 80.8±0.5 84.9±0.6 81.7±0.5 85.6±0.5
d=5 81.8±2.0 87.5±1.0 90.9±0.4 92.8±0.3 92.0±0.2 93.4±0.2
d = 10 75.7±1.8 80.6±1.3 88.4±0.5 91.2±0.4 90.2±0.4 92.8±0.3
d = 20 71.3±1.9 72.8±1.6 88.3±0.5 89.1±0.6 90.1±0.4 91.1±0.3
d = 40 72.3±1.6 67.7±2.3 88.0±0.5 87.4±0.7 90.3±0.5 90.4±0.4

posterior to match the prior shape, concentrating the sam- low dimensions. In the absence of any origin pull, the
ples in the center. However, this prevents the N -VAE to data is able to cluster naturally, utilizing the entire latent
correctly represent the true shape of the data and creates space which can be observed in Figure 2. Note that in Fig-
instability problems for the decoder around the origin. On ure 2(a) all mass is concentrated around the center, since
the contrary, if we scale down the KL term, we observe the prior mean is zero. Conversely, in Figure 2(b) all
that the samples from the approximate posterior maintain available space is evenly covered due to the uniform prior,
a shape that reflects the S 1 structure smoothed with Gaus- resulting in more separable clusters in S 2 compared to
sian noise. However, as the approximate posterior differs R2 . However, as dimensionality increases, the Gaussian
strongly from the prior, obtaining meaningful samples distribution starts to approximate a hypersphere, while
from the latent space again becomes problematic. its posterior becomes more expressive than the vMF due
to the higher number of variance parameters. Simultane-
The S-VAE on the other hand, almost perfectly recovers
ously, as described in Subsection 3.5, the surface area of
the original dataset structure, while the samples from the
the vMF starts to collapse limiting the available space.
approximate posterior closely match the prior distribution.
This simple experiment confirms the intuition that having In Figure 7 and 8 of Appendix G, we present randomly
a prior that matches the true latent structure of the data, is generated samples from the N -VAE and the S-VAE, re-
crucial in constructing a correct latent representation that spectively. Moreover, in Figure 9 of Appendix G, we
preserves the ability to generate meaningful samples. show 2-dimensional manifolds for the two models. Inter-
estingly, the manifold given by the S-VAE indeed results
5.2 EVALUATION OF EXPRESSIVENESS in a latent space where digits occupy the entire space and
there is a sense of continuity from left to right.
To compare the behavior of the N -VAE and S-VAE on a
data set that does not have a clear hyperspherical latent 5.3 SEMI-SUPERVISED LEARNING
structure, we evaluate both models on a reconstruction
task using dynamically binarized MNIST (Salakhutdinov Having observed the S-VAE’s ability to increase clus-
and Murray, 2008). We analyze the ELBO, KL, negative terability of data points in the latent space, we wish to
reconstruction error, and marginal log-likelihood (LL) for further investigate this property using a semi-supervised
both models on the test set. The LL is estimated using classification task. For this purpose we re-implemented
importance sampling with 500 sample points (Burda et al., the M1 and M1+M2 models as described in (Kingma
2016). et al., 2014), and evaluate the classification accuracy of
the S-VAE and the N -VAE on dynamically binarized
Results Results are shown in Table 1. We first note that MNIST. In the M1 model, a classifier utilizes the latent
in terms of negative reconstruction error the S-VAE out- features obtained using a VAE as in experiment 5.2. The
performs the N -VAE in all dimensions. Since the S-VAE M1+M2 model is constructed by stacking the M2 model
uses a uniform prior, the KL divergence increases more on top of M1, where M2 is the result of augmenting the
strongly with dimensionality, which results in a higher VAE by introducing a partially observed variable y, and
ELBO. However in terms of log-likelihood (LL) the S- combining the ELBO and classification objective. This
VAE clearly has an edge in low dimensions (d = 2, 5, 10) concatenated model is trained end-to-end 3 .
and performs comparable to the N -VAE in d = 20. This 3
It is worth noting that in the original implementation by
empirically confirms the hypothesis of Subsection 2.2, Kingma et al. (2014) the stacked model did not converge well
showing the positive effect of having a uniform prior in using end-to-end training, and used the extracted features of the
(a) R2 latent space of the N -VGAE. (b) Hammer projection of S 2 latent space of the S-VGAE.

Figure 3: Latent space of unsupervised N -VGAE and S-VGAE models trained on Cora citation network. Colors denote
documents classes which are not provided during training. (Best viewed in color)

This last model also allows for a combination of the two S-VAE. As previously noticed the S-VAE performance
topologies due to the presence of two distinct latent vari- drops when dim z2 = 50, with the best result being ob-
ables, z1 and z2 . Since in the M2 latent space the class tained for dim z1 = dim z2 = 10. In fact, it is worth
assignment is expressed by the variable y, while z2 only noting that for this setting the S-VAE obtains comparable
needs to capture the style, it naturally follows that the results to the original settings of (Kingma et al., 2014),
N -VAE is more suited for this objective due to its higher while needing a considerably smaller latent space. Finally,
number of variance parameters. Hence, besides compar- the end-to-end trained S+N -VAE model is able to reach
ing the S-VAE against the N -VAE, we additionally run a significantly higher classification accuracy than the orig-
experiments for the M1+M2 model by modeling z1 , z2 inal results reported by Kingma et al. (2014), 96.7±.1.
respectively with a vMF and normal distribution.
The M1+M2 model allows for conditional generation.
Similarly to (Kingma et al., 2014), we set the latent vari-
Results As can be see in Table 2, for M1 the S-VAE able z2 to the value inferred from the test image by the
outperforms the N -VAE in all dimensions up to d = 40. inference network, and then varied the class label y. In
This result is amplified for a low number of observed Figure 10 of Appendix H we notice that the model is able
labels. Note that for both models absolute performance to disentangle the style from the class.
drops as the dimensionality increases, since K-NN used
as the classifier suffers from the curse of dimensionality.
Besides reconfirming superiority of the S-VAE in d <
20, its better performance than the N -VAE for d = 20 Table 3: Summary of results of semi-supervised model
was unexpected. This indicates that although the log- M1+M2 on MNIST.
likelihood might be comparable(see Table 1) for higher
dimensions, the S-VAE latent space better captures the
Method 100
cluster structure.
dim z1 dim z2 N +N S+S S+N
In the concatenated model M1+M2, we first observe in
5 90.0±.4 94.0±.1 93.8±.1
Table 3 that either the pure S-VAE or the S+N -VAE
5 10 90.7±.3 94.1±.1 94.8±.2
model yields the best results, where the S-VAE almost
50 90.7±.1 92.7±.2 93.0±.1
always outperforms the N -VAE. Our hypothesis regard-
ing the merit of a S+N -VAE model is further confirmed, 5 90.7±.3 91.7±.5 94.0±.4
as displayed by the stable, strong performance across 10 10 92.2±.1 96.0±.2 95.9±.3
all different dimensions. Furthermore, the clear edge 50 92.9±.4 95.1±.2 95.7±.1
in clusterability of the S-VAE in low dimensional z1 as 5 92.0±.2 91.7±.4 95.8±.1
already observed in Table 2, is again evident. As the 50 10 93.0±.1 95.8±.1 97.1±.1
dimensionality of z1 , z2 increases, the accuracy of the 50 93.2±.2 94.2±.1 97.4±.1
N -VAE improves, reducing the performance gap with the

M1 model as inputs for the M2 model instead.


5.4 LINK PREDICTION ON GRAPHS On Pubmed, we observe that the S-VAE converges to a
lower score than the N -VAE. The Pubmed dataset is sig-
In this experiment, we aim at demonstrating the ability of nificantly larger than Cora and Citeseer, and hence more
the S-VAE to learn meaningful embeddings of nodes in a complex. The N -VAE has a larger number of variance pa-
graph, showing the advantages of embedding objects in rameters for the posterior distribution, which might have
a non-Euclidean space. We test hyperspherical reparam- played an important role in better modeling the relation-
eterization on the recently introduced Variational Graph ships between nodes. We further hypothesize that not all
Auto-Encoder (VGAE) (Kipf and Welling, 2016), a VAE graphs are necessarily better embedded in a hyperspher-
model for graph-structured data. We perform training on ical space and that this depends on some fundamental
a link prediction task on three popular citation network topological properties of the graph. For instance, the
datasets (Sen et al., 2008): Cora, Citeseer and Pubmed. already mentioned work from Nickel and Kiela (2017)
Dataset statistics and further experimental details are sum- shows that hyperbolical space is better suited for graphs
marized in Appendix F.3. The models are trained in an un- with a hierarchical, tree-like structure. These considera-
supervised fashion on a masked version of these datasets tions prefigure an interesting research direction that will
where some of the links have been removed. All node be explored in future work.
features are provided and efficacy is measured in terms
of average precision (AP) and area under the ROC curve 6 CONCLUSION
(AUC) on a test set of previously removed links. We use
the same training, validation, and test splits as in Kipf and
Welling (2016), i.e. we assign 5% of links for validation With the S-VAE we set an important first step in the
and 10% of links for testing. exploration of hyperspherical latent representations for
variational auto-encoders. Through various experiments,
we have shown that S-VAEs have a clear advantage over
Table 4: Results for link prediction in citation networks. N -VAEs for data residing on a known hyperspherical
manifold, and are competitive or surpass N -VAEs for data
Method N -VGAE S-VGAE with a non-obvious hyperspherical latent representation in
lower dimensions. Specifically, we demonstrated S-VAEs
AUC 92.7±.2 94.1±.1 improve separability in semi-supervised classification and
Cora
AP 93.2±.4 94.1±.3 that they are able to improve results on state-of-the-art link
AUC 90.3±.5 94.7±.2 prediction models on citation graphs, by merely changing
Citeseer
AP 91.5±.5 95.2±.2 the prior and posterior distributions as a simple drop-in
replacement.
AUC 97.1±.0 96.0±.1
Pubmed
AP 97.1±.0 96.0±.1 We believe that the presented research paves the way for
various promising areas of future work, such as exploring
more flexible approximate posterior distributions through
Results In Table 4, we show that our model outperforms normalizing flows on the hypersphere, or hierarchical
the N -VGAE baseline on two out of the three datasets mixture models combining hyperspherical and hyperpla-
by a significant margin. The log-probability of a link is nar space. Further research should be done in increasing
computed as the dot product of two embeddings. In a hy- the performance of S-VAEs in higher dimensions; one
persphere, this can be interpreted as the cosine similarity possible solution of which could be to dynamically learn
between vectors. Indeed we find that the choice of a dot the radius of the latent hypersphere in a full Bayesian
product scoring function for link prediction is problematic setting.
in combination with the normal distribution on the latent
space. If embeddings are close to the zero-center, noise
during training can have a large destabilizing effect on the Acknowledgements
angle information between two embeddings. In practice,
the model finds a solution where embeddings are ”pushed” We would like to thank Rianne van den Berg, Jonas
away from the zero-center, as demonstrated in Figure 3(a). Köhler, Pim de Haan, Taco Cohen, Marco Federici, and
This counteracts the pull towards the center arising from Max Welling for insightful discussions. T.K. is supported
the standard prior and can overall lead to poor modeling by the SAP Innovation Center Network. J.M.T. was
performance. By constraining the embeddings to the sur- funded by the European Commission within the Marie
face of a hypersphere, this effect is mitigated, and the Skłodowska-Curie Individual Fellowship (Grant No.
model can find a good separation of the latent clusters, as 702666, ”Deep learning and Bayesian inference for medi-
shown in Figure 3(b). cal imaging”).
References Gopal, S. and Yang, Y. (2014). Von mises-fisher cluster-
ing models. In Proceedings of the 31th International
Aytekin, C., Ni, X., Cricri, F., and Aksu, E. (2018). Clus- Conference on Machine Learning, ICML 2014, Bei-
tering and unsupervised anomaly detection with l2 jing, China, 21-26 June 2014, volume 32 of JMLR
normalized deep auto-encoder representations. ArXiv Workshop and Conference Proceedings, pages 154–162.
preprint, abs/1802.00187. JMLR.org.
Banerjee, A. and Basu, S. (2007). Topic models over Guan, H. and Smith, W. A. (2017). Structure-from-
text streams: A study of batch and online unsupervised motion in spherical video using the von mises-fisher
learning. ICDM, pages 431–436. distribution. IEEE Transactions on Image Processing,
Banerjee, A., Dhillon, I., Ghosh, J., and Sra, S. (2003). 26(2):711–723.
Generative model-based clustering of directional data. Guu, K., Hashimoto, T. B., Oren, Y., and Liang, P. (2018).
SIGKDD, pages 19–28. Generating sentences by editing prototypes. Transac-
Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S. (2005). tions of the Association for Computational Linguistics,
Clustering on the unit hypersphere using von mises- 6:437–450.
fisher distributions. Journal of Machine Learning Re- Hasnat, M., Bohné, J., Milgram, J., Gentric, S., Chen, L.,
search, 6(Sep):1345–1382. et al. (2017). von mises-fisher mixture model-based
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2001). Latent deep learning: Application to face verification. ArXiv
dirichlet allocation. In Dietterich, T. G., Becker, S., preprint, abs/1706.04264.
and Ghahramani, Z., editors, Advances in Neural In- Kingma, D. P. and Ba, J. (2015). Adam: A method
formation Processing Systems 14 [Neural Information for stochastic optimization. In Bengio, Y. and LeCun,
Processing Systems: Natural and Synthetic, NIPS 2001, Y., editors, 3rd International Conference on Learning
December 3-8, 2001, Vancouver, British Columbia, Representations, ICLR 2015, San Diego, CA, USA, May
Canada], pages 601–608. MIT Press. 7-9, 2015, Conference Track Proceedings.
Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A., Joze- Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling,
fowicz, R., and Bengio, S. (2016). Generating sen- M. (2014). Semi-supervised learning with deep gener-
tences from a continuous space. In Proceedings of The ative models. In Ghahramani, Z., Welling, M., Cortes,
20th SIGNLL Conference on Computational Natural C., Lawrence, N. D., and Weinberger, K. Q., editors,
Language Learning, pages 10–21, Berlin, Germany. Advances in Neural Information Processing Systems
Association for Computational Linguistics. 27: Annual Conference on Neural Information Pro-
Burda, Y., Grosse, R. B., and Salakhutdinov, R. (2016). cessing Systems 2014, December 8-13 2014, Montreal,
Importance weighted autoencoders. In Bengio, Y. Quebec, Canada, pages 3581–3589.
and LeCun, Y., editors, 4th International Conference Kingma, D. P. and Welling, M. (2014a). Auto-encoding
on Learning Representations, ICLR 2016, San Juan, variational bayes. In Bengio, Y. and LeCun, Y., editors,
Puerto Rico, May 2-4, 2016, Conference Track Pro- 2nd International Conference on Learning Represen-
ceedings. tations, ICLR 2014, Banff, AB, Canada, April 14-16,
Cohen, T. S., Geiger, M., Köhler, J., and Welling, M. 2014, Conference Track Proceedings.
(2018). Spherical cnns. In 6th International Conference Kingma, D. P. and Welling, M. (2014b). Efficient gradient-
on Learning Representations, ICLR 2018, Vancouver, based inference through transformations between bayes
BC, Canada, April 30 - May 3, 2018, Conference Track nets and neural nets. In Proceedings of the 31th Inter-
Proceedings. OpenReview.net. national Conference on Machine Learning, ICML 2014,
Dilokthanakul, N., Mediano, P. A. M., Garnelo, M., Lee, Beijing, China, 21-26 June 2014, volume 32 of JMLR
M. C. H., Salimbeni, H., Arulkumaran, K., and Shana- Workshop and Conference Proceedings, pages 1782–
han, M. (2016). Deep unsupervised clustering with 1790. JMLR.org.
gaussian mixture variational autoencoders. CoRR, Kipf, T. N. and Welling, M. (2016). Variational Graph
abs/1611.02648. Auto-Encoders. NIPS Bayesian Deep Learning Work-
Fisher, N. I., Lewis, T., and Embleton, B. J. (1987). Statis- shop.
tical analysis of spherical data. Cambridge university Liu, C. and Zhu, J. (2018). Riemannian stein variational
press. gradient descent for bayesian inference. In McIlraith,
Glorot, X. and Bengio, Y. (2010). Understanding the S. A. and Weinberger, K. Q., editors, Proceedings of
difficulty of training deep feedforward neural networks. the Thirty-Second AAAI Conference on Artificial Intel-
AISTATS, pages 249–256. ligence, (AAAI-18), the 30th innovative Applications
of Artificial Intelligence (IAAI-18), and the 8th AAAI International Conference on Machine Learning, ICML
Symposium on Educational Advances in Artificial In- 2014, Beijing, China, 21-26 June 2014, volume 32 of
telligence (EAAI-18), New Orleans, Louisiana, USA, JMLR Workshop and Conference Proceedings, pages
February 2-7, 2018, pages 3627–3634. AAAI Press. 1278–1286. JMLR.org.
Liu, W., Zhang, Y., Li, X., Liu, Z., Dai, B., Zhao, T., Salakhutdinov, R. and Murray, I. (2008). On the quan-
and Song, L. (2017). Deep hyperspherical learning. titative analysis of deep belief networks. In Cohen,
In Guyon, I., von Luxburg, U., Bengio, S., Wallach, W. W., McCallum, A., and Roweis, S. T., editors, Ma-
H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, chine Learning, Proceedings of the Twenty-Fifth Inter-
R., editors, Advances in Neural Information Processing national Conference (ICML 2008), Helsinki, Finland,
Systems 30: Annual Conference on Neural Information June 5-9, 2008, volume 307 of ACM International Con-
Processing Systems 2017, December 4-9, 2017, Long ference Proceeding Series, pages 872–879. ACM.
Beach, CA, USA, pages 3950–3960. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B.,
Mardia, K. V. (1975). Statistics of directional data. Jour- and Eliassi-Rad, T. (2008). Collective classification in
nal of the Royal Statistical Society. Series B (Method- network data. AI magazine, 29(3):93.
ological), pages 349–393. Tomczak, J. M. and Welling, M. (2018). VAE with a
Naesseth, C. A., Ruiz, F. J. R., Linderman, S. W., and Blei, vampprior. In Storkey, A. J. and Pérez-Cruz, F., edi-
D. M. (2017). Reparameterization gradients through tors, International Conference on Artificial Intelligence
acceptance-rejection sampling algorithms. In Singh, and Statistics, AISTATS 2018, 9-11 April 2018, Playa
A. and Zhu, X. J., editors, Proceedings of the 20th In- Blanca, Lanzarote, Canary Islands, Spain, volume 84
ternational Conference on Artificial Intelligence and of Proceedings of Machine Learning Research, pages
Statistics, AISTATS 2017, 20-22 April 2017, Fort Laud- 1214–1223. PMLR.
erdale, FL, USA, volume 54 of Proceedings of Machine
Ulrich, G. (1984). Computer generation of distributions
Learning Research, pages 489–498. PMLR.
on the m-sphere. Journal of the Royal Statistical Soci-
Nalisnick, E. T. and Smyth, P. (2017). Stick-breaking ety. Series C (Applied Statistics), 33(2):158–163.
variational autoencoders. In 5th International Confer-
Wilson, R. C., Hancock, E. R., Pekalska, E., and Duin,
ence on Learning Representations, ICLR 2017, Toulon,
R. P. (2014). Spherical and hyperbolic embeddings
France, April 24-26, 2017, Conference Track Proceed-
of data. IEEE transactions on pattern analysis and
ings. OpenReview.net.
machine intelligence, 36(11):2255–2269.
Nickel, M. and Kiela, D. (2017). Poincaré embeddings
for learning hierarchical representations. In Guyon, I.,
von Luxburg, U., Bengio, S., Wallach, H. M., Fergus,
R., Vishwanathan, S. V. N., and Garnett, R., editors,
Advances in Neural Information Processing Systems 30:
Annual Conference on Neural Information Processing
Systems 2017, December 4-9, 2017, Long Beach, CA,
USA, pages 6338–6347.
Reisinger, J., Waters, A., Silverthorn, B., and Mooney,
R. J. (2010). Spherical topic models. In Fürnkranz,
J. and Joachims, T., editors, Proceedings of the 27th
International Conference on Machine Learning (ICML-
10), June 21-24, 2010, Haifa, Israel, pages 903–910.
Omnipress.
Rezende, D. J. and Mohamed, S. (2015). Variational in-
ference with normalizing flows. In Bach, F. R. and Blei,
D. M., editors, Proceedings of the 32nd International
Conference on Machine Learning, ICML 2015, Lille,
France, 6-11 July 2015, volume 37 of JMLR Work-
shop and Conference Proceedings, pages 1530–1538.
JMLR.org.
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).
Stochastic backpropagation and approximate inference
in deep generative models. In Proceedings of the 31th
A SAMPLING PROCEDURE

Figure 4: Overview of von Mises-Fisher sampling procedure. Note that as ω is a scalar, the procedure does not suffer
from the curse of dimensionality.

The general algorithm for sampling from a vMF has been outlined in Algorithm 1. The exact form of the distribution of
the univariate distribution g(ω|k) is:
1
2(π m/2 ) exp(ωk)(1 − ω 2 ) 2 (m−3)
g(ω|k) = Cm (k) , (11)
Γ(m/2) B( 12 , 12 (m − 1))

Samples from this distribution are drawn using an acceptance/rejection algorithm when m 6= 3. The complete procedure
is described in Algorithm 2. The Householder reflection (see Algorithm 3 for details) simply finds an orthonormal
transformation that maps the modal vector e1 = (1, 0, · · · , 0) to µ. Since an orthonormal transformation preserves
the distances all the points in the hypersphere will stay in the surface after mapping. Notice that even the transform
U z0 = (I − 2uu> )z0 , can be executed in O(m) by rearranging the terms.

Figure 5: Geometric representation of a single sample in S 2 , where ω ∼ g(ω|k) and v ∼ U (S 1 ).


Algorithm 2 g(ω|k) acceptance-rejection sampling Algorithm 3 Householder transform
Input: dimension m, concentration κ Input: mean µ, modal vector e1
Initialize values: u0 ← e1 − µ
u0
p
−2k + 4k 2 + (m − 1)2 u ← ||u0 ||
b←
m−1 p U ← I − 2uu>
(m − 1) + 2k + 4k 2 + (m − 1)2 Return: U
a←
4
4ab
d← − (m − 1) ln(m − 1)
(1 + b)
repeat
Sample ε ∼ Beta( 12 (m − 1), 12 (m − 1))
1 − (1 + b)ε
ω ← h(ε, k) =
1 − (1 − b)ε
2ab
t←
1 − (1 − b)ε
Sample u ∼ U(0, 1)
until (m − 1) ln(t) − t + d ≥ ln(u)
Return: ω

Table 5: Expected number of samples needed before acceptance, computed using Monte Carlo estimate with 1000
samples varying dimensionality and concentration parameters. Notice that the sampling complexity increases in κ, but
decreases as the dimensionality, d, increases.

κ=1 κ=5 κ = 10 κ = 50 κ = 100 κ = 500 κ = 1000 κ = 5000 κ = 10000


d=5 1.020 1.171 1.268 1.398 1.397 1.426 1.458 1.416 1.440
d = 10 1.008 1.094 1.154 1.352 1.411 1.407 1.369 1.402 1.419
d = 20 1.001 1.031 1.085 1.305 1.342 1.367 1.409 1.410 1.407
d = 40 1.000 1.011 1.027 1.187 1.288 1.397 1.433 1.402 1.423
d = 100 1.000 1.000 1.006 1.092 1.163 1.317 1.360 1.398 1.416

B KL DIVERGENCE DERIVATION

The KL divergence between a von-Mises-Fisher distribution q(z|µ, k) and an uniform distribution in the hypersphere
−1
2(π m/2 )

(one divided by the surface area of S m−1 ) p(x) = is:
Γ(m/2)

Z
q(z|µ, k)
KL[q(z|µ, k) || p(z)] = q(z|µ, k) log dz (12)
m−1 p(z)
ZS
q(z|µ, k) log Cm (k) + kµT z − log p(z) dz

= (13)
S m−1
−1
2(π m/2 )

= kµ Eq [z] + log Cm (k) − log (14)
Γ(m/2)
Im/2 (k) 
=k + (m/2 − 1) log k − (m/2) log(2π) − log Im/2−1 (k) (15)
Im/2−1 (k)
m m
+ log π + log 2 − log Γ( ),
2 2
B.1 GRADIENT OF KL DIVERGENCE

Using
1
∇k Iv (k) = (Iv−1 (k) + Iv+1 (k)) , (16)
2
and

∇k log Cm (k) = ∇k (m/2 − 1) log k − (m/2) log(2π) − log Im/2−1 (k) (17)
Im/2 (k)
=− , (18)
Im/2−1 (k)

then
Im/2 (k)
∇κ KL[q(z|µ, k) || p(z)] = ∇κ k + ∇k log Cm (k) (19)
Im/2−1 (k)
Im/2 (k) Im/2 (k) Im/2 (k)
= + k∇k − (20)
Im/2−1 (k) Im/2−1 (k) Im/2−1 (k)
 !
1 Im/2+1 (k) Im/2 (k) Im/2−2 (k) + Im/2 (k)
= k − +1 , (21)
2 Im/2−1 (k) Im/2−1 (k)2
exp
Notice that we can use Im/2 = exp(−k)Im/2 for numerical stability.

C PROOF OF LEMMA 2
g(h(ε, θ)|θ)
Lemma 3 (2). Let f be any measurable function and ε ∼ π1 (ε|θ) = s(ε) the distribution of the accepted
r(h(ε, θ)|θ)
sample. Also let v ∼ π2 (v), and T a transformation that depends on the parameters such that if z = T (ω, v; θ) with
ω ∼ g(ω|θ), then z ∼ q(z|θ):
Z
E(ε,v)∼π1 (ε|θ)π2 (v) [f (T (h(ε, θ), v; θ))] = f (z)q(z|θ)dz = Eq(z|θ) [f (z)], (22)

Proof.
ZZ
E(ε,v)∼π1 (ε|θ)π2 (v) [f (T (h(ε, θ), v; θ))] = f (T (h(ε, θ), v; θ)) π1 (ε|θ)π2 (v)dεdv, (23)

Using the same argument employed by Naesseth et al. (2017) we can apply the change of variables ω = h(ε, θ) rewrite
the expression as: ZZ Z
= f (T (ω, v; θ)) g(ω|θ)π2 (v)dωdv =∗ f (z)q(z|θ)dz (24)

Where in * we applied the change of variables z = T (ω, v; θ).

D REPARAMETRIZATION GRADIENT DERIVATION

D.1 GENERAL EXPRESSION DERIVATION

We can then proceed as in 8 using Lemma 2 and the the log derivative trick to compute the gradient of the expectation
term ∇θ Eq(z|θ) [f (z)]:
ZZ
∇θ Eq(z|θ) [f (z)] = ∇θ f (T (h(ε, θ), v; θ)) π1 (ε|θ)π2 (v)dεdv (25)
ZZ
g(h(ε, θ)|θ)
= ∇θ f (T (h(ε, θ), v; θ)) s(ε) π2 (v)dεdv (26)
r(h(ε, θ)|θ)
ZZ  
g(h(ε, θ)|θ)
= s(ε)π2 (v)∇θ f (T (h(ε, θ), v; θ)) dεdv (27)
r(h(ε, θ)|θ)
ZZ
g(h(ε, θ)|θ)
= s(ε)π2 (v) ∇θ (f (T (h(ε, θ), v; θ))) dεdv (28)
r(h(ε, θ)|θ)
ZZ  
g(h(ε, θ)|θ)
+ s(ε)π2 (v)f (T (h(ε, θ), v; θ)) ∇θ dεdv
r(h(ε, θ)|θ)
ZZ
= π1 (ε|θ)π2 (v)∇θ (f (T (h(ε, θ), v; θ))) dεdv (29)
ZZ  
g(h(ε, θ)|θ)
+ s(ε)π2 (v)f (T (h(ε, θ), v; θ)) ∇θ dεdv
r(h(ε, θ)|θ)
= E(ε,v)∼π1 (ε|θ)π2 (v) [∇θ f (T (h(ε, θ), v; θ))] (30)
| {z }
grep
  
g(h(ε, θ)|θ)
+ E(ε,v)∼π1 (ε|θ)π2 (v) f (T (h(ε, θ), v; θ)) ∇θ log ,
r(h(ε, θ)|θ)
| {z }
gcor

where grep is the reparameterization term and gcor the correction term. Since h is invertible in ε, Naesseth et al. (2017)
q(h(ε, θ), θ)
show that ∇θ log in gcor simplifies to:
r((h(ε, θ), θ)
g(h(ε, θ), θ) ∂h(ε, θ)
∇θ log = ∇θ log g(h(ε, θ), θ) + ∇θ log | |, (31)
r((h(ε, θ), θ) ∂ε

D.2 GRADIENT CALCULATION

In our specific case we want to take the gradient w.r.t. θ of the expression:
Eqψ (z|x;θ) [log pφ (x|z)] where θ = (µ, κ), (32)
The gradient can be computed using the Lemma 2 and the subsequent gradient derivation with f (z) = pφ (x|z). As
specified in Section 3.4 we optimize unbiased Monte Carlo estimates of the gradient. Therefore fixed one datapoint x
and sampled (ε, v) ∼ π1 (ε|θ)π2 (v) the gradient is:

∇θ Eqψ (z|x;θ) [log pφ (x|z)] = grep + gcor , (33)


With
grep ≈ ∇θ log pφ (x|T (h(ε, θ), v; θ)) , (34)

 
∂h(ε, θ)
gcor ≈ pφ (x|T (h(ε, θ), v; θ)) ∇θ log g(h(ε, θ)|θ) + ∇θ log | | , (35)
∂ε

where grep is simply the gradient of the reconstruction loss w.r.t θ and can be easily handled by automatic differentiation
packages.
For what concerns gcor we notice that the terms g() and h() do not depend on µ. Thus the gcor term w.r.t. µ is 0 an all
the following calculations can will be only w.r.t. κ. We therefore have:

p
∂h(ε, k) −2b −2k + 4k 2 + (m − 1)2
= where b = , (36)
∂ε ((b − 1)ε + 1)2 m−1
and
 
1
∇κ log g(ω|k) = ∇κ log Cm (k) + ωk + (m − 3) log(1 − ω 2 ) (37)
2
 
1
= ∇k log Cm (k) + ∇κ ωk + (m − 3) log(1 − ω 2 ) . (38)
2
So, putting everything together we have:
Im/2 (k)
  
1 −2b
gcor = log pφ (x|z) · − + ∇κ ωκ + (m − 3) log(1 − ω 2 ) + log | | , (39)
Im/2−1 (k) 2 ((b − 1)ε + 1)2
where
p
−2k + 4k 2 + (m − 1)2
b= (40)
m−1
1 − (1 + b)ε
ω = h(ε, θ) = (41)
1 − (1 − b)ε
z = T (h(ε, θ), v; θ), (42)
 
−2b
And the term ∇κ ωκ + 12 (m − 3) log(1 − ω 2 ) + log | | can be computed by automatic differentia-
((b − 1)ε + 1)2
tion packages.

E COLLAPSE OF THE SURFACE AREA

Figure 6: Plot of the unit hyperspherical surface area against dimensionality. The surface area has a maximum for
m = 7.

F EXPERIMENTAL DETAILS: ARCHITECTURE AND HYPERPARAMETERS

F.1 EXPERIMENT 5.2

Architecture and hyperparameters For both the encoder and the decoder we use MLPs with 2 hidden layers of
respectively, [256, 128] and [128, 256] hidden units. We trained until convergence using early-stopping with a look
ahead of 50 epochs. We used the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1e-3, and mini-batches
of size 64. Additionally, we used a linear warm-up for 100 epochs (Bowman et al., 2016). The weights of the neural
network were initialized according to (Glorot and Bengio, 2010).

F.2 EXPERIMENT 5.3

Architecture and Hyperparameters For M1 we reused the trained models of the previous experiment, and used
K-nearest neighbors (K-NN) as a classifier with k = 5. In the N -VAE case we used the Euclidean distance as a
distance metric. For the S-VAE the geodesic distance arccos(x> y) was employed. The performance was evaluated for
N = [100, 600, 1000] observed labels.
The stacked M1+M2 model uses the same architecture as outlined by Kingma et al. (2014), where the MLPs utilized
in the generative and inference models are constructed using a single hidden layer, each with 500 hidden units. The
latent space dimensionality of z1 , z2 were both varied in [5, 10, 50]. We used the rectified linear unit (ReLU) as an
activation function. Training was continued until convergence using early-stopping with a look ahead of 50 epochs on
the validation set. We used the Adam optimizer with a learning rate of 1e-3, and mini-batches of size 100. All neural
network weight were initialized according to (Glorot and Bengio, 2010). N was set to 100, and the α parameter used to
scale the classification loss was chosen between [0.1, 1.0]. Crucially, we train this model end-to-end instead of by parts.

F.3 EXPERIMENT 5.4

Architecture and Hyperparameters We are training a Variational Graph Auto-encoder (VGAE) model, a state-of-
the-art link prediction model for graphs, as proposed in Kipf and Welling (2016). For a fair comparison, we use the same
architecture as in the original paper and we just change the way the latent space is generated using the vMF distribution
instead of a normal distribution. All models are trained for 200 epochs on Cora and Citeseer, and 400 epochs on Pubmed
with the Adam optimizer. Optimal learning rate lr ∈ {0.01, 0.005, 0.001}, dropout rate pdo ∈ {0, 0.1, 0.2, 0.3, 0.4} and
number of latent dimensions dz ∈ {8, 16, 32, 64} are determined via grid search based on validation AUC performance.
For S-VGAE, we omit the dz = 64 setting as some of our experiments ran out of memory. The model is trained with a
single hidden layer with 32 units and with document features as input, as in Kipf and Welling (2016). The weights
of the neural network were initialized according to (Glorot and Bengio, 2010). For testing, we report performance of
the model selected from the training epoch with highest AUC score on the validation set. Different from (Kipf and
Welling, 2016), we train both the N -VGAE and the S-VGAE models using negative sampling in order to speed up
training, i.e. for each positive link we sample, uniformly at random, one negative link during every training epoch. All
experiments are repeated 5 times, and we report mean and standard error values.

F.3.1 FURTHER EXPERIMENTAL DETAILS

Dataset statistics are summarized in Table 6. Final hyperparameter choices found via grid search on the validation splits
are summarized in Table 7.

Table 6: Dataset statistics for citation network datasets.

Dataset Nodes Edges Features


Cora 2,708 5,429 1,433
Citeseer 3,327 4,732 3,703
Pubmed 19,717 44,338 500
Table 7: Best hyperparameter settings found for citation network datasets.

Dataset Model lr pdo dz


N -VAE 0.005 0.4 64
Cora
S-VAE 0.001 0.1 32
N -VAE 0.01 0.4 64
Citeseer
S-VAE 0.005 0.2 32
N -VAE 0.001 0.2 32
Pubmed
S-VAE 0.01 0.0 32

G VISUALIZATION OF SAMPLES AND LATENT SPACES

(a) d = 2 (b) d = 5 (c) d = 10 (d) d = 20

Figure 7: Random samples from N -VAE of MNIST for different dimensionalities of latent space.

(a) d = 2 (b) d = 5 (c) d = 10 (d) d = 20

Figure 8: Random samples from S-VAE of MNIST for different dimensionalities of latent space.
(a) N -VAE (b) S-VAE

Figure 9: Visualization of the 2 dimensional manifold of MNIST for both the N -VAE and S-VAE. Notice that the
N -VAE has a clear center and all digits are spread around it. Conversely, in the S-VAE instead all digits occupy the
entire space and there is a sense of continuity from left to right.

H VISUALIZATION OF CONDITIONAL GENERATION

Figure 10: Visualization of handwriting styles learned by the model, using conditional generation on MNIST of M1+M2
with dim(z1 ) = 50, dim(z2 ) = 50, S+N . Following Kingma et al. (2014), the left most column shows images from
the test set. The other columns show analogical fantasies of x by the generative model, where in each row the latent
variable z2 is set to the value inferred from the test image by the inference network and the class label y is varied per
column.

You might also like