Hyperspherical Variational Auto-Encoders: Tim R. Davidson Luca Falorsi Nicola de Cao Thomas Kipf Jakub M. Tomczak
Hyperspherical Variational Auto-Encoders: Tim R. Davidson Luca Falorsi Nicola de Cao Thomas Kipf Jakub M. Tomczak
Tim R. Davidson∗ Luca Falorsi∗ Nicola De Cao∗ Thomas Kipf Jakub M. Tomczak
University of Amsterdam
arXiv:1804.00891v3 [stat.ML] 27 Sep 2022
Figure 1: Plots of the original latent space (a) and learned latent space representations in different settings, where β is a
re-scaling factor for weighting the KL divergence. (Best viewed in color)
M. Yet, since D > M , when taking random points in 3 REPLACING GAUSSIAN WITH VON
the latent space they will most likely not be in emb(M) MISES-FISHER
resulting in a poorly reconstructed sample.
The VAE tries to solve this problem by forcing M to be 3.1 VON MISES-FISHER DISTRIBUTION
mapped into an approximate posterior distribution that
has support in the entire Z. Clearly, this approach is The von Mises-Fisher (vMF) distribution is often de-
bound to fail since the two spaces have a fundamentally scribed as the Normal Gaussian distribution on a hyper-
different structure. This can likely produce two behaviors: sphere. Analogous to a Gaussian, it is parameterized by
first, the VAE could just smooth the original embedding µ ∈ Rm indicating the mean direction, and κ ∈ R≥0 the
emb(M) leaving most of the latent space empty, leading concentration around µ. For the special case of κ = 0, the
to bad samples. Second, if we increase the KL term the vMF represents a Uniform distribution. The probability
encoder will be pushed to occupy all the latent space, density function of the vMF distribution for a random unit
but this will create instability and discontinuity, affecting vector z ∈ Rm (or z ∈ S m−1 ) is then defined as
the convergence of the model. To validate our intuition
we performed a small proof of concept experiment using q(z|µ, κ) = Cm (κ) exp (κµT z) (3)
M = S 1 , which is visualized in Figure 1. Note that as κm/2−1
expected the auto-encoder in Figure 1(b) mostly recovers Cm (κ) = , (4)
(2π)m/2 I m/2−1 (κ)
the original latent space of Figure1(a) as there are no dis-
tributional restrictions. In Figure 1(c) we clearly observe
where ||µ||2 = 1, Cm (κ) is the normalizing constant, and
for the N -VAE that points collapse around the origin due
Iv denotes the modified Bessel function of the first kind
to the KL, which is much less pronounced in Figure 1(d)
at order v.
when its contribution is scaled down. Lastly, the S-VAE
almost perfectly recovers the original circular latent space.
The observed behavior confirms our intuition. 3.2 KL DIVERGENCE
To solve this problem the best option would be to directly As previously emphasized, one of the main advan-
specify a Z homeomorphic to M and distributions on M. tages of using the vMF distribution as an approxi-
However, for real data discovering the structure of M mate posterior is that we are able to place a uniform
will often be a difficult inference task. Nevertheless, we prior on the latent space. The KL divergence term
believe this shows that investigating VAE architectures KL(vMF(µ, κ)||U (S m−1 )) to be optimized is:
that map to posterior distributions defined on manifolds
different than the Euclidean space is a topic worth to be −1
Im/2 (k) 2(π m/2 )
explored. In that sense, this work represents an initial step κ + log Cm (κ) − log , (5)
in this research direction. Im/2−1 (k) Γ(m/2)
Figure 2: Latent space visualization of the 10 MNIST digits in 2 dimensions of both N -VAE (left) and S-VAE (right).
(Best viewed in color)
3.5 BEHAVIOR IN HIGH DIMENSIONS process. Opposite to these approaches, we aim at using a
non-informative prior to simplify the inference.
The surface area of a hypersphere is defined as
The closest approach to ours is a VAE with a vMF distri-
2(π m/2
) bution in the latent space used for a sentence generation
S(m − 1) = rm (10) task by (Guu et al., 2018). While formally this approach
Γ(m/2)
is cast as a variational approach, the proposed model does
where m is the dimensionality and r the radius. Notice not reparameterize and learn the concentration parameter
that S(m − 1) → 0, as m → ∞. However, even for κ, treating it as a constant value that remains the same
m > 20 we observe a vanishing surface problem (see for every approximate posterior instead. Critically, as
Figure 6 in Appendix E). This could thus lead to unstable indicated in Equation 5, the KL divergence term only
behavior of hyperspherical models in high dimensions. depends on κ therefore leaving κ constant means never
explicitly optimizing the KL divergence term in the loss.
The method then only optimizes the reconstruction error
4 RELATED WORK by adding vMF noise to the encoder output in the latent
space to still allow generation. Moreover, using a fixed
Extending the VAE The majority of VAE extensions global κ for all the approximate posteriors severely limits
focus on increasing the flexibility of the approximate the flexibility and the expressiveness of the model.
posterior. This is usually achieved through normalizing
flows (Rezende and Mohamed, 2015), a class of invertible Non-Euclidean Latent Space In Liu and Zhu (2018),
transformations applied sequentially to an initial repa- a general model to perform Bayesian inference in Rieman-
rameterizable density q0 (z0 ), allowing for more complex nian Manifolds is proposed. Following other Stein-related
posteriors. Normalizing flows can be considered orthog- approaches, the method does not explicitly define a poste-
onal to our proposed approach. In fact, while allowing rior density but approximates it with a number of particles.
for a more flexible posterior, they do not modify the stan- Despite its generality and flexibility, it requires the choice
dard normal prior assumption. They could be perfectly of a kernel on the manifold and multiple particles to have
combined with S-VAEs allowing for more flexible distri- a good approximation of the posterior distribution. The
butions on the hypersphere. former is not necessarily straightforward, while the latter
quickly becomes computationally unfeasible.
One approach to obtain a more flexible prior is to use a
simple mixture of Gaussians (MoG) prior (Dilokthanakul Another approach by Nickel and Kiela (2017), capital-
et al., 2016). The recently introduced VampPrior model izes on the hierarchical structure present in some data
(Tomczak and Welling, 2018) outlines several advantages types. By learning the embeddings for a graph in a
over the MoG and instead tries to learn a more flexible non-euclidean negative curvature hyperbolical space, they
prior by expressing it as a mixture of approximate pos- show this topology has clear advantages over embedding
teriors. A non-parametric prior is proposed in Nalisnick these objects in a Euclidean space. Although they did not
and Smyth (2017), utilizing a truncated stick-breaking use a VAE-based approach, that is, they did not build a
Table 1: Summary of results (mean and standard-deviation over 10 runs) of unsupervised model on MNIST. RE and KL
correspond respectively to the reconstruction and the KL part of the ELBO. Best results are highlighted only if they
passed a student t-test with p < 0.01.
N -VAE S-VAE
Method
LL L[q] RE KL LL L[q] RE KL
d=2 -135.73±.83 -137.08±.83 -129.84±.91 7.24±.11 -132.50±.73 -133.72±.85 -126.43±.91 7.28±.14
d=5 -110.21±.21 -112.98±.21 -100.16±.22 12.82±.11 -108.43±.09 -111.19±.08 -97.84±.13 13.35±.06
d = 10 -93.84±.30 -98.36±.30 -78.93±.30 19.44±.14 -93.16±.31 -97.70±.32 -77.03±.39 20.67±.08
d = 20 -88.90±.26 -94.79±.19 -71.29±.45 23.50±.31 -89.02±.31 -96.15±.32 -67.65±.43 28.50±.22
d = 40 -88.93±.30 -94.91±.18 -71.14±.56 23.77±.49 -90.87±.34 -101.26±.33 -67.75±.70 33.50±.45
posterior to match the prior shape, concentrating the sam- low dimensions. In the absence of any origin pull, the
ples in the center. However, this prevents the N -VAE to data is able to cluster naturally, utilizing the entire latent
correctly represent the true shape of the data and creates space which can be observed in Figure 2. Note that in Fig-
instability problems for the decoder around the origin. On ure 2(a) all mass is concentrated around the center, since
the contrary, if we scale down the KL term, we observe the prior mean is zero. Conversely, in Figure 2(b) all
that the samples from the approximate posterior maintain available space is evenly covered due to the uniform prior,
a shape that reflects the S 1 structure smoothed with Gaus- resulting in more separable clusters in S 2 compared to
sian noise. However, as the approximate posterior differs R2 . However, as dimensionality increases, the Gaussian
strongly from the prior, obtaining meaningful samples distribution starts to approximate a hypersphere, while
from the latent space again becomes problematic. its posterior becomes more expressive than the vMF due
to the higher number of variance parameters. Simultane-
The S-VAE on the other hand, almost perfectly recovers
ously, as described in Subsection 3.5, the surface area of
the original dataset structure, while the samples from the
the vMF starts to collapse limiting the available space.
approximate posterior closely match the prior distribution.
This simple experiment confirms the intuition that having In Figure 7 and 8 of Appendix G, we present randomly
a prior that matches the true latent structure of the data, is generated samples from the N -VAE and the S-VAE, re-
crucial in constructing a correct latent representation that spectively. Moreover, in Figure 9 of Appendix G, we
preserves the ability to generate meaningful samples. show 2-dimensional manifolds for the two models. Inter-
estingly, the manifold given by the S-VAE indeed results
5.2 EVALUATION OF EXPRESSIVENESS in a latent space where digits occupy the entire space and
there is a sense of continuity from left to right.
To compare the behavior of the N -VAE and S-VAE on a
data set that does not have a clear hyperspherical latent 5.3 SEMI-SUPERVISED LEARNING
structure, we evaluate both models on a reconstruction
task using dynamically binarized MNIST (Salakhutdinov Having observed the S-VAE’s ability to increase clus-
and Murray, 2008). We analyze the ELBO, KL, negative terability of data points in the latent space, we wish to
reconstruction error, and marginal log-likelihood (LL) for further investigate this property using a semi-supervised
both models on the test set. The LL is estimated using classification task. For this purpose we re-implemented
importance sampling with 500 sample points (Burda et al., the M1 and M1+M2 models as described in (Kingma
2016). et al., 2014), and evaluate the classification accuracy of
the S-VAE and the N -VAE on dynamically binarized
Results Results are shown in Table 1. We first note that MNIST. In the M1 model, a classifier utilizes the latent
in terms of negative reconstruction error the S-VAE out- features obtained using a VAE as in experiment 5.2. The
performs the N -VAE in all dimensions. Since the S-VAE M1+M2 model is constructed by stacking the M2 model
uses a uniform prior, the KL divergence increases more on top of M1, where M2 is the result of augmenting the
strongly with dimensionality, which results in a higher VAE by introducing a partially observed variable y, and
ELBO. However in terms of log-likelihood (LL) the S- combining the ELBO and classification objective. This
VAE clearly has an edge in low dimensions (d = 2, 5, 10) concatenated model is trained end-to-end 3 .
and performs comparable to the N -VAE in d = 20. This 3
It is worth noting that in the original implementation by
empirically confirms the hypothesis of Subsection 2.2, Kingma et al. (2014) the stacked model did not converge well
showing the positive effect of having a uniform prior in using end-to-end training, and used the extracted features of the
(a) R2 latent space of the N -VGAE. (b) Hammer projection of S 2 latent space of the S-VGAE.
Figure 3: Latent space of unsupervised N -VGAE and S-VGAE models trained on Cora citation network. Colors denote
documents classes which are not provided during training. (Best viewed in color)
This last model also allows for a combination of the two S-VAE. As previously noticed the S-VAE performance
topologies due to the presence of two distinct latent vari- drops when dim z2 = 50, with the best result being ob-
ables, z1 and z2 . Since in the M2 latent space the class tained for dim z1 = dim z2 = 10. In fact, it is worth
assignment is expressed by the variable y, while z2 only noting that for this setting the S-VAE obtains comparable
needs to capture the style, it naturally follows that the results to the original settings of (Kingma et al., 2014),
N -VAE is more suited for this objective due to its higher while needing a considerably smaller latent space. Finally,
number of variance parameters. Hence, besides compar- the end-to-end trained S+N -VAE model is able to reach
ing the S-VAE against the N -VAE, we additionally run a significantly higher classification accuracy than the orig-
experiments for the M1+M2 model by modeling z1 , z2 inal results reported by Kingma et al. (2014), 96.7±.1.
respectively with a vMF and normal distribution.
The M1+M2 model allows for conditional generation.
Similarly to (Kingma et al., 2014), we set the latent vari-
Results As can be see in Table 2, for M1 the S-VAE able z2 to the value inferred from the test image by the
outperforms the N -VAE in all dimensions up to d = 40. inference network, and then varied the class label y. In
This result is amplified for a low number of observed Figure 10 of Appendix H we notice that the model is able
labels. Note that for both models absolute performance to disentangle the style from the class.
drops as the dimensionality increases, since K-NN used
as the classifier suffers from the curse of dimensionality.
Besides reconfirming superiority of the S-VAE in d <
20, its better performance than the N -VAE for d = 20 Table 3: Summary of results of semi-supervised model
was unexpected. This indicates that although the log- M1+M2 on MNIST.
likelihood might be comparable(see Table 1) for higher
dimensions, the S-VAE latent space better captures the
Method 100
cluster structure.
dim z1 dim z2 N +N S+S S+N
In the concatenated model M1+M2, we first observe in
5 90.0±.4 94.0±.1 93.8±.1
Table 3 that either the pure S-VAE or the S+N -VAE
5 10 90.7±.3 94.1±.1 94.8±.2
model yields the best results, where the S-VAE almost
50 90.7±.1 92.7±.2 93.0±.1
always outperforms the N -VAE. Our hypothesis regard-
ing the merit of a S+N -VAE model is further confirmed, 5 90.7±.3 91.7±.5 94.0±.4
as displayed by the stable, strong performance across 10 10 92.2±.1 96.0±.2 95.9±.3
all different dimensions. Furthermore, the clear edge 50 92.9±.4 95.1±.2 95.7±.1
in clusterability of the S-VAE in low dimensional z1 as 5 92.0±.2 91.7±.4 95.8±.1
already observed in Table 2, is again evident. As the 50 10 93.0±.1 95.8±.1 97.1±.1
dimensionality of z1 , z2 increases, the accuracy of the 50 93.2±.2 94.2±.1 97.4±.1
N -VAE improves, reducing the performance gap with the
Figure 4: Overview of von Mises-Fisher sampling procedure. Note that as ω is a scalar, the procedure does not suffer
from the curse of dimensionality.
The general algorithm for sampling from a vMF has been outlined in Algorithm 1. The exact form of the distribution of
the univariate distribution g(ω|k) is:
1
2(π m/2 ) exp(ωk)(1 − ω 2 ) 2 (m−3)
g(ω|k) = Cm (k) , (11)
Γ(m/2) B( 12 , 12 (m − 1))
Samples from this distribution are drawn using an acceptance/rejection algorithm when m 6= 3. The complete procedure
is described in Algorithm 2. The Householder reflection (see Algorithm 3 for details) simply finds an orthonormal
transformation that maps the modal vector e1 = (1, 0, · · · , 0) to µ. Since an orthonormal transformation preserves
the distances all the points in the hypersphere will stay in the surface after mapping. Notice that even the transform
U z0 = (I − 2uu> )z0 , can be executed in O(m) by rearranging the terms.
Table 5: Expected number of samples needed before acceptance, computed using Monte Carlo estimate with 1000
samples varying dimensionality and concentration parameters. Notice that the sampling complexity increases in κ, but
decreases as the dimensionality, d, increases.
B KL DIVERGENCE DERIVATION
The KL divergence between a von-Mises-Fisher distribution q(z|µ, k) and an uniform distribution in the hypersphere
−1
2(π m/2 )
(one divided by the surface area of S m−1 ) p(x) = is:
Γ(m/2)
Z
q(z|µ, k)
KL[q(z|µ, k) || p(z)] = q(z|µ, k) log dz (12)
m−1 p(z)
ZS
q(z|µ, k) log Cm (k) + kµT z − log p(z) dz
= (13)
S m−1
−1
2(π m/2 )
= kµ Eq [z] + log Cm (k) − log (14)
Γ(m/2)
Im/2 (k)
=k + (m/2 − 1) log k − (m/2) log(2π) − log Im/2−1 (k) (15)
Im/2−1 (k)
m m
+ log π + log 2 − log Γ( ),
2 2
B.1 GRADIENT OF KL DIVERGENCE
Using
1
∇k Iv (k) = (Iv−1 (k) + Iv+1 (k)) , (16)
2
and
∇k log Cm (k) = ∇k (m/2 − 1) log k − (m/2) log(2π) − log Im/2−1 (k) (17)
Im/2 (k)
=− , (18)
Im/2−1 (k)
then
Im/2 (k)
∇κ KL[q(z|µ, k) || p(z)] = ∇κ k + ∇k log Cm (k) (19)
Im/2−1 (k)
Im/2 (k) Im/2 (k) Im/2 (k)
= + k∇k − (20)
Im/2−1 (k) Im/2−1 (k) Im/2−1 (k)
!
1 Im/2+1 (k) Im/2 (k) Im/2−2 (k) + Im/2 (k)
= k − +1 , (21)
2 Im/2−1 (k) Im/2−1 (k)2
exp
Notice that we can use Im/2 = exp(−k)Im/2 for numerical stability.
C PROOF OF LEMMA 2
g(h(ε, θ)|θ)
Lemma 3 (2). Let f be any measurable function and ε ∼ π1 (ε|θ) = s(ε) the distribution of the accepted
r(h(ε, θ)|θ)
sample. Also let v ∼ π2 (v), and T a transformation that depends on the parameters such that if z = T (ω, v; θ) with
ω ∼ g(ω|θ), then z ∼ q(z|θ):
Z
E(ε,v)∼π1 (ε|θ)π2 (v) [f (T (h(ε, θ), v; θ))] = f (z)q(z|θ)dz = Eq(z|θ) [f (z)], (22)
Proof.
ZZ
E(ε,v)∼π1 (ε|θ)π2 (v) [f (T (h(ε, θ), v; θ))] = f (T (h(ε, θ), v; θ)) π1 (ε|θ)π2 (v)dεdv, (23)
Using the same argument employed by Naesseth et al. (2017) we can apply the change of variables ω = h(ε, θ) rewrite
the expression as: ZZ Z
= f (T (ω, v; θ)) g(ω|θ)π2 (v)dωdv =∗ f (z)q(z|θ)dz (24)
We can then proceed as in 8 using Lemma 2 and the the log derivative trick to compute the gradient of the expectation
term ∇θ Eq(z|θ) [f (z)]:
ZZ
∇θ Eq(z|θ) [f (z)] = ∇θ f (T (h(ε, θ), v; θ)) π1 (ε|θ)π2 (v)dεdv (25)
ZZ
g(h(ε, θ)|θ)
= ∇θ f (T (h(ε, θ), v; θ)) s(ε) π2 (v)dεdv (26)
r(h(ε, θ)|θ)
ZZ
g(h(ε, θ)|θ)
= s(ε)π2 (v)∇θ f (T (h(ε, θ), v; θ)) dεdv (27)
r(h(ε, θ)|θ)
ZZ
g(h(ε, θ)|θ)
= s(ε)π2 (v) ∇θ (f (T (h(ε, θ), v; θ))) dεdv (28)
r(h(ε, θ)|θ)
ZZ
g(h(ε, θ)|θ)
+ s(ε)π2 (v)f (T (h(ε, θ), v; θ)) ∇θ dεdv
r(h(ε, θ)|θ)
ZZ
= π1 (ε|θ)π2 (v)∇θ (f (T (h(ε, θ), v; θ))) dεdv (29)
ZZ
g(h(ε, θ)|θ)
+ s(ε)π2 (v)f (T (h(ε, θ), v; θ)) ∇θ dεdv
r(h(ε, θ)|θ)
= E(ε,v)∼π1 (ε|θ)π2 (v) [∇θ f (T (h(ε, θ), v; θ))] (30)
| {z }
grep
g(h(ε, θ)|θ)
+ E(ε,v)∼π1 (ε|θ)π2 (v) f (T (h(ε, θ), v; θ)) ∇θ log ,
r(h(ε, θ)|θ)
| {z }
gcor
where grep is the reparameterization term and gcor the correction term. Since h is invertible in ε, Naesseth et al. (2017)
q(h(ε, θ), θ)
show that ∇θ log in gcor simplifies to:
r((h(ε, θ), θ)
g(h(ε, θ), θ) ∂h(ε, θ)
∇θ log = ∇θ log g(h(ε, θ), θ) + ∇θ log | |, (31)
r((h(ε, θ), θ) ∂ε
In our specific case we want to take the gradient w.r.t. θ of the expression:
Eqψ (z|x;θ) [log pφ (x|z)] where θ = (µ, κ), (32)
The gradient can be computed using the Lemma 2 and the subsequent gradient derivation with f (z) = pφ (x|z). As
specified in Section 3.4 we optimize unbiased Monte Carlo estimates of the gradient. Therefore fixed one datapoint x
and sampled (ε, v) ∼ π1 (ε|θ)π2 (v) the gradient is:
∂h(ε, θ)
gcor ≈ pφ (x|T (h(ε, θ), v; θ)) ∇θ log g(h(ε, θ)|θ) + ∇θ log | | , (35)
∂ε
where grep is simply the gradient of the reconstruction loss w.r.t θ and can be easily handled by automatic differentiation
packages.
For what concerns gcor we notice that the terms g() and h() do not depend on µ. Thus the gcor term w.r.t. µ is 0 an all
the following calculations can will be only w.r.t. κ. We therefore have:
p
∂h(ε, k) −2b −2k + 4k 2 + (m − 1)2
= where b = , (36)
∂ε ((b − 1)ε + 1)2 m−1
and
1
∇κ log g(ω|k) = ∇κ log Cm (k) + ωk + (m − 3) log(1 − ω 2 ) (37)
2
1
= ∇k log Cm (k) + ∇κ ωk + (m − 3) log(1 − ω 2 ) . (38)
2
So, putting everything together we have:
Im/2 (k)
1 −2b
gcor = log pφ (x|z) · − + ∇κ ωκ + (m − 3) log(1 − ω 2 ) + log | | , (39)
Im/2−1 (k) 2 ((b − 1)ε + 1)2
where
p
−2k + 4k 2 + (m − 1)2
b= (40)
m−1
1 − (1 + b)ε
ω = h(ε, θ) = (41)
1 − (1 − b)ε
z = T (h(ε, θ), v; θ), (42)
−2b
And the term ∇κ ωκ + 12 (m − 3) log(1 − ω 2 ) + log | | can be computed by automatic differentia-
((b − 1)ε + 1)2
tion packages.
Figure 6: Plot of the unit hyperspherical surface area against dimensionality. The surface area has a maximum for
m = 7.
Architecture and hyperparameters For both the encoder and the decoder we use MLPs with 2 hidden layers of
respectively, [256, 128] and [128, 256] hidden units. We trained until convergence using early-stopping with a look
ahead of 50 epochs. We used the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1e-3, and mini-batches
of size 64. Additionally, we used a linear warm-up for 100 epochs (Bowman et al., 2016). The weights of the neural
network were initialized according to (Glorot and Bengio, 2010).
Architecture and Hyperparameters For M1 we reused the trained models of the previous experiment, and used
K-nearest neighbors (K-NN) as a classifier with k = 5. In the N -VAE case we used the Euclidean distance as a
distance metric. For the S-VAE the geodesic distance arccos(x> y) was employed. The performance was evaluated for
N = [100, 600, 1000] observed labels.
The stacked M1+M2 model uses the same architecture as outlined by Kingma et al. (2014), where the MLPs utilized
in the generative and inference models are constructed using a single hidden layer, each with 500 hidden units. The
latent space dimensionality of z1 , z2 were both varied in [5, 10, 50]. We used the rectified linear unit (ReLU) as an
activation function. Training was continued until convergence using early-stopping with a look ahead of 50 epochs on
the validation set. We used the Adam optimizer with a learning rate of 1e-3, and mini-batches of size 100. All neural
network weight were initialized according to (Glorot and Bengio, 2010). N was set to 100, and the α parameter used to
scale the classification loss was chosen between [0.1, 1.0]. Crucially, we train this model end-to-end instead of by parts.
Architecture and Hyperparameters We are training a Variational Graph Auto-encoder (VGAE) model, a state-of-
the-art link prediction model for graphs, as proposed in Kipf and Welling (2016). For a fair comparison, we use the same
architecture as in the original paper and we just change the way the latent space is generated using the vMF distribution
instead of a normal distribution. All models are trained for 200 epochs on Cora and Citeseer, and 400 epochs on Pubmed
with the Adam optimizer. Optimal learning rate lr ∈ {0.01, 0.005, 0.001}, dropout rate pdo ∈ {0, 0.1, 0.2, 0.3, 0.4} and
number of latent dimensions dz ∈ {8, 16, 32, 64} are determined via grid search based on validation AUC performance.
For S-VGAE, we omit the dz = 64 setting as some of our experiments ran out of memory. The model is trained with a
single hidden layer with 32 units and with document features as input, as in Kipf and Welling (2016). The weights
of the neural network were initialized according to (Glorot and Bengio, 2010). For testing, we report performance of
the model selected from the training epoch with highest AUC score on the validation set. Different from (Kipf and
Welling, 2016), we train both the N -VGAE and the S-VGAE models using negative sampling in order to speed up
training, i.e. for each positive link we sample, uniformly at random, one negative link during every training epoch. All
experiments are repeated 5 times, and we report mean and standard error values.
Dataset statistics are summarized in Table 6. Final hyperparameter choices found via grid search on the validation splits
are summarized in Table 7.
Figure 7: Random samples from N -VAE of MNIST for different dimensionalities of latent space.
Figure 8: Random samples from S-VAE of MNIST for different dimensionalities of latent space.
(a) N -VAE (b) S-VAE
Figure 9: Visualization of the 2 dimensional manifold of MNIST for both the N -VAE and S-VAE. Notice that the
N -VAE has a clear center and all digits are spread around it. Conversely, in the S-VAE instead all digits occupy the
entire space and there is a sense of continuity from left to right.
Figure 10: Visualization of handwriting styles learned by the model, using conditional generation on MNIST of M1+M2
with dim(z1 ) = 50, dim(z2 ) = 50, S+N . Following Kingma et al. (2014), the left most column shows images from
the test set. The other columns show analogical fantasies of x by the generative model, where in each row the latent
variable z2 is set to the value inferred from the test image by the inference network and the class label y is varied per
column.