Riemannian Diffusion Models
Riemannian Diffusion Models
Abstract
Diffusion models are recent state-of-the-art methods for image generation and
likelihood estimation. In this work, we generalize continuous-time diffusion mod-
els to arbitrary Riemannian manifolds and derive a variational framework for
likelihood estimation. Computationally, we propose new methods for computing
the Riemannian divergence which is needed in the likelihood estimation. More-
over, in generalizing the Euclidean case, we prove that maximizing this variational
lower-bound is equivalent to Riemannian score matching. Empirically, we demon-
strate the expressive power of Riemannian diffusion models on a wide spectrum
of smooth manifolds, such as spheres, tori, hyperboloids, and orthogonal groups.
Our proposed method achieves new state-of-the-art likelihoods on all benchmarks.
1 Introduction
By learning to transmute noise, generative models seek to uncover the underlying generative
factors that give rise to observed data. These factors can often be cast as inherently geometric
quantities as the data itself need not lie on a flat Euclidean space. Indeed, in many scientific
domains such as high-energy physics (Brehmer & Cranmer, 2020), directional statistics (Mardia &
Jupp, 2009), geoscience (Mathieu & Nickel, 2020), computer graphics (Kazhdan et al., 2006), and
linear biopolymer modeling such as protein and RNA (Mardia et al., 2008; Boomsma et al., 2008;
Frellsen et al., 2009), data is best represented on a Riemannian manifold with a non-zero curvature.
Naturally, to effectively capture the generative factors of these data, we must take into account the
geometry of the space when designing a learning framework.
Recently, diffusion based generative models have emerged as an attractive model class that not
only achieve likelihoods comparable to state-of-the-art autogressive models (Kingma et al., 2021)
but match the sample quality of GANs without the pains of adversarial optimization (Dhariwal &
Nichol, 2021). Succinctly, a diffusion model consists of a fixed Markov chain that progressively
transforms data to a prior defined by the inference path, and a generative model which is another
Markov chain that is learned to invert the inference process (Ho et al., 2020; Song et al., 2021b).
While conceptually simple, the learning framework can have a variety of perspectives and goals.
For example, Huang et al. (2021) provide a variational framework for general continuous-time
diffusion processes on Euclidean manifolds as well as a functional Evidence Lower Bound (ELBO)
that can be equivalently shown to be minimizing an implicit score matching objective. At present,
however, much of the success of diffusion based generative models and its accompanying variational
framework is purpose built for Euclidean spaces, and more specifically, image data. It does not
easily translate to general Riemannian manifolds.
In this paper, we introduce Riemannian Diffusion Models (RDM)—generalizing conventional
diffusion models on Euclidean spaces to arbitrary Riemannian manifolds. Departing from diffusion
models on Euclidean spaces, our approach uses the Stratonovich SDE formulation for which the
2 Background
In this section, we provide the necessary background on diffusion models and key concepts
from Riemannian geometry that we utilize to build RDMs. For a short review of the latter, see
Appendix A or Ratcliffe (1994) for a more comprehensive treatment of the subject matter.
2.1 Euclidean diffusion models
A diffusion model can be defined as the solution to the (Itô) SDE (Øksendal, 2003),
dX = µ dt + σ dBt , (1)
with the initial condition X0 following some unstructured prior p0 such as the standard normal
distribution, where Bt is a standard Brownian motion, and µ and σ are the drift and diffusion coef-
ficients of the diffusion process, which control the deterministic forces driving the evolution and the
amount of noise injected at each time step. This provides us a way to sample from the model, via
numerically solving the dynamics from t = 0 to t = T for some fixed termination time T . To train
the model via maximum likelihood, we require an expression for the log marginal density of XT ,
denoted by log p(x, T ), which is generally intractable.
The marginal likelihood can be represented using a stochastic instantaneous change-of-variable for-
mula, by applying the Feynman-Kac theorem to the Fokker-Planck PDE of the density. An applica-
tion of Girsanov’s theorem followed by an application of Jensen’s inequality leads to the following
variational lower bound (Huang et al., 2021; Song et al., 2021a):
" Z T #
1 2
log p(x, T ) ≥ E log p0 (YT ) − ka(Ys , s)k2 + ∇ · µ(Ys , T − s) ds Y0 = x (2)
0 2
where a is the variational degree of freedom, ∇· denotes the (Euclidean) divergence operator, and
Ys follows the inference SDE (the generative coefficients are evaluated in reversed time, i.e. T − s)
with B̂s being another Brownian motion. This is known as the continuous-time evidence lower
bound, or the CT-ELBO for short.
2
2.2 Riemannian manifolds
Unlike Euclidean spaces, Riemannian manifolds generally do not possess a vector space structure.
This prevents the direct application of the usual (stochastic) calculus. We can resolve this by defining
the process via test functions. Specifically, let Vk be a family of smooth vector fields on M, and let
Z k be a family of semimartingales (Protter, 2005). Symbolically, we write
X X
dXt = Vk (Xt ) ◦ dZtk if df (Xt ) = Vk (f )(Xt ) ◦ dZtk (4)
k k
for any f ∈ C ∞ (M) (Hsu, 2002). The ◦ in the second differential equation is to be interpreted in
the Stratonovich sense (Protter, 2005). The use of the Stratonovich integral is the first step deviating
from the Euclidean diffusion model (1), as the Itô integral does not follow the usual chain rule.
Working with this abstract definition is not always convenient, so instead we work with specific
coordinates of M. Let ϕ be a chart, and let ṽ = (ṽjk ) be a matrix representing the coefficients
Pd ∂
−1
of Vk in the coordinate basis—i.e. Vk (f ) = j=1 ṽjk ∂ x̃j f ◦ ϕ x̃=ϕ(x)
. This allows us to
write dϕ(Xt ) = ṽ ◦ dZ. Similarly, suppose M is a submanifold embedded in Rm , and denote by
−1
v = (vik ) the coefficients wrt the Euclidean basis. v and ṽ are related by v = dϕdx̃ ṽ. Then we can
express the dynamics of X as a regular SDE using the Euclidean space’s coefficients dX = v ◦ dZ.
Notably, by the relation between v and ṽ, the column vectors of v are required to lie in the span of
−1
the column vectors of the Jacobian dϕdx̃ which restricts the dynamics to move tangentially on M.
Our first result is a stochastic instantaneous change-of-variable formula for the Riemannian SDE by
applying the Feynman-Kac theorem to the Fokker Planck PDE of the density p(x, t).
Pw
1
The multiplication is interpreted similarly to matrix-vector multiplication, i.e. V ◦ dBt = k=1 Vk ◦ dBtk .
3
Theorem 1 (Marginal Density). The density p(x, t) of the SDE (5) can be written as
" Z t #
1
p(x, t) = E p0 (Yt ) exp − ∇g · V0 − (V · ∇g )V ds Y0 = x (7)
0 2
where the expectation is taken wrt the following process induced by a Brownian motion Bs0
dY = (−V0 + (V · ∇g )V ) ds + V ◦ dBs0 . (8)
For effective likelihood maximization, we require access to log p and its gradient. Towards this goal,
we prove the following Riemannian CT-ELBO which serves as our training objective and follows
from an application of change of measure (Girsanov’s theorem) and Jensen’s inequality.
Theorem 2 (Riemannian CT-ELBO). Let B̂s be a w-dimensional Brownian motion, and let
Ys be a process solving the following
Inference SDE: dY = (−V0 + (V · ∇g )V + V a) ds + V ◦ dB̂s , (9)
m m
where a : R × [0, T ] → R is the variational degree of freedom. Then we have
" Z T #
1 2 1
log p(x, T ) ≥ E log p0 (YT ) − ka(Ys , s)k2 + ∇g · V0 − (V · ∇g )V ds Y0 = x ,
0 2 2
(10)
where all the generative degree of freedoms Vk are evaluated in the reversed time direction.
Similar to the Euclidean case, computing the Riemannian CT-ELBO requires computing the diver-
gence “∇g ·” of a vector field, which can be achieved by applying the following identity.
Proposition 1 (Riemannian divergence identity). Let (M, g) be a d-dimensional Rieman-
nian manifold. For any smooth vector field Vk ∈ X(M), the following identity holds:
d D
X E
∇g · V k = ∇Ẽj Vk , Ẽ j . (11)
g
j=1
Intrinsic coordinates. The patch-space formula (6) can be used to compute the Riemannian diver-
gence. This view was adopted by Mathieu & Nickel (2020), where they combined the Hutchinson
trace identity and the internal coordinate formula to estimate the divergence. The drawbacks of this
framework include: (1) obtaining local coordinates may be difficult for some manifolds, hindering
generality in practice; (2) wep might need to change patches, which complicates implementations;
and (3) the inverse scaling of |G| might result in numerical instability and high variance.
Closest-point projection. The coordinate-free expression (11) leads to the closest-point projection
method proposed by Rozen et al. (2021). Concretely, define the closest-point projection by π(x) :=
arg miny∈M kx − yk, where k·k is the Euclidean norm. Let Vk (x) be the derivation corresponding
to the ambient space vector vk (x) = Pπ(x) u(π(x)) for some unconstrainted u : Rm → Rm . Rozen
et al. (2021) showed that ∇g · Vk (x) = ∇ · vk (x), since vk is infinitesimally constant in the normal
4
direction to Tx M. This allows us to compute the divergence directly in the ambient space. However,
the closest-point projection map π may not always be easily obtained.
QR decomposition. An alternative to the closest-point projection is to instead search for an or-
thogonal basis for Tx M. Let Q = [e1 , · · · , ed , n1 , · · · , nm−d ] be an orthogonal matrix whose
first d columns span the Tx M, and the remaining m − d vectors span its orthogonal complement
Tx M⊥ . To construct Q we can simply sample d vectors—e.g. from N (0, 1)–in the ambient space
and orthogonally project them to Tx M using Px . These vectors, although not orthogonal yet, form
a basis for Tx M. Next we concatenate them with m − d random vectors and apply a simple QR
decomposition to retrieve an orthogonal basis. Using Q we may rewrite equation (12) as follows:
X d
> dvk > dvk dvk
(∇g · Vk )(x) = tr QQ Px Px = tr (Px Q) Px Q = e>
j ej (13)
dx dx j=1
dx
where we used (1) the orthogonality of Q, (2) the cyclic property of trace, (3) and the fact that
Px ej = ej and Px nj = 0. In practice, concatenation with the remaining m − d vectors is not
needed as they are effectively not used in computing the divergence, speeding up computation when
m d. Moreover, the vector-Jacobian product can be computed in O(m) time using reverse-mode
autograd and importantly does not require the closest-point projection π.
Projected Hutchinson. When QR is too expensive for higher dimensional problems, the Hutchin-
son trace estimator (Hutchinson, 1989) can be employed within the extrinsic view representa-
tion (12). For example, let z be a standard normal vector (or a Rademacher vector), we have
(∇g ·Vk )(x) = Ez∼N ,z0 =Px z [z 0> dv k 0
dx z ]. Different from a direct application of the trace estimator to
the closest-point method, we directly project the random vector to the tangent subspace. Therefore,
the closest-point projection is again not needed.
Following prior work (Sohl-Dickstein et al., 2015; Ho et al., 2020; Huang et al., 2021), we let the
inference SDE (9) be defined as a simple noise process taking observed data to unstructured noise:
dY = U0 dt + V ◦ dB̂s , (14)
where U0 = 21 ∇g log p0 and V is the tangential projection matrix; that is, Vk (f )(x) =
Pm ∂f
j=1 (Px )jk ∂xj for any smooth function f . This is known as the Riemannian Langevin diffu-
sion (Girolami & Calderhead, 2011). As long as p0 satisfies a log-Sobolev inequality, the marginal
distribution of Ys (i.e. the aggregated posterior) converges to p0 at a linear rate in the KL divergence
(Wang et al., 2020). For compact manifolds, we set p0 to be the uniform density, which means
U0 = 0, and (14) is reduced to the extrinsic construction of Brownian motion on M (Hsu, 2002,
Section 1.2). The benefits of this fixed-inference parameterization are the following:
Stable and Efficient Training. With the fixed-inference parameterization we do not need to opti-
mize the vector fields that generate Ys , and the Riemannian CT-ELBO can be rewritten as:
Z T " #
1 2 1
E[log p0 (YT )] − ka(Ys , s)k2 + ∇g · V0 − (V · ∇g )V Y0 = x ds, (15)
EYs
0 2 2
where the first term is a constant wrt the model parameters (or it can be optimized separately if we
want to refine the prior), and the time integral of the second term can be estimated via importance
sampling (see Section 3.3). A sample of Ys can be drawn cheaply by numerically integrating (14),
without requiring a stringent error tolerance (see Section 5.2 for an empirical analysis), which allows
us to estimate the time integral in (15) by evaluating a(Ys , s) at a single time step s only.
Simplified Riemannian CT-ELBO. The CT-ELBO can be simplified as the differential operator
V · ∇g applied to V yields a zero vector when V is the tangential projection.
Proposition 2. If V is the tangential projection matrix, then (V · ∇g )V = 0.
This means that we can express the generative SDE V0 using the variational parameter a via
dX = (V a(X, T − t) − U0 (X, T − t)) dt + V ◦ dB̂t , (16)
5
with the corresponding Riemannian CT-ELBO:
Z T " #
1 2
E[log p0 (YT )] − kak2 + ∇g · (V a − U0 ) Y0 = x ds. (17)
EYs
0 2
The inference process can be more generally defined to account for a time reparameterization. In
fact, this leads to an equivalent model if one can find an invariant representation of the temporal
variable. Learning this time rescaling can help to reduce variance (Kingma et al., 2021).
In principle, we can adopt the same methodology, but this would further complicate the parameteri-
zation of the model. Alternatively, we optRfor a simpler view for variance reduction via importance
sampling. We estimate the time integral “ . . . ds” in (17) using the following estimator:
1 1 2
I := kak2 + ∇g · (V a − U0 ) where s ∼ q(s) and Ys ∼ q(Ys | Y0 ), (18)
q(s) 2
where q(s) is a proposal density supported on [0, T ]. We parameterize q(s) using a 1D monotone
flow (Huang et al., 2018). As the expected value of this estimator is the same as the time integral
in (17), it is unbiased. However, this means we cannot train the proposal distribution q(s) by max-
imizing this objective, since the gradient wrt the parameters of q(s) is zero in expectation. Instead,
we minimize the variance of the estimator by following the stochastic gradient wrt q(s)
∇q(s) Var(I) = ∇q(s) E[I 2 ] −
∇q(s)
E[I]2
= ∇q(s) E[I 2 ]. (19)
The latter can be optimized using the reparameterization trick (Kingma & Welling, 2014) and is a
well-known variance reduction method in a multitude of settings (Luo et al., 2020; Tucker et al.,
2017). It can be seen as minimizing the χ2 -divergence from a density proportional to the magnitude
of EYs [I] (Dieng et al., 2017; Müller et al., 2019).
In the Euclidean case, it can be shown that maximizing the variational lower bound of the fixed-
inference diffusion model (16) is equivalent to score matching (Ho et al., 2020; Huang et al., 2021;
Song et al., 2021a). In this section, we extend this connection to its Riemannian counterpart.
Let q(ys , s) be the density of Ys following (14), marginalizing out the data distribution q(y0 , 0).
The score function is the Riemannian gradient of the log-density ∇g log q. The following theorem
tells us that we can create a family of inference and generative SDEs that induce the same marginal
distributions over Ys and XT −s as (16) if we have access to its score.
Theorem 3 (Marginally equivalent SDEs). For λ ≤ 1, the marginal distributions of XT −s
and Ys of the processes defined as below
√
λ
dY = U0 − ∇g log q ds + 1 − λV ◦ dB̂s Y0 ∼ q(·, 0) (20)
2
√
λ
dX = 1− ∇g log q − U0 dt + 1 − λ ◦ V dB̂t X0 ∼ q(·, T ) (21)
2
both have the density q(·, s). In particular, λ = 1 gives rise to an equivalent ODE.
This suggests if we can approximate the score function, and plug it into the reverse process (21), we
obtain a time-reversed process that induces approximately the same marginals.
Theorem 4 (Score matching equivalency). For λ < 1, let Eλ∞ denote the Riemannian CT-
ELBO of the generative process (21), with ∇g log q replaced by an approximate score Sθ , and
with (20) being the inference SDE. Assume Sθ is a compactly supported smooth vector. Then
Z T h i
∞ 2
EY0 [Eλ ] = −C1 EYs kSθ − ∇g log qkg ds + C2 (22)
0
6
Figure 1: Density of models trained on earth datasets. Red dots are samples from the test set.
where C1 > 0 and C2 are constants wrt θ.
The first implication of the theorem is that maximizing the Riemannian CT-ELBO of the plug-in
reverse process is equivalent to minimizing the Riemannian score-matching loss. Second, if we set
λ = 0, from (135) (in the appendix), we have V a = Sθ , which is exactly the fixed-inference training
in §3.2. That is, the vector V a trained using equation (17) is actually an approximate score, allowing
us to extract an equivalent ODE by substituting V a for ∇g log q in (20,21) by setting λ = 1.
4 Related work
Diffusion models. Diffusion models can be viewed from two different but ultimately complimentary
perspectives. The first approach leverages score based generative models (Song & Ermon, 2019;
Song et al., 2021b), while the second approach treats generative modeling as inverting a fixed noise-
injecting process (Sohl-Dickstein et al., 2015; Ho et al., 2020). Finally, continuous-time diffusion
models can also be embedded within a maximum likelihood framework (Huang et al., 2021; Song
et al., 2021a), which represents the special case of prescribing a flat geometry—i.e. Euclidean—to
the generative model and is completely generalized by the theory developed in this work.
Riemannian Generative Models. Generative models beyond Euclidean manifolds have recently
risen to prominence with early efforts focusing on constant curvature manifolds (Bose et al., 2020;
Rezende et al., 2020). Another line of work extends continuous-time flows (Chen et al., 2018a) to
more general Riemannian manifolds (Lou et al., 2020; Mathieu & Nickel, 2020; Falorsi & Forré,
2020). To avoid explicitly solving an ODE during training, Rozen et al. (2021) propose Moser
Flow whose objective involves computing the Riemannian divergence of a parametrized vector field.
Concurrent to our work, De Bortoli et al. (2022) develop Riemannian score-based generative models
for compact manifolds like the Sphere. While similar in endeavor, RDMs are couched within the
the maximum likelihood framework. As a result our approach is directly amenable to variance
reduction techniques via importance sampling and likelihood estimation. Moreover, our approach
is also applicable to non-compact manifolds such as hyperbolic spaces, and we demonstrate this in
our experiments on a larger variety of manifolds including the orthogonal group and toroids.
5 Experiments
We investigate the empirical caliber of RDMs on a range of manifolds. We instantiate RDMs by
parametrizing a in (16) using an MLP and maximize the CT-ELBO (17). We report our detailed
training procedure—including selected hyperparameters—for all models in §D.
5.1 Sphere
For spherical manifolds, we model the datasets compiled by Mathieu & Nickel (2020), which consist
of earth and climate science events on the surface of the earth such as volcanoes (NGDC/WDS,
2022b), earthquakes (NGDC/WDS, 2022a), floods (Brakenridge, 2017), and fires (EOSDIS, 2020).
7
Volcano Earthquake Flood Fire
Mixture of Kent −0.80±0.47 0.33±0.05 0.73±0.07 −1.18±0.06
Riemannian CNF (Mathieu & Nickel, 2020) −0.97±0.15 0.19±0.04 0.90±0.03 −0.66±0.05
Moser Flow (Rozen et al., 2021) −2.02±0.42 −0.09±0.02 0.62±0.04 −1.03±0.03
Stereographic Score-Based −4.18±0.30 −0.04±0.11 1.31±0.16 0.28±0.20
Riemannian Score-Based (De Bortoli et al., 2022) −5.56±0.26 −0.21±0.03 0.52±0.02 −1.24±0.07
RDM −6.61±0.97 −0.40±0.05 0.43±0.07 −1.38±0.05
Dataset size 827 6120 4875 12809
Table 1: NLL scores for each method on earth datasets. Bold shows best results (up to statistical significance).
Means and standard deviations are calculated over 5 runs. Baselines taken from De Bortoli et al. (2022).
Figure 2: Variance reduction with Figure 3: Direct sampling vs numerical integration of Brownian mo-
importance sampling. tion. Numbers in legends indicate the number of time steps.
Figure 4: Ramachandran contour plot of the model density for protein datasets. Red dots are set test samples.
In Table 1 for each dataset we report average and standard deviation of test negative log likelihood
on 5 different runs with different splits of the dataset. In Figure 1 we plot the model density in blue
while the test data is depicted with red dots.
Variance reduction. We demonstrate the effect of applying variance reduction on optimizing the
Riemannian CT-ELBO (17) using the earthquake dataset. As shown in Figure 2, learning an impor-
tance sampling proposal effectively lowers the variance and speeds up training.
5.2 Tori
For tori, we use the list of 500 high-resolution proteins compiled in Lovell et al. (2003) and select
113 RNA sequences listed in Murray et al. (2003). Each macromolecule is divided into multiple
monomers, and the joint structure is discarded—we model the lower dimensional density of the
backbone conformation of the monomer. For the protein data, this corresponds to 3 torsion angles
of the amino acid. As one of the angles is normally 180°, we also discard it, and model the density
over the 2D torus. For the RNA data, the monomer is a nucleotide described by 7 torsion angles in
the backbone, represented by a 7D torus. For protein, we divide the dataset by the type of side chain
attached to the amino acid, resulting in 4 datasets, and we discard the nucleobases of the RNA.
In Table 2 we report the NLL of our model. Our baseline is a mixture of 4, 096 power spherical
distributions (De Cao & Aziz, 2020, MoPS). We observe that RDM outperforms the baseline across
the board, and the difference is most noticeable for the RNA data, which has a higher dimensionality.
Numerical integration ablation. We estimate the loss (17) by integrating the inference SDE on M.
To study the effect of integration error, we experiment with various numbers of time steps evenly
spaced between [0, s] on Glycine. Also, as we can directly sample the Brownian motion on tori
8
General Glycine Proline Pre-Pro RNA
MoPS 1.15±0.002 2.08±0.009 0.27±0.008 1.34±0.019 4.08±0.368
RDM 1.04±0.012 1.97±0.012 0.12±0.011 1.24±0.004 −3.70±0.592
Dataset size 138208 13283 7634 6910 9478
Table 2: Negative test log-likelihood for each method on Tori datasets. Bold shows best results (up to statistical
significance). Means and standard deviations are calculated over 5 runs.
without numerical integration, we use it as a reference (termed direct loss) for comparison. Figure
3 shows while fewer time steps tend to underestimate the loss, the model trained with 100 time
steps is already indistinguishable from the one trained with direct sampling. We also find numerical
integration is not a significant overhead as each experiment takes approximately the same wall-clock
time with identical setups. This is because the inference path does not involve the neural module a.
6 Conclusion
In this paper, we introduce RDMs that extend continuous-time diffusion models to arbitrary
Riemannian manifolds—including challenging non-compact manifolds like hyperbolic spaces. We
provide a variational framework to train RDMs by optimizing a novel objective, the Riemannian
Continuous-Time ELBO. To enable efficient and stable training we provide several key tools such as
a fixed-inference paramterization of the SDE in the ambient space, new methodological techniques
to compute the Riemannian divergence, as well as an importance sampling procedure with respect
to the time integral to reduce the variance of the loss. On a theoretical front, we also show deep
connections between our proposed variational framework and Riemannian score matching through
the construction of marginally equivalent SDEs. Finally, we complement our theory by constructing
RDMs that achieve state-of-the-art performance on density estimation on geoscience datasets,
protein/RNA data on toroidal, and synthetic data on hyperbolic and orthogonal-group manifolds.
9
References
Boomsma, W., Mardia, K. V., Taylor, C. C., Ferkinghoff-Borg, J., Krogh, A., and Hamelryck, T. A
generative, probabilistic model of local protein structure. Proceedings of the National Academy
of Sciences, 105(26):8932–8937, 2008.
Bose, J., Smofsky, A., Liao, R., Panangaden, P., and Hamilton, W. Latent variable modelling with
hyperbolic normalizing flows. In International Conference on Machine Learning, pp. 1045–1055.
PMLR, 2020.
Brakenridge, G. Global active archive of large flood events. http://floodobservatory.
colorado.edu/Archives/index.html, 2017.
Brehmer, J. and Cranmer, K. Flows for simultaneous manifold learning and density estimation.
Advances in Neural Information Processing Systems, 33:442–453, 2020.
Brofos, J. A., Brubaker, M. A., and Lederman, R. R. Manifold density estimation via generalized
dequantization. arXiv preprint arXiv:2102.07143, 2021.
Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. arXiv preprint
arXiv:1509.00519, 2015.
Burrage, K., Burrage, P., and Tian, T. Numerical methods for strong solutions of stochastic differen-
tial equations: an overview. Proceedings of the Royal Society of London. Series A: Mathematical,
Physical and Engineering Sciences, 460(2041):373–402, 2004.
Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential equations.
In Proceedings of the 32nd International Conference on Neural Information Processing Systems,
pp. 6572–6583, 2018a.
Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential equa-
tions. Advances in Neural Information Processing Systems, 2018b.
Chen, R. T. Q., Amos, B., and Nickel, M. Learning neural event functions for ordinary differential
equations. International Conference on Learning Representations, 2021.
Chirikjian, G. S. Stochastic Models, Information Theory, and Lie Groups, Volume 1: Classical
Results and Geometric Methods. Springer Science & Business Media, 2009.
De Bortoli, V., Mathieu, E., Hutchinson, M., Thornton, J., Teh, Y. W., and Doucet, A. Riemannian
score-based generative modeling. arXiv preprint arXiv:2202.02763, 2022.
De Cao, N. and Aziz, W. The power spherical distribution, 2020.
Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. arXiv preprint
arXiv:2105.05233, 2021.
Dieng, A. B., Tran, D., Ranganath, R., Paisley, J., and Blei, D. Variational inference via \chi upper
bound minimization. Advances in Neural Information Processing Systems, 30, 2017.
EOSDIS. Land, atmosphere near real-time capability for eos (lance) system operated by
nasa’s earth science data and information system (esdis). https://earthdata.nasa.gov/
earth-observation-data/near-real-time/firms/active-fire-data, 2020.
Falorsi, L. and Forré, P. Neural ordinary differential equations on manifolds. arXiv preprint
arXiv:2006.06663, 2020.
Frellsen, J., Moltke, I., Thiim, M., Mardia, K. V., Ferkinghoff-Borg, J., and Hamelryck, T. A
probabilistic model of rna conformational space. PLoS computational biology, 5(6):e1000406,
2009.
Girolami, M. and Calderhead, B. Riemann manifold langevin and hamiltonian monte carlo methods.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
Gunther, M. Isometric embeddings of riemannian manifolds, kyoto, 1990. In Proc. Intern. Congr.
Math., pp. 1137–1143. Math. Soc. Japan, 1991.
10
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in neural
information processing systems, 2020.
Hsu, E. P. Stochastic analysis on manifolds. Number 38. American Mathematical Soc., 2002.
Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. Neural autoregressive flows. In Interna-
tional Conference on Machine Learning, pp. 2078–2087, 2018.
Huang, C.-W., Lim, J. H., and Courville, A. A variational perspective on diffusion-based generative
models and score matching. arXiv preprint arXiv:2106.02808, 2021.
Hunter, J. D. Matplotlib: A 2d graphics environment. Computing in science & engineering, 9(3):
90, 2007.
Hutchinson, M. F. A stochastic estimator of the trace of the influence matrix for laplacian smoothing
splines. Communications in Statistics-Simulation and Computation, 18(3):1059–1076, 1989.
Inc., P. T. Collaborative data science, 2015. URL https://plot.ly.
Kazhdan, M., Bolitho, M., and Hoppe, H. Poisson surface reconstruction. In Proceedings of the
fourth Eurographics symposium on Geometry processing, volume 7, 2006.
Kidger, P., Foster, J., Li, X., Oberhauser, H., and Lyons, T. Neural SDEs as Infinite-Dimensional
GANs. International Conference on Machine Learning, 2021.
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference
on Learning Representations, 2015.
Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. Advances
in neural information processing systems, 31, 2018.
Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In International Conference on
Learning Representations, 2014.
Kingma, D. P., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. arXiv preprint
arXiv:2107.00630, 2021.
Lee, J. M. Introduction to Smooth Manifolds. Springer, 2013.
Lee, J. M. Introduction to Riemannian manifolds. Springer, 2018.
Li, X., Wong, T.-K. L., Chen, R. T., and Duvenaud, D. Scalable gradients for stochastic differential
equations. In International Conference on Artificial Intelligence and Statistics, pp. 3870–3882.
PMLR, 2020.
Lou, A., Lim, D., Katsman, I., Huang, L., Jiang, Q., Lim, S. N., and De Sa, C. M. Neural manifold
ordinary differential equations. Advances in Neural Information Processing Systems, 33:17548–
17558, 2020.
Lovell, S. C., Davis, I. W., Arendall III, W. B., De Bakker, P. I., Word, J. M., Prisant, M. G.,
Richardson, J. S., and Richardson, D. C. Structure validation by cα geometry: φ, ψ and cβ
deviation. Proteins: Structure, Function, and Bioinformatics, 50(3):437–450, 2003.
Luo, Y., Beatson, A., Norouzi, M., Zhu, J., Duvenaud, D., Adams, R. P., and Chen, R. T. Sumo:
Unbiased estimation of log marginal probability for latent variable models. arXiv preprint
arXiv:2004.00353, 2020.
Mardia, K. V. and Jupp, P. E. Directional statistics, volume 494. John Wiley & Sons, 2009.
Mardia, K. V., Hughes, G., Taylor, C. C., and Singh, H. A multivariate von mises distribution with
applications to bioinformatics. Canadian Journal of Statistics, 36(1):99–109, 2008.
Mathieu, E. and Nickel, M. Riemannian continuous normalizing flows. arXiv preprint
arXiv:2006.10605, 2020.
11
Met Office. Cartopy: a cartographic python library with a Matplotlib interface. Exeter, Devon,
2010 - 2015. URL https://scitools.org.uk/cartopy.
Müller, T., McWilliams, B., Rousselle, F., Gross, M., and Novák, J. Neural importance sampling.
ACM Transactions on Graphics (TOG), 38(5):1–19, 2019.
Murray, L. J., Arendall, W. B., Richardson, D. C., and Richardson, J. S. Rna backbone is rotameric.
Proceedings of the National Academy of Sciences, 100(24):13904–13909, 2003.
NGDC/WDS. Ncei/wds global significant earthquake database. https://www.ncei.noaa.gov/
access/metadata/landing-page/bin/iso?id=gov.noaa.ngdc.mgg.hazards:G012153,
2022a.
NGDC/WDS. Ncei/wds global significant volcanic eruptions database. https:
//www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.
ngdc.mgg.hazards:G10147, 2022b.
Øksendal, B. Stochastic differential equations. In Stochastic differential equations, pp. 65–84.
Springer, 2003.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein,
N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In
Advances in neural information processing systems, pp. 8026–8037, 2019.
Protter, P. E. Stochastic Integration and Differential Equations, volume 21 of Stochastic Mod-
elling and Applied Probability. Springer Berlin Heidelberg, Berlin, Heidelberg, 2005. ISBN
9783642055607 9783662100615. doi: 10.1007/978-3-662-10061-5.
Ratcliffe, J. G. Foundations of Hyperbolic Manifolds. Number 149 in Graduate Texts in Mathemat-
ics. Springer-Verlag, 1994.
Rezende, D. J., Papamakarios, G., Racaniere, S., Albergo, M., Kanwar, G., Shanahan, P., and Cran-
mer, K. Normalizing flows on tori and spheres. In International Conference on Machine Learning,
pp. 8083–8092. PMLR, 2020.
Rozen, N., Grover, A., Nickel, M., and Lipman, Y. Moser flow: Divergence-based generative
modeling on manifolds. Advances in Neural Information Processing Systems, 34, 2021.
Rudin, W. Real and Complex Analysis. McGraw-Hill, 1987.
Rudin, W. et al. Principles of mathematical analysis, volume 3. McGraw-hill New York, 1976.
Skopek, O., Ganea, O.-E., and Bécigneul, G. Mixed-curvature variational autoencoders. arXiv
preprint arXiv:1911.08411, 2019.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning
using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp.
2256–2265. PMLR, 2015.
Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In
Advances in neural information processing systems, 2019.
Song, Y., Durkan, C., Murray, I., and Ermon, S. Maximum likelihood training of score-based
diffusion models. Advances in Neural Information Processing Systems, 34, 2021a.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based
generative modeling through stochastic differential equations. In International Conference on
Learning Representations, 2021b.
Thalmaier, A. Sde and pde: Solving pde by running a brownian motion. 2021.
Tucker, G., Mnih, A., Maddison, C. J., Lawson, J., and Sohl-Dickstein, J. Rebar: Low-variance,
unbiased gradient estimates for discrete latent variable models. Advances in Neural Information
Processing Systems, 30, 2017.
Wang, X., Lei, Q., and Panageas, I. Fast convergence of langevin dynamics on manifold: Geodesics
meet log-sobolev. Advances in Neural Information Processing Systems, 33:18894–18904, 2020.
12
Checklist
The checklist follows the references. Please read the checklist guidelines carefully for information
on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or
[N/A] . You are strongly encouraged to include a justification to your answer, either by referencing
the appropriate section of your paper or providing a brief inline description. For example:
• Did you include the license to the code and datasets? [Yes] See Section ??.
• Did you include the license to the code and datasets? [No] The code and the data are
proprietary.
• Did you include the license to the code and datasets? [N/A]
Please do not modify the questions and only use the provided macros for your answers. Note that the
Checklist section does not count towards the page limit. In your paper, please delete this instructions
block and only keep the Checklist section heading above along with the questions/answers below.
(d) Did you discuss whether and how consent was obtained from people whose data
you’re using/curating? [No]
(e) Did you discuss whether the data you are using/curating contains personally identifi-
able information or offensive content? [No]
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [N/A]
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
13
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [N/A]
14
A Riemannian manifolds
Notation Convention. There are a number of different notations used in differential geometry
and all have their place. The most abstract level is with tensors (of which forms and vectors are
special cases) and it is the best for establishing general properties. We used an index-free notation
in the main paper. A coordinate-based, but still intrinsic, description that uses local charts which
describe explicit coordinate systems for patches of the manifold: for the most part we use intrinsic
coordinates in the main paper. For computational purposes it is convenient to view manifolds as
hypersurfaces embedded in Rm even though this obscures the geometric meaning; these are called
extrinsic coordinates which we use for actual implementations.
We use capital letters to denote vectors, and tilded letters to denote vectors and variables defined on
the local patch.
We recall some preliminaries of smooth manifolds. See Lee (2013) for a more detailed and compre-
hensive account.
A smooth d-manifold is a topological space M (assumed to be paracompact, Hausdorff and second
countable) and a family of pairs {(Ui , ϕi )}, where the Ui are open sets that together cover all of M
and each ϕi is a homeomorphism from Ui to an open set in Rd ; these pairs are called charts. They
are required to satisfy a compatibility condition: if Ui and Uj have non-empty intersection, say V ,
then ϕi ◦ ϕ−1 d
j |V has to be an infinitely differentiable map from ϕj (V ) ⊂ R to ϕi (V ) ⊂ R . The
d
use of charts allows one to talk about differentiability of functions or vectors fields, by moving to
Rd as needed. A smooth function f on M has type M → R and is such that for any chart (U, ϕ)
the map f ◦ ϕ−1 : Rd → R is smooth 2 . The set of smooth functions on M is denoted C ∞ (M).
Let M be a smooth manifold, and fix a point x in M. A derivation at x is a linear operator
D : C ∞ (M) → R satisfying the product rule
D(f g) = f (x)D(g) + g(x)D(f ) (23)
∞
for all f, g ∈ C (M). The set of all derivations at x is a d-dimensional real vector space called the
tangent space Tx M, and the elements of Tx M are called the tangent vectors (or tangents) at x.
∂
For the Euclidean space M = Rd , we have that Tx Rd = span{ ∂x 1
, · · · , ∂x∂ d }. We now see how to
use the Euclidean derivations to induce the tangent space of arbitrary Riemannian manifolds.
Let N be another smooth manifold. For any tangent V ∈ Tx M and smooth map ϕ : M → N , the
differential dϕx : Tx M → Tϕ(x) N is defined as the pushforward of V acting on f ∈ C ∞ (N ):
dϕx (v)(f ) = V (f ◦ ϕ). (24)
Note that, if ϕ is a diffeomorphism, dϕx is an isomorphism between Tx M and Tϕ(x) N , and the
inverse map satisfies (dϕx )−1 = d(ϕ−1 )ϕ(x) . Furthermore, differentials follow the chain rule, i.e.
the differential of a composite is the composite of the differentials.
Let x̃ = (x̃1 , · · · , x̃d ) = ϕ(x) be a local coordinate. Since dϕx : Tx M → Tϕ(x) Rd is an isomor-
phism, we can characterize Tx M via inversion. We define the basis vector Ẽi of Tx M by
∂ ∂
Ẽi = (dϕx )−1 = (dϕ−1 )ϕ(x) , (25)
∂ x̃i ∂ x̃i
which means
∂
Ẽi (f ) =f (ϕ−1 (x̃)). (26)
∂ x̃i
n o
The tangent space Tx M of M at x is spanned by Ẽ1 , · · · , Ẽd . This means any tangent vector V
Pd
can be represented by i=1 ṽi Ẽi for some coordinate-dependent coefficients ṽi .
2
Strictly speaking this map has to be restricted to ϕ(U ) but we will assume that the appropriate restrictions
are always intended rather than cluttering up the notation with restrictions all the time.
15
A manifold M is said to be embedded in Rm if there is an inclusion map ι : M → Rm such that
M is homeomorphic to ι(M) and the differential at every point is injective. Every smooth manifold
can be embedded in some Rm with m > d for some suitably chosen m.
When M is embedded in Rm , we can view Tx M as a linear subspace of Tx Rm ; note that this map
has trivial kernel. Let ι : M → Rm denote the inclusion map, i.e. ι(x) = x ∈ Rm for x ∈ M.
Then
m
∂ϕ−1
X
∂ ∂ j ∂
Ẽi = (dϕ−1 )ϕ(x) = (dι−1 )ι(x) (dι ◦ ϕ−1 )ϕ(x) = . (27)
∂ x̃i ∂ x̃i j=1
∂ x̃i ∂xj
This means we can rewrite a tangent vector using the ambient space’s basis
d
X d X
X m
∂ϕ−1j ∂ Xm
∂
ṽi Ẽi = ṽi = v̄j (28)
i=1 i=1 j=1
∂ x̃i ∂xj j=1
∂xj
Pd ∂ϕ−1j
where v̄j = i=1 ṽi ∂ x̃i is the coefficient corresponding to the j’th ambient space coordinate.
−1
What exactly is ϕj ? Note that ι ◦ (ϕ−1 ) is a map from Rd to Rm and it takes ϕ(x) to ι(x). It is
this that we are writing as ϕ−1j .
−1
In matrix-vector form, we can write v̄ = dϕdx̃ ṽ, where v̄ is a vector that represents the m-
dimensional coefficients in the ambient space. This also means v̄ lies in the linear subspace spanned
−1
by the column vectors of the Jacobian ∂ϕ ∂ x̃i . This linear subspace is isomorphic to Tx M, which
itself is a subspace of Tx Rm . We refer to this linear subspace as the tangential linear subspace
Intuitively, this means a particle traveling at speed v̄ and position x can only move tangentially on
the surface. Therefore it is restricted to move on the manifold.
A vector field V is a continuous map that assigns a tangent vector to each point on the manifold;
that is V (x) ∈ Tx M. We abuse the notation a bit and use capital letters to denote both vector fields
and vectors. It should be clear in the context whether it it meant to be a function of points on the
manifold or not. Such a vector field can also map a smooth function to a function, via the assignment
x ∈ M 7→ V (x)(f ) ∈ R. If it maps smooth functions to smooth functions we say that the vector
field is smooth. The space of smooth vector fields on M is denoted by by X(M).
Generally, given a set of basis vectors, such as Ẽi , the metric tensor can be represented in a matrix
form, via
gij := hẼi , Ẽj ig (31)
16
This allows us to write the metric using the patch coordinates
X X
hU, V ig = ũi ṽj hẼi , Ẽj ig = ũi ṽj gij = ũ> Gṽ (32)
i,j i,j
where G is a matrix whose i’th row and j’th column corresponds to gij .
Using the components of the metric tensor, we can define the dual basis Ẽ i = j g ij Ẽj , where
P
g ij stands for the (i, j)’th entry of the inverse matrix G−1 . (Ẽ 1 , · · · , Ẽ d ) is called the dual basis for
(Ẽ1 , · · · , Ẽd ) since they form a bi-orthogonal system, meaning
* +
X X X
i
hẼ , Ẽj ig = ik
g Ẽk , Ẽj = g ik hẼk , Ẽj ig = g ik gkj = (G−1 G)ij = δij . (33)
k g k k
dψ > dψ
That is, if ψ = ϕ−1 is the inverse map of ϕ, we can write G = dx̃ dx̃ , which can be equivalently
deduced from equating (30) and (32).
An important use of the metric is to define a measure over measurable subsets of the manifold. Let
(U, ϕ) be a chart and consider all functions smooth functions f supported in U . Then
Z p
f 7→ (f | det G|) ◦ ϕ−1 dx̃
ϕ(U )
is a positive linear functional. Since M is Hausdorff and locally compact, by the Riesz representa-
tion Rtheorem (Rudin, 1987, Theorem 2.14), there exists a unique Borel measure µg (over U ) such
that U f dµg is equal to the evaluation of the functional above. We can then apply a partition of
unity (Lee, 2013, Theorem 2.23) to extend this construction of µg to be defined over the entire
M, which says that for any open cover {Ui } of M, there exists a set of continuous functions Φi
satisfying the following properties:
4. Any x ∈ M has a neighborhood that intersects with only finitely many supp Φi .
By means of the partition, we can consider the following positive linear functional instead:
XZ p
f ∈ Cc (M) 7→ (Ψi f | det G|) ◦ ϕ−1 dx̃, (35)
i ϕ(Ui )
which is alwaysp well-defined since f is compactly supported in M (only finitely many summands
are non-zero). | det G| is called the volume density. We write |G| = | det G| for short. A
probability
R density p over M can be thought of as a non-negative integrable function satisfying
M
p dµg = 1.
17
A.3 Riemannian gradient and divergence
Riemannian gradient Another crucial structure closely related to the metric is the Riemannian
gradient. The definition of Riemannian gradient ∇g : f ∈ C ∞ (M) 7→ ∇g f ∈ X(M) is motivated
by the directional derivative in Euclidean space, satisfying
h∇g f, V ig = V (f ) (36)
where we let ũi and ṽj denote the coefficients of the gradient and V respectively. And,
d
X ∂
V (f ) = ṽj f ◦ ϕ−1 . (38)
j=1
∂ x̃j
Riemannian divergence Recall that we define the Riemannian divergence using the patch coor-
dinates in (6), which we later show has a coordinate-free form (11) and can be computed in the
ambient space (12) if the manifold is embedded. The following theorem extends the Stokes theorem
to Riemannian manifolds.
R
Theorem 5 (Divergence theorem). For any compactly supported f ∈ X(M), M ∇g ·
f dµg = 0.
Proof. Let {(Ψi , Ui )} be a partition of unity. By compactness, we can choose a finite subcover over
the support of f , so the index set of i is finite.
Z Z !
X
∇g · f dµg = ∇g · Ψi f dµg (40)
M M i
XZ
= ∇g · (Ψi f ) dµg (41)
i Ui
XZ 1
= ∇ · (|G| 2 Ψi f ) ◦ ϕ−1 dx̃. (42)
i ϕi (Ui )
All of the finitely many summands equal 0 by an application of Stokes’ theorem in Rd (Rudin et al.,
1976, Theorem 10.33). This is because the support of Ψi ◦ ϕ−1
i is contained in ϕi (Ui )); therefore at
the boundary of ϕi (Ui ), Ψi ◦ ϕ−1
i is equal to 0.
18
Proof. Using (11), the product rule of the Affine connection (see Appendix A.4),
d
X
∇g · (f V ) = h∇Ẽj (f V ), Ẽ j ig (44)
j=1
d
X
= hf ∇Ẽj V + Ẽj (f )V, Ẽ j ig (45)
j=1
d d
* d
+
X X X
=f h∇Ẽj V, Ẽ j ig + Ẽj (f ) ṽj 0 Ẽj 0 , Ẽ j
(46)
j=1 j=1 j 0 =1 g
d
X
= f ∇g · V + Ẽj (f )ṽj 0 hẼj 0 , Ẽ j ig (47)
j,j 0 =1
d
X
= f ∇g · V + Ẽj (f )ṽj 0 δjj 0 (48)
j,j 0 =1
d
X
= f ∇g · V + Ẽj (f )ṽj = f ∇g · V + V (f ). (49)
j
An affine connection allows us to compare values of a vector field at nearby points. It is a dif-
ferential operator denoted by ∇ : X(M) × X(M) → X(M) and written as U, V 7→ ∇U V for
U, V ∈ X(M), satisfying the following defining properties:
19
Then for any U, V ∈ X(M), we have
d
X
∇U V = ∇U ṽj Ẽj (53)
j=1
d
X
= ṽj ∇U Ẽj + U (ṽj )Ẽj (54)
j=1
d
X d
X
= ũi ṽj ∇Ẽi Ẽj + U (ṽj )Ẽj (55)
i,j=1 j=1
d
X d
X
= ũi ṽj Γkij Ẽk + U (ṽj )Ẽj . (56)
i,j,k=1 j=1
The first condition looks messy but it essentially says that the Levi-Civita connection leaves the
metric invariant. It is equivalent to saying that the covariant derivative of g in any direction is
zero.
Theorem 6 (Fundamental Theorem of Riemannian Geometry). Let (M, g) be a Rieman-
nian manifold. There exists a unique Levi-Civita connection of g.
See Lee (2018, Theorem 5.10) for proof. The connection coefficients of the Levi-Civita connection
are called the Christoffel symbols of g. They are symmetric in the lower indices, i.e. Γkij = Γkji . A
by-product of the proof of the fundamental theorem is the following identity, which will turn out to
be useful in deriving the identity for the Riemannian divergence:
d
∂ X
gki = Γljk gli + Γlji glk . (57)
∂ x̃j
l=1
An example of a Levi-Civita connection is the Euclidean connection of (Rd , ḡ). It can be checked
that ∇ is both symmetric and compatible with ḡ. Furthermore, for any d-submanifold M embedded
in Rm for m > d, we can define a tangential connection
∇>
U V = P ∇U V (58)
for U, V ∈ X(M), where U and V are any3 smooth extensions of U and V to Rm . P is the
tangential projection defined as
m
X ∂
(P V )(x) = (Px v̄)j (59)
j=1
∂xj
for any V ∈ X(Rm ). Recall that Px is the orthogonal projection onto the tangent space spanned
by ∂∂ψ >
x̃i . The tangential connection ∇ is the Levi-Civita connection on the embedded submanifold
M (Lee, 2018, Proposition 5.12).
3
The value of the tangential connection is independent of the extensions chosen, so ∇> is well-defined.
20
B Proofs
Theorem 1 (Marginal Density). The density p(x, t) of the SDE (5) can be written as
" Z t #
1
p(x, t) = E p0 (Yt ) exp − ∇g · V0 − (V · ∇g )V ds Y0 = x (7)
0 2
where the expectation is taken wrt the following process induced by a Brownian motion Bs0
dY = (−V0 + (V · ∇g )V ) ds + V ◦ dBs0 . (8)
Proof. Our first step is to express the time derivative of the density using derivations (spatial deriva-
tives); this gives us a partial differential equation (PDE) on the manifold. Second, we apply the
Feynman-Kac formula (Thalmaier, 2021, Proposition 3.1) to the solution of the PDE.
We denote by dX̃t = ṽ0 dt + ṽ ◦ dBt the Stratonovich SDE defined on the patch. The density p of
the process satisfies the Fokker-Planck equation (Chirikjian, 2009, Equation (8.16)):
d X d w
1 1 1 1
X ∂ X ∂ 1
∂t p(x̃, t) = −|G|− 2 ∇ · (|G| 2 ṽ0 p) + |G|− 2 ṽi,k |G| 2 ṽj,k p (60)
| {z } 2 ∂ x̃i ∂ x̃j
i=1 j=1 k=1
first term | {z }
second term
We would like to re-express the RHS using the abstract vectors V0 and V . Note the first term can
be written as −∇g · (pV0 ). We now show that we can also rewrite the second term in terms of the
Riemannian divergence.
d d w
1 −1 X X ∂ X ∂ 1
|G| 2 ṽi,k |G| 2 ṽj,k p (61)
2 i=1 j=1
∂ x̃i ∂ x̃j
k=1
w d d
1 −1 X X ∂ X ∂ 1
= |G| 2 ṽi,k |G| 2 ṽj,k p (62)
2 i=1
∂ x̃i j=1
∂ x̃j
k=1
w d d
1 −1 X X ∂ 1 1
X ∂ 1
= |G| 2 ṽi,k |G| 2 |G|− 2 |G| 2 ṽj,k p (63)
2 i=1
∂ x̃i j=1
∂ x̃j
k=1
w d
1 −1 X X ∂ 1
= |G| 2 ṽi,k |G| 2 ∇g · (pVk ) (64)
2 i=1
∂ x̃i
k=1
w d
1X 1
X ∂ 1
= |G|− 2 |G| 2 ṽi,k ∇g · (pVk ) (65)
2 i=1
∂ x̃i
k=1
w
!
1X
= ∇g · ∇g · (pVk ) Vk (66)
2
k=1
w
!
1X
∂t p(x, t) = −∇g · (pV0 ) + ∇g · ∇g · (pVk ) Vk (67)
2
k=1
Next, we expand the above formula using the product rule (43):
21
w
!
1X
∂t p(x, t) = −∇g · (pV0 ) + ∇g · ∇g · (pVk ) Vk (68)
2
k=1
w
!
1 X
= −V0 (p) − p∇g · (V0 ) + ∇g · Vk (p) + p∇g · (Vk ) Vk (69)
2
k=1
w
!
1 X
= −V0 (p) − p∇g · (V0 ) + ∇g · Vk (p) + p∇g · (Vk ) Vk (70)
2
k=1
w
!
1 X
= −V0 (p) − p∇g · (V0 ) + Vk Vk (p) + p∇g · (Vk ) + Vk (p) + p∇g · (Vk ) ∇g · (Vk )
2
k=1
(71)
w
!
1 X
= −V0 (p) − p∇g · (V0 ) + Vk (Vk (p)) + Vk (p∇g · (Vk )) + Vk (p)∇g · (Vk ) + p∇g · (Vk )∇g · (Vk )
2
k=1
(72)
w
!
1 X
2
= −V0 (p) − p∇g · (V0 ) + Vk (Vk (p)) + Vk (p∇g · (Vk )) + Vk (p)∇g · (Vk ) + p(∇g · (Vk ))
2
k=1
(73)
w
!
1X
= −V0 (p) − p∇g · (V0 ) + Vk2 (p) + Vk (p∇g · (Vk )) + Vk (p)∇g · (Vk ) + p(∇g · (Vk ))2
2
k=1
(74)
w
!
1X
= −V0 (p) − p∇g · (V0 ) + Vk2 (p) + Vk (p)∇g · (Vk ) + pVk (∇g · (Vk )) + Vk (p)∇g · (Vk ) + p(∇g · (Vk ))2
2
k=1
(75)
w
!
1X
= −V0 (p) − p∇g · (V0 ) + Vk2 (p) + Vk (p)∇g · (Vk ) + pVk (∇g · (Vk )) + Vk (p)∇g · (Vk ) + p(∇g · (Vk ))2
2
k=1
(76)
w w
!
X 1X 2 2
= −V0 (p) − p∇g · (V0 ) + Vk (p)∇g · (Vk ) + Vk (p) + pVk (∇g · (Vk )) + p(∇g · (Vk ))
2
k=1 k=1
(77)
w w
!
X 1X 2 2
= −V0 (p) − p∇g · (V0 ) + Vk ∇g · (Vk ) (p) + Vk (p) + pVk (∇g · (Vk )) + p(∇g · (Vk ))
2
k=1 k=1
(78)
w
!
1X 2 2
= −V0 (p) − p∇g · (V0 ) + (V · ∇g )V (p) + Vk (p) + pVk (∇g · (Vk )) + p(∇g · (Vk ))
2
k=1
(79)
w
!
1X
Vk2 (p) + p∇g · ((∇g · Vk )Vk )
= −V0 (p) − p∇g · (V0 ) + (V · ∇g )V (p) + (80)
2
k=1
w
!
1X
Vk2 (p) + p∇g · ((∇g · Vk )Vk )
= −V0 (p) − p∇g · (V0 ) + (V · ∇g )V (p) + (81)
2
k=1
w w
1X 2
1X
= −V0 (p) − p∇g · (V0 ) + (V · ∇g )V (p) + Vk (p) + p∇g · ((∇g · Vk )Vk ) (82)
2 2
k=1 k=1
w
1 X 1
Vk2 (p) + p∇g · ((V · ∇g )V )
= −V0 (p) − p∇g · (V0 ) + (V · ∇g )V (p) + (83)
2 2
k=1
22
In order to apply the Feynman-Kac formula, we group all the terms by the order of differentiation
(of p), which gives us
w
1X 2 1
∂t p(x, t) = −V0 (p) − p∇g · (V0 ) + (V · ∇g )V (p) + Vk (p) + p∇g · ((V · ∇g )V )
2 2
k=1
(84)
X w
1 1
Vk2 (p)
= p −∇g · (V0 ) + ∇g · ((V · ∇g )V )) + −V0 + (V · ∇g )V (p) +
2 2
| {z } k=1
V
(85)
Now the above is a parabolic PDE, which can be solved using the Feynman-Kac formula (Thalmaier,
2021, Proposition 3.1). Let Y be induced (8) restated below
Pw k
dY = (−V0 + (V · ∇g )V )dt + k=1 (Vk ) ◦ dBs0 (86)
Y0 = x
Then p(x, t) is given by
" #
Z t
p(x, t) = E exp V (Ys (x)) ds p0 (Yt ) Y0 = x (87)
0
23
Theorem 2 (Riemannian CT-ELBO). Let B̂s be a w-dimensional Brownian motion, and let
Ys be a process solving the following
Inference SDE: dY = (−V0 + (V · ∇g )V + V a) ds + V ◦ dB̂s , (9)
m m
where a : R × [0, T ] → R is the variational degree of freedom. Then we have
" Z T #
1 2 1
log p(x, T ) ≥ E log p0 (YT ) − ka(Ys , s)k2 + ∇g · V0 − (V · ∇g )V ds Y0 = x ,
0 2 2
(10)
where all the generative degree of freedoms Vk are evaluated in the reversed time direction.
Proof. Let P be the probability measure under which B 0 is a Brownian motion. Let
dB̂ = −a ds + dBs0 , (88)
where a is the variational degree of freedom. Let Q be defined as
Z T !
1 T
Z
0 2
dQ = exp a(Ys , s)dBs − ka(Ys , s)k2 ds dP. (89)
0 2 0
Note that the first term is an Itô integral. Then by the Girsanov theorem (Øksendal, 2003, Theo-
rem 8.6.3), B̂ is a Brownian motion wrt Q. Therefore, changing the measure from P to Q to the
expression in Theorem 1 yields
" Z T ! #
dP 1
log p(x, t) = log EQ · p0 (Yt ) exp − ∇g · V0 − (V · ∇g )V ds Y0 = x ,
dQ 0 2
where we used the definition of Q (89), the definition of dB̂ (88), and the Martingale property of
the Itô integral (Øksendal, 2003, Corollary 3.2.6). This concludes the proof.
24
Proposition 1 (Riemannian divergence identity). Let (M, g) be a d-dimensional Rieman-
nian manifold. For any smooth vector field Vk ∈ X(M), the following identity holds:
d D
X E
∇g · V k = ∇Ẽj Vk , Ẽ j . (11)
g
j=1
Proof. We drop the index on k (since the statement is for any smooth vector). Using product rule,
the LHS of (11) is equal to
d
X ∂ṽj 1 ∂ 1
+ ṽj |G|− 2 |G| 2 (94)
j=1
∂ x̃ j ∂ x̃ j
Using the chain rule, Jacobi’s formula, and the identity (57), we have
1 ∂ 1 1 ∂
ṽj |G|− 2 |G| 2 = ṽj |G|−1 det G (95)
∂ x̃j 2 ∂ x̃j
1 −1 ∂G
= ṽj tr G (96)
2 ∂ x̃j
d
1 X ik ∂gki
= ṽj g (97)
2 ∂ x̃j
i,k=1
d d
!
1 X ik X l l
= ṽj g Γjk gli + Γji glk (98)
2
i,k=1 l=1
d d
1 X X
= ṽj Γljk g ik gli + ṽj Γlji g ik glk (99)
2
i,k,l=1 i,k,l=1
d d
1 X l 1 X l
= ṽj Γjk δkl + ṽj Γji δil (100)
2 2
k,l=1 i,l=1
d d
1 X k 1 X i
= ṽj Γjk + ṽj Γ (101)
2 2 i=1 ji
k=1
d
X
= ṽj Γkjk (102)
k=1
Now we express the covariant derivative on the RHS using the connection coefficients (56)
d d
X X ∂ṽi
∇Ẽj V = ṽi Γkji Ẽk + Ẽi (104)
i=1
∂ x̃j
i,k=1
25
which means
d d
X X ∂ṽi
h∇Ẽj V, Ẽ j ig = ṽi Γkji hẼk , Ẽ j ig + hẼi , Ẽ j ig (105)
i=1
∂ x̃ j
i,k=1
d d
X X ∂ṽi
= ṽi Γkji δkj + δij (106)
i=1
∂ x̃j
i,k=1
d
X ∂ṽj
= ṽi Γjji + . (107)
i=1
∂ x̃j
Relabeling i → j and j → k in the term term shows this is equal to the LHS.
For the second half of the theorem, recall that the Levi-Civita
Pm connection is equal to the tangential
connection. Therefore, changing the basis via Ẽj = k=1 ∂ψ k ∂
∂ x̃j ∂xk , we can rewrite it as
m Xm
!! m
X ∂ψk ∂v̄i ∂ X dv̄ dψ ∂
(∇Ẽj V )(x) = P (x) = Px . (109)
i=1
∂ x̃j ∂xk ∂xi i=1
dx dx̃ ij ∂xi
k=1
Since g is the induced metric, the summation over j = 1, · · · , d is equivalent to the Frobenius inner
product h·, ·iF of the two m × d matrices
d
* !−1 +
>
X dv̄ dψ dψ dψ dψ
h∇Ẽj V, Ẽ j ig = Px , (111)
j=1
dx dx̃ dx̃ dx̃ dx̃
F
!−1
> >
dv̄ dψ dψ dψ dψ
= tr Px (112)
dx dx̃ dx̃ dx̃ dx̃
dv̄
= tr Px Px . (113)
dx
Proof. By definition,
m
X
(V · ∇g )V = Vj ∇g · Vj . (114)
j=1
26
Denote by the j’th column of Px by (Px ):j . Applying the resulting tangent vector to any smooth
function f (evaluated at x) and applying (12) gives
m
X
((V · ∇g )V )(f )(x) = (∇g · Vj )(x)Vj (f )(x) (115)
j=1
m Xm
X d(Px ):j ∂f
= tr Px Px (Px )ij (116)
j=1
dx i=1
∂xi
m m
XX d(Px ):j ∂f
= (Px )ij tr Px Px . (117)
i=1 j=1
dx ∂x i
That is, the resulting tangent vector’s coefficients correspond to the tangential projection of the
vector
tr Px d(Pdxx ):1 Px
..
(118)
.
tr Px d(Pdx x ):m
Px
∇x (nx )>
m−d 1r
d(Px ):j X .. >
Px Px = − (nx )jr Px Px + Px (nx ):r ∇x (nx )jr Px . (122)
dx .
∇x (nx )>
r=1
| {z }
mr 0
Lastly, let
∇x (nx )>
1r
τr = tr Px ..
Px (123)
.
>
∇x (nx )mr
which means (118) is simply
m−d
X
− (nx ):r τr . (124)
r=1
This implies the claim is true, since this is nothing more than a linear combination of the column
vectors of nx , which is orthogonal to the tangential linear subspace.
27
Theorem 3 (Marginally equivalent SDEs). For λ ≤ 1, the marginal distributions of XT −s
and Ys of the processes defined as below
√
λ
dY = U0 − ∇g log q ds + 1 − λV ◦ dB̂s Y0 ∼ q(·, 0) (20)
2
√
λ
dX = 1− ∇g log q − U0 dt + 1 − λ ◦ V dB̂t X0 ∼ q(·, T ) (21)
2
both have the density q(·, s). In particular, λ = 1 gives rise to an equivalent ODE.
That is, U0 (f ) = k (P r)k ∂∂x̃k f ◦ ψ, and V is the tangential projection. The marginal density q
P
follows the Fokker-Planck PDE
m
1X
∂s q = −∇g · (qU0 ) + ∇g · ((∇g · (qVk )) Vk ) (126)
2
k=1
m
1 X
= −∇g · (qU0 ) + ∇g · ((Vk (q) + q∇ · Vk ) Vk ) (127)
2
k=1
m
1 X
= −∇g · (qU0 ) + ∇g · (Vk (q)Vk ) (128)
2
k=1
m
1 X
= −∇g · (qU0 ) + ∇g · (qVk (log q)Vk ) (129)
2
k=1
1
= −∇g · (qU0 ) + ∇g · (q∇g log q) , (130)
2
where we have used the product rule, and Proposition 2, the chain rule, and Proposition 4.
For λ ≤ 1, we can rearrange the Fokker-Planck and get
λ 1−λ
∂s q = −∇g · q U0 − ∇g log q + ∇g · (q∇g log q) , (131)
2 2
which is the Fokker-Planck equation of the process (20).
To construct a reverse process inducing the same family of marginal densities, we mirror the diffu-
sion term around 0:
λ 1−λ
∂s q = −∇g · q U0 − 1 − ∇g log q − ∇g · (q∇g log q) (132)
2 2
Now we apply a change of variable of time via p(x, t) = q(x, T − t), which means ∂t p =
−∂s q|s=T −t and thus
λ 1−λ
∂s p = −∇g · q 1− ∇g log q − U0 + ∇g · (q∇g log q) , (133)
2 2
which is the Fokker-Planck of (21).
Theorem 4 (Score matching equivalency). For λ < 1, let Eλ∞ denote the Riemannian CT-
ELBO of the generative process (21), with ∇g log q replaced by an approximate score Sθ , and
28
with (20) being the inference SDE. Assume Sθ is a compactly supported smooth vector. Then
Z T h i
2
EY0 [Eλ∞ ] = −C1 EYs kSθ − ∇g log qkg ds + C2 (22)
0
Proof. Approximating ∇g log q in (21) using Sθ and plugging it in (5) and (20) in (9), we get
λ
V0 = 1 − Sθ − U0 (134)
2
√ λ
1 − λV a = (1 − λ) Sθ + (Sθ − ∇g log q) . (135)
2
∂
where Ej denote the ambient space Euclidean derivation ∂xj .
Thus, we have
λ2
1 2 1 2 2 2
kP ak2 = (1 − λ) kSθ kg + (1 − λ) λhSθ , Sθ − ∇g log qig + kSθ − ∇g log qkg
2 2(1 − λ) 4
2
λ 1 2 λ 1 2 λ 1 2
= 1− kSθ kg + kSθ kg − hSθ , ∇g log qig + kSθ − ∇g log qkg
2 2 2 2 4(1 − λ) 2
λ 2
∇ · V0 = 1− ∇ · Sθ − U0
2 2−λ
Summing up these two parts gives us Eλ∞ . Taking the expectation over q(·, 0) and applying the
divergence theorem give us the desired identity.
C Manifolds
Spheres are defined as submanifolds in an Euclidean space of points with unit Euclidean norm.
Precisely, an d-sphere is Sd = {x ∈ Rd+1 : kxk2 = 1}. Therefore the ambient space dimensionality
of a d sphere is m = d + 1. Tori are products of 1-spheres (or circles); that is Td = Πdi=1 S1 .
Naturally, we can embed a d-torus in a m = 2d-dimensional ambient space.
29
Tangential projection Without loss of generality, we derive the orthogonal projection to the tan-
gent space of spheres. The tangential projection of tori is just the same linear operator applied to d
R2 vectors independently.
To derive the tangential project, we note that any any incremental change in x, denoted by dx, will
need to leave the norm kxk2 unchanged. That is,
2
d kxk2 = 2x dx = 0. (141)
This means x is normal to the tangential linear subspace. We can find the orthogonal projection onto
xx>
the tangent space by subtracting the normal component, via Px = I − kxk 2.
2
x
Closest-point projection The closest-point projection onto the sphere is π(x) = kxk2 . One can
d d+1
verify this is the point on S that minimizes the Euclidean distance from x ∈ R \ {0}.
We work with the Lorentzian model of the hyperbolic manifold, which, like the d-spheres, is a
d-manifold embedded in Rd+1 , defined as
HdK := {x = (x0 , . . . , xd ) ∈ Rd+1 : hx, xiL = 1/K, x0 > 0}, (142)
where K < 0 is the curvature of the manifold, and h·, ·iL is the Lorentzian inner product
hx, yiL = −x0 y0 + x1 y1 + · · · + xn yn . (143)
In our experiments, K = −1.
The d + 1-dimensional Euclidean space endowned with the Lorentzian inner product (Rd+1 , h·, ·iL )
is known as the Minkowski space. The Lorentz inner product is in general indefinite. Therefore,
technically it is not an inner product. But it is positive definite when restricted to HdK , and as a result
induces a valid Riemannian metric gL . Equation (12), however, relies on the Euclidean geometry
of the ambient space. Therefore, we model the density pE associated with the metric tensor gE
induced by the regular Euclidean inner product. That is, pE is a probability density of the manifold
points still lie on the same topological space HdK , and the density
(HdK , gE ). Note that all the dataq
|GL |
can be translated via pE = pL |GE | , where GL and GE are the components of the matrix gL
and gE , and pL is the actual density on the Hyperbolic manifold (HdK , gL ). This change-of-volume
relation implies instead of maximizing the likelihood log pL , we can simply maximize log pE .
Alternatively, one can also compute the Riemannian divergence wrt the metric gL using the internal
coordinates, as is done in (Lou et al., 2020). In this case, the learned density will be the actual
density pL on the hyperbolic manifold.
Tangential projection Similar to the spheres, we analyze the contribution of the differential dx.
dhx, xiL = 2nx dx = 0, (144)
where nx = (−x0 , x1 , . . . , xd ) is the normal vector. Subtracting the normal contribution gives rise
nx n>
to the tangential projection Px = I − x
knx k22
.
hx,uiL
Note that this is different from the usual “Lorentz” orthogonal projection PxL (u) = u − hx,xiL x
(Ratcliffe, 1994); the latter is not orthogonal in the Euclidean inner product.
Closest-point projection We first derive the closest-point projection wrt the Lorentz inner prod-
uct. For any x ∈ {x0 : hx0 , x0 iL < 0},
2
π(x) = arg min kx − ykL , (145)
y∈Hd
K
p
where kxkL := hx, xiL is the Lorentz norm. To deal with the constraint y ∈ HdK , we can
introduce the Lagrange multiplier λ, and find the stationary point of the function
2
kx − ykL + λ(hy, yiL − 1/K). (146)
30
Figure 7: Closest-point projection of the point (1.0, 0.9) onto the Hyperbolic manifold H1−1 in the
Lorentz norm. This projection is clearly not the closest one in Euclidean distance.
The orthogonal groups are defined as O(n) = {X ∈ Rn×n : X > X = XX > = I}. The determi-
nant of X is either 1 or −1. The subgroup with determinant 1 is called the special orthogonal group,
denoted by SO(n). Naturally, Rn×n is an ambient space of the orthogonal groups.
31
Equating the last step with 0 yields
Λ + Λ>
V =U+ X. (151)
2
Λ + Λ> Λ + Λ>
XV > + V X > = XU > + + U X> + (152)
2 2
= XU > + U X > + Λ + Λ> = 0, (153)
which means
Λ + Λ> = −XU > − U X > . (154)
U − XU > X
V = . (155)
2
U −XU > X
That is, PX (U ) = 2 for orthogonal groups.
Closest-point projection Again, using the Lagrange multiplier Λ for the constraint that the pro-
jection M of X should satisfy M > M = I, we try to find the stationary point of the following
quantity
2
kM − XkF + hΛ, M > M − IiF . (156)
That is, π(X) = U V > for orthogonal groups, where U, V are the left and right singular matrices of
X.
32
Manifold Activation Hidden layers Embedding size ActNorm first
Sphere Sine 5 512 False
Tori Swish 4 256 False
Hyperbolic Swish 2 512 True
Orthogonal group Swish 256 256 False
Table 3: The variational function a network architectures for different manifolds in our experiments.
D Experimental details
D.1 Architecture
In our experiments, we parameterize the a network as a multi-layer perceptron (MLP) with either the
sinusoidal or the swish activation function. For the hyperbolic experiments, the first layer of the MLP
has an additional ActNorm layer (Kingma & Dhariwal, 2018) which we find adds extra numerical
stability. The ActNorm layer is initialized before training with one batch such that its output has
a mean of zero and a standard deviation of one. In an analogous manner to training the MLP
the ActNorm parameters are updated via backpropagation. For the orthogonal group experiments
we flatten the input matrix into a vector before passing it to the MLP. The details of our various
model are given in Table 3. For our importance sampler which is used to represent a differentiable
distribution over [0, T ], we use a deep sigmoidal flow (Huang et al., 2018) (without the final logit
activation) followed by a fixed scaling flow, which represents the range [0, T ]. We disconnect the
gradient from the numerical solver to save compute; i.e. Ys is not differentiable. This would result
in slightly biased gradient updates for minimizing the variance of the importance estimator, but we
still observe substantial reduction in variance (see Figure 2). Finally, we use PyTorch (Paszke et al.,
2019) as our deep learning framework.
Computational Resources. We run all of our experiments either on a single NVIDIA Tesla V100
or a single NVIDIA Quadro RTX 8000 GPU for a maximum of 30 hours.
D.2 Optimization
We use the Adam (Kingma & Ba, 2015) optimizer to train the a network. The learning rate and mo-
mentum parameters used for each manifold is mentioned in the Table 4. For the sphere experiments,
we slowly decrease the learning rate during training using a cosine scheduler. For optimization of
our importance sampler, we use Adam with a fixed learning rate of 0.01. We update the importance
sampler every 500 steps of our training loop for the a network. Lastly, to optimize our mixture of
power spherical distributions for the tori experiments we use Adam with a learning rate of 0.03 with
β1 = 0.9 and β2 = 0.999.
D.3 KELBO
The gap between the exact likelihood of the data given the model, i.e. log p(x), and the Riemannian
CT-ELBO may be large. This evaluation gap makes empirical validation of the models using the
Riemannian CT-ELBO imprecise. We acquire a tighter lower bound by using K > 1 samples and
importance sampling similar to Burda et al. (2015). In details, we know from (90) that:
" Z T ! #
dP 1
log p(x, t) = log EQ · p0 (Yt ) exp − ∇g · V0 − (V · ∇g )V ds Y0 = x .
dQ 0 2
33
Manifold Integration steps during training
Sphere 100
Tori 1000
Hyperbolic 100
Orthogonal group 100
Table 5: Details of training integration.
where Y i s are i.i.d. trajectories sampled from Q. We call this new lower bound KELBO. Note that
this is a tighter lower bound because we can write:
K K
X 1 X 1
KELBO = EQ log L(Y i ) ≥ EQ log L(Y i ) = EQ log L(Y ) = Riemannian CT-ELBO.
i=1
K i=1
K
In fact, this lower bound increases monotonically to the true likelihood as K → ∞. We use KELBO
with K = 100 for evaluating all of our models. We have experimented with the K to be up to 1000
and found out the results stop changing much for K > 100.
During training and evaluation, we numerically integrate the SDE on each respective manifold using
the Stratonovich-Heun method as described in Burrage et al. (2004). Each iteration is followed by
the closest-point projection (in the case of HdK , we use the closest-point project wrt the Lorentz inner
product). The number of integration steps for each manifold during training is reported in Table 5.
During evaluation, as described in D.3, we numerically integrate the data from s = 0 to s = T ,
and the Itô integral involved in the KELBO is approximated using the Euler-Maruyama scheme
(note that the dynamics is still generated using Stratonovich-Heun). As computing KELBO requires
forward passes through the a network, it may not be as smooth as just integrating the inference SDE.
Therefore, we use an adaptive step size for integration. We adapted the torchsde library (Kidger
et al., 2021; Li et al., 2020) to calculate errors and adapt the step size accordingly. The error tolerance
and minimum step size used in integration for all the experiments are reported in Table 6. Also, for
plotting densities we use the exact log likelihood of the equivalent ODE. To numerically integrate
the ODE for computing the exact likelihood, we use the default dopri5 solver from the torchdiffeq
library (Chen et al., 2018b, 2021). Finally, we use cartopy (Met Office, 2010 - 2015), matplotlib
(Hunter, 2007), and plotly (Inc., 2015) for visualization.
34