Pennec - Intrinsic Statistics On Riemannian Manifolds
Pennec - Intrinsic Statistics On Riemannian Manifolds
Abstract
In medical image analysis and high level computer vision, there is an intensive use of geo-
metric features like orientations, lines, and geometric transformations ranging from simple ones
(orientations, lines, rigid body or affine transformations, etc.) to very complex ones like curves,
surfaces, or general diffeomorphic transformations. The measurement of such geometric prim-
itives is generally noisy in real applications and we need to use statistics either to reduce the
uncertainty (estimation), to compare observations, or to test hypotheses. Unfortunately, even
simple geometric primitives often belong to manifolds that are not vector spaces. In previous
works [1, 2], we investigated invariance requirements to build some statistical tools on transfor-
mation groups and homogeneous manifolds that avoids paradoxes. In this paper, we consider
finite dimensional manifolds with a Riemannian metric as the basic structure. Based on this
metric, we develop the notions of mean value and covariance matrix of a random element, nor-
mal law, Mahalanobis distance and χ2 law. We provide a new proof of the characterization of
Riemannian centers of mass and an original gradient descent algorithm to efficiently compute
them. The notion of Normal law we propose is based on the maximization of the entropy know-
ing the mean and covariance of the distribution. The resulting family of pdfs spans the whole
range from uniform (on compact manifolds) to the point mass distribution. Moreover, we were
able to provide tractable approximations (with their limits) for small variances which show that
we can effectively implement and work with these definitions.
1 Introduction
To represent the results of a random experiment, one theoretically consider a probability measure
on the space of all events. Although this probabilized space contains all the information about the
random experiment, one often have access only to some measurements depending of the outcome
of the experiment. The mathematical way to formalize this is to investigate random variables or
observables which are maps from the probabilized space into R. One usually further simplify by
restricting to random variables that have a probability density function (pdf).
However, from a computational point of view, the pdf is still too informative and we have to
restrict the measurements to a few numeric characteristics of a random variable. Thus, one usually
approximate a unimodal pdf by a central value and a dispersion value around it. The Rmost used
Rcentral value is the mean value or expectation of the random variable: x = E [ x ]= x dPr =
2 = E (x − x)2 .
R y p x (y) dy. The corresponding dispersion value is the variance σ x
In real problems, we can have several simultaneous measurements of the same random exper-
iment. If we arrange these n random variables xi into a vector x = (x1 . . . xn ), we obtain a
random vector. As the expectation is a linear operator, it is easily generalized to vector or ma-
trix functions in order to define R the mean value and the covariance matrix of a random vector:
Σxx = E [ (x − x) (x − x)T ] = (y − x) (y − x)T px (y) dy. If one has to assume a probability dis-
tribution, the Gaussian distribution is usually well adapted, as it is completely determined by the
mean and the covariance. It is moreover the entropy maximizing distribution knowing only these
moments. Then, one can use a statistical distance between distributions such as the Mahalanobis
distance and the associated statistical tests.
The problem we investigate in this article is to generalize this framework to measurements in
finite dimensional Riemannian manifolds instead of measurements in a vector space. Examples of
manifolds we routinely use in medical imaging applications are 3D rotations, 3D rigid transforma-
tions, frames (a 3D point and an orthonormal trihedron), semi- or non-oriented frames (where 2
(resp. 3) of the trihedron unit vectors are given up to their sign) [3, 4], oriented or directed points
[5, 6], positive definite symmetric matrices coming from diffusion tensor imaging [7, 8, 9, 10, 11]
or from variability measurements [12]. We have already shown in [13, 2] that this is not an easy
problem and that some paradoxes can arise. In particular, we cannot generalize the expectation
to give a mean value since it would be an integral with value in the manifold: a new definition of
mean is needed, which implies revisiting an important part of the theory of statistics.
Statistical analysis on manifolds is a relatively new domain at the confluent of several mathemat-
ical and application domains. Its goal is to study statistically geometric object living in differential
manifolds. It is linked to the theory of statistical manifolds [14, 15], which aims at providing a
Riemannian structure to the the space of parameters of statistical distribution. However, the tar-
geted geometrical objects are usually different. Directional statistics [16, 17, 18, 19] provide a first
approach to statistics on manifold. As the manifolds considered here are spheres and projective
spaces, the tools developed were mostly extrinsic, i.e. relying on the embedding of the manifold in
the ambient Euclidean space. More complex objects are obtained when we consider the “shape” of
a set of k points, i.e. what remains invariant under the action of a given group of transformation
(usually rigid body ones or similarities). The statistics on these shape spaces [20, 21, 22, 23] raised
the need for intrinsic tools. However, the link between the tools developed in these works, the
metric used and the space structure was not always very clear.
Another mathematical approach was provided by the study of stochastic processes on Lie groups.
For instance, [24] derived central limit theorems on different families of groups and semi-groups
with specific algebraic properties. Since then, several authors in the area of stochastic differential
1
geometry and stochastic calculus on manifolds proposed results related to mean values [25, 26,
27, 28, 29, 30]. On the applied mathematics and computer science side, people get interested
in computing and optimizing in specific manifolds, like rotations and rigid body transformations
[4, 31, 32, 33, 34], Stiefel and Grassmann manifolds [35], etc.
Over the last years, several groups attempted to federate some of the above approaches in a
general statistical framework, with different objectives in mind. For instance, [36] and [15] aimed
at characterizing the performances of statistical parametric estimators, like the bias and the mean
square error. [36] considered extrinsic statistics, based on the Euclidean distance of the embedding
space, while [15] considered the intrinsic Riemannian distance, and refined the Cramer-Rao lower
bound using bounds on the sectional curvature of the manifold. In [37, 38], the authors focused on
the asymptotic consistency properties of the extrinsic and intrinsic means and variances for large
sample sizes, and were able to propose a central limit theorem for flat manifolds. Here, in view of
computer vision and medical image analysis applications, our concern is quite different: we aim at
developing computational tools that can consistently deal with geometric features, or that provide
at least good approximations. As we often have few measurements, we are interested in small
sample sizes rather than large one, and we prefer to obtain approximations rather than bounds on
the quality of the estimation. Thus, one of our special interest is to develop Taylor expansions with
respect to the variance, in order to evaluate the quality of the computations with respect to the
curvature of the manifold. In all cases, the chosen framework is the one of geodesically complete
Riemannian manifolds, which appears to be powerful enough to support an interesting theory. To
ensure a maximal consistency of the theory, we rely in this paper only on intrinsic properties of
the Riemannian manifold, thus excluding methods based on the embedding of the manifold in an
ambient Euclidean space.
We review in Section 2 some basic notions of differential and Riemannian geometry that will
be needed afterward. This synthesis was inspired from [39, chap. 9], [40, 41, 42], and the reader
can refer to these books to find more details. In the remaining of the paper, we consider that
our Riemannian manifold is connected and geodesically complete. We first detail in Section 3 the
measure induced by the Riemannian metric on the manifold, which allows to define probability
density functions, in particular the uniform one. Then, we turn in Section 4 to the expectation of a
random point. We provide a quite comprehensive survey of the definitions that have been proposed.
Among them, we focus on the Karcher and Fréchet means, defined as the set of points minimizing
locally or globally the variance (the expected Riemannian distance). We provide a new proof of
the barycentric characterization theorem and an original Gauss-Newton gradient descent algorithm
to practically compute the mean. Once the mean value is determined, one can easily define the
covariance matrix of a random point (and possibly higher order moments) using the exponential
map at the mean point (Section 4). To generalize the Gaussian distribution, we propose in Section
6 a new family of distributions based on a maximum entropy approach. Under some reasonable
hypotheses, we show that is amounts to take a truncated Gaussian distribution in the exponential
map at the mean point. We illustrate the properties of this pdf family on the circle, and provide
computationally tractable approximations for concentrated distributions. Last but not least, we
investigate in Section 7 the generalization of the Mahalanobis distance and the χ2 law. A careful
analysis shows that, with our definition of the generalized Gaussian, the χ2 law remains independent
of the variance and of the manifold curvature, up to the order 3. This demonstrate that the whole
framework is computationally sound and particularly consistent.
2
2 Differential geometry background
2.1 Riemannian metric, distance and geodesics
In the geometric framework, one specifies the structure of a manifold M by a Riemannian metric.
This is a continuous collection of dot products h . | . ix on the tangent space Tx M at each point x
∂
of the manifold. A local coordinate system x = (x1 , . . . xn ) induces a basis ∂x = (∂1 , . . . ∂n ) of the
i
tangent spaces (∂i is a shorter notation for ∂/∂x ). Thus, we can express the metric in this basis by
a symmetric positive definite matrix G(x) = [gij (x)] where each element is given by the dot product
of the tangent vector to the coordinate curves: gij (x) = h ∂i | ∂j i. This matrix is called the local
representation of the Riemannian metric in the chart x and the dot products of two vectors v and
w in Tx M is now h v | w ix = v T G(x) w. The matrix G(x) is called the local representation of the
Riemannian metric in the chart x.
If we consider a curve γ(t) on the manifold, we can compute at each point its instantaneous
speed vector γ̇(t) and its norm, the instantaneous speed. To compute the length of the curve, we
can proceed as usual by integrating this value along the curve:
Z b Z b 1
2
Lba (γ) = kγ̇(t)k dt = h γ̇(t) | γ̇(t) iγ(t) dt
a a
The Riemannian metric is the intrinsic way of measuring length on a manifold. The extrinsic
way is to consider the manifold as embedded in a larger vector space E (think for instance to the
sphere S2 in R3 ) and compute the length of a curve in M as for any curve in E. In this case, the
corresponding Riemannian metric is the restriction of the dot product of E onto the tangent space
at each point of the manifold. By Whitney’s theorem, there always exists such an embedding for a
large enough vector space E (dim(E) ≤ 2dim(M) + 1).
To obtain a distance between two points of a connected Riemannian manifold, we simply have
to take the minimum length among the smooth curves joining these points:
The curves realizing this minimum for any two points of the manifold are called geodesics1 . Let
[g ij ] = [gij ](-1) be the inverse of the metric matrix (in a given coordinate system x) and Γijk =
1 im
2g (∂k gmj + ∂j gmk − ∂m gjk ) the Christoffel symbols (using Einstein summation convention that
implicit sum upon each index that appear up and down in the formula). The calculus of variations
shows the geodesics are the curves satisfying the following second order differential system (in the
chart x = (x1 , . . . xn )):
γ̈ i + Γijk γ̇ j γ̇ k = 0
The manifold is said to be geodesically complete if the definition domain of all geodesics can be
extended to R. This means that the manifold has no boundary nor any singular point that we can
reach in a finite time (for instance, Rn − {0} with the usual metric is not geodesically complete, but
Rn or Sn are). As an important consequence, the Hopf-Rinow-De Rham theorem state that such
a manifold is complete for the induced distance (equation 1), and that there always exist at least
one minimizing geodesic between any two points of the manifold (i.e. which length is the distance
between the two points). From now on, we will assume that the manifold is geodesically complete.
Rb
In facts, geodesics are defined as the critical points of the energy functional E(γ) = 12 a k∂γ k2 dt. It turns out
1
that they also optimize the length functional but they are moreover parameterized proportionally to arc-length.
3
2.2 Exponential map and cut locus
From the theory of second order differential equations, we know that there exists one and only
one geodesic γ(x,∂v ) going through the point x ∈ M at t = 0 with tangent vector ∂v ∈ Tx M. This
geodesic is theoretically defined in a sufficiently small interval around zero but since we the manifold
is geodesically complete, its definition domain can be extended to R. Thus, the point γ(x,∂v ) (t) is
defined for all vector ∂v ∈ Tx M and all parameter t. The exponential map maps each vector ∂v to
the point of the manifold reached in a unit time:
Tx M −→ M
expx :
∂v 7−→ expx (∂v ) = γ(x,∂v ) (1)
This function realizes a local diffeomorphism from a sufficiently small neighborhood of 0 in Tx M
into a neighborhood of the point x ∈ M. We denote by logx = exp(-1) x the inverse map or simply
−
→ = log (y). In this chart, the geodesics going through x are the represented by the lines going
xy x
through the origin: logx γ(x,−→ (t) = t −
xy)
→ Moreover, the distance with respect to the development
xy.
point x is preserved:
dist(x, y) = k−
→ = (h −
xyk → |−
xy → i )1/2
xy x
Thus, the exponential chart at x can be seen as the development of the manifold in the tangent
space at a given point along the geodesics. This is also called a normal coordinate system if it
is provided with an orthonormal basis. At the origin of such a chart, the metric reduces to the
identity matrix and the Christoffel symbols vanish.
Now, it is natural to search for the maximal domain where the exponential map is a diffeo-
morphism. If we follow a geodesic γ(x,∂v ) (t) = expx (t ∂v ) from t = 0 to infinity, it is either always
minimizing all along or it is minimizing up to a time t0 < ∞ and not any more after (thanks to
the geodesic completeness). In this last case, the point z = γ(x,∂v ) (t0 ) is called a cut point and the
corresponding tangent vector t0 ∂v a tangential cut point. The set of of all cut points of all geodesics
starting from x is the cut locus C(x) ∈ M and the set of corresponding vectors the tangential cut
locus C(x) ∈ Tx M. Thus, we have C(x) = expx (C(x)), and the maximal definition domain for the
exponential chart is the domain D(x) containing 0 and delimited by the tangential cut locus.
It is easy to see that this domain is connected and star-shaped with respect to the origin of Tx M.
Its image by the exponential map covers all the manifold except the cut locus and the segment [0, − →
xy]
is transformed into the unique minimizing geodesic from x to y. Hence, the exponential chart at
x is a chart centered at x with a connected and star-shaped definition domain that covers all the
manifold except the cut locus C(x):
D(x) ∈ Rn ←→ M − C(x)
−
xy = logx (y) ←→ y = expx (−
→ →
xy)
From a computational point of view, it is often interesting to extend this representation to include
the tangential cut locus. However, we have to take care of the multiple representations: points
in the cut locus where several minimizing geodesics meet are represented by several points on the
tangential cut locus as the geodesics are starting with different tangent vectors (e.g. rotation of
π around the axis ±n for 3D rotations, antipodal point on the sphere). This multiplicity problem
cannot be avoided as the set of such points is dense in the cut locus.
The size of this definition domain is quantified by the injectivity radius i(M, x) = dist(x, C(x)),
which is the maximal radius of centered balls in Tx M on which the exponential map is one-to-one.
The injectivity radius of the manifold i(M) is the infimum of the injectivity over the manifold. It
may be zero, inp which case the manifold somehow tends towards a singularity (think e.g. to the
surface z = 1/ x2 + y 2 as a sub-manifold of R3 ).
4
Example: On the sphere Sn (center 0 and radius 1) with the canonical Riemannian metric (in-
duced by the ambient Euclidean space Rn+1 ), the geodesics are the great circles and the cut locus
of a points x is its antipodal point x = −x. The exponential chart is obtained by rolling the sphere
onto its tangent space so that the great circles going through x become lines. The maximal defini-
tion domain is thus the open ball D = Bn (π). On its boundary ∂D = C = Sn−1 (π), all the points
represent x.
Figure 1: Exponential chart and cut locus for the sphere S2 and the projective space P2
For the real projective space Pn (obtained by identification of antipodal points of the sphere
Sn ), the geodesics are still the great circles, but the cut locus of the point {x, −x} is now the
equator of the two points where antipodal points are still identified (thus the cut locus is Pn−1 ).
The definition domain of the exponential chart is the open ball D = Bn ( π2 ), and the tangential cut
locus is the sphere ∂D = Sn−1 ( π2 ) where antipodal points are identified.
5
where Jfx = [∂i f ] and Hfx = [∂ij f ]. Since we are in a normal coordinate system, we have fx (v) =
f (expx (v)). Moreover, the metric at the origin reduces to the identity: Jfx = grad f T , and the
Christoffel symbols vanish so that the matrix of second derivatives Hfx corresponds to the Hessian
Hess f . Thus, The Taylor expansion can be written in any coordinate system:
1
f (expx (v)) = f (x) + grad f (v) + Hess f (v, v) + O(kvk3 ) (2)
2
As in the real or vectorial case, we can now make abstraction of the original space Ω and directly
work with the induced probability measure on M.
One can show that the cut locus has a null measure. This means that we can integrate in-
differently in M or in any exponential chart. If f is an integrable function of the manifold and
fx (−
→ = f (exp (−
xy) →
x xy)) is its image in the exponential chart at x, we have:
Z Z p
f (x) dM = fx (~z) G~x (~z) d~z
M D(x)
6
A simple example of a pdf is the uniform pdf in a bounded set X :
1 1X (y)
pX (y) = R 1X (y) =
X dM Vol(X )
One must be careful that this pdf is uniform with respect to the measure dM and is not uniform for
another measure on the manifold. This problem is the basis of the Bertrand paradox for geometrical
probabilities [43, 44, 2] and raise the problem of the measure to choose on the manifold. In our case,
the measure is induced by the Riemannian metric but the problem is only lifted: which Riemannian
metric do we have to choose ? For transformation groups and homogeneous manifolds, we showed
in [1] that an invariant metric is a good geometric choice, even if such a metric does not always
exist for homogeneous manifolds or if it leads in general to a partial consistency only between the
geometric and statistical operations in non compact transformation groups [45].
This notion of expectation corresponds to the one we defined on real random variables and vectors.
However, we cannot directly extend it to define the mean value of the distribution since we have
no way to generalize this integral in R into an integral with value in the manifold.
is minimized for the mean vector x = E [ x ]. The major point for the generalization is that the
expectation of a real valued function is well defined for our connected and geodesically complete
Riemannian manifold M.
7
Definition 3 Variance of a random point
Let x be a random point of pdf px . The variance σx2 (y) is the expectation of the squared distance
between the random point and the fixed point y:
Z
σx2 (y) = E dist(y, x)2 = dist(y, z)2 px (z) dM(z)
(3)
M
If there exists a least one mean point x̄, we call variance the minimal value σx2 = σx2 (x̄) and standard
deviation its square-root.
Similarly, one defines the empirical or discrete mean point of a set of measures x1 , . . . xn :
!
1 X
E [ {xi } ] = arg min E { dist(y, xi )2 } = arg min dist(y, xi )2
y∈M y∈M n
i
there exists a least a mean point x̄, one calls empirical variance the minimal value s2 =
If P
1 2
n i dist(x̄, xi ) and empirical standard deviation or RMS (for Root Mean Square) its square-
root.
Following the same principle, one can define other types of central values. The mean deviation
at order α is
Z 1/α
α 1/α α
σx,α (y) = (E [ dist(y, x) ]) = dist(y, z) px (z) dM(z)
M
If this function is bounded on M, one call central point at order α every point x̄α minimizing it.
For instance, the modes are obtained for α = 0. Exactly like in a vector space, they are the points
where the density is maximal on the manifold (which is generally not a maximum for the density
on the charts). The median point is obtained for α = 1. For α → ∞, we obtain the “barycenter” of
the distribution support (which has to be compact).
The definition of these central values can be extended to the discrete case easily, except perhaps
for the modes and for α → ∞. We note that the Fréchet expectation is defined for all metric space
and not only for Riemannian manifolds.
8
of mass. As global minima are local minima, the Fréchet expected points are a subset of the Rie-
mannian centers of mass. However, the use of local minima allows to characterize the Riemannian
centers of mass using only local derivatives of order two.
Using this extended definition, Karcher [25] and Kendall [48] established conditions on the
manifold and the distribution to ensure the existence and uniqueness of the mean. We just recall
here the results without the proofs.
Definition 5 (Regular geodesic balls) The ball B(x, r) = {y ∈ M/ dist(x, y) < r} is said
geodesic if it does not meet the cut locus of its center. This means that there exists a unique
minimizing geodesic from the center to any point of a geodesic ball. The ball is said regular if its
√
radius verifies 2r κ < π, where κ is the maximum of the Riemannian curvature in this ball.
For instance, on the sphere S2 with radius one, the curvature is constant and equal to 1. A
geodesic ball is regular if r < π/2. Such a ball can almost cover an hemisphere, but not the equator.
In a Riemannian manifold with non positive curvature, a regular geodesic ball can cover the whole
manifold (according to the Cartan-Hadamard theorem, such a manifold is diffeomorphic to Rn if it
is simply connected and complete).
• Kendall [48] If the support of px is included in a regular geodesic ball B(y, r), then there
exists one and only one Riemannian center of mass x on this ball.
• Karcher [25] If the support of px is included in a geodesic ball B(y, r) and if the ball of double
radius B(y, 2 r) is still geodesic and regular, then the variance σx2 (z) is a convex function of z
and has only one critical point on B(y, r), necessarily the Riemannian center of mass.
These conditions ensure a correct behavior of the mean for sufficiently localized distributions.
However, they are quite restrictive as they only address pdfs with a compact support. Kendall’s
existence and uniqueness theorem was extended by [30] to distributions with non-compact support
in manifolds with Ψ-convexity. This notion, already introduced by Kendall in his original proof,
is here extended to the whole manifold. Unfortunately, this type of argument can only be used
for a restricted class of manifolds as a non-compact connected and geodesically complete Ψ-convex
manifold is diffeomorphic to Rm . It remains that this extension of the theorem applies to the
important class of Hadamard manifolds (i.e. simply connected, complete and with non-positive
sectional curvature), whose curvature is bounded from below.
∀y ∈ R |y − x̄| ≤ E [ |x − x̄| ]
9
is the mean value E [ x ]. Thus, in a metric space, the mean according to Doss is defined as the set
of points x̄ ∈ M verifying:
Herer shows in [50, 51] that this definition includes the classical expectation in a Banach space
(with possibly other points) and develop on this basis a conditional expectation.
A similar definition that uses convex functions on the manifold instead of metric properties
proposed by Emery [27] and Arnaudon [52, 28]. A function from M to R is convex if its restriction
to all geodesic is convex (considered as a function from R to R). The convex barycenter of a random
point x with density px is the set B(x) of points y ∈ M such that α(y) ≤ E [ α(x) ] holds for every
real bounded and convex function α on a neighborhood of the support of px .
This definition seems to be of little interest in our case since for compact manifolds, such as
the sphere or SO3 (the manifold of 3D rotations), the geodesics are closed and the only convex
functions on the manifold are the constant ones. Thus, every random point for which the support
of the distribution is the whole manifold has the whole manifold as convex barycenter.
However, in the case where the support of the distribution is included in a strongly convex open
2
set U, Emery [27] showed that the exponential barycenters, defined as the critical points of the
variance σx2 (y) are subset of the convex barycenter B(x). Local and global minima being particular
critical points, the exponential barycenters include the Riemannian centers of mass that include
themselves the Fréchet means.
Picard [29] realized a good synthesis of most of these notions of mean value and show that the
definition of a “barycenter” (i.e. a mean value) is linked to a connector, which determines itself a
connection, and thus possibly a metric. An interesting property brought by this formulation is that
the distance between two barycenters (with different definitions) is of the order of O(σx ). Thus,
for sufficiently centered random points, all these values are close.
is differentiable at any point y ∈ M where it is finite and where the cut locus C(y) has a null
probability measure:
Z Z
2
P (C(y)) = dP (z) = 0 and σ (y) = dist(y, z)2 dP (z) < ∞
C(y) M
2
Here, strongly convex means that for every two points of U there is a unique minimizing geodesic joining them
that depend in a C ∞ of the two points.
10
At such a point, it has the following gradient:
Z
yz dP (z) = −2 E −
−
→ →
(grad σ 2 )(y) = −2 yx
M/C(y)
Now, we know that the variance is continuous but may not be differentiable at the points
where the cut locus has a non-zero probability measure. At these points, the variance can have an
extremum (think for instance to kxk in vector spaces). Thus, the extrema of σ 2 are characterized
by (grad σ 2 )(y) = 0 if this is defined or P (C(y)) > 0.
If the manifold does not have a cut locus (for instance in Hadamard manifolds), we have no
differentiation problem. One could think of going one step further and computing the Hessian
matrix. Indeed, we have in the vector case: Hess σx2 (y) = −2 Id everywhere, which proves that
any extremum of the variance is a minimum. In Riemannian manifolds, one has to be more careful
because the Hessian is modified by the curvature of the manifold. One solution is to compute the
Hessian matrix using the theory of Jacobi fields, and then take estimates of its eigenvalues based
on bounds on the curvature. This is essentially the idea exploited in [25] to show the uniqueness of
the mean is small enough geodesic balls, and by [54] to exhibit an example of a manifold without
cut-locus that is strongly convex (i.e. there is one and only one minimizing geodesic joining any
two points), but that support finite mass measures that have non-unique Riemannian centers of
mass. Thus, the absence of a cut locus is not enough: one should also have some constraints on the
curvature of the manifold. In order to remain simple, we stick in this paper to the existence and
uniqueness theorem provided by [30] for simply connected and complete manifolds whose curvature
is non-positive (i.e. Hadamard) and bounded from below.
Results similar to Theorem 2 and the above corollaries have been derived independently. [15]
defined the mean values in manifolds as the exponential barycenters. To relate them with the
Riemannian centers of mass, they determined the gradient of the variance. However, they only
investigate the relation between the two notions when the probability is dominated by the Rie-
mannian measure, which excludes explicitly point-mass distributions. In [37, 38], the gradient of
the variance is also determined and the existence of the mean is established for simply connected
Riemannian manifolds with non-positive curvature.
Basically, the characterization of the Riemannian center of mass is the same as in Euclidean
spaces if the curvature of manifold is non-positive (and bounded from below), in which case there is
no cut-locus (we assumed that the manifold was complete and simply connected). If the sectional
curvature becomes positive, a cut locus may appear, and a non-zero probability on this cut-locus
11
induces some discontinuities in the first derivative of the variance. This corresponds to something
like a Dirac measure on the second order derivative, which is an additional difficulty to compute
the Hessian matrix of the variance on these manifolds.
Its derivative is rather easy to compute: grad σ 2 (α) = −2 cos(α) sin(α), and the second order
derivative is H(α) = 4 sin(α)2 − 2. Solving for grad σ 2 (α) = 0, we get four critical points:
Thus, there are two relative (and here absolute) minima: E [ θ ] = {0 , ±π}.
Let us use now the general framework developed on Riemannian manifolds. According to
theorem 2, the gradient of the variance is
Z α+π Z α+π
2
h−→i −→ cos(θ)2
grad σ (α) = −2E αθ = −2 αθ dθ = −2 (θ − α) dθ = −2 cos(α) sin(α),
α−π α−π π
which is in accordance with our previous computations. Now, differentiating once again under the
sum, we get:
α+π α+π −
→ Z α+π
∂ 2 dist(α, θ)
Z Z
∂ αθ
p(θ) dθ = −2 p(θ) dθ = 2 p(θ) dθ = 2,
α−π ∂α2 α−π ∂α α−π
which is clearly different from our direct calculation. One way to see the problem is the following:
−
→
the vector field αθ is continuous and differentiable on the circle except at the cut locus of α (i.e. at
θ = α ± π) where it has a jump of 2π. Thus, the second order derivative of the squared distance
should be −2(−1 + 2πδ(α±π) (θ)), where δ is the Dirac distribution, and the integral becomes:
Z α+π
−1 + 2πδ(α±π) (θ) p(θ) dθ = 2 − 4π p(α ± π) = 2 − 4 cos(θ)2
H(α) = −2
α−π
12
algorithm seems to be perfectly adapted. In this section, we assume that the conditions of theorem
(2) are fulfilled.
Let y be an estimation of the mean of the random point x and f (y) = σx2 (y) the variance.
A practical gradient descent algorithm is to minimize the second order approximation of the cost
function at the current point. According to the Taylor expansion of equation (2), the second order
approximation of f and y is:
1
f (expy (v)) = f (y) + grad f (v) + Hess f (v, v)
2
This is a function of the vector v ∈ Ty M. Assuming that Hess f is positive definite, this function
is convex and has thus a minimum characterized by a null gradient. Let Hf (v) denote the linear
form verifying Hf (v)(w) = Hess f (v, w) for all w and Hf(-1) denote the inverse map. The minimum
is characterized by
We noteP that in the case of a vector space, these two formula simplify to yt+1 = E [ x ] and
yt+1 = n1 i xi , which are the definition of the mean value and the barycenter. Moreover, the
algorithm converges in a single step.
An important point for this algorithm is to determine a good starting point. In the case on a
set of observations {xi }, one can choose at random one of the observations as the starting point.
Another solution is to map to each point xi its mean distance with respect to other points (or
the median distance to be robust) and choose as the starting point the minimizing point. From a
computer science point of view, the complexity is k 2 (where k is the number of observations) but
the method can be randomized efficiently [57, 58].
To verify the uniqueness of the solution, we can repeat the algorithm from several starting
points (for instance all the observations xi ). If we know the Riemannian curvature of the manifold
(for instance if it is constant or if there is an upper bound κ), we can use theorem (1, Section 4.2).
We just have to verify that the maximum distance between the observations and the mean value
we have found is sufficiently small so that all observations fits into a regular geodesic ball of radius:
π
r = max dist(x̄, xi ) < √
i 2 κ
13
5 Covariance matrix
With the mean value, we have a dispersion value: the variance. To go one step further, we observe
that the covariance matrix of a random vector x with respect to a point y is the directional dispersion
of the “difference” vector −
→ = x − y:
yx
Z
−→ −
→ (−
→ (− → T p (x) dx
T
Covx (y) = E yx yx = yx) yx) x
Rn
This definition is readily extendible to a complete Riemannian manifold using the random vector
−
→ in T M and the Riemannian measure. In fact, we are usually interested in the covariance relative
yx y
to the mean value:
Definition 6 (Covariance)
Let x be a random point and x̄ ∈ E [ x ] a mean value that we assume to be unique to simplify the
notations (otherwise we have to keep a reference to the mean value). We define the covariance Σxx
by the expression:
i Z
−
→− →T −
→ − →
h
Σxx = Covx (x̄) = E x̄x x̄x = (x̄x) (x̄x)T px (x) dM(x)
D(x̄)
The empirical covariance is defined in the same way using the discrete version of the expectation
operator.
We observe that the covariance depends on the basis used for the exponential chart if we see it
as a matrix, but it does not depend on it if we consider it as a bilinear form over the tangent plane.
The covariance is related to the variance just as in the vector case:
−
→− →
h i
Tr(Σxx ) = E Tr(x̄x x̄xT ) = E dist(x̄, x)2 = σx2
This formula is still valid relatively to any fixed point: Tr(Covx (y)) = σx2 (y).
Figure 2: The covariance is defined in the tangent plane at S2 at the mean point as the
classical
−→ −→T
covariance matrix of the random vector “deviation from the mean” Σxx = E xx xx .
In fact, as soon as we have found a (or the) mean value and that the probability of its cut locus
is null, everything appears to be similar to the case of a centered random vector by developing
14
−
→
the manifold onto
p the tangent space at the mean value. Indeed, x̄x is a random vector of pdf
ρx (y) = px (y) |G(y)| with respect to the Lebesgue measure iin the connected and star-shaped
−
→
h
domain D(x̄) ⊂ Tx̄ M. We know that its expectation is E x̄x = 0 and its covariance matrix is
defined as usual. Thus, we could define higher order moments of the distribution by tensors on this
tangent space, just as we have done for the covariance.
15
This definition is coherent with the measure inherited from our Riemannian metric since the
pdf pU that maximizes the entropy when we only know that the measure is in a compact set U is
the uniform density in this set:
Z
pU (x) = 1U (x) dM(y)
U
H [ x | x̄ ∈ E [ x ] , Σxx = Σ ]
16
where the normalization constant and the covariance are related to the concentration matrix by:
−
→ −
→! −
→ −
→!
x̄yT Γ x̄y x̄yT Γ x̄y
Z Z
(-1) −
→− →T
k = exp − dM(y) and Σ=k x̄y x̄y exp − dM(y)
M 2 M 2
From the concentration matrix, we can compute the covariance of the random point, at least
numerically, but the reverse is more difficult.
It is interesting to have a look on limit properties: if the circle radius goes to infinity, the circle
becomes the real line and we obtain the usual Gaussian with the relation σ 2 = 1/γ, as expected.
Now, let us consider a circle with a fixed radius. As anticipated, the variance goes to zero and
the density tends toward a point mass distribution (see figure 3) if the concentration γ goes to
infinity. On the other hand, the variance cannot becomes infinite (as in the real case) when the
concentration parameter γ goes to zero because the circle is compact: a Taylor expansion gives
σ 2 = a2 /3 + O(γ). Thus, the maximal variance on the circle is
a2 1
σ02 = lim σ 2 = with the density N(0,0) (x) =
γ→0 3 2a
Interestingly, the Normal density of concentration 0 is the uniform density. In fact, this result can
be generalized to all compact manifolds
17
Figure 3: Variance σ 2 with respect to the concentration parameter γ on the circle of radius 1 and
the real line. This variance tends toward σ02 = π 2 /3 for the uniform distribution on the circle
(γ = 0) whereas it tends to infinity for the uniform measure on R. For a strong concentration
(γ > 1), the variance on the circle can be accurately approximated by σ 2 ' 1/γ, as in the real case.
It follows immediately that the limit of the normal distribution is the uniform one, and that the
limit covariance is finite.
Theorem 4 Let M be a compact manifold. The limit of the normal distribution for small concen-
tration matrices is the uniform density N (y) = 1/V ol(M) + O(Tr(Γ)) . Moreover, the covariance
matrix tends towards a finite value:
Z
1 −
→− →
Σ= x̄y x̄yT dM + O(Tr(Γ)) < +∞
V ol(M) M
18
The Taylor expansion of the metric is given by [63, p84]. We easily deduce the Taylor expansion
of the measure around the origin (Ric is the Ricci (or scalar) curvature matrix in the considered
normal coordinate system):
p y T Ric y
dM(y) = det(G(y)) dy = 1 − + O(kyk3 ) dy
6
Substituting this Taylor expansion in the integrals and manipulating the formulas (see appendix
B) leads to the following theorem.
1 + O(σ 3 ) + ε σr
1 σ
k= p and Γ = Σ(-1) − Ric + O(σ) + ε
(2π)n det(Σ) 3 r
σ
Here, ε(x) is a function that is a O(xk ) for any positive k, with the convention that ε +∞ =
ε(0) = 0. More precisely, this is a function such that ∀k ∈ R+ , lim0+ x−k ε(x) = 0
6.7 Discussion
The maximum entropy approach to generalize the normal distribution to Riemannian manifolds is
interesting since we obtain a whole family of densities going from the Dirac measure to the uniform
distribution (or the uniform measure if the manifold is only locally compact). Unfortunately, this
distribution is generally not differentiable at the cut locus, and often even not continuous.
However, if the relation between the parameters and the moments of the distribution are not as
simple as in the vector case (but can we expect something simpler in the general case of Riemannian
manifolds?), the approximation for small covariances turns out to be rather simple. Thus, this
approximate distribution can be handled quite easily for computational purposes. It is likely that
similar approximations hold for wrapped Gaussian, but this remains to establish.
19
of error when saying that the observation does not come from the distribution. The distribution
of x is usually assumed to be Gaussian, as this distribution minimizes the added information (i.e.
minimizes the entropy) when we only know the mean and the covariance. In that case, we know that
the Mahalanobis distance should be χ2n distributed if the observation is correct (n is the dimension
of the vector space). If the probability of the current distance is too small (i.e. µ2 is too large), the
observation x̂ can safely be considered as an outlier.
The definition of the Mahalanobis distance can be easily generalized to complete Riemannian
manifolds with our tools. We note that it is well defined for any distribution of the random point
and not only the normal one.
Definition 7 (Mahalanobis distance)
We call Mahalanobis distance between a random point x ∼ (x̄, Σxx ) and a (deterministic) point y
on the manifold the value
−
→ −
→
µ2x (y) = x̄yT Σ(-1)
xx x̄y.
7.1 Properties
Since µ2x is a function from M to R, µ2x (y) is a real random variable. The expectation of this
random variable is well defined and turns out to be quite simple:
Z Z
→T (-1) −
− →
E µ2x (y) µ2x (z) py (z) dM(z) =
= x̄z Σxx x̄z py (z) dM(z)
M Z M
(-1) →−
− →T
= Tr Σxx x̄z x̄z py (z) dM(z) = Tr Σ(-1)xx Covy (x̄)
M
The expectation of the Mahalanobis distance of a random point with itself is even simpler:
E µ2x (x) = Tr(Σ(-1)
xx Σxx ) = Tr( Idn ) = n
Theorem 6 The expected Mahalanobis distance of a random point withitself isindependent of the
distribution and does only depend on the dimension of the manifold: E µ2x (x) = n.
This identity can be used to verify with a posteriori observations that the covariance matrix
has been correctly estimated. It can be compared with
the expectation of the “normalized” squared
distance, which is by definition: E dist(x, x̄)2 /σx2 = 1.
20
The 2 probability can be computed using the incomplete gamma function Pr{χ2 ≤ α2 } =
χ 2
P n2 , α2 (see for instance [64]).
In practice, one often use this law to test if an observation x̂ has been drawn from a random
point x that we assume to be Gaussian: if the hypothesis is true, the value µ2x (x̂) will be less than
α2 with a probability γ = Pr{χ2 ≤ α2 }. Thus, one choose a confidence level γ (for instance 95%
or 99%), then we find the value α(γ) such that γ = Pr{χ2 ≤ α2 } and accept the hypothesis if
µ2x (x̂) ≤ α2 .
8 Discussion
On a (geodesically complete) Riemannian manifold, it is easy to define probability density functions
associated to random points, thanks to the availability of a metric. However, as soon as the
expectation is concerned, we may only define the expectation of an observable (a real or vectorial
function of the random point). Thus, the definition of a mean value for a random point is much
more complex than for the vectorial case and it requires a distance-based variational formulation:
the Riemannian center of mass basically minimize locally the variance. As the mean is now defined
through a minimization procedure, its existence and uniqueness are not ensured any more (except
for distributions with a sufficiently small compact support). In practice, one mean value almost
always exists, and it is unique as soon as the distribution is sufficiently peaked. The properties
of the mean are very similar to those of the modes (that can be defined as central values of order
0) in the vectorial case. We present here a new proof of the barycentric characterization theorem
that is valid for distribution with any kind of support. The main difference with the vector case is
that we have to ensure a null probability measure of the cut locus. To compute the mean value,
we designed an original Gauss-Newton gradient descent algorithm that essentially alternates the
computation of the barycenter in the exponential chart centered at the current estimation, and a
re-centering step of the chart at the newly computed barycenter.
To define higher moments of the distribution, we used the exponential chart at the mean point
(which may be seen as the development of the manifold onto its tangent space at this point along
the geodesics): the random point is thus represented as a random vector with null mean in a
star-shaped and symmetric domain. With this representation, there is no more problem to define
the covariance matrix and potentially higher order moments. Based on this covariance matrix,
we defined a Mahalanobis distance between a random and a deterministic point that basically
weights the distance between the deterministic point and the mean point using the inverse of the
covariance matrix. Interestingly, the expected Mahalanobis distance of a random point with itself
is independent of the distribution and is equal to the dimension of the manifold, as in the vectorial
case.
Like for the mean, we choose a variational approach to generalize the Normal law: we define
it as the maximum entropy distribution knowing the mean and the covariance. Neglecting the
cut-locus constraints, we show that this amounts to consider a truncated Gaussian distribution on
the exponential chart centered at the mean point. However, the relation between the concentration
matrix (the “metric” used in the exponential of the pdf) and the covariance matrix is slightly more
complex that the simple inversion of the vectorial case, as it has to be corrected for the curvature
of the manifold.
Last but not least, using the Mahalanobis distance of a Normally distributed random point, we
can generalize the χ2 law: we were able to show that is has the same density as in the vectorial
case up to an order 3 in σ. This opens the way to the generalization of many other statistical tests,
21
as we may expect similarly simple approximations for sufficiently centered distributions.
In this paper, we focused on purpose on the theoretical formulation of the statistical framework
in geodesically complete Riemannian spaces, eluding applications examples. The interested reader
will find practical applications in computer vision to compute the mean rotation [32, 33] or for
the generalization of matching algorithms to arbitrary geometric features [65]. In medical image
analysis, selected applications cover the validation of the rigid registration accuracy [4, 66, 67],
shape statistics [7] and more recently tensor computing, either for processing and analyzing diffusion
tensor images [10, 8, 9, 11], or to model the brain variability [12]. One can even find applications
in rock mechanics with the analysis of fracture geometry [68].
In the theory presented here, all definitions are derived from the Riemannian metric of the
manifold. A natural question is how to chose this metric? Invariance properties requirements
provide a partial answer for connected Lie groups and homogeneous manifolds [1, 45]. However,
an invariant metric does not always exist on an homogeneous manifolds. Likewise, left and right
invariant metric and generally different in non-compact Lie groups, so that we only have a partial
consistency between the geometric and statistical operations. Another way to chose the metric
could be to estimate it from empirical data.
More generally, we could design other definitions of the mean value using the notion of connector
introduced in [29]. This connector formalizes a relationship between the manifold and its tangent
space at one point, exactly in the way we used the exponential map of the Riemannian metric. Thus,
we could generalize easily the higher order moments and the other statistical operations we defined
by replacing everywhere the exponential map with the connector. One important point is that
these connectors can model extrinsic distances (like the Euclidean distance on unit vectors), and
could lead to very efficient approximations of the mean value for sufficiently peaked distributions.
For instance, the “barycenter / re-centering” algorithm we designed will most probably converge
toward a first order approximation of the Riemannian mean if we use any chart that is consistent to
the exponential chart at the first order (e.g. Euler’s angle on re-centered 3D rotations). We believe
that this research track may become one of the most productive from the practical point of view.
Acknowledgments
The main part of this work was done at MIT, Artificial Intelligence Lab in 1997. If all the ideas
presented in this paper were in place at that time [69], the formula for gradient of the variance
(appendix A) remained a conjecture. The author would like to thank specially Pr Maillot for
providing a first proof of the differentiability of the variance for the uniform distribution on compact
manifolds [53]. Its generalization to the non compact case with arbitrary distributions considerably
delayed the submission of this paper. In the meantime, other (and simpler) proofs were proposed
[15, 38]. The author would also like to thank two anonymous referees for very valuable advises and
for the detection of a number of technical errors.
22
Hypotheses
Let P be a probability on the Riemannian manifold M. We assume that the cut locus C(y) of
the derivation point y ∈ M has a null measure with respect to this probability (as it has with the
Riemannian measure) and that the variance is finite at that point:
Z Z
2
P (C(y)) = dP (z) = 0 and σ (y) = dist(y, z)2 dP (z) < ∞
C(y) M
corresponds to a derivation under the sum, but the usual conditions of the Lebesgue theorem are
not fulfilled: the zero measure set C(y) varies with y. Thus, we have to come back to the original
definition of the gradient.
σ 2 (γ(t)) − σ 2 (y)
∀w ∈ Ty M (grad σ 2 )(y) | w = ∂w σ 2 (y) = lim
t→0 t
Since tangent vectors are defined as equivalent classes, we can choose the geodesic curve γ(t) =
expy (t w). Using v = t w, the above condition can then be rewritten:
which can be rephrased as: for all η ∈ R+ , there exists ε sufficiently small such that:
General idea
Let ∆(z, v) be the integrated function (for z 6∈ C(y)):
23
R
and H(v) = M\C(y) ∆(z, v) dP (z) be the function to bound:
The idea is to split this integral in two in order to bound ∆ on a small neighborhood W around
the cut locus of y and to use the standard Lebesgue theorem to bound the integral of ∆ on M\W .
S
Lemma 1 Wε = x∈B(y,ε) C(x) is a continuous series of included and decreasing open sets all
containing C(y) and converging towards it.
Lemma 2 Let y ∈ M and α > 0. Then there exists an open neighborhood Wε of C(y) such that
(i) For all x ∈ B(y, ε), C(x) ⊂ Wε ,
R
(ii) P (Wε ) = Wε dP (z) < α
R
(iii) Wε dist(y, z) dP (z) < α
By hypothesis, the cut locus C(y) has a null measure for theR measures dP (z). The distance
being a measurable function, its measure is null on the cut locus: C(y) dist(y, z) dP (z) = 0. Thus,
R R
the functions Wε dP (z) and Wε dist(y, z) dP (z) are converging toward zero as ε goes to zero. By
continuity, we can make both terms smaller than any positive α by choosing ε sufficiently small.
24
Bounding ∆ on Wε
Let Wε , ε and α verifying the conditions of lemma 2 and x, x0 ∈ B(y, ε). We have dist(z, x) ≤
dist(z, x0 ) + dist(x0 , x). Thus:
Using the symmetry of x and x0 and the inequalities dist(x, x0 )≤ 2ε and dist(z, x0 ) ≤ dist(z, y)+ε, we
have: dist(z, x)2 − dist(z, x0 )2 ≤ 2 dist(x, x0 ) 2ε + dist(z, y) . Applying this bound to x = expy (v)
and x0 = y, we obtain:
→
−
Now, the last term of ∆(z, v) is easily bounded by: h yz | v i ≤ dist(y, z) kvk. Thus, we have:
|∆(z, v)| ≤ 4 ε + dist(z, y) kvk and its integral over Wε is bounded by:
Z Z
∆(z, v) dP (z) ≤ 4kvk ε + dist(z, y) dP (z) < 8αkvk
Wε Wε
Bounding ∆ on M\Wε :
Let x = expy (v) ∈ B(y, ε). We know from lemma 2 that the cut locus C(x) of such a point belongs
to Wε . Thus, the integration domain M\Wε is now independent of y and we can use the usual
Lebesgue theorem to differentiate under the sum:
Z ! Z
grad 2
dist(y, z) dP (z) = −2 →
−
yz dP (z)
M\Wε M\Wε
Conclusion
R
Thus, for kvk small enough, we have M ∆(z, v) dP (z) < 9αkvk, which means that the variance has
a derivative at the point y:
Z
2
(grad σ )(y) = −2 −
→
yz dP (z)
M/C(y)
25
to invert these relations in order to obtain a Taylor expansion of the concentration matrix and the
normalization coefficient k with respect to the variance. This can be realized thanks to the Taylor
expansion of the Riemannian metric around the origin in a normal coordinate system [63, section
2.8, corollary 2.3, p.84]: det(gij (exp v)) = 1 − Ric (v, v)/3 + O(kvk3 ), where Ric is the expression
of the Ricci tensor of scalar curvatures in the exponential chart. Thus [42, p.144]:
RdM (y) 3
p y T Ric y
dM(y) = det(G(y)) dy = 1 − + RdM (y) dy with lim =0
6 y→0 kyk
where D is the definition domain of the exponential chart. Since the concentration matrix Γ is a
symmetric positive definite matrix, we have a unique symmetric positive definite square root Γ1/2 ,
which is obtained by changing the eigenvalues of Γ into their square root in a diagonalization. Using
1 1
the change of variable z = Γ1/2 y in the first two terms and the matrix T = Γ− 2 Ric Γ− 2 to simplify
the notations, we have:
kzk2
T
zT T z
Z Z
(-1) −1/2 y Γy
k = det(Γ) exp − 1− dy + exp − RdM (y) dy
D0 2 6 D 2
where the new definition domain is D0 = Γ1/2 D. Likewise, we have for the covariance:
kzk2
T
zTT z
Z Z
(-1) 1 − 21 T − 21 T y Γy
k Σ= p Γ (zz )Γ exp − 1− dz+ yy exp − RdM (y)dy
det(Γ) D0 2 6 D 2
kzk2 kzk2
Z Z
T
z T z exp − dz = Tr T T
z z exp − dz = (2π)n/2 Tr(T )
Rn 2 Rn 2
(2π)n/2
Tr(Γ(-1) Ric ) yT Γ y
Z
(-1)
k =p 1− + exp − RdM (y) dy
det(Γ) 6 Rn 2
Unfortunately, the last term is not so simple to simplify as we only know that the remainder
RdM (y) behaves as a O(kyk3 ) for small values of y. The idea is to split the right integral into
one part around zero, say for kyk < α, where we know how to bound the remainder, and to show
26
that exp(−kyk2 ) dominate the Remainder elsewhere. Let γm be the smallest eigenvalue of Γ. As
y T Γy ≥ γm kyk2 , we have:
kyk2
Z T Z
y Γy
exp − RdM (y) dy ≤ exp −γm RdM (y) dy
Rn 2 Rn 2
For any η > 0, there exists a constant α > 0 such that kyk < α implies that |RdM (y)| < η kyk3 .
Thus:
kyk2 kyk2
Z Z
exp −γm RdM (y) dy < η exp −γm kyk3 dy
kyk<α 2 kyk<α 2
kyk2
Z
2 η V ol(Sn−1 )
< η exp −γm kyk3 dy = 2
Rn 2 γm
−2 . For the other part, as exp
Thus, we know that this part of the integral behaves as a O γm
is a monotonous function, one has: exp(−γ kyk2 ) < exp(−kyk2 )/γ 2 for kyk2 > 2 log(γ)/(γ − 1).
Moreover, as the limit of log(γ)/(γ − 1) is zero at γ = +∞, it can be made smaller than α/2
provided that γ is large enough. Thus, we have that (for γm large enough):
γm kyk2 kyk2
Z Z
1
exp − RdM (y) dy < 2
exp − |RdM (y)| dy
kyk2 >α 2 γm kyk2 >α 2
kyk2
Z
1
< 2
exp − |RdM (y)| dy
γm Rn 2
The last integral is a constant which is finite, since we assumed that k (-1) was finite for all Γ. Thus,
−2 when γ goes to infinity. We have obtained that:
this part of the integral behaves also as a O γm m
Z
−2
exp(−y T Γ y/2) RdM (y) dy = O γm (4)
Rn
so that
(2π)n/2
(-1) Tr(Γ(-1) Ric ) −2
k =p 1− + O γm
det(Γ) 6
−1 and
p 1/2
As Tr(Γ(-1) Ric ) is a term in γm det(Γ) a term in γm , we finally have:
p
det(Γ) Tr(Γ(-1) Ric )
−3/2
k= 1+ + O γm (5)
(2π)n/2 6
The covariance matrix: The principle is the same as for the normalization coefficient, with
slightly more complex integrals. We basically have to compute
Z T
1 y Γy
k (-1) Σ = det(Γ)−1/2 Γ−1/2 I − J Γ−1/2 + y y T exp − RdM (y) dy
6 D 2
kzk2 kzk2
Z Z
with I= z z T exp − dz and J= Tr(T z z T ) z z T exp − dz
Rn 2 Rn 2
These are classical calculations in multivariate statistics: for the first integral, the off diagonal
elements Iij (with i 6= j) vanish because we integrate antisymmetric functions over R:
Z
zi zj exp(kzk2 /2) dz = 0 for i 6= j
Rn
27
The diagonal elements are:
Z Z Y Z
Iii = zi2 exp(−kzk2 /2) dz = zi2 exp(−zi2 /2) dzi exp(−zj2 /2) dzj = (2π)n/2
Rn R j6=i R
so that we have I = (2π)n/2 Id. Since Tr(T z z T ) = k,l Tkl zk zl , the elements of the second integral
P
are:
kzk2
X Z
Jij = Tkl zk zl zi zj exp − dz
2
k,l
Let us investigate first off diagonal elements (i 6= j): for k 6= i and l 6= j or l 6= i and k 6= j,
we integrate an antisymmetric function and the result is zero. Since the Ricci curvature matrix is
symmetric, the matrix T is also symmetric and the sum is reduced to a single term:
kzk2
Z
Jij = (Tij + Tji ) zi2 zj2 exp − dz = 2 Tij (2π)n/2 for i 6= j
2
2 2
P R
The diagonal terms are: Jii = k,l Tkl Rn zi zk zl exp(−kzk /2) dz. For k 6= l, we integrate
antisymmetric functions: the result is zero. Thus, we are left with:
(2π)n/2 −1/2
Z T
(-1) 1 −1/2 T y Γy
k Σ= p Γ Id − (Tr(T ) Id + 2 T ) Γ + y y exp − RdM (y) dy
det(Γ) 6 D 2
Like for the normalization constant, we now have to bound the integral of the remainder. The
principle is the same. Let γm be the smallest eigenvalue of Γ. As y T Γy ≥ γm kyk2 , we have:
kyk2
Z T Z
T y Γy 2
y y exp − RdM (y) dy ≤ kyk exp −γm |RdM (y)| dy
Rn 2 Rn 2
For any η > 0, we can find a constant α > 0 such that kyk < α implies that |RdM (y)| < η kyk3 , i.e:
kyk2 kyk2
Z Z
2 2
kyk exp −γm |RdM (y)| dy < η kyk exp −γm kyk3 dy
kyk<α 2 kyk<α 2
kyk2
Z
8 η V ol(Sn−1 )
< η exp −γm kyk5 dy = 3
Rn 2 γm
−3 . For the other part, as exp
Thus, we know that this part of the integral behaves as a O γm
is a monotonous function, one has: exp(−γ kyk2 ) < exp(−kyk2 )/γ 3 for kyk2 > 3 log(γ)/(γ − 1).
Moreover, as the limit of log(γ)/(γ − 1) is zero at γ = +∞, it can be made smaller than α/3
provided that γ is large enough. Thus, we have that (for γm large enough):
γm kyk2 kyk2
Z Z
1
kyk2 exp − |RdM (y)| dy < 3
kyk 2
exp − |RdM (y)| dy
kyk2 >α 2 γm kyk2 >α 2
kyk2
Z
1 2
< 3
kyk exp − |RdM (y)| dy
γm Rn 2
28
The last integral is a constant which is finite, since
we assumed that Σ was finite p
for all Γ. Thus,
−3
this part of the integral behaves also as a O γm when γm goes to infinity. As 1/ det(Γ) a term
−1/2
in γm , we finally have:
(2π)n
(-1) Tr(Γ(-1) Ric ) (-1) Γ(-1) Ric Γ(-1)
−5/2
Σ = kp Γ − Γ − + O γm
det(Γ) 6 3
Simplifying this expression using the previous Taylor expansion of k (Eq. 5), we obtain thus the
following relation between the covariance and the concentration matrices:
1
−5/2
Σ = Γ(-1) − Γ(-1) Ric Γ(-1) + O γm (6)
3
Inverting the variables of the Taylor expansions This relation can be inverted to obtain the
Taylor expansion of the concentration matrix with respect to the covariance. First, we shall note
−5/2 5
that from the above equation, O(γm ) = O(σmax ) where σmax is the square root of the largest
eigenvalue of Σ. However, a global variable such as the variance σ 2 = Tr(Σ) is more appropriate.
2 ≤ i σi2 = σ 2 ≤ n σmax2
P
Since σmax , we have O(σmax ) = O(σ). Then, one easily verifies that the
Taylor expansion of Γ is: Γ = Σ + Σ Ric Σ/3 + O(σ 5 ), and finally:
(-1) (-1)
1
Γ = Σ(-1) − Ric + O(σ) (7)
3
p
To express k with respect to Σ instead of Γ, we have to compute Tr(Γ(-1) Ric ) and det(Γ).
1
Tr(Γ(-1) Ric ) = Tr(Σ Ric + Σ Ric Σ Ric + O(σ 5 )) = Tr(Σ Ric ) + O(σ 3 )
3
For the determinant, one verifies that if A is a differential map from R to the matrix group,
one has det(A)0 = (det(A)) Tr(A0 A(-1) ), so that det( Id + η B) = 1 + η Tr(B) + O(η 2 ). Since
Γ Σ = Id − 13 Ric Σ + O(σ 3 ) and Σ is a term in O(σ 2 ), we have
1 1
det(Γ) det(Σ) = det Id − Ric Σ + O(σ 3 ) = 1 − Tr(Σ Ric ) + O(σ 3 )
3 3
and thus
p 1 1 3
det(Γ) = p 1 − Tr(Σ Ric ) + O(σ )
det(Σ) 6
Substituting this expression in equation 5, we obtain:
1 1 1
k=p 1 − Tr(Σ Ric ) + O(σ 3 ) 1 + Tr(Σ Ric ) + O(σ 3 )
(2π)n det(Σ) 6 6
29
Summary of the approximate normal density: In a manifold without a cut locus at the
mean point, the normal density in a normal coordinate system at the mean value is:
T
y Γy
N (y) = k exp −
2
The normalization constant and the concentration matrices are approximated by the following
expressions for a covariance matrix Σ of small variance σ 2 = Tr(Σ):
1 + O(σ 3 ) 1
k=p and Γ = Σ(-1) − Ric + O(σ)
(2π)n det(Σ) 3
30
Now, for the second integral, we have
√ √
γm r/ n √ √
√ √
z2
Z
γm r 2 n 2
√ √
zi2 exp − i dzi = 2π −
2π erfc √2n − √
γm r exp − γm2 r
− γm r/ n 2
√
−1/2
= 2π + ε γm r−1
We obtain thus
kzk2
Z
zi2 exp − dz = (2π)n/2 + ε(γm −1/2 r−1 )
D0 2
In fact, it is easy to see that every integral we computed over Rn in the previous paragraph has
−1/2
the same value over D0 plus a term whose absolute value is of the order of ε(γm r−1 ) = ε(σ/r).
Thus, we can directly generalize the previous results by replacing O(σ k ) with O(σ k ) + ε (σ/r).
Summary of the approximate normal density: In a manifold with injectivity radius r at the
mean point, the normal density in a normal coordinate system at this mean value is:
T
y Γy
N (y) = k exp −
2
The normalization constant and the concentration matrices are approximated by the following
expressions for a covariance matrix Σ of small variance σ 2 = Tr(Σ):
1 + O(σ 3 ) + ε (σ/r) 1 σ
k= p and Γ = Σ(-1) − Ric + O(σ) + ε
(2π)n det(Σ) 3 r
The last integral, involvingRthe remainder of the metric, is obviously bounded by the integral over
the whole tangent space k Rn exp(−y T Γ y/2) |RdM (y)| dy which we have shown to be a O(σ 3 ) in
31
appendix B. Let Σ1/2 be the positive symmetric square root of Σ. With the change of variable
y = Σ1/2 x, this probability becomes:
√ p Z
2 2 1 T 1 1 1 T 1 1
Pr{χ ≤ α } = 2 k det(Σ) exp − x Σ Γ Σ x
2 2 1 − x Σ Ric Σ x dx + O(σ 3 )
2 2
kxk≤α 2 6
Using S = 13 Σ1/2 Ric Σ1/2 and the fact that k det(Σ) = (2 π)−n/2 (1 + O(σ 3 )) (from (Eq. 8)), we
p
end up with
1 + O(σ 3 )
Z
2 2 1 T 1 1 1 T
Pr{χ ≤ α } = p exp − x Σ 2 Γ Σ 2 x 1 − x S x dx + O(σ 3 )
(2 π)n
kxk≤α 2 2
The goal is now to show that the above integrated term is basically exp(−kxk2 /2) plus a remainder
which integrates into a O(σ 3 ). In this process, we will have to compute the integrals:
kxk2
Z
Ik (α) = exp − kxk2k dx
kxk<α 2
By changing into polar coordinates (r = kxk is the radius and u is the corresponding unit vector
so that x = r u), we have dx = rn−1 dr du and thus:
! Z
Z α
2k+n−1 2
Ik (α) = du r exp(−r /2) dr
Sn−1 0
! !
α2
π n/2
Z
k−1+ n t
= t 2 exp − dt
Γ(n/2) 0 2
R +∞
with the change of variable t = r2 . In this formula, Γ = 0 tx−1 exp(−t)dt is the Gamma function
√
which can be recursively computed from Γ(x + 1) = x Γ(x) with Γ(1) = 1 and Γ(1/2) = π.
The value I0 (α) is in fact the standard cumulated probability function of the χ2 law (up to a
normalization factor). For α = +∞, the remaining integral can be computed using the Gamma
function: n
(2π) 2 2k Γ(k + n2 )
Ik (+∞) =
Γ( n2 )
Series expansion of the exponential From Eq. 7, we have Σ1/2 Γ Σ1/2 = Id − S − RΓ , where
the remainder RΓ behaves as a O(σ 3 ). Thus, we have
!
xT Σ1/2 Γ Σ1/2 x
1 2 xT S x
exp − = exp − kxk 1+ + Rexp
2 2 2
32
Integral of the Remainder Rexp S = 31 Σ1/2 Ric Σ1/2 is a term of the order O(σ 2 ) and the
remainder RΓ behaves as a O(σ 3 ). Thus, we can find positive constants C1 and C2 such that, for
σ sufficiently small, we have: xT RΓ x ≤ C1 σ 3 kxk2 and xT (S + RΓ ) x ≤ C2 σ 2 kxk2 . This means
that we can bound the integral of the remainder by:
+∞ k 2k n +∞
kxk2
Z X C σ (2 π) 2 X
exp − 3
Rexp (x) dx ≤ C1 σ I1 (+∞) + 2
Ik (+∞) =≤ C10 3
σ + ak
2k Γ n2 ) k=2
kxk<α 2 k!
k=2
where ak = C2k σ 2k Γ(k + n/2)/k!. To show that the series converges, let us investigate the ratio of
successive terms:
ak+1 C2 σ 2 Γ(k + 1 + n/2) k + n/2
= = C2 σ 2
ak k + 1 Γ(k + n/2) k+1
The limit for k = +∞ is obviously C2 σ 2 , which can be made smaller than 1 for σ sufficiently small.
Thus, by d’Alembert’s ratio test, the series converges. Moreover, the smaller term of the series is
a σ 4 (for k=2), which shows that the integral of the remainder is a finally dominated by the first
term, a O(σ 3 ).
1 u n −1 u
2
pχ2 (u) = exp − + O(σ 3 )
2 Γ n2
2 2
kxk2
Z σ
2 2 −n
Pr{χ ≤ α } = (2π) 2 exp − dx + O(σ 3 ) + ε
B(α)∩D0 2 r
√ √
As we have B( γm r) ⊂ D0 ⊂ Rn , we can enclose the researched domain into: B min( γm r, α) ⊂
√ √
B(α) ∩ D0 ⊂ B(α). For α ≤ γm r, there is not problem but for α > γm r, we have:
kxk2
Z σ
n
Pr{χ2 ≤ α2 } ≥ (2π)− 2 √
exp − dx + O(σ 3
) + ε
B( γm r) 2 r
√
and we have already seen that this integral is 1 + ε σr . As α > γm r, the same integral is itself
of the same order, and we obtain in all cases the same result as with no cut locus where O(σ 3 ) is
replaced by O(σ 3 ) + ε (σ/r).
33
References
[1] Xavier Pennec. L’incertitude dans les problèmes de reconnaissance et de recalage – Appli-
cations en imagerie médicale et biologie moléculaire. Thèse de sciences (PhD thesis), Ecole
Polytechnique, Palaiseau (France), December 1996.
[2] X. Pennec and N. Ayache. Uniform distribution, distance and expectation problems for ge-
ometric features processing. Journal of Mathematical Imaging and Vision, 9(1):49–67, July
1998.
[3] X. Pennec, N. Ayache, and J.-P. Thirion. Landmark-based registration using features identified
through differential geometry. In I. Bankman, editor, Handbook of Medical Imaging, chapter 31,
pages 499–513. Academic Press, September 2000.
[4] X. Pennec, C.R.G. Guttmann, and J.-P. Thirion. Feature-based registration of medical images:
Estimation and validation of the pose accuracy. In Proc. of First Int. Conf. on Medical Image
Computing and Computer-Assisted Intervention (MICCAI’98), volume 1496 of LNCS, pages
1107–1114, Cambridge, USA, October 1998. Springer Verlag.
[5] S. Granger, X. Pennec, and A. Roche. Rigid point-surface registration using an EM variant of
ICP for computer guided oral implantology. In W.J. Niessen and M.A. Viergever, editors, 4th
Int. Conf. on Medical Image Computing and Computer-Assisted Intervention (MICCAI’01),
volume 2208 of LNCS, pages 752–761, Utrecht, The Netherlands, October 2001.
[6] S. Granger and X. Pennec. Statistiques exactes et approchées sur les normales aléatoires.
Research report RR-4533, INRIA, 2002.
[7] P.T. Fletcher, S. Joshi, C. Lu, and S Pizer. Gaussian distributions on Lie groups and their
application to statistical shape analysis. In Poc of Information Processing in Medical Imaging
(IPMI’2003), pages 450–462, 2003.
[8] P.T. Fletcher and S.C. Joshi. Principal geodesic analysis on symmetric spaces: Statistics of
diffusion tensors. In Proc. of CVAMIA and MMBIA Workshops, Prague, Czech Republic, May
15, 2004, LNCS 3117, pages 87–98. Springer, 2004.
[9] Ch. Lenglet, M. Rousson, R. Deriche, and O. Faugeras. Statistics on multivariate normal
distributions: A geometric approach and its application to diffusion tensor MRI. Research
Report 5242, INRIA, 2004.
[11] Xavier Pennec, Pierre Fillard, and Nicholas Ayache. A Riemannian framework for tensor
computing. International Journal of Computer Vision, 65(1), October 2005.
[12] Pierre Fillard, Vincent Arsigny, Xavier Pennec, Paul Thompson, and Nicholas Ayache. Ex-
trapolation of sparse tensor fields: Application to the modeling of brain variability. In Gary
Christensen and Milan Sonka, editors, Proc. of Information Processing in Medical Imaging
2005 (IPMI’05), volume 3565 of LNCS, pages 27–38, Glenwood springs, Colorado, USA, July
2005. Springer.
34
[13] X. Pennec and J.-P. Thirion. A framework for uncertainty and validation of 3D registra-
tion methods based on points and frames. Int. Journal of Computer Vision, 25(3):203–229,
December 1997.
[15] J.M. Oller and J.M. Corcuera. Intrinsic analysis of statistical estimation. Annals of Statistics,
23(5):1562–1581, 1995.
[16] C. Bingham. An antipodally symmetric distribution on the sphere. The Annals of Statistics,
2(6):1201–1225, 1974.
[17] P.E. Jupp and K.V. Mardia. A unified view of the theory of directional statistics, 1975-1988.
Int. Statistical Review, 57(3):261–294, 1989.
[18] J.T. Kent. The art of Statistical Science, chapter 10 : New Directions in Shape Analysis, pages
115–127. John Wiley & Sons, 1992. K.V. Mardia, ed.
[19] K.V. Mardia. Directional statistics and shape analysis. Journal of applied Statistics, 26(949-
957), 1999.
[20] D.G. Kendall. A survey of the statistical theory of shape (with discussion). Statist. Sci.,
4:87–120, 1989.
[21] I.L. Dryden and K.V. Mardia. Theoretical and distributional aspects of shape analysis. In Prob-
ability Measures on Groups, X (Oberwolfach, 1990), pages 95–116, New York, 1991. Plenum.
[22] H. Le and D.G. Kendall. The Riemannian structure of euclidean shape space: a novel envi-
ronment for statistics. Ann. Statist., 21:1225–1271, 1993.
[23] C.G. Small. The Statistical Theory of Shapes. Springer series in statistics. Springer, 1996.
[25] H. Karcher. Riemannian center of mass and mollifier smoothing. Comm. Pure Appl. Math,
30:509–541, 1977.
[26] W.S. Kendall. Convexity and the hemisphere. Journ. London Math. Soc., 43(2):567–576, 1991.
[27] M. Emery and G. Mokobodzki. Sur le barycentre d’une probabilité dans une variété. In M. Yor
J. Azema, P.A. Meyer, editor, Séminaire de probabilités XXV, volume 1485 of Lect. Notes in
Math., pages 220–233. Springer-Verlag, 1991.
[28] M. Arnaudon. Barycentres convexes et approximations des martingales continues dans les
variétés. In M. Yor J. Azema, P.A. Meyer, editor, Séminaire de probabilités XXIX, volume
1613 of Lect. Notes in Math., pages 70–85. Springer-Verlag, 1995.
[29] J. Picard. Barycentres et martingales sur une variété. Annales de l’institut Poincaré - Proba-
bilités et Statistiques, 30(4):647–702, 1994.
[30] R.W.R. Darling. Martingales on non-compact manifolds: maximal inequalities and prescribed
limits. Ann. Inst. H. Poincarré Proba. Statistics, 32(4):431–454, 1996.
35
[31] U. Grenander, M.I. Miller, and A. Srivastava. Hilbert-schmidt lower bounds for estimators on
matrix Lie groups for atr. IEEE Trans. on PAMI, 20(8):790–802, 1998.
[32] X. Pennec. Computing the mean of geometric features - application to the mean rotation.
Research Report RR-3371, INRIA, March 1998.
[33] C. Gramkow. On averaging rotations. Int. Jour. Computer Vision, 42(1-2):7–16, April/May
2001.
[34] M. Moakher. Means and averaging in the group of rotations. SIAM J. of Matrix Anal. Appl.,
24(1):1–16, 2002.
[35] A. Edelman, T. Arias, and S.T. Smith. The geometry of algorithms with orthogonality con-
straints. SIAM Journal of Matrix Analysis and Applications, 20(2):303–353, 1998.
[36] H. Hendricks. A Cramer-Rao type lower bound for estimators with values in a manifold.
Journal of Multivariate Analysis, 38:245–261, 1991.
[38] R. Bhattacharya and V. Patrangenaru. Large sample theory of intrinsic and extrinsic sample
means on manifolds, I. Annals of Statistics, 31(1):1–29, 2003.
[39] M. Spivak. Differential Geometry, volume 1. Publish or Perish, Inc., 2nd edition, 1979.
[40] W. Klingenberg. Riemannian Geometry. Walter de Gruyter, Berlin, New York, 1982.
[41] M. do Carmo. Riemannian Geometry. Mathematics. Birkhäuser, Boston, Basel, Berlin, 1992.
[42] S. Gallot, D. Hulin, and J. Lafontaine. Riemannian Geometry. Springer Verlag, 2nd edition
edition, 1993.
[44] M.G. Kendall and P.A.P. Moran. Geometrical probability. Number 10 in Griffin’s statistical
monographs and courses. Charles Griffin & Co. Ltd., 1963.
[45] Xavier Pennec. Probabilities and Statistics on Riemannian Manifolds: A Geometric approach.
Research Report 5093, INRIA, January 2004.
[46] M. Fréchet. L’intégrale abstraite d’une fonction abstraite d’une variable abstraite et son ap-
plication à la moyenne d’un élément aléatoire de nature quelconque. Revue Scientifique, pages
483–512, 1944.
[47] M. Fréchet. Les éléments aléatoires de nature quelconque dans un espace distancié. Ann. Inst.
H. Poincaré, 10:215–310, 1948.
[48] W.S. Kendall. Probability, convexity, and harmonic maps with small image I: uniqueness and
fine existence. Proc. London Math. Soc., 61(2):371–406, 1990.
[49] S. Doss. Sur la moyenne d’un élément aléatoire dans un espace distancié. Bull. Sc. Math.,
73:48–72, 1949.
36
[50] W. Herer. Espérance mathématique au sens de Doss d’une variable aléatoire à valeur dans un
espace métrique. C. R. Acad. Sc. Paris, Série I, t.302(3):131–134, 1986.
[51] W. Herer. Espérance mathématique d’une variable aléatoire à valeur dans un espace métrique
à courbure négative. C. R. Acad. Sc. Paris, Série I, t.306:681–684, 1988.
[53] H. Maillot. Différentielle de la variance et centrage de la plaque de coupure sur une variété
riemannienne compacte. Communication personnelle, 1997.
[54] W.S. Kendall. The propeller: a counterexample to a conjectured criterion for the existence of
certain harmonic functions. Journal of the London Mathematical Society, 46:364–374, 1992.
[55] Ernst Hairer, Ch. Lubich, and Gerhard Wanner. Geometric numerical integration : struc-
ture preserving algorithm for ordinary differential equations, volume 31 of Springer series in
computational mathematics. Springer, 2002.
[56] J.-P. Dedieu, G. Malajovich, and P. Priouret. Newton method on Riemannian manifolds:
Covariant alpha-theory. IMA Journal of Numerical Analysis, 23:395–419, 2003.
[58] P.J. Rousseeuw and A.M. Leroy. Robust Regression and Outliers Detection. Wiley series in
prob. and math. stat. J. Wiley and Sons, 1987.
[60] Alexander Grigor’yan. Heat kernels on weighted manifolds and applications. Cont. Math,
2005. To appear. http://www.ma.ic.ac.uk/ grigor/pubs.htm.
[61] K.V. Mardia and P.E. Jupp. Directional statistics. Whiley, Chichester, 2000.
[62] A.M. Kagan, Y.V. Linnik, and C.R. Rao. Characterization problems in Mathematical Statistics.
Whiley-Interscience, New-York, 1973.
[63] I. Chavel. Riemannian geometry - A modern introduction, volume 108 of Cambridge tracts in
mathematics. Cambridge university press, 1993.
[64] W.H. Press, B.P. Flannery, S.A Teukolsky, and W.T. Vetterling. Numerical Recipices in C.
Cambridge Univ. Press, 1991.
[65] X. Pennec. Toward a generic framework for recognition based on uncertain geometric features.
Videre: Journal of Computer Vision Research, 1(2):58–87, 1998.
[66] A. Roche, X. Pennec, G. Malandain, and N. Ayache. Rigid registration of 3D ultrasound with
mr images: a new approach combining intensity and gradient information. IEEE Transactions
on Medical Imaging, 20(10):1038–1049, October 2001.
37
[67] S. Nicolau, X. Pennec, L. Soler, and N. Ayache. Evaluation of a new 3D/2D registration crite-
rion for liver radio-frequencies guided by augmented reality. In N. Ayache and H. Delingette, ed-
itors, International Symposium on Surgery Simulation and Soft Tissue Modeling (IS4TM’03),
volume 2673 of Lecture Notes in Computer Science, pages 270–283, Juan-les-Pins, France,
2003. INRIA Sophia Antipolis, Springer-Verlag.
[68] V. Rasouli. Application of Riemannian multivariate statistics to the analysis of rock fracture
surface roughness. PhD thesis, University of London, 2002.
[69] X. Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric
measurements. In A.E. Cetin, L. Akarun, A. Ertuzun, M.N. Gurcan, and Y. Yardimci, editors,
Proc. of Nonlinear Signal and Image Processing (NSIP’99), volume 1, pages 194–198, June
20-23, Antalya, Turkey, 1999. IEEE-EURASIP.
38