Geometric Theory of Information: Frank Nielsen
Geometric Theory of Information: Frank Nielsen
Geometric
Theory
of Information
Signals and Communication Technology
Geometric Theory
of Information
123
Editor
Frank Nielsen
Sony Computer Science Laboratories Inc
Shinagawa-Ku, Tokyo
Japan
and
Laboratoire d’Informatique (LIX)
Ecole Polytechnique
Palaiseau Cedex
France
vii
viii Preface
Acknowledgments
First of all, I would like to thank the chapter contributors for providing us with the
latest advances in information geometry, its computational methods, and appli-
cations. I express my gratitude to the peer reviewers for their careful feedback that
led to this polished, revised work. Each chapter received from two to eight review
reports, with an average number of about three to five reviews per chapter.
I thank the following reviewers (in alphabetical order of their first name):
Akimichi Takemura, Andrew Wood, Anoop Cherian, Arnaud Dessein, Atsumi
Ohara, Bijan Afsari, Frank Critchley, Frank Nielsen, Giovanni Pistone, Hajime
Preface ix
Reference
1. Amari, S., Nagaoka, H.: Method of information geometry, AMS Monograph. Oxford
University Press, Oxford (2000)
Contents
xi
xii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
Chapter 1
Divergence Functions and Geometric Structures
They Induce on a Manifold
Jun Zhang
1.1 Introduction
J. Zhang (B)
Department of Psychology and Department of Mathematics, University of Michigan,
Ann Arbor, MI 48109, USA
e-mail: [email protected]
Under coordinate transform x ◦→ x̃, the new coefficients Γ are related to old ones
Γ via ⎛ ⎞
∂x i ∂x j ∂ 2 xk ∂ x̃ l
Γmn
l
(x̃) = ⎝ Γ k
(x) + ⎠ ; (1.3)
∂ x̃ m ∂ x̃ n ij ∂ x̃ m ∂ x̃ n ∂x k
k i,j
1 A holonomic coordinate system means that the coordinates have been properly “scaled” in unit-
length with respect to each other such that the directional derivatives commute: their Lie bracket
[∂i , ∂j ] = ∂i ∂j − ∂j ∂i = 0, i.e., the mixed partial derivatives are exchangeable in their order of
application.
4 J. Zhang
A curve whose tangent vectors are parallel along the curve is said to be
“auto-parallel”.
As a primitive on a manifold, affine connections can be characterized in terms of
their (i) torsion and (ii) curvature. The torsion T of a connection Γ , which is a tensor
itself, is given by the asymmetric part of the connection T (∂i , ∂j ) = ∇∂i ∂j − ∇∂j ∂i =
k
k Tij ∂k , where Tij is its local representation given as
k
l is anti-symmetric when i ←→ j.
By definition, Rkij
A connection is said to be flat when Rkij l (x) ≡ 0 and T k ≡ 0. Note that this is a
ij
tensorial condition, so that the flatness of a connection ∇ is a coordinate-independent
property even though the local expression of the connection (in terms of Γ ) is
coordinate-dependent. For any flat connection, there exists a local coordinate system
under which Γijk (x) ≡ 0 in a neighborhood; this is the affine coordinate for the given
flat connection.
In the above discussions, metric and connections are treated as separate structures
on a manifold. When both are defined on the same manifold, then it is convenient to
express affine connection Γ in its “covariant” form
Γij,k = g(∇∂i ∂j , ∂k ) = glk Γijl . (1.4)
l
Though Γijk is the more primitive quantity that does not involve metric, Γij,k represents
the projection of Γ onto the manifold spanned by the bases ∂k . The covariant form
of Riemann curvature is (c.f. footnote 2)
Rlkij = glm Rkij
m
.
m
2 This component-wise notation of Riemann curvature tensor followed standard differential geome-
try textbook, such as
[16]. On the other hand,
information geometers, such as [2], adopt the notation
that R(∂i , ∂j )∂k = l Rijk
l ∂ , with R
l ijkl = l Rijk gml .
m
1 Divergence Functions and Geometric Structures 5
⎧∂k ∂i , ∂j ) + g(∂i , ∇
∂k g(∂i , ∂j ) = g(∇ ⎧∂k ∂j ). (1.5)
The Levi-Civita connection Γ⎧ is compatible with the metric g, in the sense that it treats
tangent vectors of the shortest curves on a manifold as being parallel (equivalently
speaking, it treats geodesics as auto-parallel curves).
It turns out that one can define a kind of “compatibility” relation more general
than expressed by (1.5), by introducing the notion of “conjugacy” (denoted by ∗)
between two connections. A connection ∇ ∗ is said to be conjugate (or dual) to ∇
with respect to g if
Clearly, (∇ ∗ )∗ = ∇. Moreover, ∇,⎧ which satisfies (1.5), is special in the sense that
⎧ ∗ = ∇.
it is self-conjugate (∇) ⎧
Because metric tensor g provides a one-to-one mapping between points in the
tangent space (i.e., vectors) and points in the cotangent space (i.e., co-vectors), (1.6)
can also be seen as characterizing how co-vector fields are to be parallel-transported
in order to preserve their dual pairing ≤·, · with vector fields.
6 J. Zhang
so that
∗
Γkj,i = g(∇∂∗j ∂k , ∂i ) = gil Γkj∗l .
l
∂gij ∗
Cijk = − Γki,j − Γkj,i (= Γkj,i − Γkj,i ).
∂xk
From its definition, Cijk = Cjik , that is, symmetric with respective to its first two
indices. It can be further shown that:
Cijk − Cikj = gil (Tjkl − Tjk∗l )
l
where δik is the Kronecker delta. When two connections are projectively equivalent,
their corresponding auto-parallel curves have identical shape (i.e., considered as
unparameterized curves); these so-called “pre-geodesics” differ only by a change of
parameterization τ .
Two torsion-free connections Γ and Γ ∅ are said to be dual-projectively equivalent
if there exists a function τ such that:
∅
Γij,k = Γij,k − gij (∂k τ ).
When two connections are dual-projectively equivalent, then their conjugate con-
nections (with respect to g) have identical pre-geodesics (identical shape).
Recall that when the two Riemannian metric g, g ∅ are conformally equivalent, i.e.,
there exists a function τ such that
When φ = const (or ψ = const), then the corresponding connections are projectively
(dual-projectively, resp) equivalent.
n
(∂i Ω)(∂1 , . . . , ∂n ) ≡ ∂i (Ω(∂1 , . . . , ∂n )) − Ω(. . . , ∇∂i ∂l , . . .).
l=1
8 J. Zhang
or
∂ log Ω(x)
Γill (x) = . (1.10)
∂x i
l
One immediately sees that the existence of a function Ω satisfying (1.10) is equivalent
to the right side of (1.12) to be identically zero.
Making use of (1.10), it is easy to show that the parallel volume form of a
Levi-Civita connection Γ⎧ is given by
⎬
⎧ =
Ω(x) det[gij (x)].
Making use of (1.7), the parallel volume forms Ω, Ω ∗ associated with Γ and Γ ∗
satisfy (apart from a multiplicative constant which must be positive)
⎧
Ω(x) Ω ∗ (x) = (Ω(x)) 2
= det[gij (x)]. (1.13)
The equiaffine condition can also be expressed using a quantity related to the
cubic form Cijk . We may introduce the Tchebychev form (also known as the first
Koszul form), expressed in the local coordinates,
Ti = Cijk g jk . (1.14)
j,k
1 Divergence Functions and Geometric Structures 9
∂Ti ∂Tj
= . (1.15)
∂x j ∂x i
This expresses the integrability condition. When Eq. (1.15) is satisfied, there exits a
function φ such that Ti = ∂i τ . Furthermore, it can be shown that
⎧
τ = −2 log(Ω/Ω).
Proposition 1 ([13, 25]) The necessary and sufficient condition for a torsion-free
connection ∇ to be equiaffine is for any of the following to hold:
1. There exists a ∇-parallel volume element Ω : ∇Ω = 0.
2. Ricci tensor of ∇ is symmetric: Ricij = Ricji .
k
3. Curvature tensor k Rkij = 0.
4. The Tchebychev 1-form T is closed, dT = 0.
5. There exists a function τ , called Tchebychev potential, such that Ti = ∂i τ .
It is known that the Ricci tensor of the Levi-Civita connection is always
⎧ always exists.
symmetric—this is why Riemannian volume form Ω
1 + α k 1 − α ∗k
Γij(α)k = Γij + Γij . (1.16)
2 2
Obviously, Γ (0) = Γ⎧ is the Levi-Civita connection. Using cubic form, this amounts
to ∇ (α) g = αC. The α-parallel volume element is given by:
α
⎧
Ω (α) = e− 2 τ Ω
So, Γ is flat if and only if Γ ∗ is flat. In this case, the manifold is said to be “dually
flat”. When Γ, Γ ∗ are dually flat, then Γ (α) is called “α-transitively flat” [21]. In
such case, {M, g, Γ (α) , Γ (−α) } is called an α-Hessian structure [26]. They are all
compatible with a metric g that is induced from a strictly convex (potential) function,
see next subsection.
For an α-Hessian manifold, the Tchebychev form (1.14) is given by
∂ log(det[gkl ])
Ti =
∂x i
and its derivative (known as the second Koszul form) is
∂Ti ∂ 2 log(det[gkl ])
βij = = .
∂x j ∂x i ∂x j
∂ ∂x l ∂
∂i ≡ = = F li ∂l
∂ui ∂ui ∂x l
l l
∂ui ∂x i
Fij (x) = , F ij
(u) = , Fil F lj = δil (1.17)
∂x j ∂uj
l
j
where δi is Kronecker delta (taking the value of 1 when i = j and 0 otherwise). If the
new coordinate system u = [u1 , . . . , un ] (with components expressed by subscripts)
is such that
Fij (x) = gij (x), (1.18)
then the x-coordinate system and the u-coordinate system are said to be “biorthogo-
nal” to each other since, from the definition of metric tensor (1.1),
1 Divergence Functions and Geometric Structures 11
j
g(∂i , ∂ j ) = g(∂i , F lj ∂l ) = F lj g(∂i , ∂l ) = F lj gil = δi .
l l l
and
∂x r ∂x s ∂x t ∂ 2 xt
Γ rs,t (u) = Γij,k (x) + . (1.22)
∂ui ∂uj ∂uk ∂ur ∂us
i,j,k
Similarly relations hold between Γt∗rs (u) and Γij∗k (x), and between Γ ∗rs,t (u) and
∗ (x).
Γij,k
In analogous to (1.7), we have the following identity
∂ 2 xt ∂g rt (u)
= = Γ rs,t (u) + Γ ∗ts,r (u).
∂us ∂ur ∂us
Therefore, we have
and
Γr∗ ts (u) = − g js (u)Γjrt (x). (1.24)
j
Let us now express parallel volume forms Ω(x), Ω(u) under biorthogonal coor-
dinates x or u. Contracting the indices t with r in (1.24), and invoking (1.10), we
obtain
After integration,
Ω ∗ (u) Ω(x) = const. (1.25)
The relations (1.25) and (1.26) indicate that the volume forms of the pair of conjugate
connections, when expressed in biorthogonal coordinates respectively, are inversely
proportional to each other.
The Γ (α) -parallel volume element Ω (α) can be shown to be given by (in either x
and u coordinates)
1+α 1−α
Ω (α) = Ω 2 (Ω ∗ ) 2 .
Clearly,
Ω (α) (x)Ω (−α) (x) = det[gij (x)] ←→ Ω (α) (u)Ω (−α) (u) = det[g ij (u)].
such that
∂ui (x) ∂uj (x)
= gij (x) = gji (x) = .
∂x j ∂x i
The above identity implies that there exist a function Φ such that ui = ∂i Φ and, by
positive definiteness of gij , Φ would have to be a strictly convex function! In this
case, the x- and u-variables satisfy (1.37), and the pair of convex functions, Φ and
are related to gij and g ij by
its conjugate Φ,
∂ 2 Φ(x)
∂ 2 Φ(u)
gij (x) = ←→ g ij
(u) = .
∂x i ∂x j ∂ui ∂uj
It follows from the above Lemma that a necessary and sufficient condition for
a Riemannian manifold to admit biorthogonal coordinates it that its Levi-Civita
connection is given by
⎨ ⎩
1 ∂gik ∂gjk ∂gij 1 ∂gij
Γ⎧ij,k (x) ≡ + − k = .
2 ∂x j ∂x i ∂x 2 ∂x k
In other words, biorthogonal coordinates are affine coordinates for the dually-flat
pair of connections. In fact, we can now define a pair of torsion-free connections by
∗ ∂gij
γij,k (x) = 0, γij,k (x) =
∂x k
and show that they are conjugate with respect to g, that is, they satisfy (1.6). This is
to say that we select an affine connection γ such that x is its affine coordinate. From
(1.22), when γ ∗ is expressed in u-coordinates,
n
α= pi dx i
i=1
i , ∂
ω(∂i , ∂j ) = ω(∂ j ) = 0; ω(∂i , ∂
j ) = −ω(∂
j , ∂i ) = ωij . (1.28)
is given by
∂2Φ
G̃ij (z, z̄) = .
∂zi ∂z̄j
Here the real-valued function Φ (of complex variables) is called the “Kähler poten-
tial”.
It is known that the tangent bundle T M of a manifold M with a flat connection
on it admits a complex structure [7]. As [18] pointed out, Hessian manifold can be
seen as the “real” Kähler manifold.
3Conjugate connections which admit torsion has been recently studied by Calin et al. [5] and
Matsuzoe [15].
1 Divergence Functions and Geometric Structures 17
ωx = dx i ∧ dξ i .
(Recall that the comma separates the variable being in the first slot versus the second
slot for differentiation.) It is easy to check that in a neighborhood of the diagonal
ΔM ⊂ M × M, the map LD is a diffeomorphism since the Jacobian matrix of the
map ⎨ ⎩
δij Dij
0 Di,j
∗
RD ωy = −Di,j (x, y)dx i ∧ dyj .
18 J. Zhang
It was Barndorff-Nielsen and Jupp [3] who first proposed (1.29) as an induced
symplectic form on M × M, apart from a minus sign; the divergence function D was
Bregman divergence BΦ (given by (1.33) below)
called a “york”. As an example,
induces the symplectic form Φij dx i ∧ dyj .
we require
Di,j = Dj,i , (1.31)
or explicitly
∂2D ∂2D
= j i.
∂x ∂y
i j ∂x ∂y
Note that this condition is always satisfied on ΔM, by the definition of a diver-
gence function D, which has allowed us to define a Riemannian structure on ΔM
(Proposition 6). We now require it to be satisfied on M × M (at least a neighborhood
of ΔM).
For divergence functions satisfying (1.31), we can consider inducing a metric GD
on M × M—the induced Riemannian (Hermit) metric GD is defined by
GD (X, Y ) = ωD (X, JY ).
1 Divergence Functions and Geometric Structures 19
It is easy to verify GD is invariant under the almost complex structure J. The metric
components are given by:
⎨ ⎩ ⎨ ⎩ ⎨ ⎩
∂ ∂ ∂ ∂ ∂ ∂
Gij = GD , = ωD ,J = ωD , = −Di,j ,
∂x i ∂x j ∂x i ∂x j ∂x i ∂yj
⎨ ⎩ ⎨ ⎩ ⎨ ⎩
∂ ∂ ∂ ∂ ∂ ∂
Gi∅ j∅ = gD , j = ωD ,J j = ωD ,− j = −Dj,i ,
∂y ∂y
i ∂y i ∂y ∂y i ∂x
⎨ ⎩ ⎨ ⎩ ⎨ ⎩
∂ ∂ ∂ ∂ ∂ ∂
Gij∅ = gD , j = ωD ,J j = ωD ,− j = 0.
∂x ∂y
i ∂x i ∂y ∂x i ∂x
⎨ ⎩ ⎨ ⎩ ⎨ ⎩
∂ ∂ ∂ ∂ ∂ ∂
Gi∅ j = gD , j = ωD ,J j = ωD ,− j = 0.
∂y ∂x
i ∂y i ∂x ∂y i ∂y
√
Now introduce complex coordinates z = x + −1y,
⎨ ⎩
z + z̄ z − z̄
D(x, y) = D , √ ≡⎧
D(z, z̄),
2 2 −1
20 J. Zhang
so
∂2D 1 1 ∂2D ⎧
= (Dij + D,ij ) = .
∂z ∂z̄
i j 4 2 ∂z ∂z̄j
i
If D satisfies
Dij + D,ij = κDi,j (1.32)
⎧ is a
where κ is a constant, then M × M admits a Kähler potential (and hence D
Kähler manifold)
κ ∂2D ⎧
ds2 = dzi ⊗ dz̄j .
2 ∂z ∂z̄j
i
∂ 2 Φ(x)
gij (x) =
∂x i ∂x j
and the dual connections
∗ ∂ 3 Φ(x)
Γij,k (x) = 0, Γij,k (x) =
∂x i ∂x j ∂x k
are induced from a convex potential function Φ. In the (biorthogonal) u-coordinates,
these geometric quantities can be expressed as
∂ 2 Φ(u)
∂ 3 Φ(u)
g ij (u) = , Γ ∗ ij,k (u) = 0, Γ ij,k (u) = ,
∂ui ∂uj ∂ui ∂uj ∂uk
n
≤x, u n = x i ui . (1.34)
i=1
1 Divergence Functions and Geometric Structures 21
for x = y.
Recall that, when Φ is convex, its convex conjugate Φ: V ⊆
Rn → R is defined
through the Legendre transform:
= Φ and (∂Φ) = (∂ Φ)
with Φ −1 . The function Φ
is also convex, and through
which (1.35) precisely expresses the Fenchel inequality
− ≤x, u ≥ 0
Φ(x) + Φ(u)
for any x ∈ V , u ∈
V , with equality holding if and only if
u = (∂Φ)(x) = (∂ Φ)
−1 (x) ←→ x = (∂ Φ)(u) = (∂Φ)−1 (u), (1.37)
With the aid of conjugate variables, we can introduce the “canonical divergence”
AΦ : V ×
V → R+ (and AΦ: V × V → R+ )
BΦ (x, (∂Φ)−1 (v)) = AΦ (x, v) = BΦ((∂ Φ)(x), v).
for all x = y for any |α| < 1 (the inequality sign is reversed when |α| > 1). Assume
Φ to be sufficiently smooth (differentiable up to fourth order).
Zhang [23] introduced the following family of function on V × V as indexed by
α∈R
⎨ ⎨ ⎩⎩
(α) 4 1−α 1+α 1−α 1+α
DΦ (x, y) = Φ(x) + Φ(y) − Φ x + y .
1 − α2 2 2 2 2
(1.40)
(α)
From its construction, DΦ (x, y) is non-negative for |α| < 1 due to Eq. (1.39), and
for |α| = 1 due to Eq. (1.35). For |α| > 1, assuming ( 1−α 2 x + 2 y) ∈ V , the
1+α
(α)
non-negativity of DΦ (x, y) can also be proven due to the inequality (1.39) reversing
(±1)
its sign. Furthermore, DΦ (x, y) is defined by taking limα→±1 :
(1) (−1)
DΦ (x, y) = DΦ (y, x) = BΦ (x, y),
(−1) (1)
DΦ (x, y) = DΦ (y, x) = BΦ (y, x).
(α)
Note that DΦ (x, y) satisfies the relation (called “referential duality” in [24])
(α) (−α)
DΦ (x, y) = DΦ (y, x),
that is, exchanging the asymmetric status of the two points (in the directed distance)
amounts to α ↔ −α.
We start by reviewing a main result from [23] linking the divergence function
(α)
DΦ (x, y) defined in (1.40) and the α-Hessian structure.
Proposition 9 ([23]) The manifold {M, g(x), Γ (α) (x), Γ (−α) (x)}4 associated with
(α)
DΦ (x, y) is given by
gij (x) = Φij (1.41)
4 The functional argument of x (or u-below) indicates that x-coordinate system (or u-coordinate
system) is being used. Recall from Sect. 1.2.5 that under x (u, resp) local coordinates, g and Γ , in
component forms, are expressed by lower (upper, resp) indices.
1 Divergence Functions and Geometric Structures 23
and
(α) 1−α ∗(α) 1+α
Γij,k (x) = Φijk , Γij,k (x) = Φijk . (1.42)
2 2
Here, Φij , Φijk denote, respectively, second and third partial derivatives of Φ(x)
∂ 2 Φ(x) ∂ 3 Φ(x)
Φij = , Φ ijk = .
∂x i ∂x j ∂x i ∂x j ∂x k
Recall that an α-Hessian manifold is equipped with an α-independent metric and
a family of α-transitively flat connections Γ (α) (i.e., Γ (α) satisfying (1.16) and Γ (±1)
are dually flat). From (1.42),
∗(α) (−α)
Γij,k = Γij,k ,
1
Γ⎧ij,k (x) = Φijk .
2
Straightforward calculation shows that:
Proposition 10 ([26]) For α-Hessian manifold {M, g(x), Γ (α) (x), Γ (−α) (x)},
(i) the Riemann curvature tensor of the α-connection is given by:
(α) 1 − α2 ∗(α)
Rμνij (x) = (Φilν Φjkμ − Φilμ Φjkν )Ψ lk = Rijμν (x),
4
l,k
It is worth pointing out that while DΦ -divergence induces the α-Hessian structure,
it is not unique, as the same structure can arise from the following divergence function,
which is a mixture of Bregman divergences in conjugate forms:
1−α 1+α
BΦ (x, y) + BΦ (y, x).
2 2
24 J. Zhang
⎨ ⎩
d 2 xi 1−α dx j dx k d2 1−α
Φki 2 + Φkij = 0 ←→ 2 Φk x = 0.
ds 2 ds ds ds 2
i i,l
where the scalar s is the arc length and ak , bk , k = 1, 2 . . . , n are constant vectors
(determined by a point and the direction along which the auto-parallel curve flows
through). For α = −1, the auto-parallel curves are given by uk = Φk (x) = ak s + bk
are affine coordinates as previously noted.
Note that the metric and conjugated connections in the forms (1.41) and (1.42) are
induced from (1.40). Using the convex conjugate Φ: V → R given by (1.36), we
(α) (x, y) defined by
introduce the following family of divergence functions D
Φ
(α)
(x, y) induces the α-Hessian structure
Straightforward calculation shows that D
Φ
{M, g, Γ (−α) (α)
, Γ } where Γ (∓α) are given by (1.42); that is, the pair of
α-connections are themselves “conjugate” (in the sense of α ↔ −α) to those induced
(α)
by DΦ (x, y).
1 Divergence Functions and Geometric Structures 25
Explicitly written,
⎨
(α) (u, v) = 4 1−α 1+α
D Φ Φ((∂Φ)−1 (u)) + Φ((∂Φ)−1 (v))
1 − α2 2 2
⎨ ⎩⎩
1−α 1+α
−Φ (∂Φ)−1 (u) + (∂Φ)−1 (v) .
2 2
Proposition 11 ([23]) The α-Hessian manifold {M, g(u), Γ (α) (u), Γ (−α) (u)}
(α) (u, v) is given by
associated with D Φ
ij (u),
g ij (u) = Φ (1.43)
1 + α ijk 1−α
Γ (α)ij,k (u) = , Γ ∗(α)ij,k (u) =
Φ ijk ,
Φ (1.44)
2 2
ij , Φ
Here, Φ ijk denote, respectively, second and third partial derivatives of Φ(u)
∂ 2 Φ(u)
∂ 3 Φ(u)
ij (u) =
Φ ijk (u) =
, Φ .
∂ui ∂uj ∂ui ∂uj ∂uk
We remark that the same metric (1.43) and the same α-connections (1.44) are
(−α) (α)
induced by DΦ (u, v) ≡ DΦ (v, u)—this follows as a simple application of Eguchi
relation.
An application of (1.23) gives rise to the following relations:
(−α)
Γ (α)mn,l (u) = − g im (u)g jn (u)g kl (u)Γij,k (x),
i,j,k
(α)
∗(α)mn,l
Γ (u) = − g im (u)g jn (u)g kl (u)Γij,k (x),
i,j,k
(α)
(α)klmn
R (u) = g ik (u)g jl (u)g μm (u)g νn (u)Rijμν (x).
i,j,μ,ν
ij (u)]
1+α
ω (α) (u) = det[Φ 2 .
26 J. Zhang
which is symmetric in i, j. Both (1.31) and (1.32) are satisfied. The symplectic form,
under the complex coordinates, is given by
⎨ ⎩ √
1−α 1+α ⎧(α) i
4 −1 ∂ 2 Φ
ω (α) = Φij x+ dx i ∧ dyj = dz ∧ dz̄j
2 2 1 + α2 ∂zi ∂z̄j
8 ∂2Φ ⎧(α) i
ds2 = dz ⊗ dz̄j .
1 + α2 ∂zi ∂z̄j
(α)
ω (α) = Φij dx i ∧ dyj
2
⎧(α) (z, z̄).
Φ
1 + α2
1−α ⎟
Here, Φij(α) = Φij 2 x+ 1+α
2 y .
For the diagonal manifold ΔM = {(x, x) : x ∈ M}, a basis of its tangent space
T(x,x) ΔM can be selected as
1 ∂ ∂
ei = √ ( i + i ).
2 ∂x ∂y
1.5 Summary
References
1. Amari, S.: Differential Geometric Methods in Statistics. Lecture Notes in Statistics, vol. 28.
Springer, New York (1985) (Reprinted in 1990)
2. Amari, S., Nagaoka, H.: Method of Information Geometry. AMS Monograph. Oxford Univer-
sity Press, Oxford (2000)
3. Barndorff-Nielsen, O.E., Jupp, P.E.: Yorks and symplectic structures. J. Stat. Plan. Inference
63, 133–146 (1997)
4. Bregman, L.M.: The relaxation method of finding the common point of convex sets and its
application to the solution of problems in convex programming. USSR Comput. Math. Phys.
7, 200–217 (1967)
5. Calin, O., Matsuzoe, H., Zhang. J.: Generalizations of conjugate connections. In: Sekigawa,
K., Gerdjikov, V., Dimiev, S. (eds.) Trends in Differential Geometry, Complex Analysis and
Mathematical Physics: Proceedings of the 9th International Workshop on Complex Structures
and Vector Fields, pp. 24–34. World Scientific Publishing, Singapore (2009)
6. Csiszár, I.: On topical properties of f-divergence. Studia Mathematicarum Hungarica 2, 329–
339 (1967)
7. Dombrowski, P.: On the geometry of the tangent bundle. Journal fr der reine und angewandte
Mathematik 210, 73–88 (1962)
8. Eguchi, S.: Second order efficiency of minimum contrast estimators in a curved exponential
family. Ann. Stat. 11, 793–803 (1983)
9. Eguchi, S.: Geometry of minimum contrast. Hiroshima Math. J. 22, 631–647 (1992)
10. Lauritzen, S.: Statistical manifolds. In: Amari, S., Barndorff-Nielsen, O., Kass, R., Lauritzen,
S., Rao, C.R. (eds.) Differential Geometry in Statistical Inference. IMS Lecture Notes, vol. 10,
pp. 163–216. Institute of Mathematical Statistics, Hayward (1987)
11. Matsuzoe, H.: On realization of conformally-projectively flat statistical manifolds and the
divergences. Hokkaido Math. J. 27, 409–421 (1998)
12. Matsuzoe, H., Inoguchi, J.: Statistical structures on tangent bundles. Appl. Sci. 5, 55–65 (2003)
13. Matsuzoe, H., Takeuchi, J., Amari, S.: Equiaffine structures on statistical manifolds and
Bayesian statistics. Differ. Geom. Appl. 24, 567–578 (2006)
14. Matsuzoe, H.: Computational geometry from the viewpoint of affine differential geometry.
In: Nielsen, F. (ed.) Emerging Trends in Visual Computing, pp. 103–123. Springer, Berlin,
Heidelberg (2009)
30 J. Zhang
15. Matsuzoe, M.: Statistical manifolds and affine differential geometry. Adv. Stud. Pure Math.
57, 303–321 (2010)
16. Nomizu, K., Sasaki, T.: Affine Differential Geometry—Geometry of Affine Immersions. Cam-
bridge University Press, Cambridge (1994)
17. Ohara, A., Matsuzoe, H., Amari, S.: Conformal geometry of escort probability and its appli-
cations. Mod. Phys. Lett. B 26, 1250063 (2012)
18. Shima, H.: Hessian Geometry. Shokabo, Tokyo (2001) (in Japanese)
19. Shima, H., Yagi, K.: Geometry of Hessian manifolds. Differ. Geom. Appl. 7, 277–290 (1997)
20. Simon, U.: Affine differential geometry. In: Dillen, F., Verstraelen, L. (eds.) Handbook of
Differential Geometry, vol. I, pp. 905–961. Elsevier Science, Amsterdam (2000)
21. Uohashi, K.: On α-conformal equivalence of statistical manifolds. J. Geom. 75, 179–184 (2002)
22. Yano, K., Ishihara, S.: Tangent and Cotangent Bundles: Differential Geometry, vol. 16. Dekker,
New York (1973)
23. Zhang, J.: Divergence function, duality, and convex analysis. Neural Comput. 16, 159–195
(2004)
24. Zhang, J.: Referential duality and representational duality on statistical manifolds. Proceedings
of the 2nd International Symposium on Information Geometry and Its Applications, Tokyo,
pp. 58–67 (2006)
25. Zhang, J.: A note on curvature of α-connections on a statistical manifold. Ann. Inst. Stat. Math.
59, 161–170 (2007)
26. Zhang, J., Matsuzoe, H.: Dualistic differential geometry associated with a convex function. In:
Gao, D.Y., Sherali, H.D. (eds.) Advances in Applied Mathematics and Global Optimization
(Dedicated to Gilbert Strang on the occasion of his 70th birthday), Advances in Mechanics and
Mathematics, vol. III, Chap. 13, pp. 439–466. Springer, New York (2009)
27. Zhang, J.: Nonparametric information geometry: From divergence function to referential-
representational biduality on statistical manifolds. Entropy 15, 5384–5418 (2013)
28. Zhang, J., Li, F.: Symplectic and Kähler structures on statistical manifolds induced from diver-
gence functions. In: Nielson, F., Barbaresco, F. (eds.) Proceedings of the 1st International
Conference on Geometric Science of Information (GSI2013), pp. 595–603 (2013)
Chapter 2
Geometry on Positive Definite Matrices
Deformed by V-Potentials and Its Submanifold
Structure
Abstract In this paper we investigate dually flat structure of the space of positive
definite matrices induced by a class of convex functions called V-potentials, from a
viewpoint of information geometry. It is proved that the geometry is invariant under
special linear group actions and naturally introduces a foliated structure. Each leaf is
proved to be a homogeneous statistical manifold with a negative constant curvature
and enjoy a special decomposition property of canonically defined divergence. As
an application to statistics, we finally give the correspondence between the obtained
geometry on the space and the one on elliptical distributions induced from a certain
Bregman divergence.
2.1 Introduction
The space of positive definite matrices has been studied both geometrically and
algebraically, e.g., as a Riemannian symmetric space and a cone of squares in Jordan
algebra ([9, 12, 17, 37, 45] and many others), and the results have broad applications
to analysis, statistics, physics and convex optimization and so on.
In the development of Riemannian geometry on the space of positive definite
matrices P, the function α(P) = −k log det P with a positive constant k plays a cen-
A. Ohara (B)
Department of Electrical and Electronics Engineering,
University of Fukui, Fukui 910-8507, Japan
e-mail: [email protected]
S. Eguchi
The Institute of Statistical Mathematics, Tachikawa, Tokyo 190-0014, Japan
e-mail: [email protected]
tral role. Important facts are that the function α is strictly convex, and that the Riem-
manian metric defined as the Hessian of α and the associated Levi-Civita connection
are invariant under automorphism group actions on the space, i.e., congruent trans-
formations by general linear group [37, 45]. The group invariance, i.e, homogeneity
of the space is also crucial in several applications such as multivariate statistical
analysis [23] and mathematical program called the semidefinite program (SDP) [28,
46].
On the other hand, the space of positive definite matrices admits dualistic geometry
(information geometry [1, 2], Hessian geometry [40]), which involves a certain pair
of two affine connections instead of the single Levi-Civita connection. Then the
geometry on the space not only maintains the group invariance but also reveals
its abundant dualistic structure, e.g., Legendre transform, dual flatness, divergences
and Pythagorean type relation and so on [34]. Such concepts and results have been
proved very natural also in study of applications, e.g., structures of stable matrices
[31], properties of means defined as midpoints of dual geodesics [30], or iteration
complexity of interior point methods for SDP [15]. In this case, the function α is
still an essential source of the structure and induces the dualistic geometry as a
potential function.
The first purpose of this paper is to investigate the geometry of the space of positive
definite matrices induced by a special class of convex functions, including α, from
the viewpoints of the dualistic geometry. We here consider the class of functions of
the form α(V ) = V (det P) with smooth functions V on positive real numbers R++ ,
and call α(V ) a V-potential [32].1 While the dualistic geometry induced from the log-
potential α = α(− log) is standard in the sense that it is invariant for the action of the
every automorphism group, the geometries induced from general V -potentials lose
the prominent property. We, however, show that they still preserve the invariance for
the special linear group actions. On the basis of this fact and by means of the affine
hypersurface theory [29], we exploit a foliated structure and discuss the common
properties of leaves as homogeneous submanifolds.
Another purpose is to show that the dualistic geometry on the space of positive
definite matrices induced from V -potentials naturally corresponds to the geometry
of fairly general class of multivariate statistical models:
For a given n×n positive definite matrix P the n-dimensional Gaussian distribution
with zero mean vector is defined by the density function
1 1 n
f (x, P) = exp{− x T Px − α(P)}, α(P) = − log det P + log 2∂,
2 2 2
which has a wide variety of applications in the field of probability theory, statistics,
physics and so forth. Information geometry successfully provides elegant geometric
insight, such as the Fisher information metric, e-connection and m-connection, with
the theory of exponential family of distributions including the Gaussian distribution
family [1, 2, 16, 25]. Note that in this geometry, the structure is derived from the
1 The reference [32] is a conference paper of this chapter omitting proofs and the whole Sect. 2.5.
2 Geometry on Positive Definite Matrices 33
We recall the basic notions and results in information geometry [2], i.e., dualistic
differential geometry of Hessian domain [40] and statistical manifolds [19–21].
For a torsion-free affine connection ⊂ and a pseudo-Riemannian metric g on
a manifold M, the triple (M, ⊂, g) is called a statistical (Codazzi) manifold if it
admits another torsion-free connection ⊂ ∗ satisfying
for arbitrary X, Y and Z in X (M), where X (M) is the set of all tangent vector fields
on M. We call ⊂ and ⊂ ∗ duals of each other with respect to g and (g, ⊂, ⊂ ∗ ) is said
dualistic structure on M.
A statistical manifold (M, ⊂, g) is said to be of constant curvature k if the cur-
vature tensor R of ⊂ satisfies
When the constant k is zero, the statistical manifold is said to be flat, or dually flat,
because the curvature tensor R∗ of ⊂ ∗ also vanishes automatically [2].
For τ ∈ R, statistical manifolds (M, ⊂, g) and (M, ⊂ , g ) are said to be
τ-conformally equivalent if there exists a function δ on M such that
g (X, Y ) = eδ g(X, Y )
1+τ
g(⊂X Y , Z) = g(⊂X Y , Z) − dδ(Z)g(X, Y )
2
1−τ
+ {dδ(X)g(Y , Z) + dδ(Y )g(X, Z)}.
2
for two points p and q on Ω, where x(·) and x ∗ (·) respectively represent x- and x ∗ -
coordinates of a point ·. If g(α) is positive definite and Ω is a convex domain, then
hold because the unique maximum α∗ (x ∗ (q)) of supp∈Ω {≡x ∗ (q), x(p)∇ − α(x(p))} is
attained when x(p) = (gradα)−1 (x ∗ (q)) holds.
Conversely, for a function D : M × M → R satisfying two conditions (2.2) for
any points p and q in M, define the positive semidefinite form g(D) , and two affine
connections ⊂ (D) and ∗ ⊂ (D) by
for X, Y and Z ∈ X (M), where the notation in the right-hand sides stands for
Consider the vector space of n × n real symmetric matrices, denoted by Sym(n, R),
with an inner product (X|Y ) = tr(XY ) for X, Y ∈ Sym(n, R). We identify an element
Y ∗ in the dual space Sym(n, R)∗ with Y ∗ (denoted by the same symbol) in Sym(n, R)
via ≡Y ∗ , X∇ = (Y ∗ |X). Let {E1 , . . . , En(n+1)/2 } be basis matrices of Sym(n, R), and
consider the affine coordinate system {x 1 , . . . , x n(n+1)/2 } with respect to them and the
canonical flat affine connection ⊂. Denote by PD(n, R) the convex cone of positive
definite matrices in Sym(n, R).
Let X (PD(n, R)) and TP PD(n, R) be the set of all tangent vector fields on
PD(n, R) and the tangent vector space at P ∈ PD(n, R), respectively. By iden-
tifying Ei and the natural basis (ψ/ψx i )P , we represent XP ∈ TP PD(n, R) by
X ∈ Sym(n, R). Similarly we regard a Sym(n, R)-valued smooth function X on
PD(n, R) as X ∈ X (PD(n, R)) via the identification of constant function Ei
with ψ/ψx i .
Let φG denote the congruent transformation by matrices G, i.e., φG X = GXGT .
The differential of φG is denoted by φG ∗ . If G is nonsingular, the transformation φG
is an element of automorphism group that acts transitively on PD(n, R).
In the Sects. 2.3 and 2.4, we consider the dually flat structure on PD(n, R) as a
Hessian domain induced from a certain class of potential functions.
Definition 1 Let V (s) be a smooth function on positive real numbers s ∈ R++ . The
function defined by
α(V ) (P) = V (det P) (2.4)
dβi−1 (s)
βi (s) = s, i = 1, 2, 3, where β0 (s) = V (s).
ds
β2 (s) 1
(i) β1 (s) < 0 (s > 0), (ii) γ (V ) (s) = < (s > 0), (2.5)
β1 (s) n
which are later shown to ensure the convexity of α(V ) (P) on PD(n, R). Note that
the first condition β1 (s) < 0 for all s > 0 implies the function V (s) is strictly
decreasing on s > 0. Important examples of V (s) satisfying (2.5) are − log s or
c1 + c2 sγ for real parameters c1 , c2 , γ with c2 γ < 0 and γ < 1/n. Another example
V (s) = c log(cs + 1) − log s with 0 ≤ c < 1 is proved useful for the quasi-Newton
updates [14].
Using the formula grad det P = (det P)P−1 , we have the gradient mapping
gradα(V ) and the differential form dα(V ) , respectively, as
2 Geometry on Positive Definite Matrices 37
(V )
The Hessian of α(V ) at P ∈ PD(n, R), which we denote by gP , is calculated as
(V )
gP (X, Y ) = d(dα(V ) (X))(Y )
= −β1 (det P)tr(P−1 XP−1 Y ) + β2 (det P)tr(P−1 X)tr(P−1 Y ).
Theorem 1 The Hessian g(V ) is positive definite on PD(n, R) if and only if the
conditions (2.5) hold.
The proof can be found in the Appendix A.
(V ) (V )
Remark 1 The Hessian g(V ) is SL(n, R)-invariant, i.e., gP (X , Y ) = gP (X, Y )
for any G ∈ SL(n, R), where P = φG P, X = φG∗ X and Y = φG∗ Y .
The conjugate function of α(V ) denoted by α(V )∗ is
and gradα(V ) is invertible by the positive definiteness of g(V ) , we have the following
expression for α(V )∗ with respect to P:
Let ⊂ be the canonical flat affine connection on Sym(n, R). To discuss dually flat
structure on PD(n, R) by regarding g(V ) as positive definite Riemannian metric, we
derive the dual connection ∗ ⊂ (V ) with respect to g(V ) introduced in the Sect. 2.2.
Consider a smooth curve ω = {Pt | − ξ < t < ξ} in PD(n, R) satisfying
⎛ ⎝
dPt
(Pt )t=0 = P ∈ PD(n, R), = X ∈ TP PD(n, R).
dt t=0
38 A. Ohara and S. Eguchi
Proof Differentiate the Legendre transform Pt∗ = gradα(V ) (Pt ) along the curve ω,
then
where β1 = dβ1 /ds. By evaluating both sides at t = 0, we see the statement holds.
Theorem 2 Let ∂t denote the parallel shift operator of the connection ∗ ⊂ (V ) . Then
the parallel shift ∂t (Y ) of the tangent vector Y (= ∂0 (Y )) ∈ TP PD(n, R) along the
curve ω satisfies
⎛ ⎝
d∂t (Y )
= XP−1 Y + YP−1 X + Φ(X, Y , P) + Φ ⊆ (X, Y , P),
dt t=0
where
β2 (s)tr(P−1 X) β2 (s)tr(P−1 Y )
Φ(X, Y , P) = Y+ X, (2.11)
β1 (s) β1 (s)
Φ ⊆ (X, Y , P) = ηP, (2.12)
and s = det P.
Remark 2 The denominator β1 (s){β1 (s) − nβ2 (s)} in η is always positive because
(2.5) is assumed.
Corollary 1 Let Ei be the matrix representation of the natural basis vector fields
ψ/ψx i described in the beginning of the Sect. 2.3. Then, their covariant derivatives at
P defined by the dual connection ∗ ⊂ (V ) have the following matrix representations:
⎛ ⎝
∗ (V ) ψ
⊂ ψ = −Ei P−1 Ej − Ej P−1 Ei − Φ(Ei , Ej , P) − Φ ⊆ (Ei , Ej , P),
ψx i ψx j P
2 Geometry on Positive Definite Matrices 39
Proof Recall the identification (ψ/ψx i )P = Ei for any P ∈ PD(n, R). Then the
statement follows from Theorem 2 and the definition of the covariant derivatives,
i.e.,
⎛ ⎝ ⎛ ⎝
∗ (V ) ψ d∂−t ((ψ/ψx j )Pt )
⊂ ψ = , Pt ∈ PD(n, R), P = P0 .
ψx i ψx j P dt t=0
holds for any G ∈ SL(n, R), where P = φG P, X = φG∗ X and Y = φG∗ Y . Partic-
ularly, both of the connections induced from the power potential α(V ) , defined via
V (s) = c1 + c2 sγ with real constants c1 , c2 and γ, are GL(n, R)-invariant. In addi-
tion, so is the orthogonality with respect to g(V ) . Hence, we conclude that both ⊂-
and ∗ ⊂ (V ) -projections [1, 2] are GL(n, R)-invariant for the power potentials, while
so is not g(V ) .
The power potential function α(V ) with normalizing conditions V (1) = 0 and
β1 (1) = −1, i.e., V (s) = (1 − sγ )/γ is called the beta potential. In this case,
β1 (s) = −sγ , β2 (s) = −γsγ , β3 (s) = −γ 2 sγ and γ (V ) (s) = γ. Note that setting γ to
zero leads to V (s) = − log s, which recovers the standard dualistic geometry induced
by the logarithmic characteristic function α(− log) (P) = − log det P on PD(n, R)
[34]. See [7, 33] for detailed discussion related with the power potential function.
⎧
PD(n, R) = Ls , Ls = {P|P > 0, det P = s}.
s>0
Denote by RP the ray through P in the convex cone PD(n, R), i.e, RP = {Q|Q =
κP, 0 < κ ∈ R}. Another foliated structure we consider is
⎧
PD(n, R) = RP .
P∈Ls
Proof The tangent vectors of RP are just P multiplied by scalars. Hence for
X ∈ TP Ls we have
(V )
gP (X, P) = −β1 (s)tr(P−1 X) + nβ2 (s)tr(P−1 X) = 0.
Proof The statement follows from that RP and its Legendre transform {Q∗ |Q∗ =
κ∗ P∗ , 0 < κ∗ ∈ R} are, respectively, straight lines with respect to the affine coordi-
nate systems of ⊂ and ∗ ⊂ (V ) .
We know the mean of dual connections is metric and torsion-free [2], i.e.,
1
⊂ˆ (V ) = ⊂ + ∗ ⊂ (V ) (2.13)
2
2 Geometry on Positive Definite Matrices 41
is the Levi-Civita connection for g(V ) . Proposition 3 implies that every RP is always
⊂ˆ (V ) -geodesic for any V satisfying (2.5). On the other hand, we have the following:
Proposition 4 Each leaf Ls is autoparallel with respect to the Levi-Civita connec-
tion ⊂ˆ (V ) if and only if β2 (s) = 0, i.e, V (s) = −k log s for a positive constant k.
The proof can be found in the Appendix C.
It is seen that νs is the involutive isometry of (Ls , g̃(V ) ) with the isolated fixed point
κI ∈ Ls , where κ = s1/n . In particular, −gradα(V ) (P) is the involutive isometry of
(Ls , g̃(V ) ) satisfying s = s∗ . Since isometries φG where G ∈ SL(n, R) act transitively
on Ls , each manifold (Ls , g̃(V ) ) is globally Riemannian symmetric [12].
Proposition 6 For any V (s) satisfying (2.5), the following statements hold:
(i) The metric g̃(V ) on Ls is given by g̃(V ) = −β1 (s)g̃(−log) ,
(ii) For P, Q ∈ Ls , it holds that D(V ) (P, Q) = −β1 (s)D(−log) (P, Q),
(V )
(iii) The induced connections ∗ ⊂˜ on Ls are actually independent of V (s), i.e.,
˜ (V ) = ∗ ⊂˜ (−log) for all V (s) (See the remark below).
∗⊂
In [11, 40, 43, 44] level surfaces of the potential function in a Hessian domain were
studied via the affine hypersurface theory [29]. Since the statistical submanifold
˜ g̃(V ) ) is a level surface of α(V ) in a Hessian domain (PD(n, R), ⊂, g(V ) ), the
(Ls , ⊂,
general results in the literature can be applied. The immediate consequences along
our context are summarized as follows:
Consider the gradient vector field E of α(V ) and a scaled gradient vector field N
respectively defined by
and
1
N =− E.
dα(V ) (E)
β1 (det P) 1
E= P, N = − P. (2.14)
nβ2 (det P) − β1 (det P) nβ1 (det P)
Using N as the transversal vector field, the results in [11, 43] demonstrate that
˜ g̃(V ) ) coincides with the one realized by the affine immersion
the geometry (Ls , ⊂,
[29] of Ls into Sym(n, R) with the canonical flat connection ⊂:
Here h and φ are, respectively, the affine fundamental form and the transversal con-
nection form satisfying the relation:
h = g̃(V ) , φ = 0, (2.17)
holds for μ ∈ R satisfying R∗ = μQ∗ , i.e., μ = κ−1 β1 (det R)/β1 (det Q) > 0.
Using these preliminaries, we have the main result in this section as follows:
Proof (i) Note that Ls is a level surface of not only α(V ) (P) but also α(V )∗ (P∗ )
˜ g̃(V ) ) and (Ls , ⊂˜ ∗ , g̃(V ) ) are 1-conformally
in (2.9), which implies both (Ls , ⊂,
flat from Proposition 7. By the duality of τ-conformal equivalence, they are also
−1-conformally flat.
(ii) For arbitrary X ∈ X (Ls ), the equalities
1 1
⊂X N = − ⊂X P = − X
β1 (s)n β1 (s)n
hold at each P ∈ Ls . The second equality follows from that ⊂ is a canonical flat
connection of the vector space Sym(n, R). Comparing this equation with (2.16)
and (2.17), we have A = ks I. By substituting into the Gauss equation for the
affine immersion [29]:
Kurose proved that a modified form of the Pythagorean relation holds for canonical
divergences on statistical manifolds with constant curvature [20]. As a consequence
of (ii) in Theorem 3, his result holds on each Ls .
D(V ) (P, R) = D(V ) (P, Q) + D(V ) (Q, R) − ks D(V ) (P, Q)D(V ) (Q, R). (2.19)
Lemma 2 Suppose that three points P, Q and R and the parameter κ meet the same
assumptions in Proposition 8. Then, the following relation holds:
∗
D(V ) (P, Q) = D(V ) (Q, P) = α(V )∗ (P∗ ) + α(V ) (Q) − ≡P∗ , Q∇,
we see that the following dual result to Proposition 8 with respect to (Ls , ⊂˜ ∗ , g̃(V ) ):
∗
D(V ) (P, R) = ∗ D(V ) (Q, R) + κ ∗ D(V ) (P, Q)
D(V ) (P, S) = D(V ) (P, R) + κD(V ) (R, S), κ = κ{1 − ks D(V ) (Q, R)}, (2.21)
D(V ) (P, S) = D(V ) (P, R) + κ{1 − ks D(V ) (Q, R)}D(V ) (R, S), (2.22)
Definition 2 [4] Let U(s) be a smooth convex function with the positive derivatives
u(s) = U (s) and u (s) on R or its (semi-infinite) interval and ε be the inverse function
of u there. If the following functional for two functions f (x) and g(x) on Rn
⎨
⎩ ⎤
DU (f , g) = U(ε(g)) − U(ε(f )) − ε(g) − ε(f ) fdx
(1 + γs) (γ+1)/γ , s > −1/γ, then the corresponding U-divergence is the beta-
divergence defined in (2.23).
When we consider the family of functions parametrized by elements in a man-
ifold M, the U-divergence induces the dualistic structure on M in such a way as
Proposition 1. Here, we confine our attention to the family of multivariate probability
density functions specified by P in PD(n, R). The family is natural in the sense that
it is a dually flat statistical manifold with respect to the dualistic geometry induced
by the U-divergence.
Definition 3 [4] Let U and u be the functions given in Definition 2. The family of
elliptical distributions with the following density functions
⎥ ⎛ ⎝⎫ ⎬
1 T ⎫
MU = ⎫
f (x, P) = u − x Px − cU (det P) ⎫ P ∈ PD(n, R) ,
2
is called the (zero-mean) U-model associated with the U-divergence. Here, we set
f (x, P) = 0 if the right-hand side of f is undefined, and cU (det P) is a normalizing
constant satisfying
⎨ ⎨ ⎛ ⎝
1 1
f (x, P)dx = (det P)− 2 u − yT y − cU (det P) dy = 1,
2
where η satisfies u(−η/2−cU (det P)) = 0. Hence, the normalizing constant is given
by ⎭ 1
−1 Γ ( 2 )(det P)
n 2
cU (det P) = Γ n ,u n ,
2 ∂2
A similar argument is also valid for the case of unbounded supports. See for examples
of the calculation in [33].
Note that if the function U(s) satisfies a sort of self-similarity [33], the density
function f in the U-model can be expressed in the usual form of an elliptical distri-
bution [8, 23], i.e., ⎛ ⎝
1 1 T
f (x, P) = cf (det P) u − x Px
2
2
with a constant cf .
Now we consider the correspondence between the dualistic geometry induced
by DU on the U-model and that on PD(n, R) induced by the V -potential function
discussed in the Sects. 2.3 and 2.4.
Assume that V satisfies the conditions (2.5), then the dualistic structure (g(V ) , ⊂,
∗ ⊂ (V ) )
on PD(n, R) coincides with that on the U-model induced by the
U-divergence in such a way as Proposition 1.
2.7 Conclusion
We have studied dualistic geometry on positive definite matrices induced from the
V -potential instead of the standard characteristic function.
First, we have derived the associated Riemannian metric, mutually dual affine con-
nections and the canonical divergence. The induced geometry is, in general, proved
SL(n, R)-invariant while it is GL(n, R)-invariant in the case of the characteristic
function (V (s) = − log s). However, when V is of the power form, it is shown that
orthogonality and a pair of mutually dual connections are GL(n, R)-invariant.
Next, we have investigated a foliated structure via the induced geometry. Each
leaf (the set of positive definite matrices with a constant determinant) is proved to
be a homogeneous statistical manifold with a constant curvature depending on V .
As a consequence, we have given a new decomposition relation for the canonical
divergence that would be useful to solve the nearest point on a specified leaf.
Finally, to apply the induced geometry to robust statistical inferences we have
established a relation with geometry of U-model (or symmetric elliptical densities).
Further applications of such structures and investigation of the other form of
potential functions are left in the future work.
Acknowledgments We thank the anonymous referees for their constructive comments and careful
checks of the original manuscript.
Appendices
A Proof of Theorem 1
(V )
gP (X, Y ) = −β1 (det P){tr(P−1 XP−1 Y ) − γ (V ) (det P)tr(P−1 X)tr(P−1 Y )}
= −β1 (det P)vecT (X̃) In2 − γ (V ) (det P)vec(In )vecT (In ) vec(Ỹ ).
Here X̃ = P−1/2 XP−1/2 , Ỹ = P−1/2 YP−1/2 , vec(•) is the operator that maps A =
2
(aij ) ∈ Rn×n to [a11 , · · · , an1 , a12 , · · · , an2 , · · · , a1n , · · · , ann ]T ∈ Rn , and In
and In2 denote the unit matrices of order n and n2 , respectively. By congruently
transforming the matrix In2 − γ (V ) (det P)vec(In )vecT (In ) with a proper permutation
matrix, we see the positive definiteness of g(V ) is equivalent with −β1 (det P) > 0
and
In − γ (V ) (det P)11T > 0, where 1 = [1, 1, · · · , 1]T ∈ Rn .
2 Geometry on Positive Definite Matrices 49
√
Let W be an orthogonal matrix that has 1/ n as the first column vector. Since
the following eigen-decomposition
1 − nγ (V ) (det P) 0 · · · 0
⎜ . .. ⎟
⎜ 0 1 .. .⎟
In − γ (V ) (det P)11T = W ⎜
⎜ ..
⎟ WT
⎟
.. ..
. . . 0
0 ··· 0 1
holds, the conditions (2.5) are necessary and sufficient for positive definiteness of
g(V ) . Thus, the statement follows.
B Proof of Theorem 2
for any t.
From Lemma 1, this implies
d ⎪
β2 (det Pt )tr{Pt−1 ∂t (Y )}Pt−1 − β1 (det Pt )Pt−1 ∂t (Y )Pt−1 = 0
dt
⎥ ⎛ ⎝ ⎬
d∂t (Y )
(β1 (s) − nβ2 (s))tr P−1 (2.26)
dt t=0
⎞ ⎠ ⎞ ⎠ ⎞ ⎠
=(2β1 (s) − nβ2 (s))tr P−1 XP−1 Y + (nβ3 (s) − 2β2 (s))tr P−1 X tr P−1 Y .
C Proof of Proposition 4
Since geometric structure (Ls , g(V ) ) is also invariant under the transformation φG
where G ∈ SL(n, R), it suffices to consider at κI ∈ Ls , where κ = s1/n .
Let X̃ ∈ X (Ls ) be a vector field defined at each P ∈ Ls by
2 Geometry on Positive Definite Matrices 51
ψ
X̃ = X̃ i (P) = P1/2 XP1/2 , X ∈ TI L1 = {X|tr(X) = 0, X = X T },
ψx i
i
1/2 1/2
ψ
ỸPt = Pt Yt Pt = Ỹ i (Pt ) ,
ψx i
i
For the third equality we have used that Φ(κX, κY , κI) = 0 for any X and Y ∈ TI L1 .
Since it holds that
(V ) (V ) (V )
gκI ⊂ˆ Ỹ , I = κ−1 (−β1 (s) + β2 (s)n)tr ⊂ˆ Ỹ
X̃ κI X̃ κI
and −β1 (s) + β2 (s)n ∀= 0 by (2.5), the (TκI Ls )⊆ -component of ⊂ˆ (V ) Ỹ vanishes
X̃ κI
for any X and Y ∈ TI L1 if and only if
1
(V )
tr ⊂ˆ Ỹ = − trΦ ⊆ (κX, κY , κI) = 0.
X̃ κI 2
52 A. Ohara and S. Eguchi
D Proof of Proposition 6
The statements (i) and (ii) follow from direct calculations. Since ∗ ⊂˜ (V ) Ỹ is
X̃ P
(V ) (V )
the orthogonal projection of ∗ ⊂ Ỹ to TP Ls with respect to gP , it can be
X̃ P
represented by
∗ ˜ (V ) (V )
⊂ Ỹ = ∗ ⊂ Ỹ − δP, δ ∈ R,
X̃ P X̃ P
Since Φ ⊆(κX, κY ,κI) ∈ (TκI Ls )⊆ and (dYt /dt)t=0 ∈ TκI Ls , the orthogonal pro-
jection of ∗ ⊂ (V ) Ỹ to TκI Ls is that of κ(dYt /dt)t=0 − κ(YX + XY )/2. Thus, from
X̃ κI
the orthogonality condition we have
⎛ ⎝
(V ) d κ κ
∗
⊂˜ X̃ Ỹ =κ Yt − (XY + YX) + tr(XY )I,
κI dt t=0 2 n
E Proof of Theorem 4
For P and Q in PD(n, R), we shortly write two density functions in a U-model as
fP (x) = f (x, P) and fQ (x) = f (x, Q).
2 Geometry on Positive Definite Matrices 53
It suffices to show the dual canonical divergence ∗ D(V ) (P, Q) = D(V ) (Q, P) of
(PD(n, R), ⊂, g(V ) ) given by (2.10) coincides with DU (fP , fQ ). Note that an exchange
of the order for two arguments in a divergence only causes that of the definitions
of primal and dual affine connections in (2.3) but does not affect whole dualistic
structure of the induced geometry.
Recalling (2.6), we have
gradα(V ) (P) = V (det P) det P P−1 = β1 (det P)P−1 ,
where V denotes the derivative of V by s. On the other hand, we can directly differ-
entiate α(V ) (P) defined via (2.24)
gradα(V ) (P)
⎥⎨ ⎛ ⎝ ⎬
1
= grad U − x T Px − cU (det P) dx + cU (det P)
2
⎨ ⎥ ⎬
1 T
−1
= fP (x) − xx − cU (det P) det P P dx + cU (det P) det P P−1
2
⎨
1 1
=− fP (x)xx T dx = − EP (xx T ),
2 2
1
β1 (det P)P−1 = − EP (xx T ). (2.27)
2
Note that
1 1
ε(fP ) = − x T Px − cU (det P), ε(fQ ) = − x T Qx − cU (det Q)
2 2
because ε(u) is the identity. From the definition, U-divergence is
⎨ ⎛ ⎝ ⎛ ⎝
1 T 1 T
DU (fP , fQ ) = U − x Qx − cU (det Q) − U − x Px − cU (det P)
2 2
⎥ ⎬
1 T 1 T
−fP (x) − x Qx − cU (det Q) + x Px + cU (det P) dx
2 2
1
= α(V ) (det Q) − α(V ) (det P) + EP x T Qx − x T Px .
2
Using (2.27), the third term is expressed by
1 T 1
EP x Qx − x T Px = tr{EP (xx T )(Q − P)}
2 2
= β1 (det P)tr(P−1 (P − Q)).
54 A. Ohara and S. Eguchi
References
1. Amari, S.: Differential-geometrical methods in statistics, Lecture notes in Statistics. vol. 28,
Springer, New York (1985)
2. Amari, S., Nagaoka, H.: Methods of information geometry, AMS & OUP, Oxford (2000)
3. David, A.P.: The geometry of proper scoring rules. Ann. Inst. Stat. 59, 77–93 (2007)
4. Eguchi, S.: Information geometry and statistical pattern recognition. Sugaku Expositions Amer.
Math. Soc. 19, 197–216 (2006) (originally Sūgaku, 56, 380–399 (2004) in Japanese)
5. Eguchi, S.: Information divergence geometry and the application to statistical machine learning.
In: Emmert-Streib, F., Dehmer, M. (eds.) Information Theory and Statistical Learning, pp. 309–
332. Springer, New York (2008)
6. Eguchi, S., Copas, J.: A class of logistic-type discriminant functions. Biometrika 89(1), 1–22
(2002)
7. Eguchi, S., Komori, O., Kato, S.: Projective power entropy and maximum tsallis entropy dis-
tributions. Entropy 13, 1746–1764 (2011)
8. Fang, K.T., Kotz, S., Ng, K.W.: Symmetric Multivariate and Related Distributions. Chapman
and Hall, London (1990)
9. Faraut, J., Korányi, A.: Analysis on Symmetric Cones. Oxford University Press, New York
(1994)
10. Grunwald, P.D., David, A.P.: Game theory, maximum entropy, minimum discrepancy and
robust bayesian decision theory. Ann. Stat. 32, 1367–1433 (2004)
11. Hao, J.H., Shima, H.: Level surfaces of nondegenerate functions in r n+1 . Geom. Dedicata
50(2), 193–204 (1994)
12. Helgason, S.: Differential Geometry and Symmetric Spaces. Academic Press, New York (1962)
13. Higuchi, I., Eguchi, S.: Robust principal component analysis with adaptive selection for tuning
parameters. J. Mach. Learn. Res. 5, 453–471 (2004)
14. Kanamori, T., Ohara, A.: A bregman extension of quasi-newton updates I: an information
geometrical framework. Optim. Methods Softw. 28(1), 96–123 (2013)
15. Kakihara, S., Ohara, A., Tsuchiya, T.: Information geometry and interior-point algorithms in
semidefinite programs and symmetric cone programs. J. Optim. Theory Appl. 157(3), 749–780
(2013)
16. Kass, R.E., Vos, P.W.: Geometrical Foundations of Asymptotic Inference. Wiley, New York
(1997)
17. Koecher, M.: The Minnesota Notes on Jordan Algebras and their Applications. Springer, Berlin
(1999)
18. Kullback, S.: Information Theory and Statistics. Wiley, New York (1959)
19. Kurose, T.: Dual connections and affine geometry. Math. Z. 203(1), 115–121 (1990)
20. Kurose, T.: On the divergences of 1-conformally flat statistical manifolds. Tohoku Math. J.
46(3), 427–433 (1994)
21. Lauritzen, S.: Statistical manifolds. In: Amari, S.-I., et al. (eds.) Differential Geometry in
Statistical Inference, Institute of Mathematical Statistics, Hayward (1987)
22. Minami, M., Eguchi, S.: Robust blind source separation by beta-divergence. Neural Comput.
14, 1859–1886 (2002)
23. Muirhead, R.J.: Aspects of Multivariate Statistical Theory. Wiley, New York (1982)
24. Murata, N., Takenouchi, T., Kanamori, T., Eguchi, S.: Information geometry of u-boost and
bregman divergence. Neural Comput. 16, 1437–1481 (2004)
25. Murray, M.K., Rice, J.W.: Differential Geometry and Statistics. Chapman & Hall, London
(1993)
2 Geometry on Positive Definite Matrices 55
26. Naudts, J.: Continuity of a class of entropies and relative entropies. Rev. Math. Phys. 16,
809–822 (2004)
27. Naudts, J.: Estimators, escort probabilities, and δ-exponential families in statistical physics. J.
Ineq. Pure Appl. Math. 5, 102 (2004)
28. Nesterov, Y.E., Todd, M.J.: Primal-dual interior-point methods for self-scaled cones. SIAM J.
Optim. 8, 324–364 (1998)
29. Nomizu, K., Sasaki, T.: Affine differential geometry. Cambridge University Press, Cambridge
(1994)
30. Ohara, A.: Geodesics for dual connections and means on symmetric cones. Integr. Eqn. Oper.
Theory 50, 537–548 (2004)
31. Ohara, A., Amari, S.: Differential geometric structures of stable state feedback systems with
dual connections. Kybernetika 30(4), 369–386 (1994)
32. Ohara, A., Eguchi, S.: Geometry on positive definite matrices induced from V-potential func-
tion. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information; Lecture Notes in
Computer Science 8085, pp. 621–629. Springer, Berlin (2013)
33. Ohara, A., Eguchi, S.: Group invariance of information geometry on q-gaussian distributions
induced by beta-divergence. Entropy 15, 4732–4747 (2013)
34. Ohara, A., Suda, N., Amari, S.: Dualistic differential geometry of positive definite matrices
and its applications to related problems. Linear Algebra Appl. 247, 31–53 (1996)
35. Ohara, A., Wada, T.: Information geometry of q-Gaussian densities and behaviors of solutions
to related diffusion equations. J. Phys. A: Math. Theor. 43, 035002 (18pp.) (2010)
36. Ollila, E., Tyler, D., Koivunen, V., Poor, V.: Complex elliptically symmetric distributions :
survey, new results and applications. IEEE Trans. signal process. 60(11), 5597–5623 (2012)
37. Rothaus, O.S.: Domains of positivity. Abh. Math. Sem. Univ. Hamburg 24, 189–235 (1960)
38. Sasaki, T.: Hyperbolic affine hyperspheres. Nagoya Math. J. 77, 107–123 (1980)
39. Scott, D.W.: Parametric statistical modeling by minimum integrated square error. Technomet-
rics 43, 274–285 (2001)
40. Shima, H.: The geometry of Hessian structures. World Scientific, Singapore (2007)
41. Takenouchi, T., Eguchi, S.: Robustifying adaboost by adding the naive error rate. Neural Com-
put. 16(4), 767–787 (2004)
42. Tsallis, C.: Introduction to Nonextensive Statistical Mechanics. Springer, New York (2009)
43. Uohashi, K., Ohara, A., Fujii, T.: 1-conformally flat statistical submanifolds. Osaka J. Math.
37(2), 501–507 (2000)
44. Uohashi, K., Ohara, A., Fujii, T.: Foliations and divergences of flat statistical manifolds.
Hiroshima Math. J. 30(3), 403–414 (2000)
45. Vinberg, E.B.: The theory of convex homogeneous cones. Trans. Moscow Math. Soc. 12,
340–430 (1963)
46. Wolkowicz, H., et al. (eds.): Handbook of Semidefinite Programming. Kluwer Academic Pub-
lishers, Boston (2000)
Chapter 3
Hessian Structures and Divergence Functions
on Deformed Exponential Families
3.1 Introduction
H. Matsuzoe (B)
Department of Computer Science and Engineering, Graduate School of Engineering,
Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya 466-8555, Japan
e-mail: [email protected]
M. Henmi
The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan
e-mail: [email protected]
space [1] or a flat statistical manifold [12]. A pair of dually flat affine connections has
essential roles in geometric theory of statistical inferences. In addition, a Hessian
manifold has an asymmetric squared-distance like function, called the canonical
divergence. On an exponential family, the canonical divergence coincides with the
Kullback-Leibler divergence or the relative entropy. (See Sect. 3.3.)
A deformed exponential family is a generalization of exponential families, which
was introduced in anomalous statistical physics [22]. (See also [23, 32] and [33].)
A deformed exponential family naturally has two kinds of dualistic Hessian struc-
tures, and such geometric structures are independently studied in machine learning
theory [21] and statistical physics [3, 26], etc. For example, a q-exponential family
is a typical example of deformed exponential families. One of Hessian structures on
a q-exponential family is related to geometry of β-divergences (or density power di-
vergences [5]). The other Hessian structure is related to geometry of α-divergences.
(In the q-exponential case, these geometry are studied in [18].) In addition, confor-
mal structures of statistical manifolds play important roles in geometry of deformed
exponential families.
In this paper, we summarize such Hessian structures and conformal structures on
deformed exponential families. Then we construct a generalized relative entropy from
the viewpoint of estimating functions. As an application, we consider generalization
of independence of random variables, then elucidate geometry of the maximum
q-likelihood estimator. This paper is written based on the proceeding [19].
3.2 Preliminaries
In this paper, we assume that all objects are smooth, and a manifold M is an open
domain in Rn .
Let (M, h) be a semi-Riemannian manifold, that is, h is assumed to be nonde-
generate, which is not necessary to be positive definite (e.g. the Lorentzian metric in
relativity). Let ⊂ be an affine connection on M. We define the dual connection ⊂ ∗
of ⊂ with respect to h by
R(X, Y )Z := ⊂X ⊂Y Z − ⊂Y ⊂X Z − ⊂[X,Y ] Z,
T (X, Y ) := ⊂X Y − ⊂Y X − [X, Y ],
Assume any two of the above conditions, then the others hold.
From now on, we assume that an affine connection ⊂ is torsion-free.
We say that an affine connection ⊂ is flat if ⊂ is curvature-free. For a flat affine con-
nection ⊂, there exists a coordinate system {θi } on M locally such that the connection
coefficients {Γij⊂ k } (i, j, k = 1, . . . , n) of ⊂ vanish on its coordinate neighbourhood.
We call such a coordinate system {θi } an affine coordinate system.
Let (M, h) be a semi-Riemannian manifold, and let ⊂ be a flat affine connection
on M. We say that the pair (⊂, h) is a Hessian structure on M if there exists a function
ψ, at least locally, such that h = ⊂dψ [28]. In the coordinate form, the following
formula holds:
∂2
hij (p(θ)) = i j ψ(p(θ)),
∂θ ∂θ
(0) 1
h(⊂X Y , Z) := h(⊂X Y , Z) − C(X, Y , Z), (3.1)
2
(0) 1
h(⊂X∗ Y , Z) := h(⊂X Y , Z) + C(X, Y , Z), (3.2)
2
where ⊂ (0) is the Levi-Civita connection with respect to h. In this case, ⊂h and ⊂ ∗ h
are totally symmetric. Hence (M, ⊂, h) and (M, ⊂ ∗ , h) are statistical manifolds.
A triplet (M, ⊂, h) is a flat statistical manifold if and only if it is a Hessian manifold
(cf. [28]). Suppose that R and R∗ are curvature tensors of ⊂ and ⊂ ∗ , respectively.
Then we have
h(R(X, Y )Z, V ) = −h(Z, R∗ (X, Y )V ).
60 H. Matsuzoe and M. Henmi
Hence the condition that the triplet (M, ⊂, h) is a Hessian manifold is equivalent to
that the quadruplet (M, h, ⊂, ⊂ ∗ ) is a dually flat space [1].
For a Hessian manifold (M, ⊂, h), we suppose that {θi } is a ⊂-affine coordinate
system on M. Then there exists a ⊂ ∗ -affine coordinate system {ηi } such that
∂ ∂
h , = δji .
∂θi ∂ηj
∂ψ ∂φ n
= ηi , = θi , ψ(p) + φ(p) − θi (p)ηi (p) = 0, (p ∈ M), (3.3)
∂θi ∂ηi
i=1
∂2ψ ∂2φ
hij = , h ij
= ,
∂θi ∂θj ∂ηi ∂ηj
∂3ψ
Cijk = (3.4)
∂θi ∂θj ∂θk
is the cubic form of (M, ⊂, h).
For proof, see [1] and [28]. The functions ψ and φ are called the θ-potential and
the η-potential, respectively. From the above proposition, the Hessians of θ-potential
and η-potential coincide with the semi-Riemannian metric h:
In addition, we obtain the original flat connection ⊂ and its dual ⊂ ∗ from the potential
function ψ. From Eq. (3.4), we have the cubic form of Hessian manifold (M, ⊂, h).
Then we obtain two affine connections ⊂ and ⊂ ∗ by Eqs. (3.1), (3.2) and (3.4).
Under the same assumptions as in Proposition 2, we define a function D on M ×M
by
n
D(p, r) := ψ(p) + φ(r) − θi (p)ηi (r), (p, r ∈ M).
i=1
1. D[ | ](p) = D(p, p) = 0,
2. D[X| ](p) = D[ |X](p) = 0,
3. h(X, Y ) := −D[X|Y ] (3.6)
is a semi-Riemannian metric on M.
By differentiating Eq. (3.6), two affine connections ⊂ and ⊂ ∗ are mutually dual with
respect to h. We can check that ⊂ and ⊂ ∗ are torsion-free, and ⊂h and ⊂ ∗ h are totally
symmetric. Hence triplets (M, ⊂, h) and (M, ⊂ ∗ , h) are statistical manifolds. We call
(M, ⊂, h) the induced statistical manifold from a contrast function D. If (M, ⊂, h) is
a Hessian manifold, we say that (M, ⊂, h) is the induced Hessian manifold from D.
Proof From the definition and Eq. (3.3), we have D[ | ] = 0 and D[X| ] = D[ |X] = 0.
Let {θi } be a ⊂-affine coordinate and {ηj } the dual affine coordinate of {θj }. Set
∂i = ∂/∂θi . From Eqs. (3.3) and (3.5), we have
D[∂i |∂j ](p) = (∂i )p (∂j )r D(p, q)|p=r = (∂j )r {ηi (p) − ηi (r)} |p=r
= −(∂j )r ηi (r)|p=r = −hij (p).
This implies that the canonical divergence D is a contrast function on M ×M. Induced
affine connections are given by
⎛ ⎝
Γij,k = −D[∂i ∂j |∂k ] = (∂i )p (∂k )r ηj (p) − ηj (r) |p=r
= −(∂i )p (∂k )r ηj (r)|p=r = 0,
62 H. Matsuzoe and M. Henmi
∗
⎛ ⎝
Γik,j = −D[∂j |∂i ∂k ] = (∂i )r (∂k )r ηj (p) − ηj (r) |p=r
= −(∂i )r (∂k )r ηj (r)|p=r = −(∂i )r (∂k )r (∂j )r ψ(r)|p=r
= Cikj ,
h̄(X, Y ) = eϕ h(X, Y ),
1+α 1−α
⊂¯ X Y = ⊂X Y − h(X, Y )gradh ϕ + {dϕ(Y ) X + dϕ(X) Y } ,
2 2
where gradh ϕ is the gradient vector field of ϕ with respect to h, that is,
h(gradh ϕ, X) := Xϕ.
(The vector field gradh ϕ is often called the natural gradient of ϕ in neurosciences,
etc.) We say that a statistical manifold (M, ⊂, h) is α-conformally flat if it is locally
α-conformally equivalent to some Hessian manifold [12].
Suppose that D and D̄ are contrast functions on M × M. We say that D and D̄ are
α-conformally equivalent if there exists a function ϕ on M such that
⎞ ⎠ ⎞ ⎠
1+α 1−α
D̄(p, r) = exp ϕ(p) exp ϕ(r) D(p, r).
2 2
In this case, induced statistical manifolds (M, ⊂, h) and (M, ⊂, ¯ h̄) from D and D̄,
respectively, are α-conformally equivalent.
Historically, conformal equivalence of statistical manifolds was introduced in
asymptotic theory of sequential estimation [27]. (See also [11].) Then it is generalized
in affine differential geometry (e.g. [10, 12, 13] and [17]). As we will see in Sects. 3.5
and 3.6, conformal structures on a deformed exponential family play important roles.
(See also [2, 20, 24] and [25].)
For a statistical model S, we define the Fisher information matrix gF (ξ) = (gijF (ξ))
by
⎥
∂ ∂
gijF (ξ) := log p(x; ξ) log p(x; ξ) p(x; ξ) dx (3.7)
∂ξ i ∂ξ j
Ω
= Ep [∂i lξ ∂j lξ ],
where ∂i = ∂/∂ξ i , lξ = l(x; ξ) = log p(x; ξ), and Ep [f ] is the expectation of f (x)
with respect to p(x; ξ). The Fisher information matrix gF is semi-positive definite
in general. Assuming that gF is positive definite and all components are finite, then
gF can be regarded as a Riemannian metric on S. We call gF the Fisher metric on S.
The Fisher metric gF has the following representations:
⎥
∂ ∂
gijF (ξ) = p(x; ξ) log p(x; ξ) dx (3.8)
∂ξ i ∂ξ j
Ω
⎥
1 ∂ ∂
= p(x; ξ) p(x; ξ) dx. (3.9)
p(x; ξ) ∂ξ i ∂ξ j
Ω
(α)
where Γij,k is the Christoffel symbol of the first kind of ⊂ (α) .
We remark that ⊂ (0) is the Levi-Civita connection with respect to the Fisher
metric gF . The connection ⊂ (e) := ⊂ (1) is called the the exponential connection and
⊂ (m) := ⊂ (−1) is called the mixture connection. Two connections ⊂ (e) and ⊂ (m) are
expressed as follows:
64 H. Matsuzoe and M. Henmi
⎥
(e)
Γij,k = Ep [(∂i ∂j lξ )(∂k lξ )] = ∂i ∂j log p(x; ξ)∂k p(x; ξ)dx, (3.10)
Ω
⎥
(m)
Γij,k = Ep [((∂i ∂j lξ + ∂i lξ ∂j lξ )(∂k lξ )] = ∂i ∂j p(x; ξ)∂k log p(x; ξ)dx. (3.11)
Ω
We can check that the α-connection ⊂ (α) is torsion-free and ⊂ (α) gF is totally
symmetric. These imply that (S, ⊂ (α) , gF ) forms a statistical manifold. In addition,
it is known that the Fisher metric gF and the α-connection ⊂ (α) are independent
of choice of dominating measures on Ω. Hence we call the triplet (S, ⊂ (α) , gF ) an
invariant statistical manifold. The cubic form C F of the invariant statistical manifold
(S, ⊂ (e) , gF ) is given by
(m) (e)
F
Cijk = Γij,k − Γij,k .
under a choice of suitable dominating measure, where F1 (x), . . . , Fn (x) are functions
on the sample space Ω, θ = (θ1 , . . . , θn ) is a parameter, and ψ(θ) is a function of θ for
normalization. The following proposition is well-known in information geometry [1].
4. Set the expectation of Fi (x) by ηi := Ep [Fi (x)]. Then {ηi } is the dual affine
coordinate system of {θi } with respect to gF .
5. Set φ(η) := Ep [log p(x; θ)]. Then φ(η) is the potential of gF with respect to {ηi }.
We call s(x; ξ) the score function of p(x; ξ) with respect to ξ. In information geometry,
si (x; ξ) is called the e-(exponential) representation of ∂/∂ξ i , and ∂/∂ξ i p(x; ξ) is
called the m-(mixture) representation. The duality of e- and m-representations is
important. In fact, Eq. (3.8) implies that the Fisher metric gF is nothing but an L 2
inner product of e- and m-representations.
Construction of the Kullback-Leibler divergence is as follows. We define a cross
entropy dKL (p, r) by
dKL (p, r) := −Ep [log r(x)].
A cross entropy dKL (p, r) gives a bias of information − log r(x) with respect to p(x).
A cross entropy is also called a yoke on S [4]. Intuitively, a yoke measures a dissim-
ilarity of two probability density functions on S. We should also note that the cross
entropy is obtained by taking the expectation with respect to p(x) of the integrated
score function at r(x). Then we have the Kullback-Leibler divergence by
In this section, we review the deformed exponential family. For more details, see
[3, 22, 23] and [26]. Geometry of deformed exponential families relates to so-called
U-geometry [21].
Let χ be a strictly increasing function from (0, ≡) to (0, ≡). We define a deformed
logarithm function (or a χ-logarithm function) by
66 H. Matsuzoe and M. Henmi
⎥s
1
logχ (s) := dt.
χ(t)
1
We remark that logχ (s) is strictly increasing and satisfies logχ (1) = 0. The do-
main and the target of logχ (s) depend on the function χ(t). Set U = {s ∈
(0, ≡) | | logχ (s)| < ≡} and V = {logχ (s) | s ∈ U}. Then logχ (s) is a function
from U to V . We also remark that the deformed logarithm is usually called the
φ-logarithm [23]. However, we use φ as the dual potential on a Hessian manifold.
A deformed exponential function (or a χ-exponential function) is defined by the
inverse of the deformed logarithm function logχ (s):
⎥t
expχ (t) := 1 + λ(s)ds,
0
s1−q − 1
logq (s) := , (s > 0),
1−q
1
expq (t) := (1 + (1 − q)t) 1−q , (1 + (1 − q)t > 0).
The function logq (s) is called the q-logarithm and expq (t) the q-exponential. Taking
the limit q ◦ 1, the standard logarithm and the standard exponential are recovered,
respectively.
A statistical model Sχ is said to be a deformed exponential family (or a χ-
exponential family) if
⎤ n ⎜
⎤
⎤
Sχ := p(x; θ) ⎤p(x; θ) = expχ θ Fi (x) − ψ(θ) , θ ∈ Θ ⊂ R ,
i n
⎤
i=1
under a choice of suitable dominating measure, where F1 (x), . . . , Fn (x) are functions
on the sample space Ω, θ = {θ1 , . . . , θn } is a parameter, and ψ(θ) is the function of θ
for normalization. We assume that Sχ is a statistical model in the sense of [1]. That is,
p(x; θ) has support entirely on Ω, there exits a one-to-one correspondence between
the parameter θ and the probability distribution p(x; θ), and differentiation and inte-
gration are interchangeable. In addition, functions {Fi (x)}, ψ(θ) and parameters {θi }
must satisfy the anti-exponential condition. For example, in the q-exponential case,
these functions satisfy
n
1
θi Fi (x) − ψ(θ) < .
q−1
i=1
3 Hessian Structures and Divergence Functions on Deformed Exponential Families 67
Then we can regard that Sχ is a manifold with local coordinate system {θi }. We also
assume that the function ψ is strictly convex since we consider Hessian metrics on
Sχ later. A deformed exponential family has several different definitions. See [30]
and [34], for example.
For a deformed exponential probability density p(x; θ) ∈ Sχ , we define the escort
distribution Pχ (x; θ) of p(x; θ) by
1
Pχ (x; θ) := χ{p(x; θ)},
Zχ (θ)
Set θi = logχ p(xi ) − logχ p(x0 ) = logχ ηi − logχ η0 , Fi (x) = δi (x) and ψ(θ) =
− logχ η0 . Then the χ-logarithm of p(x) ∈ Sn is written by
68 H. Matsuzoe and M. Henmi
n
logχ p(x) = logχ ηi − logχ η0 δi (x) + logχ (η0 )
i=1
n
= θi Fi (x) − ψ(θ).
i=1
⎞ ⎠ 1
1 1 − q (x − μ)2 1−q
pq (x; μ, σ) := 1− ,
Zq (σ) 3 − q σ2 +
where [∗]+ := max{0, ∗}, {μ, σ} are parameters −≡ < μ < ≡, 0 < σ < ≡, and
Zq (σ) is the normalization defined by
⎧→
⎪ →3 − q B 2 − q , 1 σ,
⎪
⎨ (−≡ < q < 1),
Zq (σ) := →1 − q 1 − q 2
⎪
⎪ 3−q 3−q 1
⎩→ B , σ, (1 ∅ q < 3).
q−1 2(q − 1) 2
Here, B (∗, ∗) is the beta function. We restrict ourselves to consider the case q ← 1.
Then the probability distribution pq (x; μ, σ) has its support entirely on R and the set
of q-normal distributions Sq is a statistical model. Set
2 μ 1 1
θ1 := {Zq (σ)}q−1 2 , θ2 := − {Zq (σ)}q−1 2 ,
3−q σ 3−q σ
(θ1 )2 {Zq (σ)}q−1 − 1
ψ(θ) := − 2 − ,
4θ 1−q
then we have
1
logq pq (x; θ) = ({pq (x; θ)}1−q − 1)
1−q
⎟
1 1 1 − q (x − μ)2
= 1 − − 1
1 − q {Zq (σ)}1−q 3 − q σ2
2μ{Zq (σ)}q−1 {Zq (σ)}q−1 2
= x − x
(3 − q)σ 2 (3 − q)σ 2
{Zq (σ)}q−1 μ2 {Zq (σ)}q−1 − 1
− · 2+
3−q σ 1−q
= θ1 x + θ2 x 2 − ψ(θ).
3 Hessian Structures and Divergence Functions on Deformed Exponential Families 69
μq = Eq,p [x] = μ,
σq2 = Eq,p (x − μ)2 = σ 2 .
∂
(sχ )i (x; θ) := logχ p(x; θ), (i = 1, . . . , n). (3.12)
∂θi
We call sχ (x; θ) the χ-score function of p(x; θ). Using the χ-score function, we
define a (0, 2)-tensor field gM on Sχ by
⎥
∂
gijM (θ) := ∂i p(x; θ)∂j logχ p(x; θ) dx, ∂i = i . (3.13)
∂θ
Ω
Proof From the definitions of gM and logχ , the tensor field gM is written as
⎥
gijM (θ) = χ(p(x; θ)) (Fi (x) − ∂i ψ(θ)) Fj (x) − ∂j ψ(θ) dx. (3.14)
Ω
⎥
gijE (θ) := ∂i logχ p(x; θ) ∂j logχ p(x; θ) Pχ (x; θ)dx
Ω
= Eχ,p [∂i lχ (θ)∂j lχ (θ)],
⎥
1
gij (θ) :=
N
(∂i p(x; θ)) ∂j p(x; θ) dx,
Pχ (x; θ)
Ω
where lχ (θ) = logχ p(x; θ). Obviously, gE and gN are generalizations of the Fisher
metic with respect to the representations (3.7) and (3.9), respectively.
Proposition 4 Let Sχ be a deformed exponential family. Then Riemannian metrics
gE , gM and gN are mutually conformally equivalent. In particular, the following
formulas hold:
1
Zχ (θ)gE (θ) = gM (θ) = gN (θ),
Zχ (θ)
From the above formula and the definitions of Riemannian metrics gE and gN , we
have
⎥
1
gijE (θ) = χ(p(x; θ)) (Fi (x) − ∂i ψ(θ)) Fj (x) − ∂j ψ(θ) dx,
Zχ (θ)
Ω
⎥
gijN (θ) = Zχ (θ) χ(p(x; θ)) (Fi (x) − ∂i ψ(θ)) Fj (x) − ∂j ψ(θ) dx.
Ω
These equations and Eq. (3.14) imply that Riemannian metrics gE , gM and gN are
mutually conformally equivalent.
Among the three possibilities of generalizations of the Fisher metric, gM is espe-
cially associated with a Hessian structure on Sχ , as we will see below. Although the
3 Hessian Structures and Divergence Functions on Deformed Exponential Families 71
From the definitions of the deformed exponential family and the deformed log-
M(e)
arithm function, Γij,k vanishes identically. Hence the connection ⊂ M(e) is flat,
and (⊂ M(e) , gM ) is a Hessian structure on Sχ . Denote by C M the cubic form of
(Sχ , ⊂ M(e) , gM ), that is,
⎥t
Vχ (t) := logχ (s) ds.
1
We assume that Vχ (0) = limt◦+0 Vχ (t) is finite. Then the generalized entropy
functional Iχ and the generalized Massieu potential Ψ are defined by
⎥
⎛ ⎝
Iχ (pθ ) := − Vχ (p(x; θ)) + (p(x; θ) − 1)Vχ (0) dx,
Ω
⎥
Ψ (θ) := p(x; θ) logχ p(x; θ)dx + Iχ (pθ ) + ψ(θ),
Ω
4. Set the expectation of Fi (x) by ηi := Ep [Fi (x)]. Then {ηi } is a ⊂ M(m) -affine
coordinate system on Sχ and the dual of {θi } with respect to gM .
5. Set Φ(η) := −Iχ (pθ ). Then Φ(η) is the potential of gM with respect to {ηi }.
Then we have
⎥s
d
Vχ (s) = s logχ (s) − t logχ (t) dt
dt
1
logχ (s)
⎥
= s logχ (s) − expχ (u)du
0
= s logχ (s) − Uχ (logχ (s)).
Since ∂/∂θi Vχ (p(x; θ)) = (∂/∂θi p(x; θ)) logχ p(x; θ), we have
∂ ∂
p(x; θ) logχ p(x; θ) = i Uχ (logχ p(x; θ)).
∂θi ∂θ
Hence, by integrating the bias corrected χ-score function at r(x; θ) ∈ Sχ with respect
to θ, and by taking the standard expectation with respect to p(x; θ), we define a
χ-cross entropy of Bregman type by
⎥ ⎥
dχM (p, r) = − p(x) logχ r(x)dx + Uχ (logχ r(x))dx.
Ω Ω
This score function is nothing but a weighted score function in robust statistics. The
χ-divergence constructed from the bias corrected q-score function coincides with
the β-divergence (β = 1 − q):
respectively [3]. In the q-exponential case, we denote the χ-Fisher metric by gq , and
the χ-cubic form by C q . We call gq and C q a q-Fisher metric and a q-cubic form,
respectively.
Let ⊂ χ(0) be the Levi-Civita connection with respect to the χ-Fisher metric gχ .
Then a χ-exponential connection⊂ χ(e) and a χ-mixture connection⊂ χ(m) are defined
by
χ(e) 1
χ(0)
gχ (⊂X Y , Z) := gχ (⊂X
Y , Z) − C χ (X, Y , Z),
2
χ(m) χ(0) 1
gχ (⊂X Y , Z) := gχ (⊂X Y , Z) + C χ (X, Y , Z),
2
74 H. Matsuzoe and M. Henmi
Suppose that sχ (x; θ) is the χ-score function defined by (3.12). The χ-score is unbi-
ased with respect to χ-expectation, that is, Eχ,p [(sχ )i (x; θ)] = 0. Hence we regard
that sχ (x; θ) is a generalization of unbiased estimating functions.
By integrating a χ-score function, we define the χ-cross entropy by
The generalized relative entropy Dχ (p, r) coincides with the canonical divergence
D(r, p) for (Sχ , ⊂ χ(e) , gχ ). In fact, from (3.15), we can check that
n n ⎜
χ ≤ ≤ i ≤
D (p(θ), p(θ )) = Eχ,p θ Fi (x) − ψ(θ) −
i
(θ ) Fi (x) − ψ(θ )
i=1 i=1
n
n
= ψ(θ≤ ) + θi ηi − ψ(θ) − (θ≤ )i ηi = D(p(θ≤ ), p(θ)).
i=1 i=1
Divergence functions for (Sq , ⊂ q(e) , gq ) and (Sq , ⊂ (2q−1) , gF ) are given as fol-
lows. The α-divergence D(α) (p, r) with α = 1 − 2q is defined by
⎧ ⎫
⎨ ⎥ ⎬
1
D(1−2q) (p, r) := 1− p(x)q r(x)1−q dx .
q(1 − q) ⎩ ⎭
Ω
On the other hand, the normalized Tsallis relative entropy DqT (p, r) is defined by
⎥
DqT (p, r) := Pq (x) logq p(x) − logq r(x) dx
Ω
= Eq,p [logq p(x) − logq r(x)].
We remark that the invariant statistical manifold (Sq , ⊂ (1−2q) , gF ) is induced from
the α-divergence with α = 1 − 2q, and that the Hessian manifold (Sq , ⊂ q(e) , gq )
is induced from the dual of the normalized Tsallis relative entropy. In fact, for a
q-exponential family Sq , divergence functions have the following relations:
In this section, we generalize the maximum likelihood method from the viewpoint of
generalized independence. To avoid complicated arguments, we restrict ourselves to
consider the q-exponential case. However, we can generalize it to the χ-exponential
case (cf. [8, 9]).
Let X and Y be random variables which follow probability distributions p1 (x) and
p2 (y), respectively. We say that two random variables X and Y are independent if the
joint probability p(x, y) is decomposed by a product of marginal distributions p1 (x)
and p2 (Y ):
p(x, y) = p1 (x)p2 (y).
When p1 (x) > 0 and p2 (y) > 0, the independence can be written with an exponential
function and a logarithm function by
p(x, y) = exp log p1 (x) + log p2 (x) .
in other words,
N
logq Lq (ξ) = logq p(xi ; ξ).
i=1
n
N
= θi Fi (xj ) − Nψ(θ).
i=1 j=1
N
∂i logq Lq (θ) = Fi (xj ) − N∂i ψ(θ) = 0.
j=1
1
N
η̂i = Fi (xj ).
N
j=1
On the other hand, the canonical divergence for (Sq , ⊂ q(e) , gq ) can be calculated
as
78 H. Matsuzoe and M. Henmi
3.8 Conclusion
In this paper, we considered two Hessian structures from the viewpoints of the stan-
dard expectation and the χ-expectation. Though the former and the later are known as
U-geometry ([21, 26]) and χ-geometry ([3]), respectively, they turn out to be different
Hessian structures in the same deformed exponential family through a comparison
of each other.
We note that, from the viewpoint of estimating functions, the former is geometry
of bias-corrected χ-score functions with the standard expectation, whereas the later
is geometry of unbiased χ-score functions with the χ-expectation.
As an application to statistics, we considered generalization of maximum like-
lihood method for q-exponential family. We used the normalized Tsallis relative
entropy for orthogonal projection, whereas the previous results used χ-divergences
of Bregman type.
Acknowledgments The authors would like to express their sincere gratitude to the anonymous
reviewers for constructive comments for preparation of this paper. The first named author is partially
supported by JSPS KAKENHI Grant Number 23740047.
3 Hessian Structures and Divergence Functions on Deformed Exponential Families 79
References
1. Amari, S., Nagaoka, H.: Method of Information Geometry. American Mathematical Society,
Providence, Oxford University Press, Oxford (2000)
2. Amari, S., Ohara, A.: Geometry of q-exponential family of probability distributions. Entropy
13, 1170–1185 (2011)
3. Amari, S., Ohara, A., Matsuzoe, H.: Geometry of deformed exponential families: invariant,
dually-flat and conformal geometry. Phys. A. 391, 4308–4319 (2012)
4. Barondorff-Nielsen, O.E., Jupp, P.E.: Statistics, yokes and symplectic geometry. Ann. Facul.
Sci. Toulouse 6, 389–427 (1997)
5. Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C.: Robust and efficient estimation by minimising
a density power divergence. Biometrika 85, 549–559 (1998)
6. Borgesa, E.P.: A possible deformed algebra and calculus inspired in nonextensive thermosta-
tistics. Phys. A 340, 95–101 (2004)
7. Eguchi, S.: Geometry of minimum contrast. Hiroshima Math. J. 22, 631–647 (1992)
8. Fujimoto, Y., Murata, N.: A generalization of independence in naive bayes model. Lect. Notes
Comp. Sci. 6283, 153–161 (2010)
9. Fujimoto Y., Murata N.: A generalisation of independence in statistical models for categorical
distribution. Int. J. Data Min. Model. Manage. 2(4), 172–187 (2012)
10. Ivanov, S.: On dual-projectively flat affine connections. J. Geom. 53, 89–99 (1995)
11. Kumon, M., Takemura, A., Takeuchi, K.: Conformal geometry of statistical manifold with
application to sequential estimation. Sequential Anal. 30, 308–337 (2011)
12. Kurose, T.: On the divergences of 1-conformally flat statistical manifolds. Tôhoku Math. J. 46,
427–433 (1994)
13. Kurose, T.: Conformal-projective geometry of statistical manifolds. Interdiscip. Inform. Sci.
8, 89–100 (2002)
14. Lauritzen, S. L.: Statistical Manifolds, Differential Geometry in Statistical Inferences, IMS
Lecture Notes Monograph Series, vol. 10, pp. 96–163. Hayward, California (1987)
15. Matsuzoe, H.: Geometry of contrast functions and conformal geometry. Hiroshima Math. J.
29, 175–191 (1999)
16. Matsuzoe, H.: Geometry of statistical manifolds and its generalization. In: Proceedings of the
8th International Workshop on Complex Structures and Vector Fields, pp. 244–251. World
Scientific, Singapore (2007)
17. Matsuzoe, H.: Computational geometry from the viewpoint of affine differential geometry.
Lect. Notes Comp. Sci. 5416, 103–113 (2009)
18. Matsuzoe, H.: Statistical manifolds and geometry of estimating functions, pp. 187–202. Recent
Progress in Differential Geometry and Its Related Fields World Scientific, Singapore (2013)
19. Matsuzoe, H., Henmi, M.: Hessian structures on deformed exponential families. Lect. Notes
Comp. Sci. 8085, 275–282 (2013)
20. Matsuzoe, H., Ohara, A.: Geometry for q-exponential families. In: Recent progress in differ-
ential geometry and its related fields, pp. 55–71. World Scientific, Singapore (2011)
21. Murata, N., Takenouchi, T., Kanamori, T., Eguchi, S.: Information geometry of u-boost and
bregman divergence. Neural Comput. 16, 1437–1481 (2004)
22. Naudts, J.: Estimators, escort probabilities, and φ-exponential families in statistical physics. J.
Ineq. Pure Appl. Math. 5, 102 (2004)
23. Naudts, J.: Generalised Thermostatistics, Springer, New York (2011)
24. Ohara, A.: Geometric study for the legendre duality of generalized entropies and its application
to the porous medium equation. Euro. Phys. J. B. 70, 15–28 (2009)
25. Ohara, A., Matsuzoe H., Amari S.: Conformal geometry of escort probability and its applica-
tions. Mod. Phys. Lett. B. 10, 26:1250063 (2012)
26. Ohara A., Wada, T.: Information geometry of q-Gaussian densities and behaviors of solutions
to related diffusion equations. J. Phys. A: Math. Theor. 43, 035002 (2010)
27. Okamoto, I., Amari, S., Takeuchi, K.: Asymptotic theory of sequential estimation procedures
for curved exponential families. Ann. Stat. 19, 961–961 (1991)
80 H. Matsuzoe and M. Henmi
28. Shima, H.: The Geometry of Hessian Structures, World Scientific, Singapore (2007)
29. Suyari, H., Tsukada, M.: Law of error in tsallis statistics. IEEE Trans. Inform. Theory 51,
753–757 (2005)
30. Takatsu, A.: Behaviors of ϕ-exponential distributions in wasserstein geometry and an evolution
equation. SIAM J. Math. Anal. 45, 2546–2546 (2013)
31. Tanaka, M.: Meaning of an escort distribution and τ -transformation. J. Phys.: Conf. Ser. 201,
012007 (2010)
32. Tsallis, C.: Possible generalization of boltzmann—gibbs statistics. J. Stat. Phys. 52, 479–487
(1988)
33. Tsallis, C.: Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World.
Springer, New York (2009)
34. Vigelis, R.F., Cavalcante, C.C.: On φ-families of probability distributions. J. Theor. Probab.
21, 1–25 (2011)
Chapter 4
Harmonic Maps Relative to α-Connections
Keiko Uohashi
Abstract In this paper, we study harmonic maps relative to α-connections, but not
necessarily relative to Levi-Civita connections, on Hessian domains. For the purpose,
we review the standard harmonic map and affine harmonic maps, and describe the
conditions for harmonicity of maps between level surfaces of a Hessian domain in
terms of the parameter α and the dimension n. To illustrate the theory, we describe
harmonic maps between the level surfaces of convex cones.
4.1 Introduction
Harmonic maps are important objects in certain branches of geometry and physics.
Geodesics on Riemannian manifolds and holomorphic maps between Kähler man-
ifolds are typical examples of harmonic maps. In addition a harmonic map has a
variational characterization by the energy of smooth maps between Riemannian
manifolds and several existence theorems for harmonic maps are already known. On
the other hand the notion of a Hermitian harmonic map from a Hermitian manifold
to a Riemannian manifold was introduced and investigated by [4, 8, 10]. It is not
necessary a harmonic map if the domain Hermitian manifold is non-Kähler. The sim-
ilar results are pointed out for affine harmonic maps, which is analogy to Hermitian
harmonic maps [7].
Statistical manifolds have mainly been studied in terms of their affine geome-
try, information geometry, and statistical mechanics [1]. For example, Shima estab-
lished conditions for harmonicity of gradient mappings of level surfaces on a Hessian
domain, which is a typical example of a dually flat statistical manifold [14]. Level
surfaces on a Hessian domain are known as 1- and (−1)-conformally flat statistical
K. Uohashi (B)
Department of Mechanical Engineering and Intelligent Systems, Faculty of Engineering,
Tohoku Gakuin University, 1-13-1 Chuo, Tagajo, Miyagi 985-8537, Japan
e-mail: [email protected]
manifolds for primal and dual connections, respectively [17, 19]. The gradient
mappings are then considered to be harmonic maps relative to the dual connection,
i.e., the (−1)-connection [13].
In this paper, we review the notions of harmonic maps, affine harmonic maps
and α-affine harmonic maps, and investigate different kinds of harmonic maps rela-
tive to α-connections. In Sect. 4.2, we give definitions of an affine harmonic map, a
harmonic map and the standard Laplacian. In Sect. 4.3, we explain the generalized
Laplacian which defines a harmonic map relative to an affine connection. In Sect. 4.4,
we present the Laplacian of a gradient mapping on a Hessian domain, as an example
of the generalized Laplacian. Moreover, we compare the harmonic map defined by
Shima with an affine harmonic map defined in Sect. 4.2. In Sect. 4.5, α-connections
of statistical manifolds are explained. In Sect. 4.6, we define α-affine harmonic maps
which are generalization of affine harmonic maps and also a generalization of har-
monic maps defined by Shima. In Sect. 4.7, we describe the α-conformal equiva-
lence of statistical manifolds and a harmonic map relative to two α-connections. In
Sect. 4.8, we review α-conformal equivalence of level surfaces of a Hessian domain.
In Sect. 4.9, we study harmonic maps of level surfaces relative to two α-connections,
for examples of a harmonic map in Sect. 4.7, and provide examples on level surfaces
of regular convex cones.
Shima [13] investigated harmonic maps of n-dimensional level surfaces into
an (n + 1)-dimensional dual affine space, rather than onto other level surfaces.
Although Nomizu and Sasaki calculated the Laplacian of centro-affine immer-
sions into an affine space, which generate projectively flat statistical manifolds
(i.e. (−1)-conformally flat statistical manifolds), they did not discuss any harmonic
maps between two centro-affine hypersurfaces [12]. Then, we study harmonic maps
between hypersurfaces with the same dimension relative to general α-connections
that may not satisfy α = −1 or 0 (where the 0-connection implies the Levi-Civita
connection). In particular, we demonstrate the existence of non-trivial harmonic maps
between level surfaces of a Hessian domain with α-parameters and the dimension n.
g = gij dx i dx j
on M satisfying locally
∂2ϕ
gij = (4.1)
∂x i ∂x j
4 Harmonic Maps Relative to α-Connections 83
for a convex function ϕ, M is said to be a Kähler affine manifold [2, 7]. A matrix [gij ] is
positive definite and defines a Riemannian metric. Then for the Kähler affine manifold
M, (M, D, g) is a Hessian manifold, where D is a canonical flat affine connection for
{x 1 , . . . , x m }. We will mention details of Hessian manifolds and Hessian domains in
later sections of this paper.
The Kähler affine structure (4.1) defines an affinely invariant operator L by
m
∂2
L= gij . (4.2)
∂x i ∂x j
i,j=1
Lf = 0.
For a Kähler affine manifold (M, g) and a Riemannian manifold (N, h), a smooth
map φ : M ⊂ N is said to be affine harmonic if
⎛
m
∂ 2 φγ
n
γ ∂φ δ ∂φβ
g ij i j + Γˆδβ i ⎝ = 0, γ = 1, . . . , n, (4.3)
∂x ∂x ∂x ∂x j
i,j=1 δ,β=1
Δf = 0,
m
τ (φ)(x) = (∈˜ eLC
i
φ∇ ei − φ∇ ∈eLC
i
ei )(x), x ∈ M, (4.6)
i=1
∈˜ eLC
i
φ∇ ei = ∈ˆ φLC φ e;
∇ ei ∇ i
the pull-back connection,
and ∈ LC , ∈ˆ LC are the Levi-Civita connections for g, h, respectively. For local coor-
dinate systems {x 1 , . . . , x m } and {y1 , . . . , yn } on M and N, the γ-th component of
τ (φ) at x ∈ M is described by
⎩ ⎫
m ⎤ ∂ 2 φγ
m
∂φ γ
n
∂φ ∂φ ⎬
δ β
τ (φ)γ (x) = g ij − Γ k
(x) + Γˆ γ (φ(x))
⎥ ∂x i ∂x j ij δβ ∂x i ∂x j ⎭
∂x k
i,j=1 k=1 δ,β=1
(4.7)
m
n
γ ∂φδ ∂φβ
= Δφγ + g ij Γˆδβ (φ(x)) ,
∂x i ∂x j
i,j=1 δ,β=1
φδ = yδ ◦ φ, γ = 1, . . . , n,
where
n
∂
τ (φ)(x) = τ (φ)γ (x) ,
∂yγ
γ=1
γ
and Γijk , Γˆδβ are the Christoffel symbols of ∈ LC , ∈ˆ LC , respectively. The original
definition of a harmonic map is described in [3, 21], and so on.
Remark 1 Term (4.5) is not equal to the definition (4.2). Hence an affine harmonic
function is not necessary a harmonic function.
Remark 2 Term (4.7) is not equal to the definition (4.3). Hence an affine harmonic
map is not necessary a harmonic map.
4 Harmonic Maps Relative to α-Connections 85
m
τ (φ) = (∈ˆ ei (φ∇ ei ) − φ∇ (∈eLC
i
ei )) ∈ Γ (φ−1 TN) (4.8)
i=1
m ⎞ ⎠ ⎞ ⎠
∂ LC ∂
= g ∈ˆ
ij
∂ φ ∇ j − φ∇ ∈ ∂ ,
∂x i ∂x ∂x i ∂x
j
i,j=1
m
τ (φ) = (∈ˆ ei (φ∇ ei ) − φ∇ (∈eLC
i
ei )) ≡ 0.
i=1
ˆ φ = τ (φ) : M ⊂ V .
Δφ = Δ(g,∈) (4.9)
For V = R, Δφ defined by Eqs. (4.8) and (4.9) coincides with the standard Laplacian
for a function defined by (4.4).
See in [12] for an affine immersion and the Laplacian of a map, and see in [13,
14] for the gradient mapping and the Laplacian on a Hessian domain.
of An+1 , the gradient mapping ι from a Hessian domain (Ω, D, g = Ddϕ) into
(A∇n+1 , D∇ ) is defined by
∂ϕ
xi∇ ◦ ι = − i .
∂x
where DX∇ ι∇ (Y ) denotes the covariant derivative along ι induced by the canonical flat
affine connection D∇ on A∇n+1 .
The Laplacian of ι with respect to (g, D∇ ) is given by
⎞ ⎞ ⎠⎠
∂ ∂
Δ(g,D∇ ) ι = g ij D∇∂ ι∇ j − ι∇ ∈ LC ∂
∂x i ∂x ∂x i ∂x
j
i,j
⎩ ⎫
⎤ ∂ ⎬
= ι∇ g ij (D→ − ∈ LC ) ∂ (4.11)
⎥ ∂x i ∂x ⎭
j
i,j
⎩ ⎫
⎤ ∂ ⎬
= ι∇ g ij (∈ LC − D) ∂ (4.12)
⎥ ∂x i ∂x ⎭
j
i,j
⎧ ⎨
∂
= ι∇ αKi i ,
∂x
i
D + D→
= ∈ LC .
2
Details of dual affine connections are described in later sections.
Let ∈ LC∇ be the Levi-Civita connection
⎜ for the Hessian metric g ∇ = D∇ dϕ∇ on the
∇ ∇
dual domain Ω = ι(Ω), where ϕ = i x (∂ϕ/∂x i ) − ϕ is the Legendre transform
i
∂ ∂
Δ(g,D∇ ) ι = g ij {∈ LC∇
∂ (ι∇ ) − ι∇ (D ∂ )}.
∂x i ∂x j ∂x i ∂x
j
i,j
where ιi (x) = xi∇ ◦ ι(x). Therefore, if the gradient mapping ι is a harmonic map with
respect to (g, D∇ ), i.e., if Δ(g,D∇ ) ι ≡ 0, we have
⎛
∂ 2 ιγ γ ∂ιδ ∂ιβ
g ij + Γδβ ⎝ = 0, γ = 1, . . . , n + 1. (4.13)
∂x i ∂x j ∂x i ∂x j
i,j δ,β
In [13, 14], Shima studied an affine harmonic map with the restriction of the
gradient mapping ι to a level surface of a convex function ϕ.
The author does not clearly distinguish a phrase “relative to something” with a
phrase “with respect to something”.
We recall some definitions that are essential to the theory of statistical manifolds and
relate α-connections to Hessian domains.
Given a torsion-free affine connection ∈ and a pseudo-Riemannian metric h on a
manifold N, the triple (N, ∈, h) is said to be a statistical manifold if ∈h is symmetric.
If the curvature tensor R of ∈ vanishes, (N, ∈, h) is said to be flat.
Let (N, ∈, h) be a statistical manifold and let ∈ → be an affine connection on N
such that
where Γ (TN) is the set of smooth tangent vector fields on N. The affine connection
∈ → is torsion free and ∈ → h is symmetric. Then ∈ → is called the dual connection of
88 K. Uohashi
∈. The triple (N, ∈ → , h) is the dual statistical manifold of (N, ∈, h), and (∈, ∈ → , h)
defines the dualistic structure on N. The curvature tensor of ∈ → vanishes if and only if
the curvature tensor of ∈ also vanishes. Under these conditions, (∈, ∈ → , h) becomes
a dually flat structure.
Let N be a manifold with a dualistic structure (∈, ∈ → , h). For any α ∈ R, an affine
connection defined by
1+α 1−α →
∈ (α) := ∈+ ∈ (4.14)
2 2
is called an α-connection of (N, ∈, h). The triple (N, ∈ (α) , h) is also a statistical
manifold, and ∈ (−α) is the dual connection of ∈ (α) . The 1-connection ∈ (1) , the
(−1)-connection ∈ (−1) , and the 0-connection ∈ (0) correspond to the ∈, ∈ → , and the
Levi-Civita connection of (N, h), respectively. An α-connection does not need to be
flat.
A Hessian domain is a flat statistical manifold. Conversely, a local region of a
flat statistical manifold is a Hessian domain. For the dual connection D→ defined by
(4.10), (Ω, D→ , g) is the dual statistical manifold of (Ω, D, g) if a Hessian domain
(Ω, D→ , g) is a statistical manifold [1, 13, 14].
1+α 1−α →
D(α) = D+ D
2 2
⎞ ⎠ ⎞ ⎠
(α)∇ ∂ LC ∂
Δ(g,D(α)∇ ) ι = g D∂
ij
ι ∇ j − ι∇ ∈ ∂
∂x i ∂x ∂x i ∂x
j
i,j
⎩ ⎫
⎤ ∂ ⎬
= ι∇ g ij (D(−α) − ∈ LC ) ∂
⎥ ∂x i ∂x ⎭
j
i,j
⎩ ⎫
⎤ ∂ ⎬
= ι∇ g ij (∈ LC − D(α) ) ∂
⎥ ∂x i ∂x ⎭
j
i,j
4 Harmonic Maps Relative to α-Connections 89
⎞ ⎠
∂ (α) ∂
= g ij ∈ LC∇
∂ (ι ∇ ) − ι ∇ D ∂ .
∂x i ∂x j ∂x i ∂x
j
i,j
If Δ(g,D(α)∇ ) ι ≡ 0, we have
⎩ ⎫
⎤ ∂ 2 ιγ ∂ι γ γ ∂ιδ ∂ιβ ⎬
g ij − (1 − α) Γijk k + Γˆδβ i j = 0,
⎥ ∂x i ∂x j ∂x ∂x ∂x ⎭
i,j k δ,β
γ = 1, . . . , n + 1.
Definition 1 For a Kähler affine manifold (M, g) and a Riemannian manifold (N, h),
a map φ : M ⊂ N is said to be an α-affine harmonic map if
⎛
∂ 2 φγ ∂φγ γ ∂φδ ∂φβ
g ij − (1 − α) Γijk + Γˆδβ ⎝ = 0, (4.15)
∂x i ∂x j ∂x k ∂x i ∂x j
i,j k δ,β
γ = 1, . . . , dim N.
Then we obtain that the gradient mapping ι is a harmonic map with respect to
(g, D(α)∇ ) if and only if the map ι : (Ω, D(α) ) ⊂ (A∇n+1 , ∈ ∇ ) is an α-affine harmonic
map.
Remark 3 For α = 1, a 1-affine harmonic map is an affine harmonic map.
They are problems to find applications of α-affine harmonic maps and to investi-
gate them.
for X, Y and Z ∈ Γ (TN). Two statistical manifolds (N, ∈, h) and (N, ∈, ¯ h̄) are
α-conformally equivalent if and only if the dual statistical manifolds (N, ∈ → , h) and
(N, ∈¯ → , h̄) are (−α)-conformally equivalent. A statistical manifold (N, ∈, h) is said
to be α-conformally flat if (N, ∈, h) is locally α-conformally equivalent to a flat
statistical manifold [19].
Let (N, ∈, h) and (N, ∈, ¯ h̄) be α-conformally equivalent statistical manifolds of
dim n ∅ 2, and {x , . . . x } a local coordinate system on N. Suppose that h and h̄
1 n
are Riemannian metrices. We set hij = h(∂/∂x i , ∂/∂x j ) and [hij ] = [hij ]−1 . Let
πid : (N, ∈, h) ⊂ (N, ∈, ¯ h̄) be the identity map, i.e., πid (x) = x for x ∈ N, and πid∇
the differential of πid .
We define a harmonic map relative to (h, ∈, ∈) ¯ as follows:
¯ (πid ) ≡ 0,
τ(h,∈,∈)
the map πid : (N, ∈, h) ⊂ (N, ∈, ¯ h̄) is said to be a harmonic map relative to
¯ where the tension field is defined by
(h, ∈, ∈),
n ⎞ ⎠ ⎞ ⎠
∂ ∂ −1
τ(h,∈,∈)
¯ (πid ) : = hij ∈¯ ∂ πid∇ ( j ) − πid∇ ∈ ∂ ∈ Γ (πid TN)
∂x i ∂x ∂x i ∂x
j
i,j=1
n
∂ ∂
= hij (∈¯ ∂ −∈ ∂ ) ∈ Γ (TN). (4.18)
∂x i ∂x j ∂x i ∂x j
i,j=1
n ⎞ ⎠ ⎞ ⎠
1+α ∂ ∂ ∂
= h −
ij
dφ h ,
2 ∂x k ∂x i ∂x j
i,j=1
⎞ ⎠ ⎞ ⎠
1−α ∂ ∂ ∂
+ dφ h ,
2 ∂x i ∂x j ∂x k
⎞ ⎠ ⎞ ⎠
∂ ∂ ∂
+ dφ h ,
∂x j ∂x i ∂x k
n ⎞ ⎠
1 + α ∂φ 1 − α ∂φ ∂φ
= h −
ij
hij + hjk + j hik
2 ∂x k 2 ∂x i ∂x
i,j=1
⎩ ⎛⎫
⎤ 1+α ∂φ 1 − α ∂φ
n n
∂φ ⎝⎬
= − ·n· k + δik + δjk
⎥ 2 ∂x 2 ∂x i ∂x j ⎭
i=1 j=1
⎞ ⎠
1+α 1−α ∂φ
= − ·n+ ·2
2 2 ∂x k
1 ∂φ
= − {(n + 2) α + (n − 2)} k ,
2 ∂x
ι̂ ◦ π = eλ ι,
92 K. Uohashi
where ι (as denoted above) is the restriction of the gradient mapping ι to M. Let D̄→
be an affine connection on M defined by
The following theorem has been proposed elsewhere (cf. [9, 11]).
Theorem 3 ([20]) For affine connections D→ and D̄→ on M, the following are true:
From the duality of D̂ and D̂→ , D̄ is the dual connection of D̄→ on M. Then the next
theorem holds (cf. [6, 9]).
For α-connections D(α) and D̄(α) = D(−α) defined similarly to (4.14), we obtain
the following corollary by Theorem 3, Theorem 4, and Eq. (4.17) with φ = λ [15].
Corollary 1 For affine connections D(α) and D̄(α) on M, (M, D(α) , g) and (M, D̄(α) ,
ḡ) are α-conformally equivalent.
Definition 3 ([16, 18]) If a tension field τ(g,D(α) ,D̂(α) ) (π) vanishes on M, i.e.,
the map π : (M, D(α) , g) ⊂ (M̂, D̂(α) , ĝ) is said to be a harmonic map relative to
(g, D(α) , D̂(α) ), where the tension field is defined by
⎞ ⎠ ⎧ ⎨
n
(α) ∂ (α) ∂
τ(g,D(α) ,D̂(α) ) (π) := g ij D̂ ∂ π∇ ( j ) − π ∇ D ∂ ∈ Γ (π −1 T M̂).
∂x i
∂x ∂x i ∂x j
i,j=1
(4.19)
Theorem 5 ([16, 18]) Let (M, D(α) , g) and (M̂, D̂(α) , ĝ) be simply connected
n-dimensional level surfaces of an (n + 1)-dimensional Hessian domain (Ω, D, g)
with n ∅ 2. If α = −(n − 2)/(n + 2) or λ is a constant function on M, a map
π : (M, D(α) , g) ⊂ (M̂, D̂(α) , ĝ) is a harmonic map relative to (g, D(α) , D̂(α) ),
where
ι̂ ◦ π = eλ ι, (eλ )(p) = eλ(p) , eλ(p) ι(p) ∈ ι̂(M̂), p ∈ M,
and ι, ι̂ are the restrictions of the gradient mappings on Ω to M and M̂, respectively.
Proof The tension field of the map π relative to (g, D(α) , D̂(α) ) is described by the
pull-back of (M̂, D̂(α) , ĝ), namely (M, D̄(α) , ḡ), as follows:
n ⎞ ⎞ ⎠⎠ ⎞ ⎠
(α) ∂ (α) ∂
τ(g,D(α) ,D̂(α) ) (π) = g ij D̂ ∂ π∇ − π∇ D ∂
∂x i ∂x j ∂x i ∂x
j
i,j=1
n ⎞ ⎠ ⎞ ⎠
(α) ∂ (α) ∂
= g ij π∇ D̄ ∂ − π ∇ D ∂
∂x i ∂x ∂x i ∂x
j j
i,j=1
⎛
n ⎞ ⎠
(α) ∂ (α) ∂
= π∇ g ij D̄ ∂ −D ∂ ⎝
∂x i ∂x ∂x i ∂x
j j
i,j=1
n ⎞ ⎠
λ (α) ∂ (α) ∂
τ(g,D(α) ,D̂(α) ) (π) = e g D̄ ∂
ij
−D ∂ .
∂x i ∂x ∂x i ∂x
j j
i,j=1
By Corollary 1, (M, D(α) , g) and (M, D̄(α) , ḡ) are α-conformally equivalent, so that
Eq. (4.17) holds with φ = λ, h = g, ∈ = D(α) , and ∈¯ = D̄(α) for X, Y and
Z ∈ Γ (TM). Thus, for all k ∈ {1, . . . , n},
⎞ ⎠
∂
g τ(g,D(α) ,D̂(α) ) (π) , k
∂x
94 K. Uohashi
⎛
n ⎞ ⎠
(α) ∂ (α) ∂ ∂
= g eλ g ij D̄ ∂ −D ∂ , k⎝
∂x i ∂x ∂x i ∂x ∂x
j j
i,j=1
n ⎞ ⎠ ⎞ ⎠
λ 1+α ∂ ∂ ∂
=e g −ij
dλ g ,
2 ∂x k ∂x i ∂x j
i,j=1
⎞ ⎠ ⎞ ⎠
1−α ∂ ∂ ∂
+ dλ g ,
2 ∂x i ∂x j ∂x k
⎞ ⎠ ⎞ ⎠
∂ ∂ ∂
+ dλ g ,
∂x j ∂x i ∂x k
n ⎞ ⎠
λ 1 + α ∂λ 1 − α ∂λ ∂λ
=e g −
ij
gij + gjk + j gik
2 ∂x k 2 ∂x i ∂x
i,j=1
⎩ ⎛⎫
⎤ 1+α ∂λ 1 − α n
∂λ n
∂λ ⎬
= eλ − ·n· k + δ ik + δ jk ⎝
⎥ 2 ∂x 2 ∂x i ∂x j ⎭
i=1 j=1
⎞ ⎠
1+α 1−α ∂λ
= − ·n+ · 2 eλ k
2 2 ∂x
1 ∂λ
= − {(n + 2) α + (n − 2)} eλ k .
2 ∂x
Example 1 (Regular convex cone) Let Ω and ψ be a regular convex cone and its
characteristic function, respectively. On the Hessian domain (Ω, D, g = Dd log ψ),
d log ψ is invariant under a 1-parameter group of dilations at the vertex p of Ω, i.e.,
x −⊂ et (x − p) + p, t ∈ R [5, 14]. Then, under these dilations, each map between
level surfaces of log ψ is also a dilated map in the dual coordinate system. Hence,
each dilated map between level surfaces of log ψ in the primal coordinate system is
a harmonic map relative to an α-connection for any α ∈ R.
4 Harmonic Maps Relative to α-Connections 95
Example 2 (Symmetric cone) Let Ω and ψ = Det be a symmetric cone and its
characteristic function, respectively, where Det is the determinant of the Jordan
algebra that generates the symmetric cone. Then, similar to Example 1, each dilated
map at the origin between level surfaces of log ψ on the Hessian domain (Ω, D, g =
Dd log ψ) is a harmonic map relative to an α-connection for any α ∈ R
It is an important problem to find applications of non-trivial harmonic maps rel-
ative to α-connections.
Acknowledgments The author thanks the referees for their helpful comments.
References
1. Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society,
Providence, Oxford University Press, Oxford (2000)
2. Cheng, S.Y., Yau, S.T.: The real Monge-Ampère equation and affine flat structures. In: Chern,
S.S., Wu, W.T. (eds.) Differential Geometry and Differential Equations, Proceedings of the
1980 Beijing Symposium, Beijing, pp. 339–370 (1982)
3. Eelles, J., Lemaire, L.: Selected Topics in Harmonic Maps. American Mathematical Society,
Providence (1983)
4. Grunau, H.C., Kühnel, M.: On the existence of Hermitian-harmonic maps from complete
Hermitian to complete Riemannian manifolds. Math. Z. 249, 297–327 (2005)
5. Hao, J.H., Shima, H.: Level surfaces of non-degenerate functions in r n+1 . Geom. Dedicata 50,
193–204 (1994)
6. Ivanov, S.: On dual-projectively flat affine connections. J. Geom. 53, 89–99 (1995)
7. Jost, J., Şimşir, F.M.: Affine harmonic maps. Analysis 29, 185–197 (2009)
8. Jost, J., Yau, S.T.: A nonlinear elliptic system for maps from Hermitian to Riemannian manifolds
and rigidity theorems in Hermitian geometry. Acta Math. 170, 221–254 (1993)
9. Kurose, T.: On the divergence of 1-conformally flat statistical manifolds. Tôhoku Math. J. 46,
427–433 (1994)
10. Ni, L.: Hermitian harmonic maps from complete Hermitian manifolds to complete Riemannian
manifolds. Math. Z. 232, 331–335 (1999)
11. Nomizu, K., Pinkal, U.: On the geometry and affine immersions. Math. Z. 195, 165–178 (1987)
12. Nomizu, K., Sasaki, T.: Affine Differential Geometry: Geometry of Affine Immersions. Cam-
bridge University Press, Cambridge (1994)
13. Shima, H.: Harmonicity of gradient mapping of level surfaces in a real affine space. Geom.
Dedicata 56, 177–184 (1995)
14. Shima, H.: The Geometry of Hessian Structures. World Scientific Publishing, Singapore (2007)
15. Uohashi, K.: On α-conformal equivalence of statistical submanifolds. J. Geom. 75, 179–184
(2002)
16. Uohashi, K.: Harmonic maps relative to α-connections on statistical manifolds. Appl. Sci. 14,
82–88 (2012)
17. Uohashi, K.: A Hessian domain constructed with a foliation by 1-conformally flat statistical
manifolds. Int. Math. Forum 7, 2363–2371 (2012)
18. Uohashi, K.: Harmonic maps relative to α-connections on Hessian domains. In: Nielsen, F.,
Barbaresco, F. (eds.) Geometric Science of Information, First International Conference, GSI
2013, Paris, France, 28–30 August 2013. Proceedings, LNCS, vol. 8085, pp. 745–750. Springer,
Heidelberg (2013)
19. Uohashi, K., Ohara, A., Fujii, T.: 1-Conformally flat statistical submanifolds. Osaka J. Math.
37, 501–507 (2000)
96 K. Uohashi
20. Uohashi, K., Ohara, A., Fujii, T.: Foliations and divergences of flat statistical manifolds.
Hiroshima Math. J. 30, 403–414 (2000)
21. Urakawa, H.: Calculus of Variations and Harmonic Maps. Shokabo, Tokyo (1990) (in Japanese)
Chapter 5
A Riemannian Geometry in the q-Exponential
Banach Manifold Induced by q-Divergences
5.1 Introduction
The infinite-dimensional case, was initially developed by Pistone and Sempi [19],
constructing a manifold for the exponential family with the use of Orlicz spaces as
coordinate space. Gibilisco and Pistone [10] defined the exponential connection as the
natural connection induced by the use of Orlicz spaces and show that the exponential
and mixture connections are in duality relationship with the α-connections just as in
the parametric case. In these structures, every point of the manifold is a probability
density on some sample measurable space, with particular focus on the exponential
family since almost every model of the non-extensive physics belongs to this family.
Nevertheless, some models (complex models, see [4]) do not fit in this representations
so a new family of distributions must be defined to contain those models, one of
these families is the q-exponential family based on a deformation of the exponential
function which has been used in several applications using the Tsalli’s index q, for
references see [4, 21].
In order to give a geometric structure for this models, Amari and Ohara [3] studied
the geometry of the q-exponential family in the finite dimensional setting and they
found this family to have a dually flat geometrical structure derived from Legendre
transformation and is understood by means of conformal geometry. In 2013, Loaiza
and Quiceno [15] constructed for this family, a non-parametric statistical manifold
modeled on essentially bounded function spaces, such that each q-exponential para-
metric model is identified with the tangent space and the coordinate maps are natu-
rally defined in terms of relative entropies in the context of Tsallis; and in [14], the
Riemannian structure is characterized.
The manifold constructed in [14, 15]; is characterized by the fact that when
q ∗ 1 then the non-parametric exponential models are obtained and the mani-
fold constructed by Pistone and Sempi, is recovered, up to continuous embeddings
on the modeling space, which means that the manifolds are related by some map
L Δ ( p · μ) ↔ L ≡ ( p · μ); which should be investigated.
As mentioned, some complex phenomena do not fit the Gibbs distribution but the
power law, the Tsalli’s q-entropy is an example capturing such systems. Based on
the Tsalli’s entropy index q, it has been constructed a generalized exponential family
named q-exponential family, which is being presented, see [4, 16, 21].
Given a real number q, we consider the q-deformed exponential and logarithmic
functions which are respectively defined by
1 −1
eqx = [1 + (1 − q)x] 1−q if ∇x (5.1)
1−q
x 1−q − 1
lnq (x) = if x > 0. (5.2)
1−q
5 A Riemannian Geometry in the q-Exponential Banach Manifold 99
The above functions satisfy similar properties of the natural exponential and log-
arithmic functions (Fig. 5.1).
It is necessary to show the basic properties of the q-deformed exponential and loga-
rithm functions. Take the definitions given in (5.1) and (5.2) (Fig. 5.2).
Proposition 1 1. For q < 0, x ◦ [0, ≡), expq (x) is positive, continuous, increas-
ing, concave and such that:
2. For 0 < q < 1, x ◦ [0, ≡), expq (x) is positive, continuous, increasing, convex
and such that:
lim expq (x) = ≡.
x∗≡
3. For 1 < q, x ◦ 0, q−1
1
, expq (x) is positive, continuous, increasing, convex
and such that:
lim expq (x) = ≡.
−
1
x∗ q−1
Some graphics are shown for different values of the index q, to illustrate the behavior
of expq (x) (Fig. 5.3).
100 H. R. Quiceno et al.
2. For 0 < q < 1, x ◦ [0, ≡), lnq (x) is continuous, increasing, concave and such
that:
lim lnq (x) = ≡.
x∗≡
3. For 1 < q, x ◦ (0, ≡), lnq (x) is increasing, continuous, concave and such that:
1
lim lnq (x) = .
x∗≡ q −1
Some graphics are shown for different values of the index q, to illustrate the behavior
of lnq (x) (Fig. 5.4).
The following proposition shows that the deformed functions, share similar prop-
erties to the natural ones (Fig. 5.5).
Proposition 3 1. Product
2. Quotient
expq (x) x−y
= expq . (5.4)
expq (y) 1 + (1 − q)y
3. Power law
(expq x)n = exp1− (1−q) (nx). (5.5)
n
4. Inverse
−1 −x
expq (x) = expq = exp2−q (−x). (5.6)
1 + (1 − q)x
5. Derivative
d
expq (x) = (expq (x))q = exp2− 1 (q x). (5.7)
dx q
6. Integral
1
expq (nx)d x = (expq (nx))2−q . (5.8)
(2 − q)n
7. Product
lnq (x y) = lnq (x) + lnq (y) − (1 − q) lnq (x) lnq (y). (5.9)
8. Quotient
x lnq (x) − lnq (y)
lnq = . (5.10)
y 1 + (1 − q) lnq (y)
5 A Riemannian Geometry in the q-Exponential Banach Manifold 103
9. Power law n
lnq (x n ) = ln1−n (x 1−q ). (5.11)
1−q
11. Derivative
d 1
lnq (x) = q . (5.13)
dx x
12. Integral
x(lnq (x) − 1)
lnq (x)d x = . (5.14)
2−q
5.2.2 q-Algebra
For real numbers, there are two binary operations given in terms of the index q, as
follows (Fig. 5.6).
1. The q-sum →q : R2 ∗ R, is given by:
x →q y = x + y + (1 − q)x y. (5.15)
1
x ∅q y = [x 1−q + y 1−q − 1]+1−q x > 0 and y > 0. (5.16)
104 H. R. Quiceno et al.
Then
x →q (≤q x) = 0.
y (1 − q)x y
x ≤q y = x →q (≤q y) = x − −
1 + (1 − q)y 1 + (1 − q)y
x−y
= . (5.17)
1 + (1 − q)y
x ≤q y = ≤q y →q x. (5.18)
x ≤q (y ≤q z) = (x ≤q y) →q z. (5.19)
Proposition 6 For x > 0 exist an unique inverse for ∅q , denoted qx and given
1
by b = (2 − x 1−q ) 1−q .
x q (y q z) = (x q y) ∅q z = (x ∅q z) q y
if z 1−q
− 1 ∇ y 1−q ∇ x 1−q + 1. (5.22)
This definitions among with Proposition 5.3, allow to prove the next proposition.
5 A Riemannian Geometry in the q-Exponential Banach Manifold 105
Proposition 7 1.
expq (x) expq (y) = expq (x →q y). (5.23)
expq (x)
= expq (x ≤q y). (5.24)
expq (y)
2.
lnq (x y) = lnq (x) →q lnq (y). (5.27)
x
lnq = lnq (x) ≤q lnq (y). (5.28)
y
Some interesting models of statistical physics can be written into the following form,
readers interested in this examples must see [16]. If the q-exponential in the r.h.s.
diverges then f β (x) = 0 is assumed. The function H (x) is the Hamiltonian. The
parameter β is usually the inverse temperature. The normalization α(β) is written
inside the q-exponential. The function c(x) is the prior distribution. It is a reference
measure and must not depend on the parameter β.
If a model is the above form then it is said to belong to the q-exponential family.
In the limit q ∗ 1 these are the models of the standard exponential family. In that
case the expression (5.31) reduces to
with
≡
cq = expq (−x 2 ) d x
−≡
π Φ − 2 + q−1
1 1
= if 1 < q < 3,
q −1 Φ q−1 1
π Φ 1 + q−1
1
= if q < 1.
q −1 Φ 3 + 1
2 1−q
Note that this distribution vanishes outside the interval [−σ, σ]. Then, the q ∗ 1
case, reproduces the conventional Gauss distribution. For q = 2 one obtains
1 σ
f (x) = . (5.33)
π x + σ2
2
This is known as the Cauchy distribution. The function (5.33) is also called a
Lorentzian. In the range 1 ∇ q < 3 the q-Gaussian is strictly positive on the whole
5 A Riemannian Geometry in the q-Exponential Banach Manifold 107
line. For q < 1 it is zero outside an interval. For q ⊆ 3 the distribution cannot be
normalized because
1
f (x) ∧ 2
as |x| ∗ ≡ .
|x| q−1
1 v2
f (v) = 1+k . (5.34)
A(κ)v03 1 v2
1+ κ−a v 2
0
and
v2 1 v2
f (v) = expq − . (5.35)
A(κ)v03 2 − q − (q − 1)a v02
However, in order to be of the form (5.31), the pre-factor of (5.35) should not
depend on the parameter v0 . Introduce an arbitrary constant c > 0 with the dimen-
sions of a velocity. The one can write
4πv 2 v 3 v 3 q−1
0 0
f (v) = 3 expq − ln2−q 4π A(κ) − 4π A(κ) h(q, v) .
c c c
v2
where h(q, v) = 1
2−q−(q−1)a v 2 .
0
In the case q ∗ 1, one obtains the Maxwell-distribution.
1 1
f (v) = .
π v2 − v2
0
It diverges when |v| approaches its maximal value v0 and vanished for |v| > v0 .
This distribution can e written into the form (5.31) of a q-exponential family with
q = 3. To do so, let x = v and
108 H. R. Quiceno et al.
√
2
c(x) = ,
π|v|
1
β = mv02 ,
2
1 1
H (x) = 1 = ,
2 K
2 mv
3
α(β) = − .
2
On 2011, Amari and Ohara [3], constructed a finite dimensional manifold for the
family of q-exponential distributions with the properties of a dually flat geometry
derived from Legendre transformations and such that the maximizer of the q-scort
distribution is a Bayesian estimator.
Let Sn denote the family of discrete distributions over (n + 1) elements X =
{x0 , x1 , . . . , xn }; put pi = Prob {x = xi } and denote the probability distribution
vector p = ( p0 , p1 , . . . , pn ).
The probability of x is written as
n
p(x) = pi δi (x).
i=0
Proposition 8 The family Sn , has the structure of a q-exponential family for any q.
The proof is based on the fact that the q-exponential family is written as
n
ψq (θ) = − lnq ( p0 ), and p0 = 1 − pi δi (x).
i=0
It should be noted that the family in (5.36), is not in general of the form given in
[16] since it does not have the pre factor (prior distribution) c(x), as in (5.31).
The geometry is induced by a q-divergence defined by the q-potential. Since ψ is a
convex function, it defines a divergence of the Bregman-type
which simplifies to
5 A Riemannian Geometry in the q-Exponential Banach Manifold 109
1
n
q 1−q
Dq [ p : r ] = 1− pi ri ,
(1 − q)h q ( p)
i=0
n
q
where p, r are two discrete distributions and h q ( p) = pi .
i=0
For two points on the manifold, infinitesimally close, the divergence is
q
Dq [ p(x, θ) : p(x, θ + dθ)] = gi j (θ)dθi dθ j ,
q
where gi j = ∂i ∂ j ψ(θ) is the q-metric tensor.
Note that, when q = 1, this metric reduces to the usual Fisher metric.
Proposition 9 The q-metric is given by a conformal transformation of the Fisher
information metric giFj , as
q q
gi j = gF .
h q (θ) i j
With this Riemannian geometry, the geodesics curves of the manifold, given by
allows to find the maximizer of the q-score function as a Bayesian estimator, see [3].
Let (Γ, Ω, μ) be a probability space and q a real number such that 0 < q < 1.
Denote by Mμ the set of strictly positive probability densities μ-a.e. For each p ◦ Mμ
consider the probability space (Γ, Ω, p · μ), where p · μ is the probability measure
given by
( p · μ)(A) = pdμ.
A
B p := {u ◦ L ≡ ( p · μ) : E p [u] = 0},
110 H. R. Quiceno et al.
which (with the essential supremum norm) is a closed normed subspace of the Banach
space L ≡ ( p · μ), thus B p is a Banach space.
Two probability densities p, z ◦ Mμ are connected by a one-dimensional
q-exponential model if there exist r ◦ Mμ , u ◦ L ≡ (r · μ), a real function of
real variable ψ and δ > 0 such that for all t ◦ (−δ, δ), the function f defined by
tu≤q ψ(t)
f (t) = eq r,
satisfies that there are t0 , t1 ◦ (−δ, δ) for which p = f (t0 ) and z = f (t1 ). The
function f is called one-dimensional q-exponential model, since it is a deformation
of the model f (t) = etu−ψ(t) r , see [18].
Define the mapping M p by
M p (u) = E p [eq(u) ],
Using results on convergence of series in Banach spaces, see [12], it can be proven
that the domain D M p of M p contains the open unit ball B p,≡ (0, 1) √ L ≡ ( p · μ).
Also if restricting M p to B p,≡ (0, 1), this function is analytic and infinitely Fréchet
differentiable.
Let (Γ, Ω, μ) be a probability space and q a real number with 0 < q < 1. Let
be V p := {u ◦ B p : ∀u∀ p,≡ < 1}, for each p ◦ Mμ . Define the maps
eq, p : V p ∗ Mμ
by
(u≤q K p (u))
eq, p (u) := eq p,
which are injective and their ranges are denoted by U p . For each p ◦ Mμ the map
sq, p : U p ∗ V p
given by
z z
sq, p (z) := lnq ≤q E p lnq ,
p p
is precisely the inverse map of eq, p . Maps sq, p are the coordinate maps for the
manifold and the family of pairs (U p , sq, p ) p◦Mμ define an atlas on Mμ ; and the
transition maps (change of coordinates), for each
u ◦ sq, p1 (U p1 U p2 ),
5 A Riemannian Geometry in the q-Exponential Banach Manifold 111
are given by
Given
u ◦ sq, p1 (Uq, p1 Uq, p2 ),
where A(u), B(u) are functions depending on u. This, allows to establish the main
result of [15] (Theorem 14), that is, the collection of pairs {(U p , sq, p )} p◦Mμ is a
C ≡ -atlas modeled on B p , and the corresponding manifold is called q−exponential
statistical Banach manifold.
Finally, the tangent bundle of the manifold, is characterized, (Proposition 15) [15],
by regular curves on the manifold, as follows.
Let g(t) be a regular curve for Mμ where g(t0 ) = p, and u(t) ◦ Vz be its coordinate
representation over sq,z .
Then
[u(t)≤q K z (u(t))]
g(t) = eq z
and:
g(t)
1. d
dt lnq p t=t = T u ⊗ (t) − Q[M p (u(t))]1−q E p [u ⊗ (t)] for some constants T
0
and Q.
2. If z = p, i.e. the charts are centered in the same point;the tangent vectors are
d
identified with the q-score function in t given by dt lnq g(t)
p = T u ⊗ (t0 ).
t=t0
3. Consider a two dimensional q-exponential model
(tu≤q K p (tu))
f (t, q) = eq p (5.37)
In this section, we will find a metric and then the connections of the manifold,
derived from the q-divergence functional and characterizing the geometry of the
q-exponential manifold. For further details see [22].
The q-divergence functional is given as follows [15].
Let f be a function, defined for all t ←= 0 and 0 < q < 1, by
1
f (t) = −t lnq
t
which is the Tsallis’ divergence functional [9]. Some properties of this functional
are well known, for example that it is equal to the α−divergence functional up to
a constant factor where α = 1 − 2q, satisfying the invariance criterion. Moreover,
when q ∗ 0 then I (q) (z|| p) = 0 and if q ∗ 1 then I (q) (z|| p) = K (z|| p) which is
the Kullback-Leibler divergence functional [13]. As a consequence of Proposition
(17) in [15], the manifold is related with the q-divergence functional as
1 z (q)
sq, p (z) = lnq + I ( p||z) .
1 + (q − 1)I (q) ( p||z) p
where the subscript p, z means that the directional derivative is taken with respect
to the first and the second arguments in I (q) (z|| p), respectively, along the direction
u ◦ Tz (Mμ ) or v ◦ T p (Mμ ).
5 A Riemannian Geometry in the q-Exponential Banach Manifold 113
it follows
1
(du )z I (q) (z|| p) = q − qp (1−q) z (q−1) udμ
1−q Γ
and
1
(dv ) p I (q) (z|| p) = (1 − q) − (1 − q) p (−q) z (q) vdμ;
1−q Γ
and
(q) z ⊗
I (z|| p) ∇ (z − p) f dμ.
Γ p
uv
g(u, v) = q dμ.
Γ p
1
(dv ) p I (q) (z|| p) = (1 − q) − (1 − q) p (−q) z (q) vdμ
1−q Γ
and
(du )z (dv ) p I (q) (z|| p) = −q p (−q) z (q−1) uvdμ,
Γ
114 H. R. Quiceno et al.
so by (5.39), it follows
uv
g(u, v) = q dμ.
Γ p
∓
↔
Note that when q ∗ 1, this metric reduces to the one induced by the (α, β)-
divergence functional which induces the Fisher metric on parametric models.
The connections are characterized as follows.
Proposition 12 The family of covariant derivatives (connections)
⊂(q)
w u : Ω(Mμ ) × Ω(Mμ ) ∗ Ω(Mμ ),
are given as
1−q
⊂(q)
w u = dw u − uw.
p
Proof Considering (du )z (dv ) p I (q) (z|| p) as in proof of Proposition 5.11, we get
−(dw )z (du )z (dv ) p I (q) (z|| p) = q p (−q) (q − 1)z (q−2) uw + z (q−1) dw u vdμ.
Γ
(q)
⊂w u
g ⊂(q)
w u, v = q vdμ
Γ p
and then
g ⊂(q) (q)
w u, v = ∨⊂w u, v˘.
Then,
(q)
⊂w u
q p −1 (q − 1) p −1 uw + dw u = q ,
p
so
1−q
⊂(q)
w u = dw u − uw.
p
∓
↔
ˇ(q)
It is easy to prove that the associated conjugate connection is given by ⊂w u
=
dw u − qp uw. Notice that taking q = 1−α2 yields to the Amaris’s one-parameter family
of α−connections in the form
(α) 1+α
⊂w u = dw u − uw;
2p
5 A Riemannian Geometry in the q-Exponential Banach Manifold 115
where
Φ : Ω(Mμ ) × Ω(Mμ ) ∗ Ω(Mμ )
and
moreover,
R(u, v, w) = Φ (u, Φ(v, w)) − Φ (v, Φ(u, w)) + du Φ(v, w) − dv Φ(u, w),
116 H. R. Quiceno et al.
and
T (u, v) = Φ(u, v) − Φ(v, u).
and
1−q 1−q
T (u, v) = − uv + vu = 0.
p p
∓
↔
Since the mapping ⊂(q) ↔ ⊂(α) is smooth, it is expected that the geodesic
curves and parallel transports obtained from the q-connections preserves a smooth
isomorphism with the curves given by α-connections. Also, it must be investigated
if the metric tensor field in Proposition 5.11 is given by a conformal transformation
of the Fisher information metric.
References
11. Jeffreys, H.: An invariant form for the prior probability in estimation problems. Proc. Roy. Soc.
A 186, 453–61 (1946)
12. Kadets, M.I., Kadets, V.M.: Series in Banach spaces. In: Conditional and Undconditional
Convergence. Birkaaauser Verlang, Basel (1997) (Traslated for the Russian by Andrei Iacob)
13. Kulback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951)
14. Loaiza, G., Quiceno, H.: A Riemannian geometry in the q-exponential Banach manifold
induced by q-divergences. Geometric science of information, In: Proceedings of First Interna-
tional Conference on GSI 2013, pp. 737–742. Springer, Paris (2013)
15. Loaiza, G., Quiceno, H.R.: A q-exponential statistical Banach manifold. J. Math. Anal. Appl.
398, 446–6 (2013)
16. Naudts, J.: The q-exponential family in statistical physics. J. Phys. Conf. Ser. 201, 012003
(2010)
17. Pistone, G.: k-exponential models from the geometrical viewpoint. Eur. Phys. J. B 70, 29–37
(2009)
18. Pistone, G., Rogantin, M.-P.: The exponential statistical manifold: Parameters, orthogonality
and space transformations. Bernoulli 4, 721–760 (1999)
19. Pistone, G., Sempi, C.: An infinite-dimensional geometric structure on the space of all the
probability measures equivalent to a given one. Ann. Stat. 23(5), 1543–1561 (1995)
20. Rao, C.R: Information and accuracy attainable in estimation of statistical parameters. Bull.
Calcutta Math. Soc. 37, 81–91 (1945)
21. Tsallis, C.: Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 52, 479–487
(1988)
22. Zhang, J.: Referential duality and representational duality on statistical manifolds. In: Proceed-
ings of the 2nd International Symposium on Information Geometry and its Applications, pp.
58–67, Tokyo, (2005)
Chapter 6
Computational Algebraic Methods
in Efficient Estimation
6.1 Introduction
Information geometry gives geometric insights and methods for studying the statis-
tical efficiency of estimators, testing, prediction and model selection. The field of
algebraic statistics has proceeded somewhat separately but recently a positive effort
is being made to bring the two subjects together, notably [15]. This paper should be
seen as part of this effort.
A straightforward way of linking the two areas is to ask how far algebraic methods
can be used when the statistical manifolds of information geometry are algebraic,
that is algebraic varieties or derived forms, such as rational quotients. We call such
models “algebraic statistical models” and will give formal definitions.
K. Kobayashi (B)
The Institute of Statistical Mathematics, 10-3, Midori-cho,
Tachikawa, Tokyo, Japan
e-mail: [email protected]
H. P. Wynn
London School of Economics, London, UK
e-mail: [email protected]
In this section, we introduce the standard setting of statistical estimation theory, via
information geometry. See [2, 4] for details. It is recognized that the ideas go back to
at least the work of Rao [23], Efron [13] and Dawid [10]. The subject of information
geometry was initiated by Amari and his collaborators [3, 5].
Central to this family of ideas is that the rates of convergence of statistical esti-
mators and other test statistics depend on the metric and curvature of the parametric
manifolds in a neighborhood of the MLE or the null hypothesis. In addition Amari
realized the importance of two special models, the affine exponential model and the
affine mixture model, e and m frame respectively. In this paper we concentrate on
the exponential family model but also look at curved subfamilies. By extending the
dimension of the parameter space of the exponential family, we are able to cover
some classes of mixture models. The extension of the exponential model to infinite
dimensions is covered by [21].
{d P(x|α) | α ⊂ VΘ }.
Let w := (u, v) and use indexes {i, j, . . .} for α and δ, {a, b, . . .} for u, {ψ, φ, . . .}
for v and {β, γ, . . .} for w. The following are used for expressing conditions for
asymptotic efficiency of estimators, where Einstein notation is used.
Under some regularity conditions on the carrier measure τ, potential function ∂ and
the manifolds VΘ or V E , the asymptotic theory below is available. These condi-
tions are for guaranteeing the finiteness of the moments and the commuting of the
6 Computational Algebraic Methods in Efficient Estimation 123
ω
expectation and the partial derivative ωα E α [ f ] = E α [ ωωαf ]. For more details of the
required regularity conditions, see Sect. 2.1 of [4].
1. If û is a consistent estimator (i.e. P(◦û − u◦ > η) ∇ 0 as N ∇ → for any
η > 0), the squared error matrix of û is
Here [·]−1 means the matrix inverse. Thus, if gaψ = 0 for all a and ψ, the main
term in the r.h.s. becomes minimum. We call such an estimator as a first-order
efficient estimator.
2. The bias term becomes
for each a where ba (u) := Γ (m)acd (u)g cd (u). Then, the bias corrected estimator
ǔ a := û a − ba (û) satisfies E u [ǔ a − u a ] = O(N −2 ).
3. Assume gaψ = 0 for all a and ψ, then the square error matrix is represented by
⎛ ⎝
(m) (e) (m)
1 1
E u [(ǔ a − u a )(ǔ b − u b )] = g ab + Γ M2ab +2 HM2ab
+ H A2ab + o(N −2 ).
N 2N 2
See Theorem 5.3 of [4] and Theorem 4.4 of [2] for the definition of the terms in
the r.h.s. Of the four dominating terms in the r.h.s., only
(m)
H A2ab := g ψμ g φτ H (m)aψφ H (m)bμτ
Here H (m)aψφ is an embedding curvature and equal to Γ (m)aψφ when gaψ = 0 for
(m)
every a and ψ. Since H A2ab is the square of Γ (m)aψφ , the square error matrix
attains the minimum in the sense of positive definiteness if and only if
⎞ ⎞
⎞ ω2 ω i ⎞
Γ (m) ψφ,a (w)⎞ = δ (w) α (w) ⎞ = 0. (6.1)
v=0 ωv ψ ωv φ
i
ωu a ⎞
v=0
This section studies asymptotic efficiency for statistical models and estimators which
are defined algebraically. Many models in statistics are defined algebraically. Perhaps
most well known are polynomial regression models and algebraic conditions on
probability models such as independence and conditional independence. Recently
there has been considerable interest in marginal models [7] which are typically linear
restrictions on raw probabilities. In time series autoregressive models expressed by
linear transfer functions induce algebraic restrictions on covariance matrices. Our
desire is to have a definition of algebraic statistical model which can be expressed
from within the curved exponential family framework but is sufficiently broad to
cover cases such as those just mentioned. Our solution is to allow algebraic conditions
in the natural parameter α, mean parameter δ or both. The second way in which
algebra enters is in the form of the estimator.
We say a curved exponential family is algebraic if the following two conditions are
satisfied.
(C1) VΘ or V E is represented by a real algebraic variety, i.e. VΘ := V( f 1 , . . . , f k )
= {α ⊂ Rd | f 1 (α) = · · · = f k (α) = 0} or similarly V E := V(g1 , . . . , gk ) for
f i ⊂ R[α1 , . . . , αd ] and gi ⊂ R[δ1 , . . . , δd ].
(C2) α ≡∇ δ(α) or δ ≡∇ α(δ) is represented by some algebraic equations, i.e. there are
h 1 , . . . , h k ⊂ R[α, δ] such that locally in VΘ × V E , h i (α, δ) = 0 iff δ(α) = δ
or α(δ) = α.
Here R[α1 , . . . , αd ] means a polynomial of α1 , . . . , αd over the real number field R
and R[α, δ] means R[α1 , . . . , αd , δ1 , . . . , δd ]. The integer k, the size of the genera-
tors, is not necessarily equal to d − p but we assume VΘ (or V E ) has dimension p
around the true parameter. Note that if ∂(α) is a rational form or the logarithm of a
rational form, (C2) is satisfied.
The parameter set VΘ (or V E ) is sometimes singular for algebraic models. But
throughout the following analysis, we assume non-singularity around the true para-
meter α∗ ⊂ VΘ (or δ ∗ ⊂ V E respectively).
Following the discussion at the end of Sect. 6.2.1. We call α(u, v) or δ(u, v) an
algebraic estimator if
(C3) w ≡∇ δ(w) or w ≡∇ α(w) is represented algebraically.
6 Computational Algebraic Methods in Efficient Estimation 125
We remark that the MLE for an algebraic curved exponential family is an algebraic
estimator.
If conditions (C1), (C2) and (C3) hold, then all of the geometrical entities in
Sect. 6.2.2 are characterized by special polynomial equations. Furthermore, if ∂(α) ⊂
R(α) ← log R(α) and α(w) ⊂ R(w) ← log R(w), then the geometrical objects have
the additional property of being rational.
Consider an algebraic estimator δ(u, v) ⊂ R[u, v]d satisfying the following vector
equation:
⎠
d ⎠
p
X = δ(u, 0) + vi− p ei (u) + c · f j (u, v)e j (u) (6.2)
i= p+1 j=1
⎠
d ⎠
p
δ(w) = δ(u, 0) + vi− p ei (u) + c · f j (u, v)e j (u)
i= p+1 j=1
where {ẽ j (u, δu ) ⊂ R[u, δu ]d ; j = 1, . . . , p} span ((∈u δ(u, 0))∧Ḡ )∧ E for every
u and h j (X, u, δu , t) ⊂ R[X, u, δu ][t]3 (degree = 3 in t) for j = 1, . . . , p. The
constant c is to control the perturbation. The notation Ḡ represents the Fisher metric
on the full-exponential family with respect to δ. The notation (∈u δ(u, 0))∧Ḡ means
the subspace orthogonal to span(ωa δ(u, 0))a=1 with respect to Ḡ and (·)∧ E means
p
the orthogonal complement in the sense of Euclidean vector space. Here, the term
“degree” of a polynomial means the maximum degree of its terms. Note that the
case (X − δu )∅ ẽ j (u, δu ) = 0 for j = 1, . . . , p gives a special set of the estimating
equations of the MLE.
Proof Take the Euclidean inner product of both sides of (6.2) with each ẽ j which is
a vector Euclidean orthogonal to the subspace span({ei |i ≥= j}) and obtain a system
of polynomial equations. By eliminating variables v from the polynomial equations,
an algebraic version is obtained.
6 Computational Algebraic Methods in Efficient Estimation 127
(δ(u, v) − δ(u, 0))∅ ẽ j (u) + c · h j (δ(u, v), u, δ(u, 0), δ(u, v) − δ(u, 0)) = 0.
since each term of h j (δ(u, v), u, δ(u, 0), δ(u, v) − δ(u, 0)) has degree more than 3
in its third component (δi (u, v) − δi (u, 0))i=1
d and δ(u, v) − δ(u, 0)|
v=0 = 0. Since
∧
span{ẽ j (u); j = 1, . . . , p} = ((∈u δ(u, 0)) ) Ḡ ∧ E = span{Ḡωu a δ; a = 1, . . . , p},
we obtain
⎞ ⎞
(m) ⎞ ω 2 δi i j ωδ j ⎞⎞
Γψφa ⎞ = g = 0.
v=0 ωv φ ωv ψ ωu a ⎞v=0
By Theorems 1, 2 and 3, the relationship between the three forms of the second-
order efficient algebraic estimators is summarized as
(6.3) locally uniquely exists for small c, i.e. there is a neighborhood G(u ∗ ) ∗ Rd of
δ(u ∗ ) and κ > 0 such that for every fixed X ⊂ G(u ∗ ) and −κ < c < κ, a unique
estimate exists.
Proof Under the condition of the theorem, the MLE always exists locally. Further-
more, because of the nonsingular Fisher matrix, the MLE is locally bijective (by the
implicit representation theorem). Thus (u 1 , . . . , u p ) ≡∇ (g1 (x −δu ), . . . , g p (x −δu ))
for g j (x − δu ) := (X − δu )∅ ẽ j (u, δu ) in (6.3) is locally bijective. Since {gi } and
{h i } are continuous, we can select κ > 0 for (6.3) to be locally bijective for every
−κ < c < κ.
Input:
• a potential function ∂ satisfying (C2),
• polynomial equations of δ, u and v satisfying (C3),
• m 1 , . . . , m d− p ⊂ R[δ] such that V E = V (m 1 , . . . , m d− p ) gives the model,
• f j ⊂ R[u][v]⊆3 and c ⊂ R for a vector version
Step 1 Compute ∂ and α(δ), G(δ), (Γ (m) (δ) for bias correction)
Step 2 Compute f ai ⊂ R[δ][ν11 , . . . , ν pd ]1 s.t. f a j (ν11 , . . . , ν pd ) :=
ωu a m j for νbi := ωu b δi .
Step 3 Find e p+1 , . . . , ed ⊂ (∈u δ)∧Ḡ by eliminating {νa j } from
≤ei , ωu a δḠ = eik (δ)g k j (δ)νa j = 0 and f a j (ν11 , . . . , ν pd ) = 0.
Step 4 Select e1 , . . . , e p ⊂ R[δ] s.t. e1 (δ), . . . , ed (δ) are linearly
independent.
Step 5 Eliminate v from
⎠d ⎠p
X = δ(u, 0) + vi− p ei + c · f j (u, v)e j
i= p+1 j=1
Output(Vector version):
⎠d ⎠p
X = δ(u, 0) + vi− p ei (δ) + c · f j (u, v)e j (δ).
i= p+1 j=1
6 Computational Algebraic Methods in Efficient Estimation 129
Output(Algebraic version):
(X − δ)∅ ẽ + c · h(X − δ) = 0.
As we noted in Sect. 6.3.4, if we set h j = 0 for all j, the estimator becomes the MLE.
In this sense, ch j can be recognized as a perturbation from the likelihood equations.
If we select each h j (X, u, δu , t) ⊂ R[X, u, δu ][t]3 tactically, we can reduce the
degree of the polynomial estimating equation. For algebraic background, the reader
refers to Appendix A.
Here, we assume u ⊂ R[δu ]. For example, we can set u i = δi . Then ẽ j (u, δu ) is
a function of δu , so we write it as ẽ j (δ). Define an ideal I3 of R[X, δ] as
Proof Assume r j has a monomial term whose degree is more than 2 with respect
to δ and represent the term as δa δb δc q(δ, X ) with a polynomial q ⊂ R(δ, X ) and a
combination of indices a, b, c. Then {δa δb δc +(X a −δa )(X a −δa )(X a −δa )}q(δ, X )
has a smaller polynomial order than δa δb δc q(δ, X ) since ↔ is pure lexicographic
satisfying δ1 ∓ · · · ∓ δd ∓ X 1 ∓ · · · ∓ X d . Therefore by subtracting
(X a −δa )(X a −δa )(X a −δa )}q(δ, X ) ⊂ I3 from r j , the polynomial degree decreases.
This contradicts the fact r j is the normal form so each r j has degree at most 2.
Furthermore each polynomial in I3 is in R[X, u, δu ][X − δ]3 and therefore by
taking the normal form, the condition for the algebraic version (6.3) of second-order
efficiency still holds.
The reduction of the degree is important when we use algebraic algorithms such
as homotopy continuation methods [18] to solve simultaneous polynomial equations
since computational cost depends highly on the degree of the polynomials.
130 K. Kobayashi and H. P. Wynn
It is not surprising that, for first-order efficiency, almost the same arguments hold as
for second-order efficiency.
By Theorem 5.2 of [4], a consistent estimator is first-order efficient if and only if
gψa = 0. (6.4)
Consider an algebraic estimator δ(u, v) ⊂ R[u, v]d satisfying the following vector
equation:
⎠
d ⎠
p
X = δ(u, 0) + vi− p ei (u) + c · f j (u, v)e j (u) (6.5)
i= p+1 j=1
where {ẽ j (u, δu ) ⊂ R[u, δu ]d ; j = 1, . . . , p} span ((∈u δ(u, 0))∧Ḡ )∧ E for every u
and h j (X, u, δu , t) ⊂ R[X, u, δu ][t]2 (degree = 2 w.r.t. t) for j = 1, . . . , p. Here,
the only difference between (6.3) for the second-order efficiency and (6.6) for the
first-order efficiency is the degree of the h j (X, u, δu , t) with respect to t.
Then the relation between the three different forms of first-order efficiency can
be proved in the same way manner as for Theorem 1, 2 and 3.
Theorem 5 (i) Vector version (6.5) satisfies the first-order efficiency.
(ii) An estimator defined by a vector version (6.5) of the first-order efficient estimators
is also represented by an algebraic version (6.6).
(iii) Every algebraic version (6.6) gives a first-order efficient estimator.
The relationship between the three forms of the first-order efficient algebraic esti-
mators is summarized as (4) ∀ (5) √ (6) √ (4). Furthermore, if we assume the
estimator has a form δ ⊂ R(u)[v], the forms (6.4), (6.5) and (6.6) are equivalent.
Let R := Z[X, δ] and define
6.5 Examples
In this section, we show how to use the algebraic computation to design asymptoti-
cally efficient estimators for two simple examples. The examples satisfy the algebraic
conditions (C1), (C2) and (C3) so it is verified that necessary geometric entities have
an algebraic form as mentioned in Sect. 6.3.2.
The following periodic Gaussian model shows how to compute second-order effi-
cients estimators and their biases.
• Statistical Model:
⎧ ⎤ ⎧ ⎤
0 1 a a2 a
⎨0⎥ ⎨a 1 a a2⎥
X ∨ N (μ, Σ(a)) with μ = ⎨ ⎥ ⎨
⎩0⎫ and Σ(a) = ⎩a 2
⎥ for 0 ⊗ a < 1.
a 1 a⎫
0 a a2 a 1
Here, the dimension of the full exponential family and the curved exponential
family are d = 3 and p = 1, respectively.
• Curved exponential family:
• Potential function:
• Natural parameter:
⎬ ⎭∅
1 a a2
α(a) = ,− , ,
1 − 2a + 4a
2 4 1 − 2a + 4a 1 − 2a 2 + 4a 4
2 4
132 K. Kobayashi and H. P. Wynn
• A set of vectors ei ⊂ R3 :
x − δ + v1 · e1 + v2 · e2 + c · v13 · e0 = 0.
g(a) := 8(a − 1)2 (a + 1)2 (1 + 2a 2 )2 (4a 5 −8a 3 + 2a 3 x3 − 3x2 a 2 + 4a + 4ax1 + 2ax3 −x2 )
and
h(a) := (2a 4 + a 3 x2 − a 2 x3 + 2a 2 + ax2 − 2x1 − x3 − 4)3 .
• Bias correction term for an estimator â: â(â 8 − 4â 6 + 6â 4 − 4â 2 + 1)/(1 + 2â 2 )2 .
Here, we consider a log marginal model. See [7] for more on marginal models.
• Statistical model (Poisson regression):
i.i.d
X i j ∨ Po(N pi j ) s.t. pi j ⊂ (0, 1) for i = 1, 2 and j = 1, 2, 3 with model con-
straints:
Condition (6.7) can appear in a statistical test of whether acceleration of the ratio
p1 j / p2 j is constant.
• A set of vectors ei ⊂ R6 :
⎧ ⎤
δ22 (δ4 − δ6 )
⎨ −δ 2 (δ4 − δ6 ) ⎥
⎨ 2 ⎥
⎨ 0 ⎥
⎨
e0 := ⎨ ⎥
−δ δ 2 − 2δ δ δ ⎥ ⊂ (∈u δ),
⎨ 3 5 2 4 6⎥
⎩ 0 ⎫
δ3 δ52 + 2δ2 δ4 δ6
⎧⎧ ⎤ ⎧ ⎤ ⎧ ⎤⎤
δ1 δ1 (−δ1 δ52 + δ3 δ52 ) δ1 (δ1 δ52 − δ3 δ52 )
⎨⎨δ2 ⎥ ⎨δ2 (−δ1 δ 2 − 2δ2 δ4 δ6 )⎥ ⎨δ2 (δ1 δ 2 + 2δ2 δ4 δ6 )⎥⎥
⎨⎨ ⎥ ⎨ 5 ⎥ ⎨ 5 ⎥⎥
⎨⎨δ3 ⎥ ⎨ 0 ⎥ ⎨ 0 ⎥⎥
[e1 , e2 , e3 ] : = ⎨⎨ ⎥,⎨ ⎥,⎨
⎨⎨ 0 ⎥ ⎨ δ4 (δ 2 δ4 − δ 2 δ6 ) ⎥ ⎨δ4 (2δ1 δ3 δ5 + δ 2 δ6 )⎥⎥
⎥⎥
⎨⎨ ⎥ ⎨ 2 2 ⎥ ⎨ 2 ⎥⎥
⎩⎩ 0 ⎫ ⎩ δ5 (δ 2 δ4 + 2δ1 δ3 δ5 ) ⎫ ⎩ 0 ⎫⎫
2
0 0 δ6 (δ2 δ4 + 2δ1 δ3 δ5 )
2
⊂ ((∈u δ)∧Ḡ )3
134 K. Kobayashi and H. P. Wynn
X − δ + v1 · e1 + v2 · e2 + v3 · e3 + c · v13 · e0 = 0.
{x1 δ2 2 δ4 2 δ6 − x1 δ2 2 δ4 δ6 2 − x2 δ1 δ2 δ4 2 δ6 + x2 δ1 δ2 δ4 δ6 2 − 2 x4 δ1 δ2 δ4 δ6 2 −
x4 δ1 δ3 δ5 2 δ6 + 2 x6 δ1 δ2 δ4 2 δ6 + x6 δ1 δ3 δ4 δ5 2 ,
−x2 δ2 δ3 δ4 2 δ6 + x2 δ2 δ3 δ4 δ6 2 + x3 δ2 2 δ4 2 δ6 − x3 δ2 2 δ4 δ6 2 − x4 δ1 δ3 δ5 2 δ6 −
2 x4 δ2 δ3 δ4 δ6 2 + x6 δ1 δ3 δ4 δ5 2 + 2 x6 δ2 δ3 δ4 2 δ6 ,
−2 x4 δ1 δ3 δ5 2 δ6 − x4 δ2 2 δ4 δ5 δ6 + x5 δ2 2 δ4 2 δ6 − x5 δ2 2 δ4 δ6 2 + 2 x6 δ1 δ3 δ4 δ5 2 +
x6 δ2 2 δ4 δ5 δ6 ,
δ1 δ3 δ5 2 − δ2 2 δ4 δ6 , δ1 + δ2 + δ3 − δ4 − δ5 − δ6 , −δ1 − δ2 − δ3 − δ4 − δ5 − δ6 + 1}
{−3 x1 x2 x4 2 x6 δ2 +6 x1 x2 x4 2 x6 δ6 + x1 x2 x4 2 δ2 δ6 −2 x1 x2 x4 2 δ6 2 +3 x1 x2 x4 x6 2 δ2 −
6 x1 x2 x4 x6 2 δ4 + 2 x1 x2 x4 x6 δ2 δ4 − 2 x1 x2 x4 x6 δ2 δ6 − x1 x2 x6 2 δ2 δ4 + 2 x1 x2 x6 2 δ4 2 +
3 x1 x3 x4 x5 2 δ6 − 2 x1 x3 x4 x5 δ5 δ6 − 3 x1 x3 x5 2 x6 δ4 + 2 x1 x3 x5 x6 δ4 δ5 + x1 x4 2 x6 δ2 2 −
2 x1 x4 2 x6 δ2 δ6 − x1 x4 x5 2 δ3 δ6 − x1 x4 x6 2 δ2 2 + 2 x1 x4 x6 2 δ2 δ4 + x1 x5 2 x6 δ3 δ4 +
3 x2 2 x4 2 x6 δ1 − x2 2 x4 2 δ1 δ6 − 3 x2 2 x4 x6 2 δ1 − 2 x2 2 x4 x6 δ1 δ4 + 2 x2 2 x4 x6 δ1 δ6 +
x2 2 x6 2 δ1 δ4 − x2 x4 2 x6 δ1 δ2 − 2 x2 x4 2 x6 δ1 δ6 + x2 x4 x6 2 δ1 δ2 + 2 x2 x4 x6 2 δ1 δ4 −
x3 x4 x5 2 δ1 δ6 + x3 x5 2 x6 δ1 δ4 ,
3 x1 x3 x4 x5 2 δ6 −2 x1 x3 x4 x5 δ5 δ6 −3 x1 x3 x5 2 x6 δ4 +2 x1 x3 x5 x6 δ4 δ5 − x1 x4 x5 2 δ3 δ6 +
x1 x5 2 x6 δ3 δ4 + 3 x2 2 x4 2 x6 δ3 − x2 2 x4 2 δ3 δ6 − 3 x2 2 x4 x6 2 δ3 − 2 x2 2 x4 x6 δ3 δ4 +
2 x2 2 x4 x6 δ3 δ6 + x2 2 x6 2 δ3 δ4 − 3 x2 x3 x4 2 x6 δ2 + 6 x2 x3 x4 2 x6 δ6 + x2 x3 x4 2 δ2 δ6 −
2 x2 x3 x4 2 δ6 2 +3 x2 x3 x4 x6 2 δ2 −6 x2 x3 x4 x6 2 δ4 +2 x2 x3 x4 x6 δ2 δ4 −2 x2 x3 x4 x6 δ2 δ6 −
x2 x3 x6 2 δ2 δ4 + 2 x2 x3 x6 2 δ4 2 − x2 x4 2 x6 δ2 δ3 − 2 x2 x4 2 x6 δ3 δ6 + x2 x4 x6 2 δ2 δ3 +
2 x2 x4 x6 2 δ3 δ4 +x3 x4 2 x6 δ2 2 −2 x3 x4 2 x6 δ2 δ6 −x3 x4 x5 2 δ1 δ6 −x3 x4 x6 2 δ2 2 +2 x3 x4 x6 2
δ2 δ4 + x3 x5 2 x6 δ1 δ4 ,
6 x1 x3 x4 x5 2 δ6 −4 x1 x3 x4 x5 δ5 δ6 −6 x1 x3 x5 2 x6 δ4 +4 x1 x3 x5 x6 δ4 δ5 −2 x1 x4 x5 2 δ3 δ6 +
2 x1 x5 2 x6 δ3 δ4 + 3 x2 2 x4 2 x6 δ5 − x2 2 x4 2 δ5 δ6 − 3 x2 2 x4 x5 x6 δ4 + 3 x2 2 x4 x5 x6 δ6 +
x2 2 x4 x5 δ4 δ6 − x2 2 x4 x5 δ6 2 − 3 x2 2 x4 x6 2 δ5 − x2 2 x4 x6 δ4 δ5 + x2 2 x4 x6 δ5 δ6 + x2 2 x5
x6 δ4 2 − x2 2 x5 x6 δ4 δ6 + x2 2 x6 2 δ4 δ5 − 2 x2 x4 2 x6 δ2 δ5 + 2 x2 x4 x5 x6 δ2 δ4 − 2 x2 x4 x5 x6
δ2 δ6 + 2 x2 x4 x6 2 δ2 δ5 − 2 x3 x4 x5 2 δ1 δ6 + 2 x3 x5 2 x6 δ1 δ4 ,
δ1 δ3 δ5 2 − δ2 2 δ4 δ6 , δ1 + δ2 + δ3 − δ4 − δ5 − δ6 , −δ1 − δ2 − δ3 − δ4 − δ5 − δ6 + 1}.
6.6 Computation
To obtain estimates based on the method of this paper, we need fast algorithms to
find the solution of polynomial equations. The authors have carried out computations
using homotopy continuation method (matlab program HOM4PS2 by Lee, Li and
Tsuai [18]) for the log marginal model in Sect. 6.5.2 and a data X̄ = (1, 1, 1, 1, 1, 1).
The run time to compute each estimate on a standard laptop (Intel(R) Core (TM)
i7-2670QM CPU, 2.20 GHz, 4.00 GB memory) is given by Table 6.1. The computa-
tion is repeated 10 times and the averages and the standard deviations are displayed.
Note the increasing of the speed for the second-order efficient estimators is due to
the degree reduction technique. The term “path” in the table heading refers to a prim-
itive iteration step within the homotopy method. In the faster polyhedron version,
the solution region is subdivided into polyhedral domains.
Figure 6.3 shows the mean squared error and the computational time of the MLE,
the first-order estimator and the second-order efficient estimator of Sect. 6.5.2. The
true parameter is set δ ∗ = (1/6, 1/4, 1/12, 1/12, 1/4, 1/6), a point in the model
manifold, and N random samples are generated i.i.d. from the distribution with the
parameter. The computation is repeated for exponentially increasing sample sizes
N = 1, . . . , 105 . In general, there are multiple roots for polynomial equations
and here we selected the root closest to the sample mean by the Euclidean norm.
Figure 6.3(1) also shows that the mean squared error is approximately the same for
the three estimators, but (2) shows that the computational time is much more for the
MLE.
136 K. Kobayashi and H. P. Wynn
(a) (b)
Fig. 6.3 The mean squared error and computation time for each estimate by the homotopy contin-
uation method
6.7 Discussion
Acknowledgments This paper has benefited from conversations with and advice from a number of
colleagues. We should thank Satoshi Kuriki, Tomonari Sei, Wicher Bergsma and Wilfred Kendall.
The first author acknowledges support by JSPS KAKENHI Grant 20700258, 24700288 and the
second author acknowledges support from the Institute of Statistical Mathematics for two visits
in 2012 and 2013 and from UK EPSRC Grant EP/H007377/1. A first version of this paper was
delivered at the WOGAS3 meeting at the University of Warwick in 2011. We thank the sponsors.
The authors also thank the referees of the short version in GSI2013 and the referees of the first long
version of the paper for insightful suggestions.
A Normal Forms
A basic text for the materials in this section is [9]. The rapid growth of modern
computational algebra can be credited to the celebrated Buchberger’s algorithm [8].
A monomial ideal I in a polynomial ring K [x1 , . . . , xn ] over a field K is an ideal
for which there is a collection of monomials f 1 , . . . , f m such that any g ⊂ I can be
expressed as a sum
⎠
m
g= gi (x) f i (x)
i=1
⎠
m
f = si (x)gi (x) + r (x).
i=1
We call the remainder r (x) the normal form of f with respect to I and write N F( f ).
Or, to stress the fact that it may depend on ↔, we write N F( f, ↔). Given a monomial
ordering ↔, a polynomial f = β⊂L αβ x β for some L is a normal form with respect
to ↔ if x β ⊂/ ≤L T ( f ) for all β ⊂ L. An equivalent way of saying this is: given an
ideal I and a monomial ordering ↔, for every f ⊂ K [x1 , . . . , xk ] there is a unique
normal form N F( f ) such that f − N F( f ) ⊂ I .
f 0 (x, y) := f 0 (x) := a1 x d1 − b1 = 0,
g0 (x, y) := g0 (y) := a2 y d2 − b2 = 0 (6.8)
Figure 6.4 shows a sketch of the algorithm. This algorithm is called the (linear)
homotopy continuation method and justified if the path connects t = 0 and t = 1
6 Computational Algebraic Methods in Efficient Estimation 139
continuously without an intersection. That can be proved for almost all a and b.
See [19].
For each computation for the homotopy continuation method, the number of the
paths is the number of the solutions of (6.8). In this
case, the number of paths is d1 d2 .
m
In general case with m unknowns, it becomes i=1 di and this causes a serious
problem for computational cost. Therefore decreasing the degree of second-order
efficient estimators plays an important role for the homotopy continuation method.
Note that in order to solve this computational problem, the authors of [16] pro-
posed the nonlinear homotopy continuation methods (or the polyhedral continuation
methods). But as we can see in Sect. 6.5.2, the degree of the polynomials still affects
the computational costs.
References
1. Adler, R.J., Taylor, J.E.: Random Fields and Geometry. Springer Monographs in Mathematics.
Springer, New York (2007)
2. Amari, S., Nagaoka, H.: Methods of Information Geometry, vol. 191. American Mathematical
Society, Providence (2007)
3. Amari, S.: Differential geometry of curved exponential families-curvatures and information
loss. Ann. Stat. 10, 357–385 (1982)
4. Amari, S.: Differential-Geometrical Methods in Statistics. Springer, New York (1985)
5. Amari, S., Kumon, M.: Differential geometry of Edgeworth expansions in curved exponential
family. Ann. Inst. Stat. Math. 35(1), 1–24 (1983)
6. Andersson, S., Madsen, J.: Symmetry and lattice conditional independence in a multivariate
normal distribution. Ann. Stat. 26(2), 525–572 (1998)
7. Bergsma, W.P., Croon, M., Hagenaars, J.A.: Marginal Models for Dependent, Clustered, and
Longitudinal Categorical Data. Springer, New York (2009)
8. Buchberger, B.: Bruno Buchberger’s PhD thesis 1965: an algorithm for finding the basis ele-
ments of the residue class ring of a zero dimensional polynomial ideal. J. Symbol. Comput.
41(3), 475–511 (2006)
9. Cox, D.A., Little, J., O’Shea, D.: Ideals, Varieties, and Algorithms: An Introduction to Com-
putational Algebraic Geometry and Commutative Algebra, 3/e (Undergraduate Texts in Math-
ematics). Springer, New York (2007)
140 K. Kobayashi and H. P. Wynn
10. Dawid, A.P.: Further comments on some comments on a paper by Bradley Efron. Ann. Stat.
5(6), 1249 (1977)
11. Drton, M., Sturmfels B., Sullivant, S.: Lectures on Algebraic Statistics. Springer, New York
(2009)
12. Drton, M.: Likelihood ratio tests and singularities. Ann. Stat. 37, 979–1012 (2009)
13. Efron, B.: Defining the curvature of a statistical problem (with applications to second order
efficiency). Ann. Stat. 3, 1189–1242 (1975)
14. Gehrmann, H., Lauritzen, S.L.: Estimation of means in graphical Gaussian models with sym-
metries. Ann. Stat. 40(2), 1061–1073 (2012)
15. Gibilisco, P., Riccomagno, E., Rogantin, M.P., Wynn, H.P.: Algebraic and Geometric Methods
in Statistics. Cambridge University Press, Cambridge (2009)
16. Huber, B., Sturmfels, B.: A polyhedral method for solving sparse polynomial systems. Math.
Comput. 64, 1541–1555 (1995)
17. Kuriki, S., Takemura, A.: On the equivalence of the tube and Euler characteristic methods for
the distribution of the maximum of Gaussian fields over piecewise smooth domains. Ann. Appl.
Probab. 12(2), 768–796 (2002)
18. Lee, T.L., Li, T.Y., Tsai, C.H.: HOM4PS2.0: a software package for solving polynomial systems
by the polyhedral homotopy continuation method. Computing 83(2–3), 109–133 (2008)
19. Li, T.Y.: Numerical solution of multivariate polynomial systems by homotopy continuation
methods. Acta Numer 6(1), 399–436 (1997)
20. Naiman, D.Q.: Conservative confidence bands in curvilinear regression. Ann. Stat. 14(3), 896–
906 (1986)
21. Pistone, G., Sempi, C.: An infinite-dimensional geometric structure on the space of all the
probability measures equivalent to a given one. Ann. Stat. 23, 1543–1561 (1995)
22. Pistone, G., Wynn, H.P.: Generalised confounding with Gröbner bases. Biometrika 83, 653–666
(1996)
23. Rao, R.C.: Information and accuracy attainable in the estimation of statistical parameters. Bull.
Calcutta Math. Soc. 37(3), 81–91 (1945)
24. Verschelde, J.: Algorithm 795: PHCpack: a general-purpose solver for polynomial systems by
homotopy continuation. ACM Trans. Math. Softw. (TOMS) 25(2), 251–276 (1999)
25. Watanabe, S.: Algebraic analysis for singular statistical estimation. In: Watanabe, O., Yokomori,
T. (eds.) Algorithmic Learning Theory. Springer, Berlin (1999)
26. Weyl, H.: On the volume of tubes. Am. J. Math. 61, 461–472 (1939)
Chapter 7
Eidetic Reduction of Information Geometry
Through Legendre Duality of Koszul
Characteristic Function and Entropy: From
Massieu–Duhem Potentials to Geometric
Souriau Temperature and Balian Quantum
Fisher Metric
Frédéric Barbaresco
Abstract Based on Koszul theory of sharp convex cone and its hessian geometry,
Information Geometry metric is introduced by Koszul form as hessian of Koszul–
Vinberg Characteristic function logarithm (KVCFL). The front of the Legendre map-
ping of this KVCFL is the graph of a convex function, the Legendre transform of this
KVCFL. By analogy in thermodynamic with Dual Massieu–Duhem potentials (Free
Energy and Entropy), the Legendre transform of KVCFL is interpreted as a “Koszul
Entropy”. This Legendre duality is considered in more general framework of Contact
Geometry, the odd-dimensional twin of symplectic geometry, with Legendre fibra-
tion and mapping. Other analogies will be introduced with large deviation theory
with Cumulant Generating and Rate functions (Legendre duality by Laplace Princi-
ple) and with Legendre duality in Mechanics between Hamiltonian and Lagrangian.
In all these domains, we observe that the “Characteristic function” and its derivatives
capture all information of random variable, system or physical model. We present
two extensions of this theory with Souriau’s Geometric Temperature deduced from
covariant definition of thermodynamic equilibriums, and with Balian quantum Fisher
metric defined and built as hessian of von Neumann entropy. Finally, we apply Koszul
geometry for Symmetric/Hermitian Positive Definite Matrices cones, and more par-
ticularly for covariance matrices of stationary signal that are characterized by spe-
cific matrix structures: Toeplitz Hermitian Positive Definite Matrix structure (covari-
ance matrix of a stationary time series) or Toeplitz-Block-Toeplitz Hermitian Posi-
tive Definite Matrix structure (covariance matrix of a stationary space–time series).
F. Barbaresco (B)
Thales Air Systems, Voie Pierre-Gilles de Gennes, F91470 Limours, France
e-mail: [email protected]
7.1 Preamble
• Computation of Koszul
Entropy from general definition of Koszul Characteristic
function ΔΦ (x) = Φ⊂ e−∗Γ,x∈ dΓ ∀x ≡ Φ and Legendre transform:
Ω⊂ (x ⊂ ) = x, x ⊂ − Ω(x) with x ⊂ = Dx Ω and x = Dx⊂ Ω⊂ where Ω(x) = − log ΔΦ (x)
⎛ ⎠
⎞ ⎞ ⎞
Ω (x ) = Ω ⎝ Γ · px (Γ )dΓ ⎧ = − px (Γ ) log px (Γ )dΓ with x ⊂ =
⊂ ⊂ ⊂ Γ · px (Γ )dΓ
Φ⊂ Φ⊂ Φ⊂
⎤ ⎫ ⎭
⎞ px (Γ )dΓ = 1
Max ⎥− px (Γ ) log px (Γ )dΓ ⎬ such Φ
⊂
px (.) Γ · px (Γ )dΓ = x ⊂
Φ⊂ Φ⊂
⎭ −∗x,Γ ∈
px (Γ ) = e
e−∗x,Γ ∈ dΓ
∇ Φ⊂
x = Ψ−1 (x ⊂ ), Ψ(x) = dΩ(x)
dx
• Strict equivalence between Hessian Koszul metric and Fisher metric from Infor-
mation Geometry
⎜
∂ 2 log px (Γ ) ∂ 2 log ΔΦ (x)
I(x) = −EΓ =
∂x 2 ∂x 2
m−1 |dμi |2
ds2 = n · m · (d log (P0 ))2 + n (m − i) 2
i=1
1 − |μi |2
n−1 ⎜ ⎟−1 ⎟−1
+ (n − k)Tr I − Akk Ak+
k dAkk k+ k
I − Ak Ak k+
dAk
k=1
⎟
R0 , A11 , . . . , An−1 n−1 with SD = Z/ZZ + < I
with n−1 ≡ THPDm × SD m
R0 ◦ (log (P0 ) , μ1 , . . . , μm−1 ) ≡ R × Dm−1 with D = {z/zz⊂ < 1}
• New model of non-stationary signal where one time series can be split into sev-
eral stationary signals on a shorter time scale, represented by time sequence of
stationary covariance matrices or a geodesic polygon/path on covariance matrix
manifold.
• Definition of distance between non-stationary signals as Fréchet distance between
two geodesic paths in abstract metric spaces of covariance matrix manifold.
⎭
dFr échet (R1 , R2 ) = Inf t≡[0,1]
Max dgeo (R1 (α(t)), R2 (β(t)))
α,β
⎟
with dgeo2 (R (α(t)), R (β(t)))) = log R−1/2 (α(t))R (β(t))R−1/2 (α(t))
2
1 2 1 2 1
We will begin this exposition by a global view of all geometries that are interrelated
through the cornerstone concept of Koszul–Vinberg characteristic function and
metric.
7 Eidetic Reduction of Information Geometry 145
The paper title should be then considered as echo of this Bergson’s citation on
“Eidos” with twofold meaning in our development: “Eidos” considered with meaning
of “equilibrium state” as defined by Bergson “Eidos is the stable view taken of
the instability of things”, but also “Eidos” thought as “Eidetic Reduction” about
“Essence” of Information Geometry from where we will try to favor the emergence of
“Koszul characteristic function” concept. Eidetic reduction considered as the study of
“Information Geometry” reduced into a necessary essence, where will be identified
its basic and invariable components that will be declined in different domains of
Science (mathematics, physics, information theory,…).
As essence of Information Geometry, we will develop the “Koszul Characteris-
tic Function”, transverse concept in Thermodynamics (linked with Massieu–Duhem
Potentials), in Probability (linked with Poincaré generating moment function), in
Large Deviations Theory (linked with Laplace Principle), in Mechanics (linked with
Contact Geometry), in Geometric Physics (linked with Souriau Geometric temper-
ature in Mechanical Symplectic geometry) and in Quantum Physics (linked with
Balian Quantum hessian metric). We will try to explore close inter-relations between
these domains through geometric tools developed by Jean-Louis Koszul (Koszul
forms, …).
146 F. Barbaresco
Fig. 7.1 Text of Poincaré lecture on thermodynamic with development of the concept of “Massieu
characteristic function”
based on the orbit method works, that allows to define physical observables like
energy, momentum as pure geometrical objects. For Jean-Marie Souriau, equilibri-
ums states are indexed by a geometric parameter β with values in the Lie algebra
of the Lorentz–Poincaré group. Souriau approach generalizes the Gibbs equilibrium
states, β playing the role of temperature. The invariance with respect to the group, and
the fact that the entropy S is a convex function of β, imposes very strict conditions,
that allow Souriau to interpreted β as a space–time vector (the temperature vector
of Planck), giving to the metric tensor g a null Lie derivative. In our development,
before exposing Souriau theory, we will introduce Legendre Duality between the
variational Euler–Lagrange and the symplectic Hamilton–Jacobi formulations of the
equations of motion and will introduce Cartan–Poincaré invariant.
In 1986, Roger Balian has introduced a natural metric structure for the space of
states D̂ in quantum mechanics, from which we can deduce the distance between a
state and one of its approximations. Based on quantum information theory, Roger
% metric ds = d S(D̂) from
Balian has built on physical grounds this metric$ as hessian 2 2
S(D̂), von Neumann’s entropy S(D̂) = −Tr D̂ ln(D̂) . Balian has then recovered
same relations than in classical statistical physics, with also a “quantum character-
istic function” logarithm
⎟ F( & X̂) '= ln Tr exp X̂, Legendre transform of von-Neuman
Entropy S(D̂) = F X̂ − D̂, X̂ . In this framework, Balian has introduced the notion
of “Relevant Entropy”.
We will synthetize all these analogies in a table for the three models of Koszul,
Souriau and Balian (Information Geometry case being a particular case of Koszul
geometry).
In last chapters, we will apply the Koszul theory for defining geometry of Sym-
metric/Hermitian Positive Definite Matrices, and more particularly for covariance
matrix of stationary signal that are characterized by specific matrix structures:
Toeplitz Hermitian Positive Definite Matrix structure (covariance matrix of a sta-
tionary time series) or Toeplitz-Block-Toeplitz Hermitian Positive Definite Matrix
structure (covariance matrix of a stationary space–time series). We will see that
“Toeplitz” matrix structure could be captured by complex autogressive model para-
meterization. This parameterization could be naturally introduced without arbitrary
through Trench’s theorem (or equivalently Verblunsky’s theorem) or through Par-
tial Iwasawa decomposition. By extension, we introduce a new geometry for non-
stationary signal through Fréchet metric space of geodesic paths on structured matrix
manifolds.
We conclude with two general concepts of “Generating Inner Product” and “Gen-
erating Function” that extend previous developments.
The Koszul–Vinberg characteristic function is a dense knot in mathematics and
could be introduced in the framework of different geometries: Hessian Geometry
(Jean-Louis Koszul work), Homogeneous convex cones geometry (Ernest Vinberg
work), Homogeneous Symmetric Bounded Domains Geometry (Eli Cartan and Carl
Ludwig Siegel works), Symplectic Geometry (Thomas von Friedrich and Jean-Marie
Souriau work), Affine Geometry (Takeshi Sasaki and Eugenio Calabi works) and
Information Geometry (Calyampudi Rao [193] and Nikolai Chentsov [194] works).
148 F. Barbaresco
of parallel lines) then we obtain the projective plane in which the duality is given
symmetrical relationship between points and lines, and led to the classical principle
of projective duality, where the dual theorem is also a theorem.
Most Famous example is given by Pascal’s theorem (the Hexagrammum Mys-
ticum Theorem) stating that:
• If the vertices of a simple hexagon are points of a point conic, then its diagonal
points are collinear: If an arbitrary six points are chosen on a conic (i.e., ellipse,
parabola or hyperbola) and joined by line segments in any order to form a hexagon,
then the three pairs of opposite sides of the hexagon (extended if necessary) meet
in three points which lie on a straight line, called the Pascal line of the hexagon
The dual of Pascal’s Theorem is known as Brianchon’s Theorem:
• If the sides of a simple hexagon are lines of a line conic, then the diagonal lines
are concurrent (Fig. 7.3).
˜
∂ ∂ Ω̃
∂θ =η ∂η =θ
˜
∂
and (7.5)
=H ∂ Ω̃
∂Ψ ∂H =Ψ
⎟ ⎨ ⎩
with Ω̃ H̃ = E log p the Entropy.
In the theory of Information Geometry introduced by Rao [193] and Chentsov
[194], a Riemannian manifold is then defined by a metric tensor given by hessian of
these dual potential functions:
˜
∂ 2 ∂ 2 Ω̃
gij = and gij⊂ = (7.6)
∂ Ψ̃i ∂ Ψ̃j ∂ H̃i ∂ H̃j
7 Eidetic Reduction of Information Geometry 151
We define Koszul–Vinberg hessian metric on convex sharp cone, and observe that
the Fisher information metric of Information Geometry coincides with the canonical
Koszul Hessian metric (Koszul form) [1, 27–32]. We also observe, by Legendre
duality (Legendre transform of Koszul characteristic function logarithm), that we are
able to introduce a Koszul Entropy, that plays the role of general Entropy definition.
Koszul [1, 27, 32] and Vinberg [33, 146] have introduced an affinely invariant
Hessian metric on a sharp convex cone Φ through its characteristic function Δ.
In the following, Φ is a sharp open convex cone in a vector space E of finite dimen-
sion on R (a convex cone is sharp if it does not contain any full straight line). In dual
space E ⊂ of E, Φ⊂ is the set of linear strictly positive forms on Φ − {0} and Φ⊂ is
the dual cone of Φ and is a sharp open convex cone. If Γ ≡ Φ⊂ , then the intersection
Φ ← {x ≡ E/ ∗x, Γ ∈ = 1} is bounded. G = Aut(Φ) is the group of linear transform
of E that preserves Φ. G = Aut(Φ) operates on Φ⊂ by ∀g ≡ G = Aut(Φ) , ∀Γ ≡ E ⊂
then g̃ · Γ = Γ ≤ g−1 .
Koszul–Vinberg Characteristic function definition: Let dΓ be the Lebesgue mea-
sure on E ⊂ , the following integral:
⎞
ΔΦ (x) = e−∗Γ,x∈ dΓ ∀x ≡ Φ (7.7)
Φ⊂
152 F. Barbaresco
• ΔΦ is logarithmically strictly convex, and ϕΦ (x) = log (ΔΦ (x)) is strictly convex
Koszul 1-form α: The differential 1-form
β = Dα = d 2 log ΔΦ (7.12)
g = d 2 log ΔΦ (7.14)
⎜ ⎞
Δu d 2 log Δu du
d log Δ(x) = d
2 2
log Δu du =
Δu du
1 Δu Δv (d log Δu − d log Δv )2 dudv
+
2 Δu Δv dudv
7 Eidetic Reduction of Information Geometry 153
⎞ ⎞
x⊂ = Γ · e−∗Γ,x∈ dΓ / e−∗Γ,x∈ dΓ ,
Φ⊂ Φ⊂
−x ⊂ , h = dh log ΔΦ (x)
⎞ ⎞
= − ∗Γ, h∈ e−∗Γ,x∈ dΓ / e−∗Γ,x∈ dΓ (7.16)
Φ⊂ Φ⊂
From this last equation, we can deduce “Koszul Entropy” defined as Legendre
Transform of minus logarithm of Koszul–Vinberg characteristic function Ω(x):
Ω⊂ (x ⊂ ) = x, x ⊂ − Ω(x) with x ⊂ = Dx Ω and x = Dx⊂ Ω⊂ where Ω(x) = − log ΔΦ (x)
(7.17)
& ' $ %
Ω⊂ (x ⊂ ) = (Dx Ω)−1 (x ⊂ ), x ⊂ − Ω (Dx Ω)−1 (x ⊂ ) ∀x ⊂ ≡ {Dx Ω(x)/x ≡ Φ}
(7.18)
and
Ω⊂ (x ⊂ ) = ∗x, x ⊂ ∈ − Ω(x) = − log e−∗Γ,x∈ · e−∗Γ,x∈ dΓ / e−∗Γ,x∈ dΓ + log e−∗Γ,x∈ dΓ
*+ , Φ ⊂ Φ ⊂
- Φ⊂
Ω⊂ (x ⊂ ) = e−∗Γ,x∈ dΓ · log e−∗Γ,x∈ dΓ − log e−∗Γ,x∈ · e−∗Γ,x∈ dΓ / e−∗Γ,x∈ dΓ
Φ⊂ Φ⊂ Φ⊂ Φ⊂
(7.20)
154 F. Barbaresco
with
⎞
−∗x,Γ ∈−log e−∗Γ,x∈ dΓ
px (Γ ) = e −∗Γ,x∈
e−∗Γ,x∈
dΓ = e Φ⊂ = e−∗x,Γ ∈+Ω(x) and
Φ⊂
⎞
x⊂ = Γ · px (Γ )dΓ (7.22)
Φ⊂
−∗Γ,x∈
We will call px (Γ ) = e
−∗Γ,x∈ dΓ the Koszul Density, with the property that:
Φ⊂ e
⎞
log px (Γ ) = − ∗x, Γ ∈ − log e−∗Γ,x∈ dΓ = − ∗x, Γ ∈ + Ω(x) (7.23)
Φ⊂
and ⎨ ⎩
EΓ − log px (Γ ) = x, x ⊂ − Ω(x) (7.24)
The meaning of this relation is that “Barycenter of Koszul Entropy is Koszul Entropy
of Barycenter”.
This condition is achieved for x ⊂ = Dx Ω taking into account Legendre Transform
property:
⎨ ⎩
Legendre Transform: Ω* (x ⊂ ) = Sup x, x ⊂ − Ω(x)
x
* ⊂ ⊂
x ∈ − Ω(x)
Ω (x ) ≥ ∗x,
⇒ Ω* (x ⊂ ) ≥ Ω⊂ (Γ )px (Γ )dΓ
Φ⊂
$ ⊂ %
Ω⊂ (x ⊂ ) ≥ E Ω (Γ )
⇒ (7.29)
equality for x ⊂ = dΩ
dx
Classically, the density given by Maximum Entropy Principle [188–190] is given by:
⎤ ⎭
⎫
⎞ px (Γ )dΓ = 1
Max ⎥− px (Γ ) log px (Γ )dΓ such Φ
⎬ ⊂
(7.30)
px (.) Γ · px (Γ )dΓ = x ⊂
Φ⊂ Φ⊂
e−∗Γ,x∈ dΓ
If we take qx (Γ ) = e−∗Γ,x∈ / Φ⊂ e−∗Γ,x∈ dΓ = e−∗x,Γ ∈−log Φ⊂ such that:
⎭
qx (Γ ) · dΓ = e−∗Γ,x∈ dΓ / e−∗Γ,x∈ dΓ = 1
Φ⊂ Φ⊂ Φ⊂
−∗x,Γ ∈−log e−∗Γ,x∈ dΓ (7.31)
= − ∗x, Γ ∈ − log e−∗x,Γ ∈ dΓ
log qx (Γ ) = log e Φ⊂
Φ⊂
Then by using the fact that log x ≥ 1 − x −1 with equality if and only if x = 1 , we
find the following:
⎞ ⎞ ⎪
px (Γ ) qx (Γ )
− px (Γ ) log dΓ ⊆ − px (Γ ) 1 − dΓ (7.32)
qx (Γ ) px (Γ )
Φ⊂ Φ⊂
⎞ . ⎞ / ⎞
− px (Γ ) log px (Γ )dΓ ⊆ x, Γ · px (Γ )dΓ + log e−∗x,Γ ∈ dΓ (7.36)
Φ⊂ Φ⊂ Φ⊂
If we take x ⊂ = Γ · px (Γ )dΓ and Ω(x) = − log Φ⊂ e−∗x,Γ ∈ dΓ , then we deduce
Φ⊂
−∗x,Γ ∈−log Φ⊂ e−∗Γ,x∈ dΓ is
that the Koszul density qx (Γ ) = e−∗Γ,x∈ / Φ⊂ e−∗Γ,x∈
dΓ = e
the Maximum Entropy solution constrained by Φ⊂ px (Γ )dΓ = 1 and Φ⊂ Γ · px (Γ )
dΓ = x ⊂ :
⎞
− px (Γ ) log px (Γ )dΓ ⊆ x, x ⊂ − Ω(x) (7.37)
Φ⊂
⎞
− px (Γ ) log px (Γ )dΓ ⊆ Ω⊂ (x ⊂ ) (7.38)
Φ⊂
We have then proved that Koszul Entropy provides density of Maximum Entropy:
e− Γ,Ψ (Γ̄ )
−1
dΩ(x)
pΓ̄ (Γ ) = with x = Ψ−1 (Γ̄ ) and Γ̄ = Ψ(x) = (7.39)
e−∗Γ,Ψ (Γ̄ )∈ dΓ
−1
dx
Φ⊂
where ⎞ ⎞
Γ= Γ · pΓ̄ (Γ )dΓ and Ω(x) = − log e−∗x,Γ ∈ dΓ (7.40)
Φ⊂ Φ⊂
We can then deduce Maximum Entropy solution without solving Classical vari-
ational problem with Lagrangian hyperparameters, but only by inversing function
Ψ(x) .This remark was made by Jean-Marie
⎪ Souriau in the paper [97], if we take vec-
z
tor with tensor components Γ = , components of Γ̄ will provide moments of
z∧z
first and second order of the density of probability pΓ̄ (Γ ), that is defined by Gaussian
7 Eidetic Reduction of Information Geometry 157
1
∗Γ, x∈ = aT z + zT Hz (7.41)
2
1 $ T −1 $ % %
Ω(x) = − a H a + log det H −1 + n log(2π ) (7.42)
2
We can prove that first moment is equal to −H −1 a and that components of variance
tensor are equal to elements of matrix H −1 , that induces the second moment. The
Koszul Entropy, its Legendre transform, is then given by:
1$ $ % %
Ω⊂ (Γ̄ ) = log det H −1 + n log (2π · e) (7.43)
2
dΩ dΩ⊂
Ω⊂ (x ⊂ ) + Ω(x) = x, x ⊂ with x ⊂ = and x = where Ω(x) = − log ΔΦ (x)
dx dx ⊂
( dΩ
⊂ d2Ω
= dx
⊂
$ 2 ⊂ %−1
dx ⊂ = x ⇒ dx 2
⊂
dx
⇒ d 2 Ω d 2 Ω⊂
· = 1 ⇒ d2Ω
= d Ω
d Ω
2
dx ⊂ = x
dΩ
= dx 2 2 2 2
2 ⊂
dx dx ⊂ dx dx ⊂
dx ⊂ dx
$ 2 ⊂ %−1 $ 2 ⊂ %2 2 ⊂
ds2 = − ddxΩ2 dx 2 = − d Ω⊂2 · d Ω⊂2 · dx ⊂ = − d Ω⊂2
2
⇒ · dx ⊂2
dx dx dx
(7.44)
$ 2 ⊂ %−1
The relation ddxΩ2 = d Ω⊂2
2
has been established by J.P. Crouzeix in 1977 in a short
dx
communication [176] for convex smooth functions and their Legendre transforms.
This result has been extended for non-smooth function by Seeger [177] and Hiriart-
Urruty [178], using polarity relationship between the second-order subdifferentials.
This relation was mentioned in texts of variational calculus and theory of elastic
materials (with work potentials) [178].
This last relation has also be used in the framework of the Monge-Ampere measure
associated to a convex function, to prove equality with Lebesgue measure λ:
⎞
mΩ () = ϕ(x)dx = λ ({∀φ(x)/x ≡ }) (7.45)
$ %
∀ ≡ BΦ (Borel set in Φ) and φ(x) = det ∀ 2 Ω(x)
158 F. Barbaresco
⎨ ⎩−1
That is proved using Crouezix relation: ∀ 2 Ω(x) = ∀ 2 Ω (∀Ω⊂ (y)) = ∀ 2 Ω⊂ (y)
⎞ ⎞ $ %
mΩ () = ϕ(x)dx = det ∀ 2 Ω(x) · dx
⎞ $ % $ %
mΩ () = det ∀ 2 Ω ∀Ω⊂ (y) · det ∀ 2 Ω⊂ (y) dy
(∀Ω⊂ )−1 (A)
⎞
= 1 · dy = λ ({∀φ(x)/x ≡ }) (7.46)
∀Ω()
To make the link with Fisher metric given by Fisher Information matrix I(x) , we
can observe that the second derivative of log px (Γ ) is given by:
⎞
−∗x,Γ ∈−log e−∗Γ,x∈ dΓ
−∗Γ,x∈ −∗Γ,x∈
px (Γ ) = e / e dΓ = e Φ⊂
Φ⊂
⎞
⇒ log px (Γ ) = − ∗x, Γ ∈ − log e−∗Γ,x∈ dΓ (7.47)
Φ⊂
with Ω(x) = − log e−∗Γ,x∈ dΓ = − log Ω (x)
Φ⊂
* -
∂ 2 log px (Γ ) ∂ 2 Ω(x) ∂ 2 log px (Γ ) ∂ 2 Ω(x) ∂ 2 log ΔΦ (x)
= ⇒ I(x) = −EΓ =− =
∂x 2 ∂x 2 ∂x 2 ∂x 2 ∂x 2
(7.48)
⎜
∂ 2 log px (Γ ) ∂ 2 log ΔΦ (x)
I(x) = −EΓ = (7.49)
∂x 2 ∂x 2
We could then deduce the close interrelation between Fisher metric and hessian of
Koszul–Vinberg characteristic logarithm, that are totally equivalent.
Koszul [1] and Vey [2, 160] have developed these results with the following theorem
for connected Hessian manifolds:
7 Eidetic Reduction of Information Geometry 159
Last contributor is Rothaus [147] that has studied the construction of geodesics for
this hessian metric geometry, using the following property:
⎪
1 il ∂gkl ∂gjl ∂gjk 1 il ∂ 3 log ΔΦ (x) ∂ 2 log ΔΦ (x)
jk
i
= g + − = g with gij =
2 ∂xj ∂xk ∂xl 2 ∂xj ∂xk ∂xl ∂xi ∂xj
(7.50)
or expressed also according the Christoffel symbol of the first kind:
⎪
⎨ ⎩ 1 ∂gkl ∂gjl ∂gjk 1 ∂ 3 log ΔΦ (x)
i, jk = + − = (7.51)
2 ∂xj ∂xk ∂xl 2 ∂xj ∂xk ∂xl
That we can put in vector form using notations x ⊂ = −d log ΔΦ and Fisher matrix
I(x) = d 2 log ΔΦ :
d 2 x d 2 x⊂
I(x) 2 − =0 (7.56)
ds ds2
Using Koszul results of Sect. 7.4.2, let v be the volume element of g. We define a
closed 1-form α and β a symmetric bilinear form by: DX v = α(X)v and β = Dα.
The forms α and β, called the first Koszul form and the second Koszul form for a
Hessian structure (D; g) respectively are given by:
⎨ ⎩1/2 1 ∂ ⎨ ⎩ 1
v = det gij dx √ · · · dx n ⇒ αi = i log det gij 2 v and
∂x ⎨ ⎩
∂αi 1 ∂ 2 log det gkl
βij = j =
∂x 2 ∂x i ∂x j
The pair (D; g) of flat connection D and Hessian metric g define the Hessian
structure. As seen previously, Koszul studied flat manifolds endowed with a closed
1-form α such that Dα is positive definite, whereupon Dα is a Hessian metric. The
Hessian structure (D; g) is said to be of Koszul type, if there exists a closed 1-form α
such that g = Dα. The second Koszul form β plays a role similar to the Ricci tensor
for Kählerian metric.
We can then apply this Koszul geometry framework for cones of Symmetric
Positive Definite Matrices.
Let the inner product ∗x, y∈ = Tr (xy) , ∀x, y ≡ Symn (R), Φ be the set of symmetric
positive definite matrices is an open convex cone and is self-dual Φ⊂ = Φ.
7 Eidetic Reduction of Information Geometry 161
⎞
n+1
ΔΦ (x) = e−∗Γ,x∈ dΓ = det x − 2 Δ(In ) (7.57)
∗x, y∈ = Tr(xy)
Φ⊂
Φ⊂ = Φ self-dual
n+1 2 n+1
g = d 2 log ΔΦ = − d log det x and x ⊂ = −d log ΔΦ = d log det x
2 2
n + 1 −1
= x (7.58)
2
Sasaki approach could also be used for the regular convex cone consisting of
all positive definite symmetric matrices of degree n. (D, Dd log det x) is a Hessian
structure on Φ, and each level surface of det x is a minimal surface of the Riemannian
manifold (Φ, g = −Dd log det x) .
Koszul [27] has introduced another 1-form definition for homogeneous bounded
domain given by:
1
α = − d (X) with (X) = Trg/b [ad (JX) − Jad(X)] ∀X ≡ g (7.59)
4
We can illustrate this new Koszul expression for Poincaré’s Upper Half Plane V =
{z = x + iy/y > 0} (most simple symmetric homogeneous bounded domain). Let
vector fields X = y dx
d
and Y = y dy
d
, and J tensor of complex structure V defined by
(
Tr [ad (JX) − Jad (X)] = 2
JX = Y . As [X, Y ] = −Y and ad (Y ) . Z = [Y , Z] then
Tr [ad (JY ) − Jad (Y )] = 0
(7.60)
The Koszul 1-form and then the Koszul/Poincaré metric is given by:
dx 1 1 dx √ dy dx 2 + dy2
(X) = 2 ⇒ α = − d = − 2
⇒ ds2 = (7.61)
y 4 2 y 2y2
Geometry, this metric is metric for multivariate Gaussian law of covariance matrix
R and zero mean. This metric will be studied in Sect. 7.7
Convex Optimization theory has developed the notion of Universal barrier that could
be also interpreted as Koszul–Vinberg Characteristic function. Homogeneous cones
that have been studied by Elie Cartan, Carl Ludwig Siegel, Ernest Vinberg and Jean-
Louis Koszul are very specific cones because of their invariance properties under
linear transformations. They were classified and algebraic constructed by Vinberg
using Siegel domains. Homogeneous cones [41, 42, 53–55] have also been studied
for developing interior-point algorithms in optimization, through the notion of self-
concordant barriers. Nesterov and Nemirovskii [145] have shown that any cone in
Rn admits a logarithmically homogeneous universal barrier function, defined as a
volume integral. Güler and Tunçel [37, 38, 143, 148] have used a recursive scheme
to construct a barrier through Siegel domains referring to a more general construction
of Nesterov and Nemirovskii. More recently, Olena Shevchenko [142] has defined a
recursive formula for optimal dual barrier functions on homogeneous cones, based
on the primal construction of Güler and Tunçel by means of the dual Siegel cone
construction of Rothaus [144, 147].
Vinberg and Gindikin have proved that every homogeneous cones can be obtained
by a Siegel construction. To illustrate Siegel construction, we can consider K an
homogeneous cone in Rk and B(u, u) ≡ K a K-bilinear symmetric form, Siegel Cone
is then defined by:
0 1
CSiegel (K, B) = (x, u, t) ≡ Rk xRp xR/t > 0, tx − B(u, u) ≡ K (7.64)
and is homogeneous.
Rothaus has considered dual Siegel domain construction by means of a symmetric
linear mapping U(y) (positive definite for y ≡ int K ⊂ ) defined by ∗U(y)u, v∈ =
∗B(u, v), y∈, from which we can define dual Siegel cone:
0 & '1
⊂
CSiegel (K, B) = (y, v, s) ≡ K ⊂ xRp xR/t > 0, s > U(y)−1 v, v (7.65)
product ∗(x,
With respect to the inner u, t), (y, v, s)∈ = ∗x, y∈ + 2 ∗u, v∈ + st.
With the condition s > U(y)−1 v, v that is equivalent by means of Schur complement
technique to:
7 Eidetic Reduction of Information Geometry 163
⎜
s vT
>0 (7.66)
v U(y)
⎞+→ ⎞+→ $ %
ΔX (z) = e dF(x) =
izx
eizx p(x) · dx = E eizx (7.67)
−→ −→
Let μ be a positive Borel Measure on euclidean space V. Assume that the following
integral is finite for all x in an open set
⎞
Φ ⊗ V : Δx (y) = e−∗y,x∈ dμ(x) (7.68)
1 −∗y,x∈
p (y, dx) = e dμ(x) (7.69)
Δx (y)
and covariance
⎞
∗V (y)u, v∈ = ∗x − m(y), u∈ ∗x − m(y), v∈ p(y, dx) = Du Dv log Δx (y) (7.71)
1 $ %
Lim log EPn en·f (Yn ) = Sup {f (x) − I(x)} (7.72)
n◦→ n x≡Ψ
The asymptotic behavior of the last integral is determined by the largest value of
the integrand [10]:
$ % 1 ⎜ 0 1
1 n·f (Yn ) n·[f (x)−I(x)]
Lim log EPn e = log Sup e = Sup {f (x) − I(x)}
n◦→ n n x≡Ψ x≡Ψ
(7.74)
The generating function of n is defined as:
7 Eidetic Reduction of Information Geometry 165
⎞
⎨ ⎩
Δn (x) = E enx n = enxς P ( n ≡ dς ) (7.75)
In both expressions, the integral is over the domain of n . The function φ (x) defined
by the limit: φ (x) = Lim n1 log Δn (x) is called the scaled cumulant generating
n◦→
function of n . It is also called the log-generating function or free energy function
of n . In the following in thermodynamic, it will be called Massieu Potential. The
existence of this limit is equivalent to writing Δn (x) ↔ en·φ(x) .
Gärtner-Ellis Theorem: If φ(x) is differentiable, then n satisfies a large deviation
principle with rate function I (ς ) by the Legendre–Moreau [158] transform of φ(x):
= dq − q̇ · dt, Dedecker [15] has observed, that the property that among all forms
θ ∓ L·dt mod the form ω = p·dq−H ·dt is the only one satisfying dθ ∓ 0 mod ,
is a particular case of more general Lepage congruence [16] related to transversally
condition (Fig. 7.6).
Contact geometry was used in Mechanic [11, 12] and in Thermodynamic [13,
14, 17, 18, 24, 25], where integral submanifolds of dimension n in 2n + 1 dimen-
sional contact manifold are called Legendre submanifolds. A smooth fibration of a
contact manifold, all of whose are Legendre, is called a Legendre Fibration. In the
neighbourhood of each point of the total space of a Legendre Fibration there exist
contact Darboux coordinates (z, q, p) in which the fibration is given by the projection
(z, q, p) => (z, q). Indeed, the fibres (z, q) = cst are Legendre subspaces of the stan-
dard contact space. A Legendre mapping is a diagram consisting of an embedding
of a smooth manifold as a Legendre submanifold in the total space of a Legendre
fibration, and the projection of the total space of the Legendre fibration onto the base.
Let us consider the two Legendre fibrations of the standard contact space R2n+1 of
1—jets of functions on Rn : (u, p, q) ∨◦ (u, q) and (u, p, q) ∨◦ (p · q − u, p), the
projection of the l-graph of a function u = S(q) onto the base of the second fibration
gives a Legendre mapping:
⎪
∂S ∂S
q ∨◦ q − S(q), (7.82)
∂q ∂q
If S is convex, the front of this mapping is the graph of a convex function, the
Legendre transform of the function S:
⊂
S (p), p (7.83)
y = f (x1 , . . . , xn ) (7.84)
The natural contact structure of this space is defined by the following condition:
the 1-graphs {x, y = f (x), p = ∂f /∂x} ⊗ J 1 (V n , R) of all the functions on V should
be the tangent structure hyper-plane at every point. In coordinates, this conditions
means that the 1-form dy − p.dx should vanish on the hyper-planes of the contact
field (Fig. 7.7).
In thermodynamics, the Gibbs 1-form dy − p · dx will be given by:
dE = T · dS − p · dV (7.86)
The manifold of contact elements of the projective space has two natural contact
structures:
• The first is the natural contact structure of the manifold of contact elements of the
original projective space.
• The second is the natural contact structure of the manifold of contact elements of
the dual projective space.
The dual of the dual hyper-surface is the initial hyper-surface (at least if both are
smooth for instance for the boundaries of convex bodies). The affine or coordinate
version of the projective duality is called the Legendre transformation. Thus contact
geometry is the geometrical base of the theory of Legendre transformation.
In 1869, François Massieu, French Engineer from Corps des Mines, has presented
two papers to French Science Academy on “characteristic function” in Thermody-
namic [3–5]. Massieu demonstrated that some mechanical and thermal properties of
physical and chemical systems could be derived from two potentials called “charac-
teristic functions”. The infinitesimal amount of heat dQ received by a body produces
external work of dilatation, internal work, and an increase of body sensible heat.
The last two effects could not be identified separately and are noted dE (function
E accounted for the sum of mechanical and thermal effects by equivalence between
heat and work). The external work P · dV is thermally equivalent to A · P · dV (with A
the conversion factor between mechanical and thermal measures). The first principle
provides
dQ dQ = dE +A·P ·dV . For a closed reversible cycle (Joule/Carnot principles)
T = 0 that is the complete differential dS of a function S of dS = dQ
T .
If we select volume V and temperature T as independent variables:
T · dS = dQ ⇒ T · dS − dE = A · P · dV ⇒ d (TS) − dE = S · dT + A · P · dV (7.87)
∂H ∂H
If we set H = TS − E, then we have dH = S · dT + A · P · dV = · dT + · dV
∂T ∂V
(7.88)
Massieu has called H the “characteristic function” because all body characteristics
could be deduced of this function: S = ∂H 1 ∂H ∂H
∂T , P = A ∂V and E = TS − H = T ∂T − H
H ≥ = H − AP · V (7.89)
170 F. Barbaresco
We have
∂H ≥ ∂H ≥
dH ≥ = dH − AP · dV − AV · dP = S · dT − AV · dP = · dT + · dP (7.90)
∂T ∂P
And we can deduce:
∂H ≥ 1 ∂H ≥
S= and V = − (7.91)
∂T A ∂P
And inner energy:
∂H ≥ ∂H ≥
E = TS − H = TS − H ≥ − AP · V ⇒ E = T − H≥ + P · (7.92)
∂T ∂P
The most important result of Massieu consists in deriving all body proper-
ties dealing with thermodynamics from characteristic function and its derivatives:
“je montre, dans ce mémoire, que toutes les propriétés d’un corps peuvent se
déduire d’une fonction unique, que j’appelle la fonction caractéristique de ce
corps” [5].
Massieu results were extended by Gibbs that shown that Massieu functions play
the role of potentials in the determination of the states of equilibrium in a given
system, and Duhem [6–9] in a more general presentation had introduced Thermo-
dynamic potentials in putting forward the analytic development of the mechanical
Theory of heat.
In thermodynamics, the Massieu potential is the Legendre transform of the
Entropy, and depends on the inverse temperature β = 1/kT :
with E the inner energy. The Legendre transform of the Massieu potential provides
the Entropy S:
∂φ(β)
L (φ) = kβ · − φ(β) = kβ · (−E) − kβ · F = kβ (F − E) = −S (7.96)
∂ (kβ)
oriented to the top (of the diagram); in the minor klang, the right angle is oriented to
the bottom”.
To conclude this chapter on “characteristic function” in thermodynamics, we
introduce recent works of Pavlov and Sergeev [111] from Steklov Mathematical
Institute, that have used ideas of Koszol [112], Berezin [113], Poincaré [114] and
Caratheodory [115] to analyze the geometry of the space of thermodynamical states
and demonstrate that the differential-geometric structure of this space selects entropy
as the Lagrange function in mechanics in the sense that determining the entropy
completely characterizes a thermodynamic system (this development will make one
link with our Sect. 7.4.11 related to Legendre transform in Mechanics). They have
demonstrated that thermodynamics is not separated from other branches of theoretical
physics on the level of the mathematical apparatus.
Technically, both analytic mechanics and thermodynamics are based on the same
dynamical principle, which we can use to obtain the thermodynamic equations as
equations of motion of a system with non-holonomic constraints. The special feature
of thermodynamics is that the number of these constraints coincides with the number
of degrees of freedom.
Using notation, β = 1/k · T , the laws of equilibrium thermodynamics is based
on the Gibbs distribution:
1
pGibbs (H) = e−βH (7.97)
Z
where H(p, q, a) the Hamilton function depending on the external parameters
a = (a1 , . . . , ak ) and the normalizing factor:
⎞
Z(β, a) = e−βH dωN (7.98)
N
⎞ ⎞
∂ log Z e−βH
E = Ĥ = − = H· dωN = H · pGibbs (H) · dωN (7.99)
∂β Z
N N
providing:
• the second law of thermodynamics:
ωQ = T · dS (7.104)
dE = T · dS − A · da or dF = −S · dT − A · da with F = E − TS (7.105)
From the last Gibbs equation, the coefficients of the 1-forms are gradients
∂F ∂F
S=− and A = − (7.106)
∂T ∂a
and the free energy is the first integral of system of differential equation.
Pavlov and Seeger note that the matrix of 2-form ΦQ = dωQ is degenerate and
has the rank 2 (nonzero elements of the 2-form matrix ΦQ are concentrated in the
first row and first column):
∂A ∂ 2F
ΦQ = dT √ da = − dT √ da (7.107)
∂T ∂a∂T
They observe that from the standpoint of the analogy with mechanical dynamical
systems, the state space of a thermodynamic system (with local coordinates) is a
configuration space. By physical considerations (temperature, volume, mean particle
numbers, and other characteristics are bounded below by zero), it is a manifold with
a boundary. They obtain the requirement natural for foliation theory that the fibers
be transverse to the two-dimensional underlying manifold.
In Pavlov and Seeger formulation, the thermodynamic analysis of a dynamical
system begins with a heuristic definition of the system entropy as a function of
dynamical variables constructed based on general symmetry properties and the geo-
metric structure of the space of thermodynamic states is fixed to be a foliation of
codimension two.
174 F. Barbaresco
Jean-Marie Souriau in [97–106], student of Elie Cartan at ENS Ulm, has given a
covariant definition of thermodynamic equilibriums and has formulated statistical
mechanics [161, 162] and thermodynamics in the framework of Symplectic Geom-
etry by use of symplectic moments and distribution-tensor concepts, giving a geo-
metric status for temperature and entropy. This work has been extended by Vallée
and Saxcé [107, 110, 156], Iglésias [108, 109] and Dubois [157].
The first general definition of the “moment map” (constant of the motion for
dynamical systems) was introduced by J. M. Souriau during 1970s, with geometric
generalizion such earlier notions as the Hamiltonian and the invariant theorem of
Emmy Noether describing the connection between symmetries and invariants (it is
the moment map for a one-dimensional Lie group of symmetries). In symplectic
geometry the analog of Noether’s theorem is the statement that the moment map of a
Hamiltonian action which preserves a given time evolution is itself conserved by this
time evolution. The conservation of the moment of a Hamilotnian action was called
by Souriau the “Symplectic or Geometric Noether theorem” (considering phases
space as symplectic manifold, cotangent fiber of configuration space with canonical
symplectic form, if Hamiltonian has Lie algebra, moment map is constant along
system integral curves. Noether theorem is obtained by considering independently
each component of momemt map).
In previous approach based on Koszul work, we have defined two convex functions
Ω(x) and Ω⊂ (x ⊂ ) with dual system of coordinates x and x ⊂ on dual cones Φ and Φ⊂ :
⎞
Ω(x) = − log e−∗Γ,x∈ dΓ ∀x ≡ Φ and Ω⊂ (x ⊂ ) = x, x ⊂ − Ω(x)
Φ⊂
⎞
=− px (Γ ) log px (Γ )dΓ (7.108)
Φ⊂
where
⎞ ⎞
⊂ −∗Γ,x∈
x = Γ · px (Γ )dΓ and px (Γ ) = e / e−∗Γ,x∈ dΓ
Φ⊂ Φ⊂
−∗x,Γ ∈−log e−∗Γ,x∈ dΓ
=e Φ⊂ = e−∗x,Γ ∈+Ω(x) (7.109)
with
∂Ω(x) ∂Ω⊂ (x ⊂ )
x⊂ = and x = (7.110)
∂x ∂x ⊂
Souriau introduced these relations in the framework of variational problems
to extend them with a covariant definition. Let M a differentiable manifold with
7 Eidetic Reduction of Information Geometry 175
a continuous positive density dω and let E a finite vector space and U(Γ ) a con-
tinuous function defined on M with values in E, continuous positive function p(Γ )
solution of this variational problem:
⎤ ⎫ ⎭
⎞ p(Γ )dω = 1
ArgMin ⎥s = −
p(Γ ) log p(Γ )dω⎬ such that M (7.111)
p(Γ ) U(Γ )p(Γ )dω = Q
M M
is given by:
⎞
Ω(β)−β·U(Γ )
p(Γ ) = e with Ω(β) = − log e−β·U(Γ ) dω and Q
M
U(Γ )e−β·U(Γ ) dω
M
= (7.112)
e−β·U(Γ ) dω
M
Entropy s = − M p(Γ ) log p(Γ )dω can be stationary only if there exist a scalar Ω
and an element β belonging to the dual of E, where Ω and β are Lagrange parame-
ters associated to the previous constraints. Entropy appears naturally as Legendre
transform of Ω:
s(Q) = β · Q − Ω(β) (7.113)
−β·U(Γ ) dω
MU(Γ )e
This value is a strict minimum of s, and the equation Q = −β·U(Γ ) dω has a
Me
maximum of one solution for each value of Q. The function Ω(β) is differentiable
and we can write dΩ = dβ · Q and identifying E with its bidual:
∂Ω
Q= (7.114)
∂β
Uniform convergence of M U(Γ ) ∧ U(Γ )e−β·U(Γ ) dω proves that − ∂∂βΩ2 > 0 and
2
that −Ω(β) is convex. Then, Q(β) and β(Q) are mutually inverse and differentiable,
and ds = β · dQ. Identifying E with its bidual:
∂s
β= (7.115)
∂Q
⎪
Γ
Classically, if we take U(Γ ) = , components of Q will provide moments of
Γ ∧Γ
first and second order of the density of probability p(Γ ), that is defined by Gaussian
law.
Souriau has applied this approach for classical statistical mechanic system.
Considering a mechanical system with n parameters q1 , . . . , qn , its movement
176 F. Barbaresco
Apply for free particles, for ideal gas, equilibrium is given for β = kT 1
(with k
Boltzmann constant) and
dQ if we set S = k · s, previous relation ds = β · dQ provides:
dS = dQ
T and S = T and Ω(β) is identified with Massieu–Duhem Potential.
We recover also the Maxwell Speed law:
H
p(Γ ) = cste · e− kT (7.118)
We can then consider the time variable t as other variables qj through an arbitrary
parameter τ , and defined the new variationnal problem by:
⎞t1
dqJ
d L (qJ , q̇J )dτ = 0 with t = qn+1 , q̇J = and J = 1, 2, . . . , n + 1 (7.120)
dτ
t0
7 Eidetic Reduction of Information Geometry 177
where ⎪
q̇j
L (qJ , q̇J ) = l t, qj , ṫ (7.121)
ṫ
dqj
pn+1 = l − pj · (7.122)
dt
j
with
s(Q) = β · Q − Ω(β), dΩ = dβ · Q and ds = β · dQ (7.126)
⎨ ⎩
A statistical state p(Γ ) is invariant by δ if δ p(Γ ) = 0 for all Γ(then p(Γ ) is invariant
by finite transform of G generated by δ).
J. M. Souriau gave the following theorem:
An equilibrium state allowed by a group G is invariant by an element δ of Lie
Algebra A, if and only if [δ, β] = 0 (with [.], the Lie Bracket), with β the generalized
equilibrium temperature.
∂s
with β = ∂Q and M H(Γ ) · p(Γ )dω = Q where s = − M p(Γ ) log p(Γ )dω and
M p(Γ )dω = 1 .
For classical thermodynamic, where G is abelian group of translation with respect
to time t, all equilibrium states are invariant under G.
For Group of transformation of Space–Time, elements of Lie Algebra of G could be
defined as vector fields in Space–Time. The generalized temperature β previously
defined, would be also defined as a vector field. For each point of Manifold M , we
could then define:
• Vector temperature:
V
βM = (7.127)
kT
with
• Unitary Mean Speed:
βM
V= with |V | = 1 (7.128)
|βM |
Information Geometry metric has been extended in Quantum Physics. In 1986, Roger
Balian has introduced a natural metric structure for the space of states D̂ in quantum
mechanics, from which we can deduce the distance between a state and one of its
approximations [88–90, 92–95]. Based on quantum information theory, Roger Balian
has built on physical grounds this metric as hessian metric:
Roger Balian has looked for a physically meaningful metric in the space of states,
that depend solely on this space and not on the two dual spaces. Then, he considers
the Entropy S associated with a state D̂ that gathers all the information available on
an ensemble E of systems. The Entropy S quantity measures the amount of missing
information because of probabilistic character of predictions.
Von Neumann has introduced, in 1932, an extension of Boltzmann–Gibbs Entropy
of classical statistical mechanics:
$ %
S(D̂) = −Tr D̂ log(D̂) (7.134)
used by Jaynes to assign a state to a quantum system when only partial information
about it is available based on:
• Maximum Entropy Principle: among all possible states that are compatible with
the known expectation values, the one which yields the maximum entropy under
constraints on these data is selected.
that is derived from Laplace’s principle of insufficient reason:
• Identification of expectation values in the Bayesian sense with averages over a
large number of similar systems
• The expectation value of each observable for a given system is equal to its average
over E showing that equi-probability for E implies maximum entropy for each
system.
Then von Neumann’s entropy has been interpreted by Jaynes and Brillouin as:
• Entropy = measure of missing information: the state found by maximizing this
entropy can be regarded as the least biased one, as it contains no more information
than needed to account for the given data.
Different authors have shown that von Neumann Entropy is the only one that
is physically satisfactory, since it is characterized by the properties required for a
measure of disorder:
• Additivity for uncorrelated systems
• Subadditivity for correlated systems (suppressing correlations raises the uncer-
tainty)
• Extensivity for a macroscopic system
• Concavity (putting together two different statistical ensembles for the same system
produces a mixed ensemble which has a larger uncertainty than the average of the
uncertainties of the original ensembles).
7 Eidetic Reduction of Information Geometry 181
These properties are also requested to ensure that the maximum entropy criterion
follows from Laplace’s principle, since the latter principle provides an unambiguous
construction of von Neumann’s entropy.
Roger Balian has then developed the Geometry derived from von Neumann’s
entropy. using two physically meaningful scalar quantities:
$ %
• Tr D̂Ô expectation values across the two dual spaces of observables & states
$ %
• S = −Tr D̂ log(D̂) entropy within the space of states.
& '
The entropy S could be written as a scalar product S = − D̂, log(D̂) where
log(D̂) is an element of the space of observables, allowing physical geometric struc-
ture in these spaces. The second differential d 2 S is a negative quadratic form of the
coordinates of D̂ that is induced by concavity of von Neumann entropy S. Then,
Roger Balian has introduced the distance ds between the state D̂ and a neighboring
state D̂ + d D̂ as the square root of
$ %
ds2 = −d 2 S = Tr d D̂ · d log D̂ (7.135)
where the Riemannian metric tensor is the Hessian of −S(D̂) as function of a set of
independent coordinates of D̂.
We recover the algebraic/geometric duality between D̂ and ln D̂ through a Legen-
dre transform:
& ' & '
S(D̂) = F(X̂) − D̂, X̂ with S(D̂) = −Tr D̂ log D̂ = − D̂, log D̂ (7.136)
Roger Balian can then define Hessian metric from F, ds2 = d 2 F in the conjugate
space X̂:
ds2 = −dS 2 = Trd D̂d X̂ = d 2 F (7.140)
d D̂ $ %
i = Ĥ, D̂ (7.141)
dt
• Heisenberg picture: the density operator D̂ remains fixed while the observables
Ô change in time according to Heisenberg’s equation of motion:
d Ô $ %
i = Ô, Ĥ (7.142)
dt
The Liouville-von-Neumann equation describes the transfer of information from
some observables to other ones generated by a completely known dynamics and
can be modified to account for losses of information during a dynamical process.
The von-Neumann entropy measures in dimensionless units our uncertainty when
summarizes our statistical knowledge on the system, the quantum analogue of the
Shannon entropy.
In the classical limit:
• observables Ô are replaced by commuting random variables, which are functions
of the positions and momenta of the N particles.
• density operators D̂ are replaced by probability densities D in the 6N-dimensional
phase space.
• the trace by an integration over this space.
• the evolution of D is governed by the Liouville equation.
At this step, we make a digression observing that the Liouville equation has been
introduced first by Henri Poincaré in 1906, in a paper entitled “Réflexions sur la
théorie cinétique des gaz ” with this following first sentence:
La théorie cinétique des gaz laisse encore subsister bien des points embarassants pour ceux
qui sont accoutumés à la rigueur mathématique… L’un des points qui m’embarassaient le plus
7 Eidetic Reduction of Information Geometry 183
In this paper, Poincaré introduced Liouville equation for the time evolution of
3 ∂(pXi )
the phase-space probability density p (xi , t) : ∂p
∂t + ∂xi = 0 corresponding to
i
a dynamical system defined by the ordinary differential equations dx dt = Xi . For
i
3 ∂Xi ∂p
a system obeying Liouville theorem ∂xi = 0, Poincaré gave the form ∂t +
i
3 ∂p
Xi ∂xi = 0 that can also be written as: ∂p
∂t = {H, p} = Lp in terms of the Poisson
i
bracket {.,.} of the Hamiltonian H with the probability density p, which defines the
Liouvillian operator L.
Back to Balian theory, we can notice that usually, only a small set of data are
controlled.
• Ôi will be the observables that are controlled
• Oi their expectation values for the considered set of repeated experiments,
• D̂ a density operator of the system, relying on the sole knowledge of the set
Oi = Tr Ôi D̂ (7.143)
At this stage, Roger Balian has introduced the notion of Relevant Entropy as
generalization of Gibbs entropy. For any given set of relevant observables Ôi , the
expectation values Oi of which are known, the maximum of the von Neumann entropy
S(D̂) = −Tr D̂ log D̂ under the constraints Oi = Tr Ôi D̂ is found by introducing
Lagrangian multipliers, γi associated with each equation Oi = Tr Ôi D̂ and Ω asso-
ciated with the normalization of D̂:
184 F. Barbaresco
$ ⎟%
D̂R = ArgMin S(D̂) + γi Oi − Tr Ôi D̂ such that Tr D̂ = 1 (7.144)
D̂
∂Ω (γi )
= Oi (7.147)
∂γi
Roger Balian gave the name of Relevant Entropy S(D̂R ) = SR (Oi ) associated
with the set Ôi of relevant observables selected in the considered situation, measuring
the amount of information which is missing when only the data Oi are available.
The relations ∂Ω(γ ∂γi
i)
= Oi between the variables γi and the variables Oi can
therefore be inverted as:
∂SR (Oi )
= γi (7.149)
∂Oi
d D̂ $ %
i = Ĥ, D̂ (the motion equation) (7.150)
dt
Oi = Tr Ôi D̂ (7.151)
At initiation, we have:
• Oi (t0 ): initial conditions are transformed into initial conditions on D̂(t0 )
• D̂(t0 ): the least biased choice given by the maximum entropy criterion:
+ ,
D̂R = exp Ω − γi Ôi with Tr D̂R Ôi = Oi and Tr D̂R = 1 (7.152)
i
lost during the evolution), but in general, D̂(t) does not keep the exponential form,
which involves only the relevant observables Oi .
The lack of information associated with the knowledge of the variables Oi (t) only
is measured by the Relevant Entropy:
SR (Oi ) = γi Oi − Ω(γi ) (7.153)
i
Generally, SR (Oi (t)) > SR (Oi (t0 )) because a part of the initial information on the
set Ôi has leaked at the time t towards irrelevant variables., due to dissipation in the
evolution of the variables Oi (t).
A reduced density operator D̂R (t) can be built at each time with the set of relevant
variables Oi (t). As regards these variables, D̂R (t) is equivalent to D̂(t):
D̂R (t) has the maximum entropy form retaining any information about the irrel-
evant variables, and is parameterized by the set of multipliers γi (t) (in one-to-one
correspondence with the set of relevant variables Oi (t)).
186 F. Barbaresco
Q=I −P (7.157)
P2 = P, Q2 = Q, PQ = 0 and IP = I (7.158)
The density operator D̂ can thereby be split at each time into its reduced relevant
part D̂R = PD̂ and its irrelevant part D̂IR = QD̂ (Fig. 7.9):
d D̂ d D̂αβ 1
= L D̂ ⇒ = Lαβ,γ δ D̂δγ = Hαδ δβγ − Hγβ δαδ D̂δγ
dt dt i
γβ γβ
(7.160)
7 Eidetic Reduction of Information Geometry 187
d D̂R dP
= D̂R + PL D̂R + PL D̂IR (7.161)
dt dt
d D̂IR dP
= D̂IR − QL D̂R − QL D̂IR
dt dt
Roger Balian solves these equations by introduction a Superoperator Green func-
tion.
We will synthetize in the following Table 7.1 results of Sect. 7.4 with Koszul Hessian
structure of Information Geometry, Souriau model of Statistical Physics with general
concept of geometric temperature, and Balian model of Quantum Physics with notion
of Quantum Fisher Metric deduced from hessian of Von Neumann Entropy. Analogies
between models will deal with Characteristic function, Entropy, Legendre Transform,
Density of probability, Dual Coordinate System, Hessian Metric and Fisher metric.
In this chapter, we apply theory developed in first chapters to define geometry of cones
associated to Symmetric/Hermitian Positive Definite Matrices, that correspond to the
case of covariance matrices for stationary signal [35, 36] with very specific matrix
structures [77–79]:
• Toeplitz Hermitian Positive Definite Matrix structure (covariance matrix of a sta-
tionary time series)
• Toeplitz-Block-Toeplitz Hermitian Positive Definite Matrix structure (covariance
matrix of a stationary space–time series).
We will prove that “Toeplitz” structure could be captured by complex autogressive
model parameterization. This parameterization could be naturally introduced through
Trench’s theorem, Verblunsky’s theorem or Partial Iwasawa decomposition theorem.
188
Table 7.1 Synthesis of Koszul, Souriau and Balian models respectively in hessian geometry, physical statistics and quantum physics
Koszul model Souriau model Balian model
⎜
3
Characteristic function Ω(x) = − log e−∗Γ,x∈ dΓ ∀x ≡ Φ Ω(β) = − log e−β·U(Γ ) dω Ω (γi ) = − log Tr exp − γi Ôi
Φ⊂ M i
S(D̂R ) = SR (Oi ) ⎟
Entropy Ω⊂ (x ⊂ ) =− px (Γ ) log px (Γ )dΓ s=− p(Γ ) log p(Γ )dω
Φ⊂ M S(D̂R ) = −Tr D̂R log D̂R
3
Legendre transform Ω⊂ (x ⊂ ) = ∗x, x ⊂ ∈ − Ω(x) s(Q) = β · Q − Ω(β) SR (Oi ) = γi Oi − Ω(γi )
i
px (Γ ) = e−∗x,Γ ∈+Ω(x) ⎪
3
Density of probability px (Γ ) = e−∗Γ,x∈ / e−∗Γ,x∈ dΓ p(Γ ) = eΩ(β)−β·U(Γ ) D̂R = exp Ω − γi Ôi
Φ⊂ i
U(Γ )e−β·U(Γ ) dω
M
Tr D̂R (t)Ôi = Tr D̂(t)Ôi
Dual coordinate x⊂ = Γ · px (Γ )dΓ Q= e−β·U(Γ ) dω
Φ⊂ Tr D̂R (t)Ôi = Oi (t)
M
∂Ω⊂ (x ⊂ ) ∂s ∂Ω(γi ) R (Oi )
Dual coordinate system x ⊂ = ∂Ω(x)
∂x and x = ∂x ⊂ Q = ∂Ω
∂β and β = ∂Q ∂γi = Oi and ∂S∂O i
= γi
Hessian metric ds2 = −d 2 Ω⊂ (x ⊂ ) ds2 = −d 2 s(Q) ds2 = 2
−dS = Trd D̂d X̂ = d 2 F
$ %
∂ 2 log px (Γ ) $ %
I(x) = −EΓ Ω(β) 2 2Ω
Fisher metric ∂x 2
2 I(β) = − ∂ ∂β 2 I(γ ) = − ∂γ∂ i ∂γ j
Ω(x) i,j
I(x) = − ∂ ∂x 2
F. Barbaresco
7 Eidetic Reduction of Information Geometry 189
⎨ ⎩ $) )γ %
) )
where E log det x = v with |v| < +→ ⇒ E )x −1 ) < +→ (7.165)
F
In order to perform this calculation, we need results concerning the Siegel integral
for a positive-definite symmetric real matrix, given by Siegel in hi paper “Über der
analytische Theorie der quadratischen Formen”:
⎞
J (λ, Γ ) = e−∗Γ,x∈ [det x] λ−1 d̃x (7.168)
x>0
190 F. Barbaresco
$ ⎟%− (M−1+2λ)
Ωx (θ ) = det IM − iΓ −1 θ
2
(7.171)
1
Ωx (θ ) = c0 · J (λ, Γ − iθ) and Ωx (0) = 1 ⇒ c0 = (7.172)
J(λ, Γ )
Using:
⎟ ∂ det b
det (I + h) = 1 + tr(h) + 0 |h|2F and = (det b) · b−1 (7.175)
∂bjk
⎨ ⎩ (M − 1 + 2λ) −1
E xjk = · Γjk (7.176)
2
7 Eidetic Reduction of Information Geometry 191
⎨ ⎩ ⎨ ⎩ ⎨ ⎩ 1 ⎨ ⎩ ⎨ ⎩
E xjk xj≥ k ≥ = E xjk E xj≥ k ≥ + E xj≥ k E xjk ≥
(M − 1 + 2λ)
⎨ ⎩
+E xjj≥ E [xkk ≥ ] (7.177)
First moment derived from characteristic function are given by: Γ = (M−1+2λ)
2 x̄ −1 .
Characteristic function and probability density function of positive-definite ran-
dom matrix:
⎪ ⎜ − (M−1+2λ)
2i 2
Ωx (θ ) = det IM − x̄θ (7.178)
(M − 1 + 2λ)
⎤ ⎫
(2π )− n(n−1) n+1
n(n+1)
2
4
βME = ⎜ n 2 ⎟ (7.180)
⎥ 4 ⎬
n−l+2 2
l=1
To compute this density, we have to define first the inner product ∗., .∈ for our
application considering Symmetric or Hermitian Positive Definite Matrices, that
is a reductive homogeneous space. A homogeneous space G/H is said to be
reductive if there exists a decomposition g = m + h , such that AdH (m) =
192 F. Barbaresco
hAh−1 : h ≡ H, A ≡ m ⊗ m, where g and h are the Lie algebras of G and H. Given a
bilinear form on m , there are associated G-invariant metric and G-invariant affine con-
nection on G/H. The natural metric on G/H corresponds to a restriction of the Cartan–
Kiling form on g (Lie Algebra of G). For covariance matrices, g = gl(n) is the Lie
algebra of nxn matrices, that could be written as the direct sum gl(n) = m+h, with m
Symmetric or Hermitian matrices and h = so(n) sub-Lie algebra
of skew-symmetric
or h = u(n) skew-hermitian matrices: A = 1/2 · A + AT + 1/2 · A − AT . The
symmetric matrices m are AdO(n) -invariant because for any symmetric matrix S and
orthogonal matrix Q, AdO(n) (S) = QSQ−1 = QSQT , that is symmetric.
As the set of covariance symmetric matrices is related to the quotient space
GL(n)/O(n) with O(n) orthogonal Lie Group (because for any covariance matrix R,
there is an equivalence class R1/2 O(n)), and the set of covariance hermitian matrices
is related to the quotient space GL(n)/U(n) with U(n) Lie Group of Unitary matrices,
covariance matrices admit a GL(n) -invariant metric and connection corresponding
to the following Cartan–Killing bilinear form (Elie Cartan has introduced this form
in his Ph.D.):
∗X, Y ∈I = Tr (XY ) at R = I (7.183)
⎟
gR (X, Y ) = ∗X, Y ∈R = Tr XR−1 YR−1 at arbitrary R (7.184)
With this inner product given by Cartan–Killing bilinear form, we can then developed
the expression of the Koszul density:
⎭
∗x, y∈ = Tr (xy) , ∀x, y ≡ Symn (R)
n+1
ΔΦ (x) = e−∗Γ,x∈ dΓ = det x − 2 Δ(In )
Φ⊂ ∗x, y∈ = Tr(xy) (7.185)
Φ⊂ = Φ self-dual
⊂ n+1 −1
x = Γ̄ = −d log ΔΦ = n+1 2 d log det x = 2 x
⎪ n+1
n+1 2
βK = (7.188)
2
(7.189)
7 ⎜
1 0 1/2 1/2+
Wn = 1 − |μn | 2
1/2 with Φn−1 = Φn−1 · Φn−1 (7.190)
An−1 Φn−1
where
$ % ⎜ ⎜ (−)
−1 An−1 An−1
αn−1 = 1 − |μn |2 ·αn−1 , An = +μn and V (−) = J ·V ⊂ (7.191)
0 1
⎟
n−1 $ %
Ω̃(Rn , P0 ) = log det Rn−1 − log (π · e) = − (n − k) · log 1 − |μk |2
k=1
− n · log [π · e · P0 ] (7.192)
∂ 2 Ω̃ ⎨ ⎩T
gij ∓ where θ (n) = P0 μ1 · · · μn−1 (7.193)
∂θi(n) ∂θj(n)⊂
7 Eidetic Reduction of Information Geometry 195
−1
with {μk }n−1
k=1 Regularized Burg reflection coefficient [132] and P0 = α0 mean
signal Power. Kählerian metric (from Information Geometry) is finally:
⎪ 2
n−1
(n)+
⎨ ⎩ (n) dP0 |dμi |2
dsn2 = dθ gij dθ = n · + (n − i) 2 (7.194)
P0 1 − |μi |2
i=1
and * -
αn−1 + A + +
n · Rp,n · A n − A n · Rp,n
Rp,n+1 = (7.198)
−Rp,n · A n Rp,n
with ⎨ ⎩ −1
αn−1 = 1 − Ann An+
n · αn−1 , α0−1 = R0
and
196 F. Barbaresco
⎤ ⎫
⎤ ⎫ Jp An−1⊂
n−1 Jp
*
A11 -
..
A n = ⎥ ... ⎬ = A n−1 + Ann · . (7.199)
0 p ⎥ Jp A1 Jp ⎬
n−1⊂
Ann
Ip
Where we have the following Burg-like generalized forward and backward linear
prediction:
⎭
3 n+1
n+1
f
εn+1 (k) =
f
Al (k)Z(k − l) = εn (k) + An+1
n+1 εn (k − 1)
b
l=0
b (k) = 3 JAn+1 (k)⊂ JZ(k − n + l) = ε b (k − 1) + JAn+1⊂ Jε f (k)
n
εn+1 n
l n n+1
l=0
with
f
ε0 (k) = ε0b (k) = Z(k)
An+1
0 = Ip
and
AT D − C T B = In
7 Eidetic Reduction of Information Geometry 197
The Siegel upper half plane is the set of all complex symmetric n × n matrices
with positive definite imaginary part:
The action of the Symplectic Group on the Siegel upper half plane is transitive.
The group PSp(n, R) ∓ Sp(n, R)/ {±I2n } is group of SHn biholomorphisms via
generalized Möbius transformations:
⎪
A B
M= ⇒ M(Z) = (AZ + B) (CZ + D)−1 (7.204)
CD
C. L. Siegel has proved that distance in Siegel Upper-Half Plane is given by:
+ n ⎪ ˘ ,
1 + rk
2
dSiegel (Z1 , Z2 ) = log2 ˘ with Z1 , Z2 ≡ SHn (7.207)
1 − rk
k=1
and ⎟ ⎟
ds2 = Tr Y −1 dZY −1 dZ + = 2 · Tr D2 R (7.210)
In parallel, in China in 1945, Hua Lookeng has given the equations of geodesic in
Siegel upper-half plane:
d2Z dZ dZ
2
+ i Y −1 =0 (7.211)
ds ds ds
198 F. Barbaresco
The differential equation of the geodesics in Siegel Unit disk is then given by:
d2W −1
2
+ 2WW + I − WW + W =0 (7.213)
ds
Contour of Siegel Disk is called its Shilov boundary ∂SDn = W /WW + −
In = 0n }. We can also defined horosphere. Let U ≡ ∂SDn and k ≡ R⊂+ , the fol-
lowing set is called horosphere in siegel disk (Fig. 7.10):
H (k, U) = Z/0 < k I − Z + Z − I − Z + U I − U + Z
( )) ))2
)) 1 )) k
= Z/ ))))Z − U )))) < (7.214)
k+1 k+1
Hua Lookeng and Siegel have proved that the previous positive definite quadratic
differential is invariant
⎜ under the group
⎜ of automorphisms
⎜ ofthe Siegel Disk. Con-
A B ⊂ In 0 In 0
sidering M = such that M M= :
CD 0 −In 0 −In
7 Eidetic Reduction of Information Geometry 199
By analogy with Poincaré’s unit Disk, C. L. Siegel has deduced geodesic distance
in SDn : ⎪
1 1 + |ΩZ (W )|
∀Z, W ≡ SDn , d(Z, W ) = log (7.219)
2 1 − |ΩZ (W )|
n−1 $ %
Ω̃(Rp,n ) = − (n − k) · log det I − Akk Ak+
k − n · log [π · e · det R0 ] (7.221)
k=1
Paul Malliavin has proved that this form is a Kähler Potential of an invariant
Kähler metric (Information Geometry metric in our case). The metric is given by
matrix extension as Hessian of this entropic potential ds2 = d 2 Ω̃(Rp,n ):
200 F. Barbaresco
⎜ ⎟2
ds2 = nTr R0−1 dR0
n−1 ⎜ ⎟−1 ⎟−1
+ (n − k)Tr I − Akk Ak+
k dAk
k I − Ak+ k
k Ak dAk+
k (7.222)
k=1
3
m−1
(m − i) |dμi | 2 2
2
ds2 = n · m · (d log (P0 ))2 + n ·
i=1 1−|μi |
⎜ ⎟−1 ⎟−1 (7.223)
3
n−1
k+ k+ k k+
+ (n − k)Tr I − Ak Ak k dAk I − Ak Ak
k dAk
k=1
We have finally proved that for a space—time state of a stationary time series
the TBTHPD covariance matrix is coded in R × Dm−1 × SDn−1
This result could be intuitively recovered and illustrated in the case for m = 2 and
n = 1 (Fig. 7.11).
⎜
h a − ib
R= > 0, det R > 0 ∇ h2 > a2 + b2 (7.225)
a + ib h
In his book on Creative Evolution, Bergson has written about “Form and Becoming”:
“life is an evolution. We concentrate a period of this evolution in a stable view which
7 Eidetic Reduction of Information Geometry 201
Fig. 7.11 Bounded homogeneous cone associated to a 2 × 2 symmetric positive definite matrix
Fig. 7.12 Hadamard compactification (illustration from “Géométrie des bords: compactifications
différentiables et remplissages holomorphes”, Benoît Kloeckner) [131]
we call a form, and, when the change has become considerable enough to overcome
the fortunate inertia of our perception, we say that the body has changed its form.
But in reality the body is changing form at every moment; or rather, there is no form,
since form is immobile and the reality is movement. What is real is the continual
change of form: form is only a snapshot view of a transition”.
In Sect. 7.7.1, we have considered stationary signal, but we could envisage to
extend the application for non-stationary signal, defined as a continual change of
covariance matrices. We will then extend approach of Sect. 7.7.1 to non-stationary
time series. Many methods have been explored to address model of non-stationary
time series [134–141]. We propose to extend the previous geometric approach for
non-stationary signal corresponding to fast time variation of a time series. We will
assume that each non-stationary signal in one time series can be split into several
stationary signals on a shorter time scale, represented by time sequence of stationary
202 F. Barbaresco
where the infimum is taken over all possible σ . A class f ⊂ of parameterized surfaces,
equivalent to f, is called a Fréchet surface. It is a generalization of the notion of a
surface in Euclidean space to the case of an arbitrary metric space (X, d).
The Fréchet surface metric between Fréchet surfaces and f2⊂ is:
⎨ ⎩
Inf Max d f1 (p), f2 (σ (p)) (7.228)
σ p≡M 2
For n = 1 and (X, d) is Euclidean space Rn , this metric is the original Fréchet distance
introduced in 1906 between parametric curves f , g: [0, 1] ◦ Rn . For more details
on Fréchet Surface Metric, read the “Dictionary of Distances” of Elena and Michel
Deza [187].
7 Eidetic Reduction of Information Geometry 203
Fig. 7.13 Fréchet distance between two polygonal curves (and indexing all matching of points)
More recently, other distance have been introduced between opened curves
[195–199] based on diffeomorphism theory, but for our application, we will only
used the Fréchet distance and its extension on manifold.
Fig. 7.14 Fréchet free-space diagram for two polygonal curves P and Q with monotonicity in both
directions
The Fréchet distance dFr échet (P, Q) is at most ε if and only if the free-space
diagram Dε (P, Q) contains a path which from the lower left corner to the upper right
corner which is monotone both in the horizontal and in the vertical direction.
In an n × m free-space diagram, shown in Fig. 7.14, the horizontal and vertical
directions of the diagram correspond to the natural parameterizations of P and Q
respectively. Therefore, if there is a monotone increasing curve from the lower left
to the upper right corner of the diagram (corresponding to a monotone mapping), it
generates a monotonic path that defines a matching between point-sets P and Q.
(7.232)
As classical Fréchet distance doesn’t take into account with Inf[Max] close depen-
dence of elements between points of time series paths, we propose to define a new
distance given by:
7 Eidetic Reduction of Information Geometry 205
Fig. 7.15 Geodesic path on information geometry manifold where non stationary burst is decom-
posed on a sequence of stationary covariance matrices on THPD matrix manifold
⎭ 1
⎞
dgeo-path (R1 , R2 ) = Inf dgeo (R1 (α(t)), R2 (β(t))) dt (7.233)
α,β
0
We have then to find the solution for computing the geodesic minimal path [133]
on the Fréchet free-space diagram.
The length of the path is not given by euclidean
metric ds2 = dt 2 (where L = ds ) but geodesic metric weighted by d(.,.) of the
L
free-space diagram:
⎞ ⎞
Lg = g · ds = dsg with dsg = d (R1 (α(t), R2 (β(t))) · dt (7.234)
L L
This optimal shortest path could be computed by classical “Fast Marching method”
(Fig. 7.15).
• Generating Inner Product: In Koszul Geometry, we have two convex dual func-
tions Ω(x) and Ω⊂ (x ⊂ ) with dual system of coordinates x and x ⊂ defined on dual
cones Φ and Φ⊂ : Ω(x) = − log e−∗Γ,x∈ dΓ ∀x ≡ Φ and Ω⊂ (x ⊂ ) = ∗x, x ⊂ ∈−Ω(x).
Φ⊂
We can then remark that if we can define an Inner Product ∗., .∈, we will be able to
build convex characteristic function and its dual by Legendre transform because
both are only dependent of the Inner product , and
dual−∗Γ,x∈
coordinate is also defined by
x ⊂ = arg min {ΔΦ (y)/y ≡ Φ⊂ , ∗x, y∈ = n} = Γ.e dΓ / e−∗Γ,x∈ dΓ where
Φ⊂ Φ⊂
x ⊂ is also the center of gravity of the cross section {y ≡ Φ⊂ , ∗x, y∈ = n} of Φ⊂
(with notation: Ω(x) = − log ΔΦ (x)).
It is not possible to define an inner product for any two elements of a Lie Algebra,
but a symmetric bilinear form, called “Cartan–Killing form”, could be introduced.
This form has been introduced first by Elie Cartan in 1894 in his PhD report. This
form is defined according to the adjoint endomorphism Adx of g that is defined for
every element x of g with the help of the Lie bracket:
⎨ ⎩
Adx (y) = x, y (7.235)
The trace of the composition of two such endomorphisms defines a bilinear form,
the Cartan–Killing form:
B(x, y) = Tr Adx Ady (7.236)
given by:
⎨ ⎩ ⎟ ⎨ ⎩
B x, y , z = Tr Ad[x,y] Adz = Tr Adx , Ady Adz
⎨ ⎩ ⎨ ⎩
= Tr Adx Ady , Adz = B x, y, z
Elie Cartan has proved that if g is a simple Lie algebra (the Killing form is non-
degenerate) then any invariant symmetric bilinear form on g is a scalar multiple of
the Cartan–Killing form. The Cartan–Killing form is invariant under automor-
phisms σ ≡ Aut(g) of the algebra g:
rewritten
Adσ (x) = σ ≤ Adx ≤ σ −1
Then
⎟
B (σ (x), σ (y)) = Tr Adσ (x) Adσ (y) = Tr σ ≤ Adx Ady ≤ σ −1
= Tr Adx Ady = B(x, y)
As other generalization of inner product, we can also consider the case of CAT(-
1)-space (generalization of simply connected Riemannian Manifold of negative
curvature lower than unity) or of an Homogeneous Symmetric Bounded domains,
and then define a “generating” Gromov Inner Product between three points x, y
and z (relatively to x) that is defined by the distance [118]:
1
∗y, z∈x = (d(x, y) + d(x, z) − d(y, z)) (7.241)
2
208 F. Barbaresco
with d(.,.) the distance in CAT(-1). Intuitively, this inner product measures the
distance of x to the geodesics between y to z.
This Inner product could be also defined for points on the Shilov Boundary of the
domain through Busemann distance:
1
Γ, Γ ≥ =
BΓ (x, p) + BΓ ≥ (x, p)
x
(7.242)
2
⎨ ⎩
Independent of p, Where BΓ (x, y) = Lim |x − r(t)| − |y − r(t)| is horospheric
t◦+→
distance, from x to y relatively to Γ , with r(t) geodesic ray. We have the property
that:
Γ, Γ ≥ x = Lim y, y≥ x (7.243)
y◦Γ
y≥ ◦ Γ ≥
We can then define a visual metric on the Shilov boundary by (Fig. 7.16);
dx Γ, Γ ≥ = e−∗Γ,Γ ∈x if Γ = Γ ≥
≥
≥ (7.244)
dx Γ, Γ = 0 otherwise
where
7 Eidetic Reduction of Information Geometry 209
⎞
d0 Γ, Γ ≥ · dΓ ≥ (7.250)
∂Φ⊂
References
1. Koszul, J.L.: Variétés localement plates et convexité. Osaka J. Math. 2, 285–290 (1965)
2. Vey, J.: Sur les automorphismes affines des ouverts convexes saillants. Annali della Scuola
Normale Superiore di Pisa, Classe di Science, 3e série, Tome 24(4), 641–665 (1970)
3. Massieu, F.: Sur les fonctions caractéristiques des divers fluides. C. R. Acad. Sci. 69, 858–862
(1869)
4. Massieu, F.: Addition au précédent Mémoire sur les fonctions caractéristiques. C. R. Acad.
Sci. 69, 1057–1061 (1869)
5. Massieu, F.: Thermodynamique: mémoire sur les fonctions caractéristiques des divers fluides
et sur la théorie des vapeurs, 92 p. Académie des Sciences (1876)
6. Duhem, P.: Sur les équations générales de la thermodynamique. Annales Scientifiques de
l’Ecole Normale Supérieure, 3e série, Tome 8, 231 (1891)
7. Duhem, P.: Commentaire aux principes de la thermodynamique. Première partie, Journal de
Mathématiques pures et appliquées, 4e série, Tome 8, 269 (1892)
8. Duhem, P.: Commentaire aux principes de la thermodynamique—troisième partie. Journal de
Mathématiques pures et appliquées, 4e série, Tome 10, 203 (1894)
9. Duhem, P.: Les théories de la chaleur. Duhem 1992, 351–1 (1895)
10. Laplace, P.S.: Mémoire sur la probabilité des causes sur les évènements. Mémoires de Math-
ématique et de Physique, Tome Sixième (1774)
11. Arnold, V.I., Givental, A.G.: Symplectic geometry. In: Encyclopedia of Mathematical Science,
vol. 4. Springer, New York (translated from Russian) (2001)
12. Fitzpatrick, S.: On the geometric quantization of contact manifolds. http://arxiv.org/abs/0909.
2023v3 (2013). Accessed Feb 2013
13. Rajeev, S.G.: Quantization of contact manifolds and thermodynamics. Ann. Phys. 323(3),
768–82 (2008)
14. Gibbs, J.W.: Graphical methods in the thermodynamics of fluids. In: Bumstead, H.A., Van
Name, R.G. (eds.) Scientific Papers of J Willard Gibbs, 2 vols. Dover, New York (1961)
15. Dedecker, P.: A property of differential forms in the calculus of variations. Pac. J. Math. 7(4),
1545–9 (1957)
16. Lepage, T.: Sur les champs géodésiques du calcul des variations. Bull. Acad. Roy. Belg. Cl.
Sci. 27, 716–729, 1036–1046 (1936)
17. Mrugala, R.: On contact and metric structures on thermodynamic spaces. RIMS Kokyuroku
1142, 167–81 (2000)
18. Ingarden R.S., Kossakowski A.: The poisson probability distribution and information ther-
modynamics. Bull. Acad. Pol. Sci. Sér. Sci. Math. Astron. Phys. 19, 83–85 (1971)
19. Ingarden, R.S.: Information geometry in functional spaces of classical and quantum finite
statistical systems. Int. J. Eng. Sci. 19(12), 1609–33 (1981)
20. Ingarden, R.S., Janyszek, H.: On the local Riemannian structure of the state space of classical
information thermodynamics. Tensor, New Ser. 39, 279–85 (1982)
21. Ingarden, R.S., Kawaguchi, M., Sato, Y.: Information geometry of classical thermodynamical
systems. Tensor, New Ser. 39, 267–78 (1982)
22. Ingarden R.S.: Information geometry of thermodynamics. In: Transactions of the Tenth Prague
Conference Czechoslovak Academy of Sciences, vol. 10A–B, pp. 421–428 (1987)
23. Ingarden, R.S.: Information geometry of thermodynamics, information theory, statistical
decision functions, random processes. In: Transactions of the 10th Prague Conference,
Prague/Czechoslovakia 1986, vol. A, pp. 421–428 (1988)
24. Ingarden, R.S., Nakagomi, T.: The second order extension of the Gibbs state. Open Syst. Inf.
Dyn. 1(2), 243–58 (1992)
25. Arnold V.I.: Contact geometry: the geometrical method of Gibbs’s thermodynamics. In: Pro-
ceedings of the Gibbs Symposium, pp. 163–179. American Mathematical Society, Providence,
RI (1990)
26. Cartan, E.: Leçons sur les Invariants Intégraux. Hermann, Paris (1922)
7 Eidetic Reduction of Information Geometry 211
27. Koszul J.L.: Exposés sur les espaces homogènes symétriques. Publicação da Sociedade de
Matematica de São Paulo (1959)
28. Koszul J.L.: Sur la forme hermitienne canonique des espaces homogènes complexes. Can. J.
Math. 7(4), 562–576 (1955)
29. Koszul, J.L.: Lectures on Groups of Transformations. Tata Institute of Fundamental Research,
Bombay (1965)
30. Koszul, J.L.: Domaines bornées homogènes et orbites de groupes de transformations affines.
Bull. Soc. Math. Fr. 89, 515–33 (1961)
31. Koszul, J.L.: Ouverts convexes homogènes des espaces affines. Math. Z. 79, 254–9 (1962)
32. Koszul, J.L.: Déformations des variétés localement plates. Ann. Inst. Fourier 18, 103–14
(1968)
33. Vinberg, E.: Homogeneous convex cones. Trans. Moscow Math. Soc. 12, 340–363 (1963)
34. Vesentini E.: Geometry of Homogeneous Bounded Domains. Springer, Berlin (2011). Reprint
of the 1st Edn. C.I.M.E., Ed. Cremonese, Roma (1968)
35. Barbaresco F.: Information geometry of covariance matrix: Cartan-Siegel homogeneous
bounded domains, Mostow/Berger fibration and Fréchet median. In: Bhatia, R., Nielsen,
F. (eds.) Matrix Information Geometry, pp. 199–256. Springer, New York (2012)
36. Arnaudon M., Barbaresco F., Le, Y.: Riemannian medians and means with applications to
radar signal processing. IEEE J. Sel. Top. Sig. Process. 7(4), 595–604 (2013)
37. Dorfmeister, J.: Inductive construction of homogeneous cones. Trans. Am. Math. Soc. 252,
321–49 (1979)
38. Dorfmeister, J.: Homogeneous siegel domains. Nagoya Math. J. 86, 39–83 (1982)
39. Poincaré, H.: Thermodynamique, Cours de Physique Mathématique. G. Carré, Paris (1892)
40. Poincaré, H.: Calcul des Probabilités. Gauthier-Villars, Paris (1896)
41. Faraut, J., Koranyi, A.: Analysis on Symmetric Cones. The Clarendon Press, New York (1994)
42. Faraut, J., Koranyi, A.: Oxford Mathematical Monographs. Oxford University Press, New
York (1994)
43. Varadhan, S.R.S.: Asymptotic probability and differential equations. Commun. Pure Appl.
Math. 19, 261–86 (1966)
44. Sanov, I.N.: On the probability of large deviations of random magnitudes. Mat. Sb. 42(84),
11–44 (1957)
45. Ellis, R.S.: The Theory of Large Deviations and Applications to Statistical Mechanics. Lecture
Notes for Ecole de Physique Les Houches, France (2009)
46. Touchette, H.: The large deviation approach to statistical mechanics. Phys. Rep. 478(1—-3),
1–69 (2009)
47. Cartan, E.: Sur les domaines bornés de l’espace de n variables complexes. Abh. Math. Semin.
Hamburg 1, 116–62 (1935)
48. Lichnerowicz, A.: Espaces homogènes Kähleriens. In: Collection Géométrie Différentielle,
pp. 171–84, Strasbourg (1953)
49. Sasaki T.: A note on characteristic functions and projectively invariant metrics on a bounded
convex domain. Tokyo J. Math. 8(1), 49–79 (1985)
50. Sasaki, T.: Hyperbolic affine hyperspheres. Nagoya Math. J. 77, 107–23 (1980)
51. Trench W.F.: An algorithm for the inversion of finite Toeplitz matrices. J. Soc. Ind. Appl.
Math. 12, 515–522 (1964)
52. Verblunsky, S.: On positive harmonic functions. Proc. London Math. Soc. 38, 125–57 (1935)
53. Verblunsky, S.: On positive harmonic functions. Proc. London Math. Soc. 40, 290–20 (1936)
54. Hauser, R.A., Güler, O.: Self-scaled barrier functions on symmetric cones and their classifi-
cation. Found. Comput. Math. 2(2), 121–43 (2002)
55. Vinberg, E.B.: Structure of the group of automorphisms of a homogeneous convex cone. Tr.
Mosk. Mat. O-va 13, 56–83 (1965)
56. Siegel, C.L.: Über der analytische theorie der quadratischen Formen. Ann. Math. 36, 527–606
(1935)
57. Duan, X., Sun, H., Peng, L.: Riemannian means on special euclidean group and unipotent
matrices group. Sci. World J. 2013, ID 292787 (2013)
212 F. Barbaresco
58. Soize C.: A nonparametric model of random uncertainties for reduced matrix models in
structural dynamics. Probab. Eng. Mech. 15(3), 277–294 (2000)
59. Bennequin, D.: Dualités de champs et de cordes. Séminaire N. Bourbaki, exp. no. 899, pp.
117–148 (2001–2002)
60. Bennequin D.: Dualité Physique-Géométrie et Arithmétique, Brasilia (2012)
61. Chasles M.: Aperçu historique sur l’origine et le développement des méthodes en géométrie
(1837)
62. Gergonne, J.D.: Polémique mathématique. Réclamation de M. le capitaine Poncelet (extraite
du bulletin universel des annonces et nouvelles scientifiques); avec des notes. Annales de
Gergonne, vol. 18, pp. 125–125. http://www.numdam.org (1827–1828)
63. Poncelet, J.V.: Traité des propriétés projectives des figures (1822)
64. André, Y.: Dualités. Sixième séance, ENS, Mai (2008)
65. Atiyah, M.F.: Duality in mathematics and physics, lecture Riemann’s influence in geometry.
Analysis and Number Theory at the Facultat de Matematiques i Estadıstica of the Universitat
Politecnica de Catalunya (2007)
66. Von Oettingen, A.J.: Harmoniesystem in dualer Entwicklung. Studien zur Theorie der Musik,
Dorpat und Leipzig (1866)
67. Von Oettingen, A.J.: Das duale system der harmonie. In: Annalen der Naturphilosophie, vol.
1 (1902)
68. Von Oettingen, A.J.: Das duale system der harmonie. In: Annalen der Naturphilosophie, vol.
2, pp. 62–75 (1903/1904)
69. Von Oettingen, A.J.: Das duale system der harmonie. In: Annalen der Naturphilosophie, vol.
3, pp. 375–403 (1904)
70. Von Oettingen, A.J.: Das duale system der harmonie. In: Annalen der Naturphilosophie, vol.
4, pp. 241–269 (1905)
71. Von Oettingen, A.J.: Das duale system der harmonie. In: Annalen der Naturphilosophie, vol.
5, pp. 116–152, 301–338, 449–503 (1906)
72. Von Oettingen, A.J.: Das duale Harmoniesystem, Leipzig (1913)
73. Von Oettingen, A.J.: Die Grundlagen der musikwissenschaft und das duale reinistrument. In:
Abhand-lungen der mathematisch-physikalischen Klasse der Königlich Sächsischen Gesell-
schaft der Wissenschaften, vol. 34, pp. S.I–XVI, 155–361 (1917)
74. D’Alembert, J.R.: Éléments de musique, théorique et pratique, suivant les principes de M.
Rameau, Paris (1752)
75. Rameau, J.P.: Traité de l’harmonie Réduite à ses Principes Naturels. Ballard, Paris (1722)
76. Rameau, J.P.: Nouveau système de Musique Théorique. Ballard, Paris (1726)
77. Yang, L.: Médianes de mesures de probabilité dans les variétés riemanniennes et applications
à la détection de cibles radar. Thèse de l’Université de Poitiers, tel-00664188, 2011, Thales
PhD Award (2012)
78. Barbaresco, F.: Algorithme de Burg Régularisé FSDS. Comparaison avec l’algorithme de
Burg MFE, pp. 29–32 GRETSI conference (1995)
79. Barbaresco, F.: Information geometry of covariance matrix. In: Nielsen, F., Bhatia, R. (eds.)
Matrix Information Geometry Book. Springer, Berlin (2012)
80. Émery, M., Mokobodzki, G.: Sur le barycentre d’une probabilité dans une variété. Séminaire
de probabilité Strasbourg 25, 220–233 (1991)
81. Friedrich, T.: Die Fisher-information und symplektische strukturen. Math. Nachr. 153, 273–96
(1991)
82. Bingham N.H.: Szegö’s Theorem and Its Probabilistic Descendants. http://arxiv.org/abs/1108.
0368v2 (2012)
83. Landau, H.J.: Maximum entropy and the moment problem. Bull. Am. Math. Soc. 16(1), 47–77
(1987)
84. Siegel, C.L.: Symplectic geometry. Am. J. Math. 65, 1–86 (1943)
85. Libermann, P., Marle, C.M.: Symplectic Geometry and Analytical Mechanics. Reidel, Dor-
drecht (1987)
7 Eidetic Reduction of Information Geometry 213
86. Delsarte, P., Genin, Y.V.: Orthogonal polynomial matrices on the unit circle. IEEE Trans.
Comput. Soc. 25(3), 149–160 (1978)
87. Kanhouche, R.: A modified burg algorithm equivalent. In: Results to Levinson algorithm.
http://hal.archives-ouvertes.fr/ccsd-00000624
88. Douady, C.J., Earle, C.J.: Conformally natural extension of homeomorphisms of circle. Acta
Math. 157, 23–48 (1986)
89. Balian, R.: A metric for quantum states issued from von Neumann’s entropy. In: Nielsen, F.,
Barbaresco, F. (Eds.) Geometric Science of Information. Lecture Notes in Computer Science,
vol. 8085, pp. 513–518
90. Balian, R.: Incomplete descriptions and relevant entropies. Am. J. Phys. 67, 1078–90 (1989)
91. Balian, R.: Information in statistical physics. Stud. Hist. Philos. Mod. Phys. 36, 323–353
(2005)
92. Allahverdyan, A., Balian, R., Nieuwenhuizen, T.: Understanding quantum measurement from
the solution of dynamical models. Phys. Rep. 525, 1–166 (2013) (ArXiv: 1107, 2138)
93. Balian, R.: From Microphysics to Macrophysics: Methods and Applications of Statistical
Physics, vol. 1–2. Springer (2007)
94. Balian, R., Balazs, N.: Equiprobability, information and entropy in quantum theory. Ann.
Phys. (NY) 179, 97–144 (1987)
95. Balian, R., Alhassid, Y., Reinhardt, H.: Dissipation in many-body systems: a geometric
approach based on information theory. Phys. Rep. 131, 1–146 (1986)
96. Barbaresco F.: Information/contact geometries and Koszul entropy. In: Nielsen, F., Bar-
baresco, F. (eds.) Geometric Science of Information. Lecture Notes in Computer Science,
vol. 8085, pp. 604–611. Springer, Berlin (2013)
97. Souriau J.M.: Définition covariante des équilibres thermodynamiques. Suppl. Nuovo Cimento
1, 203–216 (1966)
98. Souriau, J.M.: Thermodynamique et Géométrie 676, 369–397 (1978)
99. Souriau, J.M.: Géométrie de l’espace de phases. Commun. Math. Phys. 1, 374 (1966)
100. Souriau, J.M.: On geometric mechanics. Discrete Continuous Dyn. Syst. 19(3), 595–607
(2007)
101. Souriau, J.M.: Structure des Systèmes Dynamiques. Dunod, Paris (1970)
102. Souriau, J.M.: Structure of dynamical systems. Progress in Mathematics, vol. 149. Birkhäuser
Boston Inc., Boston. A symplectic view of physics (translated from the French by Cushman-de
Vries, C.H.) (1997)
103. Souriau, J.M.: Thermodynamique relativiste des fluides. Rend. Sem. Mat. Univ. e Politec.
Torino, 35:21–34 (1978), 1976/77
104. Souriau, J.M., Iglesias, P.: Heat cold and geometry. In: Cahen, M., et al. (eds.) Differential
Geometry and Mathematical Physics, pp. 37–68 (1983)
105. Souriau, J.M.: Thermodynamique et géométrie. In: Differential Geometrical Methods in Math-
ematical Physics, vol. 2 (Proceedings of the International Conference, University of Bonn,
Bonn, 1977). Lecture Notes in Mathematics, vol. 676, pp. 369–397. Springer, Berlin (1978)
106. Souriau, J.M.,: Dynamic systems structure (Chap. 16 Convexité, Chap. 17 Mesures, Chap.
18 Etats Statistiques, Chap. 19 Thermodynamique), unpublished technical notes, available in
Souriau archive (document sent by Vallée, C.)
107. Vallée, C.: Lois de comportement des milieux continus dissipatifs compatibles avec la
physique relativiste, thèse, Poitiers University (1978)
108. Iglésias P., Equilibre statistiques et géométrie symplectique en relativité générale. Ann. l’Inst.
Henri Poincaré, Sect. A, Tome 36(3), 257–270 (1982)
109. Iglésias, P.: Essai de thermodynamique rationnelle des milieux continus. Ann. l’Inst. Henri
Poincaré, 34, 1–24 (1981)
110. Vallée, C.: Relativistic thermodynamics of continua. Int. J. Eng. Sci. 19(5), 589–601 (1981)
111. Pavlov, V.P., Sergeev, V.M.: Thermodynamics from the differential geometry standpoint.
Theor. Math. Phys. 157(1), 1484–1490 (2008)
112. Kozlov, V.V.: Heat Equilibrium by Gibbs and Poincaré. RKhD, Moscow (2002)
214 F. Barbaresco
113. Berezin, F.A.: Lectures on Statistical Physics. Nauka, Moscow (2007) (English trans., World
Scientific, Singapore, 2007)
114. Poincaré, H.: Réflexions sur la théorie cinétique des gaz. J. Phys. Theor. Appl. 5, 369–403
(1906)
115. Carathéodory, C.: Math. Ann. 67, 355–386 (1909)
116. Nakajima, S.: On quantum theory of transport phenomena. Prog. Theor. Phys. 20(6), 948–959
(1958)
117. Zwanzig, R.: Ensemble method in the theory of irreversibility. J. Chem. Phys. 33(5), 1338–
1341 (1960)
118. Bourdon, M.: Structure conforme au bord et flot géodésique d’un CAT(-1)-espace.
L’Enseignement Math. 41, 63–102 (1995)
119. Viterbo, C.: Generating functions, symplectic geometry and applications. In: Proceedings of
the International Congress Mathematics, Zurich (1994)
120. Viterbo, C.: Symplectic topology as the geometry of generating functions. Math. Ann. 292,
685–710 (1992)
121. Hörmander, L.: Fourier integral operators I. Acta Math. 127, 79–183 (1971)
122. Théret, D.: A complete proof of Viterbo’s uniqueness theorem on generating functions. Topol-
ogy Appl. 96, 249–266 (1999)
123. Pansu, P.: Volume, courbure et entropie. Séminaire Bourbaki 823, 83–103 (1996)
124. Besson, G., Courtois, G., Gallot, S.: Entropies et rigidités des espaces localement symétriques
de courbure strictement négative. Geom. Funct. Anal. 5, 731–799 (1995)
125. Fréchet, M.: Sur quelques points du calcul fonctionnel. Rend. Circolo Math. Palermo 22, 1–74
(1906)
126. Fréchet, M.: L’espace des courbes n’est qu’un semi-espace de Banach. General Topology and
Its Relation to Modern Analysis and Algebra, pp. 155–156, Prague (1962)
127. Alt, H., Godau, M.: Computing the Fréchet distance between two polygonal curves. Int. J.
Comput. Geom. Appl. 5, 75–91 (1995)
128. Fréchet, M.R.: Les éléments aléatoires de nature quelconque dans un espace distancié. Ann.
l’Inst. Henri Poincaré 10(4), 215–310 (1948)
129. Marle, C.M.: On mechanical systems with a Lie group as configuration space. In: M. de Gosson
(ed.) Jean Leray ’99 Conference Proceedings: the Karlskrona Conference in the Honor of Jean
Leray, Kluwer, Dordrecht, pp. 183–203 (2003)
130. Marle, C.M.: On Henri Poincaré’s note “Sur une forme nouvelle des équations de la
mécanique”. JGSP 29, 1–38 (2013)
131. Kloeckner, B.: Géométrie des bords: compactifications différentiables et remplissages
holomorphes. Thèse Ecole Normale Supérieure de Lyon. http://tel.archives-ouvertes.fr/tel-
00120345 (2006). Accessed Dec 2006
132. Barbaresco, F.: Super Resolution Spectrum Analysis Regularization: Burg, Capon and Ago-
antagonistic Algorithms, EUSIPCO-96, pp. 2005–2008, Trieste (1996)
133. Barbaresco, F.: Computation of most threatening radar trajectories areas and corridors based
on fast-marching and level sets. In: IEEE CISDA Symposium, Paris (2011)
134. Michor, P.W., Mumford, D.: An overview of the Riemannian metrics on spaces of curves
using the Hamiltonnian approach. Appl. Comput. Harm. Anal. 23(1), 74–113 (2007)
135. Chouakria-Douzal, A., Nagabhusha, P.N.: Improved Fréchet distance for time series. In: Data
Sciences and Classification, pp. 13–20. Springer, Berlin (2006)
136. Bauer, M., et al.: Constructing reparametrization invariant metrics on spaces of plane curves.
Preprint http://arxiv.org/abs/1207.5965
137. Fréchet, M.: L’espace dont chaque élément est une courbe n’est qu’un semi-espace de Banach.
Ann. Sci. l’ENS, 3ème série. Tome 78(3), 241–272 (1961)
138. Fréchet, M.: L’espace dont chaque élément est une courbe n’est qu’un semi-espace de Banach
II. Ann.Sci. l’ENS, 3ème série. Tome 80(2), pp. 135–137 (1963)
139. Chazal, F., et al.: Gromov-Hausdorff stable signatures for shapes using persistence. In: Euro-
graphics Symposium on Geometry Processing 2009, Marc Alexa and Michael Kazhdan (Guest
editors), vol. 28, no. 5 (2009)
7 Eidetic Reduction of Information Geometry 215
140. Cagliari, F., Di Fabio B., Landi, C.: The natural pseudo-distance as a quotient pseudo metric,
and applications. Preprint http://amsacta.unibo.it/3499/1/Forum_Submission.pdf
141. Frosini, P., Landi, C.: No embedding of the automorphisms of a topological space into a
compact metric space endows them with a composition that passes to the limit. Appl. Math.
Lett. 24(10), 1654–1657 (2011)
142. Shevchenko, O.: Recursive construction of optimal self-concordant barriers for homogeneous
cones. J. Optim. Theor. Appl. 140(2), 339–354 (2009)
143. Güler, O., Tunçel, L.: Characterization of the barrier parameter of homogeneous convex cones.
Math. Program. 81(1), Ser. A, 55–76 (1998)
144. Rothaus, O.S.: Domains of positivity. Bull. Am. Math. Soc. 64, 85–86 (1958)
145. Nesterov, Y., Nemirovskii, A.: Interior-point polynomial algorithms. In: Convex Program-
ming, SIAM Studies in Applied Mathematics, vol. 13 (1994)
146. Vinberg, E.B.: The theory of homogeneous convex cones. Tr. Mosk. Mat. O-va. 12, 303–358
(1963)
147. Rothaus, O.S.: The construction of homogeneous convex cones. Ann. Math. Ser. 2, 83, 358–
376 (1966)
148. Güler, O.: Barrier functions in interior point methods. Math. Oper. Res. 21(4), 860–885 (1996)
149. Uehlein, F.A.: Eidos and Eidetic Variation in Husserl’s Phenomenology. In: Language and
Schizophrenia, Phenomenology, pp. 88–102. Springer, New York (1992)
150. Bergson, H.: L’évolution créatrice. Les Presses universitaires de France, Paris (1907). http://
classiques.uqac.ca/classiques/bergson_henri/evolution_creatrice/evolution_creatrice.pdf
151. Riquier, C.: Bergson lecteur de Platon: le temps et l’eidos, dans interprétations des idées
platoniciennes dans la philosophie contemporaine (1850–1950), coll. Tradition de la pensée
classique, Paris, Vrin (2011)
152. Worms, F.: Bergson entre Russel et Husserl: un troisième terme? In: Rue Descartes, no. 29,
Sens et phénomène, philosophie analytique et phénoménologie, pp. 79–96, Presses Univer-
sitaires de France, Sept. 2000
153. Worms, F.: Le moment 1900 en philosophie. Presses Universitaires du Septentrion, premier
trimestre, Etudes réunies sous la direction de Frédéric Worms (2004)
154. Worms, F.: Bergson ou Les deux sens de la vie: étude inédite, Paris, Presses universitaires de
France, Quadrige. Essais, débats (2004)
155. Bergson, H., Poincaré, H.: Le matérialisme actuel. Bibliothèque de Philosophie Scientifique,
Paris, Flammarion (1920)
156. de Saxcé G., Vallée C.: Bargmann group, momentum tensor and Galilean invariance of
Clausius-Duhem inequality. Int. J. Eng. Sci. 50, 216–232 (2012)
157. Dubois, F.: Conservation laws invariants for Galileo group. CEMRACS preliminary results.
ESAIM Proc. 10, 233–266 (2001)
158. Moreau, J.J.: Fonctions convexes duales et points proximaux dans un espace hilbertien. C. R.
l’Acad. des Sci. Série A, Tome 255, 2897–2899 (1962)
159. Nielsen, F.: Hypothesis testing, information divergence and computational geometry. In:
GSI’13 Conference, Paris, pp. 241–248 (2013)
160. Vey, J.: Sur une notion d’hyperbolicité des variables localement plates. Faculté des sciences
de l’université de Grenoble, Thèse de troisième cycle de mathématiques pures (1969)
161. Ruelle, D.: Statistical mechanics. In: Rigorous Results (Reprint of the 1989 edition). World
Scientific Publishing Co., Inc, River Edge. Imperial College Press, London (1999)
162. Ruelle, D.: Hasard et Chaos. Editions Odile Jacob, Aout (1991)
163. Shima, H.: Geometry of Hessian Structures. In: Nielsen, F., Barbaresco, F. (eds.) Lecture
Notes in Computer Science, vol. 8085, pp. 37–55. Springer, Berlin (2013)
164. Shima, H.: The Geometry of Hessian Structures. World Scientific, London (2007)
165. Zia, R.K.P., Redish Edward F., McKay Susan, R.: Making Sense of the Legendre Transform
(2009), arXiv:0806.1147, June 2008
166. Fréchet, M.: Sur l’écart de deux courbes et sur les courbes limites. Trans. Am. Math. Soc.
6(4), 435–449 (1905)
216 F. Barbaresco
167. Taylor, A.E., Dugac, P.: Quatre lettres de Lebesgue à Fréchet. Rev. d’Hist. Sci. Tome 34(2),
149–169 (1981)
168. Jensen, J.L.W.: Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta
Math. 30(1), 175–193 (1906)
169. Needham, T.: A visual explanation of Jensen’s inequality. Am. Math. Mon. 8, 768–77 (1993)
170. Donaldson, S.K.: Scalar curvature and stability of toric variety. J. Differ. Geom. 62, 289–349
(2002)
171. Abreu, M.: Kähler geometry of toric varieties and extremal metrics. Int. J. Math. 9, 641–651
(1998)
172. Atiyah, M., Bott, R.: The moment map and equivariant cohomology. Topology 23, 1–28
(1984)
173. Guan, D.: On modified Mabuchi functional and Mabuchi moduli space of kahler metrics on
toric bundles. Math. Res. Lett. 6, 547–555 (1999)
174. Guillemin, V.: Kaehler structures on toric varieties. J. Differ. Geom. 40, 285–309 (1994)
175. Guillemin, V.: Moment maps and combinatorial invariants of Hamiltonian Tn -spaces,
Birkhauser (1994)
176. Crouzeix, J.P.: A relationship between the second derivatives of a convex function and of its
conjugate. Math. Program. 3, 364–365 (1977) (North-Holland)
177. Seeger, A.: Second derivative of a convex function and of its Legendre-Fenchel transformate.
SIAM J. Optim. 2(3), 405–424 (1992)
178. Hiriart-Urruty, J.B.: A new set-valued second-order derivative for convex functions. Mathe-
matics for Optimization, Mathematical Studies, vol. 129. North Holland, Amsterdam (1986)
179. Berezin, F.: Lectures on Statistical Physics (Preprint 157). Max-Plank-Institut für Mathematik,
Bonn (2006)
180. Hill, R., Rice, J.R.: Elastic potentials and the structure of inelastic constitutive laws. SIAM J.
Appl. Math. 25(3), 448–461 (1973)
181. Bruguières, A.: Propriétés de convexité de l’application moment, séminaire N. Bourbaki, exp.
no. 654, pp. 63–87 (1985–1986)
182. Condevaux, M., Dazord, P., Molino, P.: Géométrie du moment. Trav. Sémin. Sud-Rhodanien
Géom. Univ. Lyon 1, 131–160 (1988)
183. Delzant, T.: Hamiltoniens périodiques et images convexes de l’application moment. Bull. Soc.
Math. Fr. 116, 315–339 (1988)
184. Guillemin, V., Sternberg, S.: Convexity properties of the moment mapping. Inv. Math. 67,
491–513 (1982)
185. Guillemin, V., Sternberg, S.: Convexity properties of the moment mapping. Inv. Math. 77,
533–546 (1984)
186. Kirwan, F.: Convexity properties of the moment mapping. Inv. Math. 77, 547–552 (1984)
187. Deza, E., Deza, M.M.: Dictionary of Distances. Elsevier, Amsterdam (2006)
188. Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. II 106(4), 620–630
(1957)
189. Jaynes, E.T.: Information theory and statistical mechanics II. Phys. Rev. II 108(2), 171–190
(1957)
190. Jaynes, E.T.: Prior probabilities. IEEE Trans. Syst. Sci. Cybern. 4(3), 227–241 (1968)
191. Amari, S.I., Nagaoka, H.: Methods of Information Geometry (Translation of Mathematical
Monographs), vol. 191. AMS, Oxford University Press, Oxford (2000)
192. Amari, S.I.: Differential Geometrical Methods in Statistics. Lecture Notes in Statistics, vol.
28. Springer, Berlin (1985)
193. Rao, C.R.: Information and the accuracy attainable in the estimation of statistical parameters.
Bull. Calcutta Math. Soc. 37, 81–89 (1945)
194. Chentsov, N.N.: Statistical decision rules and optimal inferences. In: Transactions of Mathe-
matics Monograph, vol. 53. American Mathematical Society, Providence (1982) (Published
in Russian in 1972)
195. Trouvé, A., Younes, L.: Diffeomorphic matching in 1d: designing and minimizing matching
functionals. In: Vernon, D. (ed.) Proceedings of ECCV (2000)
7 Eidetic Reduction of Information Geometry 217
196. Trouvé, A., Younes, L.: On a class of optimal matching problems in 1 dimension. SIAM J.
Control Opt. 39(4), 1112–1135 (2001)
197. Younes, L.: Computable elastic distances between shapes. SIAM J. Appl. Math 58, 565–586
(1998)
198. Younes, L.: Optimal matching between shapes via elastic deformations. Image Vis. Comput.
17, 381–389 (1999)
199. Younes, L., Michor, P.W., Shah, J., Mumford, D.: A metric on shape space with explicit
geodesics. Rend. Lincei Mat. Appl. 9, 25–57 (2008)
200. Kapranov, M.: Thermodynamics and the Moment Map (preprint), arXiv:1108.3472, Aug 2011
Chapter 8
Distances on Spaces of High-Dimensional
Linear Stochastic Processes: A Survey
p-dimensional time series indexed by time t. Assume that each time series yi =
{ yit }⊂
t=1 can be approximately modeled by a (stochastic) LDS Mi of output-input
size ( p, m) and order n 1 realized as
x it = Ai x it−1 + Bi v t ,
m,n, p = Rn×n × Rn×m × R p×n × R p×m
yit = Ci x it + Di v t , (Ai , Bi , Ci , Di ) ∈ SL
(8.1)
where v t is a common stimulus process (e.g., white Gaussian noise with identity
covariance)2 and where the realization Ri = (Ai , Bi , Ci , Di ) is learnt and assumed
to be known. The problem is to: (1) Choose an appropriate space S of LDSs containing
the learnt models {Mi }i=1
N , (2) geometrize S, i.e., equip it with an appropriate geom-
etry (e.g., define a distance on S), (3) develop tools (e.g., probability distributions,
averages or means, variance, PCA) to perform statistical analysis (e.g., classification
and clustering) in a computationally efficient manner.
The first question to ask is: why model the processes using the state-space model
(representation) (8.1)? Recall that processes have equivalent ARMA and state-space
representations. Moreover, model (8.1) is quite general and with n large enough it
can approximate a large class of processes. More importantly, state-space represen-
tations (especially in high dimensions) are often more suitable for parameter learning
or system identification. In important practical cases of interest such models con-
veniently yield more parsimonious parametrization than vectorial ARMA models
which suffer from the curse of dimensionality [24]. The curse of dimensionality in
ARMA models stems from the fact that for p-dimensional time series if p is very
large the number of parameters of an ARMA model is roughly proportional to p 2 ,
which could be much larger than the number of data samples available pT , where T
is the observation time period (note that the autoregressive coefficient matrices are
very large p × p matrices). However, in many situations encountered in real world
examples, state-space models are more effective in overcoming the curse of dimen-
sionality [20, 24]. The intuitive reason, as already alluded to, is that often (very)
high-dimensional time series can be well approximated as being generated by a low
order but high-dimensional dynamical system (which implies small n despite large
p in the model (8.1)). This can be attributed to the fact that the components of the
observed time series exhibit correlations (cross sectional correlation). Moreover, the
contaminating noises also show correlation across different components (see [20, 24]
for examples of exact and detailed assumptions and conditions to formalize these
intuitive facts). Therefore, overall the number of parameters in the state-space model
is small compared with p 2 and this is readily reflected in (or encoded by) the small
size of the dynamics matrix Ai and the thinness of the observation matrix Ci in (8.1).3
2 Note that in a different or more general setting the noise at the output could be a process w
t different
(independent) from the input noise v t . This does not cause major changes in our developments. Since
the output noise usually represents a perturbation which cannot be modeled, as far as Problem 1 is
concerned, one could usually assume that Di = 0.
3 Note that we are not implying that ARMA models are incapable of modeling such time series.
Rather the issue is that general or unrestricted ARMA models suffer from the curse of dimensionality
in the identification problem, and the parametrization of a restricted class of ARMA models with a
small number of parameters is complicated [20]. However, at the same time, by using state-space
models it is easier to overcome the curse of dimensionality and this approach naturally leads to
simple and effective identification algorithms [20, 22].
222 B. Afsari and R. Vidal
Also, in general, state-space models are more convenient for computational purposes
than vectorial ARMA models. For example, in the case of high-dimensional time
series most effective estimation methods are based on state-space domain system
identification rooted in control theory [7, 41, 51]. Nevertheless, it should be noted
that, in general, the identification of multi-input multi-output (MIMO) systems is a
subtle problem (see Sect. 8.4 and e.g., [11, 31, 32]). However, for the case where
p > n, there are efficient system identification algorithms available for finding the
state-space parameters [20, 22].
Notice that in Problem 1 we are assuming that all the LDSs have the same order
n (more precisely the minimal order, see Sect. 8.3.3.1). Such an assumption might
seem rather restrictive and a more realistic assumption might be that all systems be
of order not larger than n (see Sect. 8.5.1). Note that since in practice real data can be
only approximately modeled by an LDS of fixed order, if n is not chosen too large,
then gross over-fitting of n is less likely to happen. From a practical point of view
(e.g., implementation) fixing the order for all systems results in great simplification in
implementation. Moreover, in classification or clustering problems one might need
to combine (e.g., average) such LDSs for the goal of replacing a class of LDSs
with a representative LDS. Ideally one would like to define an average in a such a
way that LDSs of the same order have an average of the same order and not higher,
otherwise the problem can become intractable. In fact, most existing approaches tend
to dramatically increase the order of the average LDS, which is certainly undesirable.
Therefore, intuitively, we would like to consider a space S in which the order of the
LDSs is fixed or limited. From a theoretical point of view also this assumption allows
us to work with nicer mathematical spaces namely smooth manifolds (see Sect. 8.4).
Amongst the most widely used classification and clustering algorithms for static
data are the k-nearest neighborhood and k-means algorithms, both of which rely
on a notion of distance (in a feature space) [21]. These algorithms enjoy certain
universality properties with respect to the probability distributions of the data; and
hence in many practical situations where one has little prior knowledge about the
nature of the data, they prove to be very effective [21, 35]. In view of this fact,
in this paper we focus on the notion of distance between LDSs and the stochastic
processes they generate. Hence, a natural question is what space we should use and
what type of distance we should define on it. In Problem 1, obviously, the first two
steps (which are the focus of this paper) have significant impacts on the third one.
One has different choices for the space S, as well as, for geometries on that space.
The gamut ranges from an infinite dimensional linear space to a finite dimensional
(non-Euclidean) manifold, and the geometry can be either intrinsic or extrinsic. By
an intrinsic geometry we mean one in which a shortest path between two points in
a space stays in the space, and by an extrinsic geometry we mean one where the
distance between the two points is measured in an ambient space. In the second part
of this paper, we study our recently developed approach, which is somewhere in
between: to design an easy-to-compute extrinsic distance, while keeping the ambient
space not too large.
This paper is organized as follows: In Sect. 8.2, we review some existing
approaches in geometrization of spaces of stochastic processes. In Sect. 8.3, we focus
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 223
on processes generated by LDSs of fixed order, and in Sect. 8.4, we study smooth
fiber bundle structures over spaces of LDSs generating such processes. Finally, in
Sect. 8.5, we introduce our class of group action induced distances namely the align-
ment distances. The paper is concluded in Sect. 8.6. To avoid certain technicalities
and just to convey the main ideas the proofs are omitted and will appear elsewhere.
We should stress that the theory of alignment distances on spaces of LDSs is still
under development; however, its basics have appeared in earlier papers [1–3]. This
paper for most parts is an extended version of [3].
This review, in particular, since the subject appears in a range of disciplines is non-
exhaustive. Our emphasis is on the core ideas in defining distances on spaces of
stochastic processes rather than enumerating all such distances. Other sources to
consult may include [9, 10, 25]. In view of Problem 1, our main interest is in the
finite dimensional spaces of LDSs of fixed order and the processes they generate.
However, since such a space can be embedded in the larger infinite dimensional space
of “virtually all processes,” first we consider the latter.
E{·} denotes the expectation operation under the associated probability measure.
Equivalently, the process can be identified by the Fourier (or z) transform of its
covariance sequence, namely the power spectral density (PSD) P y (ω), which is a
p × p Hermitian positive semi-definite matrix for every ω ∈ [0, 2π].4 We denote
the space of all p × p PSD matrices by P p and its subspace consisting of elements
4Strictly speaking, in order to be the PSD matrix of a regular stationary process, a matrix function
on [0, 2π] must satisfy other mild technical conditions (see [62] for details).
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 225
This distance is derived in [28] and is also called the d̄2 -distance (see also [27, p. 292]).
In view of the Hellinger distance between probability measures [9], the above
distance, in the literature, is also called the Hellinger distance [23]. Interestingly,
dH remains valid as the optimal transport-based distance for certain non-Gaussian
processes, as well [27, p. 292]. The extension of the optimal transport-based defini-
tion to higher dimensions is not straightforward. However, note that in P1 , dH can be
226 B. Afsari and R. Vidal
thought of as a square root version of dE . In fact, the square root based definition can
be easily extended to higher dimensions, e.g., in (8.3) one could simply replace the
scalar square roots with the (matrix) Hermitian square roots of P yi (ω), i = 1, 2 (at
each frequency ω) and use a matrix norm. Recall that the Hermitian square root of
the Hermitian matrix Y is the unique Hermitian solution of the equation Y = X X H ,
where H denotes conjugate transpose. We denote the Hermitian square root of Y as
Y 1/2 . Therefore, we could define the Hellinger distance in higher dimensions as
1/2 1/2
dH2 ( y1 , y2 ) = ∇P y1 (ω) − P y2 (ω)∇2F dω, (8.4)
However note that, for any unitary matrix U , X = Y 1/2 U is also a solution to
Y = X X H (but not Hermitian if U differs from the intensity). This suggests that,
one may be able to do better by finding the best unitary matrix U (ω) to minimize
1/2 1/2
∇P y1 (ω)− P y2 (ω)U (ω)∇ F (at each frequency ω). In [23] this idea has been used to
define the (improved) Hellinger distance on P p , which can be written in closed-form
as
1/2 1/2 1/2 1/2 −1/2 1/2 1/2 2
dH2 ◦ ( y1 , y2 ) = ∇P y1 − P y2 P y2 P y1 P y2 P y2 P y1 ∇ F dω, (8.5)
where dependence of the terms on ω has been dropped. Notice that the matrix U (ω) =
1/2 1/2 −1/2 1/2 1/2
P y2 P y1 P y2 P y2 P y1 is unitary for every ω and in fact it is a transfer function
of an all-pass possibly infinite dimensional linear filter [23]. Here, by an all-pass
transfer function or filter U (ω) we mean one for which U (ω)U (ω) H = I p . Also note
that (8.5) seemingly breaks down if either of the PSDs is not full-rank. However,
solving the related optimization shows that by continuity the expression remains
valid. We should point out that recently a class of distances on P1 has been introduced
by Georgiou et al. based on the notion of optimal mass transport or morphism between
PSDs (rather than probability distributions, as above) [25]. Such distances enjoy some
nice properties, e.g., in terms of robustness with respect to multiplicative and additive
noise [25]. An extension to P p also has been proposed [53]; however, the extension
is no longer a distance and it is not clear if it inherits the robustness property.
Another (possibly deeper) aspect of working with the square root of the PSD
is related to the ideas of spectral factorization and the innovations process. We
review some basics, which can be found, e.g., in [6, 31, 32, 38, 62, 65]. The
important fact is that the PSD P y (ω) of a regular process yt in P p is of constant
rank m → p almost everywhere in [0, 2π]. Moreover, it admits a factorization
of the form P y (ω) = Pl y (ω)Pl y (ω) H , where Pl y (ω) is p × m-dimensional and
uniquely determines its analytic extension Pl y (z) outside the unit disk in C. In
this factorization, Pl y (ω), itself, is not determined uniquely and any two such fac-
tors are related by an m × m-dimensional all-pass filter. However, if we require
the extension Pl y (z) to be in the class of minimum phase filters, then the choice
of the factor Pl y (ω) becomes unique up to a constant unitary matrix. A p × m
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 227
Such divergences enjoy certain invariance properties, e.g., if we filter both processes
with a common minimum phase filter, then the divergence remains unchanged. In
particular, it is scale-invariant. Such properties are shared by the distances or diver-
gences that are based on the ratios of PSDs (see below for more examples). Scale
invariance in the case of 1D PSDs has been advocated as a desirable property, since in
many cases the shape of the PSDs rather than their relative scale is the discriminative
feature (see e.g., [9, 26]).
One can arrive at similar distances from other geometric or probabilistic paths.
One example is the famous Itakura-Saito divergence (sometimes called distance)
5 In fact, our approach (in Sects. 8.3–8.5) is also based on the idea of comparing the minimum phase
(i.e., canonical) filters or factors in the case of processes with rational spectra. However, instead of
comparing the associated transfer functions or impulse responses, we try to compare the associated
state-space realizations (in a specific sense). This approach, therefore, is in some sense structural or
generative, since it tries to compare how the processes are generated (according to the state-space
representation) and the model order plays an explicit role in it.
228 B. Afsari and R. Vidal
This divergence has been used in practice, at least, since the 1970s (see [48] for
references). The Itakura-Saito divergence can be derived from the Kullback-Leibler
divergence between (infinite dimensional) probability densities of the two processes
(The definition is a time-domain based definition, however, the final result is read-
ily expressible in the frequency domain).6 On the other hand, Amari’s information
geometry-based approach [5, Chap. 5] allows to geometrize P1+ in various ways and
yields different distances including the Itakura-Saito distance (8.7) or a Riemannian
distance such as
P y1 2
dR ( y , y ) =
2 1 2
log dω. (8.8)
P y2
Furthermore, in this framework one can define geodesics between two processes
under various Riemannian or non-Riemannian connections. The high-dimensional
version of the Itakura-Saito distance has also been known since the 1980s [42] but
is less used in practice:
dIS ( y1 , y2 ) = trace(P y−1 −1
2 P y1 ) − log(det(P y2 P y1 )) − p dω. (8.9)
where log is the standard matrix logarithm. In general, such approaches are not suited
for large p due to computational costs and the full-rankness requirement. We should
stress that in (very) high dimensions the assumption of full-rankness of PSDs is not
a viable one, in particular because usually not only the actual time series are highly
correlated but also the contaminating noises are correlated, as well. In fact, this has
lead to the search for models capturing this quality. One example is the class of
generalized linear dynamic factor models, which are closely related to the tall, full
rank LDS models (see Sect. 8.3.3 and [20, 24]).
Letting the above mentioned issues aside, for the purposes of Problem 1, the space
P p (or even P +p ) is too large. The reason is that it includes, e.g., ARMA processes of
arbitrary large orders, and it is not clear, e.g., how an average of some ARMA models
6Notice that defining distances between probability densities in the time domain is a more general
approach than the PSD-based approaches, and it can be employed in the case of nonstationary as
well as non-Gaussian processes. However, such an approach, in general, is computationally difficult.
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 229
7Interestingly, for an average defined based on the Itakura-Saito divergence in the space of 1D AR
models this property holds [26], see also [5, Sect. 5.3].
230 B. Afsari and R. Vidal
More relevant to us are [33, 46], where (intrinsic) state-space based Riemannian dis-
tances between LDSs of fixed size and fixed order have been studied. Such approaches
ideally suit Problem 1, but they are computationally demanding. More recently, in
[1] and subsequently in [2, 3], we introduced group action induced distances on
certain spaces of LDSs of fixed size and order. As it will become clear in the next
section, an important feature of this approach is that the LDS order is explicit in the
construction of the distance, and the state-space parameters appear in the distance
in a simple form. These features make certain related calculations (e.g., optimiza-
tion) much more convenient (compared with other methods). Another aspect of our
approach is that, contrary to most of the distances discussed so far, which compare
the PSDs or the canonical factors directly, our approach amounts to comparing the
8 It is interesting to note that by a simple modification some of the spectral-ratio based distances
2 ( y1 , y2 ) =
P 1 2
can attain this property, e.g., by modifying dR in (8.8) as dRI log P y2 dω −
y
P 1 2
log P y2 dω (see also [9, 25, 49]).
y
9 This and the results in [53] underline the fact that defining distances on P p for p > 1 may
be challenging, not only from a computational point of view but also from a theoretical one. In
particular, certain nice properties in 1D do not automatically carry over to higher dimensions by a
simple extension of the definitions in 1D.
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 231
generative or the structural models of the processes or how they are generated. This
feature also could be useful in designing more application-specific or structure-aware
distances.
Two (stochastic) LDSs are indistinguishable if their output PSDs are equal. Using this
equivalence on the entire set of LDSs is not useful, because, as mentioned earlier two
transfer functions which differ by an all-pass filter result in the same PSD. Therefore,
the equivalence relation could induce a complicated many-to-one correspondence
between the LDSs and the subspace of stochastic processes they generate. However, if
we restrict ourselves to the subspace of minimum phase LDSs the situation improves.
Let us denote the subspace of minimum-phase realizations by SL m,n,
a,mp a
p ← SLm,n, p .
This is clearly an open submanifold of SL am,n, p . In SL
m,n,
a,mp
p , the canonical spectral
factorization of the output PSD is unique up to an orthogonal matrix [6, 62, 65]: let
m,n,
T1 (z) and T2 (z) have realizations in SL
a,mp ≡ −1 ≡ −1
p and let T1 (z)T1 (z ) = T2 (z)T2 (z ),
then T1 (z) = T2 (z)Φ for a unique Φ ∈ O(m), where O(m) is the Lie group of m ×m
orthogonal matrices. Therefore, any p-dimensional processes with PSD of normal
rank m can be identified with a simple equivalent class of stable and minimum-phase
transfer functions and the corresponding LDSs.11
10 It is crucial to have in mind that we explicitly distinguish between the LDS, M, and its
realization R, which is not unique. As it will become clear soon, an LDS has an equivalent class of
realizations.
11 These rank conditions, interestingly, have differential geometric significance in yielding nice
A fundamental fact is that there are symmetries or invariances due to certain Lie group
actions in the model (8.1). Let G L(n) denote the Lie group of n × n non-singular
(real) matrices. We say that the Lie group G L(n) × O(m) acts on the realization
m,n, p (or its subspaces) via the action • defined as12
space SL
One can easily verify that under this action the output covariance sequence (or PSD)
remains invariant. In general, the converse is not true. That is, two output covariance
sequences might be equal while their corresponding realizations are not related via
• (due to non-minimum phase and the action not being free [47], also see below).
Recall that the action of a group on a set is called free if every element of the set is
fixed only by the identity element of the group. For the converse to hold we need to
impose further rank conditions, as we will see next.
Recall that the controllability and observability matrices of order k associated with
a realization R = (A, B, C, D) are defined as Ck = [B, AB, . . . , Ak−1 B] and
Ok = [C ≡ , (C A)≡ , . . . , (C Ak−1 )≡ ]≡ , respectively. A realization is called control-
lable (resp. observable) if Ck (resp. Ok ) is of rank n for k = n. We denote the
subspace of controllable (resp. observable) realizations by SL co ob
m,n, p (resp. SLm,n, p ).
min
The space SL co ob
m,n, p = SLm,n, p ≤ SLm,n, p is called the space of minimal realizations.
An important fact is that we cannot reduce the order (i.e., the size of A) of a minimal
realization without changing its input-output behavior.
action in (8.12).
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 233
linear dynamic factor models for (very) high-dimensional time series [20] and also
appear in video sequence modeling [1, 12, 60]. It is easy to verify that all the above
realization spaces are smooth open submanifolds of SL m,n, p . Their corresponding
submanifolds of stable or minimum-phase LDSs (e.g., SL m,n,
a,mp,co
p ) are defined in an
obvious way.
The following proposition forms the basis of our approach to defining distances
between processes: any distance on the space of LDSs with realizations in the above
submanifolds (with rank conditions) can be used to define a distance on the space of
processes generated by those LDSs.
results, e.g., in [33] shows that, in fact, we have a principal fiber bundle structure.
Theorem 1 Let Γ̃m,n, p be as in Proposition 1 and Γm,n, p = Γ̃m,n, p /(G L(n) ×
O(m)) be the corresponding quotient LDS space. The realization-system pair
(Γ̃m,n, p , Γm,n, p ) has the structure of a smooth principal fiber bundle with structure
a,mp,tC
group G L(n) × O(m). In the case of SLm,n, p the bundle is trivial (i.e., diffeomor-
phic to a product), otherwise it is trivial only when m = 1 or n = 1.
The last part of the theorem has an important consequence. Recall that a principal
bundle is trivial if it diffeomorphic to global product of its base space and its structure
234 B. Afsari and R. Vidal
group. Equivalently, this means that a trivial bundle admits a global smooth cross
section or what is known as a smooth canonical form in the case of LDSs, i.e., a
globally smooth mapping s : Γm,n, p → Γm,n, p which assigns to every system a
unique realization. This theorem implies that the minimality condition is a compli-
cated nonlinear constraint, in the sense that it makes the bundle twisted and nontrivial
for which no continuous canonical form exists. Establishing this obstruction put an
end to control theorists’ search for canonical forms for MIMO LDSs in the 1970s
and explained why system identification for MIMO LDSs is a challenging task [11,
15, 36].
On the other hand, one can verify that (SL a,mp,tC a,mp,tC
m,n, p , SLm,n, p ) is a trivial bundle.
Therefore, for such systems global canonical forms exist and they can be used to
a,mp,tC
define distances, i.e., if s : SLm,n, p → SL m,n,
a,mp,tC
p is such a canonical form then
a,mp,tC
dSLa,mp,tC (M1 , M2 ) = d̃SLa,mp,tC (s(M1 ), s(M2 )) defines a distance on SLm,n, p for
m,n, p m,n, p
any distance d̃SLa,mp,tC on the realization space. In general, unless one has some
m,n, p
specific knowledge there is no preferred choice for a section or canonical form. If
one has a group-invariant distance on the realization space, then the distance induced
from using a cross section might be inferior to the group action induced distance, in
the sense it may result in an artificially larger distance. In the next section we review
the basic idea behind group action induced distances in our application.
Figure 8.1a schematically shows a realization bundle Γ and its base LDS space
Γ. Systems M1 , M2 ∈ Γ have realizations R1 and R2 in Γ, respectively. Let us
assume that a G = G L(n) × O(n)-invariant distance d̃G on the realization bundle is
given. The realizations, R1 and R2 , in general, are not aligned with each other, i.e.,
d̃G (R1 , R2 ) can be still reduced by sliding one realization along its fiber as depicted
in Fig. 8.1b. This leads to the definition of the group action induced distance:13
In fact, one can show that dΓ (·, ·) is a true distance on Γ, i.e., it is symmetric and
positive definite and obeys the triangle inequality (see e.g., [66]).14
The main challenge in the above approach is the fact that, due to non-compactness
of G L(n), constructing a G L(n) × O(n)-invariant distance is computationally dif-
13 We may call this an alignment distance. However, based on the same principle in Sect. 8.5 we
define another group action induced distance, which we explicitly call the alignment distance. Since
our main object of interest is that distance, we prefer not to call the distance in (8.13) an alignment
distance.
14 It is interesting to note that some of the good properties of the k-nearest neighborhood algorithms
(a) (b)
Fig. 8.1 Over each LDS in Γ sits a realization fiber. The fibers together form the realization space
(bundle) Γ. If given a G-invariant distance on the realization bundle, then one can define a distance
on the LDS space by aligning any realizations R1 , R2 of the two LDSs M1 , M2 as in (8.13)
Next, we recall the notion of reducing a bundle with non-compact structure group
to one with a compact structure group. This will be useful in our geometrization
approach in the next section. Interestingly, bundle reduction also appears in statistical
analysis of shapes under the name of standardization [43]. The basic fact is that any
principal fiber G-bundle (Γ̃, Γ) can be reduced to an OG-subbundle OΓ ← Γ̃,
where OG is the maximal compact subgroup of G [44]. This reduction means that
Γ is diffeomorphic to OΓ/OG (i.e., no topological information is lost by going to
the subbundle and the subgroup). Therefore, in our cases of interest we can reduce
a G L(n) × O(m)-bundle to an OG(n, m) = O(n) × O(m)-subbundle. We call
15 This problem, in general, is difficult, among other things, because it is a non-convex (infinite-
dimensional) variational problem. Recall that in Riemannian geometry the non-convexity of the arc
length variational problem can be related to the non-trivial topology of the manifold (see e.g., [17]).
236 B. Afsari and R. Vidal
(a) (b)
Fig. 8.2 A standardized subbundle OΓ m,n, p of Γm,n, p is a subbundle on which G acts via its
compact subgroup OG. The quotient space OΓ m,n, p /OG still is diffeomorphic to the base space
Γm,n, p . One can define an alignment distance on the base space by aligning realizations R1 , R2 ∈
OΓ m,n, p of M1 , M2 ∈ Γm,n, p as (8.15)
for some Q ∈ O(n) [37]. Minimizing h is called balancing (in the sense of Helmke
[37]). One can show that balancing is, in fact, a standardization in the sense that we
defined (a proof of this fact will appear elsewhere). Note that a more specific form
of balancing called diagonal balancing (due to Moore [52]) is more common in the
control literature, however, that cannot be considered as a form of reduction of the
structure group. The interesting intuitive reason is that it tries to reduce the structure
group beyond the orthogonal group to the identity element, i.e., to get a canonical
form (see also [55]). However, it fails in the sense that, as mentioned above, it cannot
a,mp,min
give a smooth canonical form, i.e., a section which is diffeomorphic to SLm,n, p .
In this section, we propose to use the large class of extrinsic unitary invariant distances
on a standardized realization subbundle to build distances on the LDS base space.
The main benefits are that such distances are abundant, the ambient space is not
too large (e.g., not infinite dimensional), and calculating the distance in the base
space boils down to a static optimization problem (albeit non-convex). Specifically,
let d̃
OΓ
be a unitary invariant distance on a standardized realization subbundle
m,n, p
OΓ m,n, p with the base Γm,n, p (as in Theorem 1). One example of such a distance is
d̃ 2 R1 ,R2) = λ A ∇A1 − A2 ∇2F + λ B ∇B1 − B2 ∇2F + λC ∇C1 −C2 ∇2F + λ D ∇D1 − D2 ∇2F ,
OΓ m,n, p
(8.14)
In [39] a fast algorithm is developed which (with little modification) can be used to
compute this distance.
Remark 4 We stress that, via the identification of a process with its canonical spec-
tral factors (Proposition 1 and Theorem 1), dΓm,n, p (·, ·) is (or induces) a distance on
the space of processes generated by the LDSs in Γm,n, p . Therefore, in the sprit
of distances studied in Sect. 8.2 we could have written dΓm,n, p ( y1 , y2 ) instead of
dΓm,n, p (M1 , M2 ), where y1 and y2 are the processes generated by M1 and M2 when
excited by the standard Gaussian process. However, the chosen notation seems more
convenient.
Remark 5 Calling the static global minimization problem (8.15) “easy” in an
absolute term is an oversimplification. However, even this global minimization
238 B. Afsari and R. Vidal
over orthogonal matrices is definitely simpler than solving the nonlinear geodesic
ODEs and finding shortest geodesics globally (an infinite-dimensional dynamic
programming problem). It is our ongoing research to develop fast and reliable algo-
rithms to solve (8.15). Our experiments indicate that the Jacobi algorithm in [39] is
quite effective in finding global minimizers.
a,mp,tC
In [1], this distance was first introduced on SLm,n, p with the standardized
a,mp,tC
subbundle OSL m,n, p . The distance was used for efficient video sequence clas-
sification (using 1-nearest neighborhood and nearest mean methods) and clustering
(e.g., via defining averages or a k-means like algorithm). However, it should be men-
tioned that in video applications (for reasons which are not completely understood)
the comparison of LDSs based on the (A, C) part in (8.1) has proven quite effective
(in fact, such distances are more commonly used than distances based on compar-
ing the full model). Therefore, in [1], the alignment distance (8.15) with parameters
λ B = λ D = 0 was used, see (8.14). An algorithm called the align and average is
a,mp,tC
developed to do averaging on SLm,n, p (see also [2]). One defines the average M̄ of
a,mp,tC
LDSs {Mi }i=1N ← SL
m,n, p (the so-called Fréchet mean or average) as a minimizer
of the sum of the squares of distances:
N
M̄ = argmin M d2 a,mp,tC (M, Mi ). (8.16)
SLm,n, p
i=1
8.5.1 Extensions
Now, we briefly point to some possible directions along which this basic idea can
be extended (see also [2]). First, note that the Frobenius norm in (8.14) can be
replaced by any other unitary invariant matrix norm (e.g., the nuclear norm). A less
trivial extension is to get rid of O(m) in (8.15) by passing to covariance matri-
a,mp,tC
m,n, p it is easy to verify that SLm,n,
ces. For example, in the case of OSL
a,mp,tC
p =
a,mp,tC,cv a,mp,tC,cv
m,n, p
OSL m,n, p
/(O(n) × Im ), where OSL = {(A, Z , C, S)|(A, B, C, D) ∈
a,mp,tC
m,n, p ,
OSL Z = B B ≡ , S = D D ≡ }. On this standardized subspace one only has the
action of O(n) which we denote as Q Ω (A, Z , C, S) = (Q ≡ AQ, Q ≡ Z Q, C Q, S).
One can use the same ambient distance on this space as in (8.14) and get
2
dΓ (M1 , M2 ) = min d̃
2
Q Ω R1 , R2 , (8.17)
m,n, p
Q∈O(n) O Γ m,n, p
a,mp,tC,cv
for realizations R1 , R2 ∈ OSL m,n, p . One could also replace the ∇ · ∇ F in the
terms associated with B and D in (8.14) with some known distances in the spaces
of positive definite matrices or positive-semi-definite matrices of fixed rank (see
e.g., [14, 63]). Another possible extension is, e.g., to consider other submanifolds
a,mp,tC
m,n, p , e.g., a submanifold where ∇C∇ F = ∇B∇ F = 1. In this case the
of OSL
corresponding alignment distance is essentially a scale invariant distance, i.e., two
processes which are scaled version of one another will have zero distance. A more
significant and subtle extension is to extend the underlying space of LDSs of fixed
size and order n to that of fixed size but (minimal) order not larger than n. The details
of this approach will appear elsewhere.
8.6 Conclusion
In this paper our focus was the geometrization of spaces of stochastic processes
generated by LDSs of fixed size and order, for use in pattern recognition of high-
dimensional time-series data (e.g., in the prototype Problem 1). We reviewed some
of the existing approaches. We then studied the newly developed class of group
action induced distances called the alignment distances. The approach is a general
and flexible geometrization framework, based on the quotient structure of the space
of such LDSs, which leads to a large class of extrinsic distances. The theory of
alignment distances and their properties is still in early stages of development and
we are hopeful to be able to tackle some interesting problems in control theory as
well as pattern recognition in time-series data.
240 B. Afsari and R. Vidal
Acknowledgments The authors are thankful to the anonymous reviewers for their insightful
comments and suggestions, which helped to improve the quality of this paper. The authors also
thank the organizers of the GSI 2013 conference and the editor of this book Prof. Frank Nielsen.
This work was supported by the Sloan Foundation and by grants ONR N00014-09-10084, NSF
0941362, NSF 0941463, NSF 0931805, and NSF 1335035.
References
1. Afsari, B., Chaudhry, R., Ravichandran, A., Vidal, R.: Group action induced distances for
averaging and clustering linear dynamical systems with applications to the analysis of dynamic
visual scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)
2. Afsari, B., Vidal, R.: The alignment distance on spaces of linear dynamical systems. In: IEEE
Conference on Decision and Control (2013)
3. Afsari, B., Vidal, R.: Group action induced distances on spaces of high-dimensional linear
stochastic processes. In: Geometric Science of Information, LNCS, vol. 8085, pp. 425–432
(2013)
4. Amari, S.I.: Differential geometry of a parametric family of invertible linear systems-
Riemannian metric, dual affine connections, and divergence. Math. Syst. Theory 20, 53–82
(1987)
5. Amari, S.I., Nagaoka, H.: Methods of information geometry. In: Translations of Mathematical
Monographs, vol. 191. American Mathematical Society, Providence (2000)
6. Anderson, B.D., Deistler, M.: Properties of zero-free spectral matrices. IEEE Trans. Autom.
Control 54(10), 2365–5 (2009)
7. Aoki, M.: State Space Modeling of Time Series. Springer, Berlin (1987)
8. Barbaresco, F.: Information geometry of covariance matrix: Cartan-Siegel homogeneous
bounded domains, Mostow/Berger fibration and Frechet median. In: Matrix Information Geom-
etry, pp. 199–255. Springer, Berlin (2013)
9. Basseville, M.: Distance measures for signal processing and pattern recognition. Sig. Process.
18, 349–9 (1989)
10. Basseville, M.: Divergence measures for statistical data processingan annotated bibliography.
Sig. Process. 93(4), 621–33 (2013)
11. Bauer, D., Deistler, M.: Balanced canonical forms for system identification. IEEE Trans.
Autom. Control 44(6), 1118–1131 (1999)
12. Béjar, B., Zappella, L., Vidal, R.: Surgical gesture classification from video data. In: Medical
Image Computing and Computer Assisted Intervention, pp. 34–41 (2012)
13. Boets, J., Cock, K.D., Moor, B.D.: A mutual information based distance for multivariate
Gaussian processes. In: Modeling, Estimation and Control, Festschrift in Honor of Giorgio
Picci on the Occasion of his Sixty-Fifth Birthday, Lecture Notes in Control and Information
Sciences, vol. 364, pp. 15–33. Springer, Berlin (2007)
14. Bonnabel, S., Collard, A., Sepulchre, R.: Rank-preserving geometric means of positive semi-
definite matrices. Linear Algebra. Its Appl. 438, 3202–16 (2013)
15. Byrnes, C.I., Hurt, N.: On the moduli of linear dynamical systems. In: Advances in Mathemat-
ical Studies in Analysis, vol. 4, pp. 83–122. Academic Press, New York (1979)
16. Chaudhry, R., Vidal, R.: Recognition of visual dynamical processes: Theory, kernels and
experimental evaluation. Technical Report 09–01. Department of Computer Science, Johns
Hopkins University (2009)
17. Chavel, I.: Riemannian Geometry: A Modern Introduction, vol. 98, 2nd edn. Cambridge Uni-
versity Press, Cambridge (2006)
18. Cock, K.D., Moor, B.D.: Subspace angles and distances between ARMA models. Syst. Control
Lett. 46(4), 265–70 (2002)
8 Distances on Spaces of High-Dimensional Linear Stochastic Processes 241
19. Corduas, M., Piccolo, D.: Time series clustering and classification by the autoregressive metric.
Comput. Stat. Data Anal. 52(4), 1860–72 (2008)
20. Deistler, M., Anderson, B.O., Filler, A., Zinner, C., Chen, W.: Generalized linear dynamic
factor models: an approach via singular autoregressions. Eur. J. Control 3, 211–24 (2010)
21. Devroye, L.: A probabilistic Theory of Pattern Recognition, vol. 31. Springer, Berlin (1996)
22. Doretto, G., Chiuso, A., Wu, Y., Soatto, S.: Dynamic textures. Int. J. Comput. Vision 51(2),
91–109 (2003)
23. Ferrante, A., Pavon, M., Ramponi, F.: Hellinger versus Kullback-Leibler multivariable spec-
trum approximation. IEEE Trans. Autom. Control 53(4), 954–67 (2008)
24. Forni, M., Hallin, M., Lippi, M., Reichlin, L.: The generalized dynamic-factor model: Identi-
fication and estimation. Rev. Econ. Stat. 82(4), 540–54 (2000)
25. Georgiou, T.T., Karlsson, J., Takyar, M.S.: Metrics for power spectra: an axiomatic approach.
IEEE Trans. Signal Process. 57(3), 859–67 (2009)
26. Gray, R., Buzo, A., Gray Jr, A., Matsuyama, Y.: Distortion measures for speech processing.
IEEE Trans. Acoust. Speech Signal Process. 28(4), 367–76 (1980)
27. Gray, R.M.: Probability, Random Processes, and Ergodic Properties. Springer, Berlin (2009)
28. Gray, R.M., Neuhoff, D.L., Shields, P.C.: A generalization of Ornstein’s d̄ distance with appli-
cations to information theory. The Ann. Probab. 3, 315–328 (1975)
29. Gray Jr, A., Markel, J.: Distance measures for speech processing. IEEE Trans. Acoust. Speech
Signal Process. 24(5), 380–91 (1976)
30. Grenander, U.: Abstract Inference. Wiley, New York (1981)
31. Hannan, E.J.: Multiple Time Series, vol. 38. Wiley, New York (1970)
32. Hannan, E.J., Deistler, M.: The Statistical Theory of Linear Systems. Wiley, New York (1987)
33. Hanzon, B.: Identifiability, Recursive Identification and Spaces of Linear Dynamical Systems,
vol. 63–64. Centrum voor Wiskunde en Informatica (CWI), Amsterdam (1989)
34. Hanzon, B., Marcus, S.I.: Riemannian metrics on spaces of stable linear systems, with appli-
cations to identification. In: IEEE Conference on Decision & Control, pp. 1119–1124 (1982)
35. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning. Springer, New
York (2003)
36. Hazewinkel, M.: Moduli and canonical forms for linear dynamical systems II: the topological
case. Math. Syst. Theory 10, 363–85 (1977)
37. Helmke, U.: Balanced realizations for linear systems: a variational approach. SIAM J. Control
Optim. 31(1), 1–15 (1993)
38. Jiang, X., Ning, L., Georgiou, T.T.: Distances and Riemannian metrics for multivariate spectral
densities. IEEE Trans. Autom. Control 57(7), 1723–35 (2012)
39. Jimenez, N.D., Afsari, B., Vidal, R.: Fast Jacobi-type algorithm for computing distances
between linear dynamical systems. In: European Control Conference (2013)
40. Kailath, T.: Linear Systems. Prentice Hall, NJ (1980)
41. Katayama, T.: Subspace Methods for System Identification. Springer, Berlin (2005)
42. Kazakos, D., Papantoni-Kazakos, P.: Spectral distance measures between Gaussian processes.
IEEE Trans. Autom. Control 25(5), 950–9 (1980)
43. Kendall, D.G., Barden, D., Carne, T.K., Le, H.: Shape and Shape Theory. Wiley Series In
Probability And Statistics. Wiley, New York (1999)
44. Kobayashi, S., Nomizu, K.: Foundations of Differential Geometry Volume I. Wiley Classics
Library Edition. Wiley, New York (1963)
45. Krishnaprasad, P.S.: Geometry of Minimal Systems and the Identification Problem. PhD thesis,
Harvard University (1977)
46. Krishnaprasad, P.S., Martin, C.F.: On families of systems and deformations. Int. J. Control
38(5), 1055–79 (1983)
47. Lee, J.M.: Introduction to Smooth Manifolds. Springer, Graduate Texts in Mathematics (2002)
48. Liao, T.W.: Clustering time series data—a survey. Pattern Recogn. 38, 1857–74 (2005)
49. Makhoul, J.: Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–80 (1975)
50. Martin, A.: A metric for ARMA processes. IEEE Trans. Signal Process. 48(4), 1164–70 (2000)
242 B. Afsari and R. Vidal
51. Moor, B.D., Overschee, P.V., Suykens, J.: Subspace algorithms for system identification and
stochastic realization. Technical Report ESAT-SISTA Report 1990–28, Katholieke Universiteit
Leuven (1990)
52. Moore, B.C.: Principal component analysis in linear systems: Controllability, observability,
and model reduction. IEEE Trans. Autom. Control 26, 17–32 (1981)
53. Ning, L., Georgiou, T.T., Tannenbaum, A.: Matrix-valued Monge-Kantorovich optimal mass
transport. arXiv, preprint arXiv:1304.3931 (2013)
54. Nocerino, N., Soong, F.K., Rabiner, L.R., Klatt, D.H.: Comparative study of several distortion
measures for speech recognition. Speech Commun. 4(4), 317–31 (1985)
55. Ober, R.J.: Balanced realizations: canonical form, parametrization, model reduction. Int. J.
Control 46(2), 643–70 (1987)
56. Papoulis, A., Pillai, S.U.: Probability, random variables and stochastic processes with errata
sheet. McGraw-Hill Education, New York (2002)
57. Piccolo, D.: A distance measure for classifying ARIMA models. J. Time Ser. Anal. 11(2),
153–64 (1990)
58. Rabiner, L., Juang, B.-H.: Fundamentals of Speech Recognition. Prentice-Hall International,
NJ (1993)
59. Rao, M.M.: Stochastic Processes: Inference Theory, vol. 508. Springer, New York (2000)
60. Ravichandran, A., Vidal, R.: Video registration using dynamic textures. IEEE Trans. Pattern
Anal. Mach. Intell. 33(1), 158–171 (2011)
61. Ravishanker, N., Melnick, E.L., Tsai, C.-L.: Differential geometry of ARMA models. J. Time
Ser. Anal. 11(3), 259–274 (1990)
62. Rozanov, Y.A.: Stationary Random Processes. Holden-Day, San Francisco (1967)
63. Vandereycken, B., Absil, P.-A., Vandewalle, S.: A Riemannian geometry with complete geo-
desics for the set of positive semi-definite matrices of fixed rank. Technical Report Report
TW572, Katholieke Universiteit Leuven (2010)
64. Vishwanathan, S., Smola, A., Vidal, R.: Binet-Cauchy kernels on dynamical systems and its
application to the analysis of dynamic scenes. Int. J. Comput. Vision 73(1), 95–119 (2007)
65. Youla, D.: On the factorization of rational matrices. IRE Trans. Inf. Theory 7(3), 172–189
(1961)
66. Younes, L.: Shapes and Diffeomorphisms. In: Applied Mathematical Sciences, vol. 171.
Springer, New York (2010)
Chapter 9
Discrete Ladders for Parallel Transport
in Transformation Groups with an Affine
Connection Structure
9.1 Introduction
manifold by preserving its properties with respect to the space geometry. It is one
of the fundamental operations of differential geometry which enables to compare
tangent vectors, and thus the underlying trajectories, across the whole manifold.
For this reason, parallel transport in transformation groups is currently an impor-
tant field of research with applications in medical imaging to the development of
spatio-temporal atlases for brain images [26], the study of hippocampal shapes [35],
or cardiac motion [1]. In computer vision one also finds applications to motion track-
ing and more generally to statistical analysis [12, 42, 45, 47].
Even though the notion of parallel transport comes in some way intuitively, its
practical implementation requires the precise knowledge of the space geometry, in
particular of the underlying connection. This is not always easy, especially in infinite
dimensions such as in the setting of diffeomorphic image registration. Moreover,
parallel transport is a continuous operation involving the computation of (covariant)
derivatives which, from the practical point of view, might lead to issues concern-
ing numerical stability and robustness. These issues are related to the unavoidable
approximations arising when continuous energy functionals and operators are dis-
cretized on grids, especially concerning the evaluation of derivatives through finite
difference schemes.
The complexity and limitations deriving from the direct computation of contin-
uous parallel transport methods can be alleviated by considering discrete approx-
imations. In the ’70s of the past century [31] proposed a scheme for performing
the parallel transport with a very simple geometrical constructions. This scheme was
called Schild’s Ladder since it was in the spirit of the work of the theoretical physicist
Alfred Schild’s. The computational interest of Schild’s ladder resides in its gener-
ality, since it enables the transport of vectors in manifolds by computing geodesics
only. This way, the implementation of covariant derivatives is not required anymore,
and we can concentrate the implementation effort on the geodesics only.
We recently showed that numerical schemes derived from the Schild’s ladder can
be effectively applied in the setting of diffeomorphic image registration, by appro-
priately taking advantage of the underlying geometrical setting [28, 29]. Based on
this experience, we believe that discrete transport methods represent promising and
powerful techniques for the analysis of transformations due to their simplicity and
generality. Bearing in mind the applicative context of the development of transport
techniques, this chapter aims to illustrate the principles of discrete schemes for par-
allel transport in smooth groups equipped with affine connection or Riemannian
structures.
The chapter is structured as follows. In Sect. 9.2 we provide fundamental notions of
finite-dimensional Lie Groups and Riemannian geometry concerning affine connec-
tion and covariant derivative. These notions are the basis for the continuous parallel
transport methods defined in Sect. 9.3. In Sect. 9.4 we introduce the Schild’s ladder.
After detailing its construction and mathematical properties, we derive from it the
more efficient Pole ladder construction in Sect. 9.4.2. These theoretical concepts are
then contextualized and discussed in the applicative setting of diffeomorphic image
registration, in which some limitations arise when considering infinite dimensional
Lie groups (Sect. 9.5). Finally, after illustrating numerical implementations of the
9 Discrete Ladders for Parallel Transport in Transformation Groups 245
We recall here the theoretical notions of Lie group theory and affine geometry, that
will be extensively used in the following sections.
A Lie group G is a smooth manifold provided with an identity element id, a smooth
associative composition rule (g, h) ⊂ G × G ∗∈ gh ⊂ G and a smooth inversion
rule g ∗∈ g −1 which are both compatible with the differential manifold structure.
As such, we have a tangent space Tg G at each point g ⊂ G. A vector field X is a
smooth function that maps a tangent vector X|g to each point g of the manifold. The
set of vector fields (the tangent bundle) is denoted T G. Vector fields can be viewed
as the directional (or Lie) derivative of a scalar function α along the vector field at
each point: ∂X α|g = dtd α(τt )|t=0 , where τt is the flow of X and τ0 = g. Composing
directional derivatives ∂X ∂Y α leads in general to a second order derivation. However,
we can remove the second order terms by subtracting ∂Y ∂X α (this can be checked
by writing these expression in a local coordinate system). We obtain the Lie bracket
that acts as an internal multiplication in the algebra of vector fields:
[X, Y](α) = ∂X ∂Y α − ∂Y ∂X α.
Left-invariant vector fields are complete in the sense that their flow τt is defined
for all time. Moreover, this flow is such that τt (g) = gτt (id) by left invariance. The
map X ∗∈ τ1 (id) of g into G is called Lie group exponential and denoted by exp. In
particular, the group exponential defines the one-parameter subgroup associated to
the vector X and has the following properties:
• τt (id) = exp(tX), for each t ⊂ R;
• exp((t + s)X) = exp(tX) exp(sX), for each t, s ⊂ R.
In finite dimension, it can be shown that the Lie group exponential is a diffeomor-
phism from a neighborhood of 0 in g to a neighborhood of id in G.
For each tangent vector X ⊂ g, the one parameter subgroup exp(tX) is a curve
that starts from identity with this tangent vector. One could question if this curve
could be seen as a geodesic like in Riemannian manifolds. To answer this question,
we first need to define what are geodesics. In a Euclidean space, straight lines are
curves which have the same tangent vector at all times. In a manifold, tangent vectors
at different times belong to different tangent spaces. When one wants to compare
tangent vectors at different points, one needs to define a specific mapping between
their tangent spaces: this is the notion of parallel transport. There is generally no
way to define globally a linear operator Δhg : Tg G ∈ Th G which is consistent
g
with composition (i.e. Δhg ◦ Δf = Δhf ). However, specifying the parallel transport
for infinitesimal displacements allows integrating along a path, thus resulting in a
parallel transport that depend on the path. This specification of the parallel transport
for infinitesimal displacements is called the (affine) connection.
The affine connection is therefore a derivation on the tangent space which infinites-
imally maps them from one tangent plane to another.
The connection give rise to two very important geometrical objects: the torsion and
curvature tensors. The torsion quantifies the failure to close infinitesimal geodesic
parallelograms:
while the curvature measures the local deviation of the space from being flat, and is
defined as
9 Discrete Ladders for Parallel Transport in Transformation Groups 247
• Is torsion free:
≡X Y − ≡Y X = [X, Y],
thus the parallel transport is symmetric with respect to the Lie bracket.
By choosing the Levi Civita connection of a given Riemannian metric, the affine
geodesics are the length minimizing paths (i.e. classical Riemannian geodesics).
However, given a general affine connection, there may not exist any Riemannian
metric for which affine geodesics are length minimizing.
Given an affine connection ≡ and a vector X on Tid G, we can therefore define two
curves on G passing through id and having X as tangent vector, one given by the Lie
group exponential exp and the other given by the affine exponential expid . When do
they coincide?
The connection ≡ on G is left-invariant if, for each left translation La (a ⊂ G)
and any vector fields X and Y, we have ≡DLa X (DLa Y) = DLa ≡X (Y). Using two left
invariant vector fields X̃, Ỹ ⊂ g generated by the tangent vectors X, Y ⊂ Tid G, we see
that ≡X̃ Ỹ is itself a left-invariant vector field generated by its value at identity. Since
a connection is completely determined by its action on the left-invariant vector fields
248 M. Lorenzi and X. Pennec
(we can recover the connection on arbitrary vector fields using Eqs. (9.1 and 9.2)
from their decomposition on the Lie Algebra), we conclude that each left-invariant
connection ≡ is uniquely determined by a product ψ (symmetric bilinear operator)
on Tid G through
ψ(X, Y ) = ≡X̃ Ỹ .
id
Notice that such a product can be uniquely decomposed into a commutative part ψ◦ =
◦◦
2 (ψ(X, Y ) + ψ(Y , X)) and a skew symmetric part ψ = 2 (ψ(X, Y ) − ψ(Y , X)).
1 1
The symmetric part specifies the geodesics (i.e. the parallel transport of a vector
along its own direction) while the skew-symmetric part specifies the torsion which
governs the parallel transport of a vector along a transverse direction (the rotation
around the direction of the curve if we have a metric connection with torsion).
Following [34], a left-invariant connection ≡ on a Lie group G is a Cartan-
Schouten connection if, for any tangent vector X at the identity, the one-parameter
subgroups and the affine geodesics coincide, i.e. exp(tX) = δ(t, id, X) . We can see
that a Cartan connection satisfies ψ(X, X) = 0 or, equivalently, is purely skew-sym-
metric.
The one-dimensional family of connections generated by ψ(X, Y ) = φ[X, Y ]
obviously satisfy this skew-symmetry condition. Moreover, the connections of this
family are also invariant by right translation [33], thus invariant by inversion also
since they are already left invariant. This make them particularly interesting since
they are fully compatible with all the group operations.
In this family, three connections have special curvature or symmetric properties
and are called the canonical Cartan-Schouten connections [11]. The zero curvature
connections given by φ = 0, 1 (with torsion T = −[X̃, Ỹ] and T = [X̃, Ỹ] respec-
tively on left invariant vector fields) are called left and right Cartan connections.
The choice of φ = 1/2 leads to average the left and right Cartan connections. It is
called the symmetric
(or mean) Cartan connection. It is torsion-free, but has curvature
R(X̃, Ỹ)Z̃ = − 41 [X̃, Ỹ], Z̃ .
As a summary, the three canonical Cartan connections of a Lie group are (for two
left-invariant vector fields):
Since the three canonical Cartan connections only differ by torsion, they share
the same affine geodesics which are the left and right translations of one parameter
subgroups. In the following, we call them group geodesics. However, the parallel
transport of general vectors along these group geodesics is specific to each connection
as we will see below.
9 Discrete Ladders for Parallel Transport in Transformation Groups 249
Given a metric <X, Y > on the tangent space at identity of a group, one can propagate
this metric to all tangent spaces using left (resp. right) translation to obtain a left-
(resp. right-) invariant Riemannian metric on the group. In the left-invariant case
we have <DLa X, DLa Y >a = <X, Y > and one can show [24] that the Levi Civita
connection is the left-invariant connection generated by the product
1 1
ψ(X, Y ) = [X, Y ] − (ad → (X, Y ) + ad → (Y , X)),
2 2
where the operator ad → is defined by <ad → (Y , X), Z> = <[X, Z], Y > for all
X, Y , Z ⊂ g. A similar formula can be established for right-invariant metrics using
the algebra of right-invariant vector fields .
We clearly see that this left-invariant Levi Civita connection has a symmetric part
which make it differ from the Cartan symmetric connection ψ(X, Y ) = 21 [X, Y ]. In
fact, the quantity ad → (X, X) specifies the rate at which a left invariant geodesic and
a one parameter subgroup starting from the identity with the same tangent vector
X deviates from each-other. More generally, the condition ad → (X, X) = 0 for all
X ⊂ g turns out to be a necessary and sufficient condition to have a bi-invariant
metric [34]. It is important to notice that geodesics of the left- and right-invariant
metrics differ in general as there do not exists bi-invariant metrics even for simple
groups like the Euclidean motions [33]. However, right invariant geodesics can be
easily obtained from the left invariant one through inversion: if α(t) is a left invariant
geodesic joining identity to the transformation α1 , then α−1 (t) is a right-invariant
geodesic joining identity to α−11 .
After having introduced the theoretical bases of affine connection spaces, in this
section we detail the theoretical relationship between parallel transport of tangent
vectors and respectively Cartan-Schouten and Riemannian (Levi Civita) connections.
For the left Cartan connection, the unique fields that are covariantly constant are the
left-invariant vector fields, and the parallel transport is induced by the differential of
the left translation [34], i.e. ΔL : Tp G ∈ Tq G is defined as
One can see that the parallel transport is actually independent of the path, which is
due to the fact that the curvature is null: we are in a space with absolute parallelism.
Similarly, the right-invariant vector fields are covariantly constant with respect to
the right invariant connection only. As above, the parallel transport is given by the
differential of the right translation
In the Riemannian setting the parallel transport with the Levi Civita connection can
be computed by solving a system of PDEs which locally depend on the associated
metric (the interested reader can refer to [16] for a more comprehensive description
of the parallel transport in Riemannian geometry). Let xi be a local coordinate chart
∂
xi with ∂i = ∂x a local basis of the tangent space. The tangent vector to the curve δ is
⎛ i i ⎛
δ̇ = i v ∂i . It can be easily shown that a vector Y = i yi ∂i is parallel transported
along δ with respect to the affine connection ≡ iff
⎞ ⎧
⎝ ⎝
≡δ̇ Y = ⎠ Φijk vj yk + δ̇(yk )⎨ ∂k = 0, (9.6)
k ij
with the Christoffel⎛symbols of the connection being defined from the covariant
derivative ≡∂i ∂j = k Φijk ∂k .
Let us consider the local expression for the metric tensor glk = < ∂x∂ l , ∂x∂ k >. Thanks
to the compatibility condition, the Christoffel symbols of the Levi Civita connection
can be locally expressed via the metric tensor leading to
⎩ ⎤
1⎝ ∂ ∂ ∂
Φijk = gjl + gli + gij g lk ,
2 ∂xi ∂xj ∂xl
l
Fig. 9.1 (1) The transport of the vector A along the curve C is performed by the Schild’s ladder
by (2) the construction of geodesic parallelograms in a sufficiently small neighborhood. (3) The
construction is iterated for a sufficient number of neighborhoods
In Sect. 9.3 we showed that the parallel transport closely depends on the underlying
connection, and that thus it assumes very specific formulations depending on the
underlying geometry. In this section we introduce discrete methods for the compu-
tation of the parallel transport which do not explicitly depend on the connection and
only make use of geodesics. Such techniques could be applied more generally when
working on arbitrary geodesic spaces.
Schild’s ladder is a general method for the parallel transport, introduced in the theory
of gravitation in [31] after Schild’s similar constructions [39]. The method infinites-
imally transports a vector along a given curve through the construction of geodesic
parallelograms (Fig. 9.1). The Schild’s ladder provides a straightforward method to
compute a second order approximation of the parallel transport of a vector along a
curve using geodesics only.
Let M a manifold and C a curve parametrized by the parameter β with ∂C ∂β |T0 = u,
and A ⊂ TP0 M, a tangent vector on the curve at the point P0 = C(0). Let P1 be a point
on the curve relatively close to P0 , i.e. separated by a sufficiently small parameter
value β .
252 M. Lorenzi and X. Pennec
The Schild’s ladder computes the parallel transport of A along the curve C as
follows:
1. Define a curve on the manifold parametrized by a parameter γ passing through
∂
the point P0 with tangent vector ∂γ |P0 = A. Chose a point P2 on the curve
separated by P0 by the value of the parameters γ. The values of the parameters γ
and β should be chosen in order to construct this step of the ladder within a single
coordinate neighborhood.
2. Let l be the geodesic connecting P2 = l(0) and P1 = l(φ), we choose the “middle
point” P3 = l(φ/2). Now, let us define the geodesic r connecting the starting point
P0 and P3 parametrized by ω such that P3 = r(ω). Extending the geodesic at the
parameter 2ω we reach the point P4 . We can now compute the geodesic curve
connecting P1 and P4 . The vector A◦ tangent to the curve at the point P1 is the
parallel translation of A along C.
3. If the distance between the points P0 and P1 is large, the above construction can
be iterated for a sufficient number of steps.
The algorithmic interest of the Schild’s ladder is that it only relies on the com-
putation of geodesics. Although the geodesics on the manifold are not sufficient to
recover all the information about the space properties, such as the torsion of the con-
nection, it has been shown that the Schild’s ladder implements the parallel transport
with respect to the symmetric part of the connection of the space [23]. An intuitive
view of that point is that the construction of the above diagram is commutative and
can be symmetrized with respect to the points P1 and P2 . If the original connection is
symmetric, then this procedure provides a correct linear approximation of the parallel
transport of vectors.
and by integrating:
9 Discrete Ladders for Parallel Transport in Transformation Groups 253
t2 k
x k (t) = x k (0) + tvk (0) − Φ (x(0))vi (0)vj (0) + O(t 3 ).
2 ij
By renormalizing the length of the vector v so that C(−1) = P0 , C(0) = M and
C(1) = Q0 (and denoting Φijk = Φijk (M)), we obtain the relations:
1 i j
P0 k = M k − vM
k
− Φijk vM vM + O(∅v∅3 ),
2
1 i j
Q0 k = M k + vM
k
− Φijk vM vM + O(∅v∅3 ).
2
Similarly, we have along the second geodesic:
1 i j
P1 k = M k − uM
k
− Φijk uM uM + O(∅u∅3 ),
2
1 i j
Q1 k = M k + uM
k
− Φijk uM uM + O(∅u∅3 ).
2
Now, to compute the geodesics joining P0 to P1 and Q0 to Q1 , we have to use a
Taylor expansion of the Christoffel symbols Φijk around the point M. In the following,
we indicate the coordinate according to which the quantity is derived by the index
k = ∂ Φk :
after a comma: Φij,a a ij
1 i j 1 k
Φijk (P0 ) = Φijk + Φij,a
k
(−vM
k
− Φijk vM vM ) + Φij,ab vM + O(∅v∅3 ).
a b
vM
2 2
However, the Christoffel symbols are multiplied by a term of order O(∅A∅2 ), so that
only the first term will be quadratic and all others will be of order 3 with respect to
A and vM . Thus, the geodesics joining P0 to P1 and Q0 to Q1 have equations:
254 M. Lorenzi and X. Pennec
1
P1k = P0k + Ak − Φijk Ai Aj + O((∅A∅ + ∅vM ∅)3 ),
2
1
Q1 = Q0 + B − Φijk Bi Bj + O((∅B∅ + ∅vM ∅)3 ).
k k k
2
1 i j 1 i j
k
uM + Φijk uM uM = vM
k
− Ak + Φijk (vM vM + Ai Aj ) + O((∅B∅ + ∅vM ∅)3 ).
2 2
Solving for u as a second order polynomial in vM and A gives
1 j
uk = vM
k
− Ak + (Φijk + Φjik )Ai vM + O((∅A∅ + ∅vM ∅)3 ).
2
1 j 1
Bk − Φijk Bi Bj = −Ak + (Φijk + Φjik )Ai vM + Φijk Ai Aj + O((∅A∅ + ∅vM ∅)3 ).
2 2
To verify that this is the correct formula for the parallel transport of A, let us
observe that the field A(x) is parallel in the direction of vj if ≡V A = 0, i.e. if
∂v Ak + Φijk Ai vj = 0, which means that Ak (x + ξv) = Ak − ξΦijk Ai vj + O(ξ2 ). If the
connection is symmetric, i.e. if Φijk = Φjik , Eq. (9.7) shows that the pole ladder leads
to Bk ← −Ak + 2Φijk Ai vj . Thus the pole ladder is realizing the parallel transport for
a length ξ = 2 (remember that our initial geodesic was defined from −1 to 1).
We have thus demonstrated that the vector −B of Fig. 9.2 is a second order approx-
imation of the transport of A. In order to optimize the number of time steps we should
evaluate the error in Eq. (9.7) at high orders on ∅A∅ and ∅vM ∅. The computation is
not straightforward and involves a large number of terms, thus preventing the possi-
bility to synthesize a useful result. However, we believe that the dependency on ∅A∅
is more important that the one on ∅vM ∅, and that we could obtain larger time steps
provided that ∅A∅ is sufficiently small.
We now describe a practical context in which the previous theoretical insights find
useful application. For this purpose, we describe here the link between the theory
9 Discrete Ladders for Parallel Transport in Transformation Groups 255
described in Sect. 9.2 and the context of computational anatomy, in particular through
the diffeomorphic non-linear registration of time series of images.
Modeling the temporal evolution of the tissues of the body is an important goal of
medical image analysis for understanding the structural changes of organs affected
by a pathology, or for studying the physiological growth during the life span. For
such purposes we need to analyze and compare the observed anatomical differ-
ences between follow-up sequences of anatomical images of different subjects. Non-
rigid registration is one of the main instruments for modeling anatomical differences
from images. The aim of non-linear registration is to encode the observed structural
changes as deformation fields densely represented in the image space, which repre-
sent the warping required to match the observed differences. This way, the anatomical
changes can be modeled and quantified by analyzing the associated deformations.
We can identify two distinct settings for the application of non-linear registration:
longitudinal and cross-sectional. In the former, non-linear registration estimates the
deformation field which explains the longitudinal anatomical (intra-subject) changes
that usually reflect biological phenomena of interest, like atrophy or growth. In the
latter, the deformation field accounts for the anatomical differences between different
subjects (inter-subject), in order to match homologous anatomical regions. These
two settings are profoundly different: the cross-sectional setting does not involve
any physical or mechanical deformations and we might wish to compare different
anatomies with different topologies. Moreover, inter-subject deformations are often
a scale of magnitude higher than the ones characterizing the usually subtle variations
of the longitudinal setting.
In case of group-wise analysis of longitudinal deformations, longitudinal and
cross-sectional settings must be integrated in a consistent manner. In fact, the com-
parison of longitudinal deformations is usually performed after normalizing them in
a common reference frame through the inter-subject registration, and the choice of
the normalization method might have a deep impact on the following analysis. In
order to accurately identify longitudinal deformations in a common reference frame
space, a rigorous and reliable normalization procedure need thus to be defined.
Normalization of longitudinal deformations can be done in different ways,
depending on the analyzed feature. For instance, the scalar Jacobian determinant
of longitudinal deformations represents the associated local volume change, and can
be compared by scalar resampling in a common reference frame via inter-subject
registration. This simple transport of scalar quantities is the basis of the classical
deformation/tensor based morphometry techniques [5, 38]. However, transporting
the Jacobian determinant is not sufficient to reconstruct a deformation in the Template
space.
If we consider vector-values characteristics of deformations instead of scalar quan-
tities, the transport is not uniquely defined anymore. For instance, a simple method
256 M. Lorenzi and X. Pennec
In Sect. 9.2.2, we derived the equivalence of one-parameter subgroups and the affine
geodesics of the canonical Cartan connections in a finite dimensional Lie group. In
order to use such a framework for diffeomorphisms, we have to generalize the theory
9 Discrete Ladders for Parallel Transport in Transformation Groups 257
A second algorithm is at the heart of the efficiency of the optimization algorithms with
SVFs: the Baker-Campbell-Hausdorff (BCH) formula [9] tells us how to approximate
the log of the composition:
lim ∅fn,ξ ∅H k ∈ ∇,
k∈∇
this function is not well behaved from the regularity point of view, which is a critical
feature when dealing with image registration.
In practice, we have a spatial discretization of the SVF (and of the deformations)
on a grid, and the temporal discretization of the time varying velocity fields by a fixed
number of time steps. This intrinsically limits the frequency of the deformation below
a kind of “Nyquist” threshold, which prevents these diffeomorphisms to be reached
anyway both by the SVF and by the “discrete” LDDMM frameworks. Therefore,
it seems more importance to understand the impact of using stationary velocity
fields in registration from the practical point of view, than from the theoretical point
of view, because we will have necessarily to deal with the unavoidable numerical
implementation and relative approximation issues.
260 M. Lorenzi and X. Pennec
Continuous and discrete methods for the parallel transport provided in Sects. 9.3
and 9.4 can be applied in the diffeomorphic registration setting, once provided the
appropriate geometrical context (Sect. 9.5). In this section we discuss and illustrate
practical implementations of the parallel transport in diffeomorphic registration, with
special focus on the application of the ladder schemes exposed in Sect. 9.4.
Fig. 9.3 Geometrical schemes in the Schild’s ladder and in the pole ladder. By using the curve C
as diagonal, the pole ladder requires the computation of half times of the geodesics (blue) required
by the Schild’s ladder (red)
compute differential operators on discrete image grids are definitely required to com-
pare them on a fair basis.
We assume that a well posed Riemannian metric is given on the space of images.
This could be L2 , Hk or the metric induced on the space of images by the action of
diffeomorphisms of a well chosen right-invariant metric (LDDMM).
262 M. Lorenzi and X. Pennec
Schild’s ladder can be naturally translated in the image context (Algorithm 1), by
requiring the computation of two diagonal geodesics.
The pole ladder is similar to the Schild’s one, with the difference of explicitly
using as a diagonal the geodesic C which connects I0 and T0 (Algorithm 2). This
is an interesting property since, given C, it requires the computation of only one
additional geodesic, thus the transport of time series of several images is based on
the same baseline-to-reference curve C (Fig. 9.3).
Despite the straightforward formulation, algorithms (1) and (2) require multiple
evaluations of image geodesics, and a consequent a high cost in terms of computation
time and resources if we compute them with registration. Moreover, since we look
for regular transformations of the space, the registration is usually constrained to be
smooth and the perfect match of correspondent intensities in the registered images is
not possible. For instance, the definition of I 1 using the forward deformation on I1 or
2
the backward from T0 would lead to different results. Since we work in computational
anatomy with deformations, it seems more natural to perform the parallel transport
directly in the group of diffeomorphisms
Given a pair of images Ii , i ⊂ {0, 1}, the SVF framework parametrizes the diffeomor-
phism τ required to match the reference I0 to the moving image I1 by a SVF u. The
velocity field u is an element of the Lie Algebra g of the Lie group of diffeomorphisms
G, i.e. an element of the tangent space at the identity Tid G. The diffeomorphism τ
belongs to the one parameter subgroup τ = exp(t u) generated by the flow of u.
9 Discrete Ladders for Parallel Transport in Transformation Groups 263
We can therefore define the paths in the space of the diffeomorphisms from the one
parameter subgroup parametrization l(φ) = exp(φ · u).
Figure 9.4 illustrate how we can take advantage of the stationarity properties of
the one-parameter subgroup in order to define the following robust scheme:
1. Let I1 = exp(u) → I0 .
2. Compute v = argminv⊂G E (T0 ◦ exp(−v/2), I0 ◦ exp(v/2)), where E is a generic
registration energy functional to be minimized.
The half space image I 1 can be defined in terms of v/2 as exp(−v/2) → T0 or
2
exp(v/2)→I0 . While from the theoretical point of view the two images are identical,
the choice of one of them, or even their mean, introduces a bias in the construction.
The definition of the half step image can be bypassed by relying on the symmetric
construction of the parallelogram.
3. The transformation from I1 to I 1 is ω = exp(v/2) ◦ exp(−u) and the symmetry
2
leads to exp (Δ(u)) = exp(v/2) ◦ ω−1 = exp(v/2) ◦ exp(u) ◦ exp(−v/2).
The transport of the deformation τ = exp(u) can be therefore obtained through
the conjugate action operated by the deformation parametrized by v/2.
Since the direct computation of the conjugation by composition is potentially
biased by the spatial discretization, we propose a numerical scheme to more robustly
evaluate the transport directly in the Lie Algebra.
The Baker Campbell Hausdorff (BCH) formula was introduced in the SVF diffeomor-
phic registration in [9] and provides an explicit way to compose diffeomorphisms
parametrized by SVFs by operating in the associated Lie Algebra. More specifi-
cally, if v, u are SVFs, then exp(v) ◦ exp(u) = exp(w) with w = BCH(v, u) =
v + u + 21 [v, u] + 12
1
[v, [v, u]] − 12
1
[u, [v, u]] + . . .. In particular, for small u, the
computation can be truncated to any order to obtain a valid approximation for the
composition of diffeomorphisms. Applying the truncate BCH to the conjugate action
264 M. Lorenzi and X. Pennec
leads to
1
ΔBCH (u) ← u + [v/2, u] + [v/2, [v/2, u]]. (9.9)
2
To establish this formula, let consider the following second order truncation of the
BCH formula
1 1 1
BCH((v/2, u) ← v/2 + u + [v/2, u] + [v/2, [v/2, u]] − [u, [v/2, u]].
2 12 12
The composition
ΔBCH (u) = BCH (v/2, BCH(u, −v/2))
is
1
Δ(u)v = v/2 + BCH(u, −v/2) + [v/2, BCH(u, −v/2)]
⎬ ⎭ 2
⎬ ⎭
A
B
1 1
+ [v/2, [v/2, BCH(u, −v/2)] − [BCH(u, −v/2), [v/2, BCH(u, −v/2)]] .
12
⎬ ⎭ ⎬ 12 ⎭
C D
Once defined the formula for the computation of the ladder, we need a consistent
scheme for the iterative construction along trajectories. We recall that the transport by
geodesic parallelograms holds only if both sides of the parallelogram are sufficiently
small, which in our case means that both longitudinal and inter-subject vectors must
be small. This is not the case in practice, since the inter-subject deformation is
usually very large. By definition, the ladder requires to scale down vectors to a
9 Discrete Ladders for Parallel Transport in Transformation Groups 265
We provide here an application of the pole ladder for the estimation of a group-wise
model of the longitudinal changes in a group of patients affected by Alzheimer’s
disease (AD). AD is a neurodegenerative pathology of the brain, characterized by
the co-occurrence of different phenomena, starting from the deposition of amy-
loid plaques and neurofibrillary tangles, to the development of functional loss and
finally to cell deaths [20]. In particular brain atrophy detectable from magnetic res-
onance imaging (MRI) is currently considered as a potential outcome measure for
the monitoring of the disease progression. Structural atrophy was shown to strongly
correlate with cognitive performance and neuropsychological scores, and character-
izes the progression from pre-clinical to pathological stages [20]. For this reason, the
development of reliable atlases of the pathological longitudinal evolution of the brain
is of paramount importance for improving the understanding of the pathology.
A preliminary approach to the group-wise analysis of longitudinal morphological
changes in AD consists in performing the longitudinal analysis after the subject-
to-template normalization [13, 43]. A key issue here is the different nature of the
changes occurring at the intra-subject level, which reflects the biological phenomena
of interest, and the changes across different subjects, which are usually large and
not related to any biological process. In fact, the inter-subject variability is a scale
of magnitude higher than the more subtle longitudinal subject-specific variations. To
provide a more sensitive quantification of the longitudinal dynamics, the intra-subject
changes should be modeled independently from the subject-to-template normaliza-
tion, and only transported in the common reference for statistical analysis afterward.
Thus, novel techniques such as the parallel transport of longitudinal deformations
might lead to better accuracy and precision for the modeling and quantification of
longitudinal pathological brain changes.
Images corresponding to the baseline I0 and the one-year follow-up I1 scans were
selected for 135 subjects affected by Alzheimer’s disease. For each subject i, the
pairs of scans were rigidly aligned. The baseline was linearly registered to a reference
template and the parameters of the transformation were applied to I1i . Finally, for each
subject, the longitudinal changes were measured by non-linear registration using the
LCC-Demons algorithm [25].
The resulting deformation fields τi = exp(vi ) were transported with the pole
ladder (BCH scheme) in the template reference along the subject-to-template defor-
mation. The group-wise longitudinal progression was modeled as the mean of the
transported SVFs vi . The areas of significant longitudinal changes were investi-
gated by one-sample t-test on the group of log-Jacobian scalar maps corresponding
9 Discrete Ladders for Parallel Transport in Transformation Groups 267
Fig. 9.5 One year structural changes for 135 Alzheimer’s patients. a Mean of the longitudinal
SVFs transported in the template space with the pole ladder. We notice the lateral expansion of the
ventricles and the contraction in the temporal areas. b T-statistic for the correspondent log-Jacobian
values significantly different from 0 (p < 0.001 FDR corrected). c T-statistic for longitudinal
log-Jacobian scalar maps resampled from the subject to the template space. Blue color significant
expansion, Red color significant contraction. The figure is reproduced from [27]
Fig. 9.6 Apparent relative volume changes encoded by the average longitudinal trajectory com-
puted with the pole ladder (Fig. 9.5). The trajectory describes a pattern of apparent volume gain in
the CSF areas, and of apparent volume loss in temporal areas and around the ventricles
9.7 Conclusions
This is a rather interesting characteristic that enables to employ the ladder without
requiring the design of any additional tool outside geodesics.
From the practical point of view, discrete methods can alleviate the numerical
problems arising from the discretization of continuous functional on finite grids,
and thus provide feasible and numerically stable alternatives to continuous transport
approaches. The application shown in Sect. 9.6.4 is a promising example of the
potential of such approaches when applied to challenging problems such as the
estimation of longitudinal atlases in diffeomorphic registration.
As shown in Sect. 9.4 the construction of the ladder holds in sufficiently small
neighborhoods. From the practical point of view this is related to the choice of an
appropriate step size for the iterative scheme proposed in Sect. 9.6, and future studies
are required in order to investigate the impact of the step size from the numerical
point of view.
Finally, future studies aimed to directly compare discrete versus continuous
approaches might shed more light on the theoretical and numerical properties of
different methods of transport.
References
1. Ardekani, S., Weiss, R.G., Lardo, A.C., George, R.T., Lima, J.A.C., Wu, K.C., Miller, M.I.,
Winslow, R.L., Younes, L.: Cardiac motion analysis in ischemic and non-ischemic cardiomy-
opathy using parallel transport. In: Proceedings of the Sixth IEEE International Conference on
Symposium on Biomedical Imaging: From Nano to Macro, ISBI’09, pp. 899–902. IEEE Press,
Piscataway (2009)
2. Arnold, V.I.: Mathematical Methods of Classical Mechanics, vol. 60. Springer, New York
(1989)
3. Arsigny, V., Commowick, O., Pennec, X., Ayache, N.: A log-euclidean framework for statistics
on diffeomorphisms. In: Proceedings of Medical Image Computing and Computer-Assisted
Intervention - MICCAI, vol. 9, pp. 924–931. Springer, Heidelberg (2006)
4. Ashburner, J., Ridgway, G.R.: Symmetric diffeomorphic modeling of longitudinal structural
MRI. Front. Neurosci. 6, (2012)
5. Ashburner, J., Friston, K.J.: Voxel-based morphometry—the methods. NeuroImage 11, 805–21
(2000)
6. Ashburner, J.: A fast diffeomorphic image registration algorithm. NeuroImage 38(1), 95–113
(2007)
7. Avants, B., Anderson, C., Grossman, M., Gee, J.: Spatiotemporal normalization for longitudinal
analysis of gray matter atrophy in frontotemporal dementia. In Ayache, N., Ourselin, S., Maeder,
A. (eds.) Medical Image Computing and Computer-Assisted Intervention, MICCAI, pp. 303–
310. Springer, Heidelberg (2007)
8. Beg, M.F., Miller, M.I., Trouvé, A., Younes, L.: Computing large deformation metric mappings
via geodesic flows of diffeomorphisms. Int. J. Comput. Vis. 61(2), 139–157 (2005)
9. Bossa, M., Hernandez, M., Olmos, S.: Contributions to 3d diffeomorphic atlas estimation:
application to brain images. In: Proceedings of Medical Image Computing and Computer-
Assisted Intervention- MICCAI, vol. 10, pp. 667–74 (2007)
10. Bossa, M.N., Zacur, E., Olmos, S.: On changing coordinate systems for longitudinal tensor-
based morphometry. In: Proceedings of Spatio Temporal Image Analysis Workshop (STIA),
(2010)
270 M. Lorenzi and X. Pennec
11. Cartan, E., Schouten, J.A.: On the geometry of the group-manifold of simple and semi-simple
groups. Proc. Akad. Wekensch (Amsterdam) 29, 803–815 (1926)
12. Charpiat, G.: Learning shape metrics based on deformations and transport. In: Second Work-
shop on Non-Rigid Shape Analysis and Deformable Image Alignment, Kyoto, Japon (2009)
13. Chetelat, G., Landeau, B., Eustache, F., Mezenge, F., Viader, F., de la Sayette, V., Desgranges,
B., Baron, J.-C.: Using voxel-based morphometry to map the structural changes associated
with rapid conversion to mci. NeuroImage 27, 934–46 (2005)
14. Thompson, D.W.: On growth and form by D’Arcy Wentworth Thompson. University Press,
Cambridge (1945)
15. Davis, B.C., Fletcher, P.T., Bullit, E., Joshi, S.: Population shape regression from random design
data. In: ICCV vol.4, pp. 375–405 (2007)
16. do Carmo, M.P.: Riemannian Geometry. Mathematics. Birkhäuser, Boston, Basel, Berlin (1992)
17. Durrleman, S., Pennec, X., Trouvé, A., Gerig, G., Ayache, N.: Spatiotemporal atlas estimation
for developmental delay detection in longitudinal datasets. In: Medical Image Computing and
Computer-Assisted Intervention—MICCAI, vol. 12, pp. 297–304 (2009)
18. Helgason, S.: Differential Geometry, Lie groups, and Symmetric Spaces. Academic Press, New
York (1978)
19. Hernandez, M., Bossa, M., Olmos, S.: Registration of anatomical images using paths of diffeo-
morphisms parameterized with stationary vector field flows. Int. J. Comput. Vis. 85, 291–306
(2009)
20. Jack, C.R., Knopman, D.S., Jagust, W.J., Shaw, L.M., Aisen, P.S., Weiner, M.W., Petersen, R.C.,
Trojanowski, J.Q.: Hypothetical model of dynamic biomarkers of the alzheimer’s pathological
cascade. Lancet Neurol. 9, 119–28 (2010)
21. Joshi, S., Miller, M.I.: Landmark matching via large deformation diffeomorphisms. IEEE Trans.
Image Process. 9(8), 1357–70 (2000)
22. Khesin, B.A., Wendt, R.: The Geometry of Infinite Dimensional Lie groups, volume 51 of
Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge / A Series of Modern Surveys in
Mathematics. Springer (2009)
23. Kheyfets, A., Miller, W., Newton, G.: Schild’s ladder parallel transport for an arbitrary con-
nection. Int. J. Theoret. Phys. 39(12), 41–56 (2000)
24. Kolev, B.: Groupes de Lie et mécanique. http://www.cmi.univ-mrs.fr/kolev/. Notes of a Master
course in 2006–2007 at Université de Provence (2007)
25. Lorenzi, M., Ayache, N., Frisoni, G.B., Pennec, X.: Lcc-demons: a robust and accurate sym-
metric diffeomorphic registration algorithm. NeuroImage 1(81), 470–83 (2013)
26. Lorenzi, M., Ayache, N., Frisoni, G.B., Pennec, X.: Mapping the effects of Aε1−42 levels on the
longitudinal changes in healthy aging: hierarchical modeling based on stationary velocity fields.
In: Medical Image Computing and Computer-Assisted Intervention—MICCAI, pp. 663–670,
(2011)
27. Lorenzi, M., Pennec, X.: Efficient parallel transport of deformations in time series of images:
from Schild’s to pole ladder. J. Math. Imaging Vis. (2013) (Published online)
28. Lorenzi, M., Pennec, X.: Geodesics, parallel transport and one-parameter subgroups for dif-
feomorphic image registration. Int. J. Comput. Vis.—IJCV 105(2), 111–127 (2012)
29. Lorenzi, M., Ayache, N., Pennec, X.: Schild’s ladder for the parallel transport of deformations
in time series of images. Inf. Process. Med. Imaging—IPMI 22, 463–74 (2011)
30. Milnor, J.: Remarks on infinite-dimensional Lie groups. In: Relativity, Groups and Topology,
pp. 1009–1057. Elsevier Science Publishers, Les Houches (1984)
31. Misner, C.W., Thorne, K.S., Wheeler, J.A.: Gravitation. W.H. Freeman and Compagny , San
Francisco, California (1973)
32. Modat, M., Ridgway, G.R., Daga, P., Cardoso, M.J., Hawkes, D.J., Ashburner, J., Ourselin, S.:
Log-Euclidean free-form deformation. In: Proceedings of SPIE Medical Imaging 2011. SPIE,
(2011)
33. Pennec, X., Arsigny, V.: Exponential barycenters of the canonical cartan connection and invari-
ant means on Lie groups. In: Barbaresco, F., Mishra, A., Nielsen, F. (eds.) Matrix Information
Geometry. Springer, Heidelberg (2012)
9 Discrete Ladders for Parallel Transport in Transformation Groups 271
34. Postnikov, M.M.: Geometry VI: Riemannian Geometry. Encyclopedia of mathematical science.
Springer, Berlin (2001)
35. Qiu, A., Younes, L., Miller, M., Csernansky, J.G.: Parallel transport in diffeomorphisms dis-
tinguish the time-dependent pattern of hippocampal surface deformation due to healthy aging
and dementia of the Alzheimer’s type. NeuroImage, 40(1):68–76 (2008)
36. Qiu, A., Albert, M., Younes, L., Miller, M.: Time sequence diffeomorphic metric mapping and
parallel transport track time-dependent shape changes. NeuroImage 45(1), S51–60 (2009)
37. Rao, A., Chandrashekara, R., Sanchez-Hortiz, G., Mohiaddin, R., Aljabar, P., Hajnal, J., Puri, B.,
Rueckert, D.: Spatial trasformation of motion and deformation fields using nonrigid registration.
IEEE Trans. Med. Imaging 23(9), 1065–76 (2004)
38. Riddle, W.R., Li, R., Fitzpatrick, J.M., DonLevy, S.C., Dawant, B.M., Price, R.R.: Character-
izing changes in mr images with color-coded jacobians. Magn. Reson. Imaging 22(6), 769–77
(2004)
39. Schild, A.: Tearing geometry to pieces: More on conformal geometry. unpublished lecture at
Jan 19 1970 Princeton University relativity seminar (1970)
40. Schmid, R.: Infinite dimensional lie groups with applications to mathematical physics. J. Geom.
Symmetry Phys. 1, 1–67 (2004)
41. Schmid, R.: Infinite-dimensional lie groups and algebras in mathematical physics. Adv. Math.
Phys. 2010, 1–36 (2010)
42. Subbarao, R.: Robust Statistics Over Riemannian Manifolds for Computer Vision. Graduate
School New Brunswick, Rutgers The State University of New Jersey, New Brunswick, (2008)
43. Thompson, P., Ayashi, K.M., Zubicaray, G., Janke, A.L., Rose, S.E., Semple, J., Herman,
D., Hong, M.S., Dittmer, S.S., Dodrell, D.M., Toga, A.W.: Dynamics of gray matter loss in
alzheimer’s disease. J. Neurosci. 23(3), 994–1005 (2003)
44. Trouvé, A.: Diffeomorphisms groups and pattern matching in image analysis. Int. J. Comput.
Vis. 28(3), 213–21 (1998)
45. Twining, C., Marsland, S., Taylor, C.: Metrics, connections, and correspondence: the setting
for groupwise shape analysis. In: Proceedings of the 8th International Conference on Energy
Minimization Methods in Computer Vision and Pattern Recognition, EMMCVPR’11, pp. 399–
412. Springer, Berlin, Heidelberg (2011)
46. Vercauteren, T., Pennec, X., Perchant, A., Ayache, N.: Symmetric log-domain diffeomorphic
registration: a demons-based approach. In: Medical Image Computing and Computer-Assisted
Intervention—MICCAI. Lecture Notes in Computer Science, vol. 5241, pp. 754–761. Springer,
Heidelberg (2008)
47. Wei, D., Lin, D., Fisher, J.: Learning deformations with parallel transport. In: ECCV, pp. 287–
300 (2012)
48. Younes, L.: Shapes and diffeomorphisms. Number 171 in Applied Mathematical Sciences.
Springer, Berlin (2010)
49. Younes L.: Jacobi fields in groups of diffeomorphisms and applications. Q. Appl. Math. pp.
113–134 (2007)
Chapter 10
Diffeomorphic Iterative Centroid
Methods for Template Estimation
on Large Datasets
10.1 Introduction
A common point of all these methods is that they need a surface matching
algorithm, which is very expensive in terms of computation time in the LDDMM
framework. When no specific optimization is used, computing only one match-
ing between two surfaces, each composed of 3000 vertices, takes approximately
30–40 min. Then, computing a template composed of one hundred such surfaces
until convergence can take a few days or some weeks. This is a limitation for the
study of large databases. Different strategies can be used to reduce computation time.
GPU implementation can substantially speed up the computation of convolutions that
are heavily used in LDDMM deformations. Matching pursuit on current can also be
used to reduce the computation time [9]. Sparse representations of deformations
allow to reduce the number of optimized parameters of the deformations [7].
Here, we propose a new approach to reduce the computation time called diffeo-
morphic iterative centroid using currents. The method provides in N − 1 steps (with
N the number of shapes of the population) a centroid already correctly centered
within the population of shapes. It increases the convergence speed of the template
estimation by providing an initialization that is closer to the target.
Our method has some close connections with more general iterative methods to
compute means on Riemannian manifolds. For example Arnaudon et al. [10] defined
a stochastic iterative method which converges to the Fréchet mean of the set of points.
Ando et al. [11] gave a recursive definition of the mean of positive definite matrices
which verifies important properties of geometric means. However these methods
require a large number of iterations (much larger than the number of points of the
dataset), while in our case, due to the high computational cost of matchings, we aim
at limiting as much as possible the number of iterations.
The chapter is organized as follows. First, we present the mathematical framework
of LDDMM and currents (Sect. 10.2). Section 10.3.1 then introduces the template
estimation and the iterative centroid method. In Sect. 10.4.1, we evaluate the approach
on datasets of real and synthetic hippocampi extracted from brain magnetic resonance
images (MRI).
For the Diffeomorphic Iterative Centroid method, we use the LDDMM frame-
work (Sect. 10.2.1) to quantify the difference between shapes. To model surfaces
of the population, we use the framework of currents (Sect. 10.2.2) which does not
assume point-to-point correspondences.
has a unique solution, and one sets ϕv = φv (·, 1) the diffeomorphism induced by
v(x, t). The induced set of diffeomorphisms AV is a subgroup of the group of C 1
diffeomorphisms. To enforce velocity fields to stay in this space, one must control
the energy
1
E(v) := v(·, t)2V dt. (10.2)
0
−1
D(ϕ, ψ) = D(I d,
ϕ ≡ ψ)
1 (10.3)
D(I d, ϕ) = inf 0 v(·, t)V dt ; v ∗ L 2 ([0, 1], V ), ϕv = ϕ
Discrete matching functionals Considering two surfaces S and T , the optimal match-
ing between them is defined in an ideal setting, as the map ϕv minimizing E(v) under
the constraint ϕv (S) = T . In practice such an exact matching is often not feasible and
one writes inexact unconstrained matching functionals which minimize both E(v)
and a matching criterion which evaluates the spatial proximity between ϕv (S) and
T , as we will see in the next section.
In a discrete setting, when the matching criterion depends only on ϕv via the
images ϕv (xi ) of a finite number of points xi (such as the vertices of the mesh S) one
can show that the vector fields v(x, t) which induce the optimal deformation map
10 Diffeomorphic Iterative Centroid Methods 277
can be written via a convolution formula over the surface involving the reproducing
kernel K V of the R.K.H.S. V . This is due to the reproducing property of V ; indeed
V is the closed span of vectors fields of the form K V (x, .)α, and therefore v(x, t)
writes
n
v(x, t) = K V (x, xi (t))αi (t), (10.5)
i=1
where xi (t) = φv (xi , t) are the trajectories of points xi , and αi (t) ∗ R3 are time-
dependent vectors called momentum vectors, which parametrize completely the
deformation. Trajectories xi (t) depend only on these vectors as solutions of the
following system of ordinary differential equations:
d x j (t)
n
= K V (x j (t), xi (t))αi (t), (10.6)
dt
i=1
for 1 ∇ j ∇ n. This is obtained by plugging formula 10.5 for the optimal velocity
fields into the flow equation 10.1 taken at x = x j . Moreover, the energy E(v) takes
an explicit form as expressed in terms of trajectories and momentum vectors:
1
n
E(v) = αi (t)T K V (xi (t), x j (t))α j (t) dt. (10.7)
0 i, j=1
Note that the first equation is nothing more than Eq. 10.6 which allows to com-
pute trajectories xi (t) from any time-dependant momentum vectors αi (t), while the
second equation gives the evolution of the momentum vectors themselves. This new
278 C. Cury et al.
set of ODEs can be solved from any initial conditions xi (0), αi (0), which means
that the initial momentum αi (0) fully determine the subsequent time evolution of
the system (since the xi (0) are fixed points). As a consequence, these initial momen-
tum vectors encode all information of the optimal diffeomorphism. This is a very
important point for applications, specifically for group studies, since it allows to
analyse the set of deformation maps from a given template to the observed shapes by
performing statistics on the initial momentum vectors located on the template shape.
We also can use geodesic shooting from initial conditions (xi (0), αi (0)) in order to
generate any arbitrary deformation of a shape in the shape space. We will use this
tool for the construction of our synthetic dataset Data1 (see Sect. 10.4.1).
10.2.2 Currents
The idea of the mathematical object named “currents” is related to the theory of
distributions as presented by Schwartz in 1952 [12], in which distributions are char-
acterized by their action on any smooth functions with compact support. In 1955, De
Rham [13] generalized distributions to differential forms to represent submanifolds,
and called this representation currents. This mathematical object serves to model
geometrical objects using a non parametric representation.
The use of currents in computational anatomy was introduced by J. Glaunés and
M. Vaillant in 2005 [14, 15] and subsequently developed by Durrleman [16] in order
to provide a dissimilarity measure between meshes which does not assume point-
to-point correspondence between anatomical structures. The approach proposed by
Vaillant and Glaunès is to represent meshes as objects in a linear space and supply
it with a computable norm. Using currents to represent surfaces has some benefits.
First it avoids the point correspondence issue: one does not need to define pairs
of corresponding points between two surfaces to evaluate their spatial proximity.
Moreover, metrics on currents are robust to different samplings and topologies and
take into account not only the global shapes but also their local orientations. Another
important benefit is that the space of currents is a vector space, which allows to
consider linear combinations such as means of shapes in the space of currents. This
property will be used in the centroid and template methods that we introduce in the
following.
We limit the framework to surfaces embedded in R3 . Let S be an oriented compact
surface, possibly with boundary. Any smooth and compactly supported differential
2-form ω of R3 —i.e. a mapping x ◦⊂ ω(x) such that for any x ∗ R3 , ω(x) is a
2-form, an alternated bilinear mapping from R3 × R3 to R—can be integrated over S
ω= ω(x)(u 1 (x), u 2 (x))dσ(x). (10.9)
S S
where (u 1 (x), u 2 (x)) is an orthonormal basis of the tangent plane at point x, and dσ
the Lebesgue measure on the surface S. Hence one can define a linear form [S] over
10 Diffeomorphic Iterative Centroid Methods 279
the space of 2-forms via the rule [S](ω) := S ω. If one defines a Hilbert metric on
the space of 2-forms such that the corresponding space is continuously embedded
in the space of continuous bounded 2-forms, this mapping will be continuous [14],
which will make [S] an element of the space of 2-currents, the dual space to the space
of 2-forms.
Note that since we are working with 2-forms on R3 , we can use a vectorial
representation via the cross product: for every 2-form ω and x ∗ R3 there exists a
vector ω(x) ∗ R3 such that for every α, β ∗ R3 ,
Therefore we can work with vector fields ω instead of 2-forms ω. In the following,
with a slight abuse of notation, we will use ω(x) to represent both the bilinear
alternated form and its vectorial representative. Hence the current of a surface S can
be re-written from Eq. 10.9 as follows:
[S](ω) = ←ω(x) , n(x)≤ dσ(x) (10.11)
S
with n(x) the unit normal vector to the surface: n(x) := u 1 (x) × u 2 (x).
We define a Hilbert metric ←· , ·≤W on the space of vector fields of R3 , and require
the space W to be continuously embedded in C01 (R3 , R3 ). The space of currents we
consider is the space of continuous linear forms on W , i.e. the dual space W ∗ , and
the required embedding property ensures that for a large class of oriented surfaces
S in R3 , comprising smooth surfaces and also triangulated meshes, the associated
linear mapping [S] is indeed a current, i.e. it belongs to W ∗ .
The central object from the computational point of view is the reproducing kernel
of space W , which we introduce here. For any point x ∗ R3 and vector α ∗ R3 one
can consider the Dirac functional δxα : ω ◦⊂ ←ω(x) , α≤ which is an element of W ∗ .
The Riesz representation theorem then states that there exists a unique u ∗ W such
that for all ω ∗ W , ←u , ω≤W = δxα (ω) = ←ω(x) , α≤. u is thus a vector field which
depends on x and linearly on α, and we write it u = K W (·, x)α. Thus we have the
rule
Moreover, applying this formula to ω = K W (·, y)β for any other point y ∗ R3
and vector β ∗ R3 , we get
Thus using Eq. 10.13, one can prove that for two surfaces S and T ,
←[S] , [T ]≤2W ∗ = n S (x) , K W (x, y)n T (y) ds(x)ds(y) (10.15)
S T
This formula defines the metric we use for evaluating spatial proximity between
shapes. It is clear that the type of kernel one uses fully determines the metric and
therefore will have a direct impact on the behaviour of the algorithms. We use scalar
invariant kernels of the form K W (x, y) = h(x − y2 /σW 2 )I , where h is a real
3
−r
function such as h(r ) = e (gaussian kernel) or h(r ) = 1/(1 + r ) (Cauchy kernel),
and σW a scale factor. In practice this scale parameter has a strong influence on the
results; we will go back to this point later.
We can now define the optimal match between two currents [S] and [T ], which
is the diffeomorphism minimizing the functional
This functional is non convex and in practice we use a gradient descent algorithm
to perform the optimization, which cannot guarantee to reach a global minimum.
We observed empirically that local minima can be avoided by using a multi-scale
approach in which several optimization steps are performed with decreasing values
of the width σW of the kernel K W (each step provides an initial guess for the next
one).
In practice, surfaces are given as triangulated meshes, which we discretize
in
nf
the space of currents W ∗ by combinations of Dirac functionals: [S] f ∗S δc f ,
where the sum is taken over all triangles f = ( f 1 , f 2 , f 3 ) of the mesh S, and
c f = 21 ( f 1 + f 2 + f 3 ), n f = 21 ( f 2 − f 1 ) × ( f 3 − f 1 ) denote respectively
the center and normal vector of the triangle. Given a deformation map ϕ and a
triangulated surface S, we also approximate its image ϕ(S) by the triangulated mesh
obtained by letting φ act only on the vertices of S. This leads us to the following
discrete formulation of the matching problem:
1
n
d
JS,T (α) = γ αi (t)T K V (xi (t), x j (t))α j (t) dt
0 i=1
+ T
n ϕ( f ) K W (cϕ( f ) , cϕ( f ⊆ ) )n ϕ( f ⊆ )
f, f ⊆ ∗S
10 Diffeomorphic Iterative Centroid Methods 281
+ n g K W (cg , cg⊆ )n g⊆ − 2 T
n ϕ( f ) K W (cϕ( f ) , cg )n g (10.17)
g,g ⊆ ∗T f ∗S,g∗T
N
vˆi , T̂ = arg min T − ϕvi (Si ) 2W ∗ + γ E(vi ) , (10.18)
vi ,T i=1
where the minimization is performed over the spaces L 2 ([0, 1], V ) for the velocity
fields vi and over the space of currents W ∗ for T . The method uses an alternated
optimization i.e. surfaces are successively matched to the template, then the template
is updated and this sequence is iterated until convergence. One can observe that when
ϕi is fixed, the functional is minimized when T is the average of [ϕi (Si )] in space
W ∗:
1
N
T = ϕvi (Si ) , (10.19)
N
i=1
which makes the optimization with respect to T straightforward. This optimal current
is not a surface itself; in practice it is constituted by the union of all surfaces ϕvi (Si ),
and the N1 factor acts as if all normal vectors to these surfaces were weighted by N1 .
282 C. Cury et al.
At the end of the optimization process however, all surfaces being co-registered, the
ϕ̂vi (Si ) are close to each other, which makes the optimal template T̂ close to being
a true surface.
In practice, we stop the template estimation method after P loops, and with the
datasets we use, P = 7 seems to be sufficient to obtain an adequate template.
As detailed in Sect. 10.2, obtaining a template allows to perform statistical analysis
of the deformation maps via the initial momentum representation to characterize the
population. One can run analysis on momentum vectors such as Principal Component
Analysis (PCA), or estimate an approximation of pairwise diffeomorphic distances
between subjects using the estimated template [17] in order to use manifold learning
methods like Isomap [18].
In the present case, the optimal template for the population
N is nota true surface
but is defined, in the space of currents, by the mean T̂ = N1 j=1 ϕ̂v j S j . However
this makes no difference from the point of view of statistical analysis, because this
template can be used in the LDDMM framework exactly as if it was a true surface.
One may speed up the estimation process and avoid local minima issues by defin-
ing a good initialization
N of the optimization process. Standard initialization consists
in setting T = N1 i=1 [Si ], which means that the initial template is defined as the
combination of all unregistered shapes in the population. Alternatively, if one is given
a good initial guess T , the convergence speed of the method can be improved. This
is the primary motivation for the introduction of the iterative centroid method which
we present in the next section.
1
N
bN = xi . (10.20)
N
i=1
Many mathematical studies (as for example Kendall [19], Karcher [20] Le [21],
Afsari [22, 23]), have focused on proving the existence and uniqueness of the mean,
as well as proposing algorithms to compute it. The more general notion of p-mean
of a probability measure μ on a Riemannian manifold M is defined by:
b = arg min F p (x), F p (x) = d M (x, y) p μ(dy). (10.23)
x∗M M
Other definitions of centroids in the Riemannian setting can be proposed. The fol-
lowing ideas are more directly connected to our method. Going back to the Euclidean
case, one can observe that b N satisfies the following iterative relation:
b1 = x1
(10.24)
bk+1 = k+1
k
bk + k+1 x k+1 ,
1
1 ∇ k ∇ N − 1,
which has the side benefit that at each step bk is the centroid of the xi , 1 ∇ i ∇ k. This
iterative process has an analogue in the Riemannian case, because one can interpret
the convex combination k+1 k
bk + k+1 1
xk+1 as the point located along the geodesic
1
linking bk to xk+1 , at a distance equal to k+1 of the total length of the geodesic,
which we can write geod(bk , xk+1 , k+1 ). This leads to the following definition in
1
Of course this new definition of centroid does not coincide with the Fréchet mean
when the metric is not Euclidean, and furthermore it has the drawback to depend on
the ordering of points xi . Moreover one may consider other iterative procedures such
as computing midpoints between arbitrary pairs of points xi , and then midpoints of
the midpoints, etc. In other words,
N all procedures that are based on decomposing the
Euclidean equality b N = N1 i=1 xi as a sequence of pairwise convex combinations
lead to possible alternative definitions of centroid in a Riemannian setting. Based on
these remarks, Emery and Mokobodzki [24] proposed to define the centroid not as a
unique point but as the set B N of points x ∗ M satisfying
1
N
f (x) ∇ f (xi ), (10.26)
N
i=1
Fig. 10.1 Illustration of the method. Left image red stars are subjects of the population, the yellow
star is the final Centroid, and orange stars are iterations of the centroid. Right image Final centroid
with the hippocampus population from Data1 (red). See Sect. 10.4.1 for more details about datasets
Fig. 10.2 Diagrams of the iterative processes which lead to the centroid computation. The tops
of the diagrams represent the final centroid. The diagram on the left corresponds to the iterative
centroid algorithms (IC1 and IC2). The diagram on the right corresponds to the pairwise algorithm
(PW)
Data: N surfaces Si
Result: 1 surface B N representing the centroid of the population
B1 = S1 ;
for i from 1 to N − 1 do
Bi is matched using the Eq. (10.16) to Si+1 which results in a deformation map φvi (x, t);
Set Bi+1 = φvi (Bi , i+1
1
) which means we transport Bi along the geodesic and stop at
time t = i+1
1
;
end
Algorithm 1: Iterative Centroid 1 (IC1)
286 C. Cury et al.
Iterative centroid with averaging in the space of currents: IC2 Because matchings
are inaccurate, the centroid computed with the method presented above accumulates
small errors which can have an impact on the final centroid. Furthermore, the centroid
computed with algorithm 1 is in fact a deformation of the first shape S1 , which makes
the procedure even more dependent on the ordering of subjects than it would be in
an ideal exact matching setting. In this second algorithm, we modify the updating
step by computing a mean in the space of currents between the deformation of the
current centroid and the backward flow of the curent shape being matched. Hence the
computed centroid is not a true surface but a current, i.e. combination of surfaces, as
in the template estimation method. The weights chosen in the averaging reflects the
relative importance of the new shape, so that at the end of the procedure, all shapes
forming the centroid have equal weight N1 . The algorithm proceeds as presented in
Algorithm 2.
Data: N surfaces Si
Result: 1 current B N representing the centroid of the population
B1 = [S1 ];
for i from 1 to N − 1 do
Bi is matched using the Eq. (10.16) to Si+1 which results in a deformation map φvi (x, t);
Set Bi+1 = i+1 i
∗ φvi (Bi , i+1
1
) + i+1
1
[φu i (Si+1 , i+1
i
)] which means we transport Bi
along the geodesic and stop at time t = i+1 ; 1
where u i (x, t) = −vi (x, 1 − t), i.e. φu i is the reverse flow map.
end
Algorithm 2: Iterative Centroid 2 (IC2)
Data: N surfaces Si
Result: 1 surface B representing the centroid of the population
if N ∧ 2 then
Ble f t = Pairwise Centroid (S1 , ..., S[N /2] );
Bright = Pairwise Centroid (S[N /2]+1 , ..., S N );
Ble f t is matched to Bright which results in a deformation map φv (x, t);
Set B = φv (Ble f t , [N /2]+1
N ) which means we transport Ble f t along the geodesic and stop
[N /2]+1
at time t = N ;
end
else
B = S1
end
Algorithm 3: Pairwise Centroid
10.3.3 Implementation
The methods presented just before need some parameters. Indeed, in each algorithm
we have to compute the matching from one surface to another. For each matching
we minimize the corresponding functional (see Eq. 10.17 at the end of Sect. 10.2.2)
which estimates the news momentum vectors α, which then are used to update the
positions of points xi of the surface. A gradient descent with adaptive step size is used
for the minimization of the functional J . Evaluation of the functional and its gradient
require numerical integrations of high-dimensional ordinary differential equations,
which is done using Euler trapezoidal rule.
The main parameters for computing J are maxiter which is the maximum number
of iterations for the adaptive step size gradient descent algorithm, γ for the regularity
of the matching, and σW and σV the sizes of the kernels which control the metric of
the spaces W and V .
We selected parameters in order to have relatively good matchings in a short time.
We chose γ close enough to zero to enforce the matching to bring the first object
to the second one. Nevertheless, we must be prudent: choosing a γ too small could
be hazardous because the regularity of the deformation could not be preserved. For
each pairwise matching, we use the multi-scale approach described in Sect. 10.2.2
page 5, performing four consecutive optimization processes with decreasing values
by a constant factor of the σW parameter which is the size of the R. K. H. S. W , to
increase the precision of the matching. At the beginning, we fix this σW parameter
with a sufficient large value in order the capture the possible important variations or
differences between shapes. This is for this reason that for the two first minimiza-
tions of the functional, we use a small maxiter parameter. For the results presented
after, we used very small values for the parameter maxiter = [50, 50, 100, 300], to
increase the velocity of the method. Results can be less accurate than in our previous
study [25] which used different values for maxiter : [40, 40, 100, 1000], which take
twice as much time to compute. For the kernel size σV of the deformation space, we
fix this parameter at the beginning and have to adapt it to the size of the data.
288 C. Cury et al.
Fig. 10.3 On the left, an iterative centroid of the dataset data2 (see Sect. 10.4.1 for more details
about datasets) computed using the IC1 algorithm, and on the right the IC2 algorithm
The first method starts from N surfaces, and gives a centroid composed by only
one surface, which is a deformation of the surface used at the initialization step. An
example is shown in Fig. 10.3. This method is rather fast, because at each step we
have to match only one mesh composed by n 1 vertices to another, where n 1 is the
number of vertices of the first mesh of the iterative procedure.
The second method starts from N surfaces and gives a centroid composed of
deformations of all surfaces of the population. At each step it forms a combination
in the space of currents between the current centroid and a backward flow of the new
surface being matched. In practice this implies that the centroid grows in complexity;
at step i its number of vertices is ij=1 j ∗ n j . Hence this algorithm is slower than
the first one, but the mesh structure of the final centroid does not depend on the mesh
of only one subject of the population, and the combination compensates the bias
introduced by the inaccuracy of matchings.
The results of the Iterative Centroid algorithms depend on the ordering of subjects.
We will study this dependence in the experimental part, and also study the effect of
stopping the I. C. before it completes all iterations.
10.4.1 Data
To evaluate our approach, we used data from 95 young (14–16 years old) subjects
from the European database IMAGEN. The anatomical structure that we considered
was the hippocampus, which is a small bilateral structure of the temporal lobe of the
brain involved in memory processes. The hippocampus is one of the first structures
to be damaged in Alzheimer’s disease; it is also implicated in temporal lobe epilepsy,
and is altered in stress and depression. Ninety five left hippocampi were segmented
from T1-weighted Magnetic Resonance Images (MRI) of this database (see Fig. 10.4)
10 Diffeomorphic Iterative Centroid Methods 289
Fig. 10.4 Left panel coronal view of the MRI with the meshes of hippocampi segmented by the
SACHA software [26], the right hippocampus is in green and the left one in pink. Right panel 3D
view of the hippocampi
Fig. 10.5 Top to bottom meshes from Data1 (n = 500), Data2 (n = 95) and RealData (n = 95)
with the software SACHA [26], before computing meshes from the binary masks
using BrainVISA software.1
We denote as RealData the dataset composed of all 95 hippocampi meshes. We
rigidly aligned all hippocampi to one subject of the population. For this rigid regis-
tration, we used a similarity term based on measures (as in [27]) rather than currents.
We also built two synthetic populations of hippocampi meshes, denoted as Data1
and Data2. Data1 is composed of a large number of subjects, in order to test our
algorithms on a large dataset. In order to study separately the effect of the population
size, meshes of this population are simple. Data2 is a synthetic population close
to the real one, with the difference that all subjects have the same mesh structure.
1 http://www.brainvisa.info
290 C. Cury et al.
This allows to test our algorithms in a population with a single mesh structure, thus
disregarding the effects of different mesh structures. These two datasets are defined
as follows (examples of subjects from these datasets are shown on Fig. 10.5):
• Data1 We chose one subject S0 that we decimated (down to 135 vertices) and
deformed using geodesic shooting in 500 random directions with a sufficiently
large kernel and a reasonable momentum vector norm in order to preserve the
overall hippocampal shape, resulting in 500 deformed objects. Each deformed
object was then further transformed by a translation and a rotation of small magni-
tude. This resulted in the 500 different shapes of Data1. All shapes in Data1 have
the same mesh structure. Data1 thus provides a large dataset with simple meshes
and mainly global deformations.
• Data2 We chose the same initial subject S0 that we decimated to 1001 vertices.
We matched this mesh to each subject of the dataset RealData (n = 95), using
diffeomorphic deformation, resulting in 95 meshes with 1001 vertices. Data2 has
more local variability than Data1, and is closer to the anatomical truth.
Table 10.1 Distances between centroids computed with different subjects orderings, for each
dataset and each of the 3 algorithms
From different order To the dataset
mean (m1) max std mean (m2) m1/m2
Data1 IC1 0.8682 1.3241 0.0526 91.25 0.0095
IC2 0.5989 0.9696 0.0527 82.66 0.0072
PW 3.5861 7.1663 0.1480 82.89 0.0433
Data2 IC1 2.4951 3.9516 0.2205 16.29 0.1531
IC2 0.2875 0.4529 0.0164 15.95 0.0181
PW 3.8447 5.3172 0.1919 17.61 0.2184
RealData IC1 4.7120 6.1181 0.0944 18.54 0.2540
IC2 0.5583 0.7867 0.0159 17.11 0.0326
PW 5.3443 6.1334 0.1253 19.73 0.2708
The three first columns present the mean, standard deviation and the maximum of distances between
all pairs of centroids computed with different orderings. The fourth column displays the mean of
distances between each centroid algorithm and all subjects of the datasets. Distances are computed
in the space of currents
Table 10.2 In columns, average distances between centroids computed using the different
algorithms
IC1 versus IC2 IC1 versus PW IC2 versus PW
Data1 1.57 5.72 6.31
Data2 1.89 3.60 3.42
RealData 3.51 5.31 4.96
fact that IC1 provides a less precise estimate of the centroid between two shapes
since it does not incorporate the reverse flow. For all datasets, distances for PW were
larger than those for IC1 and IC2, suggesting that the PW algorithm is the most
dependent on the subjects ordering. Furthermore, centroids computed with PW are
also farther from those computed using IC1 or IC2. Furthermore, we speculate that
the increased sensitivity of PW over IC1 may be due to the fact that, in IC1, n − 1
levels of averaging are performed (and only log2 n for PW) leading to a reduction of
matching errors.
Finally, in order to provide a visualization of the differences, we present match-
ings between 3 centroids computed with the IC1 algorithm, in the case of RealData.
Figure 10.6 shows that shape differences are local and residual. Visually, the 3 cen-
troids are almost similar, and the amplitudes of momentum vectors, which bring one
centroid to another, are small and local.
We also assessed whether the centroids are close to the center of the population. To
that purpose, we calculated the ratio
292 C. Cury et al.
Fig. 10.6 a First row 3 initial subjects used for 3 different centroid computations with IC1 (mean
distance between such centroids, in the space of currents, is 4.71) on RealData. Second row the 3
centroids computed using the 3 subjects from the first row as initialization. b Maps of the amplitude
of the momentum vectors that map each centroid to another. Top and bottom views of the maps are
displayed. One can note that the differences are small and local
N
N1 i=1 v0 (Si )V
R= N
, (10.27)
i=1 v0 (Si )V
1
N
with v0 (Si ) the vector field corresponding to the initial momentum vector of the
deformation from the template or the centroid to the subject i. This ratio gives some
indication about the centering of the centroid, because in a pure Riemannian setting
(i.e. disregarding the inaccuracies of matchings), a zero ratio would mean that we are
at a critical point of the Fréchet functional, and under some reasonable assumptions
on the curvature of the shape space in the neighbourhood of the dataset (which we
cannot check however), it would mean that we are at the Fréchet mean. To compute
R, we need to match the centroid to all subjects of the population. We computed this
ratio on the best (i.e. the centroid which is the closest to all other centroids) centroid
for each algorithm and for each dataset.
Results are presented in Table 10.3. We can observe that the centroids obtained
with the three different algorithms are reasonably centered for all datasets. Centroids
for Data1 are particularly well centered, which was expected given the nature of this
population. Centroids for Data2 and RealData are slightly less well centered but they
10 Diffeomorphic Iterative Centroid Methods 293
Table 10.3 Ratio values for assessing the position of the representative centroid within the
population, computed using Eq. 10.27 (for each algorithm and for each dataset)
R IC1 IC2 PW
Data1 0.046 0.038 0.085
Data2 0.106 0.102 0.107
RealData 0.106 0.107 0.108
are still close to the Fréchet mean. It is likely that using more accurate matchings (and
thus increasing the computation time of the algorithms) we could reduce this ratio
for RealData and Data2. Besides, one can note that ratios for Data2 and RealData
are very similar; this indicates that the centering of the centroid is not altered by the
variability of mesh structures in the population.
Table 10.4 Distances between templates initialized via differents IC1 (T (I C1)) for each datasets,
and the distance between template initialized via the standard initialization (T (Std I nit)) and tem-
plates initialized via IC1
T(IC1) versus T(IC1) T(IC1) versus T(StdInit)
Data1 0.9833 40.9333
Data2 0.6800 20.4666
RealData 4.0433 26.8667
Table 10.5 Ratios R for templates (T(IC1)) and for the template with its usual initialization
T (Std I nit), for each datasets
R T(IC1) T(StdInit)
Data1 0.0057 0.0062
Data2 0.0073 0.0077
RealData 0.0073 0.0074
Fig. 10.7 Estimated template from RealData. On the left, initialized via the standard initialization
which is the whole population. On the right, estimated template initialized via a IC1 centroid
via IC1 are far, in terms of distances in the space of currents, from the template
initialized by the standard initialization. These results could be alarming, but the
results of ratios (see Table 10.5) prove that templates are all very close to the Fréchet
mean, and that the differences are not due to a bad template estimation. Moreover,
both templates are visually similar as seen on Fig. 10.7.
Since it is possible to stop the Iterative Centroid methods IC1 and IC2 at any step,
we wanted to assess the influence of computing only a fraction of the N iterations
on the estimated template. Indeed one may wonder if computing an I.C. at e.g.
40 % (then saving 60 % of computation time for the IC method) could be enough
10 Diffeomorphic Iterative Centroid Methods 295
Fig. 10.8 First row Graphs of average W ∗ -distances between the IC1 at x% and the final one. The
second row present the same results with IC2
to initialize a template estimation. Moreover, for large datasets, the last subject will
have a very small influence: for a database composed of 1000 subjects, the weight
of the last subject is 1/1000. We performed this experiment in the case of IC1. In
the following, we call “IC1 at x%” an IC1 computed using x × N /100 subjects of
the population.
We computed the distance in the space of currents between “IC1 at x%” and IC1.
Results are presented on Fig. 10.8. These distances are averaged over the 10 centroids
computed for each datasets. We can note that after processing 40 % of the population,
the IC1 covers more than 75 % of the distance to the final centroid for all datasets.
We also compared T(IC1 at 40 %) to T(IC1) and to T(StdInit), using distances in
the space of currents, as well as the R ratio defined in Eq. 10.27. Results are shown on
Table 10.6). They show that using 40 % of subjects lowers substantially the quality
of the resulting template. Indeed the estimated template seems trapped in the local
minimum found by the IC1 at 40 %. We certainly have to take into account the size of
the dataset. Nevertheless, we believe that if the dataset is very large and sufficiently
homogeneous we could stop the Iterative Centroid method before the end.
Table 10.7 Computation time (in h) for Iterative Centroids and for template estimation initialised
by IC1 (T (I C1)), the standard initialization (T(StdInit)) and by IC1 at 40 % (T(IC1 at 40 %))
Computation time (h) Data1 Data2 RealData
IC1 1.7 0.7 1.2
IC2 5.2 2.4 7.5
PW 1.4 0.7 1.2
T (I C1) 21.1(= 1.7 + 19.4) 13.3(= 0.7 + 12.6) 27.9(= 1.2 + 26.7)
T(StdInit) 96.1 20.6 99
T(IC1 at 40 %) 24.4(= 0.7 + 23.7) 10.4(= 0.3 + 10.1) 40.7(= 0.5 + 40.2)
For T (I C1), we give the complete time for the whole process i.e. the time for the IC1 computation
plus the time for T (I C1) computation itself
10.5 Conclusion
We have proposed a new approach for the initialization of template estimation meth-
ods. The aim was to reduce computation time by providing a rough initial estimation,
making more feasible the application of template estimation on large databases.
To that purpose, we proposed to iteratively compute a centroid which is correctly
centered within the population. We proposed three different algorithms to compute
this centroid: the first two algorithms are iterative (IC1 and IC2) and the third one is
recursive (PW). We have evaluated the different approaches on one real and two syn-
thetic datasets of brain anatomical structures. Overall, the centroids computed with
all three approaches are close to the Fréchet mean of the population, thus providing
a reasonable centroid or initialization for template estimation method. Furthermore,
for all methods, centroids computed using different orderings are similar. It can be
noted that IC2 seems to be more robust to the ordering than IC1 which in turns seems
more robust than PW. Nevertheless, in general, all methods appear relatively robust
with respect to the ordering.
10 Diffeomorphic Iterative Centroid Methods 297
The advantage of iterative methods, like IC1 and IC2, is that we can stop the
deformation at any step, resulting in a centroid built with part of the population.
Thus, for large databases (composed for instance of 1000 subjects), it may not be
necessary to include all subjects in the computation since the weight of these subjects
will be very small. The iterative nature of IC1 and IC2 provides another interesting
advantage which is the possible online refinement of the centroid estimation as new
subjects are entered in the dataset. This leads to an increased possibility of interaction
with the image analysis process. On the other hand, the recursive PW method has
the advantage that it can be parallelized (still using GPU implementation), although
we did not implement this specific feature in the present work.
Using the centroid as initialization of the template estimation can substantially
speed up the convergence. For instance, using IC1 (which is the fastest one) as initial-
ization saved up 70 % of computation time. Moreover, this method could certainly
be used to initialize other template estimation methods, such as the method proposed
by Durrleman et al. [6].
As we observed, the centroids, obtained with rough parameters, are close to the
Fréchet mean of the population, thus we believe that computing IC with more pre-
cise parameters (but still reasonable in terms of computation time), we could obtain
centroids closer to the center. This accurate centroid could be seen as a cheap alter-
native to true template estimation methods, particularly if computing a precise mean
of the population of shapes is not required. Indeed, in the LDDMM framework,
template-based shape analysis gives only a first-order, linearized approximation of
the geometry in shape space. In a future work, we will study the impact of using IC
as a cheap template on results of population analysis based for instance on kernel
principal component analysis. Finally, the present work deals with surfaces for which
the metric based on currents seems to be well-adapted. Nevertheless, the proposed
algorithms for centroid computation are general and could be applied to images,
provided that an adapted metric is used.
Acknowledgments The authors are grateful to Vincent Frouin, Jean-Baptiste Poline, Roberto Toro
and Edouard Duschenay for providing a sample of subjects from the IMAGEN dataset and to Marie
Chupin for the use of the SACHA software.The authors thank Professor Thomas Hotz for his
suggestion on the pairwise centroid, during the discussion of the GSI’13 conference. The research
leading to these results has received funding from ANR (project HM-TC, grant number ANR-09-
EMER-006, and project KaraMetria, grant number ANR-09-BLAN-0332) and from the program
Investissements d’ avenir ANR-10-IAIHU-06.
References
1. Grenander, U., Miller, M.I.: Computational anatomy: an emerging discipline. Q. Appl. Math.
56(4), 617–694 (1998)
2. Christensen, G.E., Rabbitt, R.D., Miller, M.I.: Deformable templates using large deformation
kinematics. IEEE Trans. Image Process. 5(10), 1435–1447 (1996)
298 C. Cury et al.
3. Beg, M.F., Miller, M.I., Trouvé, A., Younes, L.: Computing large deformation metric mappings
via geodesic flows of diffeomorphisms. Int. J. Comput. Vision 61(2), 139–157 (2005)
4. Vaillant, M., Miller, M.I., Younes, L., Trouvé, A.: Statistics on diffeomorphisms via tangent
space representations. Neuroimage 23, S161–S169 (2004)
5. Glaunès, J., Joshi, S.: Template estimation from unlabeled point set data and surfaces for com-
putational anatomy. In: Pennec, X., Joshi, S., (eds.) Proceedings of the International Workshop
on the Mathematical Foundations of Computational Anatomy (MFCA-2006), pp. 29–39, 1 Oct
2006
6. Durrleman, S., Pennec, X., Trouvé, A., Ayache, N., et al.: A forward model to build unbiased
atlases from curves and surfaces. In: 2nd Medical Image Computing and Computer Assisted
Intervention. Workshop on Mathematical Foundations of Computational Anatomy, pp. 68–79
(2008)
7. Durrleman, S., Prastawa, M., Korenberg, J.R., Joshi, S., Trouvé, A., Gerig, G.: Topology pre-
serving atlas construction from shape data without correspondence using sparse parameters.
In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) Medical Image Computing and
Computer-Assisted Intervention—MICCAI 2012. Lecture Notes in Computer Science, vol.
7512, pp. 223–230. Springer, Berlin (2012)
8. Ma, J., Miller, M.I., Trouvé, A., Younes, L.: Bayesian template estimation in computational
anatomy. Neuroimage 42(1), 252–261 (2008)
9. Durrleman, S., Pennec, X., Trouvé, A., Ayache, N.: Statistical models of sets of curves and
surfaces based on currents. Med. Image Anal. 13(5), 793–808 (2009)
10. Arnaudon, M., Dombry, C., Phan, A., Yang, L.: Stochastic algorithms for computing means of
probability measures. Stoch. Process. Appl. 122(4), 1437–1455 (2012)
11. Ando, T., Li, C.K., Mathias, R.: Geometric means. Linear Algebra Appl. 385, 305–334 (2004)
12. Schwartz, L.: Théorie des distributions. Bull. Amer. Math. Soc. 58, 78–85 (1952) 0002–9904
13. de Rham, G.: Variétés différentiables. Formes, courants, formes harmoniques. Actualits Sci.
Ind., no. 1222, Publ. Inst. Math. Univ. Nancago III. Hermann, Paris (1955)
14. Vaillant, M., Glaunes, J.: Surface matching via currents. In: Information Processing in Medical
Imaging, pp. 381–392. Springer, Berlin (2005)
15. Glaunes, J.: Transport par difféomorphismes de points, de mesures et de courants pour la
comparaison de formes et l’anatomie numérique. PhD thesis, Université Paris 13 (2005)
16. Durrleman, S.: Statistical models of currents for measuring the variability of anatomical curves,
surfaces and their evolution. PhD thesis, University of Nice-Sophia Antipolis (2010)
17. Yang, X.F., Goh, A., Qiu, A.: Approximations of the diffeomorphic metric and their applications
in shape learning. In: Information Processing in Medical Imaging (IPMI), pp. 257–270 (2011)
18. Tenenbaum, J., Silva, V., Langford, J.: A global geometric framework for nonlinear dimen-
sionality reduction. Science 290(5500), 2319–2323 (2000)
19. Kendall, W.S.: Probability, convexity, and harmonic maps with small image i: uniqueness and
fine existence. Proc. Lond. Math. Soc. 3(2), 371–406 (1990)
20. Karcher, H.: Riemannian center of mass and mollifier smoothing. Commun. Pure Appl. Math.
30(5), 509–541 (1977)
21. Le, H.: Estimation of riemannian barycentres. LMS J. Comput. Math. 7, 193–200 (2004)
22. Afsari, B.: Riemannian Lp center of mass: existence, uniqueness, and convexity. Proc. Am.
Math. Soc. 139(2), 655–673 (2011)
23. Afsari, B., Tron, R., Vidal, R.: On the convergence of gradient descent for finding the riemannian
center of mass. SIAM J. Control Optim. 51(3), 2230–2260 (2013)
24. Emery, M., Mokobodzki, G.: Sur le barycentre d’une probabilité dans une variété. In: Séminaire
de probabilités, vol. 25, pp. 220–233. Springer, Berlin (1991)
25. Cury, C., Glaunès, J.A., Colliot, O.: Template estimation for large database: a diffeomorphic
iterative centroid method using currents. In: Nielsen, F., Barbaresco, F. (eds.) GSI. Lecture
Notes in Computer Science, vol. 8085, pp. 103–111. Springer, Berlin (2013)
10 Diffeomorphic Iterative Centroid Methods 299
26. Chupin, M., Hammers, A., Liu, R.S.N., Colliot, O., Burdett, J., Bardinet, E., Duncan, J.S.,
Garnero, L., Lemieux, L.: Automatic segmentation of the hippocampus and the amygdala
driven by hybrid constraints: method and validation. Neuroimage 46(3), 749–761 (2009)
27. Glaunes, J., Trouvé, A., Younes, L.: Diffeomorphic matching of distributions: a new approach
for unlabelled point-sets and sub-manifolds matching. In: Proceedings of the 2004 IEEE Com-
puter Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 712–718
(2004)
Chapter 11
Hartigan’s Method for k-MLE: Mixture
Modeling with Wishart Distributions
and Its Application to Motion Retrieval
Mixture models are a powerful and flexible tool to model an unknown probability
density function f (x) as a weighted sum of parametric density functions p j (x; θ j ):
C. Saint-Jean (B)
Mathématiques, Image, Applications (MIA), Université de La Rochelle,
17000 La Rochelle, France
e-mail: [email protected]
F. Nielsen
Sony Computer Science Laboratories, Inc., 3-14-13 Higashi Gotanda,141-0022 Shinagawa-Ku,
Tokyo, Japan
F. Nielsen
Laboratoire d’Informatique (LIX), Ecole Polytechnique, Palaiseau Cedex, France
K
K
f (x) = w j p j (x; θ j ), with w j > 0 and w j = 1. (11.1)
j=1 j=1
By far, the most common case are mixtures of Gaussians for which the Expectation-
Maximization (EM) method is used for decades to estimate the parameters {(w j , θ j )} j
from the maximum likelihood principle. Many extensions aimed at overcoming its
slowness and lack of robustness [1]. From the seminal work of Banerjee et al. [2], sev-
eral methods have been generalized for the exponential families in connection with
the Bregman divergences. In particular, the Bregman soft clustering provides a unify-
ing and elegant framework for the EM algorithm with mixtures of exponential fam-
ilies. In a recent work [3], the k-Maximum Likelihood Estimator (k-MLE) has been
proposed as a fast alternative to EM for learning any exponential family mixtures:
k-MLE relies on the bijection of exponential families with Bregman divergences to
transform the mixture learning problem into a geometric clustering problem. Thus
we refer the reader to the review paper [4] for an introduction to clustering.
This paper proposes several variations around the initial k-MLE algorithm with
a specific focus on mixtures of Wishart [5]. Such a mixture can model complex
distributions over the set S++d of d × d symmetric positive definite matrices. Data of
this kind comes naturally in some applications like diffusion tensor imaging, radar
imaging but also artificially as signature for a multivariate dataset (region of interest
in an multispectral image or a temporal sequence of measures for several sensors).
In the literature, the Wishart distribution is rarely used for modeling data but more
often in bayesian approaches as a (conjugate) prior for the inverse covariance-matrix
of a gaussian vector. This justifies that few works concern the estimation of the
parameters of Wishart from a set of matrices. To the best of our knowledge, the only
and most related work is the one of Tsai [6] concerning MLE and Restricted-MLE
with ordering constraints. From the application viewpoint, one may cite polarimetric
SAR imaging [7], bio-medical imaging [8]. Another example is a recent paper on
people tracking [9] which applies Dirichlet process mixture model (infinite mixture
model) to the clustering of covariance matrices.
The paper is organized as follows: Sect. 11.2 recalls the definition of an expo-
nential family (EF), the principle of maximum likelihood estimation in EFs and
how it is connected with Bregman divergences. From these definitions, the com-
plete description of k-MLE technique is derived by following the formalism of the
Expectation-Maximization algorithm in Sect. 11.3. In the same section, the Hartigan
approach for k-MLE is proposed and discussed as well as how to initialize it properly.
Section 11.4 concerns the learning of a mixture of Wishart with k-MLE. For this pur-
pose, a iterative procedure that converges to the MLE when it exists. In Sect. 11.5, we
describe an application scenario to motion retrieval before concluding in Sect. 11.6.
with t (x) the sufficient statistic, θ the natural parameter, k the carrier measure and
F the log-normalizer [10]. Most of commonly used distributions such Bernoulli,
Gaussian, Multinomial, Dirichlet, Poisson, Beta, Gamma, von Mises are indeed ex-
ponential families (see above reference for a complete list). Later on in the chapter,
a canonical decomposition of the Wishart distribution as an exponential family will
be detailed.
The framework of exponential families gives a direct solution for finding the max-
imum likelihood estimator θ̂ from a set of i.i.d observations χ = {x1 , . . . , x N }.
Denoting L the likelihood function
N
N
L(θ; χ) = p F (xi ; θ) = exp {⊂t (xi ), θ∗ + k(xi ) − F(θ)} (11.2)
i=1 i=1
N
¯ χ) = 1
l(θ; (⊂t (xi ), θ∗ + k(xi ) − F(θ)) . (11.3)
N
i=1
¯ χ) for θ satisfies
It follows that the MLE θ̂ = arg maxΘ l(θ;
1
N
∈ F(θ̂) = t (xi ). (11.4)
N
i=1
Recall that the functional reciprocal (∈ F)−1 of ∈ F is also ∈ F ∗ for F ∗ the convex
conjugate of F [11]. It is a mapping from the expectation parameter space H to the
natural parameter space Θ. Thus, the MLE is obtained by mapping (∈ F)−1 on the
average of sufficient statistics:
⎛
1
N
−1
θ̂ = (∈ F) t (xi ) . (11.5)
N
i=1
Whereas determining (∈ F)−1 may be trivial for some univariate distributions like
Bernoulli, Poisson, Gaussian, multivariate case is much challenging and lead to
consider approximate methods to solve this variational problem [12].
304 C. Saint-Jean and F. Nielsen
In this part, the link between MLE and Kullback-Leibler (KL) divergence is recalled.
Banerjee et. al. [2] interpret the log-density of a (regular) exponential family as a
(regular) Bregman divergence:
In Eq. 11.6, term B F ∗ (t (x) : η) says how much sufficient statistic t (x) on observation
x is dissimilar to η ◦ H.
The Kullback-Leibler divergence on two members of the same exponential family
is equivalent to the Bregman divergence of the associated log-normalizer on swapped
natural parameters [10]:
Let us remark that B F is always known in a closed-form using the canonical decom-
position of p F whereas B F ∗ requires the knowledge of F ∗ . Finding the maximizer
of the log likelihood on Θ amounts to find the minimizer η̂ of
N
N
B F ∗ (t (xi ) : η) = KL ( p F ∗ (.; t (xi ))|| p F ∗ (.; η))
i=1 i=1
on H since the two last terms in Eq. (11.6) are constant with respect to η.
This section presents how to fit a mixture of exponential families with k-MLE.
This algorithm requires to have a MLE (see previous section) for each component
distribution p F j of the considered mixture. As it shares many properties with the EM
11 Hartigan’s Method for k-MLE 305
algorithm for mixtures, this latter is recalled first. The heuristics Lloyd and Hartigan
for k-MLE are completely described. Also, two methods for the initialization of
k-MLE are proposed depending whether or not component distributions are known.
11.3.1 EM Algorithm
K
f (x) = w j p F j (x; θ j ), (11.10)
j=1
where K is the number of components and w j are the mixture weights which sum
up to unity. Finding mixture parameters {(w j , θ j )} j can be again addressed by max-
imizing the log likelihood of the mixture distribution
N
K
L({(w j , θ j )} j ; χ) = log w j p F j (xi ; θ j ). (11.11)
i=1 j=1
For K > 1, a sum of terms appearing inside a logarithm makes optimization much
more difficult than the one of Sect. 11.2.1 (K = 1). A classical solution, also well
suitable for clustering purpose, is to augment model with indicatory hidden vector
variables z i where z i j = 1 iff observation xi is generated for jth component and 0
otherwise. Previous equation is now replaced by the complete log likelihood of the
mixture distribution
N
K
⎝ ⎞
Lc ({(w j , θ j )} j ; {(xi , z i )}i ) = z i j log w j p F j (xi ; θ j ) . (11.12)
i=1 j=1
(t) (t)
(t)
w j p F j (xi ; θ j )
ẑ i j = ⎠ (t) (t)
. (11.13)
j→ w j → p F j → (xi ; θ j → )
2. Update mixture parameters by maximizing Q (i.e. Eq. (11.12) where hidden values
(t)
z i j are replaced by ẑ i j ).
⎠N (t)
(t+1) i=1 ẑ i j (t+1)
N
(t) ⎝ ⎞
ŵ j = , θ̂ j = arg max ẑ i j log p F j (xi ; θ j ) . (11.14)
N θ j ◦Θ j i=1
While ŵ(t+1)
j is always known in closed-form whatever F j are, θ̂(t+1)
j are obtained
by component-wise specific optimization involving all observations.
Many properties of this algorithm are known (e.g. maximization of Q implies max-
imization of L, slow convergence to local maximum, etc...). In a clustering per-
spective, components are identified to clusters and values ẑ i j are interpreted as soft
membership of xi to cluster C j . In order to get a strict partition after the convergence,
each xi is assigned to the cluster C j iff ẑ i j is maximum over ẑ i1 , ẑ i2 , . . . , ẑ i K .
A main reason for the slowness of EM is that all observations are taken into account
(t)
for the update of parameters for each component since ẑ i j ◦ [0, 1]. A natural idea is
then to generate smaller sub-samples of χ from ẑ i(t) 1
j in a deterministic manner. The
simplest way to do this is to get a strict partition of χ with MAP assignment:
⎧
(t) (t)
(t) 1 if ẑ i j = maxk ẑ ik
z̃ i j = .
0 otherwise
When multiple maxima exist, the component with the smallest index is chosen. If this
classification step is inserted between E-step and M-step, Classification EM (CEM)
algorithm [14] is retrieved. Moreover, for isotropic gaussian components with fixed
unit variance, CEM is shown to be equivalent to the Lloyd K-means algorithm [4].
More recently, CEM was reformulated in a close way under the name k-MLE [3]
for the context of exponential families and Bregman divergences. In the following
(t) (t)
of the paper, we will refer only to this latter. Replacing z i j by z̃ i j in Eq. (11.12), the
criterion to be maximized in the M-step can be reformulated as
(t)
K
N
(t) ⎝ ⎞
L̃c ({(w j , θ j )} j ; {(xi , z̃ i )}i ) = z̃ i j log w j p F j (xi ; θ j ) . (11.15)
j=1 i=1
Each term leads to a separate optimization to get the parameters of the corresponding
component:
(t)
(t+1)
|C j | (t+1)
ŵ j = , θ̂ j = arg max log p F j (x; θ j ). (11.17)
N θ j ◦Θ j (t)
x◦C j
Last equation is nothing but the equation of the MLE for the j-th component with
a subset of χ. Algorithm 1 summarizes k-MLE with Lloyd method given an initial
description of the mixture.
Contrary to CEM algorithm, k-MLE algorithm updates mixture weights after the
convergence of the L̃c (line 8) and not simultaneously with component parameters.
Despite this difference, both algorithms can be proved to converge to a local maxi-
mum of L̃c with same kind of arguments (see [3, 14]). In practice, the local maxima
(and also the mixture parameters) are not necessary equal for the two algorithms.
(t) (t)
j ∗ = arg max log ŵ j p F j (xc ; θ̂ j ),
j
(t) (t)
where ŵ j , and θ̂ j denote the weight and the parameters of the j-th component at
some iteration. Then, parameters of the two components are updated with MLE:
θ̂c(t+1) = arg max log p Fc (x; θc ) (11.18)
θc ◦Θc (t)
x◦Cc \{xc }
(t+1)
θ̂ j ∗ = arg max log p F j ∗ (x; θ j ∗ ). (11.19)
θ j ∗ ◦Θ j ∗ (t)
x◦C j ∗ ←{xc }
The mixture weights ŵc and ŵ j ∗ remain unchanged in this step (see line 9 of
Algorithm 1). Consequently, L̃c increases by Φ (t) (xc , Cc , C j ∗ ) where Φ (t) (xc , Cc , C j )
is more generally defined as
(t+1)
(t) ŵc
Φ (t) (xc , Cc , C j ) = log p Fc (x; θ̂c )− log p Fc (x; θ̂c ) − log
(t) (t)
ŵ j
x◦Cc \{xc } x◦Cc ←{xc }
(t+1)
(t)
+ log p F j (x; θ̂ j )− log p F j (x; θ̂ j ).
(t) (t)
x◦C j ←{xc } x◦C j \{xc }
(11.20)
11 Hartigan’s Method for k-MLE 309
This procedure is nothing more than a partial assignment (C-step) in the Lloyd
method for k-MLE. This is an indirect way to reach our initial goal which is the
maximization of L̃c .
Following Telgarsky and Vattani [16], a better approach is to consider as “optimal”
the assignment to cluster C j which maximizes Φ (t) (xc , Cc , C j )
Since Φ (t) (xc , Cc , Cc ) = 0, such assignment satisfies Φ (t) (xc , Cc , C j ∗ ) ≤ 0 and there-
fore the increase of L̃c . As the optimization space is finite (partitions of {x1 , ..., x N }),
this procedure converges to a local maximum of L̃c . There is no guarantee that C j ∗
coincides with the MAP assignment for xc .
For the k-means loss function, Hartigan method avoids empty clusters since any
assignment to one of those empty clusters decreases it necessarily [16]. Analo-
gous property will be now studied for k-MLE through the formulation of L̃c with
η-coordinates:
(t)
L̃c ({(w j , η j )} j ; {C j } j ) (11.22)
K ⎤
⎥
= F j∗ (η j ) + k j (x) + ⊂t j (x) − η j , ∈ F j∗ (η j )∗ + log w j .
j=1 x◦C (t)
j
(t) (t) ⎠
Recalling that the MLE satisfies η̂ j = |C j |−1 (t)
x◦C j
t j (x), dot product vanishes
(t)
when η j = η̂ j and it follows after small calculations
(t) (t)
|Cc |η̂c − tc (xc ) (t+1)
|C (t) (t)
j |η̂ j + t j (x c )
η̂c(t+1) = , η̂ j = . (11.24)
|Cc(t) | − 1 |C (t)
j |+1
(t)
When Cc = {xc }, there is no particular reason for Φ (t) (xc , {xc }, C j ) to be always
negative. Simplications occurring for the k-means in euclidean case (e.g. k j (xc ) = 0,
clusters have equal weight w j = K −1 , etc...) do not exist in this more general case.
310 C. Saint-Jean and F. Nielsen
(t)
K
L̊c ({θ j } j ; {C j } j ) = log p F j (x; θ j ) or equivalently to (11.25)
j=1 x◦C (t)
j
(t)
K ⎤ ⎥
L̊c ({η j } j ; {C j } j ) = F j∗ (η j ) + k j (x) + ⊂t j (x) − η j , ∈ F j∗ (η j )∗ .
j=1 x◦C (t)
j (11.26)
(t)
When all F j∗ = F ∗ are identical and the partition {C j } j corresponds to MAP
assignment, L̊ is exactly the objective function L̆ for the Bregman k-means [2].
Rewriting L̆ as an equivalent criterion to be minimized, it follows
312 C. Saint-Jean and F. Nielsen
N
K
L̆({η j } j ) = min B F ∗ (t (xi ) : η j ). (11.27)
j=1
i=1
(0)
Bregman k-means++ [20, 21] provides initial centers {η j } j which guarantee to find
a clustering that is O(log K )-competitive to the optimal Bregman k-means clustering.
The k-MLE++ algorithm amounts to use Bregman k-means++ on the dual log-
normalizer F ∗ (see Algorithm 3).
Algorithm 3: k-MLE++
Input: A sample χ = {x1 , ..., x N }, t the sufficient statistics and F ∗ the dual log-normalizer
of an exponential family, K the number of clusters
(0) (0)
Output: Initial mixture parameters {(w j , η j } j )
(0)
1 w1 = 1/K ;
(0)
2 Choose first seed η1 = t (xi ) for i uniformly random in {1, 2, . . . , N };
3 for j = 2, ..., K do
4 w (0)
j = 1/K ;
// Compute relative contributions to L̆({η j } j )
j
min j → =1 B F ∗ (t (xi ):η j → )
foreach xi ◦ χ do pi = ⎠N j ;
5 i → =1
min j → =1 B F ∗ (t (xi → ):η j → )
(0)
6 Choose η j ◦ {t (x1 ), ..., t (x N )} with probability pi ;
When K is unknown, same strategy can still be applied but a stopping criterion
has to be set. Probability pi in Algorithm 3 is a relative contribution of observation xi
through t (xi ) to L̆({η1 , ..., η K }) where K is the number of already selected centers.
A high pi indicates that xi is relatively far from these centers, thus is atypic to the
(0) (0) (0) (0) (0)
mixture {(w1 , η1 ), ..., (w K , η K ), } for w j = w (0) an arbitrary constant. When
selecting a new center, pi necessarily decreases in the next iteration. A good covering
of χ is obtained when all pi are lower than some threshold λ ◦ [0, 1]. Algorithm 4
describes the initialization named after DP-k-MLE++.
The higher the threshold λ, the lower the number of generated centers. In partic-
ular, the value N1 should be considered as a reasonable minimum setting for λ. For
λ = 1, the algorithm will simply return one center. Since pi = 0 for already selected
centers, this method guarantees all centers to be distinct.
Algorithm 4: DP-k-MLE++
Input: A sample χ = {x1 , ..., x N }, t the sufficient statistics and F ∗ the dual log-normalizer
of an exponential family, λ ◦ [0, 1]
(0) (0)
Output: Initial mixture parameters {w j , η j } j , K the number of clusters
(0)
1 Choose first seed η1 = t (xi ) for i uniformly random in {1, 2, . . . , N };
2 K=1;
3 repeat
// Compute relative contributions to L̆({η1 , ..., η K })
min Kj=1 B F ∗ (t (xi ):η j )
foreach xi ◦ χ do pi = ⎠N ;
4 i → =1
min Kj=1 B F ∗ (t (xi → ):η j → )
5 if ∃ pi > λ then
6 K = K+1;
// Select next seed
7 Choose η (0)
K ◦ {t (x 1 ), ..., t (x N )} with probability pi ;
8 until all pi ≤ λ;
(0)
9 for j = 1, ..., K do w j = 1/K ;
knowledge ξi about xi is available (see Sect. 11.5 for an example). Given such a choice
function H , Algorithm 5 called “DP-comp-k-MLE” describes this new flexible ini-
tialization method. DP-comp-k-MLE is clearly a generalization of DP-k-MLE++
when H always returns the same exponential family. However, in the general case, it
remains to be proved whether a DP-comp-k-MLE clustering is O(log K )-competitive
to the optimal k-MLE clustering (with equal weight). Without this difficult theoretical
study, suffix “++” is carefully omitted in the name DP-comp-k-MLE.
To end up with this section, let us recall that all we need to know for using
proposed algorithms is the MLE for the considered exponential family, whether it is
available in a closed-form or not. In many exponential families, all details (canonical
decomposition, F, ∈ F, F ∗ , ∈ F ∗ = (∈ F)−1 ) are already known [10]. The next
section focuses on the case of the Wishart distribution.
This section recalls the definition of Wishart distribution and proposes a maximum
likelihood estimator for its parameters. Some known facts such as the Kullback-
Leibler divergence between two Wishart densities are recalled. Its use with the above
algorithms is also discussed.
The Wishart distribution [5] is the multidimensional version of the chi-square dis-
tribution and it characterizes empirical scatter matrix estimator for the multivariate
gaussian distribution. Let X be a n-sample consisting in independent realizations of
314 C. Saint-Jean and F. Nielsen
Algorithm 5: DP-comp-k-MLE
Input: A sample χ = {x1 , ..., x N } with extra knowledge ξ = {ξ1 , ..., ξ N }, H a choice
function of an exponential family, λ ◦ [0, 1]
Output: Initial mixture parameters {(w (0) (0) ∗
j , η j )} j , {(t j , F j )} j sufficient statistics and dual
log-normalizers of exponential families, K the number of clusters
// Select first seed and exponential family
1 for i uniformly random in {1, 2, . . . , N } do
2 Obtain t1 , F1∗ from H (xi , ξi );
(0)
3 Select first seed η1 = t1 (xi );
4 K=1;
5 repeat
min Kj=1 B F ∗ (t j (xi ):η j )
foreach xi ◦ χ do pi = ⎠N
j
;
6 i → =1
min Kj=1 B F ∗ (t j (xi → ):η j → )
j
7 if ∃ pi > λ then
8 K = K+1;
// Select next seed and exponential family
9 for i with probability pi in {1, 2, . . . , N } do
10 Obtain t K , FK∗ from H (xi , ξi );
(0)
11 Select next seed η K = t K (xi );
12 until all pi ≤ λ;
13 for j = 1, ..., K do w (0)
j = 1/K ;
a random gaussian vector with d dimensions, zero mean and covariance matrix S.
Then scatter matrix X = tXX follows a central Wishart distribution with scale matrix
S and degree of freedom n, denoted by X ⊆ Wd (n, S). Its density function is
n−d−1 ⎫ ⎬
|X | 2 exp − 21 tr(S −1 X )
Wd (X ; n, S) = nd n ⎝ ⎞ ,
2 2 |S| 2 Γd n2
d(d−1) ⎭d
j−1
where for y > 0, Γd (y) = π 4 j=1 Γ y − 2 is the multivariate gamma
function. Let us remark immediately that this definition implies that n is constrained
to be strictly greater than d − 1.
Wishart distribution is an exponential family since
⎜
1
Wd (X ; θn , θ S ) = exp < θn , log |X | >R + < θ S , − X > H S + k(X ) − F(θn , θ S ) ,
2
Let us recall (see Sect. 11.2.1) that the MLE is obtained by mapping (∈ F)−1 on the
average of sufficient statistics. Finding (∈ F)−1 amounts to solve here the following
system (see Eqs. 11.5 and 11.28):
d log(2) − log |θ S | + Ψd θn + (d+1)
= ηn ,
2
(11.29)
⎪ − θn + (d+1) θ−1 = ηS .
2 S
nd n n
Fn (θ S ) = log(2) − log |θ S | + log Γd . (11.30)
2 2 2
Using classical results for matrix derivatives, (Eq. 11.5) can be easily solved:
N ⎛−1
1 1
N
n −1
− θ̂ S = − X i =∧ θ̂ S = N n Xi . (11.31)
2 N 2
i=1 i=1
with Ψd−1 the functional reciprocal of Ψd . This latter can be computed with any
optimization method on bounded domain (e.g. Brent method [24]). Let us mention
that notation is simplified here since θ̂ S and θ̂n should have been indexed by their
corresponding family. Algorithm 6 summarizes the estimate θ̂ for parameters of
⎠ −1 ⎠N
the Wishart distribution. By precomputing N N
i=1 X i and N −1 i=1 log |X i |,
much computation time can be saved. The computation of the Ψd−1 remains an
expensive part of the algorithm.
Let us now prove the convergence and the consistency of this
⎠method. Maximiz-
ing l¯ amounts to minimize equivalently E(θ) = F(θ) − ⊂ N1 i=1 N
t (X i ), θ∗. The
following properties are satisfied by E:
• The hessian ∈ 2 E = ∈ 2 F of E is positive definite on Θ since F is convex.
⎠N
• Its unique minimizer on Θ is the MLE θ̂ = ∈ F ∗ ( N1 i=1 t (X i )) whenever it
exists (although F ∗ is not known for Wishart, and F is not separable).
Output: Estimate θ̂ is the terminal values of MLE sequences {θ̂n(t) } and {θ̂(t)
S }
(t)
// Initialization of the {θ̂n } sequence
(0)
1 θ̂n = 1; t = 0;
2 repeat
// Compute MLE in Wd,n using Eq. 11.31
N ⎛−1
θ̂(t+1)
S = Nn Xi with n = 2θ̂n(t) + d + 1
i=1
// Compute MLE in Wd,S using Eq. 11.33
⎛
1
N
d +1
(t+1) −1 (t+1) −1
θ̂n = Ψd log |X i | − log 2S − with S = θ̂ S
N 2
i=1
t = t + 1;
3 until convergence of the likelihood;
(t+1)
θ̂ S = arg max E(θ̂n(t) , θ S ) (11.34)
θS
(t+1)
θ̂n(t+1) = arg max E(θn , θ̂ S ) (11.35)
θn
11 Hartigan’s Method for k-MLE 317
(t) (t)
Resulting sequences {θ̂n } and {θ̂ S } are shown to converge linearly to the coordinates
of θ̂.
By looking carefully at the previous algorithms, let us remark that the initialization
methods require to able to compute the divergence B F ∗ between two elements η1 and
η2 in the expectation space H. Whereas F ∗ is known for Wd,n and Wd,S , Eq. (11.9)
gives a potential solution for Wd by considering B F on natural parameters θ2 and θ1
in Θ. Searching the correspondence H ≡∇ Θ is analogous to compute the MLE for
a single observation...
The previous MLE procedure does not converge with a single observation X 1 .
Bogdan and Bogdan [26] proved that MLE exists and is unique in an exponential
family off the affine envelope of the N points t (X 1 ), ..., t (X N ) is of dimension D, the
order of this exponential family. Since the affine envelope of t (X 1 ) is of dimension
d × d (instead of D = d × d + 1), the MLE does not exists and the likelihood function
goes to infinity.2 Unboundedness of likelihood function is well known problem that
can be tackled by adding a penalty term to it [27]. A simpler solution is to take the
MLE in family Wd,n for some n (known or arbitrary fixed above d − 1) instead
of Wd .
Looking the KL divergences of the two Wishart sub-families Wd,n and Wd,S gives
an interesting perspective to this formula. Applying Eqs. 11.9 and 11.8, it follows
⎟
n |S1 |
1
KL(Wd,n ||Wd,n
2
)= − log + tr(S2−1 S1 ) − d (11.37)
2 |S2 |
⎝ ⎞⎛ ⎟
Γd n21 n1 − n2 n
1
KL(Wd,S ||Wd,S ) = − log
1 2 ⎝ n2 ⎞ + Ψd (11.38)
Γd 2 2 2
Detailed calculations can be found in the Appendix. Notice that KL(Wd1 ||Wd2 ) is
simply the sum of these two divergences
(t) (t)
2 Product θ̂n θ̂ S is constant through iterations.
318 C. Saint-Jean and F. Nielsen
In this part, some simple simulations are given for d = 2. Since the observations are
positive semi-definite matrices, it is possible to visualize them with ellipses para-
metrized by their eigen decompositions. For example, Fig. 11.1 shows 20 matrices
generated from Wd (.; n, S) for n = 5 and for n = 50 with S having eigenvalues
{2, 1}. This visualization highlights the difficulty for the estimation of the parameters
(even for d small) when n is small.
Then, a dataset of 60 matrices is generated from a three components mix-
ture with parameters Wd (.; 10, S1 ), Wd (.; 20, S2 ), Wd (.; 30, S3 ) and equal weights
w1 = w2 = w3 = 1/3. The respective eigenvalues for S1 , S2 , S3 are in turn
{2, 1}, {2, 0.5}, {1, 1}. Figure 11.2 illustrates this dataset. To study the influence of
a good initialization for k-MLE, the Normalized Mutual Information (NMI) [28]
is computed between the final partition and the ground-truth partition for different
initializations. This value between 0 and 1 is higher when the two partitions are more
similar. Following table gives average and standard deviation of NMI over 30 runs:
From this small experiment, we can easily verify the importance of a good initial-
ization. Also, the partitions having the highest NMI are reported in Fig. 11.4 for each
method. Let us mention that Hartigan method gives almost always a better partition
than the Lloyd’s one for the same initial mixture.
A last simulation indicates that the initialization with DP-k-MLE++ is very
sensible to its parameter λ. Again with the same set of matrices, Fig. 11.3 shows how
the number of generated clusters K and the average log-likelihood evolve with λ.
Not surprisingly, both quantities decrease when λ increases.
320 C. Saint-Jean and F. Nielsen
Fig. 11.4 Best partitions with Rand. Init/Lloyd (left), Rand. Init/Hartigan (middle), k-MLE++
Hartigan (right)
Within the same exponential family p F , the integral of the product of mixtures is
K
K →
→
m(x)m (x)dx = w j w →j → p F (x; θ j ) p F (x; θ→j → )dx. (11.41)
j=1 j → =1
When carrier measure k(X ) = 0, as it is for Wd but not for Wd,n and Wd,S , the
integral can be further expanded as
⊂θ→j ,t (X )∗−F(θ→j → )
p F (x; θ j ) p F (x; θ→j → )dX = e⊂θ j ,t (X )∗−F(θ j ) e dX
⊂θ j +θ→j → ,t (X )∗−F(θ j )−F(θ→j → )
= e dX
F(θ j +θ→j → )−F(θ j )−F(θ→j → ) ⊂θ j +θ→j → ,t (X )∗−F(θ j +θ→j → )
=e e dX .
!
=1
Note that θ j + θ→j → must be in the natural parameter space Θ to ensure that F(θ j + θ→j → )
is finite. An equivalent condition is that Θ is a convex cone.
p
When p F = Wd , space Θ =] − 1; +∀[×S++ is not a convex cone since
→ →
θn j + θn j → < −1 for n j and n j → smaller than d + 1. Practically, this constraint
is tested for each parameter pairs before going on with the computation the CS
322 C. Saint-Jean and F. Nielsen
1 ⎤ K K ⎥
→
CS(m : m ) = log w j w j → expΔ(θ j ,θ j → ) (within m)
2 → j=1 j =1
K→
⎤ K→ ⎥
1 Δ(θ→ ,θ→ )
+ log w →j w →j → exp j j → (within m → )
2 →
j=1 j =1
K ⎤ →
K ⎥
Δ(θ ,θ→ )
− log w j w →j → exp j j → (between m and m → )
j=1 j → =1
(11.42)
Note that CS divergence is symmetric since Δ(θ j , θ→j → ) is. A numeric value of
Δ(θ j , θ→j → ) can be computed for Wd from Eq. 11.28 (see Eq. 11.45 or 11.46 in the
Appendix).
To conclude this section, let us recall the elements of our proposal for a motion
retrieval system. Movement is represented by a Wishart mixture model learned by
k-MLE initialized by DP-k-MLE++ or DP-comp-k-MLE. In the case of a mixture
of a component, a simple application of the MLE for Wd,n is sufficient. Although a
Wishart distribution appears inadequate model for the scatter matrix X of a move-
ment, it has been shown that this crude assumption provides a good classification rates
on a real data set [23]. Learning representations of the movements may be performed
offline since it is computational demanding. Using CS divergence as dissimilarity, we
can then extract a taxonomy of movements with any spectral clustering algorithm. For
a query movement, its representation by a mixture has to be computed first. Then it is
possible to search the database for the most similar movements according to the CS di-
vergence or to predict its type by a majority vote among them. More details of the im-
plementation and results for the real dataset will be in a forthcoming technical report.
Hartigan’s swap clustering method for k-MLE was studied for the general case of
an exponential family. Unlike for k-means, this method does not guarantee to avoid
empty clusters but achieves generally better performance than the Lloyd’s heuristic.
11 Hartigan’s Method for k-MLE 323
Appendix A
n−d−1
|X | 2 exp{− 21 tr(S −1 X )}
Wd (X ; n, S) = nd n
2 2 |S| 2 Γd ( n2 )
⎜ n
n−d −1 1 nd n
= exp log |X | − tr(S −1 X ) − log(2) − log |S| − log Γd
2 2 2 2 2
324 C. Saint-Jean and F. Nielsen
⎟ ⎟
(d + 1) (d + 1)
F(Θ) = θn + (d log(2) − log |θ S |) + log Γd θn +
2 2
⎟
∂F (d + 1)
(θn , θ S ) = d log(2) − log |θ S | + Ψd θn + (11.43)
∂θn 2
(d + 1) n n→ n + n→ − d − 1
Δ(λ, λ→ ) = − d log(2) − log |S| − log |S → | − log |S −1
2 2 2 2
→
Γd n+n 2−d−1
+ S →−1 | + log ⎝ ⎞ → (11.46)
Γd n2 Γd n2
n−d−1
|X | 2 exp{− 21 tr(S −1 X )}
Wd (X ; n, S) = nd n n
2 2 |S| 2 Γd ( 2 )
⎜ n
n−d −1 1 nd n
= exp log |X | − tr(S −1 X ) − log(2) − log |S| − log Γd
2 2 2 2 2
Letting θ S = S −1 ,
⎜ n
1 n−d −1 nd n
Wd (X ; n, θ S ) = exp − tr(θ S X ) + log |X | − log(2) − log |θ−1
S | − log Γd
2 2 2 2 2
⎜
1
= exp < θ S , − X > H S +k(X ) − Fn (θ S )
2
nd n n
with Fn (θ S ) = log(2) − log |θ S | + log Γd
2 2 2
n−d −1
with kn (X ) = log |X |
2
∂log|X |
Using the rule ∂X =t (X −1 ) [33] and the symmetry of θ S , we get
n
∈θ S Fn (θ S ) = − θ−1
2 S
The correspondence between natural parameter θ S and expectation parameter η S is
n n −1
η S = ∈θ S Fn (θ S ) = − θ−1 ∗ −1
S √∇ θ S = ∈η S Fn (η S ) = (∈θ S Fn ) (η S ) = − η S
2 2
Finally, we obtain the MLE for θ S in this sub family:
⎛−1 N ⎛−1
1 1
N
n
θ̂ S = − − Xi = nN Xi
2 N 2
i=1 i=1
326 C. Saint-Jean and F. Nielsen
n−d−1
|X | 2 exp{− 21 tr(S −1 X )}
Wd (X ; n, S) = n
|2S| 2 Γd ( n2 )
⎜ n
n−d −1 1 n
= exp log |X | − tr(S −1 X ) − log |2S| − log Γd
2 2 2 2
Letting θn = n−d−1
2 (n = 2θn + d + 1)
⎜ ⎟ ⎟
1 d +1 d +1
Wd (X ; θn , S) = exp θn log |X | − tr(S −1 X ) − θn + log 2S − log Γd θn +
2 2 2
⎫ ⎬
= exp < θn , log |X | > +k S (X ) − FS (θn )
⎟ ⎟
d +1 d +1
with FS (θn ) = θn + log 2S + log Γd θn +
2 2
1 −1
with k S (X ) = − tr(S X )
2
↔ Ψd θn += ηn − log 2S
(d+1)
2
⎝ ⎞
↔ θn + (d+1)
2 = Ψd−1 ηn − log 2S
⎝ ⎞
↔ θn = Ψd−1 ηn − log 2S − (d+1)
2 = (∈ FS )−1 (ηn ) = ∈ FS∗ (ηn )
nd n n
4 Since |2S| = 2d |S|, we have 2 2 |S| 2 that is equivalent to |2S| 2 .
328 C. Saint-Jean and F. Nielsen
& ' ⎛
1
N
n̂ − d − 1 (d + 1)
= Ψd−1 log |X | − log 2S −
2 N 2
i=1
& ' ⎛
1
N
n̂ = 2Ψd−1 log |X | − log 2S
N
i=1
B FS∗ (ηn 1 : ηn 2 ) = FS∗ (ηn 1 ) − FS∗ (ηn 2 ) − <ηn 1 − ηn 2 , ∈ FS∗ (ηn 2 )> H S
⎝ ⎞ ⎝ ⎞ (d + 1)
= Ψd−1 ηn 1 − log 2S ηn 1 − log 2S − ηn 1
2
⎝ ⎞
− log Γd Ψd−1 ηn 1 − log 2S
⎝ ⎞ ⎝ ⎞ (d + 1)
− Ψd−1 ηn 2 − log 2S ηn 2 − log 2S + ηn 2
2
11 Hartigan’s Method for k-MLE 329
⎝ ⎞
+ log Γd Ψd−1 ηn 2 − log 2S
⎝ ⎞ (d + 1)
− ⊂ηn 1 − ηn 2 , Ψd−1 ηn 2 − log 2S − ∗H S
2
⎝ ⎞
Γd Ψd−1 ηn 2 − log 2S
B FS∗ (ηn 1 : ηn 2 ) = log ⎝ ⎞
Γd Ψd−1 ηn 1 − log 2S
⎤ ⎝ ⎞ ⎝ ⎞⎥ ⎝ ⎞
− Ψd−1 ηn 2 − log 2S − Ψd−1 ηn 1 − log 2S ηn 1 − log 2S
References
1. McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions, 2nd edn. Wiley Series in
Probability and Statistics. Wiley-Interscience, New York (2008)
2. Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J.
Mach. Learn. Res. 6, 1705–1749 (2005)
3. Nielsen, F.: k-MLE: a fast algorithm for learning statistical mixture models. In: International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 869–872 (2012). Long
version as arXiv:1203.5181
4. Jain, A.K.: Data clustering: 50 years beyond K -means. Pattern Recogn. Lett. 31, 651–666
(2010)
5. Wishart, J.: The generalised product moment distribution in samples from a Normal multivariate
population. Biometrika 20(1/2), 32–52 (1928)
6. Tsai, M.-T.: Maximum likelihood estimation of Wishart mean matrices under Lwner order
restrictions. J. Multivar. Anal. 98(5), 932–944 (2007)
7. Formont, P., Pascal, T., Vasile, G., Ovarlez, J.-P., Ferro-Famil, L.: Statistical classification for
heterogeneous polarimetric SAR images. IEEE J. Sel. Top. Sign. Proces. 5(3), 567–576 (2011)
8. Jian, B., Vemuri, B.: Multi-fiber reconstruction from diffusion MRI using mixture of wisharts
and sparse deconvolution. In: Information Processing in Medical Imaging, pp. 384–395,
Springer, Berlin (2007)
9. Cherian, A., Morellas, V., Papanikolopoulos, N., Bedros, S.: Dirichlet process mixture mod-
els on symmetric positive definite matrices for appearance clustering in video surveillance
applications. In: Computer Vision and Pattern Recognition (CVPR), pp. 3417–3424 (2011)
10. Nielsen, F., Garcia, V.: Statistical exponential families: a digest with flash cards. http://arxiv.
org/abs/0911.4863.. Accessed Nov 2009
11. Rockafellar, R.T.: Convex Analysis, vol. 28. Princeton University Press, Princeton (1997)
12. Wainwright, M.J., Jordan, M.J.: Graphical models, exponential families, and variational infer-
ence. Found. Trends Mach. Learn. 1(1–2), 1–305 (2008)
13. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the
EM algorithm. J. Roy. Stat. Soc. (Methodological). 39 1–38 (1977)
14. Celeux, G., Govaert, G.: A classification EM algorithm for clustering and two stochastic ver-
sions. Comput. Stat. Data Anal. 14(3), 315–332 (1992)
15. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. J. Roy. Stat.
Soc. C (Applied Statistics). 28(1), 100–108 (1979)
16. Telgarsky, M., Vattani, A.: Hartigan’s method: k-means clustering without Voronoi. In: Pro-
ceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), pp.
820–827 (2010)
330 C. Saint-Jean and F. Nielsen
17. Nielsen, F., Boissonnat, J.D., Nock, R.: On Bregman Voronoi diagrams. In: ACM-SIAM Sym-
posium on Discrete Algorithms, pp. 746–755 (2007)
18. Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Networks 16(3),
645–678 (2005)
19. Kulis, B., Jordan, M.I.: Revisiting k-means: new algorithms via Bayesian nonparametrics. In:
International Conference on Machine Learning (ICML) (2012)
20. Ackermann, M.R.: Algorithms for the Bregman K -median problem. PhD thesis. Paderborn
University (2009)
21. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings
of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035
(2007)
22. Ji, S., Krishnapuram, B., Carin, L.: Variational Bayes for continuous hidden Markov models
and its application to active learning. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 522–532
(2006)
23. Hidot, S., Saint-Jean, C.: An Expectation-Maximization algorithm for the Wishart mixture
model: application to movement clustering. Pattern Recogn. Lett. 31(14), 2318–2324 (2010)
24. Brent. R.P.: Algorithms for Minimization Without Derivatives. Courier Dover Publications,
Mineola (1973)
25. Bezdek, J.C., Hathaway, R.J., Howard, R.E., Wilson, C.A., Windham, M.P.: Local convergence
analysis of a grouped variable version of coordinate descent. J. Optim. Theory Appl. 54(3),
471–477 (1987)
26. Bogdan, K., Bogdan, M.: On existence of maximum likelihood estimators in exponential fam-
ilies. Statistics 34(2), 137–149 (2000)
27. Ciuperca, G., Ridolfi, A., Idier, J.: Penalized maximum likelihood estimator for normal mix-
tures. Scand. J. Stat. 30(1), 45–59 (2003)
28. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison:
variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–
2854 (2010)
29. Nielsen, F.: Closed-form information-theoretic divergences for statistical mixtures. In: Inter-
national Conference on Pattern Recognition (ICPR), pp. 1723–1726 (2012)
30. Haff, L.R., Kim, P.T., Koo, J.-Y., Richards, D.: Minimax estimation for mixtures of Wishart
distributions. Ann. Stat. 39(6), 3417–3440 (2011)
31. Jebara, T., Kondor, R., Howard, A.: Probability product kernels. J. Mach. Learn. Res. 5, 819–
844 (2004)
32. Moreno, P.J., Ho, P., Vasconcelos, N.: A Kullback-Leibler divergence based kernel for SVM
classification in multimedia applications. In: Advances in Neural Information Processing Sys-
tems (2003)
33. Petersen, K.B., Pedersen, M.S.: The matrix cookbook. http://www2.imm.dtu.dk/pubdb/p.php?
3274. Accessed Nov 2012
Chapter 12
Morphological Processing of Univariate
Gaussian Distribution-Valued Images Based
on Poincaré Upper-Half Plane Representation
12.1 Introduction
J. Angulo (B)
CMM-Centre de Morphologie Mathématique, Mathématiques et Systèmes,
MINES ParisTech, Paris, France
e-mail: [email protected]
S. Velasco-Forero
Department of Mathematics, National University of Singapore, Buona Vista, Singapore
e-mail: [email protected]
where Δ is the support space of pixels p (e.g., for 2D images Δ ⊂ Z2 ) and N denotes
the family of univariate Gaussian probability distribution functions (pdf). Nowadays
most of imaging sensors only produce single scalar values since the CCD (charge-
coupled device) cameras typically integrates the light (arriving photons) during a
given exposure time ∂ . To increase the signal-to-noise ratio (SNR), exposure time is
increased to ∂ ≡ = τ∂ , τ > 1. Let suppose that τ is a positive integer number, this is
equivalent to a multiple acquisition of τ frames during ∂ for each frame (i.e., a kind of
temporal oversampling). The standard approach only considers the sum (or average)
of the multiple intensities [28], without taking into account the variance which is
a basic estimator of the noise useful for probabilistic image processing. Another
example of such a representation from a gray scale image consists in considering that
each pixel is described by the mean and the variance of the intensity distribution from
its centered neighboring patch. This model has been for instance recently used in [10]
for computing local estimators which can be interpreted as pseudo-morphological
operators.
Let us consider the example of gray scale image parameterized by the mean and
the standard deviation of patches given in Fig. 12.1. We observe that the underlying
geometry of this space of patches is not Euclidean, e.g., the geodesics are clearly
curves. In fact, as we discuss in the paper, this parametrization corresponds to one
of the models of hyperbolic geometry.
Henceforth, the corresponding image processing operators should be able to deal
with Gaussian distributions-valued pixels. In particular, morphological operators for
images f ⊂ F(Δ, N ) involve that the space of Gaussian distributions N must be
endowed of a partial ordering leading to a complete lattice structure. In practice, it
means that given a set of Gaussian pdfs, as the example given in Fig. 12.2, we need
to be able to define a Gaussian pdf which corresponds to the infimum (inf) of the
set and another one to the supremum (sup). Mathematical morphology is a nonlin-
ear image processing methodology based on the computation of sup/inf-convolution
filters (i.e., dilation/erosion operators) in local neighborhoods [31]. Mathematical
morphology is theoretically formulated in the framework of complete lattices and
operators defined on them [21, 29]. When only the supremum or the infimum
are well defined, other morphological operators can be formulated in the frame-
work of complete semilattices [22, 23]. Both cases are considered here for images
f ⊂ F(Δ, N ).
A possible way to deal with the partial ordering problem of N can be founded
on stochastic ordering (or stochastic dominance) [30] which is basically defined in
terms of majorization of cumulative distribution functions.
However, we prefer to adopt here an information geometry approach [4], which
is based on considering that the univariate Gaussian pdfs are points in a hyper-
bolic space [3, 9]. More generally, Fisher geometry amounts to hyperbolic geometry
of constant curvature for other location-scale families of probability distributions
(Cauchy, Laplace, elliptical) p(x; μ, α) = α1 f x−μ α , where curvature depends on
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 333
Fig. 12.1 Parametrization of a gray scale image where each pixel is described by the mean and
the variance of the intensity distribution from its centered neighboring patch of 5 × 5 pixels: a Left,
original gray scale image; center, image of mean of each patch; right, image of standard deviation
of each patch. b Visualization of the patch according to their coordinates in the space mean/std.dev
0.8
0.7
0.6
0.5
fμ,σ 2(x)
0.4
0.3
0.2
0.1
0
−4 −3 −2 −1 0 1 2 3 4
x
(a) (b)
Fig. 12.2 a Example of a set of nine univariate Gaussian pdfs, Nk (μk , αk2 ), 1 ∇ k ∇ 9. b Same
◦
set of Gaussian pdfs represented as points of coordinates (xk = μk / 2, yk = αk ) in the upper-half
plane
334 J. Angulo and S. Velasco-Forero
the dimension and the density profile [3, 14, 15]. For a deep flavor on hyperbolic
geometry see [12]. There are several models representing the hyperbolic space in
Rd , d > 1, such as the three following ones: the (Poincaré) upper half-space model
Hd , the Poincaré disk model P d and the Klein disk model Kd .
1. The (Poincaré) upper half-space model is the domain Hd = {(x1 , · · · , xd ) ⊂
dx12 +···+dxd2
Rd | xd > 0} with the Riemannian metric ds2 = xd2
;
2. The Poincaré disk model is the domain P d = {(x1 , . . . , xd ) ⊂ Rd | x12 + · · · +
dx 2 +···+dx 2
xd2 < 1} with the Riemannian metric ds2 = 4 (1−x1 2 −···−x2d)2 ;
1 d
3. The Klein disk model is the space Kd = {(x1 , . . . , xd ) ⊂ Rd | x12 +· · · +xd2 < 1}
dx12 +···+dxd2
+ (x(1−x
1 dx1 +···+xd dxd )
2
with the Riemannian metric ds2 = 1−x12 −···−xd2 2 −···−x 2 )2 .
1 d
These models are isomorphic between them in the sense that one-to-one correspon-
dences can be set up between the points and lines in one model to the points and
lines in the other so as to preserve the relations of incidence, betweenness and con-
gruence. In particular, there exists an isometric mapping between any pair among
these models and analytical transformations to convert from one to another are well
known [12].
Klein disk model has been considered for instance in computational information
geometry (Voronoï diagrams, clustering, etc.) [26] and Poincaré disk model in infor-
mation geometric radar processing [5–7]. In this paper, we focus on the Poincaré
half-plane model, H2 , which is sufficient for our practical purposes of manipulating
Gaussian pdfs. Figure 12.2b illustrates the example ◦ of a set of nine Gaussian pdfs
Nk (μk , αk2 ) represented as points of coordinates (μk / 2, αk ) in the upper-half plane
as follows: μ
(μ, α) ∈∗ z = ◦ + iα.
2
◦
The rationale behind the scaling factor 2 is given in Sect. 12.3.
In summary, from a theoretical viewpoint, the aim of this paper is to endow
H2 with partial orderings which lead to useful invariance properties in order to
formulate appropriate morphological operators for images f : Δ ∗ H2 . This work is
an extension to the conference paper [1]. The rest of the paper is organized as follows.
Section 12.2 reminds the basics on the geometry of Poincaré half-plane model. The
connection between Poincaré half-plane model of hyperbolic geometry and Fisher
Information geometry of Gaussian distributions is briefly recalled in Sect. 12.3. Then,
various partial orderings on H2 are studied in Sect. 12.4. Based on the corresponding
complete lattice structure of H2 , Sect. 12.5 presents definition of morphological
operators for images on F(Δ, H2 ) and its application to morphological processing
univariate Gaussian distribution-valued images. Section 12.6 concludes the paper
with the perspectives of the present work.
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 335
In complex analysis, the upper-half plane is the set of complex numbers with positive
imaginary part:
H2 = {z = x + iy ⊂ C | y > 0} . (12.1)
We also use the notation x = →(z) and y = ∅(z). The boundary of upper-half plane
(called sometimes circle at infinity) is the real axis together with the infinity, i.e.,
δH2 = R ← ≤ = {z = x + iy | y = 0, x = ±≤, y = ≤}.
⎞ dx 2 + dy2 |dz|2
ds2 = gkl dx dy = = = y−1 dz y−1 dz∗ . (12.3)
y2 y2
k,l=1,2
With this metric, the Poincaré upper-half plane is a complete Riemannian manifold
of constant sectional curvature K equal to −1. We can consider a continuum of
other hyperbolic spaces by multiplying the hyperbolic arc length (12.3) by a positive
constant k which leads to a metric of constant Gaussian curvature K = −1/k 2 . The
tangent space to H2 at a point z is defined as the space of tangent vectors at z. It
has the structure of a 2-dimensional real vector space, Tz H2 R2 . The Riemannian
metric (12.3) is induced by the following inner product on Tz H2 : for ψ1 , ψ2 ⊂ Tz H2 ,
with ψk = (φk , βk ), we put
(ψ1 , ψ2 )
⊆ψ1 , ψ2 ∧z = (12.4)
∅(z)2
⊆ψ1 , ψ2 ∧z (ψ1 , ψ2 )
cosγ = =◦ ◦ . (12.5)
≥ψ1 ≥z ≥ψ2 ≥z (ψ1 , ψ1 ) (ψ2 , ψ2 )
336 J. Angulo and S. Velasco-Forero
We see that this notion of angle measure coincides with the Euclidean angle measure.
Consequently, the Poincaré upper-half plane is a conformal model.
The distance between two points z1 = x1 + iy1 and z2 = x2 + iy2 in H2 , ds2 is
the function
⎠ ⎧
(x1 − x2 )2 + (y1 − y2 )2
distH2 (z1 , z2 ) = cosh−1 1 + (12.6)
2y1 y2
Distance (12.6) is derived from the logarithm of the cross-ratio between these two
points and the points at the infinity, i.e., distH2 (z1 , z2 ) = log D(z1≤ , z1 , z2 , z2≤ )
z −z≤ z −z≤
where D(z1≤ , z1 , z2 , z2≤ ) = z11 −z2≤ z22 −z1≤ . To obtain their equivalence, we remind
◦ 1 2
that cosh−1 (x) = log(x + x 2 − 1). From this formulation⎨ it is⎩easy ⎤⎨ to check that
⎨ ⎨
for two points with x1 = x2 the distance is distH2 (z1 , z2 ) = ⎨log yy21 ⎨.
To see that distH2 (z1 , z2 ) is a metric distance in H2 , we first notice the argument
−x
of cosh−1 always lies in [1, ≤) and cosh(x) = e +e
x
2 , so cosh is increasing and
concave on [0, ≤). Thus cosh−1 (1) = 0 and cosh−1 is increasing and concave
down on [1, ≤), growing logarithmically. The properties required to be a metric
(non-negativity, symmetry and triangle inequality) are proven using the cross-ratio
formulation of the distance.
We note that the distance from any point z ⊂ H2 to δH2 is infinity.
12.2.2 Geodesics
The geodesics of H2 are the vertical lines, VL(a) = {z ⊂ H2 | →(z) = a}, and the
semi-circles in H2 which meet the horizontal axis →(z) = 0 orthogonally, SCr (a) =
{z ⊂ H2 | |z −z≡ | = r; →(z≡ ) = a and ∅(z≡ ) = 0}; see Fig. 12.3a. Thus given any pair
z1 , z2 ⊂ H2 , there is a unique geodesic connecting them, parameterized for instance
in polar coordinates by the angle, i.e.,
Fig. 12.3 a Geodesics of H2 : z1 and z2 are connected by a unique semi-circle; the geodesic between
z2 and z3 is a segment of vertical line. b Hyperbolic polar coordinates
Then, the unique geodesic parameterized by the length, t ∈∗ ω(z1 , z2 ; t), ω : [0, 1] ∗
H2 joining two points z1 = x1 + iy1 and z2 = x2 + iy2 such as ω(z1 , z2 ; 0) = z1 and
ω(z1 , z2 ; 1) = z2 is given by
⎫
x1 + ieφt+t0 ⎬ ⎭ if x1 = x2
ω(z1 , z2 ; t) = (12.9)
[r tanh(φt + t0 ) + a] + i r
cosh(φt+t0 ) if x1 ∀= x2
with a and r given in (12.8) and where for x1 = x2 , t0 = log(y1 ), φ = log yy21 and for
x1 ∀= x2
⎥ ⎜
⎠ ⎧ ⎠ ⎧ r + r 2 − y2
r x1 − a y 2
t0 = cosh−1 = sinh−1 , φ = log .
1
⎥
y1 y1 y2 r + r 2 − y2
1
If we take the parameterized smooth curve ω(t) = x(t) + iy(t), where x(t) and
y(t) are continuously differentiable for b ∇ t ∇ c, then the hyperbolic length along
the curve is determined by integrating the metric (12.3) as:
⎟
b b ẋ(t)2 + ẏ(t)2 b |ż(t)|
L(ω) = ds = |ω̇(t)|ω dt = dt = dt.
ω a a y(t) a y(t)
Note that this expression is independent from the parameter choice. Hence, using
the polar angle parametrization (12.7), we obtain an alternative expression of the
geodesic distance given
⎨ ⎠ ⎧ ⎠ ⎧⎨
γ2 r ⎨ γ2 γ1 ⎨⎨
distH2 (z1 , z2 ) = inf L(ω) = dt = ⎨⎨log cot − log cot
ω γ1 r sin t 2 2 ⎨
12
(a) (b) 0.45
10 0.4
0.35
8
0.3
y = Im(z)
fμ,σ 2 (x)
0.25
6
0.2
4 0.15
0.1
2
0.05
0 0
−8 −6 −4 −2 0 2 4 −15 −10 −5 0 5 10 15
x = Re(z) x
Fig. 12.4 a Example of interpolation of 5 points in H2 between points z1 = −6.3 + i2.6 (in red)
and z2 = 3.5 + i0.95 (in blue) using their geodesic t ∈∗ ω(z1 , z2 ; t), with t = 0.2, 0.4, 0.5, 0.6,
0.8. The average point (in green) corresponds just to ω(z1 , z2 ; 0.5) = 0.89 + i4.6. b Original (in
red and blue) univariate Gaussian pdfs and corresponding interpolated ones
Remark Interpolation between two univariate normal distributions. Using the closed-
form expression of geodesics t ∈∗ ω(z1 , z2 ; t), given in (12.9), it is possible to
compute the average univariate Gaussian pdf between N(μ1 , α12 ) and N(μ2 , α22 ),
◦
with (μk = 2xk , αk = yk ), by taking t = 0.5. More generally, we can interpolate a
series of distributions between them by discretizing t between 0 and 1. An example of
such a method is given in Fig. 12.4. We note in particular that the average Gaussian pdf
can have a variance bigger than α12 and α22 . We note also that, due to the “logarithmic
scale” of imaginary axis, equally spaces points in t do not have equal Euclidean
arc-length in the semi-circle.
We notice that the center of the geodesic passing through (x, y) from OH2 has
Cartesian coordinates given by (tan ξ, 0); see Fig. 12.3b.
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 339
Let the projective special linear group defined by PSL(2, R) = SL(2, R)/ {±I} where
the special linear group SL(2, R) consists of 2 × 2 matrices with real entries which
determinant equals +1, i.e.,
⎠ ⎧
ab
g ⊂ SL(2, R) : g = , ad − bc = 1;
cd
and I denotes the identity matrix. This defines the group of Möbius transformations
Mg : H2 ∗ H2 by setting for each g ⊂ SL(2, R),
⎠ ⎧
ab az + b ac|z|2 + bd + (ad + bc)→(z) + i∅(z)
z ∈∗ Mg (z) = ·z = = ,
cd cz + d |cz + d|2
such that ∅ Mg (z) = (y(ad − bc)) / (cx + d)2 + (cy)2 > 0. The inverse map is
easily computed, i.e., z ∈∗ Mg−1 (z) = (dz − b)/(−cz + a). Since Möbius transfor-
mations are well defined in H2 and map H2 to H2 homeomorphically.
The Lie group PSL(2, R) acts on the upper half-plane by preserving the hyperbolic
distance, i.e.,
Mg (a + ir) = i.
340 J. Angulo and S. Velasco-Forero
Because PSL(2, R) acts transitively by isometries of the upper half-plane, this geo-
desic is mapped into other geodesics through the action of PSL(2, R). Thus, the
general unit-speed geodesic is given by
⎠ ⎧ ⎠ t/2 ⎧
ab e 0 aiet + b
ω(t) = −t/2 ·i = . (12.11)
cd 0 e ciet + d
We note that the hyperbolic center is always below the Euclidean center. The inverse
equations are
c = (xc = xh , yc = yh cosh rh ); r = yh sinh rh . (12.12)
Naturally, the hyperbolic ball of center ch and radius rh is defined by BH2 ,rh (ch )
{z ⊂ H2 |distH2 (ch , z) ∇ rh }. Let us consider a hyperbolic ball centered at the origin
BH2 ,rh (0, 1), parameterized by its boundary curve δB in Euclidean coordinates:
x = r cos γ; y = b + r sin γ
where using (12.12), we have b = cosh rh and r = sinh rh . The length of the boundary
and area of this ball are respectively given by [32]:
2η
r
L (δB) = dγ = 2η sinh rh , (12.13)
0 b + r sin γ
dxdy dx
Area(B) = 2
= = 2η(cosh rh − 1). (12.14)
B y ω y
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 341
Comparing the values of an Euclidean ball which has area ηrh2 and length of its
rh3 rh5
boundary circle 2ηrh , and considering the Taylor series sinh rh = rh + 3! + 5! +···
rh2 rh4
and cosh rh = 1 + + + · · · , one can note that the hyperbolic space is much
2! 4!
larger than the Euclidean one. Curvature is defined through derivatives of the metric,
but the fact that infinitesimally the hyperbolic ball grows faster than the Euclidean
balls can be used a measure of the curvature of the space at the origin (0, 1) [32]:
−L(δB)]
K = limrh ∗0 3[2ηrhηr 3 = −1. Since there is an isometry that maps the neigh-
h
borhood of any point to the neighborhood of the origin, the curvature of hyperbolic
space is identically constant to −1.
Remark Minimax center in H2 . Finding the smallest circle that contains the whole
set of points x1 , x2 , . . . , xN in the Euclidean plane is a classical problem in com-
putational geometry, called the minimum enclosing circle MEC. It is also rel-
evant its statistical estimation since the unique center of the circle c≤ (called
1-center or minimax center) is defined as the L ≤ center of mass, i.e., for R2 ,
c≤ = arg minx⊂R2 max1∇i∇N ≥xi − x≥2 . Computing the smallest enclosing sphere
in Euclidean spaces is intractable in high dimensions, but efficient approximation
algorithms have been proposed. The Bădoiu and Clarkson algorithm [8] leads to a
fast and
⎬ simple
⎭ approximation (of known precision ν after a given number of itera-
1
tions ν2 using the notion of core-set, but independent from dimensionality n). The
computation of the minimax center is particularly relevant in information geometry
(smallest enclosing information disk [25]) and has been considered for hyperbolic
models such as the Klein disk, using a Riemannian extension of Bădoiu and Clarkson
algorithm [2], which only requires a closed-form of the geodesics. Figure 12.5 depicts
an example of minimax center computation using Bădoiu and Clarkson algorithm for
a set of univariate Gaussian pdfs represented in H2 . We note that, using this property
of circle preservation, the computation the minimal enclosing hyperbolic circle of a
given set of points Z = {zk }1∇k∇K , zk ⊂ H2 , denoted MECH2 (Z) is equivalent to
computing the corresponding minimal enclosing circle MEC(Z) if and only if we
have MEC(Z) ⊂ H2 . This is the case for the example given in Fig. 12.5.
(a) (b)
2.5 1
0.9
2 0.8
0.7
1.5
y = Im(z)
0.6
fμ,σ2 (x)
0.5
1 0.4
0.3
0.5 0.2
0.1
0 0
−1.5 −1 −0.5 0 0.5 1 −4 −3 −2 −1 0 1 2 3 4
x
x = Re(z)
Fig. 12.5 a Example of minimax center (xh , yh ) (red ×) of a set of nine points Z = {zk }1∇k∇9 in
H2 (original points ∗ in black), the minimal enclosing◦
circle MECH2 (Z) is also depicted (in red).
b Corresponding minimax center Gaussian set N(μ = 2xh , α 2 = yh2 ) of nine univariate Gaussian
pdfs, Nk (μk , αk2 ), 1 ∇ k ∇ 9
⎞
n
ds2 (γ) = gkl (γ)dγk dγl
k,l=1
is defined as the Fisher information metric. In the univariate Gaussian distributed case
p(x|γ) ↔ N(μ, α 2 ), we have in particular γ = (μ, α) and it can be easily deduced
that the Fisher information matrix is
⎛ ⎝
1
0
(gkl (μ, α)) = α 2 2
(12.15)
0 α2
Hence, given two univariate Gaussian pdfs N(μ1 , α12 ) and N(μ2 , α22 ), the Fisher
distance between them, distFisher : N × N ∗ R+ , defined from the Fisher informa-
tion metric is given by [9, 14]:
⎩ ⎤ ◦ ⎠ ⎧
μ1 μ2
distFisher (μ1 , α1 ), (μ2 , α2 ) = 2distH2 ◦ + iα1 , ◦ + iα2 .
2 2
(12.17)
2 2
The change of variable also involves that the geodesics in the hyperbolic Fisher
space of normal distributions
◦ are half-lines and half-ellipses orthogonal at α = 0,
with eccentricity 1/ 2.
The canonic approach can be generalized according to Burbea and Rao geometric
framework [9], which is based on replacing the Shannon entropy by the notion of
τ-order entropy, which associated Hessian metric leads to an extended large class of
information metric geometries. Focussing on the particular case of univariate normal
distributions, p(x|γ) ↔ N(μ, α 2 ), we consider again points in the upper half-plane,
z = x + iy ⊂ H2 and for a given τ > 0 the τ−order entropy metric is given by [9]:
x = [A(τ)]−1/2 μ, y = α;
(12.18)
dsτ = B(τ)y−(τ+1) (dx 2 + dy2 );
where
A(τ) = (τ1/2 −τ−1/2 )2 +2τ−1 ; B(τ) = τ−3/2 (2η)(1−τ)/2 A(τ); τ > 0. (12.19)
where
κ = (τ + 1)/2, r > 0, a ⊂ R,
γ
Fω (γ) = −ω η/2 sinω tdt.
Figure 12.6 shows examples of the geodesics from the Burbea-Rao τ-order entropy
metric for a = 0, r = 1 and τ = 0.01, 0.5, 1, 5 and 20.
By integration of the metric, it is obtained the Burbea-Rao τ-order entropy
geodesic distance for z1 , z2 ⊂ H2 [9]:
344 J. Angulo and S. Velasco-Forero
0.5
0
−1.5 −1 −0.5 0 0.5 1 1.5
◦ ⎨ ⎥ ⎥ ⎨
2 B(τ) ⎨ x1 − x2 (1−τ)/2 ⎨
distH2 (z1 , z2 ; τ) = ⎨ + y 1 − r −2 yτ+1 − y(1−τ)/2 1 − r −2 yτ+1 ⎨ ,
|1 − τ| ⎨ r 1 1 2 2 ⎨
(12.22)
(12.23)
The notion of ordering invariance in the Poincaré upper-half plane was considered
in the Soviet literature [19, 20]. Ordering invariance with respect to simple tran-
sitive subgroup T of the group of motions was studied, i.e., group T consists of
transformations t of the form:
z = x + iy ∈∗ z≡ = (λx + τ) + iλy,
where λ > 0 and τ are real numbers. We named T the Guts group. We note that T
is just the composition of a translation and a scaling in H2 , and consequently, T is
an isometric group (see Sect. 2.4).
Nevertheless, up to the best of our knowledge, the formulation of partial orders
on Poincaré upper-half plane has not been widely studied. We introduce here partial
orders in H2 and study invariance properties to transformations of Guts group or to
other subgroups of SL(2, R) (Möbius transformations).
On the other hand, the corresponding partial ordering ∇H2 will be determined by
the positive cone in H2 defined by H+
2 = {z ⊂ H2 | x ∓ 0 and y ∓ 1}, i.e.,
z1 ∇H2 z2 ˇ z2 z1 ⊂ H+
2
, (12.25)
We easily note that, in fact, exp (log(y1 ) ∨ log(y2 )) = y1 ∨ y2 and similarly for the
infimum, since the logarithm is an isotone mapping (i.e., monotone increasing) and
therefore order-preserving. Therefore, the partial ordering ∇H2 does not involve any
particular structure for H2 and does not take into account the Riemannian nature of
the upper half plane. According to that, we note also that the partial ordering ∇H2 is
invariant to the Guts group of transforms, i.e.,
Let us consider a symmetrization of the product ordering with respect to the origin
in the upper half-plane. Given any pair of points z1 , z2 ⊂ H2 , we define the upper
half-plane symmetric ordering as
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 347
⎪
0 ∇ x1 ∇ x2
and 0 ∇ log(y1 ) ∇ log(y2 ) or
x2 ∇ x1 ∇ 0 and 0 ∇ log(y1 ) ∇ log(y2 ) or
z1 ˆH 2 z2 ˇ (12.29)
x2 ∇ x1 ∇ 0 and log(y2 ) ∇ log(y1 ) ∇ 0 or
0 ∇ x1 ∇ x2 and log(y2 ) ∇ log(y1 ) ∇ 0
The four conditions of this partial ordering entails that only points belonging to
the same quadrant of H2 can be ordered, where the four quadrants {H++ 2 , H2 ,
−+
H−− , H+− } are defined with respect to the origin OH2 = (0, 1) which corresponds
2 2
to the pure imaginary complex z0 = i. In other words, we can summarize the partial
ordering (12.29) by saying that if z1 and z2 belong to the same O-quadrant of H2
we have z1 ˆH2 z2 ˇ |x1 | ∇ |x2 | and | log(x1 )| ∇ | log(x2 )|. Endowed with the
partial ordering (12.29), H2 becomes a partially ordered set (poset) where the bottom
element is z0 , but we notice that there is no top element. In addition, for any pair of
point z1 and z2 , the infimum H2 is given by
⎪
(x1 ˘ x2 ) + i(y1 ˘ y2 ) if z1 , z2 ⊂ H++
2
(x1 ∨ x2 ) + i(y1 ˘ y2 ) if z1 , z2 ⊂ H−+
2
Due to the strong dependency of partial ordering ˆH2 with respect to OH2 , it is
easy to see that such ordering is only invariant to transformations that does not move
points from one quadrant to another one. This is the case typically for mappings as
z ∈∗ λ z, λ > 0.
pol
where (β, ξ) are defined in Eq. (12.10). The polar supremum z1 ∨H2 z2 and infimum
pol
z1 ˘H2 z2 are naturally obtained from the order (12.32) for any subset of points Z,
pol !pol pol
denoted by H2 Z and H2 Z. Total order ∇H2 leads to a complete lattice, bounded
pol
from the bottom (i.e., the origin OH2 ) but not from the top. Furthermore, as ∇H2 is
a total ordering, the supremum and the infimum will be either z1 or z2 .
Polar total order is invariant to any Möbius transformation Mg which preserves
the distance to the origin (isometry group) and more generally to isotone maps in
distance, i.e., β(z1 ) ∇ β(z2 ) ˇ β(Mg (z1 )) ∇ β(Mg (z2 )) but which also preserves
the orientation order, i.e., order on the polar angle. This is for instance the case of
orientation group SO(2) and the scaling maps z ∈∗ Mg (z) = λz, 0 < λ ⊂ R.
We note also that instead of considering as the origin OH2 , the polar hyperbolic
≡
coordinates can be defined with respect to a different origin z0 and consequently, the
≡
total order is adapted to the new origin (i.e., bottom element is just z0 ).
One can replace in the polar ordering the distance distH2 (OH2 , z) by the τ-order
τ−pol
Hellinger distance to obtain now the total ordering ∇H2 parametrized by τ:
τ−pol distHellinger OH2 , z1 ; τ < distHellinger OH2 , z2 ; τ or
z1 ∇H2 z2 ˇ
distHellinger OH2 , z1 ; τ = distHellinger OH2 , z2 ; τ and tan ξ1 ∇ tan ξ2
(12.33)
As discussed above, there is a unique hyperbolic geodesic joining any pair of points.
Given two points z1 , z2 ⊂ H2 such that x1 ∀= x2 , let SCr1Φ2 (a1Φ2 ) be the semi-circle
defining their geodesic, where the center a1Φ2 and the radius r1Φ2 are given by
Eq. (12.8). Let us denote by z1Φ2 the point of SCr1Φ2 (a1Φ2 ) having the maximal
imaginary part, i.e., its imaginary part is equal to the radius: z1Φ2 = a1Φ2 + ir1Φ2 .
geo
The upper half-plane geodesic ordering ˆH2 defines an order for points being in
the same half of their geodesic semi-circle as follows,
geo a1Φ2 ∇ x1 < x2 or
z1 ˆH 2 z 2 ˇ (12.34)
x2 < x1 ∇ a1Φ2
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 349
geo geo
The property of transitivity of this partial ordering, i.e., z1 ˆH2 z2 , z2 ˆH2 z3 ⇒
geo
z1 ˆH2 z3 , holds for points belonging to the same geodesic. For two points in a
geo
geodesic vertical line, x1 = x2 , we have z1 ˆH2 z2 ˇ y2 ∇ y1 . We note that
considering the duality with respect to the involution (12.28), one has
geo geo
z1 ˆH2 z2 ˇ z1 H2 z2 .
which geometrically means that the geodesic connecting zinf to any point zk of Z lies
always in one of the half part of the semi-circle defined by zinf and zk .
In practice, the minimal enclosing semi-circle defining zinf can be easily computed
by means of the following algorithm based on the minimum enclosing Euclidean
circle MEC of a set of points: (1) Working on R2 , define a set of points given, on
the one hand, by Z and, on the other hand, by Z ∗ which corresponds to the reflected
points with respect to x-axis (complex conjugate), i.e., points Z = {(xk , yk )} and
points Z ∗ = {(xk , −yk )}, 1 ∇ k ∇ K; (2) Compute the MEC(Z ← Z ∗ ) ∈∗ Cr (c), in
such a way that, by a symmetric point configuration,
!geo we necessarily have the center
on x-axis, i.e., c = (xc , 0); (3) The infimum H2 Z = zinf is given by zinf = xc + ir.
Figure 12.7a–b gives an example of computation of the geodesic infimum from a set
of points in H2 .
As for the case of two points, the geodesic supremum of Z is defined by duality
with respect to involution (12.28), i.e.,
⎜
geo
# geo
"
zsup = Z = Z = asup + irsup , (12.37)
H2 H2
with asup = −xcdual and rsup = 1/r dual , where SCr dual (xcdual ) is the minimal enclos-
ing semi-circle from dual set of points Z. An example of computing the geodesic
supremum zsup is also given in Fig. 12.7a–b. It is easy to see that geodesic infimum
and supremum have the following properties for any Z ⊂ H2 :
geo
1. zinf ˆH2 zsup ;
2. ∅(zinf ) ∓ ∅(zk ) and
$ ∅(zsup ) ∇ ∅(zk%), √z!
k ⊂ Z;
3. 1∇k∇K →(zk ) < →(zinf ), →(zsup ) < 1∇k∇K →(zk ).
The proofs are straightforward from the notion of minimal enclosing semi-circle and
the fact that zsup lies inside the semi-circle defined by zinf .
Geodesic infimum and supremum being defined by minimal enclosing semi-
circles, their invariance properties are related to translation and scaling of points
in set Z as defined in Sect. 2.4, but not to inversion. This invariance domain just
corresponds to the Guts group of transformations, i.e.,
⎜
geo
" "geo
{T (zk )}1∇k∇K = T {zk }1∇k∇K .
H2 H2
(a) 3 (b) 4
3.5
2.5
3
2
2.5
y = Im(z)
y = Im(z)
1.5 2
1.5
1
1
0.5
0.5
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
x = Re(z) x = Re(z)
0.7 0.9
0.8
0.6
0.7
0.5 0.6
Fμ,σ 2(x)
fμ,σ 2(x)
0.4 0.5
0.3 0.4
0.3
0.2
0.2
0.1 0.1
0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
x x
!geo
Fig. 12.7 a Set of nine points in H2 , Z = {zk }1∇k∇9 . b Computation of infimum H2 Z = zinf
geo
(blue “×”) and supremum H2 Z = zsup (red “×”). Black“∗” are the original points and green
“∗”
◦ the corresponding dual ones. c In black, set of Gaussian
◦ pdfs associated to Z, i.e., Nk (μ =
2xk , α 2 = yk2 ); in blue, infimum Gaussian pdf Ninf (μ = 2xinf , α 2 = yinf
2 ); in red, supremum
◦
Gaussian pdf Nsup (μ = 2xsup , α 2 = ysup 2 ). d Cumulative distribution functions of Gaussian pdfs
from c
(a)3 (b)3
2.5 2.5
2 2
y = Im(z)
y = Im(z)
1.5 1.5
1 1
0.5 0.5
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5 2
x = Re(z) x = Re(z)
Fig. 12.8 a Set of nine points in H2 , Z = {zk }1∇k∇9 . b Computation of the smallest Burbea-Rao
τ-order geodesic enclosing the set Z, for τ = 0.01 (in green), τ = 1 (in red), τ = 5 (in magenta),
τ = 20 (in blue)
352 J. Angulo and S. Velasco-Forero
infimum will correspond to the zk having the largest imaginary part, and dually for
the supremum, i.e., zk having the smallest imaginary part. In the case of large τ,
we the τ-geodesic infimum and supremum equals
note that the real part of both,
∨1∇k∇K →(zk ) − ˘1∇k∇K →(zk ) /2, and the imaginary part of the infimum goes to
+≤ and of the supremum to 0 when τ ∗ +≤.
According to the properties of geodesic infimum zinf and supremum zsup discussed
above, we note that their real parts →(zinf ) and →(zsup ) belong to the interval bounded
by the real parts of points of set Z. Moreover, →(zinf ) and →(zsup ) are not ordered
between them. Therefore, the real part of supremum one can be smaller than that of
the infimum one. For instance, in the extreme case of a set Z where all the imaginary
parts are equal, the real part of its geodesic infimum and supremum are both equal to
the average of the real parts of points, i.e., given Z = {zk }1∇k∇K , if yk = y, 1 ∇ k ∇ K,
&
then →(zinf ) = →(zsup ) = 1/K K k=1 xk . From the viewpoint of morphological image
filtering, it can be potentially interesting to impose an asymmetric behavior for the
−∗+ −∗+ ), 1 ∇ k ∇ K. Note
infimum and supremum such that →(zinf ) ∇ zk ∇ →(zsup
that the proposed notation − ∗ + indicates a partially ordered set on x-axis. In order
to fulfil these requirements, we can geometrically consider the rectangle bounding
the minimal enclosing semi-circle, which is just of dimensions 2rinf × rinf , and use it
−∗+
to define the asymmetric infimum zinf as the upper-left corner of the rectangle. The
−∗+
asymmetric supremum zsup is similarly defined from the bounding rectangle of the
dual minimal enclosing semi-circle. Mathematically, given the geodesic infimum zinf
and supremum zsup , we have the following definitions for the asymmetric geodesic
infimum and supremum (Fig. 12.9):
⎫
−∗+
zinf = −∗+ Z = (ainf − rinf ) + irinf ;
!H 2 (12.38)
zsup = −∗+
−∗+
H2
Z = −(xcdual − r dual ) + i r dual
1
.
mean is a kind of barycenter between◦the Gaussian pdfs having a larger variance. The
supremum Gaussian pdf Nsup (μ = 2xsup , α 2 = ysup 2 ) has a smaller variance than
the K Gaussian pdfs and its mean is between the ones of smaller variance. In terms
of the corresponding cumulative distribution functions, we observe that geodesic
supremum/infimum do not have a natural interpretation.
◦ −∗+ 2In the case of the asymmetric
−∗+ −∗+ 2
Gaussian geodesic infimum Ninf (μ = 2xinf , α = (yinf ) ) and Gaussian
−∗+
◦ −∗+ 2 −∗+
supremum Nsup (μ = 2xsup , α = (ysup ) ), we observe how the means are
2
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 353
0.9
0.7
0.8
0.6
0.7
0.5
0.6
Fμ,σ 2(x)
fμ,σ 2(x)
0.4 0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
x x
Fig. 12.9 a Infimum and supremum Gaussian pdfs (in green and red respectively) from asymmetric
−∗+ −∗+ from set of Fig. 12.7. b Cumulative distribution functions of
geodesic infimum zinf and zsup
Gaussian pdfs from (a)
ordered with respect to the K others, which also involves that the corresponding cdfs
are ordered. The latter is related to the notion of stochastic dominance [30] and will
be explored in detail in ongoing research.
Let consider that H2 has been endowed with one of the partial orderings discussed
above, generally denoted by ∇. Hence (H2 , ∇) is a !
poset, which has also a structure
of complete lattice since we consider that infimum and supremum are defined
for any set of points in H2 .
!
i.e., δ(z≡ ) = {z ⊂ H2 : z≡ ∇ ε(z)}, z≡ ⊂ H2 . Similarly one can define a
unique erosion from a given dilation: ε(z) = {z≡ ⊂ H2 : δ(z≡ ) ∇ z}, z ⊂ H2 .
Given an adjunction (ε, δ), their composition product operators, ω(z) = δ (ε(z)) and
ϕ(z) = ε (δ(z)) are respectively an opening and a closing, which are the basic mor-
phological filters having very useful properties [21, 29]: idempotency ωω(z) = ω(z),
anti-extensivity ω(z) ∇ z and extensivity z ∇ ϕ(z), and increaseness. Another rele-
vant result is the fact, given an erosion ε, the opening and !closing
$ ≡ by adjunction are
%
exclusively defined in terms of erosion [21] as ω(z) = z ⊂ H2: ε(z) ∇ ε(z≡ ) ,
! $ %
ϕ(z) = ε(z≡ ): z≡ ⊂ H2 , z ∇ ε(z≡ ) , √z ⊂ H2 .
!
In the case of complete inf-semilattice (H2 , ∇), where infimum is defined
but supremum is not necessarily, we have the following particular results
[22, 23]: (a) it is always
! $ ≡ possible to associate% an opening ω to a given erosion ε
by means of ω(z) = z ⊂ H2 : ε(z) ∇ ε(z≡ ) , (b) even though the adjoint dilation
δ is not well-defined in H2 , it is always well-defined for elements on the image of H2
by ε, and (c) ω = δε. The closing defined by ϕ = εδ is only partially defined. Obvi-
ously, in the case of inf-semilattice, it is still possible to define δ such that δ(zk ) =
δ (zk ) for families for which supremum exist.
If (H2 , ∇) is a complete lattice, the set of images F(Δ, H2 ) is also a complete lattice
defined as follows: for all f , g ⊂ F(Δ, H2 ), (i) f ∇ g ˇ f (p) ∇ g(p), √p ⊂ Δ; (ii)
(f ˘ g)(p) = f (p) ˘ g(p), √p ⊂ Δ; (iii) (f ∨ g)(p) = f (p) ∨ g(p), √p ⊂ Δ , where
˘ and ∨ are the infimum and supremum in H2 . One can now define the following
adjoint pair of flat erosion εB (f ) and flat dilation δB (f ) of each pixel p of image f [21,
29]:
"
εB (f )(p) = f (p + q), (12.40)
q⊂B(p)
#
δB (f )(p) = f (p − q), (12.41)
q⊂B(p)
such that
where set B is called the structuring element, which defines the set of points in Δ when
it is centered at point p, denoted B(p) [31]. These operators, which are translation
invariant, can be seen as constant-weight (this is the reason why they are called flat)
inf/sup-convolutions, where the structuring element B works as a moving window.
The above erosion (resp. dilation) moves object edges within the image in such a
way that it expands image structures with values in H2 close to the bottom element
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 355
Fig. 12.10 Supremum and infimum of a set of 25 patches parameterized by their mean and standard
◦ the overlapped patches are taken; b embedding into the space H
deviation: a in red the region where 2
according to the coordinates μ/ 2 and α and corresponding sup and inf for the different ordering
strategies
(resp. close to the top) of the lattice F(Δ, H2 ) and shrinks the objects with values
close to the top element (resp. close to the bottom).
Let us consider now the various cases of supremum and infimum that we have
introduced above. In order to support the discussion, we have included an example
in Fig. 12.10. In fact, we have taken all the patches of size 5 × 5 pixels surrounding
one of the pixels from image of Fig. 12.1. The ◦ 25 patches are then embedded into
the space H2 according to the coordinates μ/ 2 and α. Finally, the supremum and
infimum of this set of points are computed for the different cases. It just corresponds
to the way to obtain respectively the dilation and erosion for the current pixel center
of the red region in image Fig. 12.10a.
Everything works perfectly for ! the supremum and infimum in the upper half-
plane product ordering H2 and H2 , which consequently can be used to construct
dilation and erosion operators in F(Δ, H2 ). In fact, this is exactly equivalent to the
classical operators applied on the real and imaginary parts separately.
pol !pol
Similarly, the ones for the upper half-plane polar ordering H2 and H2 , based
on a total partial ordering, also lead respectively to dilation and erosion operators.
The erosion produces a point which corresponds here to the patch closer to the origin.
That means a patch of intermediate mean and standard deviation intensity since the
image intensity is normalized, see Sect. 5.4. On the contrary, the dilation gives a point
associated to the farthest patch from the origin. In this example, an homogenous bright
patch. Note that patches of great distance correspond to the most “contrasted ones”
on the image: either homogeneous patches of dark or bright intensity or patches with
a strong variation in intensity (edge patches).
We note that for thesymmetric ordering ˆH2 one only has an inf-semilattice
structure associated to H2 . However, in the case of the upper half-plane geodesic
geo
ordering, the pair of operators (12.40) and (12.41) associated to our supremum H2
!geo
and infimum H2 will not verify the adjunction (12.42). Same limitation also holds
for the upper half-plane asymmetric geodesic supremum and infimum. Hence, the
geodesic supremum and infimum do not strictly involve a pair of dilation and erosion
356 J. Angulo and S. Velasco-Forero
Given the adjoint image operators (εB , δB ), the opening and closing by adjunction
of image f , according to structuring element B, are defined as the composition oper-
ators [21, 29]:
ωB (f ) = δB (εB (f )) , (12.43)
ϕB (f ) = εB (δB (f )) . (12.44)
Openings and closings are referred to as morphological filters, which remove objects
of image f that do not comply with a criterion related, on the one hand, to the
invariance of the object support to the structuring element B and, on the other hand,
to the values of the object on H2 which are far from (in the case of the opening) or
near to (in the case of the closing) the bottom element of H2 according to the given
partial ordering ∇.
Once the pairs of dual operators (εB , δB ) and (ωB , ϕB ) are defined, the other
morphological filters and transformation can be naturally defined [31] for images in
F(Δ, H2 ). We limit here the illustrative examples to the basic ones.
Following our analysis on the particular cases of ordering and supremum/infimum
in H2 , we can conclude that opening and closing in F(Δ, H2 ) are well formulated for
the upper half-plane product ordering and the upper half-plane polar ordering. In the
case of the upper half-plane symmetric ordering, the opening is always defined and the
closing cannot be computed. Again, we should insist on the fact that for the upper half-
geo
plane geodesic ordering, the composition operators obtained by supremum H2 and
!geo
infimum H2 will not produce opening and closing stricto sensu. Notwithstanding,
the corresponding composition operators yield a regularization effect of F(Δ, H2 )-
images which can be of interest for practical applications.
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 357
g(p) − Mean(g) ⎟
g(p) ∈∗ ĝ(p) = ◦ ∈∗ f (p) = MeanW (ĝ)(p) + i VarW (ĝ)(p).
Var(g)
We note that definition of our representation space fy (p) > 0. It means that the
variance of each patch should always be bigger than zero and obviously this is not
the case in constant patches. In order to cope with this problem, we propose to add
a ν to the value of the standard deviation.
Figure 12.11 gives a comparison of morphological erosions εB (f )(p) and openings
ωB (f )(p) on this image f using the five complete (inf-semi)lattice of H2 considered
in the paper. We have included also the pseudo-erosions and pseudo-openings asso-
ciated to the geodesic supremum and infimum and the asymmetric geodesic ones.
The same structuring element B, a square of 5 × 5 pixels, has been used for all
the examples. First of all, we remind again that working on the product complete
lattice (H2 , ∇H2 ) is equivalent to a marginal processing of real and imaginary com-
ponents. As expected, the symmetric ordering-based inf-semilattice (H2 , ˆH2 ) and
pol
polar ordering-based lattice (H2 , ∇H2 ) produce rather similar results for openings.
We observe that in both cases the opening produces a symmetric filtering effect
between bright/dark intensity in the mean and standard deviation component. But it
is important to remark that the processing effects depend on how image components
are valued with respect to the origin z0 = (0, 1). This is the reason why it is proposed
to always normalize by mean/variance the image.
The results of the pseudo-openings produced by working on geodesic lattice
geo !geo !
(H2 , H2 , H2 ) and asymmetric geodesic lattice (H2 , −∗+ H2
, −∗+
H2
) involves
a processing which is mainly driven by the values of the standard deviation. Hence,
the filtering effects are potentially more interesting for applications requiring to deal
with pixel uncertainty, either in a symmetric processing of both bright/dark mean
geo !geo
values with (H2 , H2 , H2 ) or in a more classical morphological asymmetrization
!
with (H2 , −∗+ H2
, −∗+
H2
).
Example 2 Figure 12.12 illustrates a comparative example of erosions εB (f )(p) on a
very noisy image g(p). We note that g(p) is mean centered. The “noise” is related to
an acquisition at the limit of exposure time/spatial resolution. We consider an image
model f (p) = fx (p)+ify (p), where fx (p) = g(x) and fy (p) is the standard deviation of
intensities in a patch of radius equal to 4 pixels. In fact, the results of erosion obtained
358 J. Angulo and S. Velasco-Forero
by the product and symmetric partial orderings, are compared to ones obtained by
polar ordering and more generally by the τ-polar ordering with four values of τ.
We observe, on the one hand, polar orderings are more relevant than the product or
symmetric ones. As expected, the τ-polar erosion with τ = 1 is almost equivalent
to the hyperbolic polar ordering. We note, on the other hand, the interest of the limit
cases of τ-polar erosion. The erosion for small τ produces a strongly regularized
image where the bright/dark objects with respect to the background has been nicely
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 359
enhanced. In the case of large τ, the background (i.e., pixels values close to the origin
in H2 ) is enhanced, which involves removing all the image structures smaller than
the structuring element B.
Example 3 In Fig. 12.13 a limited comparison for the case of dilation δB (f )(p) is
depicted. The image f (p) = fx (p)+ify (p) is obtained similarly to the case of Example
1. We can compare the supremum by product ordering with those obtained by the
polar supremum and the τ-polar supremum, with τ = 0.01. The analysis is similar
to the previous case.
360 J. Angulo and S. Velasco-Forero
Fig. 12.12 Comparison of erosion of Gaussian distribution-valued noisy image εB (f )(p): a Original
image f ⊂ F (Δ, H2 ), showing both the real and the imaginary components; b upper half-plane
product ordering (equivalent to standard processing); c upper half-space symmetric ordering;
d upper half-plane polar ordering; e–h upper half-plane τ-polar ordering, with four values of τ. In
all the cases the structuring element B is also a square of 5 × 5 pixels
Example 4 Figure 12.14 involves again the noisy retinal image, and it shows a com-
parison of results from (pseudo-)opening ωB (f )(p) and (pseudo-)closing ϕB (f )(p)
geo !geo
obtained for the product ordering, the geodesic lattice (H2 , H2 , H2 ) and the
!
asymmetric geodesic lattice (H2 , −∗+ H2
, −∗+
H2
). The structuring element B is a
square of 5 × 5 pixels. In order to be able to compare their enhancement effects
with an averaging operator, it is also given the result of filtering by computing the
minimax center in a square of 5 × 5 pixels [2, 8], see Remark in Sect. 2.5. We note
that operators associated to the asymmetric geodesic supremum and infimum yield
mean images relatively similar to the standard ones underlaying the supremum and
infimum in the product lattice. However, including the information given by the local
standard deviation, the contrast of the structures is better in the asymmetric geodesic
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 361
Based on the discussion given in Sect. 5.2 as well as on the examples from Sect. 5.4,
we can draw some conclusions on the experimental part of this chapter.
362 J. Angulo and S. Velasco-Forero
• First of all, we note that the examples considered here are only a preliminary
exploration on the potential applications of morphological processing univariate
Gaussian distribution-valued images.
• We have two main case studies. First, standard images which are embedded into the
Poincaré upper-half plane representation by parameterization of each local patch
by its mean and standard deviation. Second, images which naturally involves a
distribution of values at each pixel. Note that in the first case, the information
of standard deviation is mainly associated to discriminate between homogenous
zones and inhomogeneous ones (textures or contours). In the second case, the
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 363
Fig. 12.15 Morphological detail extraction of multiple acquisition image modeled as a Gaussian
distribution-valued: a Original image f ⊂ F (Δ, H2 ), showing both the real and the imaginary com-
ponents; b morphological opening ωB (f ) working on polar ordering-based lattice; c corresponding
residue (pixelwise hyperbolic difference) between the original and the opened image; d morpholog-
ical pseudo-opening ωB (f ) working on the asymmetric geodesic lattice; e corresponding residue.
In both cases the structuring element B is also a square of 7 × 7 pixels
standard deviation involves relevant information on the nature of the noise during
the acquisition.
• For any of these two cases, we should remark that different alternatives of ordering
and derived operators considered in the paper will produce nonlinear processing its
main property is that filtering effects are strongly driven by the standard deviation.
• Upper half-plane product ordering is nothing more than standard processing of
mean and standard deviation separately. The symmetric ordering leading to an
inf-semilattice has a limited interest since similar effects are obtained by the polar
ordering.
• Upper half-plane polar ordering using standard hyperbolic polar coordinates or
the τ-order Hellinger distance produces morphological operators appropriate for
image regularization and enhancement. We remind that points close to the ori-
gin (selected by the erosion) correspond in the case of the patches to those of
intermediate mean and standard deviation intensity after normalization. On the
contrary, patches far from the origin correspond to the “contrasted” ones: either
homogeneous patches of dark or bright intensity or patches with a strong variation
in intensity (edge patches).
364 J. Angulo and S. Velasco-Forero
We note that with respect to filters based on averaging, the half-plane polar
dilation/erosion as well as their product operators, produces strong simplified
images where the edges and the main objects are enhanced without any blurring
effect.
From our viewpoint this is useful for both cases of images. Then, the choice of a
high or low value for τ will depend on the particular nature of the features to be
enhanced. In any case, this parameter can be optimized.
• Upper half-plane geodesic ordering involves a nonlinear filtering framework which
takes into account the intrinsic geometry of H2 . It is mainly based on the notion
of minimal enclosing geodesic which covers the set of points.
In practice, the geodesic infimum gives a point with a standard deviation equal to or
larger than any of the point and a mean which can be seen as intermediate between the
mean values of high standard deviation. The supremum produces a point of standard
deviation equal to or smaller than the others, and the mean is obtained by averaging
around the mean of the ones having a small standard deviation.
Consequently the erosion involves a nonlinear filtering which enhances the image
zones of high standard deviation, typically the contours. The dilation enhances the
homogenous zones. We should note that the processed mean images by the com-
position of these two operators (i.e., openings and closings) are strongly enhanced
by increasing their bright/dark contrast. Therefore, it should be considered as an
appropriate tool for contrast structure enhancement on irregular backgrounds.
The asymmetric version of the geodesic ordering involves that dilation and erosion
have the same interpretation for the mean as the classical ones but the filtering
effects are driven by the zones of low or high standard deviation. These operators are
potentially useful for object extraction by residue between the original image and
the opening/closing. In comparison with classical residues, the new ones produce
sharper extracted objects.
12.6 Perspectives
Levelings are a powerful family self-dual morphological operators which have been
also formulated in vector spaces [24], using geometric notions as minimum enclosing
balls and half-planes intersection. We intend to explore the formulation of levelings
in the upper half-plane in a future work.
The complete lattice structures for the Poincaré upper-half plane introduced in this
work, and corresponding morphological operators, can be applied to process other
hyperbolic-valued images. For instance, on the one hand, it was proven in [13] that the
structure tensor for 2D images, i.e., at each pixel is given a 2 × 2 symmetric positive
definite matrix which determinant is equal to 1, are isomorphic to the Poincaré unit
disk model. On the other hand, polarimetric images [17] where at each pixel is given
a partially polarized state can be embedded in the Poincaré unit disk model. In both
cases, we only need the mapping from the Poincaré disk model to the Poincaré
12 Morphological Processing of Univariate Gaussian Distribution-Valued Images 365
half-plane, i.e.,
z+1
z ∈∗ −i .
z−1
References
1. Angulo, J., Velasco-Forero, S.: Complete lattice structure of Poincaré upper-half plane and
mathematical morphology for hyperbolic-valued images. In: Nielsen, F., Barbaresco, F. (eds.)
Proceedings of First International Conference Geometric Science of Information (GSI’2013),
vol. 8085, pp. 535–542. Springer LNCS (2013)
2. Arnaudon, M., Nielsen, F.: On approximating the riemannian 1-center. Comput. Geom. 46(1),
93–104 (2013)
3. Amari, S.-I., Barndorff-Nielsen, O.E., Kass, R.E., Lauritzen, S.L., Rao, C.R.: Differential
geometry in statistical inference. Lecture Notes-Monograph Series, vol. 10, pp. 19–94, Institute
of Mathematical Statistics, Hayward (1987)
4. Amari, S.-I., Nagaoka, H.: Methods of information geometry, translations of mathematical
monographs. Am. Math. Soc. 191, (2000)
5. Barbaresco, F.: Interactions between symmetric cone and information geometries: Bruhat-Tits
and siegel spaces models for high resolution autoregressive doppler imagery. In: Nielsen, F.
(eds.) Emerging Trends in Visual Computing (ETCV’08), Springer LNCS, Heidelberg vol.
5416, pp. 124–163, (2009)
6. Barbaresco, F.: Geometric radar processing based on Fréchet distance: information geome-
try versus optimal transport theory. In: Proceedings of IEEE International Radar Symposium
(IRS’2011), pp. 663–668 (2011)
7. Barbaresco, F.: Information geometry of covariance matrix: cartan-siegel homogeneous
bounded domains, Mostow/Berger fibration and fréchet median. In: Nielsen, F., Bhatia, R.
(eds.) Matrix Information Geometry, pp. 199–255, Springer, Heidelberg (2013)
8. Bădoiu, M., Clarkson, K.L.: Smaller core-sets for balls. In: Proceedings of the Fourteenth
annual ACM-SIAM Symposium on Discrete Algorithms (SIAM), pp. 801–802, ACM, New
York(2003)
9. Burbea, J., Rao, C.R.: Entropy differential metric, distance and divergence measures in prob-
ability spaces: a unified approach. J. Multivar. Anal. 12(4), 575–96 (1982)
10. Cǎliman, A., Ivanovici, M., Richard, N.: Probabilistic pseudo-morphology for grayscale and
color images. Pattern Recogn. 47, 721–35 (2004)
11. Cammarota, V., Orsingher, E.: Travelling randomly on the poincaré half-plane with a
pythagorean compass. J. Stat. Phys. 130(3), 455–82 (2008)
366 J. Angulo and S. Velasco-Forero
12. Cannon, J.W., Floyd, W.J., Kenyon, R., Parry, W.R.: Hyperbolic geometry. Flavors of Geometry,
vol. 31, MSRI Publications, Cambridge (1997)
13. Chossat, P., Faugeras, O.: Hyperbolic planforms in relation to visual edges and textures per-
ception. PLoS Comput. Biol. 5(12), p1 (2009)
14. Costa, S.I.R., Santos, S.A., Strapasson, J.E.: Fisher information matrix and hyperbolic geom-
etry. In: Proc. of IEEE ISOC ITW2005 on Coding and Complexity, pp. 34–36, (2005)
15. Costa, S.I.R., Santos, S.A., Strapasson, J.E.: Fisher information distance: a geometrical reading,
arXiv:1210:2354v1, p. 15 (2012)
16. Dodson, C.T.J., Matsuzoe, H.: An affine embedding of the gamma manifold. Appl. Sci. 5(1),
7–12 (2003)
17. Frontera-Pons, J., Angulo, J.: Morphological operators for images valued on the sphere. In:
Proceedings of IEEE ICIP’12 ( IEEE International Conference on Image Processing), pp.
113–116, Orlando (Florida), USA, October (2012)
18. Fuchs, L.: Partially Ordered Algebraic Systems. Pergamon, Oxford (1963)
19. Guts, A.K.: Mappings of families of oricycles in lobachevsky space. Math. USSR-Sb. 19, 131–8
(1973)
20. Guts, A.K.: Mappings of an ordered lobachevsky space. Siberian Math. J. 27(3), 347–61 (1986)
21. Heijmans, H.J.A.M.: Morphological Image Operators. Academic Press, Boston (1994)
22. Heijmans, H.J.A.M., Keshet, R.: Inf-semilattice approach to self-dual morphology. J. Math.
Imaging Vis. 17(1), 55–80 (2002)
23. Keshet, R.: Mathematical morphology on complete semilattices and its applications to image
processing. Fundamenta Informaticæ 41, 33–56 (2000)
24. Meyer, F.: Vectorial Levelings and Flattenings. In: Mathematical Morphology and its Appli-
cations to Image and Signal Processing (Proc. of ISMM’02), pp. 51–60, Kluwer Academic
Publishers, Dordrecht (2000)
25. Nielsen, F., Nock, R.: On the smallest enclosing information disk. Inform. Process. Lett. 105,
93–7 (2008)
26. Nielsen, F., Nock. R.: Hyperbolic voronoi diagrams made easy. In: Proceedings of the 2010
IEEE International Conference on Computational Science and Its Applications, pp. 74–80,
IEEE Computer Society, Washington (2010)
27. Sachs, Z.: Classification of the isometries of the upper half-plane, p. 14. University of Chicago,
VIGRE REU (2011)
28. Sbaiz, L., Yang, F., Charbon, E., Süsstrunk, S., Vetterli, M.: The gigavision camera. In: Pro-
ceedings of IEEE ICASSP’09, pp. 1093–1096 (2009)
29. Serra, J.: Image Analysis and Mathematical Morphology. Vol II: theoretical advances, Acad-
emic Press, London (1988)
30. Shaked, M., Shanthikumar, G.: Stochastic Orders and Their Applications. Associated Press,
New York (1994)
31. Soille, P.: Morphological Image Analysis. Springer-Verlag, Berlin (1999)
32. Treibergs, A.: The hyperbolic plane and its immersions into R3 , Lecture Notes in Department
of Mathematics, p. 13. University of Utah (2003)
Chapter 13
Dimensionality Reduction for Classification
of Stochastic Texture Images
C. T. J. Dodson (B)
School of Mathematics, University of Manchester,
Manchester, M13 9PL, UK
e-mail: [email protected]
W. W. Sampson
School of Materials, University of Manchester, Manchester, M13 9PL, UK
e-mail: [email protected]
13.1 Introduction
The new contribution in this paper is to couple information geometry with dimension-
ality reduction, to identify small numbers of prominent features concerning density
fluctuation and clustering in stochastic texture images, for classification of group-
ings in large datasets. Our methodology applies to any stochastic texture images,
in one, two or three dimensions, but to gain an impression of the nature of exam-
ples we analyse some familiar materials for which we have areal density arrays, and
derive analytic expressions of spatial covariance matrices for Poisson processes of
finite objects in one and two dimensions. Information geometry provides a natural
distance structure on the textures via their spatial covariances, which allows us to
obtain multidimensional scaling or dimensionality reduction and hence 3D embed-
dings of sets of samples. See Mardia et al. [14] for an account of the original work
on multidimensional scaling.
The simplest one-dimensional stochastic texture arises as the density variation
along a cotton yarn, consisting of a near-Poisson process of finite length cotton
fibres on a line, another is an audio noise drone consisting of a Poisson process of
superposed finite length notes or chords. A fundamental microscopic 1-dimensional
stochastic process is the distribution of the 20 amino acids along protein chains in a
genome [1, 3]. Figure 13.1 shows a sample of such a sequence of the 20 amino acids
A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y mapped onto the 20 grey
level values 0.025, 0.075, . . . , 0.975 from the database [19], so yielding a grey-level
barcode as a 1-dimensional texture. We analyse such textures in Sect. 13.6.5.
The largest 3-dimensional stochastic structure is the cosmological void distribu-
tion, which is observable via radio astronomy [1]. More familiar three-dimensional
stochastic porous materials include metallic (Fig. 13.2) and plastic solid foams,
geological strata and dispersions in gels, observable via computer tomography [1].
Near-planar, non-woven stochastic fibre networks are manufactured for a variety of
applications such as, at the macroscale for printing, textiles, reinforcing, and fil-
tration and at the nanoscale in medicine. Figure 13.3 shows a selection of electron
micrographs for networks at different scales. Radiography or optical densitometry
yield areal density images of the kinds shown in Fig. 13.4.
Much analytic work has been done on modelling of the statistical geometry of sto-
chastic fibrous networks [1, 6, 7, 17]. Using complete sampling by square cells, their
areal density distribution is typically well represented by a log-gamma or a (trun-
cated) Gaussian distribution of variance that decreases monotonically with increasing
cell size; the rate of decay is dependent on fibre and fibre cluster dimensions. They
have gamma void size distributions with a long tail. Clustering of fibres is well-
approximated by Poisson processes of Poisson clusters of differing density and size.
13 Dimensionality Reduction for Classification of Stochastic Texture Images 369
Fig. 13.1 Example of a 1-dimensional stochastic texture, a grey level barcode for the amino acid
sequence in a sample of the Saccharomyces cerevisiae yeast genome from the database [19]
Fig. 13.2 Aluminium foam with a narrow Gaussian-like distribution of void sizes of around 1 cm
diameter partially wrapped in fragmented metallic shells, used as crushable buffers inside vehicle
bodies. The cosmological void distribution is by contrast gamma-like with a long tail [8], inter-
spersed with 60 % of galaxies in large-scale sheets, 20 % in rich filaments and 20 % in sparse
filaments [12]. Such 3D stochastic porous materials can both be studied by tomographic meth-
ods, albeit at different scales by different technologies, yielding sequences of 2D stochastic texture
images
Fig. 13.3 Electron micrographs of four stochastic fibrous materials. Top left Nonwoven carbon
fibre mat; top right glass fibre filter; bottom left electrospun nylon nanofibrous network (Courtesy
S. J. Eichhorn and D. J. Scurr); bottom right paper using wood cellulose fibres—typically flat
ribbonlike, of length 1–2 mm and width 0.02–0.03 mm
Fig. 13.4 Areal density radiographs of three paper networks made from natural wood cellulose
fibres, with constant mean coverage, c̄ ⊂ 20 fibres, but different distributions of fibres. Each image
represents a square region of side length 5 cm; darker regions correspond to higher coverage. The
left image is similar to that expected for a Poisson process of the same fibres, so typical real samples
exhibit clustering of fibres
13 Dimensionality Reduction for Classification of Stochastic Texture Images 371
Fig. 13.5 Trivariate distribution of pixel density values for radiograph of a 5 cm square newsprint
sample. Left source density map; centre histogram of β̃i , β̃1,i and β̃2,i ; right 3D scatter plot of β̃i ,
β̃1,i and β̃2,i
density clusters; this may be extended to more random variables by using also third,
fourth, etc., neighbours. In some cases, of course, other pixel density distributions
may be more appropriate, such as mixtures of Gaussians.
The mean of a random value p is its average value, p̄, over the population. The
covariance Cov( p, q) of a pair of random variables, p and q is a measure of the
degree of association between them, the difference between their mean product and
the product of their means:
Cov( p, q) = p q − p̄ q̄ . (13.1)
In particular, the covariance of a variable with itself is its variance. From the array of
local average pixel density values β̃i , we generate two numbers associated with each:
the average density of the six first-neighbour pixels, β̃1,i and the average density of
the 16 second-neighbour pixels, β̃2,i . Thus, we have a trivariate distribution of the
random variables (β̃i , β̃1,i , β̃2,i ) with β̄2 = β̄1 = β̄.
Figure 13.5 provides an example of a typical data set obtained from a radiograph
of a 5 cm square commercial newsprint sample; the histogram and three-dimensional
scatter plot show data obtained for pixels of side 1 mm.
From the Central Limit Theorem, we expect the marginal distributions of β̃i ,
β̃1,i and β̃2,i to be well approximated by Gaussian distributions. For the example in
Fig. 13.5, these Gaussians are represented by the solid lines on the histogram; this
Gaussian approximation holds for all samples investigated in this study.
We have a simulator for creating stochastic fibre networks [10]. The code works by
dropping clusters of fibres within a circular region where the centre of each cluster is
distributed as a point Poisson process in the plane and the number of fibres per cluster,
372 C. T. J. Dodson and W. W. Sampson
Fig. 13.6 Simulated areal density maps each representing a 4 cm × 4 cm region formed from fibres
with length λ = 1 mm, to a mean coverage of 6 fibres
Consider a Poisson process in the plane for finite rectangles of length λ and width
ω ∗ λ, with uniform orientation of rectangle axes to a fixed direction. The covariance
or autocorrelation function for such objects is known and given by [7]:
13 Dimensionality Reduction for Classification of Stochastic Texture Images 373
For 0 < r ∗ ω
2 r r r2
α1 (r ) = 1 − + − . (13.3)
π λ ω 2ωλ
For ω < r ∗ λ
⎛ ⎧ ⎨
2⎝ ⎞ω ⎠ ω r r2
α2 (r ) = arcsin − − + − 1⎩. (13.4)
π r 2λ ω ω2
⎤
For λ < r ∗ (λ2 + ω 2 )
⎞ω ⎠
2 λ ω λ r2
α3 (r ) = arcsin − arccos − − −
π r r 2λ 2ω 2λω
⎧ ⎧ ⎨
r2 r2
+ − 1 + − 1⎩. (13.5)
λ2 ω2
Then, the coverage c at a point is the number of rectangles overlapping that point,
a Poisson variable with grand mean value c̄, and the average coverage or density
in finite pixels c̃ tends to a Gaussian random variable. For sampling of the process
using, say square inspection pixels of side length x, the variance of their density c̃(x)
is ∈
⎥ 2x
V ar (c̃(x)) = V ar (c(0)) α(r, ω, λ) b(r ) dr (13.6)
0
where b is the probability density function for the distance r between two points
chosen independently and at random in the given type of pixel; it was derived by
Ghosh [13].
4r ⎞ 2 ⎞ ⎞x ⎠ ⎞ x ⎠⎠⎠
b(r, x) = x arcsin − arccos
x4 r r
4r ⎤ 1
+ 4 2x (r 2 − x 2 ) − (r + 2x 2 ) .
2
(13.8)
x 2
374 C. T. J. Dodson and W. W. Sampson
Fig. 13.7 Probability density function b(r, 1) from Eqs. (13.7), (13.8) for the distance r between
two points chosen independently and at random in a unit square
A plot of this function is given in Fig. 13.7. Observe that, for vanishingly small
pixels, that is points, b degenerates into a delta function on r = 0. Ghosh [13] gave
also the form of b for other types of pixels; for arbitrary rectangular pixels those
expressions can be found in [7]. For small values of r, so r D, the formulae for
convex pixels of area A and perimeter P all reduce to
2πr 2Pr 2
b(r, A, P) = −
A A2
which would be appropriate to use when the rectangle dimensions ω, λ are small
compared with the dimensions of the pixel.
It helps to visualize practical variance computations by considering the case of
sampling using large square pixels of side mx say, which themselves consist of
exactly m 2 small square pixels of side x. The variance V ar (c̃(mx)) is related to
V ar (c̃(x)) through the covariance Cov(x, mx) of x-pixels in mx-pixels [7]:
1 m2 − 1
V ar (c̃(mx)) = V ar ( c̃(x)) + Cov(x, mx).
m2 m2
Cov(0, x) V ar (c̃(x))
ρ̃(x) = =
V ar (c(0)) V ar (c(0))
13 Dimensionality Reduction for Classification of Stochastic Texture Images 375
which increases monotonically with λ and with ω but decreases monotonically with
mx, see Deng and Dodson [6] for more details. In fact, for a Poisson process
of rectangles the variance of coverage at points is precisely the mean coverage,
V ar (c(0)) = c̄, so if we agree to measure coverage as a fraction of the mean cover-
age then Eq. (13.6) reduces to the integral
∈
⎥ 2x
V ar (c̃(x))
= α(r, ω, λ) b(r ) dr = ρ̃(x). (13.9)
c̄
0
Now, the covariance among points inside mx-pixels, Cov(0, mx), is the expec-
tation of the covariance between pairs of points separated by distance r, taken over
the possible values for r in an mx-pixel; that amounts to the integral in Eq. (13.6).
By this means we have continuous families of 2 × 2 covariance matrices for x ◦ R+
and 2 < m ◦ Z+ given by
σ11 σ12 V ar (c̃(x)) Cov(x, mx)
Δ x,m = =
σ12 σ22 Cov(x, mx) V ar (c̃(x))
ρ̃(x) ρ̃(mx)
= . (13.10)
ρ̃(mx) ρ̃(x)
which encodes information about the spatial structure formed from the Poisson
process of rectangles, for each choice of rectangle dimensions ω ∗ λ ◦ R+ . This can
be extended to include mixtures of different rectangles with given relative abundances
and processes of more complex objects such as Poisson clusters of rectangles.
There is a one dimensional version of the above that is discussed in [6, 7], with
point autocorrelation calculated easily as
⎫
1− r
λ 0∗r ∗λ
α(r ) = (13.11)
0 λ < r.
Also, the probability density function for points chosen independently and at
random with separation r in a pixel, which is here an interval of length x, is
2⎞ r⎠
b(r ) = 1− (0 ∗ r ∗ x). (13.12)
x x
Then the integral (13.6) gives the fractional between pixel variance as
⎬
1−
⎞
x
3λ ⎠ 0∗x ∗λ
ρ̃(x, λ) = λ λ (13.13)
x 1− 3x λ < x.
Eq. (13.10):
ρ̃(x, λ) ρ̃(mx, λ)
Δ x,m
(λ) = . (13.14)
ρ̃(mx, λ) ρ̃(x, λ)
In particular, if we take unit length intervals as the base pixels, for the Poisson
process of unit length line segments, x = λ = 1 we obtain
⎭
(1
⎭ − 13 ) m ⎭ 1 − 3m
1 1
Δ 1,m
(1) = for m = 2, 3, . . . . (13.15)
m 1 − 3m 1 − 13
1 1
Given the family of pixel density distributions, with associated spatial covariance
structure among neighbours, we can use the Fisher metric [1] to yield an arc length
function on the curved space of parameters which represent mean and covariance
matrices. Then the information distance between any two such distributions is given
by the length of the shortest curve between them, a geodesic, in this space. The
computational difficulty is in finding the length of this shortest curve since it is the
infimum over all curves between the given two points. Fortunately, in the cases we
need, multivariate Gaussians, this problem has been largely solved analytically by
Atkinson and Mitchell [2].
Accordingly, some of our illustrative examples use information geometry of
trivariate Gaussian spatial distributions of pixel density with covariances among
first and second neighbours to reveal features related to sizes and density of clusters,
which could arise in one, two or three dimensions. For isotropic spatial processes,
which we consider here, the variables are means over shells of first and second neigh-
bours, respectively. For anisotropic networks the neighbour sets would be split into
more new variables to pick up the spatial anisotropy in the available spatial directions.
Other illustrations will use the analytic bivariate covariances given in Sect. 13.3
by Eq. (13.10).
What we know analytically is the geodesic distance between two multivariate
Gaussians, A, B, of the same number n of variables in two particular cases [2]:
1. μA →= μB , A = B = : f A = (n, μ A , Δ), f B = (n, μ B , Δ)
⎭ T ⎭
Dμ ( f , f ) =
A B
μ A − μ B · Δ −1 · μ A − μ B . (13.16)
2. μA = μB = μ, A →= B : f A = (n, μ, Δ A ), f B = (n, μ, Δ B )
⎜
⎟
1 n
DΔ ( f A , f B ) = log2 (λ j ), (13.17)
2
j=1
13 Dimensionality Reduction for Classification of Stochastic Texture Images 377
Fig. 13.8 Plot of DΔ ( f A , f B ) from (13.17) against ΦΔ ( f A , f B ) from (13.18) for 185 different
trivariate Gaussian covariance matrices
−1/2 −1/2
with {λ j } = Eig(Δ A · ΔB · ΔA ).
In the present paper we use Eqs. (13.16) and (13.17) and take the simplest choice
of a linear combination of both when both mean and covariance are different.
However, from the form of DΔ ( f A , f B ) in (13.17) we deduce that an approximate
monotonic relationship arises with a more easily computed symmetrized log-trace
function given by
ΦΔ ( f A , f B )
⎧
1 ⎭ −1/2 −1/2 −1/2 −1/2
= log T r (Δ A ·Δ ·Δ
B A ) + T r (Δ B ·Δ ·Δ
A B .
2n
(13.18)
We shall illustrate the differences of spatial features in given data sets obtained
from the distribution of local density for real and simulated planar stochastic fibre
networks. In such cases there is benefit in mutual information difference comparisons
of samples in the set but the difficulty is often the large number of samples in a set
of interest—perhaps a hundred or more. Human brains can do this very well; the
enormous numbers of optical sensors that stream information from the eyes into the
brain with the result that we have a 3-dimensional reduction which serves to help us
‘see’ the external environment. We want to see a large data set organised in such a
way that natural groupings are revealed and quantitative dispositions among groups
are preserved. The problem is how to present the information contained in the whole
data set, each sample yielding a 3 × 3 covariance matrix Δ and mean μ. The optimum
presentation is to use a 3-dimensional plot, but the question is what to put on the
axes.
To solve this problem we use multi-dimensional scaling, or dimensionality reduc-
tion, to extract the three most significant features from the set of samples so that all
samples can be displayed graphically in a 3-dimensional plot. The aim is to reveal
groupings of data points that correspond to the prominent characteristics; in our
context we have different former types, grades and differing scales and intensities
of fibre clustering. Such a methodology has particular value in the quality control
for processes with applications that frequently have to study large data sets of sam-
ples from a trial or through a change in conditions of manufacture or constituents.
Moreover, it can reveal anomalous behaviour of a process or unusual deviation in a
13 Dimensionality Reduction for Classification of Stochastic Texture Images 379
product. The raw data of one sample from a study of spatial variability might typi-
cally consist of a spatial array of 250 × 250 pixel density values, so what we solve
is a problem in classification for stochastic image textures.
The method, which we introduced in a preliminary report [11], depends on extract-
ing the three largest eigenvalues and their eigenvectors from a matrix of mutual infor-
mation distances among distributions representing the samples in the data set. The
number in the data set is unimportant, except for the computation time in finding
eigenvalues. This follows the methods described by Carter et al. [4, 5]. Our study
is for datasets of pixel density arrays from complete sampling of density maps of
stochastic textures which incorporate spatial covariances. We report the results of
such work on a large collection of radiographs from commercial papers made from
continuous filtration of cellulose and other fibres, [9].
The series of computational stages is as follows:
1. Obtain mutual ‘information distances’ D(i, j) among the members of the data set
of N textures X 1 , X 2 , . . . , X N using the fitted trivariate Gaussian pixel density
distributions.
2. The array of N × N differences D(i, j) is a real symmetric matrix with zero
diagonal. This is centralized by subtracting row and column means and then
adding back the grand mean to give CD(i, j).
3. The centralized matrix CD(i, j) is again a real symmetric matrix with zero diag-
onal. We compute its N eigenvalues ECD(i), which are necessarily real, and the
N corresponding N -dimensional eigenvectors VCD(i).
4. Make a 3 × 3 diagonal matrix A of the first three eigenvalues of largest absolute
magnitude and a 3 × N matrix B of the corresponding eigenvectors. The matrix
product A · B yields a 3 × N matrix and its transpose is an N × 3 matrix T, which
gives us N coordinate values (xi , yi , z i ) to embed the N samples in 3-space.
Example: Bivariate Gaussians
1 −1
f (x, y) = ∈ ex p 2 (y − μ2 )2 σ11 + (x − μ1 )[(x − μ1 )σ22 + 2(−y + μ2 ))σ12 ],
2π Φ Φ
μ = (μ1 , μ2 ),
Φ = Det[Δ] = σ11 σ22 − σ122
,
σ11 σ12 10 01 00
Δ= = σ11 + σ12 + σ22 ,
σ12 σ22 00 10 01
σ22 σ12
Δ −1 = Φ − Φ .
σ12 σ11
−Φ Φ
Numerical example:
10 32 B −1 3/7 −1/7
Δ =A
, Δ =
B
, Δ =
01 26 −1/7 3/14
A −1/2 A−1/2 10 32 10 32
Δ ·Δ ·Δ
B
= = ,
01 26 01 26
with eigenvalues: λ1 = 7, λ2 = 2.
⎜
⎟
1 n
DΔ (Δ A , Δ B ) = log2 (λ j ) ⊂ 1.46065
2
j=1
7+2
ΦΔ (Δ , Δ ) =
A B
log ⊂ 0.9005.
4
Fig. 13.9 Embedding of 20 evaluations of information distance for the bivariate covariances arising
from a Poisson line process of line segments, (13.15), with x = λ = 1 and m = 2, 3, . . . 21. The
starting green point in the lower right is for m = 2 and the red end point is for m = 21
Fig. 13.10 Embedding of 18 evaluations of information distance for the bivariate covariances
arising from a planar Poisson process of squares, (13.10), with ω = λ = 1. The two groups arise
from different schemes of inspection pixels. Right group used small base pixels with x = 0.1, from
blue to pink m = 2, 3, . . . , 10; left group used large base pixels with x = 1, from green to red
m = 2, 3, . . . , 10
382 C. T. J. Dodson and W. W. Sampson
Fig. 13.11 Embedding of 22 evaluations of information distance for the bivariate covariances
arising from a planar Poisson process of rectangles, (13.10), with ω = 0.2, λ = 1. The two groups
arise from different schemes of inspection pixels. Left group used large base pixels x = 1, from
green to red m = 2, 3, . . . , 10; right group used small base pixels x = 0.1, from blue to pink
m = 2, 3, . . . , 10
for a planar Poisson process of rectangles with aspect ratio 5:1, from (13.10), with
ω = 0.2, λ = 1. Again it shows the separation into two groups of samples analysed
with small pixels, right, and with large pixels, left.
Our three spatial variables for each spatial array of data are the mean density in a cen-
tral pixel, mean of its first neighbours, and mean of its second neighbours. We begin
with analysis of a set of 16 samples of areal density maps for simulated stochastic
fibre networks made from the same number of 1 mm fibres but with differing scales
(clump sizes) and intensities (clump densities) of fibre clustering. Among these is the
standard unclustered Poisson fibre network; all samples have the same mean density.
Figure 13.12 gives analyses for spatial arrays of pixel density differences from
Poisson networks. It shows a plot of DΔ ( f A , f B ) as a cubic-smoothed surface (left),
and the same data grouped by numbers of fibres in clusters and cluster densities
(right), for geodesic information distances among 16 datasets of 1 mm pixel density
differences between a Poisson network and simulated networks made from 1 mm
fibres. Each network has the same mean density but with different scales and densities
13 Dimensionality Reduction for Classification of Stochastic Texture Images 383
0.0 0.2
−0.2
0.1
0.0
−0.1
−0.2
−0.5
0.0
0.5
Fig. 13.12 Pixel density differences from Poisson networks. Left plot of DΔ ( f A , f B ) as a cubic-
smoothed surface, for trivariate Gaussian information distances among 16 datasets of 1 mm pixel
density differences between a Poisson network and simulated networks made from 1 mm fibres,
each network has the same mean density but with different clustering. Right embedding of the same
data grouped by numbers of fibres in clusters and cluster densities
of clustering; thus the mean difference is zero in this case. Using pixels of the order of
fibre length is appropriate for extracting information on the sizes of typical clusters.
The embedding reveals the clustering features as orthogonal subgroups.
Next, Fig. 13.13 gives analyses for pixel density arrays of the clustered networks.
It shows on the left the plot of DΔ ( f A , f B ) as a cubic-smoothed surface (left)
for trivariate Gaussian information distances among the 16 datasets of 1 mm pixel
densities for simulated networks made from 1 mm fibres, each network with the same
mean density but with different clustering. In this case the trivariate Gaussians all
have the same mean vectors. Shown on the right is the dimensionality reduction
embedding of the same data grouped by numbers of fibres in clusters and cluster
densities; the solitary point is a Poisson network of the same fibres.
Figure 13.14 gives analyses for pixel density arrays for Poisson networks of different
mean density. It shows the plot of DΔ ( f A , f B ) as a cubic-smoothed surface (left),
for trivariate Gaussian information distances among 16 simulated Poisson networks
made from 1 mm fibres, with different mean density, using pixels at 1 mm scale. Also
shown is, (right) dimensionality reduction embedding of the same Poisson network
data, showing the effect of mean network density.
384 C. T. J. Dodson and W. W. Sampson
0.2
0.0
−0.2
0.5
0.0
−1
0
1
Fig. 13.13 Pixel density arrays for clustered networks: Left plot of DΔ ( f A , f B ) as a cubic-
smoothed surface, for trivariate Gaussian information distances among 16 datasets of 1 mm pixel
density arrays for simulated networks made from 1 mm fibres, each network with the same mean
density but with different clustering. Right embedding of the same data grouped by numbers of
fibres in clusters and cluster densities; the solitary point is an unclustered Poisson network
0.5
0.0
0.2
0.0
− 0.2
Fig. 13.14 Pixel density arrays for Poisson networks of different mean density. Left plot of
DΔ ( f A , f B ) as a cubic-smoothed surface (left), for trivariate Gaussian information distances
among 16 simulated Poisson networks made from 1 mm fibres, with different mean density, using
pixels at 1 mm scale. Right embedding of the same Poisson network data, showing the effect of
mean network density
13 Dimensionality Reduction for Classification of Stochastic Texture Images 385
4
2
0
0.5
0.0
– 0.5
– 1.0
–5
0
Fig. 13.15 Embedding using 182 trivariate Gaussian distributions for samples from the data set [9].
Blue points are from gap formers; orange are various handsheets, purple are from pilot paper
machines and green are from hybrid formers. The embedding separates these different forming
methods into subgroups
Figure 13.15 shows a 3-dimensional embedding for a data set from [9] including
182 paper samples from gap formers, handsheets, pilot machine samples and hybrid
formers. We see that to differing degrees the embedding separates these different and
very disparate forming methods by assembling them into subgroups. This kind of
discrimination could be valuable in evaluating trials, comparing different installations
of similar formers and for identifying anomalous behaviour.
The benefit from these analyses is the representation of the important structural
features of number of fibres per cluster and cluster density, by almost orthogonal
subgroups in the embedding.
This yeast is the genome studied in [3] for which we showed that all 20 amino
acids along the protein chains exhibited mutual clustering, and separations of 3–12
are generally favoured between repeated amino acids, perhaps because this is the
386 C. T. J. Dodson and W. W. Sampson
Fig. 13.16 Determinants of 12-variate spatial covariances for 20 samples of yeast amino acid
sequences, black Y, together with three Poisson sequences of 100,000 amino acids with the yeast
relative abundances, blue RY. Also shown are 20 samples of human sequences, red H, and three
Poisson sequences of 100,000 amino acids with the human relative abundances, green RH.
0.15
0.10
0.05
0.00
0.1
0.0
- 0.1
- 0.2
0.0
0.2
Fig. 13.17 Twelve-variate spatial covariance embeddings for 20 samples of yeast amino acid
sequences, small black points, together with three Poisson sequences of 100,000 amino acids with
the yeast relative abundances, large blue points. Also shown are 20 human DNA sequences, medium
red points, and three Poisson sequences of 100,000 amino acids with the human relative abundances,
large green points
13 Dimensionality Reduction for Classification of Stochastic Texture Images 387
usual length of secondary structure, cf. also [1]. The database of sample sequences
is available on the Saccharomyces Genome Database [19]. Here we mapped the
sequences of the 20 amino acids A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V,
W, Y onto the 20 grey-level values 0.025, 0.075, . . . , 0.975 so yielding a grey-level
barcode for each sequence, Fig. 13.1. Given the usual length of secondary structure
to range from 3 to 12 places along a sequence, we used spatial covariances between
each pixel and its successive 12 neighbours. Figure 13.16 plots the determinants of
the 12-variate spatial covariances of 20 for yeast, black Y, together with three Poisson
random sequences of 100,000 amino acids with the yeast relative abundances, blue
RY. Also shown are 20 samples of human sequences, red H, and three Poisson
sequences of 100,000 amino acids with the human relative abundances, green RH.
Figure 13.17 shows an embedding of these 20 12-variate spatial covariances for yeast,
small black points, together with three Poisson sequences of 100,000 amino acids
with the yeast relative abundances, large blue points, and 20 human DNA sequences,
medium red points using data from the NCBI Genbank Release 197.0 [15], and three
Poisson sequences of 100,000 amino acids with the human relative abundances,
large green points. The sequences ranged in length from 340 to 1,900 amino acids.
As with the original analysis of recurrence spacings [3] which revealed clustering,
the difference of the yeast and human sequence structures from Poisson is evident.
However, it is not particularly easy to distinguish yeast from human sequences by
this technique, both lie in a convex region with the Poisson sequences just outside,
but there is much scatter. Further analyses of genome structures will be reported
elsewhere.
References
1. Arwini, K., Dodson, C.T.J.: Information geometry near randomness and near independence.
In: Sampson, W.W. (eds.) Stochasic Fibre Networks (Chapter 9), pp. 161–194. Lecture Notes
in Mathematics, Springer-Verlag, New York, Berlin (2008)
2. Atkinson, C., Mitchell, A.F.S.: Rao’s distance measure. Sankhya: Indian J. Stat. Ser. A 48(3),
345–365 (1981)
3. Cai, Y., Dodson, C.T.J .Wolkenhauer, O. Doig, A.J.: Gamma distribution analysis of protein
sequences shows that amino acids self cluster. J. Theor. Biol. 218(4), 409–418 (2002)
4. Carter, K.M., Raich, R., Hero, A.O.:. Learning on statistical manifolds for clustering and
visualization. In 45th Allerton Conference on Communication, Control, and Computing, Mon-
ticello, Illinois. (2007). https://wiki.eecs.umich.edu/global/data/hero/images/c/c6/Kmcarter-
learnstatman.pdf
5. Carter, K.M.: Dimensionality reduction on statistical manifolds. Ph.D. thesis, University of
Michigan (2009). http://tbayes.eecs.umich.edu/kmcarter/thesis
6. Deng, M., Dodson, C.T.J.: Paper: An Engineered Stochastic Structure. Tappi Press, Atlanta
(1994)
7. Dodson, C.T.J.: Spatial variability and the theory of sampling in random fibrous networks. J.
Roy. Statist. Soc. B 33(1), 88–94 (1971)
8. Dodson, C.T.J.: A geometrical representation for departures from randomness of the inter-
galactic void probablity function. In: Workshop on Statistics of Cosmological Data Sets NATO-
ASI Isaac Newton Institute, 8–13 August 1999. http://arxiv.org/abs/0811.4390
388 C. T. J. Dodson and W. W. Sampson
9. Dodson, C.T.J., Ng, W.K., Singh, R.R.: Paper: stochastic structure analysis archive. Pulp and
Paper Centre, University of Toronto (1995) (3 CDs)
10. Dodson, C.T.J., Sampson, W.W.: In Advances in Pulp and Paper Research, Oxford 2009. In:
I’Anson, S.J., (ed.) Transactions of the XIVth Fundamental Research Symposium, pp. 665–691.
FRC, Manchester (2009)
11. Dodson, C.T.J., Sampson, W.W.: Dimensionality reduction for classification of stochastic fibre
radiographs. In Proceedings of GSI2013—Geometric Science of Information, Paris, 28–30:
Lecture Notes in Computer Science 8085. Springer-Verlag, Berlin (August 2013)
12. Doroshkevich, A.G., Tucker, D.L., Oemler, A., Kirshner, R.P., Lin, H., Shectman, S.A., Landy,
S.D., Fong, R.: Large- and superlarge-scale structure in the las campanas redshift survey. Mon.
Not. R. Astr. Soc. 283(4), 1281–1310 (1996)
13. Ghosh, B.: Random distances within a rectangle and between two rectangles. Calcutta Math.
Soc. 43(1), 17–24 (1951)
14. Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic Press, London (1980)
15. NCBI Genbank of The National Center for Biotechnology Information. Samples from
CCDS_protein. 20130430.faa.gz. ftp://ftp.ncbi.nlm.nih.gov/genbank/README.genbank
16. Nielsen, F., Garcia, V., Nock, R.: Simplifying Gaussian mixture models via entropic quanti-
zation. In: Proceedings of 17th European Signal Processing Conference, Glasgow, Scotland
24–28 August 2009, pp. 2012–2016
17. Sampson, W.W.: Modelling Stochastic Fibre Materials with Mathematica. Springer-Verlag,
Berlin, New York (2009)
18. Sampson, W.W.: Spatial variability of void structure in thin stochastic fibrous materials. Mod.
Sim. Mater. Sci. Eng. 20:015008 pp13 (2012). doi:10.1088/0965-0393/20/1/015008
19. Saccharomyces Cerevisiae Yeast Genome Database. http://downloads.yeastgenome.org/
sequence/S288C_reference/orf_protein/
Index
Symbols B
α-Conformally equivalent, 34, 62, 89 Balian quantum metric, 179
α-Conformally flat, 34, 62 Beta-divergence, 45
α-Hessian, 1, 2, 9, 10, 22–25, 28, 29 Bias χ-correctedscore function, 72
α-connection, 63, 88 Bias corrected q-score function, 73
α-divergence, 75 Bivariate Gaussian, 379
χ-Fisher metric, 73 Buchberger’s algorithm, 137
χ-cross entropy, 74
χ-cross entropy of Bregman type, 72
C
χ-cubic form, 73
1-conformally equivalent, 75
χ-divergence, 72, 73 1-conformally flat, 75
χ-exponential connection, 73 Canonical divergence, 35, 60
χ-exponential function, 66 Carl Ludwig Siegel, 147, 162, 196
χ-logarithm function, 65 Cartan–Killing form, 206, 207
χ-mixture connection, 73 Cartan–Schouten connections, 248
χ-score function, 69, 74 Cartan–Siegel domains, 193
η-potential, 60 CEM algorithm, 306
θ-potential, 60 Central limit theorem, 371
Centroid, 283
Clustered networks, 383
Clustering, 367
A Clusters of fibres, 371
Affine connection, 246 Complete lattice, 331, 332, 334, 346, 348,
Affine coordinate system, 59 349, 353, 354, 364, 365
Affine harmonic, 83 Computational anatomy, 274
Constant curvature, 34
Affine hypersurface theory, 42
Contact geometry, 141, 148, 165, 167, 168
Algebraic estimator, 124
Contrast function, 61
Algebraic statistics, 119
Controllable realization, 232
Alignment distance, 237 Convex cones, 94, 141, 146, 147, 151, 161
Aluminium foam, 369 Convex function, 1, 13, 26, 29
Alzheimer’s disease, 266 Cosmological voids, 369
Amino acids, 368 Cotton yarn, 368
Anomolous behaviour, 378 Cross entropy, 65
Areal density arrays, 368 Crouzeix relation, 157
Autocorrelation, 375 Cubic form, 59, 60, 64
E
e-(exponential) representation, 65 H
Efficient estimator Harmonic function, 83
first order, 123 Harmonic map, 84
ˆ 85
Harmonic map relative to (g, ∇),
second order, 123
¯ 90
Harmonic map relative to (h, ∇, ∇),
Eigenvalue, 379
Eigenvector, 379 Hellinger distance, 225
EM algorithm, 305 Henri Poincaré, 146, 182
e-(mixture) representation, 65 Hessian manifold, 59
Equiaffine geometry, 28 Hessian structure, 59
Escort distribution, 67 Hessian domain, 34
Euler-Lagrange equation, 84 High-dimensional stochastic processes, 220
Exponential connection, 63 High-dimensional time series, 220
Exponential family, 64, 302 Hippocampus, 288
algebraic cureved, 124 Homotopy continuation method, 138
curved, 121 Hyperbolic, 332, 334, 335, 337, 338, 340,
full, 121 341, 343, 347, 348, 358, 361, 363,
MLE for an, 303 364
Extrinsic distance, 222 Hyperbolic partial ordering, 332, 345–347,
349, 353, 355, 356, 358
F
Fibre network simulator, 371 I
First and second neighbours, 367 Induced Hessian manifold, 61
Fisher information matrix, 63 Induced statistical manifold, 61
Index 391
J N
Jean-Louis Koszul, 145, 147, 151, 159, 162, Nanofibrous network, 370
205 Normal form, 129
Jean-Marie Souriau, 146, 147, 174, 176 Normalized Tsallis relative entropy, 75
K O
Kähler geometry, 1, 15 Observable realization, 232
Kähler affine manifold, 83 One-parameter subgroups, 246
Kähler affine structure, 83 Optimal mass transport, 225
k-MLE algorithm, 306 Ordered Poincaré half-plane, 334
Hartigan’s method for, 308
Initialization of, 310, 312
Lloyd’s method for, 307 P
Koszul characteristic function, 145, 146, Parallel transport, 249
151, 157 Pattern recognition, 220
Koszul Entropy, 142, 143, 146, 148, 151, Pierre Duhem, 146
153, 156, 157 Pixel density arays, 379
Koszul forms, 160 Poisson clusters, 368
Kullback–Leibler, 377 Poisson line process, 380
Kullback-Leibler divergence, 64 Poisson process, 367, 380
Poisson process of rectangles, 375
Poisson rectangle process, 382
L Pole ladder, 252
Laplace Principle, 146, 163, 164, 180, 181 Positive definite matrix, 141, 147, 187
Laplacian, 83 Power spectral density matrix, 224
Laplacian of the gradient mapping, 88 Power potential, 39
LDDMM, 256, 275 Principal fiber bundle, 233
Legendre transform, 86 Pythagorean relation, 43
Levi-Civita connection, 249
Lie groups, 245
Linear dynamical systems, 220 Q
Linear predictive coding, 220 Q-Covariant derivative, 114
Liouville measure, 176 Q-cubic form, 73
Lookeng Hua, 197, 198 Q-divergence functional, 112
Q-exponential, 66
Q-exponential family, 75
M Q-exponential function, 98
Marginal distributions, 371 Q-exponential manifold, 108, 110
Mathematical morphology, 331, 332 Q-exponential model, 105, 108, 110
Maurice Fréchet, 144, 147, 148, 202, 204 Q-Fisher metric, 73
Maximum likelihood estimator (MLE), 120, Q-independent, 76
303 q-likelihood function, 77
392 Index